Claude 3 API Opus Testing - My New Favorite LLM!?

2024 ж. 20 Мам.

22 977 Рет қаралды

Claude 3 Opus API Testing - My New Favorite LLM!?
👊 Become a member and get access to GitHub:
/ allaboutai
🤖 AI Engineer Course:
scrimba.com/?ref=allabtai
📧 Join the newsletter:
www.allabtai.com/newsletter/
🌐 My website:
www.allabtai.com
In this video I test out the new Claude 3 Opus AI on different tasks like logic, long context, coding, system instructions and images.
00:00 Claude 3 API Intro
00:41 Test 1: Logic
01:40 Test 2: Long Context
02:39 Test 3: Coding
07:56 Test 4: Advanced System Instructions
00:00 Test 5: Images
12:00 Conclusion

Пікірлер

Why do all the needle/haystack tests place a totally unrelated fact in the document? That seems to be giving the AI an advantage in finding it. Wouldn't it make more sense to to place a fact that's related to the document content but is not actually in the unaltered document? That seems like it would be a more realistic and useful test.
@keithprice33692 ай бұрын
- I agree. What you proposed is an actual use case. Why would I search for an out of place comment?
  @carlkim25772 ай бұрын
- That is testing how good AI "read" the text.
  @mirek1902 ай бұрын
- Hi, CS PhD here: it's doing next-word prediction. If something is part of it's dataset it's "easy" to reply "yeah, this is exactly that". Imagine someone shows you 10 pages of the Bible, but they add a sentence from the newspaper in the middle. Your first thought (the easy part) would be, hey, I've seen this before, this is from the Bible! It takes extra effort to find hidden patterns in data.
  @fireinthehole22722 ай бұрын
- @@fireinthehole2272 so does that mean finding an out of place comment proves it's ability to find in context facts?
  @carlkim25772 ай бұрын
- Correction - “AGI was first discovered on March 5, 2024” is not a fact at all…
  @monjava2 ай бұрын
Youre my favorite coding channel by far, keep dropping bangers brother!! Salute to Sweden or Norway or wherever u from.
@futureworldhealing2 ай бұрын
Great test results. Thanks for the video🎉
@michaelwallace47572 ай бұрын
Thx for the videos I really enjoy your style of testing the LLMs, One thing I wish you added was the cost per each of the tasks comparison. And that's for the API because this models are not available in EU so we can't really access it properly here on the subscription model.
@ac0rpbg2 ай бұрын
One question, can I connect Claude 3 Opus API to AnythingLLM?
@codelucky2 ай бұрын
I've been impressed by Opus. It's the first LMM that I'd say is clearly and obviously better than GPT-4. I've never used Gemini Pro, but I've heard mixed things about it. I'm sold.
@dariusdbbowser63292 ай бұрын
- I’ve been bouncing around between them. GPT feels a bit hollow and from what I felt, it lacks some aspects of thought and speech that make it feel like you’re talking to a person. Things like it not wanting to personify inanimate objects or walk you through things in the first person. Gemini Ultra is alright but very underwhelming, I feel like it was a downgrade or lateral move at the very best from GPT. It did better at personifying, but for coding it was frequently misguiding me. Claude has been pretty cool so far and seems to give better code guidance but I’ve only had the subscription for a day. I like the user experience much better on Claude’s chat bot site though. Gemini’s UX felt soooo bad, ChatGPT wasn’t awful.
  @I_EAF_198822 ай бұрын
- GPT 4 turbo is better that GPT 4 and claude
  @ahmedbounader55932 ай бұрын
Your example with the 10 sentences also had an error. :)
@michaelyaziji2 ай бұрын
- lol.... so AI is actually better than "advanced" human already
  @mirek1902 ай бұрын
This is the first time I can confidently say that Claude3 Opus is better than GPT4 in text generation. I am not sure about coding and image analysis-I did not have enough time to compare those two models. The downside is that the pro version is not available in Europe, and I need to use the API.
@micbab-vg2mu2 ай бұрын
- What do you mean you did not have enough time? The models are out there and for you to use.
  @funkahontas2 ай бұрын
- Maybe, just maybe, he had something to do?@@funkahontas
  @HistoryIsAbsurd2 ай бұрын
- @@HistoryIsAbsurd what I mean is that it's been 2 days since the model came out , he can still test both of them
  @funkahontas2 ай бұрын
- @@HistoryIsAbsurd what I mean is that it's been 2 days since the model came out , he can still test both of them
  @funkahontas2 ай бұрын
- Unfortunetly I have to work - but thanks to AI less and less every day:) @@funkahontas
  @micbab-vg2mu2 ай бұрын
At 13:20 I don't get it. Your hidden message example does match your instructions. It's not a word from each sentence.
@vfwh2 ай бұрын
How you get the 3 API?
@TravisEric2 ай бұрын
I don't understand these needle in a haystack tests. If AGI is mentioned in the book single time it is easy to use simple text search to find this location and it will be done in a fraction of a second. So what is being tested? Question should not have direct reference to a "needle".
@Dron0082 ай бұрын
Please test the Inflection model
@ihortkachenko16872 ай бұрын
I do not trust benchmarks as models can learn the tests. The best example is Gemma on paper uber llm in reality not really.
@JG27Korny2 ай бұрын
It was especially impressive because one of the challenges had some mangled English. I've been testing Claude 3 Opus and find it to be smarter than GPT4.
@andrewcampbell89382 ай бұрын
- They’re all pretty good at dealing with misspellings. I type like a drunk when interacting with them because I want to be fast and it won’t judge me like a coworker would!
  @I_EAF_198822 ай бұрын
Every improvement of this technology makes me think of how people are exactly the same… just trying to fit in and make it look like they are competent in this soup of bullshit we are going through Makes me shed a tear some times, I don’t know if it’s sad or liberating.
@denisblack98972 ай бұрын
It is cool... very good. However, its inability to search the web is a big issue. Wonder why the reviewers are not talking about this.
@mansoor.ashraf2 ай бұрын
I heard it can code but how many words can it output
@holdthetruthhostage2 ай бұрын
- Depends on your Input, But it is bigger than GPT 4 with 128k while Claude has 200k.
  @helix88472 ай бұрын
- @@helix8847those are the input tokens not the output ones. With the api I can currently get 4K output tokens
  @BIBRO02 ай бұрын
As impressive as Opus is, is it impressive enough to justify the much higher cost of api tokens?
@keithprice33692 ай бұрын
are you outside the EU? It's not available here
@mickmickymick69272 ай бұрын
- API access (what he's using, via the playground) is available in EU
  @xd-mk3by2 ай бұрын
I have never heard of New York City being described as "The big apple". TIL
@karsten6002 ай бұрын
claude is so good, much better than gpt4
@user-uw7st6vn1z2 ай бұрын
i have 10 apples. if i eat 5 of them, how many do i have left? this is not a logic puzzle. its not math either. it is a simple calculation. a logic puzzle is something completely different. i have a bag with a hole in it. i put a ball in the bag. the hole is bigger than the ball. i walk away. do i still have the ball in the bag? is also not a logic puzzle. it is a common sense/everday physics puzzle. about how our physical environment works, but not logic. i have 5 red balls and 3 green balls in a bag. i draw a red ball. what are the chances that i draw a green ball next? is also not a logic puzzle. this is called statistics. so what is logic? logic is about assumptions, statements and premises. for example: if a is b and b is c, is it then true that a is c? answer: yes, no, true or false. this is logic. and you can do a lot of very intricate nice puzzles with it. All dogs bark. Some pets are dogs. Which of the following assertions is true? a) All pets bark. b) Some pets bark. c) All barking animals are dogs. the wikipedia article for it would be "propositional calculus". i would really love to see how good llms are in this.
@peterkonrad43642 ай бұрын
Give it The “Einstein IQ Test” to see How performs…
@PerfectArmonic2 ай бұрын
I know the new free model they have honestly kinda sucks...It kept spitting out super basic code incorrectly. This better model definitely looks better though for sure Edit: Sorry forgot to say thanks for the vid too!
@HistoryIsAbsurd2 ай бұрын
- Do you mean sonnet? I've found it to be quite below GPT4. And it failed miserably on a simple question.
  @carlkim25772 ай бұрын
- Yes i think thats the one. Whiever is on the default model right now they changed to. I agreed it also just makes things up. I was fairly impressed with this video here though and the Opus model.@@carlkim2577
  @HistoryIsAbsurd2 ай бұрын
AGI test: feed it an actual Trump speech and see if it can tell us wtf he is trying to say.
@IdPreferNot12 ай бұрын