Claude 3 API Opus Testing - My New Favorite LLM!?
2024 ж. 20 Мам.
22 977 Рет қаралды
Claude 3 Opus API Testing - My New Favorite LLM!?
👊 Become a member and get access to GitHub:
/ allaboutai
🤖 AI Engineer Course:
scrimba.com/?ref=allabtai
📧 Join the newsletter:
www.allabtai.com/newsletter/
🌐 My website:
www.allabtai.com
In this video I test out the new Claude 3 Opus AI on different tasks like logic, long context, coding, system instructions and images.
00:00 Claude 3 API Intro
00:41 Test 1: Logic
01:40 Test 2: Long Context
02:39 Test 3: Coding
07:56 Test 4: Advanced System Instructions
00:00 Test 5: Images
12:00 Conclusion
Why do all the needle/haystack tests place a totally unrelated fact in the document? That seems to be giving the AI an advantage in finding it. Wouldn't it make more sense to to place a fact that's related to the document content but is not actually in the unaltered document? That seems like it would be a more realistic and useful test.
I agree. What you proposed is an actual use case. Why would I search for an out of place comment?
That is testing how good AI "read" the text.
Hi, CS PhD here: it's doing next-word prediction. If something is part of it's dataset it's "easy" to reply "yeah, this is exactly that". Imagine someone shows you 10 pages of the Bible, but they add a sentence from the newspaper in the middle. Your first thought (the easy part) would be, hey, I've seen this before, this is from the Bible! It takes extra effort to find hidden patterns in data.
@@fireinthehole2272 so does that mean finding an out of place comment proves it's ability to find in context facts?
Correction - “AGI was first discovered on March 5, 2024” is not a fact at all…
Youre my favorite coding channel by far, keep dropping bangers brother!! Salute to Sweden or Norway or wherever u from.
Great test results. Thanks for the video🎉
Thx for the videos I really enjoy your style of testing the LLMs, One thing I wish you added was the cost per each of the tasks comparison. And that's for the API because this models are not available in EU so we can't really access it properly here on the subscription model.
One question, can I connect Claude 3 Opus API to AnythingLLM?
I've been impressed by Opus. It's the first LMM that I'd say is clearly and obviously better than GPT-4. I've never used Gemini Pro, but I've heard mixed things about it. I'm sold.
I’ve been bouncing around between them. GPT feels a bit hollow and from what I felt, it lacks some aspects of thought and speech that make it feel like you’re talking to a person. Things like it not wanting to personify inanimate objects or walk you through things in the first person. Gemini Ultra is alright but very underwhelming, I feel like it was a downgrade or lateral move at the very best from GPT. It did better at personifying, but for coding it was frequently misguiding me. Claude has been pretty cool so far and seems to give better code guidance but I’ve only had the subscription for a day. I like the user experience much better on Claude’s chat bot site though. Gemini’s UX felt soooo bad, ChatGPT wasn’t awful.
GPT 4 turbo is better that GPT 4 and claude
Your example with the 10 sentences also had an error. :)
lol.... so AI is actually better than "advanced" human already
This is the first time I can confidently say that Claude3 Opus is better than GPT4 in text generation. I am not sure about coding and image analysis-I did not have enough time to compare those two models. The downside is that the pro version is not available in Europe, and I need to use the API.
What do you mean you did not have enough time? The models are out there and for you to use.
Maybe, just maybe, he had something to do?@@funkahontas
@@HistoryIsAbsurd what I mean is that it's been 2 days since the model came out , he can still test both of them
@@HistoryIsAbsurd what I mean is that it's been 2 days since the model came out , he can still test both of them
Unfortunetly I have to work - but thanks to AI less and less every day:) @@funkahontas
At 13:20 I don't get it. Your hidden message example does match your instructions. It's not a word from each sentence.
How you get the 3 API?
I don't understand these needle in a haystack tests. If AGI is mentioned in the book single time it is easy to use simple text search to find this location and it will be done in a fraction of a second. So what is being tested? Question should not have direct reference to a "needle".
Please test the Inflection model
I do not trust benchmarks as models can learn the tests. The best example is Gemma on paper uber llm in reality not really.
It was especially impressive because one of the challenges had some mangled English. I've been testing Claude 3 Opus and find it to be smarter than GPT4.
They’re all pretty good at dealing with misspellings. I type like a drunk when interacting with them because I want to be fast and it won’t judge me like a coworker would!
Every improvement of this technology makes me think of how people are exactly the same… just trying to fit in and make it look like they are competent in this soup of bullshit we are going through Makes me shed a tear some times, I don’t know if it’s sad or liberating.
It is cool... very good. However, its inability to search the web is a big issue. Wonder why the reviewers are not talking about this.
I heard it can code but how many words can it output
Depends on your Input, But it is bigger than GPT 4 with 128k while Claude has 200k.
@@helix8847those are the input tokens not the output ones. With the api I can currently get 4K output tokens
As impressive as Opus is, is it impressive enough to justify the much higher cost of api tokens?
are you outside the EU? It's not available here
API access (what he's using, via the playground) is available in EU
I have never heard of New York City being described as "The big apple". TIL
claude is so good, much better than gpt4
i have 10 apples. if i eat 5 of them, how many do i have left? this is not a logic puzzle. its not math either. it is a simple calculation. a logic puzzle is something completely different. i have a bag with a hole in it. i put a ball in the bag. the hole is bigger than the ball. i walk away. do i still have the ball in the bag? is also not a logic puzzle. it is a common sense/everday physics puzzle. about how our physical environment works, but not logic. i have 5 red balls and 3 green balls in a bag. i draw a red ball. what are the chances that i draw a green ball next? is also not a logic puzzle. this is called statistics. so what is logic? logic is about assumptions, statements and premises. for example: if a is b and b is c, is it then true that a is c? answer: yes, no, true or false. this is logic. and you can do a lot of very intricate nice puzzles with it. All dogs bark. Some pets are dogs. Which of the following assertions is true? a) All pets bark. b) Some pets bark. c) All barking animals are dogs. the wikipedia article for it would be "propositional calculus". i would really love to see how good llms are in this.
Give it The “Einstein IQ Test” to see How performs…
I know the new free model they have honestly kinda sucks...It kept spitting out super basic code incorrectly. This better model definitely looks better though for sure Edit: Sorry forgot to say thanks for the vid too!
Do you mean sonnet? I've found it to be quite below GPT4. And it failed miserably on a simple question.
Yes i think thats the one. Whiever is on the default model right now they changed to. I agreed it also just makes things up. I was fairly impressed with this video here though and the Opus model.@@carlkim2577
AGI test: feed it an actual Trump speech and see if it can tell us wtf he is trying to say.