I pressure tested GPT-4's 128K context retrieval

2024 ж. 13 Мам.
21 128 Рет қаралды

Get updates from me: mail.gregkamradt.com/
FullStackRetrieval.com
Tweet write up: / 1722386725635580292
Code: github.com/gkamradt/LLMTest_N...
Check out how GPT-4 does at retrieval with 128K tokens worth of context.
Lost In The Middle: www-cs.stanford.edu/~nfliu/pa...
Greg’s Info:
- Twitter: / gregkamradt
- Newsletter: mail.gregkamradt.com/
- Website: gregkamradt.com/
- LinkedIn: / gregkamradt
- Work with me: tiny.one/TEi2HhN
- Contact Me: Twitter DM, LinkedIn Message, or contact@dataindependent.com

Пікірлер
  • OpenAI should be paying you $200 an hour for this type of in-depth analysis of their model.

    @gardnmi@gardnmi5 ай бұрын
    • ha! I wish, but thank you. I'm sure their product analysts are doing the same (and more) as we speak

      @DataIndependent@DataIndependent5 ай бұрын
  • Brilliantly executed. That graph is incredibly intuitive and information dense.

    @ultraprim@ultraprim11 күн бұрын
  • What an amazing quality video! Thank you for this test

    @SaileshB@SaileshB5 ай бұрын
  • Superb test. Thanks for doing this.

    @Adhithya2003@Adhithya20035 ай бұрын
  • You are doing great work brother, keep it up.

    @ShaidaMuhammad@ShaidaMuhammad4 ай бұрын
  • Sweet! Thank you investing the $$$, time, effort in this! It's great to have these data points. tbh I'm just a hobbyist with no actual use case for such a large context length at the moment, but I agree it's valuable to gain an intuition about how these models behave. I've been curious about how annotation and structure affects retrieval at such long context length. In your example the retrieved information was unrelated to the main text and placed randomly. I wonder, in a situation where context and query are related, would performance increase if the LLM was given a document formatted more like a book (or github repo), with a leading Table of Contents, and maybe even a trailing index? We know that the GPT-x models love markdown structure, would that make diff? Anyway, there's endless experiments one could run, and I'm sure we'll be seeing more research papers soon. My instinct says vector, metadata, and even purely text search based retrieval will remain valuable regardless of how large context lengths get. Why wouldn't you try to increase signal to noise if you have the tool?

    @thenoblerot@thenoblerot5 ай бұрын
    • I totally agree! I can't think of a use case of long context that wouldn't benefit from retrieval to increase the signal to noise ratio. There are so many variations of this to try, if I had $400K we could put together a pretty well researched test of tons of permutations to build up an intuition. But there are a lot of other ways to spend that kind of money which would leverage more value too....;)

      @DataIndependent@DataIndependent5 ай бұрын
  • Nice! This is much needed content. Not enough people are talking about this. I'd be curious to see how Claude 2 100K context retrieval would compare.

    @andreyseas@andreyseas5 ай бұрын
    • Thanks Andre! It was a fun test to do

      @DataIndependent@DataIndependent5 ай бұрын
    • Actually I heard of the 'lost in the middle' issue in the context of Claude 2 100K...so now we know it affects OpenAI too....

      @scharlesworth93@scharlesworth935 ай бұрын
    • Interesting! Will have to dig into that more myself. Thanks for letting me know! @@scharlesworth93

      @andreyseas@andreyseas5 ай бұрын
  • Love the analysis Greg! It's great to see your action taken on these questions. Would love to see it on a new Claude 200k context

    @jonpappas2@jonpappas25 ай бұрын
    • Thanks Jon! Of course, here ya go: twitter.com/GregKamradt/status/1727018183608193393 Same process, different model. BTW it was awesome working w/ ya in our prior lives

      @DataIndependent@DataIndependent5 ай бұрын
  • This becomes STANDARD Retrieval Benchmark on every major model release

    @raregear@raregearАй бұрын
  • Great content and very solid reasoning ! Will definitly try this approach on free to use models ! The needle in the haystack approach is great to challenge LLM retrieval abilities, however it might be easier for the LLM to do well because the sandwich in San Francisco "semantically stands out" from the rest of the essay,. It would be interesting to ask the LLM a precise information about a precise information already contained in the given context (Graham's essay) to make the task "closer to user's need" .

    @jourdainlouis8553@jourdainlouis85532 ай бұрын
    • nice! Yep totally agree

      @DataIndependent@DataIndependent2 ай бұрын
  • Love the enthusiasm. Most people I know think this stuff is boring.

    @alexanderroodt5052@alexanderroodt50525 ай бұрын
    • ha - totally, I can't tell if I'm brainwashed or what

      @DataIndependent@DataIndependent5 ай бұрын
  • Good stuff!

    @PrimeMindAI@PrimeMindAI4 ай бұрын
  • Very well mad video, great research. Thanks for the investment!

    @bvdlio@bvdlio5 ай бұрын
    • Nice! Thank you

      @DataIndependent@DataIndependent5 ай бұрын
  • Great Video. Would be interesting to see how the model works if you place 2-3 "needles" in the text at different positions. Would help to know for odering responses in RAG with large chunks.

    @maof77@maof775 ай бұрын
    • Totally, that would be a solid test. There are so many variations I’d like to do but would cost a ton of $$

      @DataIndependent@DataIndependent5 ай бұрын
  • Thanks. Would be interesting to see the performance when the sentence is placed not as a new line, but as a continuation or inside of a paragraph. New line might be easier to detect.

    @kai_s1985@kai_s19855 ай бұрын
    • yeah...I did sentence breaks with new lines so the results would definitely change

      @DataIndependent@DataIndependent5 ай бұрын
  • Solid video. Would've loved to see some of the tweet reply discussed in this video.

    @adamgdev@adamgdev5 ай бұрын
    • Thanks Adam! Totally! I should have included that - I might do a recap video on it

      @DataIndependent@DataIndependent5 ай бұрын
  • Awesome !

    @SuperYutubu@SuperYutubu5 ай бұрын
    • Thanks Yutubu!

      @DataIndependent@DataIndependent5 ай бұрын
  • Excellent work! Do you plan to do this test with the new Gemini Pro 1.5 model?

    @hitalex07@hitalex072 ай бұрын
    • Yep - when there is access and it comes out totally

      @DataIndependent@DataIndependent2 ай бұрын
  • I guess a needle should be a something the model can't make up, e.g. "The answer to the question that you are going to be asked is '9h550klz2a6'"

    @hidroman1993@hidroman19935 ай бұрын
    • When I was getting feedback on this after the test I was told that uuid key value pair retrieval is the standardized test. Makes sense. I went for relatable in this version

      @DataIndependent@DataIndependent5 ай бұрын
  • Awesome test. Really clean and concise. I would recommend getting rid of the side camera, since it adds nothing of value and looks a bit weird. Cheers!

    @Joao-pm8je@Joao-pm8je5 ай бұрын
    • Thanks Joao! Totally agree

      @DataIndependent@DataIndependent5 ай бұрын
  • how are you evaluating score?

    @MridulBanikcse@MridulBanikcseАй бұрын
  • Interesting! I'd say that large context windows always beats all the previous tricks like chunking + summarizing to shrink the data size.

    @hermannschmidt9788@hermannschmidt97885 ай бұрын
    • I don’t know on that actually - I’m a big fan of investing in retrieval to get better signal:noise I haven’t seen a use case that requires 128K tokens of context that wouldn’t benefit from better retrieval

      @DataIndependent@DataIndependent5 ай бұрын
    • Lossless retrieval into large context windows, yes.

      @hermannschmidt9788@hermannschmidt97885 ай бұрын
    • @@hermannschmidt9788 sure ya once it can recall & synthesize 100% accurate from long context then it’ll be a way different conversation

      @DataIndependent@DataIndependent5 ай бұрын
  • Have you tried running this test on the great Mixtral MOE?

    @jeffwads@jeffwads4 ай бұрын
  • It’s an interesting result, thanks for doing it. Any ideas why the conclusion seems to be different from the lost in the middle paper? Are the takeaways contradictory?

    @Li-rm2gj@Li-rm2gj5 ай бұрын
    • There is so much variability with these tests it is tough to pin down what it would be

      @DataIndependent@DataIndependent5 ай бұрын
  • absolute mad lad

    @dcrebbin@dcrebbin5 ай бұрын
    • Thanks Devon - it was fun to do!

      @DataIndependent@DataIndependent5 ай бұрын
  • Excellent test! I have enjoyed watching your videos. I would ask if you could do the same with Claude 3? It would be nice to see a comparison.

    @rosendoduron4753@rosendoduron47532 ай бұрын
    • Nice! Thank you - I haven't done it with that one yet but it's on the to do list

      @DataIndependent@DataIndependent2 ай бұрын
  • Nice analysis! How exactly did you measure the correctness of the answer?

    @arthurducasse9944@arthurducasse99445 ай бұрын
    • I used LangChain's eval which was easy github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/73ffdd4dd2d190d9306d9162a2401ae9a067ddcf/LLMNeedleHaystackTester.py#L369

      @DataIndependent@DataIndependent5 ай бұрын
  • PG writes about SF occasionally. Wouldn’t this test be more definitive if you had changed the city name to one that PG has never written about? Perhaps the error rate simply increased with context length because more of PG’s opinions on SF got included in the text, and not because of hallucination. That aside, thank you for doing this test and how much $ it cost. I think this kind of thing is the most interesting and valuable kind of content. 🙌

    @feralmachine@feralmachine5 ай бұрын
    • Nice! Thank you for that and you're totally right. Small variations in the text/question would produce different results. I was told after the test was done that I could have done a key:value pair UUID retrieval Ex: "What is the value for this key? ad1491f3-d899-495b-8fea-7f07a7c6a602?" But that felt boring even if it was technically more correct. My goal was to kick off the conversation rather than claim definitive results (hence a tweet write up vs a paper and peer reviewed)

      @DataIndependent@DataIndependent5 ай бұрын
  • Do you like to make the script public? Would be awesome to test other models with. I am on your mailing list, but where do I need to sign to get that on the silver plate ;-) Oooh, I found it. Thanks for putting it in your repo. Very valuable.

    @DannyGerst@DannyGerst5 ай бұрын
    • I just put the code in the description! Thanks for the call out

      @DataIndependent@DataIndependent5 ай бұрын
  • I just saw your graph for Claude2.1. That was a lot of red. Also interesting that it was almost all 100% or 0%, with very little in between. For the failures, were you getting a lot of Claude claiming it couldn't do that kind of task? That's probably the most common response I get from Claude for any given task

    @ashlynnantrobus5029@ashlynnantrobus50295 ай бұрын
    • I'd get a lot of this type of response "Unfortunately, the context does not mention the most fun thing to do in San Francisco. It discusses the history and design of the Lisp programming language, web and mobile application development, and starting technology companies. There is no information provided about activities or attractions in San Francisco specifically. Without any relevant details to draw from, I cannot provide a direct answer to the question asked."

      @DataIndependent@DataIndependent5 ай бұрын
    • @@DataIndependent so actually trying, but just looking for it like my kids do (I promise you, your shoes are not in the ceiling. You can stop looking there)

      @ashlynnantrobus5029@ashlynnantrobus50295 ай бұрын
  • 👍

    @caiyu538@caiyu5385 ай бұрын
  • Can you link the paper for reference ?

    @vaidyanathanag6463@vaidyanathanag64635 ай бұрын
    • Yep I put it in the description

      @DataIndependent@DataIndependent5 ай бұрын
  • 👍👍👍

    @hanzo_process@hanzo_process10 күн бұрын
  • What’s temperature in A.I Language? Also, congratulations on Google using your Needle in a haystack method 👍

    @moxes8237@moxes82372 ай бұрын
    • Thank you! It was 0

      @DataIndependent@DataIndependent2 ай бұрын
  • could you please please add details on how this can be done with azureopenai, in your website, I tried the langchain extraction chain with azure openai but I am not able to run it. Had to resort to functions to do that, can you please tell me how can I effectively extract only certain data from the script, also how do I work with tabular data with LLM's?

    @Shoaibkhan-oj3oe@Shoaibkhan-oj3oe5 ай бұрын
    • Here's the code, you can fork it and make azure the model provider github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py

      @DataIndependent@DataIndependent5 ай бұрын
  • Now try the same test but not with some random sentence but some content related to the content of document. Which would be a more practical test. I uploaded a lengthy medical benefits document and the answer were close enough to a human's response

    @sanz1996_@sanz1996_5 ай бұрын
  • In practice, we've built artificial brains, and now we need neurologists (data scientists) like you to study them.

    @michelefruscella7373@michelefruscella73735 ай бұрын
  • Hey Greg , really nice video! I was wondering if I could help you enhance Editing in your videos and also make a highly engaging Thumbnail which will help your video to reach to a wider audience .

    @Divyv520@Divyv5205 ай бұрын
  • Hi Greg, Hope will meet one day. One more thing, I am 20 yrs old, so according to my age, what should I call you, Uncle, Bro or anything else.

    @mayanksingh3366@mayanksingh33665 ай бұрын
    • Let’s go with ‘Greg’ whenever anyone opens up with ‘bro’ I close the message

      @DataIndependent@DataIndependent5 ай бұрын
    • Call him Unclebro

      @scharlesworth93@scharlesworth935 ай бұрын
  • Thank you - very usuful.

    @micbab-vg2mu@micbab-vg2mu5 ай бұрын
    • Thanks Micbab

      @DataIndependent@DataIndependent5 ай бұрын
KZhead