There's More To Retrieval Than Vector Stores

2024 ж. 13 Мам.
6 073 Рет қаралды

What I'm Researching In AI: mail.gregkamradt.com/signup
Twitter: / gregkamradt
Colin's Article: colinharman.substack.com/p/be...
0:00 - Intro
0:31- Relevancy Is Everything
1:15 - The Problem
1:53 - The Assumption
2:02 - Start up Differentiation
2:51 - Greg's Recommendation
3:36 - The Wrap Up

Пікірлер
  • Hi Greg, another well structured and insightful video thanks. I have been experimenting with practical large custom KB types apps and would 100% agree. What I have settled on now is a combined, multi-layered approach incorporating standard relational DB lookups and vectorstores, which seems promising. Also, by extracting keyword and other metadata on ingestion and adding these to the vector embeds it is possible to do relational DB queries as a “first pass” then use these results to filter the vector search retrieval through metatdata, therefore providing a more fine tuned context for the semantic search before passing all this to the LLM. I have also recently developed a technique to parse PDFs by semantic structure (i.e. headers -> sections) and using this to split docs for ingestion, as opposed to arbitrary size-based chunking. This further improves relevance and semantic coherence within the semantic search and retrieval and ultimately the LLM context, also leveraging the new 16k token window. I will hopefully be able to share the basics of the technique soon, if you are interested?

    @sethhavens1574@sethhavens157410 ай бұрын
    • First off, thanks for the awesome comment and sharing! Second off, yes! I think both would be interesting to read about, but I'm a big fan of the 2nd (chunking) because I see this problem all over.

      @DataIndependent@DataIndependent10 ай бұрын
    • ok dude no probs - what is the best way to contact you to share the repo when ready?

      @sethhavens1574@sethhavens157410 ай бұрын
    • @@sethhavens1574I’m interested too and I need this for a project a non profit project. Please do share!

      @MarxOrx@MarxOrx7 ай бұрын
  • Well there is also the consideration of integrating both, then depending on the task mix and match them or specific use of semantic search vs traditional keywords. Many vector stores allow pre-filtering the data based on the metadata before the semantic search is performed, this is an example of using traditional KV + Similarities searching. With it being not that much of a hassle implementing it, it's worthwhile to keep the tool in mind, large search engines like Google have been using it for years, they just never used it for AI augmented search until now, along with other tech giants. I'd say the few places where vector database shouldn't be used is when exact data is expected, but as soon as context is a variable intent in search, KV is not good enough.

    @TheMaxiviper117@TheMaxiviper11710 ай бұрын
    • Totally agree - good call

      @DataIndependent@DataIndependent10 ай бұрын
  • absoulutely agree, some use cases, traditional data base query will be sufficient.

    @changmianwang7414@changmianwang74149 ай бұрын
  • Great explaination.

    @matthew_berman@matthew_berman10 ай бұрын
    • Awesome thanks Matthew

      @DataIndependent@DataIndependent10 ай бұрын
  • Great video - the technique that you described works perfectly for me.

    @micbab-vg2mu@micbab-vg2mu10 ай бұрын
    • Nice!

      @DataIndependent@DataIndependent10 ай бұрын
  • Definitely agree with this - I fell into this trap while developing and at one point I was like “Wait, why do I need this to be stored in the first place…”

    @tajwarakmal@tajwarakmal10 ай бұрын
  • Short video that packs an important point. For context, I have built a production feature for our product using OpenAI APIs. When scoping the build, it was confusing for a minute to see how vector stores were everywhere in resources. It's a powerful tool but it has been oversold.

    @akashtandon1@akashtandon19 ай бұрын
  • Agreed, it's important to look at the problem you're solvinf before commiting to any tech stack. Many problems simply require relational, precise data. Vector stores are a great fit for quickly building something that needs a fuzzy semantic retrieval. (though BM25 is surprisingly effective and like you say, cheap!) You can go a long way by just "stuffing" top-k results into the prompt but the limitations will soon become obvious, mainly the shallow data representation and lack of control. This could be addressed somewhat by adding metadata + filters on top of the vector index but that feels like a band-aid to me. All of this makes me excited for the future of this space!

    @tomwalczak4992@tomwalczak499210 ай бұрын
    • Nice! I totally agree. Retrieval as a whole is on my top list of interesting areas in AI right now. IMO our bottleneck right now isn’t gpt4 or reasoning, but rather the super structures around it. Of which good retrieval is 75% of the limiting factor.

      @DataIndependent@DataIndependent10 ай бұрын
  • Agree. Retrieval must be a mix of tech approaches: semantic/embeddings, lexical search, structured SQL type queries, etc...

    @cuttheace6407@cuttheace640710 ай бұрын
    • Totally! Start with the problem first, then solution. Not the other way around

      @DataIndependent@DataIndependent10 ай бұрын
  • Well it’s clear enough that there are a lot of hypes in tech, we’ve all experienced them. This video would add in value if it dove deeper into *why* vector stores aren’t a silver bullet - the pros, the cons, the use cases. Would be good to understand when exactly to use them, when not, what the alternatives are etc.

    @anatoli25236@anatoli252367 ай бұрын
    • Nice! Thanks for the feedback on it

      @DataIndependent@DataIndependent7 ай бұрын
  • Hi Greg, I’m new to LLM, and found your videos super helpful! Recently I’m working on a usecase that is to summarize top X frequent topics from a long text file (each line is a topic). The approach I took is to put it into vector store and use retrievalQA from langchain. But it responded based on limited number of source documents (4 by default), which is not a good summary from the whole file. I wonder if there’s any better approach for this. Thanks!

    @wilsonchu5853@wilsonchu585310 ай бұрын
    • Hm, that sounds like a fun project. I haven't seen a great way to do that yet. I would need to check out the data to give a better recommendation

      @DataIndependent@DataIndependent10 ай бұрын
  • Please add some tutorials about combining Knowledge Graphs with LLMs. Thanks in advance!

    @ahmadalis1517@ahmadalis151710 ай бұрын
  • My boss insisted I store CSV/Data Frames in vectors 😂 He claimed he was trying to be “visionary”.

    @WANHandler@WANHandler8 ай бұрын
    • Ha! Yes - exactly

      @DataIndependent@DataIndependent8 ай бұрын
  • With the right workload, vector stores have improved my latency, upload and indexing time, etc.. Otherwise what are your thoughts about the langchain sql agent potential? But dear god keep LLM out of INSERT operations. :-) Can you imagine?

    @collectivelogic@collectivelogic10 ай бұрын
    • Would be cool for discovery and natural sql db interaction... From a read-only perspective 😅

      @dualityninja@dualityninja10 ай бұрын
    • I like the potential of them, but totally depends on the problem at hand. I haven’t experimented with them too much

      @DataIndependent@DataIndependent10 ай бұрын
    • @@DataIndependent a good use case is replacement to old school ways of building data dashboard and data insight reports. Natural language to query and analyze existing data for additional insight, better and more relevant correlation is possible too if the right models are trained and integrated. Build in diagrams and chart generation into chat, and have a good time. Amazing for fraud and accounting analysis, contextual cross referencing (useful in larger multifaceted organizations). All this allows you to create different dimensions to existing data...worse case, it's great for training with existing data. Customer service bots and direct integration for self help. HR related matters, without outputting PII data to other sources. Use temporary memory vectors for conversation. Etc

      @collectivelogic@collectivelogic10 ай бұрын
  • Totally Agreed bruh...

    @10points56@10points5610 ай бұрын
  • Could you do an example where vectorestores are misplaced? For any company related llm question I immediately think about vectorestores. thx and br

    @DanielWeikert@DanielWeikert10 ай бұрын
    • Depends - what kinds of company related questions? Something like "what is apple's revenue in 2022?" doesn't need a vectorstore

      @DataIndependent@DataIndependent10 ай бұрын
  • I was wondering if there are approaches where the context is not only retrieved based on semantic (embedding wise) similarity to the search query but from a kind of knowledge graph? I can think of a lot of use cases where the relevant pieces of information might not be retrievable by simple semantic embedding distance approach.

    @TerenceChill286@TerenceChill28610 ай бұрын
    • Llama index has advanced structures like that gpt-index.readthedocs.io/en/latest/reference/indices/kg.html

      @DataIndependent@DataIndependent10 ай бұрын
    • @@DataIndependent Awesome thanks a lot

      @TerenceChill286@TerenceChill28610 ай бұрын
  • Greg, how has the quality of retrieval been among vector databases? (At least among the ones you’ve used). Is there a clear difference in quality? Or are they all fairly close? Also, great video as always. 😊

    @real23lions@real23lions10 ай бұрын
    • It's been all the same for me, no clear winners yet. But there will be differentiation soon. The limiting factor of retrieval has been my ability to code up a more robust solution rather than the DB itself

      @DataIndependent@DataIndependent10 ай бұрын
  • Fully agree with that 💯, I am implementing my own vector store over a SQLite database works amazing with my small data set application (thousands of records so far)

    @itstartsnow7@itstartsnow710 ай бұрын
    • Nice! Thanks for sharing

      @DataIndependent@DataIndependent10 ай бұрын
  • I am waiting a new video from you about fine-tuning gpt 3.5. Thanks in advance!

    @ahmadalis1517@ahmadalis15178 ай бұрын
    • Working on it now - starting my exploration to match tone twitter.com/GregKamradt/status/1695123517766078534

      @DataIndependent@DataIndependent8 ай бұрын
  • Totally agree. There is no substitute to understanding first a customer's needs. Otherwise you end up with a solution looking for a problem to solve. But perversely, it does solve the problem of company valuation.

    @cstan2381@cstan238110 ай бұрын
    • Big time - I want to make a whole video reinforcing this point. The same tenants of product development still apply.

      @DataIndependent@DataIndependent10 ай бұрын
  • I get your point and I agree. But we are talking about the usage of LLMs. Of course if the data is a database which is ready to be searched by the LLM, then no need for a second vector storage. But keep in mind, the LLM itself relies on a huge vectorstorage. It can only analyze your structured Database because it knows so much "embedded language". So in some way embeddings and vector storage are the foundation of LLMs. The hype is bullshit, I agree. But large quantity of data in form of text/PDFs etc. (which LLMs are good at) can only be parsed as embeddings in vectorstores. Why is this important? Because the new use cases for data analysis arises exactly from this (mostly) new possibility to investigate large pieces of text documents, like contracts. Sure you can also use LLM to analyze well structured data and it will be more easy, but those are not are not the main use cases.

    @GI002@GI00210 ай бұрын
    • First off thank you for the in depth comment, this is the fun part - talking to people in the industry. > can only be parsed as embeddings in vectorstores I would slightly disagree and say it depends on the project at hand. Totally agree that there is a new world of data unlocked w/ embeddings. What I was trying to argue is that it is not the only way and one should start w/ the problem they're trying to solve first.

      @DataIndependent@DataIndependent10 ай бұрын
  • Agree Greg. I think we got a lot of new people into the data space thanks to AI (LLMs). As such, many of them just don't know the whole data search and retrieval environment. They just know the vector stores and ChatGPT

    @tubingphd@tubingphd10 ай бұрын
    • What’s the best way to get the word out? More projects that highlight other types of search?

      @DataIndependent@DataIndependent10 ай бұрын
  • Another challenge for enterprise apps is usage and ownership of data. Pinecone specifically states you give them the right to "access, use, reproduce, modify and display the Customer Data", and goes on to say that they own your aggregated data "Customer further agrees that Pinecone shall OWN such Aggregated and Anonymous Data and may retain, use and disclose such data for any lawful business purpose, including to improve its products and services.". If you are using a vector database for large amounts of corporate documentation, how exactly is your data being made anonymous when the entire body contains proprietary and identifiable data. Kudos to Pinecone for being very transparent about this, but it is a challenge for serious business apps to use vector dbs.

    @TimeWasted8675309@TimeWasted867530910 ай бұрын
    • Wow I didn’t know that - I feel like that isn’t well known. Thanks for raising that

      @DataIndependent@DataIndependent10 ай бұрын
    • Can you share the pinecone link where it says that?

      @insigh01@insigh0110 ай бұрын
    • @@insigh01 Hmm, tried 3 times to post the link but it keeps disappearing.. It is pinecone DOT io SLASH user-agreement. Maybe that will work.

      @TimeWasted8675309@TimeWasted867530910 ай бұрын
  • Are vectore stores suitable for NER? I tried using them when extracting book titles from a transcript, but It didn't work well.

    @ahmadalis1517@ahmadalis151710 ай бұрын
    • If you want to extract book titles from transcripts check out my "topic modeling" video. That method may help. I wouldn't use vectors for that task

      @DataIndependent@DataIndependent10 ай бұрын
    • @@DataIndependent Of course! I've learned and am still learning a lot from your tutorials, especially on topic modeling, summarization, and deploying to a Streamlit app.

      @ahmadalis1517@ahmadalis151710 ай бұрын
  • I'm interested in the concept of llama-index's tree index as an alternative to vector store, but haven't seen much content on it. If you're looking for content- would be good to go over pros/cons of different alternatives like tree index.

    @nathancanbereached@nathancanbereached10 ай бұрын
    • Nice - thanks for the ideas. I'll check it out

      @DataIndependent@DataIndependent10 ай бұрын
  • I don't think these startups are "evil" or "lying", but if the only tool you sell is a hammer, you'll see every problem as a nail. And most content creators kinda latched on because "it worked for someone before" and "it works in this use case". In my case, I want to make a chatbot that takes a wiki that no one has time to properly tag for semantic search, a daily newsletter and a lot of PDF manuals and gives people summaries and points to the right page when someone needs. So far only a vector DB managed to take this chaotic mix and return useful information with a minimum success. Normal search engines get stuck on keywords and fail to rate relevance automatically and a QA index needs someone actively maintaining it. But if there's an easier or cheaper automated way of doing it, please do a video about it. You already have my view and like when you do

    @dukemagus@dukemagus10 ай бұрын
    • Nice! For that use case I totally agree vector is a good place to get the mvp going.

      @DataIndependent@DataIndependent10 ай бұрын
    • Also, thank you very much for the follow and thoughtful comment - this is the fun part of talking to people in the industry

      @DataIndependent@DataIndependent10 ай бұрын
  • I am finishing up a small events app project and my experience started out with 'AI, vector, yada...' yet ended up with a more traditional implementation. I did use AI effectively, just not as first envisioned.

    @gr8tbigtreehugger@gr8tbigtreehugger10 ай бұрын
    • Nice! That's a solid case study, thanks for sharing

      @DataIndependent@DataIndependent10 ай бұрын
  • I didn’t hear any argument against vectoring. I heard bad implementation is an issue, but that can be said for any tech. No real reasons given against them. They are indeed a novel way to group near/like data easily.

    @shannons1886@shannons188610 ай бұрын
    • Totally - the main point I was trying to emphasize is that market would lead you to believe it’s the only solution for AI apps. I love vectors, but it’s only a single tool in the toolbox

      @DataIndependent@DataIndependent10 ай бұрын
  • Hey i completely agree with not using vector store on database but what above if we have 300 pdf file each of 300 pages. Should we go with openAI embedding or offline embedding ?

    @prashantpandya5397@prashantpandya539710 ай бұрын
    • Depends on the problem you’re trying to solve. If you want all paragraphs that contain the word “dog” then no need for vector search. If you wanted all paragraphs connected to the idea of animals then vectors come into play

      @DataIndependent@DataIndependent10 ай бұрын
    • @@DataIndependent what if you have some financial relevant and bond data and yes definitely paragraph

      @prashantpandya5397@prashantpandya539710 ай бұрын
    • Openai vs offline embedding is not the right question. The question is whether embedding is the correct approach. What are you trying to accomplish?

      @cuttheace6407@cuttheace640710 ай бұрын
    • @@cuttheace6407 a chatbot that can go through all the pdf text data and if I ask questions about how much revenue is making each stream it can go in depth from those pdf file

      @prashantpandya5397@prashantpandya539710 ай бұрын
    • You should find good results with data chunking and embedding. You have to figure out: Which embeddings tool or API will you use? Openai embeddings are good IMO and are not expensive for calculating embeddings for a few documents here and there. Then you have to figure out where to store the embeddings. Pinecone, weaviate or a local solution like chroma. If you are just running this for your own benefit and not trying to put it on a server for public use, just use a local vector store.

      @cuttheace6407@cuttheace640710 ай бұрын
  • Nothing wrong with vector stores, imho, for unstructured data or multi modal (semantic) data combining text and images. For pure structured data it is overkill. I prefer a hybrid approach combining semantic (embedding vectors), lexical (e.g keywords BM25) and structured (relational) data. Either as add on to already existing relational databases or solutions that incorporate these features by design.

    @toddnedd2138@toddnedd213810 ай бұрын
    • I'm with you on the hybrid approach

      @DataIndependent@DataIndependent10 ай бұрын
  • The audacity of VC's is kinda baffling.🤦🏾‍♂

    @hiranga@hiranga10 ай бұрын
    • They have an incentive to call out vectorDBs as it’s own category. The feedback loop works for them.

      @DataIndependent@DataIndependent10 ай бұрын
  • Seems to me vector stores are only useful if you want them to do the similarity search for you, and you have enough vectors that brute force search against every vector is not fast enough. If not, store your vectors locally (csv? Pickle?) and iterate with cosine similarity, done.

    @willsmithorg@willsmithorg10 ай бұрын
    • Ya, that's not a bad approach! Totally depends on the type of app you're trying to build

      @DataIndependent@DataIndependent10 ай бұрын
    • @@DataIndependent thanks for the quick reply. Although using chroma in langchain is a few line no brainer, easier to code than pickle/csv.

      @willsmithorg@willsmithorg10 ай бұрын
  • We went through all this when MongoDB and NoSQL were the shiny new object that would solve all scalability problems.

    @christinefeaster4102@christinefeaster410210 ай бұрын
    • ;) How’d the story end?

      @DataIndependent@DataIndependent10 ай бұрын
  • Thank you for calling this out. It's just blatant marketing.

    @annaczgli2983@annaczgli298310 ай бұрын
    • Thanks! It'll pass eventually

      @DataIndependent@DataIndependent10 ай бұрын
  • Audio and video are not syncing, it is like a robot

    @ayoubelmhamdi7920@ayoubelmhamdi792010 ай бұрын
    • ooo whoops!

      @DataIndependent@DataIndependent10 ай бұрын
  • I’ve found vectors retrieval to be very hit and miss. Sometimes it gets the right answers sometimes it doesn’t. Way to unreliable

    @yellowboat8773@yellowboat877310 ай бұрын
    • Ya - totally depends on the answer or target chunk you're trying to look up

      @DataIndependent@DataIndependent10 ай бұрын
  • Brother you spent the entire video bashing people using vector stores for retrieval, acting all smug af, and didn't even explain why, nor did you propose an alternative solutions for retrieval. Are you okay? What was the point of hitting record on that camera exactly? To act smug?

    @azero79@azero7910 ай бұрын
    • Yo! Thanks for the feedback! What are you building? How can I help?

      @DataIndependent@DataIndependent10 ай бұрын
    • Did you suggest alternatives to vector stores though, genuinely was curious to your title claim.

      @kongweiying3892@kongweiying389210 ай бұрын
    • @@kongweiying3892 I would push back and claim that the question is "what other types of retrieval do you like?" vs "how to do you replace vector stores" Totally agree that vector stores have a place in our heart, especially when doing anything with semantic reasoning. But a bunch of other retrieval mechanisms out there will do fine for non-semantic searches

      @DataIndependent@DataIndependent10 ай бұрын
KZhead