The 5 Levels Of Text Splitting For Retrieval

2024 ж. 13 Мам.
41 270 Рет қаралды

Get Code: fullstackretrieval.com/
Get updates from me: mail.gregkamradt.com/
* www.chunkviz.com/
Greg’s Info:
- Twitter: / gregkamradt
- Newsletter: mail.gregkamradt.com/
- Website: gregkamradt.com/
- LinkedIn: / gregkamradt
- Work with me: tiny.one/TEi2HhN
- Contact Me: Twitter DM, LinkedIn Message, or contact@dataindependent.com
Outline:
0:00 - Intro
3:42 - Theory
6:57 - Level 1: Character Split
16:04 - Level 2: Recursive Character Split
20:59 - Level 3: Document Specific Splitting
32:10 - Level 4: Semantic Splitting (With Embeddings)
48:02 - Level 5: Agentic Splitting
1:02:47 - Bonus Level: Alternative Representation

Пікірлер
  • Both LangChain and Llama Index have added Semantic Chunking (level 4) to their libraries LangChain: python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker Llama Index: llamahub.ai/l/llama-packs/llama-index-packs-node-parser-semantic-chunking?from=all

    @DataIndependent@DataIndependent2 ай бұрын
    • But the Semantic Chunker in LangChain only goes with the OpenAI Embedder, doesn't it? What I mean: Is there a way to use another embedding mode than openAI embedder?

      @GeorgAubele@GeorgAubele28 күн бұрын
    • @@GeorgAubele No, you can use your own, check out the docs, replace the embeddings engine you use

      @DataIndependent@DataIndependent28 күн бұрын
  • With the continuous influx of short form content, props to you for making this so interesting to watch. Didn't even realise it was an hour long. Loved every second of it. Thanks!

    @adityasankhla1433@adityasankhla14339 күн бұрын
  • First video I came across that actually explain langchain in detail so that a layman can understand how it actually works

    @AshWickramasinghe@AshWickramasinghe4 ай бұрын
    • Nice I love that - thank you!

      @DataIndependent@DataIndependent4 ай бұрын
  • Why did KZhead take so long to recommend me this channel? Incredible work!

    @stavroskyriakidis4839@stavroskyriakidis48392 ай бұрын
    • Glad you're here my friend

      @DataIndependent@DataIndependentАй бұрын
  • what the ___. how good can a tutorial be. such a gem of a video. thx for making this. new to ml and found this very helpful

    @drakongames5417@drakongames541727 күн бұрын
  • Wow! I hadn't even thought about Agentic Chunking! I need to try this. I did some extensive experimentation with chunking on a project at work for a clinical knowledge base and I found that chunking strategies can make the difference between an ok retrieval and an awesome retrieval that works across a higher percentage of queries.

    @NadaaTaiyab@NadaaTaiyab25 күн бұрын
  • I thought the explanation and showing your experimentation for semantic splitting was creative. Thank you very much.

    @kenchang3456@kenchang34563 ай бұрын
  • Thanks for this Greg. I've been looking at agentic chunking for a while and this video really helped me with implementation. Not heard of you before I searched but now subbed. Thanks a lot :)

    @truthwillout2371@truthwillout23712 ай бұрын
    • Awesome - love it thanks for sharing

      @DataIndependent@DataIndependent2 ай бұрын
  • Your channel is a gem, thank you!

    @Munk-tt6tz@Munk-tt6tz17 күн бұрын
  • Human beings are always continuously learning. LLM’s should have all the abilities that we have.

    @DarrenAllatt@DarrenAllatt6 күн бұрын
  • thank you so much! well done!

    @KaptainLuis@KaptainLuis3 ай бұрын
  • Nice vid, Greg! You're on the cutting edge with some of these splitting techniques. Well done. 😎

    @andreyseas@andreyseas4 ай бұрын
    • Thanks man - they were fun explorations

      @DataIndependent@DataIndependent4 ай бұрын
  • Extremely helpful, thanks for the great tutorial!

    @JelckedeBoer@JelckedeBoer4 ай бұрын
    • nice! thank you

      @DataIndependent@DataIndependent4 ай бұрын
  • Thanks greg! Love the long form instructional video :D Greatly appreciated

    @srikanthganta7626@srikanthganta76264 ай бұрын
    • Awesome! Glad it worked out

      @DataIndependent@DataIndependent4 ай бұрын
  • Best chunking video to date.

    @oleksandr.brazhii@oleksandr.brazhii3 ай бұрын
  • you really deserve that like buttons really thanks for this out of the world content

    @chakerayachi8468@chakerayachi8468Ай бұрын
  • Hi Greg, many thanks for the work you put into this and to help all of us learn. Great clarity, depth and tempo! 💪

    @artislove491@artislove4912 ай бұрын
    • Awesome thank you! The tempo part is good to hear because you never know

      @DataIndependent@DataIndependent2 ай бұрын
  • One useful technique for performing sentence embedding is to apply COREFERENCE RESOLUTION, with a cheap model like Haiku: """Identify pronouns, definite noun phrases, and other referring expressions in the text (for example "it", "he", "she", "they", "this", "that", "those", "their" etc.). Determine the antecedent or the entity to which each referring expression refers. Apply coreference resolution by replacing the referring expressions with their antecedents or a more explicit description of the entity."""

    @AdamTwardoch@AdamTwardochАй бұрын
    • This makes each sentence self-sufficient semantically, and you get huge improvements this way. This is useful for any kind of chunking.

      @AdamTwardoch@AdamTwardochАй бұрын
    • Grammar simplification also may be useful: """Optimize the syntax and grammar. Identify syntactic and grammatical problems such as complex phrases, clauses, and sentence structures, as well as passive voice, embedded clauses, or convoluted sentence constructions. Break these syntactic and grammatical problems down into simpler, more concise sentences, and rephrase them using simpler structures, such as active voice or straightforward subject-verb-object constructions."""

      @AdamTwardoch@AdamTwardochАй бұрын
    • Do you havy any example notebook or code for this ? Or any way I can contact you ?

      @TalhaJSiam@TalhaJSiamАй бұрын
    • There is a spacy module for coreference resolution called "neuralcoref" which works without an LLM. I wonder whether neuralcoref + a clustering algorithm, e. g. BERTopic, could replace the use of LLMs and make the process cheaper.

      @jlsachse@jlsachse20 күн бұрын
  • Hi Greg, thanks for the video. It's awesome to have someone publishing good content who's doing the exact same thing as me. Hope to see more videos on advanced topics like this!

    @Jonathan-rm6kt@Jonathan-rm6kt4 ай бұрын
    • Awesome thank you Jonathan! What is the domain you're working in?

      @DataIndependent@DataIndependent4 ай бұрын
  • Love your videos, especially this one. The information density and presentation is off the charts. It is so altruistic of you to put this out there for free. I am especially interested in the semantic chunking. One use case is transcripts which often have distinct conversation blocks or qhestion answer pairs. Since it is important to capture the question and answer for full context, i was wonderinf what methodology might work best. Alternatively, semantically chunking a document vs pre-defined themes - sort of the opposite direction as the agentic chunker. First generate or define the overarching themes or buckets, then assign chunks to them. It seems that there is some real possibility in the semantic chunking methods. 🎉 Looking forward to experimenting more. Thank you again.

    @robxmccarthy@robxmccarthy3 ай бұрын
    • Nice! For that one I actually recommend a slightly different method to explore. No idea if it'll work better for your use case but it might Check out this video where I do topic extraction from podcasts, I bet you could use this method and switch up the prompts a bit to pull out Q&A pairs w/ answers kzhead.info/sun/o6mkqLaJfYB3pmw/bejne.html

      @DataIndependent@DataIndependent3 ай бұрын
  • Thank you very much! Great video!

    @GeorgAubele@GeorgAubeleАй бұрын
  • Great stuff!

    @christosmelissourgos2757@christosmelissourgos27578 күн бұрын
  • great lectures, great teacher

    @caiyu538@caiyu5384 ай бұрын
    • Thanks Caiyu!

      @DataIndependent@DataIndependent4 ай бұрын
  • Thanks I was thinking about solving my own Retrieval problem. I already got the small crude proof of concept using just simple chunking, embedding, RAG, etc. Now I need to get bigger user inputs that are in bigger pdf files. I thought using agents for it to get around the context window, you agentic chunker is a good starter and does make intuitive sense. I will try this route.

    @JunYamog@JunYamog3 ай бұрын
  • Thank you.

    @AllDomainDefense@AllDomainDefenseАй бұрын
  • Another banger hit from Greg! How does he do it. Love this video!

    @jessaco.8653@jessaco.86534 ай бұрын
  • Amazing!! I am fascinated by how document specific splitting or the bonus level also ties with how we structure our data schema. E.g. extracting metadata like "Introduction" in level 3 or applying a summary to the podcast and indexing that to then link to the raw clip in the bonus level. All amazing, super useful stuff -- I am a bit skeptical on embedding based splitting though, maybe just need to dive in further! Mostly bullish on level 5: agentic splitting with multimodal llms that kind of blend levels 3 and 5

    @connorshorten6311@connorshorten63114 ай бұрын
    • Awesome Connor I love the comment!

      @DataIndependent@DataIndependent4 ай бұрын
  • This video is a piece of art ❤

    @frimis@frimis2 ай бұрын
    • Thanks Frimis

      @DataIndependent@DataIndependent2 ай бұрын
  • Buy you a cup of coffee ? How about a Starbucks franchise ! This is some very powerful material.. Looking forward to implementing these into my pipelines ! THANK YOU !

    @derekcarroll7904@derekcarroll79043 ай бұрын
  • Another great video - thank you:) In my case I need to try Semantic Splitting and Document Specific Splitting.

    @micbab-vg2mu@micbab-vg2mu4 ай бұрын
    • Awesome, thanks Micbab

      @DataIndependent@DataIndependent4 ай бұрын
  • Love the upload

    @jj55222@jj552224 ай бұрын
    • Awesome thanks

      @DataIndependent@DataIndependent4 ай бұрын
  • Congrats that's ! That's an excellent job ! I hope you will continue your work and more benchmark will come. I am particularly curious if the benefit of semantic & agent chunking are minored or majored when code, html , csv is chunked.

    @DemoP.AUSSEIL-bb1ew@DemoP.AUSSEIL-bb1ew3 ай бұрын
  • This was next level

    @p3drocr@p3drocrАй бұрын
  • Incredible! I love the approach to Semantic Splitting. I'm working on creating AI tools that will analyze customer interviews (i.e. founders or user researchers talking to customers and then using AI for the analysis/synthesis). In those transcripts, there are multiple speakers. I'm incorporating your approach here and trying to find a better way to chunk those transcripts by the topic of conversation. Thanks a ton for sharing your work!

    @BrianRhea@BrianRhea3 ай бұрын
    • Awesome, thank you Brian! Love it - I'm doing a ton of work on transcripts as well. This company was just showed to me around user research calls for consultants www.myjunior.ai/

      @DataIndependent@DataIndependent3 ай бұрын
    • Any tips yet based on your findings? I've also been experimenting with semantic chunking of transcripts with somewhat mixed results.

      @robxmccarthy@robxmccarthy3 ай бұрын
  • wow it has been very long time since I made a comment. This content is outstanding! Thank you for creating such a great video.

    @furkandemirturk3646@furkandemirturk36463 ай бұрын
    • heck ya! Thank you! glad to see you back on the comments

      @DataIndependent@DataIndependent3 ай бұрын
  • This video should have a milliion views already. Amazing work

    @MrSawaiz@MrSawaiz2 ай бұрын
    • Thanks again sawaiz - text splitting, not sexy, but it's fun!

      @DataIndependent@DataIndependent2 ай бұрын
  • Fantastic!

    @leisdodigital@leisdodigitalАй бұрын
    • Thanks Leisdo

      @DataIndependent@DataIndependentАй бұрын
  • Great video, starting out with naive and easy to understand methods of text chunking, ending up with novel ideas that may point to the future

    @paalhoff63@paalhoff633 ай бұрын
    • Awesome - thank you!

      @DataIndependent@DataIndependent3 ай бұрын
  • That agentic chunking really does sound like an interesting approach . How can we predefine the topics instead of them being automatically generated?

    @ahmadzaimhilmi@ahmadzaimhilmiАй бұрын
  • That was great! Semantic and Agential ideas are definitely a way forward. Branching off that, here's a thought: building a meta-transformer that uses a classic-transformer through multi head attention to associate high dim vect between semantic chunks > more efficient parallel processing and capturing more nuanced relations between chunks & macro managing the splitting iteratively GPT formatting: Proposed Meta Transformer Approach: Chunk-Level Semantic Analysis: The meta transformer, as you propose, would operate on semantically split chunks, not just individual tokens. High-Dimensional Semantic Space: Each chunk (sequence of tokens) is mapped onto a high-dimensional semantic space. Iterative Mapping for Optimal Chunking: Through multi-head attention, the model would iteratively determine the best separation points for these chunks.

    @Arvolve@Arvolve4 ай бұрын
    • That’s a fun idea - I’d love to see a demo or implementation if you share it out

      @DataIndependent@DataIndependent4 ай бұрын
  • Could attention be used here instead of the embeddings? Input every 2 sentences with overlap into an encoder. Above a certain threshold of "attention" from one sentence to another, have both in the same chunk

    @zinebbhr651@zinebbhr6513 ай бұрын
  • Fantastic Video. Been thinking about level 5, a brilliant way to approach chunking, and I see other applications. Level 4 is clever. Retrieving in syntheticI believe will be the standard as time moves on.

    @danielvalentine132@danielvalentine1324 ай бұрын
    • Totally agree

      @DataIndependent@DataIndependent4 ай бұрын
  • Thanks, Excellent video about chunking strategies👍 Question: Can i store the pulled html table using unstructured in a vector database together with a normal text and asking question (RAG)?.

    @henkhbit5748@henkhbit574819 күн бұрын
  • Great video. Some concepts in it overlap with the RAPTOR paper for RAG

    @cag6825@cag6825Ай бұрын
  • Great content. FYI - Google’s Gemini models are built to be Multi-Modal from the outset, so seems to overcome some of the challenges you mentioned when combing text and images.

    @RushyNova@RushyNova4 ай бұрын
    • Awesome thanks Rushy - ya, I’m ready for a multi modal embedding model

      @DataIndependent@DataIndependent4 ай бұрын
  • Thanks!

    @TitanWellnessCenter2852@TitanWellnessCenter28524 ай бұрын
    • Woah this is cool - I think its my first tip, I appreciate it and I will be enjoying In-n-out animal style fries with it

      @DataIndependent@DataIndependent4 ай бұрын
    • Your Doing an amazing Job. I have really enjoyed the hard work you have put in. Keep it up. @@DataIndependent

      @TitanWellnessCenter2852@TitanWellnessCenter28524 ай бұрын
  • I really enjoyed this thanks. I’ve had good IRL business results with your tiers 2 and 3. I’ve used semantic search quite a bit and my jury is still out on the match score’s reliability to granular levels like decision-making breakpoints. So I would probably find tier 4 still more of an aspirational novelty. I like the concepts of 4 & 5 though on the more distant horizon. As an aside - the term “naive” a lot of folks are using lately in the Langchain llamaindex crowd makes me roll my eyes. It just smells like smug Silicon Valley 20-something (not specifically throwing shade at you I’m seeing it all over the place). If someone is chunking a set of documentation and the content is divided into topics by markdown tags they’d call your “tier 3” implementation naive even if it’s clearly the most practical way to chunk the data and achieve an outcome. I would love to see a term arise to discuss the simple-but-often-practical methods with less negative baggage.

    @ccapp3389@ccapp33894 ай бұрын
    • Nice! Thank you for the solid comment. Totally agree that 4 and 5 are experimental for now. It’s really tough to beat the ROI on recursive character. Definitely open to a new word if one fit better

      @DataIndependent@DataIndependent4 ай бұрын
    • Perhaps a specific term isn’t even needed? They’re all just different methods that can add value in different scenarios. Some are useful for concept-level education, some are useful for practical implementations today, some are useful for future-state theory crafting 🤷‍♂️

      @ccapp3389@ccapp33894 ай бұрын
  • Love it

    @MrSawaiz@MrSawaiz2 ай бұрын
    • thanks sawaiz

      @DataIndependent@DataIndependent2 ай бұрын
  • What a great video. It would have taken me forever if if I was to research and learn more about this on my own. What a life safer. Do you have a video or a good resource about optimizing other RAG hyperparams and about reranking of chunks?

    @krishnaprasad5874@krishnaprasad58742 ай бұрын
    • Nope not yet, but there is more at FullStackRetrieval.com on RAG in general

      @DataIndependent@DataIndependent2 ай бұрын
  • @DataIndependent Hi Greg. Which type of splitting would you recommend when working with bank statement, invoices, balance sheets etc.

    @MuhammadDanyalKhan@MuhammadDanyalKhanАй бұрын
  • Gold.

    @shingyanyuen3420@shingyanyuen34202 ай бұрын
    • nice!!

      @DataIndependent@DataIndependent2 ай бұрын
  • Theory & Importance of Text Splitting: Context Limits: Language models have limitations on the amount of data they can process at once. Splitting helps by breaking down large texts into manageable chunks. Signal-to-Noise Ratio: Providing focused information relevant to the task improves the model's accuracy and efficiency. Splitting eliminates unnecessary data, enhancing the signal-to-noise ratio. Retrieval Optimization: Splitting prepares data for effective retrieval, ensuring the model can easily access the necessary information for its task. Five Levels of Text Splitting: Level 1: Character Splitting: Concept: Dividing text based on a fixed number of characters. Pros: Simplicity and ease of implementation. Cons: Rigidity and disregard for text structure. Tools: LangChain's CharacterTextSplitter. Level 2: Recursive Character Text Splitting: Concept: Recursively splitting text using a hierarchy of separators like double new lines, new lines, spaces, and characters. Pros: Leverages text structure (paragraphs) for more meaningful splits. Cons: May still split sentences if chunk size is too small. Tools: LangChain's RecursiveCharacterTextSplitter. Level 3: Document Specific Splitting: Concept: Tailoring splitting strategies to specific document types like markdown, Python code, JavaScript code, and PDFs. Pros: Utilizes document structure (headers, functions, classes) for better grouping of similar information. Cons: Requires specific splitters for different document types. Tools: LangChain's various document-specific splitters, Unstructured library for PDFs and images. Level 4: Semantic Splitting: Concept: Grouping text chunks based on their meaning and context using embedding comparisons. Pros: Creates semantically coherent chunks, overcoming limitations of physical structure-based methods. Cons: Requires more processing power and is computationally expensive. Methods: Hierarchical clustering with positional reward, finding breakpoints between sequential sentences. Level 5: Agentic Chunking: Concept: Employing an agent-like system that iteratively decides whether new information belongs to an existing chunk or should initiate a new one. Pros: Emulates human-like chunking with dynamic decision-making. Cons: Highly experimental, slow, and computationally expensive. Tools: LangChain Hub prompts for proposition extraction, custom agentic chunker script. Bonus Level: Alternative Representations: Concept: Exploring ways to represent text beyond raw form for improved retrieval. Methods: Multi-vector indexing (using summaries or hypothetical questions), parent document retrieval, graph structure extraction. Key Takeaways: The ideal splitting strategy depends on your specific task, data type, and desired outcome. Consider the trade-off between simplicity, accuracy, and computational cost when choosing a splitting method. Experiment with different techniques and evaluate their effectiveness for your application. Be mindful of future advancements in language models and chunking technologies. Further Exploration: Full Stack Retrieval website: Explore tutorials, code examples, and resources for retrieval and chunking techniques. LangChain library: Discover various text splitters, document loaders, and retrieval tools. Unstructured library: Explore options for extracting information from PDFs and images. LlamaIndex library: Investigate alternative chunking and retrieval methods. Research papers and articles on text splitting and retrieval.

    @nfaza80@nfaza8015 күн бұрын
  • Great tutorials . Are there any courses or book written by you. Your explanation is excellent . Thank you. Can you please share the code which was shown in the demo.

    @karthikb.s.k.4486@karthikb.s.k.44864 ай бұрын
    • Check out fullstackretrieval.com for the code

      @DataIndependent@DataIndependent4 ай бұрын
  • I feel like my mind was blown, brought together then blown again by 'level 4 - semantic search' part of the video

    @Sylarleft@SylarleftАй бұрын
    • Love it! Thanks for the comment

      @DataIndependent@DataIndependentАй бұрын
  • This might just work for my meeting transcripts. Ts similar to something David Shaperio did. Where knowledge bases articles are written. And then reserved and updated during a conversation. I like the idea of using propositions and doing this at the article level.

    @nathank5140@nathank51402 ай бұрын
  • If I could like this video twice, I would have

    @balkisdirahoui7622@balkisdirahoui76222 ай бұрын
    • Thank you!

      @DataIndependent@DataIndependent2 ай бұрын
  • Hi... Any suggestion like how we can handle large chunks s some of the chunks are having token length greater then 4k !!

    @Himanshu-gg6vo@Himanshu-gg6vo14 күн бұрын
  • Hi Greg, Nice video! As for 4, did you considered to use fine tuned NLI models? i.e. combine 2 sentences if the model predicts entailment/natural relationship?

    @tomor3880@tomor38803 ай бұрын
    • I did - but it seemed like way overkill for the tutorial scope I'd like to explore that another time

      @DataIndependent@DataIndependent3 ай бұрын
  • Great video. Thanks for sharing. The level 5 implementation doesn't rewrite the proposition (e.g. It would still say "He likes walking" not "Greg likes walking"), or am I missing something!? I guess that would be another level of improvement? Any ideas how to implement that rewrite?

    @actorjohanmatsfredkarlsson2293@actorjohanmatsfredkarlsson22934 ай бұрын
    • Ah answer seems to be in the bouns part. Use a graph retriever.

      @actorjohanmatsfredkarlsson2293@actorjohanmatsfredkarlsson22934 ай бұрын
    • Hey thanks for the comment. The first step of getting the proposition would remove any of the “he likes doing X” Or maybe I’m not understanding the question correctly

      @DataIndependent@DataIndependent3 ай бұрын
  • Great content! I have one quick question though, You have specified that typically you go with chunk sizes around 2000-4000 characters. But isn't it a problem for the embedding stage? I believe 4000 characters roughly corresponds around 600-1000 tokens, popular small-sized sentence transformers (for embedding purposes) typically have context size around 512. What am I missing here? How do you meaningfully embed the long chunks? Any suggestions? Thanks in advance.

    @erdoganyildiz617@erdoganyildiz617Ай бұрын
  • Nice video, next level of chunking! Are you planning to have soon max_chunk_size?

    @IvanTsvetanov-yq7xu@IvanTsvetanov-yq7xuАй бұрын
    • Hey, nope not at the moment, but it would be cool to add

      @DataIndependent@DataIndependentАй бұрын
  • Hello Greg, Great video! ANy chnace we can get access to the Agentic chunking code?

    @surajthakkar3420@surajthakkar34204 ай бұрын
    • It's in this repo! Which you can find at fullstackretrieval.com

      @DataIndependent@DataIndependent4 ай бұрын
    • Thank you for the reply, maybe i'm just dumb but I cannot find link to the repository anywhere after the sign up. I tried to use the search bar as well as CTRL+F. Would be great if you could post it here. @@DataIndependent

      @surajthakkar3420@surajthakkar34204 ай бұрын
    • Here ya go github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb @@surajthakkar3420

      @DataIndependent@DataIndependent4 ай бұрын
    • Thank you so much Greg!@@DataIndependent

      @surajthakkar3420@surajthakkar34204 ай бұрын
  • So, as you are using Langchain and Llama-Index - what do you prefer for which task? What are the pros and cons of each? I´ve also used both and have manifested an opinion.

    @sticksen@sticksen3 ай бұрын
    • Nice! They both have pros and cons for different tasks. It's up to the dev w/ what they are most comfortable with

      @DataIndependent@DataIndependent3 ай бұрын
  • Hi Sir, what is the best chunking method to process the complex pdfs such as 10K reports. 10K reports will have so many TABLES, How to load those tables to vectorDBs?

    @vijaybrock@vijaybrock12 күн бұрын
  • Agent chunking is a paradox. We aim to spilt the document into concise units to eliminate the noise so that the LLM can generate better answers. But we are asking the LLM to figure out the concise units by dumping all the propositions.

    @AR_7333@AR_73334 ай бұрын
    • Thanks for the comment! I take the other side of the argument where the correct chunks are task dependent, and creating those with character based methods is too crude.

      @DataIndependent@DataIndependent4 ай бұрын
    • ​@@DataIndependent , I agree that creating chunks with character based method is a naive approach. But my concern is: won't the LLM suffer from the same difficulty to process all the proposition to group together the relevant ones as it did when the entire document (without chunking) is given as context to the LLM.

      @AR_7333@AR_73334 ай бұрын
  • Running the code, it always throws an error when using unstructured --> "No module named 'unstructured_inference.inference.elements' " Anyone solved it?

    @ultracycling_vik@ultracycling_vik3 ай бұрын
  • Awesome, I'm trying to do a similar thing with semantic chunking on historic chat messages, but every new message that comes in means you have to re-do the chunking. Can you think of a better way of chunking chat message history.

    @zugbob@zugbobАй бұрын
    • Instead of redoing the all the chunks again you could try finding which cluster the embedding is closest with and naively add it to that one?

      @DataIndependent@DataIndependentАй бұрын
    • @@DataIndependent Cheers, I did that at first but ended up doing something similar to the percentile method you mentioned. The issue was the overlapping possibly unrelated message threw off the cluster. I get the embedding of each new message and measure the similarity distance between each and then when a new message comes in and if it's > 85 percentile then it splits on that (with a minimum of 4-5 messages in a cluster with overlap).

      @zugbob@zugbob29 күн бұрын
  • Is there a way to dynamically change the chunk size ? I have text where I want to split according 4 anchors let’s say. The 4 anchors have x amount of text in between them. So chunk size can say constant, and I’m trying to use regex to split the text.

    @Akimbofmg9_@Akimbofmg9_3 ай бұрын
    • Check out level 2 and specify your own splitters and then chunk size

      @DataIndependent@DataIndependent3 ай бұрын
    • @@DataIndependent I’m sorry I meant to say chunk size cannot stay constant. This is for api call sequences from windows executables. They have varied names and argument sizes. But they do have module name, api name arguments and return values as constants. But the actual text in each field(args ret value etc) can vary according to the specific api.

      @Akimbofmg9_@Akimbofmg9_3 ай бұрын
  • Does anyone have an example of agentic chunking (level 5) as javascript?

    @joshlopez7727@joshlopez77272 ай бұрын
    • I bet you could feed the agentic chunking python code into gemini (or claude 3) and get a pretty good starting point to make it yourself

      @DataIndependent@DataIndependent2 ай бұрын
  • Has someone solved this issue when running the function partition_pdf(). I get this error: module 'PIL.Image' has no attribute 'LINEAR'

    @datagus@datagus2 ай бұрын
    • I would try upgrading all packages

      @DataIndependent@DataIndependent2 ай бұрын
  • When I try to run the same code for reading tables from pdf and saving image from pdf my kernel shutdown and gives message that it will restart again. How to overcome this? Thanks

    @shuvobarman9294@shuvobarman92942 ай бұрын
    • weird - I haven't seen that one before. I would double check that all packages are up to date

      @DataIndependent@DataIndependent2 ай бұрын
    • I have tried the same code with Google Colab, and it's working just fine. The issue was with my anaconda environment as it seems. Thanks a lot for creating such a depth video. Learned a lot.

      @shuvobarman9294@shuvobarman92942 ай бұрын
  • How would you suggest to split legal documents ?

    @ivant_true@ivant_true4 ай бұрын
    • depends on the format - assuming their PDFs, probably starting with unstructured as a proof of concept

      @DataIndependent@DataIndependent4 ай бұрын
    • ​@@DataIndependent They are HTMLs and PDFs, unstructured is for parsing the documents themselves I guess, I was asking about splitters(to further split those documents)

      @ivant_true@ivant_true4 ай бұрын
  • OMG this video is so precious for me❤. I am a web dev and started studying LLMs for fews days now. I had no idea about "splitting" "embeddings" "retrieval" etc. You really well explained in here. Thanks!

    @54peace@54peaceАй бұрын
  • just found the channel and i already know its gonna be big. remember me when you're famous

    @eyemazed@eyemazed4 ай бұрын
    • The goal is to keep putting out good work - thanks Eye!

      @DataIndependent@DataIndependent4 ай бұрын
  • Maybe I'm missing something here, but isn't there the risk that in the Date&Time chunk you end up getting a chunk that reads "The month is October" and another that reads "The year is 2023" but that don't relate to the same context in the original source? In other words, are we sure that those wouldn't be placed togheter unless it was actually October 2023 in the source?

    @RobertoFabrizi@RobertoFabrizi4 ай бұрын
    • Because we use propositions the hypothesis and aim is that of Oct 2023 was related to something then the proposition would include that something. So like “it was October 2023. Bob was 18” then the proposition would be “Greg was 18 in Oct 2023.” I think that is what you’re referring to?

      @DataIndependent@DataIndependent4 ай бұрын
    • Yea I had a similar feeling but the wfh prompt template asks the LLM to break sentences into propositions. Greg then fed it by paragraph so the assumption is the paragraph has a direct relation. It would be good to test this against a paragraph that vaguely makes two related claims. Example: The year is 2023 and Jon had many gatherings. Jon celebrated his birthday last month in July. Jon’s favorite birthday party was in 2006.

      @jonintc@jonintc4 ай бұрын
    • @@jonintc definitely valid points around, I did a paragraph because it was a manageable chunk. The example in the prompt is much longer and more complicated so you could experiment increasing this window size.

      @DataIndependent@DataIndependent4 ай бұрын
    • Wouldn't it make sense to use prepositional heirarchical chunking, where each embedding vector includes indexing to the document, section and paragraph? That way, propositions that seem unrelated but relate to the same discussion are mapped close together in some axis.

      @OccamsPlasmaGun@OccamsPlasmaGun3 ай бұрын
    • @@OccamsPlasmaGun definitely worth a try! sounds like a good experiment

      @DataIndependent@DataIndependent3 ай бұрын
  • cool ideas, another way is keybert till keyword repetition stops

    @kyunglee1924@kyunglee19244 ай бұрын
    • Cool thanks for sharing

      @DataIndependent@DataIndependent4 ай бұрын
  • Unstructured is a great package... But very very slow. If you need to work with big pdfs you will need 10+mins to extract all images and tables...

    @MrDespik@MrDespik4 ай бұрын
    • Ooo - have you found an alternative method that works better?

      @DataIndependent@DataIndependent4 ай бұрын
    • @DataIndependent I am trying to extract tables with the same approach that unstructured uses. Detect tables with detectron2, crop table image and extract table info with table transformer from Microsoft. Or we can try to do everything with table transformed. There is a great package llmsherpa... but they provide API, so difficult to use it in production, cuz I don't want to send someone client documents

      @MrDespik@MrDespik4 ай бұрын
    • I know that unstructured provide API and it is possible to work with this API in async, but haven't tried yet.

      @MrDespik@MrDespik4 ай бұрын
    • Ah. And of course. Azure document intelligence extract tables really great. And it can chunk text too.

      @MrDespik@MrDespik4 ай бұрын
  • Hey greg, do you have a discord?

    @joseph_thacker@joseph_thacker4 ай бұрын
    • Nope! I don’t want to manage a community unless it’s curated and focused and it is not a priority for my energy right now. Maybe in the future

      @DataIndependent@DataIndependent4 ай бұрын
KZhead