Pinecone LangChain - Questions/Answer on Your Own TXT/PDF Files - Code in 9 Minutes!

2023 ж. 25 Мам.

9 096 Рет қаралды

This video guides you through the basics of loading a custom TXT and a PDF file externally into Pinecone as embeddings(vectors). It also guides you on the basics of querying your custom TXT/PDF file to get answers back (semantic search) from the Pinecone vector database, via the OpenAI LLM API. Using LLMs to query your own data is a powerful application to become operationally efficient for various tasks requiring looking up large documents.
Watch: Tutorial for "OpenAI Function Calling API + LangChain Bot" : • LangChain OpenAI Funct...
Playlist for all LangChain Tutorials: • LangChain OpenAI API -...
Thanks for watching! 🙏
😃 SUBSCRIBE 🌟 👍 LIKE 🌟 💬 COMMENT 🌟 SHARE
▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Resources and Support ▬▬▬▬▬▬▬▬▬▬▬
☕ Buy me a Coffee: ko-fi.com/goodaitechnology
📕 Github Tutorial Repo: github.com/GoodAITechnology/L...
📖 This Notebook: github.com/GoodAITechnology/L...
🍿 Watch LangChain Tutorial Playlist: • LangChain OpenAI API -...
🍿 Watch Q/A your own data (Multiple PDF Files): • Pinecone LangChain - R...
🍿 Watch Open AI Function Calling (Weather Bot): • LangChain OpenAI Funct...
🍿 Watch LangChain Constitutional AI Tutorial (Try Prevent Prompt Hacking): • 👩‍🚒 LangChain Prompt I...
Thanks for watching! 😃

Пікірлер

Watch LangChain Tutorials: : - Open AI Function Calling with Code (Build your own Weather Bot): kzhead.info/sun/fdJvqs6eqpuAq68/bejne.html - Open AI Talk to GPT using LangChain Basics: kzhead.info/sun/a6qSirSaimOPi2w/bejne.html Git Repo: - OpenAI LangChain Tutorials: github.com/GoodAITechnology/LangChain-Tutorials
@goodaitechnology10 ай бұрын
Hi! Thanks for the video. What the embed_model = 'text-embedding-ada-002' variable is used for?
@Daniel-fl4si10 ай бұрын
- You are welcome! 'text-embedding-ada-002' is the name of OpenAI's embedding model that we use for creating embeddings from our text book data and the query. For example, in this call where we are creating the pinecone vectorstore: book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name) You can checkout all the different openAI embedding models available here: platform.openai.com/docs/guides/embeddings/what-are-embeddings
  @goodaitechnology10 ай бұрын
In the second example with PDF, you are getting the PDF content in the `book_texts` variable, but after that, you are not updating the Pinecone with this `book_texts` data. (i.e.: book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name)) The `book_docsearch` contains the older pinecone object for the TXT file. So, how it's giving the results from the PDF? Please explain/clarify.
@rajatagrawal53394 ай бұрын
Great video, one problem I'm facing while creating book_docsearch. If I'm running the cell again it's generating embedding again. Is there a way to use the earlier embedding already stored, if I just want to do inference
@antiquechariot819510 ай бұрын
- Thanks! Try this method: Pinecone.from_existing_index(index_name, embeddings)
  @goodaitechnology10 ай бұрын
Can we have a meet call to discuss potential collaboration opportunities?
@user-vz8fh8tb7t7 ай бұрын
Great Video! Even though I had no clue about this it helped me and I didn't quite understand fully but i'm sure this would help ppl in need. music i a bit too loud tho.
@AbsoluteGT.9 ай бұрын
- Thanks, I’m glad you found it useful! And thanks for the feedback on the music. I’ve gotten similar feedback on the music, and I’m going to try lower volume/different music on upcoming video 😊
  @goodaitechnology9 ай бұрын
Thanks! You upsert txt docs, is it the same for uploading pdf, and instead of using .txt use .pdf in the script?
@thespiritualmindset35807 ай бұрын
- embeddings that you create of your text or pdf documents are are upserted to the vector DB
  @goodaitechnology6 ай бұрын
Very interesting, im eager to learn about this but I only have the very basic understanding of Python lol! What videos do you recommend I should watch before this? it would be of great help! :)
@renderllama703610 ай бұрын
- Thanks! I am very excited for you to be interested in this and in coding. There are so many resources online to learn - videos and courses. Watch/read these. And augment, I'd say the best is to learn by doing small projects like say this one. I find that while doing a project, you will come across a concept or something that you don't know or understand, that you will need to research. In the process, grow your knowledge. And at the end of it have a sense of accomplishment. Here are some resources you might find useful: www.coursera.org/ www.w3schools.com/python/ You can download python/anaconda etc locally for this purpose. Alternatively, you can use Google Colab: colab.research.google.com/ I hope this helps. Happy learning! And thank you for watching :)
  @goodaitechnology10 ай бұрын
- @@goodaitechnology Thank you for this! Means a lot. Yes I agree that you learn way faster by actually doing and getting your hands dirty.
  @renderllama703610 ай бұрын
- @goodaitechnology how to get page numbers from where the data is picked
  @Bubleekum9 ай бұрын
- This video might be helpful, it walks through how to query multiple files, cite sources along with page number: kzhead.info/sun/a7ZrcdaugGusnqM/bejne.html
  @goodaitechnology8 ай бұрын
Can you please make a video on how to deploy it on website so that customer can chat with my pdf data and get answers on time.
@topacademy34428 күн бұрын
Thank you so much for this video! So I’m having a problem, every time I run my script it says the token size is too large. But I see it at like 1500 tokens with 50 token overlap. And I uploaded a small PDF and asked a very simple question. Yet every time I run the script the token size gets bigger and bigger. Does pinecone and/or open ai embeddings need to be cleared or initialized somehow? I’m a beginner so any help you can offer would be great!
@br1rocks10 ай бұрын
- Not sure what the issue might be without knowing the details. You may have to max_token_limit. Is your prompt very big, perhaps your context is somehow getting too big
  @goodaitechnology10 ай бұрын
- @@goodaitechnology Thanks for this reply, I’m still trying to figure out what happened. I haven’t had much time to devote to the project and I’m a beginning coder. But I’ll keep at it and let you know if I figure out the issue.
  @br1rocks9 ай бұрын
- @@br1rocks Great, thanks! :)
  @goodaitechnology9 ай бұрын
Can you suggest which vector DB has best results for pdf Question Answering?
@user-dd3zb6rm8g9 ай бұрын
- I've gotten good results with pinecone. Although you may want to try out others too!
  @goodaitechnology6 ай бұрын
This application is really quire interesting. However, If I want to query my own document, the document details will become publicly available due to openai API. What is the use case of this applications if I do not want the customers information to be shared with openai? Can you please help clarify?
@user-kk1li5mk7q10 ай бұрын
- OpenAPI lets you use their GPT 3.5/4 API so that they can continuously train their GPT model with your data further. will you be interested to pay for a software that will let you query your own documents and also will not make you documents publicly available?
  @tarekullah87729 ай бұрын
- @user-kk1li5mk7q, this is tricky, since in order to embed you have to send the data to the vendor api for embedding it. Also, the retrieved data when sent to openai for final synthesis will be available to train and might be saved on their platform. OpenAi provides various levels of privacy for using/not using/ saving/not saving your data for training. You'll need to review these before opting for this vendor solution. You could potentially use, local embedding and FM for your use case. You may also want to assess the use of data annonymization for your purpose: python.langchain.com/docs/guides/privacy/presidio_data_anonymization/
  @goodaitechnology2 ай бұрын
Share the notebook please
@borismiz10 ай бұрын
- Please share, as it saves a lot of time typ[ing!
  @IR24047410 ай бұрын
- Thanks for the input :) I agree, for sure. I'll be setting up the git repo for the channel in a day or two and will let you know!
  @goodaitechnology10 ай бұрын
- @@goodaitechnology Thank you! That would be wonderful!!
  @IR24047410 ай бұрын
- Pls share the notebook
  @colmxbyrne10 ай бұрын
- @borismiz, @IR240474, @colmxbyrne You can now find the links in the video description. Thanks for the suggestion!
  @goodaitechnology10 ай бұрын
Will there be any false replies or wrong replies? Suppose, If I am asking like, "how many times a name "asdfsdfsadf" used in the story"
@vrynstudios6 ай бұрын
- There is always some chance of hallucination when llms are involved even with the use of a vector database, depending on the amount of data you've used to create embedding, the prompt that you've engineered for your app, and the user queries (genuine or evil). You will need to make your system prompts and other prompts solid and water tight to prevent these.
  @goodaitechnology6 ай бұрын
Obviously this video starts somewhere after certain knowledge has already been delivered. A terminology primer would be useful.
@rayfellers10 ай бұрын
- @rayfellers, thanks for watching.! And also thanks for your feedback. There’s a couple of basic tutorials on this channel. Please check them out. See if those are helpful to get you started. I might try making a primer post sometime in the future
  @goodaitechnology10 ай бұрын
very informational, but it is very unfortunate that you have that annoying background music.
@fluffykitties90206 ай бұрын
- sorry that you found it annoying. I've gotten that input from a few other viewers also. I'll try something else in future videos. Thanks for your input.
  @goodaitechnology6 ай бұрын