Pinecone LangChain - Questions/Answer on Your Own TXT/PDF Files - Code in 9 Minutes!
This video guides you through the basics of loading a custom TXT and a PDF file externally into Pinecone as embeddings(vectors). It also guides you on the basics of querying your custom TXT/PDF file to get answers back (semantic search) from the Pinecone vector database, via the OpenAI LLM API. Using LLMs to query your own data is a powerful application to become operationally efficient for various tasks requiring looking up large documents.
Watch: Tutorial for "OpenAI Function Calling API + LangChain Bot" : • LangChain OpenAI Funct...
Playlist for all LangChain Tutorials: • LangChain OpenAI API -...
Thanks for watching! 🙏
😃 SUBSCRIBE 🌟 👍 LIKE 🌟 💬 COMMENT 🌟 SHARE
▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Resources and Support ▬▬▬▬▬▬▬▬▬▬▬
☕ Buy me a Coffee: ko-fi.com/goodaitechnology
📕 Github Tutorial Repo: github.com/GoodAITechnology/L...
📖 This Notebook: github.com/GoodAITechnology/L...
🍿 Watch LangChain Tutorial Playlist: • LangChain OpenAI API -...
🍿 Watch Q/A your own data (Multiple PDF Files): • Pinecone LangChain - R...
🍿 Watch Open AI Function Calling (Weather Bot): • LangChain OpenAI Funct...
🍿 Watch LangChain Constitutional AI Tutorial (Try Prevent Prompt Hacking): • 👩🚒 LangChain Prompt I...
Thanks for watching! 😃
Watch LangChain Tutorials: : - Open AI Function Calling with Code (Build your own Weather Bot): kzhead.info/sun/fdJvqs6eqpuAq68/bejne.html - Open AI Talk to GPT using LangChain Basics: kzhead.info/sun/a6qSirSaimOPi2w/bejne.html Git Repo: - OpenAI LangChain Tutorials: github.com/GoodAITechnology/LangChain-Tutorials
Hi! Thanks for the video. What the embed_model = 'text-embedding-ada-002' variable is used for?
You are welcome! 'text-embedding-ada-002' is the name of OpenAI's embedding model that we use for creating embeddings from our text book data and the query. For example, in this call where we are creating the pinecone vectorstore: book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name) You can checkout all the different openAI embedding models available here: platform.openai.com/docs/guides/embeddings/what-are-embeddings
In the second example with PDF, you are getting the PDF content in the `book_texts` variable, but after that, you are not updating the Pinecone with this `book_texts` data. (i.e.: book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name)) The `book_docsearch` contains the older pinecone object for the TXT file. So, how it's giving the results from the PDF? Please explain/clarify.
Great video, one problem I'm facing while creating book_docsearch. If I'm running the cell again it's generating embedding again. Is there a way to use the earlier embedding already stored, if I just want to do inference
Thanks! Try this method: Pinecone.from_existing_index(index_name, embeddings)
Can we have a meet call to discuss potential collaboration opportunities?
Great Video! Even though I had no clue about this it helped me and I didn't quite understand fully but i'm sure this would help ppl in need. music i a bit too loud tho.
Thanks, I’m glad you found it useful! And thanks for the feedback on the music. I’ve gotten similar feedback on the music, and I’m going to try lower volume/different music on upcoming video 😊
Thanks! You upsert txt docs, is it the same for uploading pdf, and instead of using .txt use .pdf in the script?
embeddings that you create of your text or pdf documents are are upserted to the vector DB
Very interesting, im eager to learn about this but I only have the very basic understanding of Python lol! What videos do you recommend I should watch before this? it would be of great help! :)
Thanks! I am very excited for you to be interested in this and in coding. There are so many resources online to learn - videos and courses. Watch/read these. And augment, I'd say the best is to learn by doing small projects like say this one. I find that while doing a project, you will come across a concept or something that you don't know or understand, that you will need to research. In the process, grow your knowledge. And at the end of it have a sense of accomplishment. Here are some resources you might find useful: www.coursera.org/ www.w3schools.com/python/ You can download python/anaconda etc locally for this purpose. Alternatively, you can use Google Colab: colab.research.google.com/ I hope this helps. Happy learning! And thank you for watching :)
@@goodaitechnology Thank you for this! Means a lot. Yes I agree that you learn way faster by actually doing and getting your hands dirty.
@goodaitechnology how to get page numbers from where the data is picked
This video might be helpful, it walks through how to query multiple files, cite sources along with page number: kzhead.info/sun/a7ZrcdaugGusnqM/bejne.html
Can you please make a video on how to deploy it on website so that customer can chat with my pdf data and get answers on time.
Thank you so much for this video! So I’m having a problem, every time I run my script it says the token size is too large. But I see it at like 1500 tokens with 50 token overlap. And I uploaded a small PDF and asked a very simple question. Yet every time I run the script the token size gets bigger and bigger. Does pinecone and/or open ai embeddings need to be cleared or initialized somehow? I’m a beginner so any help you can offer would be great!
Not sure what the issue might be without knowing the details. You may have to max_token_limit. Is your prompt very big, perhaps your context is somehow getting too big
@@goodaitechnology Thanks for this reply, I’m still trying to figure out what happened. I haven’t had much time to devote to the project and I’m a beginning coder. But I’ll keep at it and let you know if I figure out the issue.
@@br1rocks Great, thanks! :)
Can you suggest which vector DB has best results for pdf Question Answering?
I've gotten good results with pinecone. Although you may want to try out others too!
This application is really quire interesting. However, If I want to query my own document, the document details will become publicly available due to openai API. What is the use case of this applications if I do not want the customers information to be shared with openai? Can you please help clarify?
OpenAPI lets you use their GPT 3.5/4 API so that they can continuously train their GPT model with your data further. will you be interested to pay for a software that will let you query your own documents and also will not make you documents publicly available?
@user-kk1li5mk7q, this is tricky, since in order to embed you have to send the data to the vendor api for embedding it. Also, the retrieved data when sent to openai for final synthesis will be available to train and might be saved on their platform. OpenAi provides various levels of privacy for using/not using/ saving/not saving your data for training. You'll need to review these before opting for this vendor solution. You could potentially use, local embedding and FM for your use case. You may also want to assess the use of data annonymization for your purpose: python.langchain.com/docs/guides/privacy/presidio_data_anonymization/
Share the notebook please
Please share, as it saves a lot of time typ[ing!
Thanks for the input :) I agree, for sure. I'll be setting up the git repo for the channel in a day or two and will let you know!
@@goodaitechnology Thank you! That would be wonderful!!
Pls share the notebook
@borismiz, @IR240474, @colmxbyrne You can now find the links in the video description. Thanks for the suggestion!
Will there be any false replies or wrong replies? Suppose, If I am asking like, "how many times a name "asdfsdfsadf" used in the story"
There is always some chance of hallucination when llms are involved even with the use of a vector database, depending on the amount of data you've used to create embedding, the prompt that you've engineered for your app, and the user queries (genuine or evil). You will need to make your system prompts and other prompts solid and water tight to prevent these.
Obviously this video starts somewhere after certain knowledge has already been delivered. A terminology primer would be useful.
@rayfellers, thanks for watching.! And also thanks for your feedback. There’s a couple of basic tutorials on this channel. Please check them out. See if those are helpful to get you started. I might try making a primer post sometime in the future
very informational, but it is very unfortunate that you have that annoying background music.
sorry that you found it annoying. I've gotten that input from a few other viewers also. I'll try something else in future videos. Thanks for your input.