Upload and Search Your Own Documents With Bubble, OpenAI, and Pinecone (Part 1)

2023 ж. 10 Мам.

6 523 Рет қаралды

In this video we'll focus on splitting a document text into multiple parts, embedding each part with vector (from @OpenAI), and then upserting them into Pinecone (@pinecone-io). Everything will be done without any code, all coordinated within Bubble (@BubbleIO).
www.buymeacoffee.com/blanksla...

Пікірлер

Espectacular tutorial! Very easy to follow. It's the first video I watched from you and I'm def subscribing. Here is a timeline of the video: 00:01 Learn how to connect Bubble with OpenAI and Pinecone for document question-answering 04:06 Setting up OpenAI and Pinecone APIs 08:25 Building a curl command for Pinecone API 12:47 API setup for database, OpenAI embedding, and Pinecone upsert 17:19 Convert PDF to text and chunk it into smaller pieces using a backend workflow 21:22 Create a document chunk and split it into an array of words 25:34 Process for chunking and connecting text content with Pinecone and Bubble 29:38 Create a workflow to chunk text and send it to Pinecone
@dniliveact6 ай бұрын
- Thanks so much for putting this together! 🙏
  @BlankSlateLabs6 ай бұрын
Love it! Thank you for the full tutorial!
@tursunable Жыл бұрын
This is a great tutorial, thank you for taking the time to share the knowledge!
@kieranball Жыл бұрын
- Thanks so much Kieran!
  @BlankSlateLabs Жыл бұрын
Have to STRONGLY disagree with the use of JSON-safe formatting for the api calls (vs. keeping quotes in)... Couldn't help but laugh when you forgot at the end! Thanks so much for this, truly one of the better resources out there for integrating these new AI tools with Bubble.
@mattk25318 ай бұрын
- Hey Matt, curious why you're not a fan of using JSON-safe formatting. Is there a particular reason why or just personal preference?
  @jameslakin8 ай бұрын
- Ha appreciate the strong opinion. I do it mainly to escape out special characters. The fact that it adds the quotation marks in automatically is a downside. I'd definitely be down for another way though. What is your approach to escaping out any special characters before sending in an API call? Thanks for the feedback!
  @BlankSlateLabs8 ай бұрын
Excellent tutorial. Generous in its presentation of a valuable and conceptually complex topic and implementation. Thank you.
@ronmartinez878111 ай бұрын
- No problem! Really glad it was helpful.
  @BlankSlateLabs11 ай бұрын
Thanks this is great.
@davidliu853811 ай бұрын
- Thanks so much David!
  @BlankSlateLabs11 ай бұрын
Fantastic tutorial! Would love if you could also show us other implementations like: - How to update your vector database - How to implement the status of the upload to Pinecone - How to search for multiple uploads Thank you so much for sharing this so far
@jvanh89268 ай бұрын
- Thanks so much! I can try to do some quick videos on those. a) valuable! thanks for the suggestions. b) all pretty easy to implement, just some small changes to how you implement them.
  @BlankSlateLabs8 ай бұрын
- @@BlankSlateLabs Thank you for replying! Would be amazing if you could make some quicks videos. Other idea would also be to create a bubble template that People can buy (if you want to monetize it) or use to have a Quick start up with your (soon to be made) advanced tutorials and can use it along side those videos. With great anticipation I hope to see more content of your knowledge in this field. Thanks!
  @jvanh89268 ай бұрын
- Thanks for the suggestions! Yeah, also something to think about. :) Let me look into it!
  @BlankSlateLabs8 ай бұрын
Hey great video!! Looking forward for the next one!! Why we need to do this overlap? is it like a best practice of vector databases?
@sebastianescalona3738 Жыл бұрын
- It protects against losing relevant context if the relevant information is at the split point (so it would match and pull in the info both before and after).
  @BlankSlateLabs Жыл бұрын
Great video. There are some new tools that will make this no-code to the next level. I wish I had your understanding of the workflow in bubble. I have the front end understanding but the back not so much. Would love to chat but I can't leave my email here. Thanks Mike
@michaelkelly50811 ай бұрын
- Hi Mike, feel free to email me at hello [at] blankslate-labs.com!
  @BlankSlateLabs11 ай бұрын
Hi! Was wondering if the parameter in "select until#" has to be the chunk_size or if it has to be start_index+chunk_size. In this case chunk_size would remain 100 all the time and start_indext changes from 0 to 95, 190, etc. then the command would say i.e. from item #190 to item #100.. Would that work or generate an error? Such a precious tutorial btw! Thanks!
@StartupStudio_MA8 ай бұрын
- ha, what you are describing was what I initially assumed it would do as well! It seems logical that the the select until would be the index overall of the array. But it actually is the index from the starting point. So when you create the "select from" then it treats that as the new 0 index and so the "select until" is additive from that vs. the whole array. Does that make sense?
  @BlankSlateLabs8 ай бұрын
Great tutorial. i tried to build it.. it works. But how to do it for different users? As for now, different users will see the same thing & content @Blank Slate Labs
@soonstudio1018 ай бұрын
- Hey! Glad it was helpful. One approach I have done is to use namespace in Pinecone as the user or team filter. So you can set Namespace when you upsert the vector as the user ID (if doing user-based access control) or team ID (if doing team-based access control). Then you query only for that Namespace.
  @BlankSlateLabs8 ай бұрын
Great video, thank you so much for such an effort! I was wondering, if my document was already split into chunks. For example, I have around 50 separate fields for every user, so that's 50 chunks. Can the backend workflow be bypassed then? I am trying to figure out how to setup the workflow but I am having a really hard time.
@FtgStudio Жыл бұрын
- No problem! At some stage, you'll still need to do the vector embedding and then upsert to Pinecone step for each text chunk. An option could be to 1) add a field into the text chunks called "upsertComplete" with "yes/no" type. Then 2) create a backend workflow that takes in 1 parameter (the text chunk) and does the vector embedding and upsert to Pinecone portion with the final step to change the field "upsertComplete" to "yes". Finally, 3) you'd run a bulk action to run on that backend workflow on that table for all text chunks.
  @BlankSlateLabs Жыл бұрын
- @@BlankSlateLabs I was thinking to insert the embedding and upsert in the same worklfow that would save the inputs for each chunk. Once the user makes the input for one of his 50 tabs, workflow starts first does the embedding then the upsert, for the first chunk. Every user has to go through those inputs anyway. Would it be possible that way?
  @petarsavic3684 Жыл бұрын
Hello, could you tell me what program you used to record the screen like that? It looks great!
@neuralthinkio10 ай бұрын
- www.screen.studio/ It's great!
  @BlankSlateLabs10 ай бұрын
Excellent tutorial! Please I can't get past the stage of initializing the Upsert Vectors call at 15:27. I keep on getting this response "Raw error for the API read ECONNRESET" on bubble
@andreschurrla94910 ай бұрын
- Thanks! for that error, i am not sure. It's a connection error (not a setup error). So could be an issue on the Bubble side, the Pinecone side, or with your network connection. Maybe try again?
  @BlankSlateLabs10 ай бұрын
- @@BlankSlateLabs Yeah, I will retry the steps
  @andreschurrla94910 ай бұрын
Great one! I almost followed all your steps, but I'm facing an issue in Pinecone. I can't add more than 1 vector in a namespace though the loop goes more than 1 time in the backend API workflow. It's weird. Tried everything, but I couldn't succeed. Do you have an idea where I could have gone wrong? Thanks!
@baskaranm538511 ай бұрын
- If it is getting to another iteration on the recursive workflow (as seen by multiple document_chunk's being created in the DB), then two potential reasons the Upsert for the additional time around fails are: 1) the embed step isn't passing a proper vector value into the upsert to Pinecone step. 2) the ID you are using in the Upsert to Pinecone step is not unique. Also happy to do a quick chat if you message me (Jeff) at hello [at] blankslate-labs.com
  @BlankSlateLabs11 ай бұрын
- @@BlankSlateLabs Thanks Jeff for your reply. Really smart. As mentioned, I was making a mistake in the second step. Not passing a unique ID. Now it is working perfectly.
  @baskaranm538511 ай бұрын
- @@baskaranm5385 🙌 phew, awesome! glad it's all working now.
  @BlankSlateLabs11 ай бұрын
Brilliant! How long did it take pine cone to reach 115 vectors? I’m interested in letting my users upload a PDF and chat with their own documents. If the doc is massive, I’m wondering if pine cone would take too long to finish before a user gives up.
@BradleyJH11 ай бұрын
- Hey Bradley! In this example the main thing that takes time is the document splitting process on the Bubble side. (Bubble is not optimized for running recursive backend workflows, so things kinda run slow right now). So with that scenario, it’s actually a couple of minutes to get everything upserted and ready to go in Pinecone for 115 vectors. However, that can be sped up by doing the splitting and upserting either in something like Xano or using code with a cloud function. Then it would be in a matter of seconds.
  @BlankSlateLabs11 ай бұрын
- @@BlankSlateLabs I would love to get in touch with you and hire you to build it. What’s the best way to reach you? If not, no problem. I love these videos and will continue to share and promote them.
  @BradleyJH11 ай бұрын
- Sure! Would be great to connect. Feel free to email me (Jeff) at hello [at] blankslate-labs.com
  @BlankSlateLabs11 ай бұрын
Hey mate, what is your browser?
@Djaxad9 ай бұрын
- Arc - arc.net/
  @BlankSlateLabs9 ай бұрын
Does it work for image embeddings?
@proan_199210 ай бұрын
- For this setup, no, it would not work for image embeddings. Image embeddings and language embeddings are generally not compatible with each other and are not interchangeable.
  @BlankSlateLabs10 ай бұрын
How can we add followup data under the same namespace? E.g., The user upserted their data today and will come back everyday to add more data.
@tursunable Жыл бұрын
- You'd probably then use the "Namespace" as something dynamic. Maybe the user creates a project that they then add more documents to and then the namespace becomes the project unique ID.
  @BlankSlateLabs Жыл бұрын
- @@BlankSlateLabs Thank you! That was what I was thinking of recently but wanted to confirm with other guys (Pinecone never mentioned this issue on their platform). So, Every time when the users add info they would create a new namespace. Does the counts of namespace increase the cost of data storage?
  @tursunable Жыл бұрын
- @@tursunable No problem! Actually, you should be able to reuse the same namespace, as long as you store the value on the Bubble side (or whichever platform you use) and send it along with each Upsert, it should all be combined within the namespace.
  @BlankSlateLabs Жыл бұрын
I watched another video in which the developer did not chunk the text and then embedding them, as he said that embedding include chunking
@salemmohammad270110 ай бұрын
- Embedding itself is not chunking. Basically think of embedding as translating to another language. In this case it's a language defined by a vector in multi-dimensional space (1,536 dimensions). Each unique vector represents unique meaning. The closer the vectors are to each other, the closer they are in meaning. So when you embed to a vector, you give it a new language to search for things with similar meaning. Instead of just search by text in a language like English. The chunking is done so that when you search, you get back the most relevant excerpts of text to the search term (instead of the entire document text). If you did not chunk and just embedded, you'd just always return the whole document text.
  @BlankSlateLabs10 ай бұрын
- @@BlankSlateLabs What are the factors that determine the appropriate size for the chunk?
  @salemmohammad270110 ай бұрын
- @@salemmohammad2701 The ultimate goal is to break it down into pieces of text that represent distinct topics. Some methods will define the length of the chunk dynamically by parsing the text by sentences or paragraphs. Others will use the headers to define sections and create a chunk for each section. For what I did in the video, it's a bit more rudimentary, since I am just using Bubble natively (no-code). Thus I set the chunk size at 100 words. This is about a paragraph length (20 words per sentence average, 5 sentences). I then overlap by 10 words (each chunk also includes the last 10 words of the previously created chunk).
  @BlankSlateLabs10 ай бұрын
- @@BlankSlateLabs the embedding model can handle any size of chunk? and is the size of chunk affect the cost the model's company will take?
  @salemmohammad270110 ай бұрын
- @@salemmohammad2701 OpenAI's Embedding cost is $0.0001 / 1,000 tokens (a token is about 4 characters). For their latest model, the max tokens you can input into the embedding model is 8,000 tokens
  @BlankSlateLabs10 ай бұрын
It's all good.....but what is the use of this...?? Can you give some practical real life use cases for this application?
@Statsjk11 ай бұрын
- Sure! One use would be to create custom responses to customer support questions. So instead of just returning a list of matching help docs, it could formulate an exact, personalized response based on multiple docs.
  @BlankSlateLabs11 ай бұрын
Text in the diagrams is too small!
@tylersnard10 ай бұрын
- Sorry 'bout that! Here are links to images: drive.google.com/file/d/17ubXUoAaGG0I6LE-qBSCmy9DwjFJi51r/view?usp=drive_link drive.google.com/file/d/16zJd50aVTrVZoO46yqHr-1QodFhLRVqb/view?usp=drive_link
  @BlankSlateLabs10 ай бұрын
how i can reach you please ? contact please
@MahmoudAmmar011 ай бұрын
- Hey! Sure, feel reach out to me (Jeff) at hello [at] blankslate-labs.com
  @BlankSlateLabs11 ай бұрын
having a problem with 15.42 minutes, it came up as There was an issue setting up your call. Raw response for the API Status code 400 Expected an object key or }. 34, 0.0023024268, ^ please help
@shoshi47510 ай бұрын
- This most likely is caused by a missing comma or bracket in the JSON body. Make sure it is this: { "vectors": [ { "values": [], "metadata": { "document_id": }, "id": } ], "namespace": } Another possible reason is you copied a { or extra comma or quotation mark when you grabbed the vector and copied to the value. Make sure the start and end do not include any special characters.
  @BlankSlateLabs10 ай бұрын
- I have the same problem, and I checked many time :( @@BlankSlateLabs
  @JoaoSilva-ij6dx8 ай бұрын