Build Your First RAG Pipeline for Better RAG (step-by-step)

Today we're going to be talking about rag pipelines and the importance of keeping your database up to date. At this point, I'm assuming you've already built some sort of vector database rag agent before. If you haven't, I built a full course on that. You can go watch that video up here. And then when you're done with that video, come back over here and we're going to build out a data pipeline. In today's example, we're going to make sure whenever we drop a PDF into a Google Drive, it gets put into our vector database. Whenever we update that file in Google Drive, it will also get put into our database and the old one will be deleted. And then of course if we delete the file out of Google Drive, it will also be deleted out of our vector database. So I don't want to waste any time. Let's get into the video. All right, real quick before we hop into Nen, I wanted to do a few slides about why data pipelines matter for the success of your AI agents. The idea of setting up a knowledge base that all of your agents can pull from is that the knowledge in there is accurate and up-to-date. So if your data is messy, outdated, or scattered everywhere, your AI agents going to struggle to deliver actual real answers. So what we need to do is we need to design automated rag pipelines to make sure that they're constantly checking to make sure that the vector database is accurate or wherever the data is being stored. So when I think of a data pipeline, I think of three steps. I think of the raw material that we take in. I think of the processing line of what actually happens to that raw material and then I think of where it ends up sitting. So a quick practical and example of this is in this workflow I've got my transcripts pipeline. So the raw material that I'm giving it is I'm uploading a URL of a YouTube video right here. And then we move into the processing flow where I get the transcript from it. I'm extracting the actual transcript and I'm extracting the timestamps, merging that back together. And this was basically me cleaning up and getting the data ready to be ingested into the final product, which is our Superbase vector database right over here. So we've got four essential components to be thinking about. The first one is the trigger. what actually starts the process of getting data into a vector database or deleting data from a vector database. This could be new emails come in and you want those to get vectorized. This could be a new row in a Google sheet. It could be a file upload or it could even be some sort of criteria is met. And let me show you again what I mean by that with a real example of this YouTube transcript pipeline. After we get a YouTube video into our vector database, we then put it in a Google sheet. And so the Google sheet would look like this where we'd get the title, the URL, and the transcript. And then also we would have a status. So it would be processed or if I changed it to remove then that would trigger off this second flow down here. This would go off whenever a row status equals remove and then it would filter out all the other rows and then it would get rid of the vectors where it came from that video. So hopefully that makes sense. If it didn't, you can go ahead and watch this YouTube transcript video which I'll tag right up here. But that's just a way for me to make sure that my vector database is only having YouTube videos that I want to chat with. And then we have inputs. And these are the data sources that we need to process. You really want to know exactly what your data sources look like and how they're going to be coming in because predictability is your best friend. Are they going to be PDFs? Are they going to be CSVs? Are they going to be both? Are there going to be images? Or is it just going to be text? You need to understand this stuff in order to make that middle portion of your rag pipeline actually good. And then of course, we take those inputs and we process them. We clean them up. We remove duplicates. We make sure that they're ready to go. We give them metadata, stuff like that. And then we actually shove them into our vector database or a relational database, wherever we actually want to keep them. So I really just wanted to preface this stuff because it's really, really important to think about what data am I currently processing and then later how can I scale this up. So, a great example today is we're just going to be building a flow to handle PDFs. But later on, if we knew, okay, we might also need Word Docs and Excel files and stuff like that, then you could come in here and build a system like this where you're watching a Google Drive folder, but then you also have a switch to handle PDFs, text files, Excels, and all of them get processed differently because they're different types of files, but then ultimately they all go into the same vector database. So, that's just an example I wanted to show you guys real quick of what I meant by understanding these core components and why predictability is your best friend. So, now that we got all of that boring stuff out of the way, let's get started with this build. So, the first thing we're going to build is the pipeline that takes a new doc that we drop into a Google Drive folder and it puts it into a vector database. So, super simple, we're going to start off here by grabbing a Google Drive node and we're going to grab a trigger that is on changes involving a specific folder. First thing here after you connect your Google Drive account is to choose the folder. So, after you connect your Google Drive account, you want to choose the folder that you're going to be looking in. We are going to grab one that I just made called rag. There we go. And then what are we watching for? We're watching for a new file being created in this folder. So, what I'm going to do real quick is go over to my Google Drive and we're going to take this policy and FAQ document. And I am just going to move this into our folder called rag. As you can see right here, moving into rag. And then when we go back to end and I hit fetch test event, we should now see that that folder has arrived. or sorry, not the folder, the file. You can see if I scroll over somewhere, there it is. It is called policy and FAQ document. So, we've got that data here. What I'm going to do now is just pin this to keep it here for now. The next thing we're going to do is actually download this file because all that came back here was like metadata about the file, its ID, its title, all that kind of stuff. So, I'm going to grab another Google Drive node. We're going to do download file, and I'm going to change the file we're looking for to be by ID. And then all I have to do here is find the ID of the file that triggered this workflow. Okay, so I had to scroll down a little bit, but I found it. It is right here. I'm going to drag that into the box. And now we have this variable which represents the ID of the incoming file. And I'm just going to click execute step. And now we should see the binary over here. Actually, I forgot that this is a Google doc. So what I'm going to do is I'm going to add an option down here where I can actually download any Google Doc as a PDF. So I can click on add conversion and rather than turning a doc into HTML, I can turn a doc into PDF. And if I run this again, we should now see right here that this is coming through as a PDF. So perfect. We've got what we want. And now it's as simple as that. I'm just going to add a superbase step. We're going to add a superbase vector store and we're going to add documents to it. So I'm choosing the table to put it in in Superbase, which is called documents. As you can see, here's my environment and this is the table we're going to put it in. We don't need to add any other options. We just need to add our default data loader. And this is important because right now it's looking for JSON, but what we actually want to give it is binary. As you can see, we have our PDF right here as binary. So, I'm going to change that to binary. We're going to leave everything else up here as default for the sake of the example, but we are going to add some metadata. This is going to be very important for us later when we need to update and delete files. So, I'm going to add metadata. The first thing that I'm going to add is file name. So I'm going to do So I'm just going to do some camel case there. Put in file name. And then we just need to go back to the schema of this file and find its name. So if I scroll down here, we can see the name is policy and FAQ document. I'm going to throw that right in there. And then we're going to add one more metadata property, which is going to be date. And then I am just going to type in two open curly braces and do dollar sign. Now, so whenever we get a new piece of information put into our vector database, we can see the exact date and time that it was uploaded. That way, we can just later on validate that if we update a file in our Google Drive that it updates in Subbase as well. Okay, cool. So, we have file name and date as our metadata. That's all we're going to do for now. And then I'm going to add an embedding. So, I'm going to choose OpenAI. I've already got this all set up. We've got text embedding three small which has to be the same as your embedding model for your database. So we're we're good to go here. And now I'm just going to run this and this is going to put that policy and FAQ document into our superbase. Cool. So it says five items should be there. So we should refresh this and see five items. Oh, they popped up right there. And if we go to the metadata and open this up, we can see that we have we have title and producer because it I guess it got that from the binary data itself. But we also have the metadata down here that we added which was date and file name right there. And instead of file name, you could have also done file ID as long as you have some sort of unique variable that you can reference later. And you guys will see exactly what I mean by that when we do this next pipeline. Real quick before we build that next pipeline, I'm just going to build a really, really quick AI agent so we can validate that it is able to read this document. Okay, so set that up real quick. I'm just going to ask it what is our shipping policy? Shoot that off and we should get an answer from the vector database. I didn't even give the agent a prompt or anything. We just hooked it up to a tool and look how smart this guy is. So, we've got our shipping policy. Orders are processed within one to two business days. Standard shipping takes 3 to seven business days. And you can see right here that it is correct. All right, cool. So the next step is we now need to create a flow that when we update this file, it will also update in our subase factor database. So what we're going to do is we're going to add another trigger, which is going to be a another Google Drive trigger. So you might think to just do on changes to a specific file, which is fine if your vector database only has one file, but what we're going to do is on changes involving a specific folder instead, just in case you drop in many files in this folder. So, we're going to choose that same one again, which was called rag. And we're going to be watching for a file updated rather than a file created. All right. So, I just changed the name. It used to be tech haven. As you can see in the um vector database, the policy and FAQ doc store name is tech haven, but I just came in here and changed it to green grass. So, now when we test this trigger, it should pull in that file because the file had a change made to it. So, we got this information back. But now before we download the file, what we want to do is want to get rid of all of the vectors in Superbase where the file name equals policy and FAQ document because these are now outdated vectors. So to do this, we're going to add another node and this is going to be a superbase node, not a vector store node, just a regular superbase node and we're going to choose delete a row. So once again, we need to choose the table which is documents and keep in mind this is a table that has embeddings. So it is a vector store but we're able to use the regular subbase node here. So what we want to do is delete. We're going to delete a row in this documents table but instead of build manually we're going to choose string and then I'm going to change this to an expression and paste in this expression right here which is metadata arrow arrow file name which is the metadata field equals like period asterisk. So kind of a mouthful and not a super intuitive string but this is how it's going to work. And what we need to do now is just go down to grab the file name of this file. And like I said, you could use the ID. You could use anything that's unique to this file. I just decided to go with name because it looks a little less intimidating for the sake of the demo. So now we have any any vectors where the metadata of file name equals this. It's going to delete. So if I hit execute step, we should see five items were output because we had five vectors right here. And these should disappear any second. Now there you go. They're gone. So now we know our vector database is clean of old vectors. And now all we have to do is same thing up here. Download the file and then put it into superbase. So I'm actually going to copy this superbase right here and just put it right here. And then I'm going to grab another Google Drive node in order to download the file. And we just need to download by ID once again. And we're going to choose the ID from the Google Drive file that triggered this workflow which is at the bottom right here. Same thing actually though. I'm going to do the file conversion and make sure the doc is getting turned into a PDF and then download it. Okay, one thing did happen though, which let me explain what happened. So, we pulled back five files, but they're all the same one. And the reason why that happened is because when we deleted five rows from Superbase, this has five items, which makes Google Drive think that it needs to output five items as well. So, we're going to click on this node. We're going to go to settings, and we're just going to say execute only once. And now when we run this again, it's only going to have one as you can see. And now we're able to just hook that puppy into Superbase. And when we run this, I believe everything should be set up. Um, we should still in our data loader have the metadata, but we need to fix this because the name is not mapped correctly. So, we're going to go back to the Google Drive trigger. We're going to scroll down to get to the name, which is all the way near the bottom. They make it so hard to find. There it is. Policy and FAQ doc. So now this flow has the right metadata. And you can see now it's good. We're passing over a new date and time. And then we're going to go ahead and run this one as well. And because it's basically the same file, it should still be five items. We should go to Subabase and wait for these to pop in. And you can see now that the vector is updated because it says store name is green grass rather than tech haven. And once again, we could come up here and chat with the agent and say, what is our store name? Send that off and it should hit subbase right here and come back with green grass. There you Our store name is green grass. Okay, so we now have our flow that puts a file into Subabase when it is put into a Google Drive folder. We've got one that when we update that file, it's going to delete the old records and then put the new ones into Subabase. And now the last thing we need to do is what happens if we actually just delete that file or we want to delete it. How do we make sure Subase deletes them? So, we have a bit of a band-aid fix, but it does work. And also keep in mind what I'm trying to show you guys here is the idea of building these pipelines. Not saying that this is the optimized way to chunk and split and embed data into a subbase vector store. We're just keeping it simple here with the main foundational highle concepts. So anyways, the fix we have here because as you can see when we go to Google Drive and we go to triggers and we do on changes involving a specific folder, there's not one. Sorry, let me just grab the folder. There's not a watch for file deleted. Not exactly sure why. Obviously, they have something on their end which is like sending over the data off of their web hooks and triggers, whatever. But there's not that option there. So, what we do is we're going to go file created, but we're going to choose a different folder. We're going to choose a new one that we made called the recycling bin. So, now it's watching a separate folder for anything that gets put in the recycling bin. So, what I'm going to do is go over to this policy and FAQ doc. We're going to go ahead and move it once again. So, I'm going to move this to the recycling bin folder. As you can see, now it's gone from there. And it should have gone into our recycling bin right here. Same file though. And now, if we go into NN and we fetch test event, we should see that we got this file, which was once again hopefully the policy and FAQ document right here. And then, it's actually really simple because we don't have to ingest anything. We just have to delete. All we have to do is throw this right in here and make sure that everything's mapped up correctly, which is metadata file name like name. Execute that. We should get five items over here. And then we should see them be deleted from Subbase. Just like that. So, it was really that simple. And now we have two different Google Drive folders that as soon as we either drop something in, update those contents or move it from there to a recycling bin folder, our Subbase vector database will be taken care of. Once again, this wasn't for showing you how to optimally process data and put it in. This was more about the idea of it and how you can think about creating these different triggers and using metadata to actually filter and delete things. So, I hope you guys were able to watch this one, understand what's going on, and follow along. As always though, you'll be able to download this exact workflow. I'll also have like sticky notes and a setup guide and stuff. All you have to do to get that for free is join my free school community. The link for that will be down in the description. Once you get in there, it will look like this. You'll just need to navigate to the YouTube resources and every single one of my videos here has some sort of resource. So, right here is my developer agent and you have the developer agent JSON to download. And if you're looking to dive a little deeper with more hands-on learning experience, then definitely check out my plus community. The link for that is also down in the description. Got a great community of over 200 members who are building with NADN every day and building businesses with NAND. It's a super active group. We've been having some really fun calls and discussions lately. And we also have a full classroom section where we dive into the foundations with agent zero nit with 10 hours to 10 seconds and then a new course for our annual members called one person AI automation agency. So I'd love to see you guys in these calls in the community. But that's going to do it for the video. If you enjoyed this one or you learned something new, please give it a like. Definitely helps me out a ton. And as always, I appreciate you guys making it to the end of the video. I'll see you on the next one. Thanks everyone.

Build Your First RAG Pipeline for Better RAG (step-by-step)

Processing Error