RAG Pipeline: 6 Steps for Creating Naive RAG App

If you’re starting with large language models, you must have heard of RAG (Retrieval-Augmented Generation). It’s the magic that lets AI chatbots talk about your data—your company’s PDFs, your private notes, or any new information—without “hallucinating.”

It might sound complex, but the core logic of a simple RAG pipeline can be boiled down to six simple steps. We’re going to walk through the “conductor” script that runs this pipeline, showing you how data flows from a raw document to a smart, factual answer.

Our entire system is built on this simple mantra:

  • Phase 1 (Indexing): Chunk $\rightarrow$ Embed $\rightarrow$ Store
  • Phase 2 (Querying): Embed $\rightarrow$ Retrieve $\rightarrow$ Generate

Let’s look at the Python code that brings this mantra to life.

Step 0: Loading Our “Brains” (The Models)

Before we can define our 6 steps, we need to load the tools we’ll be using. In a real application, you load these once when the app starts.

import torch
import numpy as np
import faiss  # For our vector database
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer

print("Loading all models... (This might take a moment)")

# 1. The Embedding Model (for steps 2 & 4)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. The LLM and Tokenizer (for step 6)
llm_name = 'google/flan-t5-base'
llm_tokenizer = AutoTokenizer.from_pretrained(llm_name)
llm = AutoModelForSeq2SeqLM.from_pretrained(llm_name)

print("Models loaded successfully!")

Tutor’s Notes:

We’re loading two different “brains”:

  1. SentenceTransformer(‘all-MiniLM-L6-v2’): This is our Embedding Model. Its only job is to read text and turn it into a vector (a list of numbers) that represents its meaning. It’s a “translator” from words to math.
  2. AutoModelForSeq2SeqLM…(‘google/flan-t5-base’): This is our LLM (Large Language Model). This is the “generator.” Its job is to read a prompt (which we’ll give it) and generate a new text answer. We also load its AutoTokenizer to “talk” to it.

Phase 1: Building the Library (Indexing)

This is the one-time, offline prep work. We’re going to build a searchable “card catalog” of our knowledge.

Step 1. CHUNK

First, we need our documents. For this lab, we’ll just use a simple list of strings.

def chunk_data():
    """
    1. CHUNK: Get our documents.
    For this simple lab, "chunking" just means loading our list of text.
    """
    print("STEP 1: CHUNK (Loading documents)")
    documents = [
        "Agent 'Alpha' is a data analysis bot, specialized in processing financial data and generating reports.",
        "Agent 'Bravo' is a customer support chatbot, designed to handle user inquiries and provide 24/7 assistance.",
        "Agent 'Charlie' is a logistics coordinator, responsible for tracking shipments and managing inventory.",
        "The headquarters for 'Alpha' and 'Bravo' is in New York, while 'Charlie' operates from a base in London."
    ]
    return documents

Step 2. EMBED (Docs)

Now, we turn those text chunks into “meaning vectors” using our embedding model.

def embed_documents(documents_list):
    """
    2. EMBED (Docs): Convert all document chunks to vectors.
    """
    print("STEP 2: EMBED (Docs) -> Converting text to vectors")
    doc_embeddings = embedding_model.encode(documents_list)
    return doc_embeddings

Step 3. STORE

We take those vectors and put them into our faiss vector database so we can search them.

def store_vectors(doc_embeddings):
    """
    3. STORE: Build a searchable FAISS index with our vectors.
    """
    print("STEP 3: STORE -> Building FAISS vector index")
    # Get the dimension of our vectors (all-MiniLM-L6-v2 is 384)
    d = doc_embeddings.shape[1] 
    
    # Create a simple L2 (Euclidean distance) index
    index = faiss.IndexFlatL2(d)
    
    # Add our document vectors to the index
    index.add(np.array(doc_embeddings).astype('float32'))
    return index

Phase 2: Answering a Question (Querying)

This is the “live” part that happens every time a user asks a question.

Step 4. EMBED (Query)

We use the exact same embedding model from Step 2 to convert the user’s question into a vector.

def embed_query(query):
    """
    4. EMBED (Query): Convert the user's question to a vector.
    """
    print("STEP 4: EMBED (Query) -> Converting question to a vector")
    query_vector = embedding_model.encode([query])
    return np.array(query_vector).astype('float32')

Step 5. RETRIEVE

We take the “question vector” and search our FAISS index for the document vectors that are most similar.

def retrieve_chunks(query_vector, index, original_documents, top_k=2):
    """
    5. RETRIEVE: Search the index for the top_k most similar chunks.
    """
    print(f"STEP 5: RETRIEVE -> Searching index for top {top_k} chunks")
    # Search the index
    distances, indices = index.search(query_vector, top_k)
    
    # Use the indices to get the original text
    retrieved_chunks = [original_documents[i] for i in indices[0]]
    
    print("...Chunks found:")
    for chunk in retrieved_chunks:
        print(f"  - {chunk}")
        
    return retrieved_chunks

Step 6. GENERATE

This is the “Augmented Generation” part. We “stuff” the relevant text chunks we just retrieved into a prompt, add the user’s question, and ask the LLM to write an answer based only on that context.

def generate_answer(query, retrieved_chunks):
    """
    6. GENERATE: Give the LLM the context and question.
    """
    print("STEP 6: GENERATE -> Building prompt and calling LLM")
    # Combine the retrieved chunks into one string
    context_string = "\n".join(retrieved_chunks)
    
    # Create the augmented prompt
    prompt = f"""
    Answer the following question based ONLY on the context provided.
    
    Context:
    {context_string}
    
    Question:
    {query}
    
    Answer:
    """
    
    # Tokenize and generate
    inputs = llm_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
    output_ids = llm.generate(inputs.input_ids, max_length=50)
    answer = llm_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    return answer

The “Conductor”: Bringing It All Together

Finally, we need a “conductor” to run all our steps in the right order. The if __name__ == "__main__": block is the main entry point of our script.

if __name__ == "__main__":
    
    # --- PHASE 1: INDEXING ---
    # (In a real app, you'd do this once and save the index)
    documents = chunk_data()
    doc_vectors = embed_documents(documents)
    vector_index = store_vectors(doc_vectors)
    
    print("\n--- RAG System is Indexed and Ready ---\n")
    
    # --- PHASE 2: QUERYING (Query 1) ---
    user_query = "What does Agent Alpha do?"
    print(f"User Query: \"{user_query}\"")
    
    query_vector = embed_query(user_query)
    context = retrieve_chunks(query_vector, vector_index, documents)
    final_answer = generate_answer(user_query, context)
    
    print("\n--- FINAL ANSWER ---")
    print(final_answer)

    # --- PHASE 2: QUERYING (Query 2) ---
    print("\n--- Running a second query ---\n")
    user_query_2 = "Where is Agent Charlie based?"
    print(f"User Query: \"{user_query_2}\"")

    query_vector_2 = embed_query(user_query_2)
    context_2 = retrieve_chunks(query_vector_2, vector_index, documents)
    final_answer_2 = generate_answer(user_query_2, context_2)

    print("\n--- FINAL ANSWER ---")
    print(final_answer_2)

Why This is So Powerful

You’ve just built a system where:

  1. Knowledge is Flexible: To use the app for your purpose, you just change the text in the chunk_data() function and re-run. No multi-million dollar “re-training” required.
  2. Answers are Factual: The LLM is forced to use the context you provide. This “grounding” is what stops it from making things up (hallucinating).

Bookmark this post! This 6-step pattern is the fundamental building block for almost every modern RAG system you see today.

 

Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Generative AI, Large Language Models. Tagged with , , .