Have you ever wondered how to seamlessly integrate the vast knowledge of Large Language Models (LLMs) with the specificity of domain specific knowledge or external databases? As the world of machine learning continues to evolve, the need for more sophisticated and contextually relevant responses from models becomes paramount. For data scientists and product managers keen on deploying LLMs in production, the Retrieval Augmented Generation (RAG) pattern offers a compelling solution. In this blog, we’ll dive deep into the RAG pattern, illustrating its power and potential with practical examples. Whether you’re aiming to enhance your product’s AI capabilities or simply curious about the next big thing in machine learning, this exploration of RAG and LLMs is tailored just for you.
What is Retrieval Augmented Generation (RAG) Pattern for leveraging LLMs?
The Retrieval Augmented Generation (RAG) pattern is a machine learning approach that combines the capabilities of large language models (LLMs) with external knowledge or information retrieval systems. Instead of relying solely on the pre-trained knowledge of an LLM, the RAG pattern fetches relevant external information to provide more informed and contextually accurate responses. The following are key stages of the system implemented based on the RAG pattern:
- Retrieval Phase: Given an input query (like a question), the RAG system first retrieves relevant documents or passages from a large corpus using a retriever. This is often done using efficient dense vector space methods, like the Dense Retriever (DPR), which embeds both the query and documents into a continuous vector space and retrieves documents based on distance metrics.
- Generation Phase: Once the top-k relevant documents or passages are retrieved, they are fed into a sequence-to-sequence generator along with the original query. The generator is then responsible for producing the desired output (like an answer to the question) using both the query and the retrieved passages as context.
Let’s look at a real-life example to understand the RAG pattern.
Imagine you have a vast database of scientific articles, and you want to answer a specific question using an LLM like GPT-4: “What are the latest advancements in CRISPR technology?”
- Without RAG: The LLM might provide a general overview of CRISPR based on its training data, which could be outdated or lack the latest research insights.
- With RAG:
- First, the system would query the database of scientific articles to retrieve the most relevant and recent articles on CRISPR advancements.
- These retrieved articles would then be used as context when posing the question to the LLM.
- The LLM, now equipped with this fresh context, would generate a response that incorporates the latest findings from the retrieved articles, offering a more up-to-date and detailed answer.
Based on the above, we can understand that the RAG pattern enhances the LLM’s capabilities by integrating real-time, external knowledge sources, ensuring that the generated responses are both contextually relevant and informed by the most current information available.
Why implement the RAG pattern?
The following are a few benefits of leveraging RAG pattern for your next use case-specific search needs:
- Scalability: By separating the retrieval and generation phases, RAG can effectively leverage massive external corpora without having to encode the entire corpus directly in the generator.
- Flexibility: Since the retriever can be updated or even replaced without retraining the generator, it provides adaptability to changing data or requirements.
- Quality: In many tasks, especially open-domain question answering, RAG has shown improved performance over models that rely solely on generation or solely on retrieval.
4 Key Steps for Implementing the RAG Pattern for LLMs
The Retrieval-Augmented Generation (RAG) pattern is an innovative approach that combines the power of large language models (LLMs) with external knowledge sources to generate more informed and contextually relevant responses. Let’s delve into the steps involved in leveraging the RAG pattern, supplemented with examples:
Step 1. Deploy Large Language Model (LLM) Deploy a large language model, such as OpenAI’s GPT series. These models are trained on vast amounts of text, enabling them to generate human-like text based on the input they receive. For instance, imagine deploying GPT-4 to answer questions about world history. While LLMs can answer a wide range of questions, their responses are based solely on their training data.
Step 2. Ask a Question to LLM Without Providing the Context Pose a question to the LLM without giving any specific context. For example, asking “Who was Cleopatra?” without specifying which Cleopatra or any other context might yield a generic response like “Cleopatra was a famous Egyptian queen.” This highlights the inherent limitations of LLMs.
Step 3. Improve the Answer to the Same Question Using Prompt Engineering with Insightful Context By refining the question or providing additional context, you can guide the LLM to produce a more accurate or detailed answer. For instance, asking “What was Cleopatra VII’s role in Roman history?” might generate a more specific answer such as “Cleopatra VII was known for her relationships with Roman leaders Julius Caesar and Mark Antony.”
Step 4. Use RAG Based Approach to Identify the Correct Documents, and Use Them Along with Prompts and Questions to Query LLM
In our quest to harness the full potential of Large Language Models (LLMs), it is recommended to use a Retrieval Augmented Generation (RAG) approach. The core idea is to utilize document embeddings to pinpoint the most pertinent documents from our expansive knowledge library. These documents are then amalgamated with specific prompts when querying the LLM. The following represents a step-by-step process to achieve RAG. The picture (courtesy: RAG on AWS) below represents the steps mentioned below.
- Deploying the Model Endpoint for Embedding Model: Before you can retrieve relevant documents, you need an embedding model that can convert text into numerical vectors. Consider deploying a BERT-based model to generate embeddings for historical texts. These embedding models capture the semantic essence of documents, making them crucial for similarity-based retrieval. You can also use vector database services such as Pinecone. Pinecone can be used to store document embeddings and then query them.
- Generate Embeddings for Each Document in the Knowledge Library with the Embedding Model: Convert each document in your knowledge library into a vector representation. For instance, transforming texts about ancient civilizations into numerical vectors will serve as the foundation for indexing and retrieval. Recall that the embeddings are a way of representing text, or really any kind of data, in a numerical format, typically as dense vectors in a high-dimensional space. When it comes to text, these embeddings are generated using various Natural Language Processing (NLP) models that can capture the semantic essence of the content. The beauty of embeddings is that semantically similar items (like documents or words) will have similar embeddings, i.e., they’ll be close to each other in the embedding space.
- Index the Embedding Knowledge Library: The algorithms such as K-Nearest Neighbors (KNN) algorithm can be used to index the generated embeddings. As an example, you might be indexing embeddings of texts about Roman, Greek, and Egyptian civilizations. KNN provides a scalable and efficient way to search through large datasets, ensuring that the most relevant documents are retrieved quickly.
- Retrieve the Most Relevant Documents: Based on the query’s embedding, retrieve the top ‘k’ most similar documents from the indexed knowledge library. For a query about “Roman architecture,” you might retrieve documents discussing the Colosseum, aqueducts, and Roman temples. This ensures that the LLM has access to the most relevant external information when generating a response.
- Combine the Retrieved Documents, Prompt, and Question to Query the LLM: The final step involves feeding the LLM with the retrieved documents, the refined prompt, and the original question. For instance, combining documents about the Colosseum with the question “How did Romans use the Colosseum?” might yield a detailed answer like “Romans used the Colosseum for gladiatorial contests and public spectacles.”
The following is how we can use Pinecone for storing document embeddings and than later querying the embedding storage to get a similar text or document. This is then combined with questions and prompts and sent to LLM for getting more accurate results.
- Storing Document Embeddings:
- Text to Vector: First, you need to convert your text documents into embeddings using an NLP model, like BERT, RoBERTa, or any other model suitable for your data.
- Pinecone Initialization: Initialize Pinecone and create a new vector index to store the document embeddings.
- Uploading Vectors: Using Pinecone’s APIs, you can then batch upload your document embeddings to the created index.
- Querying for Similar Context:
- Generating Query Embedding: When you have a new query (or a document for which you want to find similar content), you first convert it into its corresponding embedding using the same NLP model you used earlier.
- Cosine Similarity Search: Pinecone allows for similarity searches based on several metrics. Cosine similarity is a popular choice for textual embeddings. By querying Pinecone with the generated embedding and specifying cosine similarity as the metric, you’ll retrieve the most similar document embeddings stored in the Pinecone index.
- Interpreting Results: The returned results will include identifiers for the matched documents and their similarity scores. You can use this information to fetch the actual content from your original document store or display relevant excerpts to the user.
The fusion of Retrieval Augmented Generation with Large Language Models is a testament to the evolving landscape of machine learning. As we’ve explored, this synergy not only amplifies the capabilities of LLMs but also ensures that the responses generated are both contextually relevant and informed by the most current information available. By leveraging advanced embedding models and efficient search algorithms, businesses and researchers can unlock a new realm of possibilities, making AI-driven solutions more precise and insightful than ever before. Are you inspired to integrate the RAG approach into your AI solutions? If you’ve already begun experimenting with this method, I’d love to hear about your experiences and insights.