Last updated: 26th Jan, 2024
Have you ever wondered how to seamlessly integrate the vast knowledge of Large Language Models (LLMs) with the specificity of domain-specific knowledge stored in file storage, image storage, vector databases, etc? As the world of machine learning continues to evolve, the need for more sophisticated and contextually relevant responses from LLMs becomes paramount. Lack of contextual knowledge can result in LLM hallucination thereby producing inaccurate, unsafe, and factually incorrect responses. This is where context augmentation with prompts, and, hence retrieval augmentated generation method, comes into the picture. For data scientists and product managers keen on deploying LLMs in production, the Retrieval Augmented Generation pattern offers a compelling solution if they want to leverage contextual information with prompts sent by the end users. In this blog, we’ll dive deep into the LLM RAG pattern, illustrating its power and potential with practical examples. Whether you’re aiming to enhance your product’s generative AI capabilities or simply curious about the next big thing in machine learning, this RAG LLM tutorial is tailored just for you.
The Retrieval Augmented Generation (RAG) pattern is a generative AI approach that combines the capabilities of large language models (LLMs) with contextual information stored in different form of data storage such as vector databases, file storage, image repositories, etc.
Let LLM represent a Large Language Model and IR represents a retriever or an information retrieval system. The RAG pattern can be mathematically formulated as:
RAG=f(LLM,IR)
Here, f is a function that combines the outputs of LLM and IR.
Instead of relying solely on the pre-trained knowledge of an LLM, the RAG pattern fetches relevant contextual information to provide more informed and accurate responses. The following are key stages of the system implemented based on the RAG pattern:
The following picture represents a RAG LLM system while showcasing retrievers (explained above) and LLMs working together.
Can RAG & fine-tuning be used together?
RAG can be integrated with off-the-shelf foundation models or with fine-tuned and human-aligned models specific to generative use cases and domains. RAG and fine-tuning can be used together. They are not mutually exclusive.
When is RAG useful?
RAG is useful in any case where you want the language model to have access to additional data that is not contained within the LLMs learned during pretraining and fine-tuning. This could be data that did not exist in the original training data, such as proprietary information from your organization’s internal data stores.
Let’s look at a real-life example to understand the RAG LLM pattern.
Imagine you have a vast database of scientific articles, and you want to answer a specific question using an LLM like GPT-4: “What are the latest advancements in CRISPR technology?”
Based on the above, we can understand that the RAG pattern enhances the LLM’s capabilities by integrating real-time, external knowledge sources, ensuring that the generated responses are both contextually relevant and informed by the most current information available.
Implementing the Retrieval Augmented Generation (RAG) pattern in generative AI applications offers a myriad of advantages, particularly for tasks that require nuanced, context-rich, and accurate responses. Here are some expanded insights from my experience:
Let’s delve into the steps involved in leveraging the RAG pattern for LLM, supplemented with examples:
Step 1. Deploy Large Language Model (LLMs) Deploy a large language model, such as OpenAI’s GPT series. These models are trained on vast amounts of text, enabling them to generate human-like text based on the input they receive. For instance, imagine deploying GPT-4 to answer questions about world history. While LLMs can answer a wide range of questions, their responses are based solely on their training data.
Step 2. Ask a Question to LLM: Pose a question to the LLM without giving any specific context. For example, asking “Who was Cleopatra?” without specifying which Cleopatra or any other context might yield a generic response like “Cleopatra was a famous Egyptian queen.” This highlights the inherent limitations of LLMs.
Step 3. Improve the Answer by adding insightful context to the Same Question based on the RAG based. By refining the question or providing additional context, you can guide the LLM to produce a more accurate or detailed answer. For instance, asking “What was Cleopatra VII’s role in Roman history?” might generate a more specific answer such as “Cleopatra VII was known for her relationships with Roman leaders Julius Caesar and Mark Antony.”
In our quest to harness the full potential of Large Language Models (LLMs), it is recommended to use a Retrieval Augmented Generation (RAG) approach to add the above insightful context. The core idea is to utilize document embeddings to pinpoint the most pertinent documents from our expansive knowledge library. These documents are then amalgamated with specific prompts when querying the LLM. The following represents a step-by-step process to achieve RAG. The picture (courtesy: RAG on AWS) below represents the steps mentioned below.
The following is how we can use Pinecone for storing document embeddings and then later querying the embedding storage to get a similar text or document. This is then combined with questions and prompts and sent to LLM to get more accurate results.
Many times, it is asked as to what model can be used to embed the data for RAG. When it comes to embedding data for RAG, Sentence BERT (sBERT) is often recommended for almost all applications. In a RAG setup, sBERT can be used to embed both the query (from the user or the language model) and the documents in the external knowledge source. The embeddings help in efficiently retrieving the most relevant documents or pieces of information from a large corpus, which the language model can then use to generate more informed and accurate responses.
Sentence BERT (sBERT) takes a pre-trained BERT encoder model and further trains it to produce semantically meaningful embeddings. To do this, we can use a siamese/triplet network structure, where we separately embed two (or three) pieces of text with BERT, producing an embedding for each piece of text. Then, given pairs (or triplets) of text that are either similar or not similar, we can train sBERT to accurately predict textual similarity from these embeddings using an added regression or classification head. This approach is cheap/efficient, drastically improves the semantic meaningfulness of BERT embeddings, and produces a high-performing (BERT-based) bi-encoder that we can use for vector search.
A variety of open-source sBERT models are available via the Sentence Transformers library. To use these models for RAG, we simply need to split our documents into chunks, embed each chunk with sBERT, and then index all of these chunks for search in a vector database. From here, we can perform an efficient vector search (ideally, a hybrid search that also includes lexical components) over these chunks to retrieve relevant data at inference time for RAG.
The fusion of Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) is a testament to the evolving landscape of machine learning. As we’ve explored, this synergy not only amplifies the capabilities of LLMs but also ensures that the responses generated are both contextually relevant and informed by the most current information available. By leveraging advanced embedding models and efficient search algorithms, businesses and researchers can unlock a new realm of possibilities, making AI-driven solutions more precise and insightful than ever before. Are you inspired to integrate the RAG LLM approach into your generative AI solutions? If you’ve already begun experimenting with this method, I’d love to hear about your experiences and insights.
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…