RAG with LlamaIndex#

LlamaIndex is a very popular framework to help build connections between data sources and LLMs. It is also a top choice when people would like to build an RAG framework. In this tutorial, we will go through how to use LlamaIndex to aggregate bge-base-en-v1.5 and GPT-4o-mini to an RAG application.

0. Preparation#

First install the required packages in the environment.

%pip install llama-index-llms-openai llama-index-embeddings-huggingface llama-index-vector-stores-faiss
%pip install llama_index 

Then fill the OpenAI API key below:

# For openai key
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

BGE-M3 is a very powerful embedding model, We would like to know what does that ‘M3’ stands for.

Let’s first ask GPT the question:

from llama_index.llms.openai import OpenAI

# non-streaming
response = OpenAI().complete("What does M3-Embedding stands for?")
print(response)
M3-Embedding stands for Multimodal Multiscale Embedding. It is a technique used in machine learning and data analysis to embed high-dimensional data into a lower-dimensional space while preserving the structure and relationships within the data. This technique is particularly useful for analyzing complex datasets that contain multiple modalities or scales of information.

By checking the description in GitHub repo of BGE-M3, we are pretty sure that GPT is giving us hallucination. Let’s build an RAG pipeline to solve the problem!

1. Data#

First, download BGE-M3 paper to a directory, and load it through SimpleDirectoryReader.

Note that SimpleDirectoryReader can read all the documents under that directory and supports a lot of commonly used file types.

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader("data")
# reader = SimpleDirectoryReader("DIR_TO_FILE")
documents = reader.load_data()

The Settings object is a global settings for the RAG pipeline. Attributes in it have default settings and can be modified by users (OpenAI’s GPT and embedding model). Large attributes like models will be only loaded when being used.

Here, we specify the node_parser to SentenceSplitter() with our chosen parameters, use the open-source bge-base-en-v1.5 as our embedding model, and gpt-4o-mini as our llm.

from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI

# set the parser with parameters
Settings.node_parser = SentenceSplitter(
    chunk_size=1000,    # Maximum size of chunks to return
    chunk_overlap=150,  # number of overlap characters between chunks
)

# set the specific embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# set the llm we want to use
Settings.llm = OpenAI(model="gpt-4o-mini")

2. Indexing#

Indexing is one of the most important part in RAG. LlamaIndex integrates a great amount of vector databases. Here we will use Faiss as an example.

First check the dimension of the embeddings, which will need for initializing a Faiss index.

embedding = Settings.embed_model.get_text_embedding("Hello world")
dim = len(embedding)
print(dim)
768

Then create the index with Faiss and our documents. Here LlamaIndex help capsulate the Faiss function calls. If you would like to know more about Faiss, refer to the tutorial of Faiss and indexing.

import faiss
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

# init Faiss and create a vector store
faiss_index = faiss.IndexFlatL2(dim)
vector_store = FaissVectorStore(faiss_index=faiss_index)

# customize the storage context using our vector store
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

# use the loaded documents to build the index
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

3. Retrieve and Generate#

With a well constructed index, we can now build the query engine to accomplish our task:

query_engine = index.as_query_engine()

The following cell displays the default prompt template for Q&A in our pipeline:

# check the default promt template
prompt_template = query_engine.get_prompts()['response_synthesizer:text_qa_template']
print(prompt_template.get_template())
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 

(Optional) You could modify the prompt to match your use cases:

from llama_index.core import PromptTemplate

template = """
You are a Q&A chat bot.
Use the given context only, answer the question.

<context>
{context_str}
</context>

Question: {query_str}
"""

new_template = PromptTemplate(template)
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": new_template}
)

prompt_template = query_engine.get_prompts()['response_synthesizer:text_qa_template']
print(prompt_template.get_template())
You are a Q&A chat bot.
Use the given context only, answer the question.

<context>
{context_str}
</context>

Question: {query_str}

Finally, let’s see how does the RAG application performs on our query!

response = query_engine.query("What does M3-Embedding stands for?")
print(response)
M3-Embedding stands for Multi-Linguality, Multi-Functionality, and Multi-Granularity.