Indexing Using Faiss#

In practical cases, datasets contain thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is very time and space consuming. In this tutorial, we’ll introduce how to use indexing to make our retrieval fast and neat.

Step 0: Setup#

Install the dependencies in the environment.

%pip install -U FlagEmbedding

faiss-gpu on Linux (x86_64)#

Faiss maintain the latest updates on conda. So if you have GPUs on Linux x86_64, create a conda virtual environment and run:

conda install -c pytorch -c nvidia faiss-gpu=1.8.0

and make sure you select that conda env as the kernel for this notebook.

faiss-cpu#

Otherwise it’s simple, just run the following cell to install faiss-cpu

%pip install -U faiss-cpu

Step 1: Dataset#

Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.

Each sentence is a concise discription of a famous people in specific domain.

corpus = [
    "Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.",
    "Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.",
    "Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'",
    "Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.",
    "Eminem is a renowned rapper and one of the best-selling music artists of all time.",
    "Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.",
    "Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.",
    "Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.",
    "Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.",
    "Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.",
]

And a few queries (add your own queries and check the result!):

queries = [
    "Who is Robert Downey Jr.?",
    "An expert of neural network",
    "A famous female singer",
]

Step 2: Text Embedding#

Here, for the sake of speed, we just embed the first 500 docs in the corpus.

from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the corpus
corpus_embeddings = model.encode(corpus)

print("shape of the corpus embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)
shape of the corpus embeddings: (10, 768)
data type of the embeddings:  float32

Faiss only accepts float32 inputs.

So make sure the dtype of corpus_embeddings is float32 before adding them to the index.

import numpy as np

corpus_embeddings = corpus_embeddings.astype(np.float32)

Step 3: Indexing#

In this step, we build an index and add the embedding vectors to it.

import faiss

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)

# if you installed faiss-gpu, uncomment the following lines to make the index on your GPUs.

# co = faiss.GpuMultipleClonerOptions()
# index = faiss.index_cpu_to_all_gpus(index, co)

No need to train if we use “Flat” quantizer and METRIC_INNER_PRODUCT as metric. Some other indices that using quantization might need training.

# check if the index is trained
print(index.is_trained)  
# index.train(corpus_embeddings)

# add all the vectors to the index
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")
True
total number of vectors: 10

Step 3.5 (Optional): Saving Faiss index#

Once you have your index with the embedding vectors, you can save it locally for future usage.

# change the path to where you want to save the index
path = "./index.bin"
faiss.write_index(index, path)

If you already have stored index in your local directory, you can load it by:

index = faiss.read_index("./index.bin")

Step 4: Find answers to the query#

First, get the embeddings of all the queries:

query_embeddings = model.encode_queries(queries)

Then, use the Faiss index to do a knn search in the vector space:

dists, ids = index.search(query_embeddings, k=3)
print(dists)
print(ids)
[[0.6686779  0.37858668 0.3767978 ]
 [0.6062041  0.59364545 0.527691  ]
 [0.5409331  0.5097007  0.42427146]]
[[9 7 2]
 [3 1 8]
 [5 0 4]]

Now let’s see the result:

for i, q in enumerate(queries):
    print(f"query:\t{q}\nanswer:\t{corpus[ids[i][0]]}\n")
query:	Who is Robert Downey Jr.?
answer:	Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.

query:	An expert of neural network
answer:	Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.

query:	A famous female singer
answer:	Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.