{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Indexing Using Faiss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practical cases, datasets contain thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is very time and space consuming. In this tutorial, we'll introduce how to use indexing to make our retrieval fast and neat." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install the dependencies in the environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -U FlagEmbedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### faiss-gpu on Linux (x86_64)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Faiss maintain the latest updates on conda. So if you have GPUs on Linux x86_64, create a conda virtual environment and run:\n", "\n", "```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```\n", "\n", "and make sure you select that conda env as the kernel for this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### faiss-cpu\n", "\n", "Otherwise it's simple, just run the following cell to install `faiss-cpu`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -U faiss-cpu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.\n", "\n", "Each sentence is a concise discription of a famous people in specific domain." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "corpus = [\n", " \"Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.\",\n", " \"Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.\",\n", " \"Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'\",\n", " \"Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\",\n", " \"Eminem is a renowned rapper and one of the best-selling music artists of all time.\",\n", " \"Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\",\n", " \"Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.\",\n", " \"Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.\",\n", " \"Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.\",\n", " \"Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\",\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And a few queries (add your own queries and check the result!): " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "queries = [\n", " \"Who is Robert Downey Jr.?\",\n", " \"An expert of neural network\",\n", " \"A famous female singer\",\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Text Embedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, for the sake of speed, we just embed the first 500 docs in the corpus." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shape of the corpus embeddings: (10, 768)\n", "data type of the embeddings: float32\n" ] } ], "source": [ "from FlagEmbedding import FlagModel\n", "\n", "# get the BGE embedding model\n", "model = FlagModel('BAAI/bge-base-en-v1.5',\n", " query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n", " use_fp16=True)\n", "\n", "# get the embedding of the corpus\n", "corpus_embeddings = model.encode(corpus)\n", "\n", "print(\"shape of the corpus embeddings:\", corpus_embeddings.shape)\n", "print(\"data type of the embeddings: \", corpus_embeddings.dtype)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Faiss only accepts float32 inputs.\n", "\n", "So make sure the dtype of corpus_embeddings is float32 before adding them to the index." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "corpus_embeddings = corpus_embeddings.astype(np.float32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we build an index and add the embedding vectors to it." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import faiss\n", "\n", "# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768\n", "dim = corpus_embeddings.shape[-1]\n", "\n", "# create the faiss index and store the corpus embeddings into the vector space\n", "index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n", "\n", "# if you installed faiss-gpu, uncomment the following lines to make the index on your GPUs.\n", "\n", "# co = faiss.GpuMultipleClonerOptions()\n", "# index = faiss.index_cpu_to_all_gpus(index, co)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No need to train if we use \"Flat\" quantizer and METRIC_INNER_PRODUCT as metric. Some other indices that using quantization might need training." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "total number of vectors: 10\n" ] } ], "source": [ "# check if the index is trained\n", "print(index.is_trained) \n", "# index.train(corpus_embeddings)\n", "\n", "# add all the vectors to the index\n", "index.add(corpus_embeddings)\n", "\n", "print(f\"total number of vectors: {index.ntotal}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3.5 (Optional): Saving Faiss index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have your index with the embedding vectors, you can save it locally for future usage." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# change the path to where you want to save the index\n", "path = \"./index.bin\"\n", "faiss.write_index(index, path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you already have stored index in your local directory, you can load it by:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "index = faiss.read_index(\"./index.bin\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Find answers to the query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, get the embeddings of all the queries:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "query_embeddings = model.encode_queries(queries)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, use the Faiss index to do a knn search in the vector space:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.6686779 0.37858668 0.3767978 ]\n", " [0.6062041 0.59364545 0.527691 ]\n", " [0.5409331 0.5097007 0.42427146]]\n", "[[9 7 2]\n", " [3 1 8]\n", " [5 0 4]]\n" ] } ], "source": [ "dists, ids = index.search(query_embeddings, k=3)\n", "print(dists)\n", "print(ids)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see the result:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "query:\tWho is Robert Downey Jr.?\n", "answer:\tRobert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\n", "\n", "query:\tAn expert of neural network\n", "answer:\tGeoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\n", "\n", "query:\tA famous female singer\n", "answer:\tTaylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\n", "\n" ] } ], "source": [ "for i, q in enumerate(queries):\n", " print(f\"query:\\t{q}\\nanswer:\\t{corpus[ids[i][0]]}\\n\")" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }