{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Indexing Using Faiss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practical cases, datasets contain thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is very time and space consuming. In this tutorial, we'll introduce how to use indexing to make our retrieval fast and neat."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0: Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install the dependencies in the environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U FlagEmbedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### faiss-gpu on Linux (x86_64)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Faiss maintain the latest updates on conda. So if you have GPUs on Linux x86_64, create a conda virtual environment and run:\n",
    "\n",
    "```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```\n",
    "\n",
    "and make sure you select that conda env as the kernel for this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### faiss-cpu\n",
    "\n",
    "Otherwise it's simple, just run the following cell to install `faiss-cpu`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U faiss-cpu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.\n",
    "\n",
    "Each sentence is a concise discription of a famous people in specific domain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = [\n",
    "    \"Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.\",\n",
    "    \"Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.\",\n",
    "    \"Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'\",\n",
    "    \"Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\",\n",
    "    \"Eminem is a renowned rapper and one of the best-selling music artists of all time.\",\n",
    "    \"Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\",\n",
    "    \"Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.\",\n",
    "    \"Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.\",\n",
    "    \"Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.\",\n",
    "    \"Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And a few queries (add your own queries and check the result!): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries = [\n",
    "    \"Who is Robert Downey Jr.?\",\n",
    "    \"An expert of neural network\",\n",
    "    \"A famous female singer\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Text Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, for the sake of speed, we just embed the first 500 docs in the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shape of the corpus embeddings: (10, 768)\n",
      "data type of the embeddings:  float32\n"
     ]
    }
   ],
   "source": [
    "from FlagEmbedding import FlagModel\n",
    "\n",
    "# get the BGE embedding model\n",
    "model = FlagModel('BAAI/bge-base-en-v1.5',\n",
    "                  query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n",
    "                  use_fp16=True)\n",
    "\n",
    "# get the embedding of the corpus\n",
    "corpus_embeddings = model.encode(corpus)\n",
    "\n",
    "print(\"shape of the corpus embeddings:\", corpus_embeddings.shape)\n",
    "print(\"data type of the embeddings: \", corpus_embeddings.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Faiss only accepts float32 inputs.\n",
    "\n",
    "So make sure the dtype of corpus_embeddings is float32 before adding them to the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "corpus_embeddings = corpus_embeddings.astype(np.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this step, we build an index and add the embedding vectors to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import faiss\n",
    "\n",
    "# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768\n",
    "dim = corpus_embeddings.shape[-1]\n",
    "\n",
    "# create the faiss index and store the corpus embeddings into the vector space\n",
    "index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n",
    "\n",
    "# if you installed faiss-gpu, uncomment the following lines to make the index on your GPUs.\n",
    "\n",
    "# co = faiss.GpuMultipleClonerOptions()\n",
    "# index = faiss.index_cpu_to_all_gpus(index, co)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No need to train if we use \"Flat\" quantizer and METRIC_INNER_PRODUCT as metric. Some other indices that using quantization might need training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "total number of vectors: 10\n"
     ]
    }
   ],
   "source": [
    "# check if the index is trained\n",
    "print(index.is_trained)  \n",
    "# index.train(corpus_embeddings)\n",
    "\n",
    "# add all the vectors to the index\n",
    "index.add(corpus_embeddings)\n",
    "\n",
    "print(f\"total number of vectors: {index.ntotal}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3.5 (Optional): Saving Faiss index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have your index with the embedding vectors, you can save it locally for future usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change the path to where you want to save the index\n",
    "path = \"./index.bin\"\n",
    "faiss.write_index(index, path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you already have stored index in your local directory, you can load it by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = faiss.read_index(\"./index.bin\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Find answers to the query"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, get the embeddings of all the queries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_embeddings = model.encode_queries(queries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, use the Faiss index to do a knn search in the vector space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.6686779  0.37858668 0.3767978 ]\n",
      " [0.6062041  0.59364545 0.527691  ]\n",
      " [0.5409331  0.5097007  0.42427146]]\n",
      "[[9 7 2]\n",
      " [3 1 8]\n",
      " [5 0 4]]\n"
     ]
    }
   ],
   "source": [
    "dists, ids = index.search(query_embeddings, k=3)\n",
    "print(dists)\n",
    "print(ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see the result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query:\tWho is Robert Downey Jr.?\n",
      "answer:\tRobert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\n",
      "\n",
      "query:\tAn expert of neural network\n",
      "answer:\tGeoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\n",
      "\n",
      "query:\tA famous female singer\n",
      "answer:\tTaylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i, q in enumerate(queries):\n",
    "    print(f\"query:\\t{q}\\nanswer:\\t{corpus[ids[i][0]]}\\n\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}