{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Evaluation is a crucial part in all machine learning tasks. In this notebook, we will walk through the whole pipeline of evaluating the performance of an embedding model on [MS Marco](https://microsoft.github.io/msmarco/), and use three metrics to show its performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0: Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install the dependencies in the environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U FlagEmbedding faiss-cpu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Load Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, download the queries and MS Marco from Huggingface Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "import numpy as np\n",
    "\n",
    "data = load_dataset(\"namespace-Pt/msmarco\", split=\"dev\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Considering time cost, we will use the truncated dataset in this tutorial. `queries` contains the first 100 queries from the dataset. `corpus` is formed by the positives of the the first 5,000 queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries = np.array(data[:100][\"query\"])\n",
    "corpus = sum(data[:5000][\"positive\"], [])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have GPU and would like to try out the full evaluation of MS Marco, uncomment and run the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# data = load_dataset(\"namespace-Pt/msmarco\", split=\"dev\")\n",
    "# queries = np.array(data[\"query\"])\n",
    "\n",
    "# corpus = load_dataset(\"namespace-PT/msmarco-corpus\", split=\"train\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose the embedding model that we would like to evaluate, and encode the corpus to embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Inference Embeddings: 100%|██████████| 21/21 [02:10<00:00,  6.22s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shape of the corpus embeddings: (5331, 768)\n",
      "data type of the embeddings:  float32\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from FlagEmbedding import FlagModel\n",
    "\n",
    "# get the BGE embedding model\n",
    "model = FlagModel('BAAI/bge-base-en-v1.5',\n",
    "                  query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n",
    "                  use_fp16=True)\n",
    "\n",
    "# get the embedding of the corpus\n",
    "corpus_embeddings = model.encode(corpus)\n",
    "\n",
    "print(\"shape of the corpus embeddings:\", corpus_embeddings.shape)\n",
    "print(\"data type of the embeddings: \", corpus_embeddings.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the index_factory() functions to create a Faiss index we want:\n",
    "\n",
    "- The first argument `dim` is the dimension of the vector space, in this case is 768 if you're using bge-base-en-v1.5.\n",
    "\n",
    "- The second argument `'Flat'` makes the index do exhaustive search.\n",
    "\n",
    "- The thrid argument `faiss.METRIC_INNER_PRODUCT` tells the index to use inner product as the distance metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total number of vectors: 5331\n"
     ]
    }
   ],
   "source": [
    "import faiss\n",
    "\n",
    "# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768\n",
    "dim = corpus_embeddings.shape[-1]\n",
    "\n",
    "# create the faiss index and store the corpus embeddings into the vector space\n",
    "index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)\n",
    "corpus_embeddings = corpus_embeddings.astype(np.float32)\n",
    "# train and add the embeddings to the index\n",
    "index.train(corpus_embeddings)\n",
    "index.add(corpus_embeddings)\n",
    "\n",
    "print(f\"total number of vectors: {index.ntotal}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the embedding process is time consuming, it's a good choice to save the index for reproduction or other experiments.\n",
    "\n",
    "Uncomment the following lines to save the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# path = \"./index.bin\"\n",
    "# faiss.write_index(index, path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you already have stored index in your local directory, you can load it by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# index = faiss.read_index(\"./index.bin\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the embeddings of all the queries, and get their corresponding ground truth answers for evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_embeddings = model.encode_queries(queries)\n",
    "ground_truths = [d[\"positive\"] for d in data]\n",
    "corpus = np.asarray(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the faiss index to search top $k$ answers of each query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Searching: 100%|██████████| 1/1 [00:00<00:00, 20.91it/s]\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "\n",
    "res_scores, res_ids, res_text = [], [], []\n",
    "query_size = len(query_embeddings)\n",
    "batch_size = 256\n",
    "# The cutoffs we will use during evaluation, and set k to be the maximum of the cutoffs.\n",
    "cut_offs = [1, 10]\n",
    "k = max(cut_offs)\n",
    "\n",
    "for i in tqdm(range(0, query_size, batch_size), desc=\"Searching\"):\n",
    "    q_embedding = query_embeddings[i: min(i+batch_size, query_size)].astype(np.float32)\n",
    "    # search the top k answers for each of the queries\n",
    "    score, idx = index.search(q_embedding, k=k)\n",
    "    res_scores += list(score)\n",
    "    res_ids += list(idx)\n",
    "    res_text += list(corpus[idx])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Evaluate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Recall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall represents the model's capability of correctly predicting positive instances from all the actual positive samples in the dataset.\n",
    "\n",
    "$$\\textbf{Recall}=\\frac{\\text{True Positives}}{\\text{True Positives}+\\text{False Negatives}}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall is useful when the cost of false negatives is high. In other words, we are trying to find all objects of the positive class, even if this results in some false positives. This attribute makes recall a useful metric for text retrieval tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "recall@1: 0.97\n",
      "recall@10: 1.0\n"
     ]
    }
   ],
   "source": [
    "def calc_recall(preds, truths, cutoffs):\n",
    "    recalls = np.zeros(len(cutoffs))\n",
    "    for text, truth in zip(preds, truths):\n",
    "        for i, c in enumerate(cutoffs):\n",
    "            recall = np.intersect1d(truth, text[:c])\n",
    "            recalls[i] += len(recall) / max(min(c, len(truth)), 1)\n",
    "    recalls /= len(preds)\n",
    "    return recalls\n",
    "\n",
    "recalls = calc_recall(res_text, ground_truths, cut_offs)\n",
    "for i, c in enumerate(cut_offs):\n",
    "    print(f\"recall@{c}: {recalls[i]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 MRR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mean Reciprocal Rank ([MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.\n",
    "\n",
    "$$MRR=\\frac{1}{|Q|}\\sum_{i=1}^{|Q|}\\frac{1}{rank_i}$$\n",
    "\n",
    "where \n",
    "- $|Q|$ is the total number of queries.\n",
    "- $rank_i$ is the rank position of the first relevant document of the i-th query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def MRR(preds, truth, cutoffs):\n",
    "    mrr = [0 for _ in range(len(cutoffs))]\n",
    "    for pred, t in zip(preds, truth):\n",
    "        for i, c in enumerate(cutoffs):\n",
    "            for j, p in enumerate(pred):\n",
    "                if j < c and p in t:\n",
    "                    mrr[i] += 1/(j+1)\n",
    "                    break\n",
    "    mrr = [k/len(preds) for k in mrr]\n",
    "    return mrr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MRR@1: 0.97\n",
      "MRR@10: 0.9825\n"
     ]
    }
   ],
   "source": [
    "mrr = MRR(res_text, ground_truths, cut_offs)\n",
    "for i, c in enumerate(cut_offs):\n",
    "    print(f\"MRR@{c}: {mrr[i]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 nDCG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normalized Discounted cumulative gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:\n",
    "\n",
    "1. Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.\n",
    "\n",
    "$$DCG_p=\\sum_{i=1}^p\\frac{2^{rel_i}-1}{\\log_2(i+1)}$$\n",
    "\n",
    "2. Normalized by ideal DCG to make it comparable across queries.\n",
    "$$nDCG_p=\\frac{DCG_p}{IDCG_p}$$\n",
    "where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_hard_encodings = []\n",
    "for pred, label in zip(res_text, ground_truths):\n",
    "    pred_hard_encoding = list(np.isin(pred, label).astype(int))\n",
    "    pred_hard_encodings.append(pred_hard_encoding)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nDCG@1: 0.97\n",
      "nDCG@10: 0.9869253606521631\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import ndcg_score\n",
    "\n",
    "for i, c in enumerate(cut_offs):\n",
    "    nDCG = ndcg_score(pred_hard_encodings, res_scores, k=c)\n",
    "    print(f\"nDCG@{c}: {nDCG}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Congrats! You have walked through a full pipeline of evaluating an embedding model. Feel free to play with different datasets and models!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}