{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluation Metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we'll cover a list of metrics that are widely used for evaluating embedding model's performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Preparation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install numpy scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we have a corpus with document ids from 0 - 30. \n", "- `ground_truth` contains the actual relevant document ids to each query.\n", "- `results` contains the search results of each query by some retrieval system." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "ground_truth = [\n", " [11, 1, 7, 17, 21],\n", " [ 4, 16, 1],\n", " [26, 10, 22, 8],\n", "]\n", "\n", "results = [\n", " [11, 1, 17, 7, 21, 8, 0, 28, 9, 20],\n", " [16, 1, 6, 18, 3, 4, 25, 19, 8, 14],\n", " [24, 10, 26, 2, 8, 28, 4, 23, 13, 21],\n", "]" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 14, 16, 17, 18, 19,\n", " 21, 22, 24, 25, 26, 28])" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.intersect1d(ground_truth, results)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],\n", " [1, 1, 1, 1, 1, 1, 1, 1, 1, 0],\n", " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.isin(ground_truth, results).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we are interested in the following cutoffs:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "cutoffs = [1, 5, 10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we will use the above small example to show how different metrics evaluate the retrieval system's quality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Recall" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall represents the model's capability of correctly predicting positive instances from all the actual positive samples in the dataset.\n", "\n", "$$\\textbf{Recall}=\\frac{\\text{True Positives}}{\\text{True Positives}+\\text{False Negatives}}$$\n", "\n", "to write it in the form of information retrieval, which is the ratio of relevant documents retrieved to the total number of relevant documents in the corpus. In practice, we usually make the denominator to be the minimum between the current cutoff (usually 1, 5, 10, 100, etc) and the total number of relevant documents in the corpus:\n", "\n", "$$\\textbf{Recall}=\\frac{|\\text{\\{Relevant docs\\}}\\cap\\text{\\{Retrieved docs\\}}|}{\\text{min}(|\\text{\\{Retrieved docs\\}}|, |\\text{\\{Relevant docs\\}}|)}$$" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def calc_recall(preds, truths, cutoffs):\n", " recalls = np.zeros(len(cutoffs))\n", " for text, truth in zip(preds, truths):\n", " for i, c in enumerate(cutoffs):\n", " hits = np.intersect1d(truth, text[:c])\n", " recalls[i] += len(hits) / max(min(c, len(truth)), 1)\n", " recalls /= len(preds)\n", " return recalls" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "recall@1: 0.6666666666666666\n", "recall@5: 0.8055555555555555\n", "recall@10: 0.9166666666666666\n" ] } ], "source": [ "recalls = calc_recall(results, ground_truth, cutoffs)\n", "for i, c in enumerate(cutoffs):\n", " print(f\"recall@{c}: {recalls[i]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. MRR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mean Reciprocal Rank ([MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.\n", "\n", "$$MRR=\\frac{1}{|Q|}\\sum_{i=1}^{|Q|}\\frac{1}{rank_i}$$\n", "\n", "where \n", "- $|Q|$ is the total number of queries.\n", "- $rank_i$ is the rank position of the first relevant document of the i-th query." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def calc_MRR(preds, truth, cutoffs):\n", " mrr = [0 for _ in range(len(cutoffs))]\n", " for pred, t in zip(preds, truth):\n", " for i, c in enumerate(cutoffs):\n", " for j, p in enumerate(pred):\n", " if j < c and p in t:\n", " mrr[i] += 1/(j+1)\n", " break\n", " mrr = [k/len(preds) for k in mrr]\n", " return mrr" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MRR@1: 0.6666666666666666\n", "MRR@5: 0.8333333333333334\n", "MRR@10: 0.8333333333333334\n" ] } ], "source": [ "mrr = calc_MRR(results, ground_truth, cutoffs)\n", "for i, c in enumerate(cutoffs):\n", " print(f\"MRR@{c}: {mrr[i]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. nDCG" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalized Discounted Cumulative Gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:\n", "\n", "1. Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.\n", "\n", "$$DCG_p=\\sum_{i=1}^p\\frac{2^{rel_i}-1}{\\log_2(i+1)}$$\n", "\n", "2. Normalized by ideal DCG to make it comparable across queries.\n", "$$nDCG_p=\\frac{DCG_p}{IDCG_p}$$\n", "where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "pred_hard_encodings = []\n", "for pred, label in zip(results, ground_truth):\n", " pred_hard_encoding = list(np.isin(pred, label).astype(int))\n", " pred_hard_encodings.append(pred_hard_encoding)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nDCG@1: 0.0\n", "nDCG@5: 0.3298163165186628\n", "nDCG@10: 0.5955665344840209\n" ] } ], "source": [ "from sklearn.metrics import ndcg_score\n", "\n", "for i, c in enumerate(cutoffs):\n", " nDCG = ndcg_score(pred_hard_encodings, results, k=c)\n", " print(f\"nDCG@{c}: {nDCG}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Precision \n", "\n", "$$\\textbf{Recall}=\\frac{\\text{True Positives}}{\\text{True Positives}+\\text{False Positive}}$$\n", "\n", "in information retrieval, it's the ratio of relevant documents retrieved to the totoal number of documents retrieved:\n", "\n", "$$\\textbf{Recall}=\\frac{|\\text{\\{Relevant docs\\}}\\cap\\text{\\{Retrieved docs\\}}|}{|\\text{\\{Retrieved docs\\}}|}$$" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def calc_precision(preds, truths, cutoffs):\n", " prec = np.zeros(len(cutoffs))\n", " for text, truth in zip(preds, truths):\n", " for i, c in enumerate(cutoffs):\n", " hits = np.intersect1d(truth, text[:c])\n", " prec[i] += len(hits) / c\n", " prec /= len(preds)\n", " return prec" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision@1: 0.6666666666666666\n", "precision@5: 0.6666666666666666\n", "precision@10: 0.3666666666666667\n" ] } ], "source": [ "precisions = calc_precision(results, ground_truth, cutoffs)\n", "for i, c in enumerate(cutoffs):\n", " print(f\"precision@{c}: {precisions[i]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. MAP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mean Average Precision (MAP) measures the effectiveness of a system at returning relevant documents across multiple queries. \n", "\n", "First, Average Precision (AP) evals how well relevant documents are ranked within the retrieved documents. It's computed by averaging the precision values for each position of relevant document in the ranking of all the retrieved documents:\n", "\n", "$$\\textbf{AP}=\\frac{\\sum_{k=1}^{M}\\text{Relevance}(k) \\times \\text{Precision}(k)}{|\\{\\text{Relevant Docs}\\}|}$$\n", "\n", "where \n", "- $M$ is the total number of documents retrieved.\n", "- $\\text{Relevance}(k)$ is a binary value, indicating whether document at position $k$ is relevant (=1) or not (=0).\n", "- $\\text{Precision}(k)$ is the precision when considering only top $k$ retrieved items." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then calculate the average AP across multiple queries to get the MAP:\n", "\n", "$$\\textbf{MAP}=\\frac{1}{N}\\sum_{i=1}^{N}\\text{AP}_i$$\n", "\n", "where\n", "- $N$ is the total number of queries.\n", "- $\\text{AP}_i$ is the average precision of the $i^{th}$ query." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def calc_AP(encoding):\n", " rel = 0\n", " precs = 0.0\n", " for k, hit in enumerate(encoding, start=1):\n", " if hit == 1:\n", " rel += 1\n", " precs += rel/k\n", "\n", " return 0 if rel == 0 else precs/rel" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def calc_MAP(encodings, cutoffs):\n", " res = []\n", " for c in cutoffs:\n", " ap_sum = 0.0\n", " for encoding in encodings:\n", " ap_sum += calc_AP(encoding[:c])\n", " res.append(ap_sum/len(encodings))\n", " \n", " return res" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAP@1: 0.6666666666666666\n", "MAP@5: 0.862962962962963\n", "MAP@10: 0.8074074074074075\n" ] } ], "source": [ "maps = calc_MAP(pred_hard_encodings, cutoffs)\n", "for i, c in enumerate(cutoffs):\n", " print(f\"MAP@{c}: {maps[i]}\")" ] } ], "metadata": { "kernelspec": { "display_name": "test", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 2 }