Evaluation Metrics#

In this tutorial, we’ll cover a list of metrics that are widely used for evaluating embedding model’s performance.

0. Preparation#

%pip install numpy scikit-learn

Suppose we have a corpus with document ids from 0 - 30.

ground_truth contains the actual relevant document ids to each query.
results contains the search results of each query by some retrieval system.

import numpy as np

ground_truth = [
    [11,  1,  7, 17, 21],
    [ 4, 16,  1],
    [26, 10, 22,  8],
]

results = [
    [11,  1, 17,  7, 21,  8,  0, 28,  9, 20],
    [16,  1,  6, 18,  3,  4, 25, 19,  8, 14],
    [24, 10, 26,  2,  8, 28,  4, 23, 13, 21],
]

np.intersect1d(ground_truth, results)

array([ 0,  1,  2,  3,  4,  6,  7,  8,  9, 10, 11, 13, 14, 16, 17, 18, 19,
       21, 22, 24, 25, 26, 28])

np.isin(ground_truth, results).astype(int)

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

And we are interested in the following cutoffs:

cutoffs = [1, 5, 10]

In this tutorial, we will use the above small example to show how different metrics evaluate the retrieval system’s quality.

1. Recall#

Recall represents the model’s capability of correctly predicting positive instances from all the actual positive samples in the dataset.

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

to write it in the form of information retrieval, which is the ratio of relevant documents retrieved to the total number of relevant documents in the corpus. In practice, we usually make the denominator to be the minimum between the current cutoff (usually 1, 5, 10, 100, etc) and the total number of relevant documents in the corpus:

$$\textbf{Recall}=\frac{|\text{{Relevant docs}}\cap\text{{Retrieved docs}}|}{\text{min}(|\text{{Retrieved docs}}|, |\text{{Relevant docs}}|)}$$

def calc_recall(preds, truths, cutoffs):
    recalls = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            hits = np.intersect1d(truth, text[:c])
            recalls[i] += len(hits) / max(min(c, len(truth)), 1)
    recalls /= len(preds)
    return recalls

recalls = calc_recall(results, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"recall@{c}: {recalls[i]}")

recall@1: 0.6666666666666666
recall@5: 0.8055555555555555
recall@10: 0.9166666666666666

2. MRR#

Mean Reciprocal Rank (MRR) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.

$$MRR=\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}$$

where

$|Q|$ is the total number of queries.
$rank_i$ is the rank position of the first relevant document of the i-th query.

def calc_MRR(preds, truth, cutoffs):
    mrr = [0 for _ in range(len(cutoffs))]
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    mrr = [k/len(preds) for k in mrr]
    return mrr

mrr = calc_MRR(results, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"MRR@{c}: {mrr[i]}")

MRR@1: 0.6666666666666666
MRR@5: 0.8333333333333334
MRR@10: 0.8333333333333334

3. nDCG#

Normalized Discounted Cumulative Gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:

Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.

$$DCG_p=\sum_{i=1}^p\frac{2^{rel_i}-1}{\log_2(i+1)}$$

Normalized by ideal DCG to make it comparable across queries. $$nDCG_p=\frac{DCG_p}{IDCG_p}$$ where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance.

pred_hard_encodings = []
for pred, label in zip(results, ground_truth):
    pred_hard_encoding = list(np.isin(pred, label).astype(int))
    pred_hard_encodings.append(pred_hard_encoding)

from sklearn.metrics import ndcg_score

for i, c in enumerate(cutoffs):
    nDCG = ndcg_score(pred_hard_encodings, results, k=c)
    print(f"nDCG@{c}: {nDCG}")

nDCG@1: 0.0
nDCG@5: 0.3298163165186628
nDCG@10: 0.5955665344840209

4. Precision#

Precision

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positive}}$$

in information retrieval, it’s the ratio of relevant documents retrieved to the totoal number of documents retrieved:

$$\textbf{Recall}=\frac{|\text{{Relevant docs}}\cap\text{{Retrieved docs}}|}{|\text{{Retrieved docs}}|}$$

def calc_precision(preds, truths, cutoffs):
    prec = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            hits = np.intersect1d(truth, text[:c])
            prec[i] += len(hits) / c
    prec /= len(preds)
    return prec

precisions = calc_precision(results, ground_truth, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"precision@{c}: {precisions[i]}")

precision@1: 0.6666666666666666
precision@5: 0.6666666666666666
precision@10: 0.3666666666666667

5. MAP#

Mean Average Precision (MAP) measures the effectiveness of a system at returning relevant documents across multiple queries.

First, Average Precision (AP) evals how well relevant documents are ranked within the retrieved documents. It’s computed by averaging the precision values for each position of relevant document in the ranking of all the retrieved documents:

$$\textbf{AP}=\frac{\sum_{k=1}^{M}\text{Relevance}(k) \times \text{Precision}(k)}{|{\text{Relevant Docs}}|}$$

where

$M$ is the total number of documents retrieved.
$\text{Relevance}(k)$ is a binary value, indicating whether document at position $k$ is relevant (=1) or not (=0).
$\text{Precision}(k)$ is the precision when considering only top $k$ retrieved items.

Then calculate the average AP across multiple queries to get the MAP:

$$\textbf{MAP}=\frac{1}{N}\sum_{i=1}^{N}\text{AP}_i$$

where

$N$ is the total number of queries.
$\text{AP}_i$ is the average precision of the $i^{th}$ query.

def calc_AP(encoding):
    rel = 0
    precs = 0.0
    for k, hit in enumerate(encoding, start=1):
        if hit == 1:
            rel += 1
            precs += rel/k

    return 0 if rel == 0 else precs/rel

def calc_MAP(encodings, cutoffs):
    res = []
    for c in cutoffs:
        ap_sum = 0.0
        for encoding in encodings:
            ap_sum += calc_AP(encoding[:c])
        res.append(ap_sum/len(encodings))
        
    return res

maps = calc_MAP(pred_hard_encodings, cutoffs)
for i, c in enumerate(cutoffs):
    print(f"MAP@{c}: {maps[i]}")

MAP@1: 0.6666666666666666
MAP@5: 0.862962962962963
MAP@10: 0.8074074074074075