Evaluation#

Evaluation is a crucial part in all machine learning tasks. In this notebook, we will walk through the whole pipeline of evaluating the performance of an embedding model on MS Marco, and use three metrics to show its performance.

Step 0: Setup#

Install the dependencies in the environment.

%pip install -U FlagEmbedding faiss-cpu

Step 1: Load Dataset#

First, download the queries and MS Marco from Huggingface Dataset

from datasets import load_dataset
import numpy as np

data = load_dataset("namespace-Pt/msmarco", split="dev")

Considering time cost, we will use the truncated dataset in this tutorial. queries contains the first 100 queries from the dataset. corpus is formed by the positives of the the first 5,000 queries.

queries = np.array(data[:100]["query"])
corpus = sum(data[:5000]["positive"], [])

If you have GPU and would like to try out the full evaluation of MS Marco, uncomment and run the following cell:

# data = load_dataset("namespace-Pt/msmarco", split="dev")
# queries = np.array(data["query"])

# corpus = load_dataset("namespace-PT/msmarco-corpus", split="train")

Step 2: Embedding#

Choose the embedding model that we would like to evaluate, and encode the corpus to embeddings.

from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the corpus
corpus_embeddings = model.encode(corpus)

print("shape of the corpus embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)

Inference Embeddings: 100%|██████████| 21/21 [02:10<00:00,  6.22s/it]

shape of the corpus embeddings: (5331, 768)
data type of the embeddings:  float32

Step 3: Indexing#

We use the index_factory() functions to create a Faiss index we want:

The first argument dim is the dimension of the vector space, in this case is 768 if you’re using bge-base-en-v1.5.
The second argument 'Flat' makes the index do exhaustive search.
The thrid argument faiss.METRIC_INNER_PRODUCT tells the index to use inner product as the distance metric.

import faiss

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")

total number of vectors: 5331

Since the embedding process is time consuming, it’s a good choice to save the index for reproduction or other experiments.

Uncomment the following lines to save the index.

# path = "./index.bin"
# faiss.write_index(index, path)

If you already have stored index in your local directory, you can load it by:

# index = faiss.read_index("./index.bin")

Step 4: Retrieval#

Get the embeddings of all the queries, and get their corresponding ground truth answers for evaluation.

query_embeddings = model.encode_queries(queries)
ground_truths = [d["positive"] for d in data]
corpus = np.asarray(corpus)

Use the faiss index to search top $k$ answers of each query.

from tqdm import tqdm

res_scores, res_ids, res_text = [], [], []
query_size = len(query_embeddings)
batch_size = 256
# The cutoffs we will use during evaluation, and set k to be the maximum of the cutoffs.
cut_offs = [1, 10]
k = max(cut_offs)

for i in tqdm(range(0, query_size, batch_size), desc="Searching"):
    q_embedding = query_embeddings[i: min(i+batch_size, query_size)].astype(np.float32)
    # search the top k answers for each of the queries
    score, idx = index.search(q_embedding, k=k)
    res_scores += list(score)
    res_ids += list(idx)
    res_text += list(corpus[idx])

Searching: 100%|██████████| 1/1 [00:00<00:00, 20.91it/s]

Step 5: Evaluate#

5.1 Recall#

Recall represents the model’s capability of correctly predicting positive instances from all the actual positive samples in the dataset.

$$\textbf{Recall}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}}$$

Recall is useful when the cost of false negatives is high. In other words, we are trying to find all objects of the positive class, even if this results in some false positives. This attribute makes recall a useful metric for text retrieval tasks.

def calc_recall(preds, truths, cutoffs):
    recalls = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            recall = np.intersect1d(truth, text[:c])
            recalls[i] += len(recall) / max(min(c, len(truth)), 1)
    recalls /= len(preds)
    return recalls

recalls = calc_recall(res_text, ground_truths, cut_offs)
for i, c in enumerate(cut_offs):
    print(f"recall@{c}: {recalls[i]}")

recall@1: 0.97
recall@10: 1.0

5.2 MRR#

Mean Reciprocal Rank (MRR) is a widely used metric in information retrieval to evaluate the effectiveness of a system. It measures the rank position of the first relevant result in a list of search results.

$$MRR=\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}$$

where

$|Q|$ is the total number of queries.
$rank_i$ is the rank position of the first relevant document of the i-th query.

def MRR(preds, truth, cutoffs):
    mrr = [0 for _ in range(len(cutoffs))]
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    mrr = [k/len(preds) for k in mrr]
    return mrr

mrr = MRR(res_text, ground_truths, cut_offs)
for i, c in enumerate(cut_offs):
    print(f"MRR@{c}: {mrr[i]}")

MRR@1: 0.97
MRR@10: 0.9825

5.3 nDCG#

Normalized Discounted cumulative gain (nDCG) measures the quality of a ranked list of search results by considering both the position of the relevant documents and their graded relevance scores. The calculation of nDCG involves two main steps:

Discounted cumulative gain (DCG) measures the ranking quality in retrieval tasks.

$$DCG_p=\sum_{i=1}^p\frac{2^{rel_i}-1}{\log_2(i+1)}$$

Normalized by ideal DCG to make it comparable across queries. $$nDCG_p=\frac{DCG_p}{IDCG_p}$$ where $IDCG$ is the maximum possible DCG for a given set of documents, assuming they are perfectly ranked in order of relevance.

pred_hard_encodings = []
for pred, label in zip(res_text, ground_truths):
    pred_hard_encoding = list(np.isin(pred, label).astype(int))
    pred_hard_encodings.append(pred_hard_encoding)

from sklearn.metrics import ndcg_score

for i, c in enumerate(cut_offs):
    nDCG = ndcg_score(pred_hard_encodings, res_scores, k=c)
    print(f"nDCG@{c}: {nDCG}")

nDCG@1: 0.97
nDCG@10: 0.9869253606521631

Congrats! You have walked through a full pipeline of evaluating an embedding model. Feel free to play with different datasets and models!