Evaluate the Fine-tuned Model#

In the previous sections, we prepared the dataset and fine-tuned the model. In this tutorial, we will go through how to evaluate the model with the test dataset we constructed.

0. Installation#

% pip install -U datasets pytrec_eval FlagEmbedding

1. Load Data#

We first load data from the files we processed.

from datasets import load_dataset

queries = load_dataset("json", data_files="ft_data/test_queries.jsonl")["train"]
corpus = load_dataset("json", data_files="ft_data/corpus.jsonl")["train"]
qrels = load_dataset("json", data_files="ft_data/test_qrels.jsonl")["train"]

queries_text = queries["text"]
corpus_text = [text for sub in corpus["text"] for text in sub]
qrels_dict = {}
for line in qrels:
    if line['qid'] not in qrels_dict:
        qrels_dict[line['qid']] = {}
    qrels_dict[line['qid']][line['docid']] = line['relevance']

3. Evaluation#

from FlagEmbedding.abc.evaluation.utils import evaluate_metrics, evaluate_mrr
from FlagEmbedding import FlagModel

k_values = [10,100]

raw_name = "BAAI/bge-large-en-v1.5"
finetuned_path = "test_encoder_only_base_bge-large-en-v1.5"

The result for the original model:

raw_model = FlagModel(
    raw_name, 
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    devices=[0],
    use_fp16=False
)

results = search(raw_model, queries_text, corpus_text)

eval_res = evaluate_metrics(qrels_dict, results, k_values)
mrr = evaluate_mrr(qrels_dict, results, k_values)

for res in eval_res:
    print(res)
print(mrr)
pre tokenize: 100%|██████████| 3/3 [00:00<00:00, 129.75it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 3/3 [00:00<00:00, 11.08it/s]
pre tokenize: 100%|██████████| 28/28 [00:00<00:00, 164.29it/s]
Inference Embeddings: 100%|██████████| 28/28 [00:04<00:00,  6.09it/s]
Searching: 100%|██████████| 22/22 [00:08<00:00,  2.56it/s]
defaultdict(<class 'list'>, {'NDCG@10': 0.70405, 'NDCG@100': 0.73528})
defaultdict(<class 'list'>, {'MAP@10': 0.666, 'MAP@100': 0.67213})
defaultdict(<class 'list'>, {'Recall@10': 0.82286, 'Recall@100': 0.97286})
defaultdict(<class 'list'>, {'P@10': 0.08229, 'P@100': 0.00973})
defaultdict(<class 'list'>, {'MRR@10': 0.666, 'MRR@100': 0.67213})

Then the result for the model after fine-tuning:

ft_model = FlagModel(
    finetuned_path, 
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    devices=[0],
    use_fp16=False
)

results = search(ft_model, queries_text, corpus_text)

eval_res = evaluate_metrics(qrels_dict, results, k_values)
mrr = evaluate_mrr(qrels_dict, results, k_values)

for res in eval_res:
    print(res)
print(mrr)
pre tokenize: 100%|██████████| 3/3 [00:00<00:00, 164.72it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.45it/s]
pre tokenize: 100%|██████████| 28/28 [00:00<00:00, 160.19it/s]
Inference Embeddings: 100%|██████████| 28/28 [00:04<00:00,  6.06it/s]
Searching: 100%|██████████| 22/22 [00:07<00:00,  2.80it/s]
defaultdict(<class 'list'>, {'NDCG@10': 0.84392, 'NDCG@100': 0.85792})
defaultdict(<class 'list'>, {'MAP@10': 0.81562, 'MAP@100': 0.81875})
defaultdict(<class 'list'>, {'Recall@10': 0.93143, 'Recall@100': 0.99429})
defaultdict(<class 'list'>, {'P@10': 0.09314, 'P@100': 0.00994})
defaultdict(<class 'list'>, {'MRR@10': 0.81562, 'MRR@100': 0.81875})

We can see an obvious improvement in all the metrics.