Evaluate the Fine-tuned Model#
In the previous sections, we prepared the dataset and fine-tuned the model. In this tutorial, we will go through how to evaluate the model with the test dataset we constructed.
0. Installation#
% pip install -U datasets pytrec_eval FlagEmbedding
1. Load Data#
We first load data from the files we processed.
from datasets import load_dataset
queries = load_dataset("json", data_files="ft_data/test_queries.jsonl")["train"]
corpus = load_dataset("json", data_files="ft_data/corpus.jsonl")["train"]
qrels = load_dataset("json", data_files="ft_data/test_qrels.jsonl")["train"]
queries_text = queries["text"]
corpus_text = [text for sub in corpus["text"] for text in sub]
qrels_dict = {}
for line in qrels:
if line['qid'] not in qrels_dict:
qrels_dict[line['qid']] = {}
qrels_dict[line['qid']][line['docid']] = line['relevance']
2. Search#
Then we prepare a function to encode the text into embeddings and search the results:
import faiss
import numpy as np
from tqdm import tqdm
def search(model, queries_text, corpus_text):
queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)
# create and store the embeddings in a Faiss index
dim = corpus_embeddings.shape[-1]
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
index.train(corpus_embeddings)
index.add(corpus_embeddings)
query_size = len(queries_embeddings)
all_scores = []
all_indices = []
# search top 100 answers for all the queries
for i in tqdm(range(0, query_size, 32), desc="Searching"):
j = min(i + 32, query_size)
query_embedding = queries_embeddings[i: j]
score, indice = index.search(query_embedding.astype(np.float32), k=100)
all_scores.append(score)
all_indices.append(indice)
all_scores = np.concatenate(all_scores, axis=0)
all_indices = np.concatenate(all_indices, axis=0)
# store the results into the format for evaluation
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
results[queries["id"][idx]] = {}
for score, index in zip(scores, indices):
if index != -1:
results[queries["id"][idx]][corpus["id"][index]] = float(score)
return results
3. Evaluation#
from FlagEmbedding.abc.evaluation.utils import evaluate_metrics, evaluate_mrr
from FlagEmbedding import FlagModel
k_values = [10,100]
raw_name = "BAAI/bge-large-en-v1.5"
finetuned_path = "test_encoder_only_base_bge-large-en-v1.5"
The result for the original model:
raw_model = FlagModel(
raw_name,
query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
devices=[0],
use_fp16=False
)
results = search(raw_model, queries_text, corpus_text)
eval_res = evaluate_metrics(qrels_dict, results, k_values)
mrr = evaluate_mrr(qrels_dict, results, k_values)
for res in eval_res:
print(res)
print(mrr)
pre tokenize: 100%|██████████| 3/3 [00:00<00:00, 129.75it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 3/3 [00:00<00:00, 11.08it/s]
pre tokenize: 100%|██████████| 28/28 [00:00<00:00, 164.29it/s]
Inference Embeddings: 100%|██████████| 28/28 [00:04<00:00, 6.09it/s]
Searching: 100%|██████████| 22/22 [00:08<00:00, 2.56it/s]
defaultdict(<class 'list'>, {'NDCG@10': 0.70405, 'NDCG@100': 0.73528})
defaultdict(<class 'list'>, {'MAP@10': 0.666, 'MAP@100': 0.67213})
defaultdict(<class 'list'>, {'Recall@10': 0.82286, 'Recall@100': 0.97286})
defaultdict(<class 'list'>, {'P@10': 0.08229, 'P@100': 0.00973})
defaultdict(<class 'list'>, {'MRR@10': 0.666, 'MRR@100': 0.67213})
Then the result for the model after fine-tuning:
ft_model = FlagModel(
finetuned_path,
query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
devices=[0],
use_fp16=False
)
results = search(ft_model, queries_text, corpus_text)
eval_res = evaluate_metrics(qrels_dict, results, k_values)
mrr = evaluate_mrr(qrels_dict, results, k_values)
for res in eval_res:
print(res)
print(mrr)
pre tokenize: 100%|██████████| 3/3 [00:00<00:00, 164.72it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 3/3 [00:00<00:00, 9.45it/s]
pre tokenize: 100%|██████████| 28/28 [00:00<00:00, 160.19it/s]
Inference Embeddings: 100%|██████████| 28/28 [00:04<00:00, 6.06it/s]
Searching: 100%|██████████| 22/22 [00:07<00:00, 2.80it/s]
defaultdict(<class 'list'>, {'NDCG@10': 0.84392, 'NDCG@100': 0.85792})
defaultdict(<class 'list'>, {'MAP@10': 0.81562, 'MAP@100': 0.81875})
defaultdict(<class 'list'>, {'Recall@10': 0.93143, 'Recall@100': 0.99429})
defaultdict(<class 'list'>, {'P@10': 0.09314, 'P@100': 0.00994})
defaultdict(<class 'list'>, {'MRR@10': 0.81562, 'MRR@100': 0.81875})
We can see an obvious improvement in all the metrics.