Evaluate Reranker#

Reranker usually better captures the latent semantic meanings between sentences. But comparing to using an embedding model, it will take quadratic $O(N^2)$ running time for the whole dataset. Thus the most common use cases of rerankers in information retrieval or RAG is reranking the top k answers retrieved according to the embedding similarities.

The evaluation of reranker has the similar idea. We compare how much better the rerankers can rerank the candidates searched by a same embedder. In this tutorial, we will evaluate two rerankers’ performances on BEIR benchmark, with bge-large-en-v1.5 as the base embedding model.

Note: We highly recommend to run this notebook with GPU. The whole pipeline is very time consuming. For simplicity, we only use a single task FiQA in BEIR.

0. Installation#

First install the required dependency

%pip install FlagEmbedding

1. bge-reranker-large#

The first model is bge-reranker-large, a BERT like reranker with about 560M parameters.

We can use the evaluation pipeline of FlagEmbedding to directly run the whole process:

%%bash
python -m FlagEmbedding.evaluation.beir \
--eval_name beir \
--dataset_dir ./beir/data \
--dataset_names fiqa \
--splits test dev \
--corpus_embd_save_dir ./beir/corpus_embd \
--output_dir ./beir/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_path /root/.cache/huggingface/hub \
--overwrite True \
--k_values 10 100 \
--eval_output_method markdown \
--eval_output_path ./beir/beir_eval_results.md \
--eval_metrics ndcg_at_10 recall_at_100 \
--ignore_identical_ids True \
--embedder_name_or_path BAAI/bge-large-en-v1.5 \
--reranker_name_or_path BAAI/bge-reranker-large \
--embedder_batch_size 1024 \
--reranker_batch_size 1024 \
--devices cuda:0 \
Split 'dev' not found in the dataset. Removing it from the list.
ignore_identical_ids is set to True. This means that the search results will not contain identical ids. Note: Dataset such as MIRACL should NOT set this to True.
pre tokenize: 100%|██████████| 57/57 [00:03<00:00, 14.68it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
Inference Embeddings: 100%|██████████| 57/57 [00:44<00:00,  1.28it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 61.59it/s]
Inference Embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.22it/s]
Searching: 100%|██████████| 21/21 [00:00<00:00, 68.26it/s]
pre tokenize:   0%|          | 0/64 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 64/64 [00:08<00:00,  7.15it/s]
Compute Scores: 100%|██████████| 64/64 [01:39<00:00,  1.56s/it]

2. bge-reranker-v2-gemma#

The second model is bge-reranker-v2-m3

%%bash
python -m FlagEmbedding.evaluation.beir \
--eval_name beir \
--dataset_dir ./beir/data \
--dataset_names fiqa \
--splits test dev \
--corpus_embd_save_dir ./beir/corpus_embd \
--output_dir ./beir/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_path /root/.cache/huggingface/hub \
--overwrite True \
--k_values 10 100 \
--eval_output_method markdown \
--eval_output_path ./beir/beir_eval_results.md \
--eval_metrics ndcg_at_10 recall_at_100 \
--ignore_identical_ids True \
--embedder_name_or_path BAAI/bge-large-en-v1.5 \
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
--embedder_batch_size 1024 \
--reranker_batch_size 1024 \
--devices cuda:0 cuda:1 cuda:2 cuda:3 \
--reranker_max_length 1024 \
Split 'dev' not found in the dataset. Removing it from the list.
ignore_identical_ids is set to True. This means that the search results will not contain identical ids. Note: Dataset such as MIRACL should NOT set this to True.
initial target device: 100%|██████████| 4/4 [01:14<00:00, 18.51s/it]
pre tokenize: 100%|██████████| 15/15 [00:01<00:00, 11.21it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 15/15 [00:01<00:00, 11.32it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 15/15 [00:01<00:00, 10.29it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 15/15 [00:01<00:00, 13.99it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
Inference Embeddings: 100%|██████████| 15/15 [00:12<00:00,  1.24it/s]
Inference Embeddings: 100%|██████████| 15/15 [00:12<00:00,  1.23it/s]
Inference Embeddings: 100%|██████████| 15/15 [00:12<00:00,  1.22it/s]
Inference Embeddings: 100%|██████████| 15/15 [00:12<00:00,  1.21it/s]
Chunks: 100%|██████████| 4/4 [00:30<00:00,  7.70s/it]
Chunks: 100%|██████████| 4/4 [00:00<00:00, 47.90it/s]
Searching: 100%|██████████| 21/21 [00:00<00:00, 128.34it/s]
initial target device: 100%|██████████| 4/4 [01:09<00:00, 17.43s/it]
pre tokenize:   0%|          | 0/16 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize:  12%|█▎        | 2/16 [00:00<00:02,  6.46it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize:  12%|█▎        | 2/16 [00:00<00:03,  4.60it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize:  25%|██▌       | 4/16 [00:00<00:02,  4.61it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 16/16 [00:03<00:00,  4.12it/s]
pre tokenize: 100%|██████████| 16/16 [00:04<00:00,  3.78it/s]
pre tokenize: 100%|██████████| 16/16 [00:04<00:00,  3.95it/s]
pre tokenize: 100%|██████████| 16/16 [00:04<00:00,  3.81it/s]
Compute Scores: 100%|██████████| 67/67 [00:29<00:00,  2.30it/s]
Compute Scores: 100%|██████████| 67/67 [00:29<00:00,  2.27it/s]
Compute Scores: 100%|██████████| 67/67 [00:29<00:00,  2.27it/s]
Compute Scores: 100%|██████████| 67/67 [00:30<00:00,  2.19it/s]
Chunks: 100%|██████████| 4/4 [00:51<00:00, 12.97s/it]
/share/project/xzy/Envs/ft/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

3. Comparison#

import json

with open('beir/search_results/bge-large-en-v1.5/bge-reranker-large/EVAL/eval_results.json') as f:
    results_1 = json.load(f)
    print(results_1)
    
with open('beir/search_results/bge-large-en-v1.5/bge-reranker-v2-m3/EVAL/eval_results.json') as f:
    results_2 = json.load(f)
    print(results_2)
{'fiqa-test': {'ndcg_at_10': 0.40991, 'ndcg_at_100': 0.48028, 'map_at_10': 0.32127, 'map_at_100': 0.34227, 'recall_at_10': 0.50963, 'recall_at_100': 0.75987, 'precision_at_10': 0.11821, 'precision_at_100': 0.01932, 'mrr_at_10': 0.47786, 'mrr_at_100': 0.4856}}
{'fiqa-test': {'ndcg_at_10': 0.44828, 'ndcg_at_100': 0.51525, 'map_at_10': 0.36551, 'map_at_100': 0.38578, 'recall_at_10': 0.519, 'recall_at_100': 0.75987, 'precision_at_10': 0.12299, 'precision_at_100': 0.01932, 'mrr_at_10': 0.53382, 'mrr_at_100': 0.54108}}

From the above results we can see that bge-reranker-v2-m3 has advantage on almost all the metrics.