MTEB#
For evaluation of embedding models, MTEB is one of the most well-known benchmark. In this tutorial, we’ll introduce MTEB, its basic usage, and evaluate how your model performs on the MTEB leaderboard.
0. Installation#
Install the packages we will use in your environment:
%%capture
%pip install sentence_transformers mteb
1. Intro#
The Massive Text Embedding Benchmark (MTEB) is a large-scale evaluation framework designed to assess the performance of text embedding models across a wide variety of natural language processing (NLP) tasks. Introduced to standardize and improve the evaluation of text embeddings, MTEB is crucial for assessing how well these models generalize across various real-world applications. It contains a wide range of datasets in eight main NLP tasks and different languages, and provides an easy pipeline for evaluation.
MTEB is also well known for the MTEB leaderboard, which contains a ranking of the latest first-class embedding models. We’ll cover that in the next tutorial. Now let’s have a look on how to use MTEB to do evaluation easily.
import mteb
from sentence_transformers import SentenceTransformer
Now let’s take a look at how to use MTEB to do a quick evaluation.
First we load the model that we would like to evaluate on:
model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(model_name)
Below is the list of datasets of retrieval used by MTEB’s English leaderboard.
MTEB directly use the open source benchmark BEIR in its retrieval part, which contains 15 datasets (note there are 12 subsets of CQADupstack).
retrieval_tasks = [
"ArguAna",
"ClimateFEVER",
"CQADupstackAndroidRetrieval",
"CQADupstackEnglishRetrieval",
"CQADupstackGamingRetrieval",
"CQADupstackGisRetrieval",
"CQADupstackMathematicaRetrieval",
"CQADupstackPhysicsRetrieval",
"CQADupstackProgrammersRetrieval",
"CQADupstackStatsRetrieval",
"CQADupstackTexRetrieval",
"CQADupstackUnixRetrieval",
"CQADupstackWebmastersRetrieval",
"CQADupstackWordpressRetrieval",
"DBPedia",
"FEVER",
"FiQA2018",
"HotpotQA",
"MSMARCO",
"NFCorpus",
"NQ",
"QuoraRetrieval",
"SCIDOCS",
"SciFact",
"Touche2020",
"TRECCOVID",
]
For demonstration, let’s just run the first one, “ArguAna”.
For a full list of tasks and languages that MTEB supports, check the page.
tasks = mteb.get_tasks(tasks=retrieval_tasks[:1])
Then, create and initialize an MTEB instance with our chosen tasks, and run the evaluation process.
# use the tasks we chose to initialize the MTEB instance
evaluation = mteb.MTEB(tasks=tasks)
# call run() with the model and output_folder
results = evaluation.run(model, output_folder="results")
───────────────────────────────────────────────── Selected tasks ─────────────────────────────────────────────────
Retrieval
- ArguAna, s2p
Batches: 100%|██████████| 44/44 [00:41<00:00, 1.06it/s]
Batches: 100%|██████████| 272/272 [03:36<00:00, 1.26it/s]
The results should be stored in {output_folder}/{model_name}/{model_revision}/{task_name}.json
.
Openning the json file you should see contents as below, which are the evaluation results on “ArguAna” with different metrics on cutoffs from 1 to 1000.
{
"dataset_revision": "c22ab2a51041ffd869aaddef7af8d8215647e41a",
"evaluation_time": 260.14976954460144,
"kg_co2_emissions": null,
"mteb_version": "1.14.17",
"scores": {
"test": [
{
"hf_subset": "default",
"languages": [
"eng-Latn"
],
"main_score": 0.63616,
"map_at_1": 0.40754,
"map_at_10": 0.55773,
"map_at_100": 0.56344,
"map_at_1000": 0.56347,
"map_at_20": 0.56202,
"map_at_3": 0.51932,
"map_at_5": 0.54023,
"mrr_at_1": 0.4139402560455192,
"mrr_at_10": 0.5603739077423295,
"mrr_at_100": 0.5660817425350153,
"mrr_at_1000": 0.5661121884705748,
"mrr_at_20": 0.564661930998293,
"mrr_at_3": 0.5208629682313899,
"mrr_at_5": 0.5429113323850182,
"nauc_map_at_1000_diff1": 0.15930478114759905,
"nauc_map_at_1000_max": -0.06396189194646361,
"nauc_map_at_1000_std": -0.13168797291549253,
"nauc_map_at_100_diff1": 0.15934819555197366,
"nauc_map_at_100_max": -0.06389635013430676,
"nauc_map_at_100_std": -0.13164524259533786,
"nauc_map_at_10_diff1": 0.16057318234658585,
"nauc_map_at_10_max": -0.060962623117325254,
"nauc_map_at_10_std": -0.1300413865104607,
"nauc_map_at_1_diff1": 0.17346152653542332,
"nauc_map_at_1_max": -0.09705499215630589,
"nauc_map_at_1_std": -0.14726476953035533,
"nauc_map_at_20_diff1": 0.15956349246366208,
"nauc_map_at_20_max": -0.06259296677860492,
"nauc_map_at_20_std": -0.13097093150054095,
"nauc_map_at_3_diff1": 0.15620049317363813,
"nauc_map_at_3_max": -0.06690213479396273,
"nauc_map_at_3_std": -0.13440904793529648,
"nauc_map_at_5_diff1": 0.1557795701081579,
"nauc_map_at_5_max": -0.06255283252590663,
"nauc_map_at_5_std": -0.1355361594910923,
"nauc_mrr_at_1000_diff1": 0.1378988612808882,
"nauc_mrr_at_1000_max": -0.07507962333910836,
"nauc_mrr_at_1000_std": -0.12969109830101241,
"nauc_mrr_at_100_diff1": 0.13794450668758515,
"nauc_mrr_at_100_max": -0.07501290390362861,
"nauc_mrr_at_100_std": -0.12964855554504057,
"nauc_mrr_at_10_diff1": 0.1396047981645623,
"nauc_mrr_at_10_max": -0.07185174301688693,
"nauc_mrr_at_10_std": -0.12807325096717753,
"nauc_mrr_at_1_diff1": 0.15610387932529113,
"nauc_mrr_at_1_max": -0.09824591983546396,
"nauc_mrr_at_1_std": -0.13914318784294258,
"nauc_mrr_at_20_diff1": 0.1382786098284509,
"nauc_mrr_at_20_max": -0.07364476417961506,
"nauc_mrr_at_20_std": -0.12898192060943495,
"nauc_mrr_at_3_diff1": 0.13118224861025093,
"nauc_mrr_at_3_max": -0.08164985279853691,
"nauc_mrr_at_3_std": -0.13241573571401533,
"nauc_mrr_at_5_diff1": 0.1346130730317385,
"nauc_mrr_at_5_max": -0.07404093236468848,
"nauc_mrr_at_5_std": -0.1340775377068567,
"nauc_ndcg_at_1000_diff1": 0.15919987960292029,
"nauc_ndcg_at_1000_max": -0.05457945565481172,
"nauc_ndcg_at_1000_std": -0.12457339152558143,
"nauc_ndcg_at_100_diff1": 0.1604091882521101,
"nauc_ndcg_at_100_max": -0.05281549383775287,
"nauc_ndcg_at_100_std": -0.12347288098914058,
"nauc_ndcg_at_10_diff1": 0.1657018523692905,
"nauc_ndcg_at_10_max": -0.036222943297402846,
"nauc_ndcg_at_10_std": -0.11284619565817842,
"nauc_ndcg_at_1_diff1": 0.17346152653542332,
"nauc_ndcg_at_1_max": -0.09705499215630589,
"nauc_ndcg_at_1_std": -0.14726476953035533,
"nauc_ndcg_at_20_diff1": 0.16231721725673165,
"nauc_ndcg_at_20_max": -0.04147115653921931,
"nauc_ndcg_at_20_std": -0.11598700704312062,
"nauc_ndcg_at_3_diff1": 0.15256475371124711,
"nauc_ndcg_at_3_max": -0.05432154580979357,
"nauc_ndcg_at_3_std": -0.12841084787822227,
"nauc_ndcg_at_5_diff1": 0.15236205846534961,
"nauc_ndcg_at_5_max": -0.04356123278888682,
"nauc_ndcg_at_5_std": -0.12942556865700913,
"nauc_precision_at_1000_diff1": -0.038790629929866066,
"nauc_precision_at_1000_max": 0.3630826341915611,
"nauc_precision_at_1000_std": 0.4772189839676386,
"nauc_precision_at_100_diff1": 0.32118609204433185,
"nauc_precision_at_100_max": 0.4740132817600036,
"nauc_precision_at_100_std": 0.3456396169952022,
"nauc_precision_at_10_diff1": 0.22279659689895104,
"nauc_precision_at_10_max": 0.16823918613191954,
"nauc_precision_at_10_std": 0.0377209694331257,
"nauc_precision_at_1_diff1": 0.17346152653542332,
"nauc_precision_at_1_max": -0.09705499215630589,
"nauc_precision_at_1_std": -0.14726476953035533,
"nauc_precision_at_20_diff1": 0.23025740175221762,
"nauc_precision_at_20_max": 0.2892313928157665,
"nauc_precision_at_20_std": 0.13522755012490692,
"nauc_precision_at_3_diff1": 0.1410889527057097,
"nauc_precision_at_3_max": -0.010771302313530132,
"nauc_precision_at_3_std": -0.10744937823276193,
"nauc_precision_at_5_diff1": 0.14012953903010988,
"nauc_precision_at_5_max": 0.03977485677045894,
"nauc_precision_at_5_std": -0.10292184602358977,
"nauc_recall_at_1000_diff1": -0.03879062992990034,
"nauc_recall_at_1000_max": 0.36308263419153386,
"nauc_recall_at_1000_std": 0.47721898396760526,
"nauc_recall_at_100_diff1": 0.3211860920443005,
"nauc_recall_at_100_max": 0.4740132817599919,
"nauc_recall_at_100_std": 0.345639616995194,
"nauc_recall_at_10_diff1": 0.22279659689895054,
"nauc_recall_at_10_max": 0.16823918613192046,
"nauc_recall_at_10_std": 0.037720969433127145,
"nauc_recall_at_1_diff1": 0.17346152653542332,
"nauc_recall_at_1_max": -0.09705499215630589,
"nauc_recall_at_1_std": -0.14726476953035533,
"nauc_recall_at_20_diff1": 0.23025740175221865,
"nauc_recall_at_20_max": 0.2892313928157675,
"nauc_recall_at_20_std": 0.13522755012490456,
"nauc_recall_at_3_diff1": 0.14108895270570979,
"nauc_recall_at_3_max": -0.010771302313529425,
"nauc_recall_at_3_std": -0.10744937823276134,
"nauc_recall_at_5_diff1": 0.14012953903010958,
"nauc_recall_at_5_max": 0.039774856770459645,
"nauc_recall_at_5_std": -0.10292184602358935,
"ndcg_at_1": 0.40754,
"ndcg_at_10": 0.63616,
"ndcg_at_100": 0.66063,
"ndcg_at_1000": 0.6613,
"ndcg_at_20": 0.65131,
"ndcg_at_3": 0.55717,
"ndcg_at_5": 0.59461,
"precision_at_1": 0.40754,
"precision_at_10": 0.08841,
"precision_at_100": 0.00991,
"precision_at_1000": 0.001,
"precision_at_20": 0.04716,
"precision_at_3": 0.22238,
"precision_at_5": 0.15149,
"recall_at_1": 0.40754,
"recall_at_10": 0.88407,
"recall_at_100": 0.99147,
"recall_at_1000": 0.99644,
"recall_at_20": 0.9431,
"recall_at_3": 0.66714,
"recall_at_5": 0.75747
}
]
},
"task_name": "ArguAna"
}
Now we’ve successfully run the evaluation using mteb! In the next tutorial, we’ll show how to evaluate your model on the whole 56 tasks of English MTEB and compete with models on the leaderboard.