MTEB Leaderboard#
In the last tutorial we show how to evaluate an embedding model on an dataset supported by MTEB. In this tutorial, we will go through how to do a full evaluation and compare the results with MTEB English leaderboard.
Caution: Evaluation on the full Eng MTEB is very time consuming even with GPU. So we encourage you to go through the notebook to have an idea. And run the experiment when you have enough computing resource and time.
0. Installation#
Install the packages we will use in your environment:
%%capture
%pip install sentence_transformers mteb
1. Run the Evaluation#
The MTEB English leaderboard contains 56 datasets on 7 tasks:
Classification: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.
Clustering: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.
Pair Classification: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.
Reranking: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.
Retrieval: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.
Semantic Textual Similarity (STS): Determine the similarity between each sentence pair. Spearman correlation based on cosine similarity serves as the main metric.
Summarization: Only 1 dataset is used in this task. Score the machine-generated summaries to human-written summaries by computing distances of their embeddings. The main metric is also Spearman correlation based on cosine similarity.
The benchmark is widely accepted by researchers and engineers to fairly evaluate and compare the performance of the models they train. Now let’s take a look at the whole evaluation pipeline
Import the MTEB_MAIN_EN
to check the all 56 datasets.
import mteb
from mteb.benchmarks import MTEB_MAIN_EN
print(MTEB_MAIN_EN.tasks)
['AmazonCounterfactualClassification', 'AmazonPolarityClassification', 'AmazonReviewsClassification', 'ArguAna', 'ArxivClusteringP2P', 'ArxivClusteringS2S', 'AskUbuntuDupQuestions', 'BIOSSES', 'Banking77Classification', 'BiorxivClusteringP2P', 'BiorxivClusteringS2S', 'CQADupstackAndroidRetrieval', 'CQADupstackEnglishRetrieval', 'CQADupstackGamingRetrieval', 'CQADupstackGisRetrieval', 'CQADupstackMathematicaRetrieval', 'CQADupstackPhysicsRetrieval', 'CQADupstackProgrammersRetrieval', 'CQADupstackStatsRetrieval', 'CQADupstackTexRetrieval', 'CQADupstackUnixRetrieval', 'CQADupstackWebmastersRetrieval', 'CQADupstackWordpressRetrieval', 'ClimateFEVER', 'DBPedia', 'EmotionClassification', 'FEVER', 'FiQA2018', 'HotpotQA', 'ImdbClassification', 'MSMARCO', 'MTOPDomainClassification', 'MTOPIntentClassification', 'MassiveIntentClassification', 'MassiveScenarioClassification', 'MedrxivClusteringP2P', 'MedrxivClusteringS2S', 'MindSmallReranking', 'NFCorpus', 'NQ', 'QuoraRetrieval', 'RedditClustering', 'RedditClusteringP2P', 'SCIDOCS', 'SICK-R', 'STS12', 'STS13', 'STS14', 'STS15', 'STS16', 'STS17', 'STS22', 'STSBenchmark', 'SciDocsRR', 'SciFact', 'SprintDuplicateQuestions', 'StackExchangeClustering', 'StackExchangeClusteringP2P', 'StackOverflowDupQuestions', 'SummEval', 'TRECCOVID', 'Touche2020', 'ToxicConversationsClassification', 'TweetSentimentExtractionClassification', 'TwentyNewsgroupsClustering', 'TwitterSemEval2015', 'TwitterURLCorpus']
Load the model we want to evaluate:
from sentence_transformers import SentenceTransformer
model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(model_name)
Alternatively, MTEB provides popular models on their leaderboard in order to reproduce their results.
model_name = "BAAI/bge-base-en-v1.5"
model = mteb.get_model(model_name)
Then start to evaluate on each dataset:
for task in MTEB_MAIN_EN.tasks:
# get the test set to evaluate on
eval_splits = ["dev"] if task == "MSMARCO" else ["test"]
evaluation = mteb.MTEB(
tasks=[task], task_langs=["en"]
) # Remove "en" to run all available languages
evaluation.run(
model, output_folder="results", eval_splits=eval_splits
)
2. Submit to MTEB Leaderboard#
After the evaluation is done, all the evaluation results should be stored in results/{model_name}/{model_revision}
.
Then run the following shell command to create the model_card.md. Change {model_name} and {model_revision} to your path.
!mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
For the case that the readme of that model already exists:
# !mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md
Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Now relax and wait for the daily refresh of leaderboard. Your model will show up soon!
3. Partially Evaluate#
Note that you don’t need to finish all the tasks to get on to the leaderboard.
For example you fine-tune a model’s ability on clustering. And you only care about how your model performs with respoect to clustering, but not the other tasks. Then you can just test its performance on the clustering tasks of MTEB and submit to the leaderboard.
TASK_LIST_CLUSTERING = [
"ArxivClusteringP2P",
"ArxivClusteringS2S",
"BiorxivClusteringP2P",
"BiorxivClusteringS2S",
"MedrxivClusteringP2P",
"MedrxivClusteringS2S",
"RedditClustering",
"RedditClusteringP2P",
"StackExchangeClustering",
"StackExchangeClusteringP2P",
"TwentyNewsgroupsClustering",
]
Run the evaluation with only clustering tasks:
evaluation = mteb.MTEB(tasks=TASK_LIST_CLUSTERING)
results = evaluation.run(model, output_folder="results")
Then repeat Step 2 to submit your model. After the leaderboard refresh, you can find your model in the “Clustering” section of the leaderboard.
4. Future Work#
MTEB is working on a new version of English benchmark. It contains updated and concise tasks and will make the evaluation process faster.
Please check out their GitHub page for future updates and releases.