C-MTEB#

C-MTEB is the largest benchmark for Chinese text embeddings, similar to MTEB. In this tutorial, we will go through how to evaluate an embedding model’s ability on Chinese tasks in C-MTEB.

0. Installation#

First install dependent packages:

%pip install FlagEmbedding mteb

1. Datasets#

C-MTEB uses similar task splits and metrics as English MTEB. It contains 35 datasets in 6 different tasks: Classification, Clustering, Pair Classification, Reranking, Retrieval, and Semantic Textual Similarity (STS).

  1. Classification: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.

  2. Clustering: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.

  3. Pair Classification: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.

  4. Reranking: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.

  5. Retrieval: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.

  6. Semantic Textual Similarity (STS): Determine the similarity between each sentence pair. Spearman correlation based on cosine similarity serves as the main metric.

Check the HF page for the details of each dataset.

ChineseTaskList = [
    'TNews', 'IFlyTek', 'MultilingualSentiment', 'JDReview', 'OnlineShopping', 'Waimai',
    'CLSClusteringS2S.v2', 'CLSClusteringP2P.v2', 'ThuNewsClusteringS2S.v2', 'ThuNewsClusteringP2P.v2',
    'Ocnli', 'Cmnli',
    'T2Reranking', 'MMarcoReranking', 'CMedQAv1-reranking', 'CMedQAv2-reranking',
    'T2Retrieval', 'MMarcoRetrieval', 'DuRetrieval', 'CovidRetrieval', 'CmedqaRetrieval', 'EcomRetrieval', 'MedicalRetrieval', 'VideoRetrieval',
    'ATEC', 'BQ', 'LCQMC', 'PAWSX', 'STSB', 'AFQMC', 'QBQTC'
]

2. Model#

First, load the model for evaluation. Note that the instruction here is used for retreival tasks.

from ...C_MTEB.flag_dres_model import FlagDRESModel

instruction = "为这个句子生成表示以用于检索相关文章:"
model_name = "BAAI/bge-base-zh-v1.5"

model = FlagDRESModel(model_name_or_path="BAAI/bge-base-zh-v1.5",
                      query_instruction_for_retrieval=instruction,
                      pooling_method="cls")

Otherwise, you can load a model using sentence_transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("PATH_TO_MODEL")

Or implement a class following the structure below:

class MyModel():
    def __init__(self):
        """initialize the tokenizer and model"""
        pass

    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()

3. Evaluate#

After we’ve prepared the dataset and model, we can start the evaluation. For time efficiency, we highly recommend to use GPU for evaluation.

import mteb
from mteb import MTEB

tasks = mteb.get_tasks(ChineseTaskList)

for task in tasks:
    evaluation = MTEB(tasks=[task])
    evaluation.run(model, output_folder=f"zh_results/{model_name.split('/')[-1]}")

4. Submit to MTEB Leaderboard#

After the evaluation is done, all the evaluation results should be stored in zh_results/{model_name}/.

Then run the following shell command to create the model_card.md. Change {model_name} and its following to your path.

!!mteb create_meta --results_folder results/{model_name}/ --output_path model_card.md

Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Then goto the MTEB leaderboard and choose the Chinese leaderboard to find your model! It will appear soon after the website’s daily refresh.