{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# C-MTEB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C-MTEB is the largest benchmark for Chinese text embeddings, similar to MTEB. In this tutorial, we will go through how to evaluate an embedding model's ability on Chinese tasks in C-MTEB." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Installation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First install dependent packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install FlagEmbedding mteb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C-MTEB uses similar task splits and metrics as English MTEB. It contains 35 datasets in 6 different tasks: Classification, Clustering, Pair Classification, Reranking, Retrieval, and Semantic Textual Similarity (STS). \n", "\n", "1. **Classification**: Use the embeddings to train a logistic regression on the train set and is scored on the test set. F1 is the main metric.\n", "2. **Clustering**: Train a mini-batch k-means model with batch size 32 and k equals to the number of different labels. Then score using v-measure.\n", "3. **Pair Classification**: A pair of text inputs is provided and a label which is a binary variable needs to be assigned. The main metric is average precision score.\n", "4. **Reranking**: Rank a list of relevant and irrelevant reference texts according to a query. Metrics are mean MRR@k and MAP.\n", "5. **Retrieval**: Each dataset comprises corpus, queries, and a mapping that links each query to its relevant documents within the corpus. The goal is to retrieve relevant documents for each query. The main metric is nDCG@k. MTEB directly adopts BEIR for the retrieval task.\n", "6. **Semantic Textual Similarity (STS)**: Determine the similarity between each sentence pair. Spearman correlation based on cosine\n", "similarity serves as the main metric.\n", "\n", "\n", "Check the [HF page](https://huggingface.co/C-MTEB) for the details of each dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ChineseTaskList = [\n", " 'TNews', 'IFlyTek', 'MultilingualSentiment', 'JDReview', 'OnlineShopping', 'Waimai',\n", " 'CLSClusteringS2S.v2', 'CLSClusteringP2P.v2', 'ThuNewsClusteringS2S.v2', 'ThuNewsClusteringP2P.v2',\n", " 'Ocnli', 'Cmnli',\n", " 'T2Reranking', 'MMarcoReranking', 'CMedQAv1-reranking', 'CMedQAv2-reranking',\n", " 'T2Retrieval', 'MMarcoRetrieval', 'DuRetrieval', 'CovidRetrieval', 'CmedqaRetrieval', 'EcomRetrieval', 'MedicalRetrieval', 'VideoRetrieval',\n", " 'ATEC', 'BQ', 'LCQMC', 'PAWSX', 'STSB', 'AFQMC', 'QBQTC'\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, load the model for evaluation. Note that the instruction here is used for retreival tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ...C_MTEB.flag_dres_model import FlagDRESModel\n", "\n", "instruction = \"为这个句子生成表示以用于检索相关文章:\"\n", "model_name = \"BAAI/bge-base-zh-v1.5\"\n", "\n", "model = FlagDRESModel(model_name_or_path=\"BAAI/bge-base-zh-v1.5\",\n", " query_instruction_for_retrieval=instruction,\n", " pooling_method=\"cls\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Otherwise, you can load a model using sentence_transformers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "\n", "model = SentenceTransformer(\"PATH_TO_MODEL\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or implement a class following the structure below:\n", "\n", "```python\n", "class MyModel():\n", " def __init__(self):\n", " \"\"\"initialize the tokenizer and model\"\"\"\n", " pass\n", "\n", " def encode(self, sentences, batch_size=32, **kwargs):\n", " \"\"\" Returns a list of embeddings for the given sentences.\n", " Args:\n", " sentences (`List[str]`): List of sentences to encode\n", " batch_size (`int`): Batch size for the encoding\n", "\n", " Returns:\n", " `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences\n", " \"\"\"\n", " pass\n", "\n", "model = MyModel()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Evaluate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we've prepared the dataset and model, we can start the evaluation. For time efficiency, we highly recommend to use GPU for evaluation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import mteb\n", "from mteb import MTEB\n", "\n", "tasks = mteb.get_tasks(ChineseTaskList)\n", "\n", "for task in tasks:\n", " evaluation = MTEB(tasks=[task])\n", " evaluation.run(model, output_folder=f\"zh_results/{model_name.split('/')[-1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Submit to MTEB Leaderboard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the evaluation is done, all the evaluation results should be stored in `zh_results/{model_name}/`.\n", "\n", "Then run the following shell command to create the model_card.md. Change {model_name} and its following to your path." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!!mteb create_meta --results_folder results/{model_name}/ --output_path model_card.md" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy and paste the contents of model_card.md to the top of README.md of your model on HF Hub. Then goto the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and choose the Chinese leaderboard to find your model! It will appear soon after the website's daily refresh." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }