MLDR#
MLDR is a Multilingual Long-Document Retrieval dataset built on Wikipeida, Wudao and mC4, covering 13 typologically diverse languages. Specifically, we sample lengthy articles from Wikipedia, Wudao and mC4 datasets and randomly choose paragraphs from them. Then we use GPT-3.5 to generate questions based on these paragraphs. The generated question and the sampled article constitute a new text pair to the dataset.
An example of train
set looks like:
{
'query_id': 'q-zh-<...>',
'query': '...',
'positive_passages': [
{
'docid': 'doc-zh-<...>',
'text': '...'
}
],
'negative_passages': [
{
'docid': 'doc-zh-<...>',
'text': '...'
},
...
]
}
An example of dev
and test
set looks like:
{
'query_id': 'q-zh-<...>',
'query': '...',
'positive_passages': [
{
'docid': 'doc-zh-<...>',
'text': '...'
}
],
'negative_passages': []
}
An example of corpus
looks like:
{
'docid': 'doc-zh-<...>',
'text': '...'
}
You can evaluate model’s performance on MLDR simply by running our provided shell script:
chmod +x /examples/evaluation/mldr/eval_mldr.sh
./examples/evaluation/mldr/eval_mldr.sh
Or by running:
python -m FlagEmbedding.evaluation.mldr \
--eval_name mldr \
--dataset_dir ./mldr/data \
--dataset_names hi \
--splits test \
--corpus_embd_save_dir ./mldr/corpus_embd \
--output_dir ./mldr/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_path /root/.cache/huggingface/hub \
--overwrite False \
--k_values 10 100 \
--eval_output_method markdown \
--eval_output_path ./mldr/mldr_eval_results.md \
--eval_metrics ndcg_at_10 \
--embedder_name_or_path BAAI/bge-m3 \
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
--devices cuda:0 cuda:1 \
--cache_dir /root/.cache/huggingface/hub \
--embedder_passage_max_length 8192 \
--reranker_max_length 8192
change the args of embedder, reranker, devices and cache directory to your preference.