AIR-Bench#

AIR-Bench (Automated heterogeneous Information Retrieval Benchmark) is a dynamic (actively being updated) benchmark for information retrieval. Now the benchmark contains two versions. Notice that the testing data is generated by LLMs with out human intervention. This helps the evaluation of new domains easier and faster to be updated. It also makes it impossible for any models to have the test data covered in their training sets.

You can evaluate model’s performance on AIR-Bench by running our provided shell script:

chmod +x /examples/evaluation/air_bench/eval_air_bench.sh
./examples/evaluation/air_bench/eval_air_bench.sh

Or by running:

python -m FlagEmbedding.evaluation.air_bench \
--benchmark_version AIR-Bench_24.05 \
--task_types qa long-doc \
--domains arxiv \
--languages en \
--splits dev test \
--output_dir ./air_bench/search_results \
--search_top_k 1000 \
--rerank_top_k 100 \
--cache_dir /root/.cache/huggingface/hub \
--overwrite False \
--embedder_name_or_path BAAI/bge-m3 \
--reranker_name_or_path BAAI/bge-reranker-v2-m3 \
--devices cuda:0 cuda:1 \
--model_cache_dir /root/.cache/huggingface/hub \
--reranker_max_length 1024

change the embedder, reranker, devices and cache directory to your preference.