Evaluate on BEIR#

BEIR (Benchmarking-IR) is a heterogeneous evaluation benchmark for information retrieval. It is designed for evaluating the performance of NLP-based retrieval models and widely used by research of modern embedding models.

0. Installation#

First install the libraries we are using:

% pip install beir FlagEmbedding

1. Evaluate using BEIR#

BEIR contains 18 datasets which can be downloaded from the link, while 4 of them are private datasets that need appropriate licences. If you want to access to those 4 datasets, take a look at their wiki for more information. Information collected and codes adapted from BEIR GitHub repo.

Dataset Name	Type	Queries	Documents	Avg. Docs/Q	Public
`msmarco`	`Train` `Dev` `Test`	6,980	8.84M	1.1	Yes
`trec-covid`	`Test`	50	171K	493.5	Yes
`nfcorpus`	`Train` `Dev` `Test`	323	3.6K	38.2	Yes
`bioasq`	`Train` `Test`	500	14.91M	8.05	No
`nq`	`Train` `Test`	3,452	2.68M	1.2	Yes
`hotpotqa`	`Train` `Dev` `Test`	7,405	5.23M	2.0	Yes
`fiqa`	`Train` `Dev` `Test`	648	57K	2.6	Yes
`signal1m`	`Test`	97	2.86M	19.6	No
`trec-news`	`Test`	57	595K	19.6	No
`arguana`	`Test`	1,406	8.67K	1.0	Yes
`webis-touche2020`	`Test`	49	382K	49.2	Yes
`cqadupstack`	`Test`	13,145	457K	1.4	Yes
`quora`	`Dev` `Test`	10,000	523K	1.6	Yes
`dbpedia-entity`	`Dev` `Test`	400	4.63M	38.2	Yes
`scidocs`	`Test`	1,000	25K	4.9	Yes
`fever`	`Train` `Dev` `Test`	6,666	5.42M	1.2	Yes
`climate-fever`	`Test`	1,535	5.42M	3.0	Yes
`scifact`	`Train` `Test`	300	5K	1.1	Yes

1.1 Load Dataset#

First prepare the logging setup.

import logging
from beir import LoggingHandler

logging.basicConfig(format='%(message)s',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

In this demo, we choose the arguana dataset for a quick demonstration.

import os
from beir import util

url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip"
out_dir = os.path.join(os.getcwd(), "data")
data_path = util.download_and_unzip(url, out_dir)
print(f"Dataset is stored at: {data_path}")

Dataset downloaded here: /share/project/xzy/Projects/FlagEmbedding/Tutorials/4_Evaluation/data/arguana

from beir.datasets.data_loader import GenericDataLoader

corpus, queries, qrels = GenericDataLoader("data/arguana").load(split="test")

2024-11-15 03:54:55,809 - Loading Corpus...

100%|██████████| 8674/8674 [00:00<00:00, 158928.31it/s]

2024-11-15 03:54:55,891 - Loaded 8674 TEST Documents.
2024-11-15 03:54:55,891 - Doc Example: {'text': "You don’t have to be vegetarian to be green. Many special environments have been created by livestock farming – for example chalk down land in England and mountain pastures in many countries. Ending livestock farming would see these areas go back to woodland with a loss of many unique plants and animals. Growing crops can also be very bad for the planet, with fertilisers and pesticides polluting rivers, lakes and seas. Most tropical forests are now cut down for timber, or to allow oil palm trees to be grown in plantations, not to create space for meat production.  British farmer and former editor Simon Farrell also states: “Many vegans and vegetarians rely on one source from the U.N. calculation that livestock generates 18% of global carbon emissions, but this figure contains basic mistakes. It attributes all deforestation from ranching to cattle, rather than logging or development. It also muddles up one-off emissions from deforestation with on-going pollution.”  He also refutes the statement of meat production inefficiency: “Scientists have calculated that globally the ratio between the amounts of useful plant food used to produce meat is about 5 to 1. If you feed animals only food that humans can eat — which is, indeed, largely the case in the Western world — that may be true. But animals also eat food we can't eat, such as grass. So the real conversion figure is 1.4 to 1.” [1] At the same time eating a vegetarian diet may be no more environmentally friendly than a meat based diet if it is not sustainably sourced or uses perishable fruit and vegetables that are flown in from around the world. Eating locally sourced food can has as big an impact as being vegetarian. [2]  [1] Tara Kelly, Simon Fairlie: How Eating Meat Can Save the World, 12 October 2010  [2] Lucy Siegle, ‘It is time to become a vegetarian?’ The Observer, 18th May 2008", 'title': 'animals environment general health health general weight philosophy ethics'}
2024-11-15 03:54:55,891 - Loading Queries...
2024-11-15 03:54:55,903 - Loaded 1406 TEST Queries.
2024-11-15 03:54:55,903 - Query Example: Being vegetarian helps the environment  Becoming a vegetarian is an environmentally friendly thing to do. Modern farming is one of the main sources of pollution in our rivers. Beef farming is one of the main causes of deforestation, and as long as people continue to buy fast food in their billions, there will be a financial incentive to continue cutting down trees to make room for cattle. Because of our desire to eat fish, our rivers and seas are being emptied of fish and many species are facing extinction. Energy resources are used up much more greedily by meat farming than my farming cereals, pulses etc. Eating meat and fish not only causes cruelty to animals, it causes serious harm to the environment and to biodiversity. For example consider Meat production related pollution and deforestation  At Toronto’s 1992 Royal Agricultural Winter Fair, Agriculture Canada displayed two contrasting statistics: “it takes four football fields of land (about 1.6 hectares) to feed each Canadian” and “one apple tree produces enough fruit to make 320 pies.” Think about it — a couple of apple trees and a few rows of wheat on a mere fraction of a hectare could produce enough food for one person! [1]  The 2006 U.N. Food and Agriculture Organization (FAO) report concluded that worldwide livestock farming generates 18% of the planet's greenhouse gas emissions — by comparison, all the world's cars, trains, planes and boats account for a combined 13% of greenhouse gas emissions. [2]  As a result of the above point producing meat damages the environment. The demand for meat drives deforestation. Daniel Cesar Avelino of Brazil's Federal Public Prosecution Office says “We know that the single biggest driver of deforestation in the Amazon is cattle.” This clearing of tropical rainforests such as the Amazon for agriculture is estimated to produce 17% of the world's greenhouse gas emissions. [3] Not only this but the production of meat takes a lot more energy than it ultimately gives us chicken meat production consumes energy in a 4:1 ratio to protein output; beef cattle production requires an energy input to protein output ratio of 54:1.  The same is true with water use due to the same phenomenon of meat being inefficient to produce in terms of the amount of grain needed to produce the same weight of meat, production requires a lot of water. Water is another scarce resource that we will soon not have enough of in various areas of the globe. Grain-fed beef production takes 100,000 liters of water for every kilogram of food. Raising broiler chickens takes 3,500 liters of water to make a kilogram of meat. In comparison, soybean production uses 2,000 liters for kilogram of food produced; rice, 1,912; wheat, 900; and potatoes, 500 liters. [4] This is while there are areas of the globe that have severe water shortages. With farming using up to 70 times more water than is used for domestic purposes: cooking and washing. A third of the population of the world is already suffering from a shortage of water. [5] Groundwater levels are falling all over the world and rivers are beginning to dry up. Already some of the biggest rivers such as China’s Yellow river do not reach the sea. [6]  With a rising population becoming vegetarian is the only responsible way to eat.  [1] Stephen Leckie, ‘How Meat-centred Eating Patterns Affect Food Security and the Environment’, International development research center  [2] Bryan Walsh, Meat: Making Global Warming Worse, Time magazine, 10 September 2008 .  [3] David Adam, Supermarket suppliers ‘helping to destroy Amazon rainforest’, The Guardian, 21st June 2009.  [4] Roger Segelken, U.S. could feed 800 million people with grain that livestock eat, Cornell Science News, 7th August 1997.  [5] Fiona Harvey, Water scarcity affects one in three, FT.com, 21st August 2003  [6] Rupert Wingfield-Hayes, Yellow river ‘drying up’, BBC News, 29th July 2004

1.2 Evaluation#

Then we load bge-base-en-v1.5 from huggingface and evaluate its performance on arguana.

from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES


# Load bge model using Sentence Transformers
model = DRES(models.SentenceBERT("BAAI/bge-base-en-v1.5"), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

# Get the searching results
results = retriever.retrieve(corpus, queries)

2024-11-15 04:00:45,253 - Use pytorch device_name: cuda
2024-11-15 04:00:45,254 - Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-15 04:00:48,750 - Encoding Queries...

Batches: 100%|██████████| 11/11 [00:01<00:00,  8.27it/s]

2024-11-15 04:00:50,177 - Sorting Corpus by document length (Longest first)...
2024-11-15 04:00:50,183 - Encoding Corpus in batches... Warning: This might take a while!
2024-11-15 04:00:50,183 - Scoring Function: Cosine Similarity (cos_sim)
2024-11-15 04:00:50,184 - Encoding Batch 1/1...

Batches: 100%|██████████| 68/68 [00:07<00:00,  9.43it/s]

logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

2024-11-15 04:00:58,514 - Retriever evaluation for k in: [1, 3, 5, 10, 100, 1000]
2024-11-15 04:00:58,514 - For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.
2024-11-15 04:00:59,184 - 

2024-11-15 04:00:59,188 - NDCG@1: 0.4075
2024-11-15 04:00:59,188 - NDCG@3: 0.5572
2024-11-15 04:00:59,188 - NDCG@5: 0.5946
2024-11-15 04:00:59,188 - NDCG@10: 0.6361
2024-11-15 04:00:59,188 - NDCG@100: 0.6606
2024-11-15 04:00:59,188 - NDCG@1000: 0.6613
2024-11-15 04:00:59,188 - 

2024-11-15 04:00:59,188 - MAP@1: 0.4075
2024-11-15 04:00:59,188 - MAP@3: 0.5193
2024-11-15 04:00:59,188 - MAP@5: 0.5402
2024-11-15 04:00:59,188 - MAP@10: 0.5577
2024-11-15 04:00:59,188 - MAP@100: 0.5634
2024-11-15 04:00:59,188 - MAP@1000: 0.5635
2024-11-15 04:00:59,188 - 

2024-11-15 04:00:59,188 - Recall@1: 0.4075
2024-11-15 04:00:59,188 - Recall@3: 0.6671
2024-11-15 04:00:59,188 - Recall@5: 0.7575
2024-11-15 04:00:59,188 - Recall@10: 0.8841
2024-11-15 04:00:59,188 - Recall@100: 0.9915
2024-11-15 04:00:59,189 - Recall@1000: 0.9964
2024-11-15 04:00:59,189 - 

2024-11-15 04:00:59,189 - P@1: 0.4075
2024-11-15 04:00:59,189 - P@3: 0.2224
2024-11-15 04:00:59,189 - P@5: 0.1515
2024-11-15 04:00:59,189 - P@10: 0.0884
2024-11-15 04:00:59,189 - P@100: 0.0099
2024-11-15 04:00:59,189 - P@1000: 0.0010

2. Evaluate using FlagEmbedding#

We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in example folder.

Load the arguments:

import sys

arguments = """-
    --eval_name beir 
    --dataset_dir ./beir/data 
    --dataset_names arguana
    --splits test dev 
    --corpus_embd_save_dir ./beir/corpus_embd 
    --output_dir ./beir/search_results 
    --search_top_k 1000 
    --rerank_top_k 100 
    --cache_path /root/.cache/huggingface/hub 
    --overwrite True 
    --k_values 10 100 
    --eval_output_method markdown 
    --eval_output_path ./beir/beir_eval_results.md 
    --eval_metrics ndcg_at_10 recall_at_100 
    --ignore_identical_ids True 
    --embedder_name_or_path BAAI/bge-base-en-v1.5 
    --embedder_batch_size 1024
    --devices cuda:4
""".replace('\n','')

sys.argv = arguments.split()

Then pass the arguments to HFArgumentParser and run the evaluation.

from transformers import HfArgumentParser

from FlagEmbedding.evaluation.beir import (
    BEIREvalArgs, BEIREvalModelArgs,
    BEIREvalRunner
)


parser = HfArgumentParser((
    BEIREvalArgs,
    BEIREvalModelArgs
))

eval_args, model_args = parser.parse_args_into_dataclasses()
eval_args: BEIREvalArgs
model_args: BEIREvalModelArgs

runner = BEIREvalRunner(
    eval_args=eval_args,
    model_args=model_args
)

runner.run()

Split 'dev' not found in the dataset. Removing it from the list.
ignore_identical_ids is set to True. This means that the search results will not contain identical ids. Note: Dataset such as MIRACL should NOT set this to True.
pre tokenize: 100%|██████████| 9/9 [00:00<00:00, 16.19it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 9/9 [00:11<00:00,  1.27s/it]
pre tokenize: 100%|██████████| 2/2 [00:00<00:00, 19.54it/s]
Inference Embeddings: 100%|██████████| 2/2 [00:02<00:00,  1.29s/it]
Searching: 100%|██████████| 44/44 [00:00<00:00, 208.73it/s]

Take a look at the results and choose the way you prefer!

with open('beir/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:
    print(content_file.read())

{
    "arguana-test": {
        "ndcg_at_10": 0.63668,
        "ndcg_at_100": 0.66075,
        "map_at_10": 0.55801,
        "map_at_100": 0.56358,
        "recall_at_10": 0.88549,
        "recall_at_100": 0.99147,
        "precision_at_10": 0.08855,
        "precision_at_100": 0.00991,
        "mrr_at_10": 0.55809,
        "mrr_at_100": 0.56366
    }
}