dataset loader#
- class FlagEmbedding.abc.evaluation.AbsEvalDataLoader(eval_name: str, dataset_dir: str | None = None, cache_dir: str | None = None, token: str | None = None, force_redownload: bool = False)[source]#
Base class of data loader for evaluation.
- Parameters:
eval_name (str) – The experiment name of current evaluation.
dataset_dir (str, optional) – path to the datasets. Defaults to
None
.cache_dir (str, optional) – Path to HuggingFace cache directory. Defaults to
None
.token (str, optional) – HF_TOKEN to access the private datasets/models in HF. Defaults to
None
.force_redownload – If True, will force redownload the dataset to cover the local dataset. Defaults to
False
.
Methods#
- AbsEvalDataLoader.available_dataset_names() List[str] [source]#
Returns: List[str]: Available dataset names.
- abstractmethod AbsEvalDataLoader.available_splits(dataset_name: str | None = None) List[str] [source]#
Returns: List[str]: Available splits in the dataset.
- AbsEvalDataLoader.check_dataset_names(dataset_names: str | List[str]) List[str] [source]#
Check the validity of dataset names
- Parameters:
dataset_names (Union[str, List[str]]) – a dataset name (str) or a list of dataset names (List[str])
- Raises:
ValueError –
- Returns:
List of valid dataset names.
- Return type:
List[str]
- AbsEvalDataLoader.check_splits(splits: str | List[str], dataset_name: str | None = None) List[str] [source]#
Check whether the splits are available in the dataset.
- Parameters:
splits (Union[str, List[str]]) – Splits to check.
dataset_name (Optional[str], optional) – Name of dataset to check. Defaults to
None
.
- Returns:
The available splits.
- Return type:
List[str]
- AbsEvalDataLoader.load_corpus(dataset_name: str | None = None) DatasetDict [source]#
Load the corpus from the dataset.
- Parameters:
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.- Returns:
A dict of corpus with id as key, title and text as value.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader.load_qrels(dataset_name: str | None = None, split: str = 'test') DatasetDict [source]#
Load the qrels from the dataset.
- Parameters:
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.split (str, optional) – The split to load relevance from. Defaults to
'test'
.
- Raises:
ValueError –
- Returns:
A dict of relevance of query and document.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader.load_queries(dataset_name: str | None = None, split: str = 'test') DatasetDict [source]#
Load the queries from the dataset.
- Parameters:
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.split (str, optional) – The split to load queries from. Defaults to
'test'
.
- Raises:
ValueError –
- Returns:
A dict of queries with id as key, query text as value.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._load_remote_corpus(dataset_name: str | None = None, save_dir: str | None = None) DatasetDict [source]#
Abstract method to load corpus from remote dataset, to be overrode in child class.
- Parameters:
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.save_dir (Optional[str], optional) – Path to save the new downloaded corpus. Defaults to
None
.
- Raises:
NotImplementedError – Loading remote corpus is not implemented.
- Returns:
A dict of corpus with id as key, title and text as value.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._load_remote_qrels(dataset_name: str | None = None, split: str = 'test', save_dir: str | None = None) DatasetDict [source]#
Abstract method to load relevance from remote dataset, to be overrode in child class.
- Parameters:
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.split (str, optional) – Split to load from the remote dataset. Defaults to
'test'
.save_dir (Optional[str], optional) – Path to save the new downloaded relevance. Defaults to
None
.
- Raises:
NotImplementedError – Loading remote qrels is not implemented.
- Returns:
A dict of relevance of query and document.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._load_remote_queries(dataset_name: str | None = None, split: str = 'test', save_dir: str | None = None) DatasetDict [source]#
Abstract method to load queries from remote dataset, to be overrode in child class.
- Parameters:
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.split (str, optional) – Split to load from the remote dataset. Defaults to
'test'
.save_dir (Optional[str], optional) – Path to save the new downloaded queries. Defaults to
None
.
- Raises:
NotImplementedError –
- Returns:
A dict of queries with id as key, query text as value.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._load_local_corpus(save_dir: str, dataset_name: str | None = None) DatasetDict [source]#
Load corpus from local dataset.
- Parameters:
save_dir (str) – Path to save the loaded corpus.
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.
- Returns:
A dict of corpus with id as key, title and text as value.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._load_local_qrels(save_dir: str, dataset_name: str | None = None, split: str = 'test') DatasetDict [source]#
Load relevance from local dataset.
- Parameters:
save_dir (str) – Path to save the loaded relevance.
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.split (str, optional) – Split to load from the local dataset. Defaults to
'test'
.
- Raises:
ValueError –
- Returns:
A dict of relevance of query and document.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._load_local_queries(save_dir: str, dataset_name: str | None = None, split: str = 'test') DatasetDict [source]#
Load queries from local dataset.
- Parameters:
save_dir (str) – Path to save the loaded queries.
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to
None
.split (str, optional) – Split to load from the local dataset. Defaults to
'test'
.
- Raises:
ValueError –
- Returns:
A dict of queries with id as key, query text as value.
- Return type:
datasets.DatasetDict
- AbsEvalDataLoader._download_file(download_url: str, save_dir: str)[source]#
Download file from provided URL.
- Parameters:
download_url (str) – Source URL of the file.
save_dir (str) – Path to the directory to save the zip file.
- Raises:
FileNotFoundError –
- Returns:
The path of the downloaded file.
- Return type:
str
- AbsEvalDataLoader._get_fpath_size(fpath: str) int [source]#
Get the total size of the files in provided path.
- Parameters:
fpath (str) – path of files to compute the size.
- Returns:
The total size in bytes.
- Return type:
int
- AbsEvalDataLoader._download_gz_file(download_url: str, save_dir: str)[source]#
Download and unzip the gzip file from provided URL.
- Parameters:
download_url (str) – Source URL of the gzip file.
save_dir (str) – Path to the directory to save the gzip file.
- Raises:
FileNotFoundError –
- Returns:
The path to the file after unzip.
- Return type:
str
- AbsEvalDataLoader._download_zip_file(download_url: str, save_dir: str)[source]#
Download and unzip the zip file from provided URL.
- Parameters:
download_url (str) – Source URL of the zip file.
save_dir (str) – Path to the directory to save the zip file.
- Raises:
FileNotFoundError –
- Returns:
The path to the file after unzip.
- Return type:
str