dataset loader#

class FlagEmbedding.abc.evaluation.AbsEvalDataLoader(eval_name: str, dataset_dir: str | None = None, cache_dir: str | None = None, token: str | None = None, force_redownload: bool = False)[source]#

Base class of data loader for evaluation.

Parameters:

eval_name (str) – The experiment name of current evaluation.
dataset_dir (str, optional) – path to the datasets. Defaults to None.
cache_dir (str, optional) – Path to HuggingFace cache directory. Defaults to None.
token (str, optional) – HF_TOKEN to access the private datasets/models in HF. Defaults to None.
force_redownload – If True, will force redownload the dataset to cover the local dataset. Defaults to False.

Methods#

AbsEvalDataLoader.available_dataset_names() → List[str][source]#: Returns: List[str]: Available dataset names.

abstractmethod AbsEvalDataLoader.available_splits(dataset_name: str | None = None) → List[str][source]#: Returns: List[str]: Available splits in the dataset.

AbsEvalDataLoader.check_dataset_names(dataset_names: str | List[str]) → List[str][source]#

Check the validity of dataset names

Parameters:: dataset_names (Union[str, List[str]]) – a dataset name (str) or a list of dataset names (List[str])
Raises:: ValueError –
Returns:: List of valid dataset names.
Return type:: List[str]

AbsEvalDataLoader.check_splits(splits: str | List[str], dataset_name: str | None = None) → List[str][source]#

Check whether the splits are available in the dataset.

Parameters:

splits (Union[str, List[str]]) – Splits to check.
dataset_name (Optional[str], optional) – Name of dataset to check. Defaults to None.

Returns:

The available splits.

Return type:

List[str]

AbsEvalDataLoader.load_corpus(dataset_name: str | None = None) → DatasetDict[source]#

Load the corpus from the dataset.

Parameters:: dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
Returns:: A dict of corpus with id as key, title and text as value.
Return type:: datasets.DatasetDict

AbsEvalDataLoader.load_qrels(dataset_name: str | None = None, split: str = 'test') → DatasetDict[source]#

Load the qrels from the dataset.

Parameters:

dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
split (str, optional) – The split to load relevance from. Defaults to 'test'.

Raises:

ValueError –

Returns:

A dict of relevance of query and document.

Return type:

datasets.DatasetDict

AbsEvalDataLoader.load_queries(dataset_name: str | None = None, split: str = 'test') → DatasetDict[source]#

Load the queries from the dataset.

Parameters:

dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
split (str, optional) – The split to load queries from. Defaults to 'test'.

Raises:

ValueError –

Returns:

A dict of queries with id as key, query text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_remote_corpus(dataset_name: str | None = None, save_dir: str | None = None) → DatasetDict[source]#

Abstract method to load corpus from remote dataset, to be overrode in child class.

Parameters:

dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
save_dir (Optional[str], optional) – Path to save the new downloaded corpus. Defaults to None.

Raises:

NotImplementedError – Loading remote corpus is not implemented.

Returns:

A dict of corpus with id as key, title and text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_remote_qrels(dataset_name: str | None = None, split: str = 'test', save_dir: str | None = None) → DatasetDict[source]#

Abstract method to load relevance from remote dataset, to be overrode in child class.

Parameters:

dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
split (str, optional) – Split to load from the remote dataset. Defaults to 'test'.
save_dir (Optional[str], optional) – Path to save the new downloaded relevance. Defaults to None.

Raises:

NotImplementedError – Loading remote qrels is not implemented.

Returns:

A dict of relevance of query and document.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_remote_queries(dataset_name: str | None = None, split: str = 'test', save_dir: str | None = None) → DatasetDict[source]#

Abstract method to load queries from remote dataset, to be overrode in child class.

Parameters:

dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
split (str, optional) – Split to load from the remote dataset. Defaults to 'test'.
save_dir (Optional[str], optional) – Path to save the new downloaded queries. Defaults to None.

Raises:

NotImplementedError –

Returns:

A dict of queries with id as key, query text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_local_corpus(save_dir: str, dataset_name: str | None = None) → DatasetDict[source]#

Load corpus from local dataset.

Parameters:

save_dir (str) – Path to save the loaded corpus.
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

Returns:

A dict of corpus with id as key, title and text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_local_qrels(save_dir: str, dataset_name: str | None = None, split: str = 'test') → DatasetDict[source]#

Load relevance from local dataset.

Parameters:

save_dir (str) – Path to save the loaded relevance.
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
split (str, optional) – Split to load from the local dataset. Defaults to 'test'.

Raises:

ValueError –

Returns:

A dict of relevance of query and document.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_local_queries(save_dir: str, dataset_name: str | None = None, split: str = 'test') → DatasetDict[source]#

Load queries from local dataset.

Parameters:

save_dir (str) – Path to save the loaded queries.
dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.
split (str, optional) – Split to load from the local dataset. Defaults to 'test'.

Raises:

ValueError –

Returns:

A dict of queries with id as key, query text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._download_file(download_url: str, save_dir: str)[source]#

Download file from provided URL.

Parameters:

download_url (str) – Source URL of the file.
save_dir (str) – Path to the directory to save the zip file.

Raises:

FileNotFoundError –

Returns:

The path of the downloaded file.

Return type:

str

AbsEvalDataLoader._get_fpath_size(fpath: str) → int[source]#

Get the total size of the files in provided path.

Parameters:: fpath (str) – path of files to compute the size.
Returns:: The total size in bytes.
Return type:: int

AbsEvalDataLoader._download_gz_file(download_url: str, save_dir: str)[source]#

Download and unzip the gzip file from provided URL.

Parameters:

download_url (str) – Source URL of the gzip file.
save_dir (str) – Path to the directory to save the gzip file.

Raises:

FileNotFoundError –

Returns:

The path to the file after unzip.

Return type:

str

AbsEvalDataLoader._download_zip_file(download_url: str, save_dir: str)[source]#

Download and unzip the zip file from provided URL.

Parameters:

download_url (str) – Source URL of the zip file.
save_dir (str) – Path to the directory to save the zip file.

Raises:

FileNotFoundError –

Returns:

The path to the file after unzip.

Return type:

str

dataset loader#

Methods#

This Page