dataset loader#

class FlagEmbedding.abc.evaluation.AbsEvalDataLoader(eval_name: str, dataset_dir: str | None = None, cache_dir: str | None = None, token: str | None = None, force_redownload: bool = False)[source]#

Base class of data loader for evaluation.

Parameters:
  • eval_name (str) – The experiment name of current evaluation.

  • dataset_dir (str, optional) – path to the datasets. Defaults to None.

  • cache_dir (str, optional) – Path to HuggingFace cache directory. Defaults to None.

  • token (str, optional) – HF_TOKEN to access the private datasets/models in HF. Defaults to None.

  • force_redownload – If True, will force redownload the dataset to cover the local dataset. Defaults to False.

Methods#

AbsEvalDataLoader.available_dataset_names() List[str][source]#

Returns: List[str]: Available dataset names.

abstractmethod AbsEvalDataLoader.available_splits(dataset_name: str | None = None) List[str][source]#

Returns: List[str]: Available splits in the dataset.

AbsEvalDataLoader.check_dataset_names(dataset_names: str | List[str]) List[str][source]#

Check the validity of dataset names

Parameters:

dataset_names (Union[str, List[str]]) – a dataset name (str) or a list of dataset names (List[str])

Raises:

ValueError

Returns:

List of valid dataset names.

Return type:

List[str]

AbsEvalDataLoader.check_splits(splits: str | List[str], dataset_name: str | None = None) List[str][source]#

Check whether the splits are available in the dataset.

Parameters:
  • splits (Union[str, List[str]]) – Splits to check.

  • dataset_name (Optional[str], optional) – Name of dataset to check. Defaults to None.

Returns:

The available splits.

Return type:

List[str]

AbsEvalDataLoader.load_corpus(dataset_name: str | None = None) DatasetDict[source]#

Load the corpus from the dataset.

Parameters:

dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

Returns:

A dict of corpus with id as key, title and text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader.load_qrels(dataset_name: str | None = None, split: str = 'test') DatasetDict[source]#

Load the qrels from the dataset.

Parameters:
  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • split (str, optional) – The split to load relevance from. Defaults to 'test'.

Raises:

ValueError

Returns:

A dict of relevance of query and document.

Return type:

datasets.DatasetDict

AbsEvalDataLoader.load_queries(dataset_name: str | None = None, split: str = 'test') DatasetDict[source]#

Load the queries from the dataset.

Parameters:
  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • split (str, optional) – The split to load queries from. Defaults to 'test'.

Raises:

ValueError

Returns:

A dict of queries with id as key, query text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_remote_corpus(dataset_name: str | None = None, save_dir: str | None = None) DatasetDict[source]#

Abstract method to load corpus from remote dataset, to be overrode in child class.

Parameters:
  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • save_dir (Optional[str], optional) – Path to save the new downloaded corpus. Defaults to None.

Raises:

NotImplementedError – Loading remote corpus is not implemented.

Returns:

A dict of corpus with id as key, title and text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_remote_qrels(dataset_name: str | None = None, split: str = 'test', save_dir: str | None = None) DatasetDict[source]#

Abstract method to load relevance from remote dataset, to be overrode in child class.

Parameters:
  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • split (str, optional) – Split to load from the remote dataset. Defaults to 'test'.

  • save_dir (Optional[str], optional) – Path to save the new downloaded relevance. Defaults to None.

Raises:

NotImplementedError – Loading remote qrels is not implemented.

Returns:

A dict of relevance of query and document.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_remote_queries(dataset_name: str | None = None, split: str = 'test', save_dir: str | None = None) DatasetDict[source]#

Abstract method to load queries from remote dataset, to be overrode in child class.

Parameters:
  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • split (str, optional) – Split to load from the remote dataset. Defaults to 'test'.

  • save_dir (Optional[str], optional) – Path to save the new downloaded queries. Defaults to None.

Raises:

NotImplementedError

Returns:

A dict of queries with id as key, query text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_local_corpus(save_dir: str, dataset_name: str | None = None) DatasetDict[source]#

Load corpus from local dataset.

Parameters:
  • save_dir (str) – Path to save the loaded corpus.

  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

Returns:

A dict of corpus with id as key, title and text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_local_qrels(save_dir: str, dataset_name: str | None = None, split: str = 'test') DatasetDict[source]#

Load relevance from local dataset.

Parameters:
  • save_dir (str) – Path to save the loaded relevance.

  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • split (str, optional) – Split to load from the local dataset. Defaults to 'test'.

Raises:

ValueError

Returns:

A dict of relevance of query and document.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._load_local_queries(save_dir: str, dataset_name: str | None = None, split: str = 'test') DatasetDict[source]#

Load queries from local dataset.

Parameters:
  • save_dir (str) – Path to save the loaded queries.

  • dataset_name (Optional[str], optional) – Name of the dataset. Defaults to None.

  • split (str, optional) – Split to load from the local dataset. Defaults to 'test'.

Raises:

ValueError

Returns:

A dict of queries with id as key, query text as value.

Return type:

datasets.DatasetDict

AbsEvalDataLoader._download_file(download_url: str, save_dir: str)[source]#

Download file from provided URL.

Parameters:
  • download_url (str) – Source URL of the file.

  • save_dir (str) – Path to the directory to save the zip file.

Raises:

FileNotFoundError

Returns:

The path of the downloaded file.

Return type:

str

AbsEvalDataLoader._get_fpath_size(fpath: str) int[source]#

Get the total size of the files in provided path.

Parameters:

fpath (str) – path of files to compute the size.

Returns:

The total size in bytes.

Return type:

int

AbsEvalDataLoader._download_gz_file(download_url: str, save_dir: str)[source]#

Download and unzip the gzip file from provided URL.

Parameters:
  • download_url (str) – Source URL of the gzip file.

  • save_dir (str) – Path to the directory to save the gzip file.

Raises:

FileNotFoundError

Returns:

The path to the file after unzip.

Return type:

str

AbsEvalDataLoader._download_zip_file(download_url: str, save_dir: str)[source]#

Download and unzip the zip file from provided URL.

Parameters:
  • download_url (str) – Source URL of the zip file.

  • save_dir (str) – Path to the directory to save the zip file.

Raises:

FileNotFoundError

Returns:

The path to the file after unzip.

Return type:

str