AbsDataset#

AbsRerankerTrainDataset#

class FlagEmbedding.abc.finetune.reranker.AbsRerankerTrainDataset(args: AbsRerankerDataArguments, tokenizer: PreTrainedTokenizer)[source]#

Abstract class for reranker training dataset.

Parameters:

Methods#

AbsRerankerTrainDataset.create_one_example(qry_encoding: str, doc_encoding: str)[source]#

Creates a single input example by encoding and preparing a query and document pair for the model.

Parameters:
  • qry_encoding (str) – Query to be encoded.

  • doc_encoding (str) – Document to be encoded.

Returns:

A dictionary containing tokenized and prepared inputs, ready for model consumption.

Return type:

dict

AbsRerankerTrainDataset._load_dataset(file_path: str)[source]#

Load dataset from path.

Parameters:

file_path (str) – Path to load the datasets from.

Raises:

ValueErrorpos_scores and neg_scores not found in the features of training data

Returns:

Loaded HF dataset.

Return type:

datasets.Dataset

AbsRerankerTrainDataset._shuffle_text(text)[source]#

shuffle the input text.

Parameters:

text (str) – Input text.

Returns:

Shuffled text.

Return type:

str

AbsRerankerCollator#

class FlagEmbedding.abc.finetune.reranker.AbsRerankerCollator(tokenizer: PreTrainedTokenizerBase, padding: bool | str | PaddingStrategy = True, max_length: int | None = None, pad_to_multiple_of: int | None = None, return_tensors: str = 'pt', query_max_len: int = 32, passage_max_len: int = 128)[source]#

The abstract reranker collator.

AbsLLMRerankerTrainDataset#

class FlagEmbedding.abc.finetune.reranker.AbsLLMRerankerTrainDataset(args: AbsRerankerDataArguments, tokenizer: PreTrainedTokenizer)[source]#

Abstract class for LLM reranker training dataset.

Parameters:

AbsLLMRerankerCollator#

class FlagEmbedding.abc.finetune.reranker.AbsLLMRerankerCollator(tokenizer: PreTrainedTokenizerBase, model: Any | None = None, padding: bool | str | PaddingStrategy = True, max_length: int | None = None, pad_to_multiple_of: int | None = None, label_pad_token_id: int = -100, return_tensors: str = 'pt', query_max_len: int = 32, passage_max_len: int = 128)[source]#

Wrapper that does conversion from List[Tuple[encode_qry, encode_psg]] to List[qry], List[psg] and pass batch separately to the actual collator. Abstract out data detail for the model.