Dataset#

DecoderOnlyEmbedderICLSameDatasetTrainDataset#

class FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLSameDatasetTrainDataset(args: DecoderOnlyEmbedderICLDataArguments, default_batch_size: int, seed: int, tokenizer: PreTrainedTokenizer, process_index: int = 0, num_processes: int = 1)[source]#

Dataset class for icl model.

Parameters:
  • args (DecoderOnlyEmbedderICLDataArguments) – Data argument class for icl model.

  • default_batch_size (int) – The default batch size.

  • seed (int) – Random seed to use.

  • tokenizer (PreTrainedTokenizer) – Tokenzier.

  • process_index (int, optional) – Current process index. Defaults to 0.

  • num_processes (int, optional) – Total number of processes. Defaults to 1.

Methods#

DecoderOnlyEmbedderICLSameDatasetTrainDataset._create_batch_data(batch_raw_data)[source]#

Create a comple batch of data with queries, documents and teacher scores.

Parameters:

batch_raw_data (datasets.Dataset) – One batch of raw data.

Returns:

Queries with instruction format. List[str]: Documents with instruction format. List[float]: Teacher scores for model distillation.

Return type:

List[str]

AbsEmbedderSameDatasetCollator#

class FlagEmbedding.finetune.embedder.decoder_only.icl.AbsEmbedderSameDatasetCollator(tokenizer: PreTrainedTokenizerBase, padding: bool | str | PaddingStrategy = True, max_length: int | None = None, pad_to_multiple_of: int | None = None, return_tensors: str = 'pt', query_max_len: int = 32, passage_max_len: int = 128, sub_batch_size: int = -1)[source]#

EmbedCollator for SameDataset. Note that after using this collator, the training_args should be set as:

training_args.per_device_train_batch_size = 1

training_args.dataloader_num_workers = 0    # avoid multi-processing