Dataset#
DecoderOnlyEmbedderICLSameDatasetTrainDataset#
- class FlagEmbedding.finetune.embedder.decoder_only.icl.DecoderOnlyEmbedderICLSameDatasetTrainDataset(args: DecoderOnlyEmbedderICLDataArguments, default_batch_size: int, seed: int, tokenizer: PreTrainedTokenizer, process_index: int = 0, num_processes: int = 1)[source]#
Dataset class for icl model.
- Parameters:
args (DecoderOnlyEmbedderICLDataArguments) – Data argument class for icl model.
default_batch_size (int) – The default batch size.
seed (int) – Random seed to use.
tokenizer (PreTrainedTokenizer) – Tokenzier.
process_index (int, optional) – Current process index. Defaults to 0.
num_processes (int, optional) – Total number of processes. Defaults to 1.
Methods#
- DecoderOnlyEmbedderICLSameDatasetTrainDataset._create_batch_data(batch_raw_data)[source]#
Create a comple batch of data with queries, documents and teacher scores.
- Parameters:
batch_raw_data (datasets.Dataset) – One batch of raw data.
- Returns:
Queries with instruction format. List[str]: Documents with instruction format. List[float]: Teacher scores for model distillation.
- Return type:
List[str]
AbsEmbedderSameDatasetCollator#
- class FlagEmbedding.finetune.embedder.decoder_only.icl.AbsEmbedderSameDatasetCollator(tokenizer: PreTrainedTokenizerBase, padding: bool | str | PaddingStrategy = True, max_length: int | None = None, pad_to_multiple_of: int | None = None, return_tensors: str = 'pt', query_max_len: int = 32, passage_max_len: int = 128, sub_batch_size: int = -1)[source]#
EmbedCollator for SameDataset. Note that after using this collator, the training_args should be set as:
training_args.per_device_train_batch_size = 1
training_args.dataloader_num_workers = 0 # avoid multi-processing