{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we went through how to construct training and testing data properly. In this tutorial, we will actually fine-tune the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note to fine-tune BGE models using FlagEmbedding, we need to install the package with the finetune dependency:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% pip install -U FlagEmbedding[finetune]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fine-tune" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below are the arguments for fine-tuning:\n", "\n", "The following arguments are for model:\n", "- `model_name_or_path`: The model checkpoint for initialization.\n", "- `config_name`: Pretrained config name or path if not the same as model_name.\n", "- `tokenizer_name`: Pretrained tokenizer name or path if not the same as model_name.\n", "- `cache_dir`: Where do you want to store the pre-trained models downloaded from s3.\n", "- `trust_remote_code`: Trust remote code\n", "- `token`: The token to use when accessing the model.\n", "\n", "The following arguments are for data:\n", "- `train_data`: One or more paths to training data. `query: str`, `pos: List[str]`, `neg: List[str]` are required in the training data. Argument type: multiple.\n", "- `cache_path`: Where do you want to store the cached data.\n", "- `train_group_size`: (No metadata provided)\n", "- `query_max_len`: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated.\n", "- `passage_max_len`: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated.\n", "- `pad_to_multiple_of`: If set will pad the sequence to be a multiple of the provided value.\n", "- `max_example_num_per_dataset`: The max number of examples for each dataset.\n", "- `query_instruction_for_retrieval`: Instruction for query.\n", "- `query_instruction_format`: Format for query instruction.\n", "- `knowledge_distillation`: Use knowledge distillation when `pos_scores: List[float]` and `neg_scores: List[float]` are in features of training data.\n", "- `passage_instruction_for_retrieval`: Instruction for passage.\n", "- `passage_instruction_format`: Format for passage instruction.\n", "- `shuffle_ratio`: The ratio of shuffling the text.\n", "- `same_dataset_within_batch`: All samples in the same batch comes from the same dataset.\n", "- `small_threshold`: The threshold of small dataset. All small dataset in the same directory will be merged into one dataset.\n", "- `drop_threshold`: The threshold for dropping merged small dataset. If the number of examples in the merged small dataset is less than this threshold, it will be dropped.\n", "\n", "And the following extra arguments:\n", "- `negatives_cross_device`: Share negatives across devices.\n", "- `temperature`: Temperature used for similarity score.\n", "- `fix_position_embedding`: Freeze the parameters of position embeddings.\n", "- `sentence_pooling_method`: The pooling method. Available options: cls, mean, last_token. Default: cls.\n", "- `normalize_embeddings`: Whether to normalize the embeddings.\n", "- `sub_batch_size`: Sub batch size for training.\n", "- `kd_loss_type`: The loss type for knowledge distillation. Available options: kl_div, m3_kd_loss. Default: kl_div." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "W1223 06:27:06.807000 1362426 site-packages/torch/distributed/run.py:793] \n", "W1223 06:27:06.807000 1362426 site-packages/torch/distributed/run.py:793] *****************************************\n", "W1223 06:27:06.807000 1362426 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. \n", "W1223 06:27:06.807000 1362426 site-packages/torch/distributed/run.py:793] *****************************************\n", "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml\n", " warnings.warn(\n", "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2024-12-23 06:27:31,423] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n", "[2024-12-23 06:27:31,424] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n", "[2024-12-23 06:27:40,529] [INFO] [comm.py:652:init_distributed] cdb=None\n", "[2024-12-23 06:27:40,529] [INFO] [comm.py:652:init_distributed] cdb=None\n", "[2024-12-23 06:27:40,529] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "12/23/2024 06:27:40 - WARNING - FlagEmbedding.abc.finetune.embedder.AbsRunner - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True\n", "12/23/2024 06:27:40 - INFO - FlagEmbedding.abc.finetune.embedder.AbsRunner - Training/evaluation parameters AbsEmbedderTrainingArguments(\n", "_n_gpu=1,\n", "accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},\n", "adafactor=False,\n", "adam_beta1=0.9,\n", "adam_beta2=0.999,\n", "adam_epsilon=1e-08,\n", "auto_find_batch_size=False,\n", "batch_eval_metrics=False,\n", "bf16=False,\n", "bf16_full_eval=False,\n", "data_seed=None,\n", "dataloader_drop_last=True,\n", "dataloader_num_workers=0,\n", "dataloader_persistent_workers=False,\n", "dataloader_pin_memory=True,\n", "dataloader_prefetch_factor=None,\n", "ddp_backend=None,\n", "ddp_broadcast_buffers=None,\n", "ddp_bucket_cap_mb=None,\n", "ddp_find_unused_parameters=None,\n", "ddp_timeout=1800,\n", "debug=[],\n", "deepspeed=config/ds_stage0.json,\n", "disable_tqdm=False,\n", "dispatch_batches=None,\n", "do_eval=False,\n", "do_predict=False,\n", "do_train=False,\n", "eval_accumulation_steps=None,\n", "eval_delay=0,\n", "eval_do_concat_batches=True,\n", "eval_on_start=False,\n", "eval_steps=None,\n", "eval_strategy=IntervalStrategy.NO,\n", "eval_use_gather_object=False,\n", "evaluation_strategy=None,\n", "fix_position_embedding=False,\n", "fp16=True,\n", "fp16_backend=auto,\n", "fp16_full_eval=False,\n", "fp16_opt_level=O1,\n", "fsdp=[],\n", "fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},\n", "fsdp_min_num_params=0,\n", "fsdp_transformer_layer_cls_to_wrap=None,\n", "full_determinism=False,\n", "gradient_accumulation_steps=1,\n", "gradient_checkpointing=True,\n", "gradient_checkpointing_kwargs=None,\n", "greater_is_better=None,\n", "group_by_length=False,\n", "half_precision_backend=auto,\n", "hub_always_push=False,\n", "hub_model_id=None,\n", "hub_private_repo=False,\n", "hub_strategy=HubStrategy.EVERY_SAVE,\n", "hub_token=,\n", "ignore_data_skip=False,\n", "include_inputs_for_metrics=False,\n", "include_num_input_tokens_seen=False,\n", "include_tokens_per_second=False,\n", "jit_mode_eval=False,\n", "kd_loss_type=kl_div,\n", "label_names=None,\n", "label_smoothing_factor=0.0,\n", "learning_rate=1e-05,\n", "length_column_name=length,\n", "load_best_model_at_end=False,\n", "local_rank=0,\n", "log_level=passive,\n", "log_level_replica=warning,\n", "log_on_each_node=True,\n", "logging_dir=./test_encoder_only_base_bge-large-en-v1.5/runs/Dec23_06-27-30_job-40fb0ce3-8bfb-46ea-b409-0a2e2a1a3163-master-0,\n", "logging_first_step=False,\n", "logging_nan_inf_filter=True,\n", "logging_steps=1.0,\n", "logging_strategy=IntervalStrategy.STEPS,\n", "lr_scheduler_kwargs={},\n", "lr_scheduler_type=SchedulerType.LINEAR,\n", "max_grad_norm=1.0,\n", "max_steps=-1,\n", "metric_for_best_model=None,\n", "mp_parameters=,\n", "neftune_noise_alpha=None,\n", "negatives_cross_device=True,\n", "no_cuda=False,\n", "normalize_embeddings=True,\n", "num_train_epochs=2.0,\n", "optim=OptimizerNames.ADAMW_TORCH,\n", "optim_args=None,\n", "optim_target_modules=None,\n", "output_dir=./test_encoder_only_base_bge-large-en-v1.5,\n", "overwrite_output_dir=True,\n", "past_index=-1,\n", "per_device_eval_batch_size=8,\n", "per_device_train_batch_size=2,\n", "prediction_loss_only=False,\n", "push_to_hub=False,\n", "push_to_hub_model_id=None,\n", "push_to_hub_organization=None,\n", "push_to_hub_token=,\n", "ray_scope=last,\n", "remove_unused_columns=True,\n", "report_to=[],\n", "restore_callback_states_from_checkpoint=False,\n", "resume_from_checkpoint=None,\n", "run_name=./test_encoder_only_base_bge-large-en-v1.5,\n", "save_on_each_node=False,\n", "save_only_model=False,\n", "save_safetensors=True,\n", "save_steps=1000,\n", "save_strategy=IntervalStrategy.STEPS,\n", "save_total_limit=None,\n", "seed=42,\n", "sentence_pooling_method=cls,\n", "skip_memory_metrics=True,\n", "split_batches=None,\n", "sub_batch_size=None,\n", "temperature=0.02,\n", "tf32=None,\n", "torch_compile=False,\n", "torch_compile_backend=None,\n", "torch_compile_mode=None,\n", "torch_empty_cache_steps=None,\n", "torchdynamo=None,\n", "tpu_metrics_debug=False,\n", "tpu_num_cores=None,\n", "use_cpu=False,\n", "use_ipex=False,\n", "use_legacy_prediction_loop=False,\n", "use_mps_device=False,\n", "warmup_ratio=0.1,\n", "warmup_steps=0,\n", "weight_decay=0.0,\n", ")\n", "12/23/2024 06:27:40 - INFO - FlagEmbedding.abc.finetune.embedder.AbsRunner - Model parameters AbsEmbedderModelArguments(model_name_or_path='BAAI/bge-large-en-v1.5', config_name=None, tokenizer_name=None, cache_dir='./cache/model', trust_remote_code=False, token=None)\n", "12/23/2024 06:27:40 - INFO - FlagEmbedding.abc.finetune.embedder.AbsRunner - Data parameters AbsEmbedderDataArguments(train_data=['./ft_data/training.json'], cache_path='./cache/data', train_group_size=8, query_max_len=512, passage_max_len=512, pad_to_multiple_of=8, max_example_num_per_dataset=100000000, query_instruction_for_retrieval='Represent this sentence for searching relevant passages: ', query_instruction_format='{}{}', knowledge_distillation=False, passage_instruction_for_retrieval=None, passage_instruction_format='{}{}', shuffle_ratio=0.0, same_dataset_within_batch=False, small_threshold=0, drop_threshold=0)\n", "12/23/2024 06:27:40 - WARNING - FlagEmbedding.abc.finetune.embedder.AbsRunner - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True\n", "12/23/2024 06:35:01 - INFO - FlagEmbedding.finetune.embedder.encoder_only.base.runner - Config: BertConfig {\n", " \"_name_or_path\": \"BAAI/bge-large-en-v1.5\",\n", " \"architectures\": [\n", " \"BertModel\"\n", " ],\n", " \"attention_probs_dropout_prob\": 0.1,\n", " \"classifier_dropout\": null,\n", " \"gradient_checkpointing\": false,\n", " \"hidden_act\": \"gelu\",\n", " \"hidden_dropout_prob\": 0.1,\n", " \"hidden_size\": 1024,\n", " \"id2label\": {\n", " \"0\": \"LABEL_0\"\n", " },\n", " \"initializer_range\": 0.02,\n", " \"intermediate_size\": 4096,\n", " \"label2id\": {\n", " \"LABEL_0\": 0\n", " },\n", " \"layer_norm_eps\": 1e-12,\n", " \"max_position_embeddings\": 512,\n", " \"model_type\": \"bert\",\n", " \"num_attention_heads\": 16,\n", " \"num_hidden_layers\": 24,\n", " \"pad_token_id\": 0,\n", " \"position_embedding_type\": \"absolute\",\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.44.2\",\n", " \"type_vocab_size\": 2,\n", " \"use_cache\": true,\n", " \"vocab_size\": 30522\n", "}\n", "\n", "12/23/2024 06:35:01 - INFO - FlagEmbedding.abc.finetune.embedder.AbsDataset - loading data from ./ft_data/training.json ...\n", "Generating train split: 6300 examples [00:00, 46043.95 examples/s]\n", "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations\n", " warnings.warn(\n", "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations\n", " warnings.warn(\n", "12/23/2024 06:35:02 - WARNING - accelerate.utils.other - Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[1734935704.354551] [job-40fb0ce3-8bfb-46ea-b409-0a2e2a1a3163-master-0:1362491:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device\n", "[1734935704.383634] [job-40fb0ce3-8bfb-46ea-b409-0a2e2a1a3163-master-0:1362492:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...\n", "Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...\n", "Detected CUDA files, patching ldflags\n", "Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...\n", "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. \n", "If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].\n", " warnings.warn(\n", "Building extension module fused_adam...\n", "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ninja: no work to do.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading extension module fused_adam...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time to load fused_adam op: 1.1966907978057861 seconds\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading extension module fused_adam...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time to load fused_adam op: 1.2037739753723145 seconds\n", "[2024-12-23 06:35:06,883] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started\n", "[2024-12-23 06:35:06,888] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n", " 0%| | 0/3150 [00:00