Data Preparation for Fine-tuning#
In this tutorial, we will show an example of the first step for fine-tuning: dataset preparation.
0. Installation#
% pip install -U datasets
Suppose we are willing to fine-tune our model for financial tasks. We found an open-source dataset that could be useful: financial-qa-10k. Let’s see how to properly prepare our dataset for fine-tuning.
The raw dataset has the following structure:
5 columns of: ‘question’, ‘answer’, ‘context’, ‘ticker’, and ‘filing’.
7000 rows.
from datasets import load_dataset
ds = load_dataset("virattt/financial-qa-10K", split="train")
ds
Dataset({
features: ['question', 'answer', 'context', 'ticker', 'filing'],
num_rows: 7000
})
1. Data for Fine-tuning#
Construct the dataset to the following format:
{"query": str, "pos": List[str], "neg":List[str], "pos_scores": List[int], "neg_scores": List[int], "prompt": str, "type": str}
query
is the query, and pos
is a list of positive texts, neg
is a list of negative texts. pos_scores
is a list of scores corresponding to the query and pos, neg_scores
is a list of scores corresponding to the query
and neg
, if you don’t use knowledge distillation, it can be ignored. prompt
is the prompt used for the query, it will cover query_instruction_for_retrieval. type
is used for bge-en-icl, it includes normal
, symmetric_class
, symmetric_clustering
, .etc. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.
We select the columns ‘question’ and ‘context’ as our query and answer(pos), and rename the columns. Then add the ‘id’ column for later evaluation use.
ds = ds.select_columns(column_names=["question", "context"])
ds = ds.rename_column("question", "query")
ds = ds.rename_column("context", "pos")
ds = ds.add_column("id", [str(i) for i in range(len(ds))])
ds[0]
{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
'pos': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.',
'id': '0'}
Negative examples are important during the training of embedding models. Our initial dataset does not come with negative texts. Thus we directly sample a few from the whole corpus.
import numpy as np
np.random.seed(520)
neg_num = 10
def str_to_lst(data):
data["pos"] = [data["pos"]]
return data
# sample negative texts
new_col = []
for i in range(len(ds)):
ids = np.random.randint(0, len(ds), size=neg_num)
while i in ids:
ids = np.random.randint(0, len(ds), size=neg_num)
neg = [ds[i.item()]["pos"] for i in ids]
new_col.append(neg)
ds = ds.add_column("neg", new_col)
# change the key of 'pos' to a list
ds = ds.map(str_to_lst)
Map: 100%|██████████| 7000/7000 [00:00<00:00, 22336.83 examples/s]
Lastly, we add the prompt which is used for query. It will be the query_instruction_for_retrieval
during inference.
instruction = "Represent this sentence for searching relevant passages: "
ds = ds.add_column("prompt", [instruction]*len(ds))
Now a single row of the dataset is:
ds[0]
{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
'pos': ['Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.'],
'id': '0',
'neg': ['Kroger expects that its value creation model will deliver total shareholder return within a target range of 8% to 11% over time.',
'CSB purchased First Mortgages of $2.9 billion during 2023.',
'See Note 13 to our Consolidated Financial Statements for information on certain legal proceedings for which there are contingencies.',
'Diluted earnings per share were $16.69 in fiscal 2022 compared to $15.53 in fiscal 2021.',
'In the year ended December 31, 2023, Total net sales and revenue increased primarily due to: (1) increased net wholesale volumes primarily due to increased sales of crossover vehicles and full-size pickup trucks, partially offset by decreased sales of mid-size pickup trucks; (2) favorable Price as a result of low dealer inventory levels and strong demand for our products; (3) favorable Mix associated with increased sales of full-size pickup trucks and full-size SUVs and decreased sales of vans, passenger cars and mid-size pickup trucks, partially offset by increased sales of crossover vehicles; and (4) favorable Other due to increased sales of parts and accessories.',
'As of December 31, 2023, we had 3,157 full-time employees.',
'Item 3. Legal Proceedings. The information contained in Note 18 ‘‘Commitments and Contingencies’’ included in Item 8 of this 10-K is incorporated herein by reference.',
'Under the amended 2019 Secured Facility, the maturity date is set to July 20, 2026.',
'Accounts receivable for Las Vegas Sands Corp. on December 31, 2023, totaled $685 million, with a provision for credit losses of $201 million, resulting in a net balance of $484 million.',
'Operating expenses as a percentage of segment net sales decreased 25 basis points for fiscal 2023 when compared to the previous fiscal year, primarily driven by strong sales growth and lower incremental COVID-19 related costs, partially offset by increased wage costs.'],
'prompt': 'Represent this sentence for searching relevant passages: '}
Then we split the dataset into training set and testing set.
split = ds.train_test_split(test_size=0.1, shuffle=True, seed=520)
train = split["train"]
test = split["test"]
Now we are ready to store the data for later fine-tuning:
train.to_json("ft_data/training.json")
Creating json from Arrow format: 100%|██████████| 7/7 [00:00<00:00, 39.73ba/s]
16583481
2. Test Data for Evaluation#
The last step is to construct the testing dataset for evaluaton.
test
Dataset({
features: ['query', 'pos', 'id', 'neg', 'prompt'],
num_rows: 700
})
First select the columns for queries:
queries = test.select_columns(column_names=["id", "query"])
queries = queries.rename_column("query", "text")
queries[0]
{'id': '1289',
'text': 'How does Starbucks recognize the interest and penalties related to income tax matters on their financial statements?'}
Then select the columns for corpus:
corpus = ds.select_columns(column_names=["id", "pos"])
corpus = corpus.rename_column("pos", "text")
Finally, make the qrels that indicating the relations of queries and corresponding corpus”
qrels = test.select_columns(["id"])
qrels = qrels.rename_column("id", "qid")
qrels = qrels.add_column("docid", list(test["id"]))
qrels = qrels.add_column("relevance", [1]*len(test))
qrels[0]
Flattening the indices: 100%|██████████| 700/700 [00:00<00:00, 180956.10 examples/s]
{'qid': '1289', 'docid': '1289', 'relevance': 1}
Store the training set
queries.to_json("ft_data/test_queries.jsonl")
corpus.to_json("ft_data/corpus.jsonl")
qrels.to_json("ft_data/test_qrels.jsonl")
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 210.42ba/s]
Creating json from Arrow format: 100%|██████████| 7/7 [00:00<00:00, 261.19ba/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 591.08ba/s]
30574