Hard Negatives#
Hard negatives are those negative samples that are particularly challenging for the model to distinguish from the positive ones. They are often close to the decision boundary or exhibit features that make them highly similar to the positive samples. Thus hard negative mining is widely used in machine learning tasks to make the model focus on subtle differences between similar instances, leading to better discrimination.
In text retrieval system, a hard negative could be document that share some feature similarities with the query but does not truly satisfy the query’s intent. During retrieval, those documents could rank higher than the real answers. Thus it’s valuable to explicitly train the model on these hard negatives.
1. Preparation#
First, load an embedding model:
from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-base-en-v1.5')
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Then, load the queries and corpus from dataset:
from datasets import load_dataset
corpus = load_dataset("BeIR/scifact", "corpus")["corpus"]
queries = load_dataset("BeIR/scifact", "queries")["queries"]
corpus_ids = corpus.select_columns(["_id"])["_id"]
corpus = corpus.select_columns(["text"])["text"]
We create a dictionary maping auto generated ids (starting from 0) used by FAISS index, for later use.
corpus_ids_map = {}
for i in range(len(corpus)):
corpus_ids_map[i] = corpus_ids[i]
2. Indexing#
Use the embedding model to encode the queries and corpus:
p_vecs = model.encode(corpus)
pre tokenize: 100%|██████████| 21/21 [00:00<00:00, 46.18it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Attempting to cast a BatchEncoding to type None. This is not supported.
/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
Inference Embeddings: 0%| | 0/21 [00:00<?, ?it/s]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 5%|▍ | 1/21 [00:49<16:20, 49.00s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 10%|▉ | 2/21 [01:36<15:10, 47.91s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 14%|█▍ | 3/21 [02:16<13:23, 44.66s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 19%|█▉ | 4/21 [02:52<11:39, 41.13s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 24%|██▍ | 5/21 [03:23<09:58, 37.38s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 29%|██▊ | 6/21 [03:55<08:51, 35.44s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 33%|███▎ | 7/21 [04:24<07:47, 33.37s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 38%|███▊ | 8/21 [04:51<06:49, 31.51s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 43%|████▎ | 9/21 [05:16<05:52, 29.37s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 48%|████▊ | 10/21 [05:42<05:13, 28.51s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 52%|█████▏ | 11/21 [06:05<04:25, 26.59s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 57%|█████▋ | 12/21 [06:26<03:43, 24.85s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 62%|██████▏ | 13/21 [06:45<03:06, 23.35s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 67%|██████▋ | 14/21 [07:04<02:33, 21.89s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 71%|███████▏ | 15/21 [07:21<02:03, 20.54s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 76%|███████▌ | 16/21 [07:38<01:36, 19.30s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 81%|████████ | 17/21 [07:52<01:11, 17.87s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 86%|████████▌ | 18/21 [08:06<00:49, 16.58s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 90%|█████████ | 19/21 [08:18<00:30, 15.21s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 95%|█████████▌| 20/21 [08:28<00:13, 13.56s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings: 100%|██████████| 21/21 [08:29<00:00, 24.26s/it]
p_vecs.shape
(5183, 768)
Then create a FAISS index
import torch, faiss
import numpy as np
# create a basic flat index with dimension match our embedding
index = faiss.IndexFlatIP(len(p_vecs[0]))
# make sure the embeddings are float32
p_vecs = np.asarray(p_vecs, dtype=np.float32)
# use gpu to accelerate index searching
if torch.cuda.is_available():
co = faiss.GpuMultipleClonerOptions()
co.shard = True
co.useFloat16 = True
index = faiss.index_cpu_to_all_gpus(index, co=co)
# add all the embeddings to the index
index.add(p_vecs)
3. Searching#
For better demonstration, let’s use a single query:
query = queries[0]
query
{'_id': '0',
'title': '',
'text': '0-dimensional biomaterials lack inductive properties.'}
Get the id and content of that query, then use our embedding model to get its embedding vector.
q_id, q_text = query["_id"], query["text"]
# use the encode_queries() function to encode query
q_vec = model.encode_queries(queries=q_text)
Use the index to search for closest results:
_, ids = index.search(np.expand_dims(q_vec, axis=0), k=15)
# convert the auto ids back to ids in the original dataset
converted = [corpus_ids_map[id] for id in ids[0]]
converted
['4346436',
'17388232',
'14103509',
'37437064',
'29638116',
'25435456',
'32532238',
'31715818',
'23763738',
'7583104',
'21456232',
'2121272',
'35621259',
'58050905',
'196664003']
qrels = load_dataset("BeIR/scifact-qrels")["train"]
pos_id = qrels[0]
pos_id
{'query-id': 0, 'corpus-id': 31715818, 'score': 1}
Lastly, we use the mothod of top-k shifted by N, which get the top 10 negatives after rank 5.
negatives = [id for id in converted[5:] if int(id) != pos_id["corpus-id"]]
negatives
['25435456',
'32532238',
'23763738',
'7583104',
'21456232',
'2121272',
'35621259',
'58050905',
'196664003']
Now we have select a group of hard negatives for the first query!
There are other methods to refine the process of choosing hard negatives. For example, the implementation in our GitHub repo get the top 200 shifted by 10, which mean top 10-210. And then sample 15 from the 200 candidates. The reason is directly choosing the top K may introduce some false negatives, passages that somehow relative to the query but not exactly the answer to that query, into the negative set. This could influence model’s performance.