{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hard Negatives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hard negatives are those negative samples that are particularly challenging for the model to distinguish from the positive ones. They are often close to the decision boundary or exhibit features that make them highly similar to the positive samples. Thus hard negative mining is widely used in machine learning tasks to make the model focus on subtle differences between similar instances, leading to better discrimination.\n", "\n", "In text retrieval system, a hard negative could be document that share some feature similarities with the query but does not truly satisfy the query's intent. During retrieval, those documents could rank higher than the real answers. Thus it's valuable to explicitly train the model on these hard negatives." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, load an embedding model:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from FlagEmbedding import FlagModel\n", "\n", "model = FlagModel('BAAI/bge-base-en-v1.5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, load the queries and corpus from dataset:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "corpus = load_dataset(\"BeIR/scifact\", \"corpus\")[\"corpus\"]\n", "queries = load_dataset(\"BeIR/scifact\", \"queries\")[\"queries\"]\n", "\n", "corpus_ids = corpus.select_columns([\"_id\"])[\"_id\"]\n", "corpus = corpus.select_columns([\"text\"])[\"text\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a dictionary maping auto generated ids (starting from 0) used by FAISS index, for later use." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "corpus_ids_map = {}\n", "for i in range(len(corpus)):\n", " corpus_ids_map[i] = corpus_ids[i]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Indexing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the embedding model to encode the queries and corpus:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "pre tokenize: 100%|██████████| 21/21 [00:00<00:00, 46.18it/s]\n", "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n", "Attempting to cast a BatchEncoding to type None. This is not supported.\n", "/share/project/xzy/Envs/ft/lib/python3.11/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml\n", " warnings.warn(\n", "Inference Embeddings: 0%| | 0/21 [00:00