{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrieval Demo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we will show how to use BGE models on a text retrieval task in 5 minutes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, install FlagEmbedding in the environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -U FlagEmbedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.\n", "\n", "Each sentence is a concise discription of a famous people in specific domain." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "corpus = [\n", " \"Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.\",\n", " \"Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.\",\n", " \"Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'\",\n", " \"Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\",\n", " \"Eminem is a renowned rapper and one of the best-selling music artists of all time.\",\n", " \"Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\",\n", " \"Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.\",\n", " \"Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.\",\n", " \"Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.\",\n", " \"Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\",\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to know which one of these people could be an expert of neural network and who he/she is. \n", "\n", "Thus we generate the following query:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "query = \"Who could be an expert of neural network?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Text -> Embedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's use a [BGE embedding model](https://huggingface.co/BAAI/bge-base-en-v1.5) to create sentence embedding for the corpus." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from FlagEmbedding import FlagModel\n", "\n", "# get the BGE embedding model\n", "model = FlagModel('BAAI/bge-base-en-v1.5',\n", " query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n", " use_fp16=True)\n", "\n", "# get the embedding of the query and corpus\n", "corpus_embeddings = model.encode(corpus)\n", "query_embedding = model.encode(query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The embedding of each sentence is a vector with length 768. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shape of the query embedding: (768,)\n", "shape of the corpus embeddings: (10, 768)\n" ] } ], "source": [ "print(\"shape of the query embedding: \", query_embedding.shape)\n", "print(\"shape of the corpus embeddings:\", corpus_embeddings.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following print line to take a look at the first 10 elements of the query embedding vector." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-0.00790005 -0.00683443 -0.00806659 0.00756918 0.04374858 0.02838556\n", " 0.02357143 -0.02270943 -0.03611493 -0.03038301]\n" ] } ], "source": [ "print(query_embedding[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Calculate Similarity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we have the embeddings of the query and the corpus. The next step is to calculate the similarity between the query and each sentence in the corpus. Here we use the dot product/inner product as our similarity metric." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.39290053 0.6031525 0.32672375 0.6082418 0.39446455 0.35350388\n", " 0.4626108 0.40196604 0.5284606 0.36792332]\n" ] } ], "source": [ "sim_scores = query_embedding @ corpus_embeddings.T\n", "print(sim_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a list of score representing the query's similarity to: [sentence 0, sentence 1, sentence 2, ...]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we have the similarity score of the query to each sentence in the corpus, we can rank them from large to small." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3, 1, 8, 6, 7, 4, 0, 9, 5, 2]\n" ] } ], "source": [ "# get the indices in sorted order\n", "sorted_indices = sorted(range(len(sim_scores)), key=lambda k: sim_scores[k], reverse=True)\n", "print(sorted_indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now from the ranking, the sentence with index 3 is the best answer to our query \"Who could be an expert of neural network?\"\n", "\n", "And that person is Geoffrey Hinton!" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\n" ] } ], "source": [ "print(corpus[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the order of indecies, we can print out the ranking of people that our little retriever got." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Score of 0.608: \"Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.\"\n", "Score of 0.603: \"Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.\"\n", "Score of 0.528: \"Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.\"\n", "Score of 0.463: \"Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.\"\n", "Score of 0.402: \"Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.\"\n", "Score of 0.394: \"Eminem is a renowned rapper and one of the best-selling music artists of all time.\"\n", "Score of 0.393: \"Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.\"\n", "Score of 0.368: \"Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.\"\n", "Score of 0.354: \"Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.\"\n", "Score of 0.327: \"Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'\"\n" ] } ], "source": [ "# iteratively print the score and corresponding sentences in descending order\n", "\n", "for i in sorted_indices:\n", " print(f\"Score of {sim_scores[i]:.3f}: \\\"{corpus[i]}\\\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the ranking, not surprisingly, the similarity scores of the query and the discriptions of Geoffrey Hinton and Fei-Fei Li is way higher than others, following by those of Andrew Ng and Sam Altman. \n", "\n", "While the key phrase \"neural network\" in the query does not appear in any of those discriptions, the BGE embedding model is still powerful enough to get the semantic meaning of query and corpus well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Evaluate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've seen the embedding model performed pretty well on the \"neural network\" query. What about the more general quality?\n", "\n", "Let's generate a very small dataset of queries and corresponding ground truth answers. Note that the ground truth answers are the indices of sentences in the corpus." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "queries = [\n", " \"Who could be an expert of neural network?\",\n", " \"Who might had won Grammy?\",\n", " \"Won Academy Awards\",\n", " \"One of the most famous female singers.\",\n", " \"Inventor of AlexNet\",\n", "]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "ground_truth = [\n", " [1, 3],\n", " [0, 4, 5],\n", " [2, 7, 9],\n", " [5],\n", " [3],\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we repeat the steps we covered above to get the predicted ranking of each query." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[3, 1, 8, 6, 7, 4, 0, 9, 5, 2],\n", " [5, 0, 3, 4, 1, 9, 7, 2, 6, 8],\n", " [3, 2, 7, 5, 9, 0, 1, 4, 6, 8],\n", " [5, 0, 4, 7, 1, 9, 2, 3, 6, 8],\n", " [3, 1, 8, 6, 0, 7, 5, 9, 4, 2]]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# use bge model to generate embeddings for all the queries\n", "queries_embedding = model.encode(queries)\n", "# compute similarity scores\n", "scores = queries_embedding @ corpus_embeddings.T\n", "# get he final rankings\n", "rankings = [sorted(range(len(sim_scores)), key=lambda k: sim_scores[k], reverse=True) for sim_scores in scores]\n", "rankings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mean Reciprocal Rank ([MRR](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) is a widely used metric in information retrieval to evaluate the effectiveness of a system. Here we use that to have a very rough idea how our system performs." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def MRR(preds, labels, cutoffs):\n", " mrr = [0 for _ in range(len(cutoffs))]\n", " for pred, label in zip(preds, labels):\n", " for i, c in enumerate(cutoffs):\n", " for j, index in enumerate(pred):\n", " if j < c and index in label:\n", " mrr[i] += 1/(j+1)\n", " break\n", " mrr = [k/len(preds) for k in mrr]\n", " return mrr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We choose to use 1 and 5 as our cutoffs, with the result of 0.8 and 0.9 respectively." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MRR@1: 0.8\n", "MRR@5: 0.9\n" ] } ], "source": [ "cutoffs = [1, 5]\n", "mrrs = MRR(rankings, ground_truth, cutoffs)\n", "for i, c in enumerate(cutoffs):\n", " print(f\"MRR@{c}: {mrrs[i]}\")" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }