Intro to Embedding#

For text retrieval, pattern matching is the most intuitive way. People would use certain characters, words, phrases, or sentence patterns. However, not only for human, it is also extremely inefficient for computer to do pattern matching between a query and a collection of text files to find the possible results.

For images and acoustic waves, there are rgb pixels and digital signals. Similarly, in order to accomplish more sophisticated tasks of natural language such as retrieval, classification, clustering, or semantic search, we need a way to represent text data. That’s how text embedding comes in front of the stage.

1. Background#

Traditional text embedding methods like one-hot encoding and bag-of-words (BoW) represent words and sentences as sparse vectors based on their statistical features, such as word appearance and frequency within a document. More advanced methods like TF-IDF and BM25 improve on these by considering a word’s importance across an entire corpus, while n-gram techniques capture word order in small groups. However, these approaches suffer from the “curse of dimensionality” and fail to capture semantic similarity like “cat” and “kitty”, difference like “play the watch” and “watch the play”.

# example of bag-of-words
sentence1 = "I love basketball"
sentence2 = "I have a basketball match"

words = ['I', 'love', 'basketball', 'have', 'a', 'match']
sen1_vec = [1, 1, 1, 0, 0, 0]
sen2_vec = [1, 0, 1, 1, 1, 1]

To overcome these limitations, dense word embeddings were developed, mapping words to vectors in a low-dimensional space that captures semantic and relational information. Early models like Word2Vec demonstrated the power of dense embeddings using neural networks. Subsequent advancements with neural network architectures like RNNs, LSTMs, and Transformers have enabled more sophisticated models such as BERT, RoBERTa, and GPT to excel in capturing complex word relationships and contexts. BAAI General Embedding (BGE) provide a series of open-source models that could satisfy all kinds of demands.

Get Embedding#

The first step of modern text retrieval is embedding the text. So let’s take a look at how to use the embedding models.

Install the packages:

%%capture
%pip install -U FlagEmbedding sentence_transformers openai cohere

We’ll use the following three sentences as the inputs:

sentences = [
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day",
]

Open-source Models#

A huge portion of embedding models are in the open source community. The advantages of open-source models include:

Free, no extra cost. But make sure to check the License and your use case before using.
No frequency limit, can accelerate a lot if you have enough GPUs to parallelize.
Transparent and might be reproducible.

Let’s take a look at two representatives:

BGE#

BGE is a series of embedding models and rerankers published by BAAI. Several of them reached SOTA at the time they released.

from FlagEmbedding import FlagModel

# Load BGE model
model = FlagModel('BAAI/bge-base-en-v1.5')

# encode the queries and corpus
embeddings = model.encode(sentences)
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")

Embeddings:
(3, 768)
Similarity scores:
[[1.         0.7900386  0.57525384]
 [0.7900386  0.9999998  0.59190154]
 [0.57525384 0.59190154 0.99999994]]

Sentence Transformers#

Sentence Transformers is a library for sentence embeddings with a huge amount of embedding models and datasets for related tasks.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(sentences, normalize_embeddings=True)
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")

Embeddings:
(3, 384)
Similarity scores:
[[0.99999976 0.6210502  0.24906276]
 [0.6210502  0.9999997  0.21061528]
 [0.24906276 0.21061528 0.9999999 ]]

Commercial Models#

There are also plenty choices of commercial models. They have the advantages of:

Efficient memory usage, fast inference with no need of GPUs.
Systematic support, commercial models have closer connections with their other products.
Better training data, commercial models might be trained on larger, higher-quality datasets than some open-source models.

OpenAI#

Along with GPT series, OpenAI has their own embedding models. Make sure to fill in your own API key in the field "YOUR_API_KEY"

import os
import numpy as np

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

Then run the following cells to get the embeddings. Check their official documentation for more details.

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(input = sentences, model="text-embedding-3-small")

embeddings = np.asarray([response.data[i].embedding for i in range(len(sentences))])
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")

Embeddings:
(3, 1536)
Similarity scores:
[[1.00000004 0.697673   0.34739798]
 [0.697673   1.00000005 0.31969923]
 [0.34739798 0.31969923 0.99999998]]

Voyage AI#

Voyage AI provides embedding models and rerankers for different purpus and in various fields. Their API keys can be freely used in low frequency and token length.

os.environ["VOYAGE_API_KEY"] = "YOUR_API_KEY"

Check their official documentation for more details.

import voyageai

vo = voyageai.Client()

result = vo.embed(sentences, model="voyage-large-2-instruct")

embeddings = np.asarray(result.embeddings)
print(f"Embeddings:\n{embeddings.shape}")

scores = embeddings @ embeddings.T
print(f"Similarity scores:\n{scores}")

Embeddings:
(3, 1024)
Similarity scores:
[[0.99999997 0.87282517 0.63276503]
 [0.87282517 0.99999998 0.64720015]
 [0.63276503 0.64720015 0.99999999]]