Similarity#

In this section, we will introduce several different ways to measure similarity.

1. Jaccard Similarity#

Before directly calculate the similarity between embedding vectors, let’s first take a look at the primal method for measuring how similar two sentenses are: Jaccard similarity.

Definition: For sets $A$ and $B$, the Jaccard index, or the Jaccard similarity coefficient between them is the size of their intersection divided by the size of their union: $$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

The value of $J(A,B)$ falls in the range of $[0, 1]$.

def jaccard_similarity(sentence1, sentence2):
    set1 = set(sentence1.split(" "))
    set2 = set(sentence2.split(" "))
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection)/len(union)

s1 = "Hawaii is a wonderful place for holiday"
s2 = "Peter's favorite place to spend his holiday is Hawaii"
s3 = "Anna enjoys baking during her holiday"

jaccard_similarity(s1, s2)

0.3333333333333333

jaccard_similarity(s1, s3)

0.08333333333333333

We can see that sentence 1 and 2 are sharing ‘Hawaii’, ‘place’, and ‘holiday’. Thus getting a larger score of similarity (0.333) than that (0.083) of the sentence 1 and 3 that only share ‘holiday’.

2. Euclidean Distance#

import torch

A = torch.randint(1, 7, (1, 4), dtype=torch.float32)
B = torch.randint(1, 7, (1, 4), dtype=torch.float32)
print(A, B)

tensor([[5., 2., 2., 6.]]) tensor([[4., 6., 6., 4.]])

Definition: For vectors $A$ and $B$, the Euclidean distance or L2 distance between them is defined as: $$d(A, B) = |A-B|2 = \sqrt{\sum{i=1}^n (A_i-B_i)^2}$$

The value of $d(A, B)$ falls in the range of [0, $+\infty$). Since this is the measurement of distance, the closer the value is to 0, the more similar the two vector is. And the larger the value is, the two vectors are more dissimilar.

You can calculate Euclidean distance step by step or directly call torch.cdist()

dist = torch.sqrt(torch.sum(torch.pow(torch.subtract(A, B), 2), dim=-1))
dist.item()

6.082762718200684

torch.cdist(A, B, p=2).item()

6.082762718200684

(Maximum inner-product search)#

3. Cosine Similarity#

For vectors $A$ and $B$, their cosine similarity is defined as: $$\cos(\theta)=\frac{A\cdot B}{|A||B|}$$

The value of $\cos(\theta)$ falls in the range of $[-1, 1]$. Different from Euclidean distance, close to -1 denotes not similar at all and close to +1 means very similar.

3.1 Naive Approach#

The naive approach is just expanding the expression: $$\frac{A\cdot B}{|A||B|}=\frac{\sum_{i=1}^{i=n}A_i B_i}{\sqrt{\sum_{i=1}^{n}A_i^2}\cdot\sqrt{\sum_{i=1}^{n}B_i^2}}$$

# Compute the dot product of A and B
dot_prod = sum(a*b for a, b in zip(A[0], B[0]))

# Compute the magnitude of A and B
A_norm = torch.sqrt(sum(a*a for a in A[0]))
B_norm = torch.sqrt(sum(b*b for b in B[0]))

cos_1 = dot_prod / (A_norm * B_norm)
print(cos_1.item())

0.802726686000824

3.2 PyTorch Implementation#

The naive approach has few issues:

There are chances of losing precision in the numerator and the denominator
Losing precision may cause the computed cosine similarity > 1.0

Thus PyTorch uses the following way:

$$ \frac{A\cdot B}{|A||B|}=\frac{A}{|A|}\cdot\frac{B}{|B|} $$

res = torch.mm(A / A.norm(dim=1), B.T / B.norm(dim=1))
print(res.item())

0.802726686000824

3.3 PyTorch Function Call#

In practice, the most convinient way is directly use cosine_similarity() in torch.nn.functional:

import torch.nn.functional as F

F.cosine_similarity(A, B).item()

0.802726686000824

4. Inner Product/Dot Product#

Coordinate definition: $$A\cdot B = \sum_{i=1}^{i=n}A_i B_i$$

Geometric definition: $$A\cdot B = |A||B|\cos(\theta)$$

dot_prod = A @ B.T
dot_prod.item()

68.0

Relationship with Cosine similarity#

For computing the distance/similarity between two vectors, dot product and Cos similarity are closely related. Cos similarity only cares about the angle difference (because it is normalized by the product of two vectors’ magnitude), while dot product takes both magnitude and angle into consideration. So the two metrics are preferred in different use cases.

The BGE series models already normalized the output embedding vector to have the magnitude of 1. Thus using dot product and cos similarity will have the same result.

from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-large-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

sentence = "I am very interested in natural language processing"
embedding = torch.tensor(model.encode(sentence))
torch.norm(embedding).item()

1.0

5. Examples#

Now we’ve learned the mechanism of different types of similarity. Let’s look at a real example.

sentence_1 = "I will watch a show tonight"
sentence_2 = "I will show you my watch tonight"
sentence_3 = "I'm going to enjoy a performance this evening"

It’s clear to us that in sentence 1, ‘watch’ is a verb and ‘show’ is a noun.

But in sentence 2, ‘show’ is a verb and ‘watch’ is a noun, which leads to different meaning of the two sentences.

While sentence 3 has very similar meaning to sentence 1.

Now let’s see how does different similarity metrics tell us the relationship of the sentences.

print(jaccard_similarity(sentence_1, sentence_2))
print(jaccard_similarity(sentence_1, sentence_3))

0.625
0.07692307692307693

The results show that sentence 1 and 2 (0.625) are way more similar than sentence 1 and 3 (0.077), which indicate the opposite conclusion compare to what we have made.

Now let’s first get the embeddings of these sentences.

embeddings = torch.from_numpy(model.encode([sentence_1, sentence_2, sentence_3]))
embedding_1 = embeddings[0].view(1, -1)
embedding_2 = embeddings[1].view(1, -1)
embedding_3 = embeddings[2].view(1, -1)

print(embedding_1.shape)

torch.Size([1, 1024])

Then let’s compute the Euclidean distance:

euc_dist1_2 = torch.cdist(embedding_1, embedding_2, p=2).item()
euc_dist1_3 = torch.cdist(embedding_1, embedding_3, p=2).item()
print(euc_dist1_2)
print(euc_dist1_3)

0.714613139629364
0.5931472182273865

Then, let’s see the cosine similarity:

cos_dist1_2 = F.cosine_similarity(embedding_1, embedding_2).item()
cos_dist1_3 = F.cosine_similarity(embedding_1, embedding_3).item()
print(cos_dist1_2)
print(cos_dist1_3)

0.7446640729904175
0.8240882158279419

Using embedding, we can get the correct result different from Jaccard similarity that sentence 1 and 2 should be more similar than sentence 1 and 3 using either Euclidean distance or cos similarity as the metric.