BGE-M3#
0. Installation#
Install the required packages in your environment.
%%capture
%pip install -U transformers FlagEmbedding accelerate
1. BGE-M3 structure#
from transformers import AutoTokenizer, AutoModel
import torch, os
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
raw_model = AutoModel.from_pretrained("BAAI/bge-m3")
The base model of BGE-M3 is XLM-RoBERTa-large, which is a multilingual version of RoBERTa.
raw_model.eval()
XLMRobertaModel(
(embeddings): XLMRobertaEmbeddings(
(word_embeddings): Embedding(250002, 1024, padding_idx=1)
(position_embeddings): Embedding(8194, 1024, padding_idx=1)
(token_type_embeddings): Embedding(1, 1024)
(LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): XLMRobertaEncoder(
(layer): ModuleList(
(0-23): 24 x XLMRobertaLayer(
(attention): XLMRobertaAttention(
(self): XLMRobertaSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): XLMRobertaSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): XLMRobertaIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): XLMRobertaOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): XLMRobertaPooler(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(activation): Tanh()
)
)
2. Multi-Functionality#
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 240131.91it/s]
2.1 Dense Retrieval#
Using BGE M3 for dense embedding has similar steps to BGE or BGE 1.5 models.
Use the normalized hidden state of the special token [CLS] as the embedding:
$$e_q = norm(H_q[0])$$
Then compute the relevance score between the query and passage:
$$s_{dense}=f_{sim}(e_p, e_q)$$
where $e_p, e_q$ are the embedding vectors of passage and query, respectively.
$f_{sim}$ is the score function (such as inner product and L2 distance) for comupting two embeddings’ similarity.
# If you don't need such a long length of 8192 input tokens, you can set max_length to a smaller value to speed up encoding.
embeddings_1 = model.encode(sentences_1, max_length=10)['dense_vecs']
embeddings_2 = model.encode(sentences_2, max_length=100)['dense_vecs']
# compute the similarity scores
s_dense = embeddings_1 @ embeddings_2.T
print(s_dense)
[[0.6259035 0.34749585]
[0.349868 0.6782462 ]]
2.2 Sparse Retrieval#
Set return_sparse
to true to make the model return sparse vector. If a term token appears multiple times in the sentence, we only retain its max weight.
BGE-M3 generates sparce embeddings by adding a linear layer and a ReLU activation function following the hidden states:
$$w_{qt} = \text{Relu}(W_{lex}^T H_q [i])$$
where $W_{lex}$ representes the weights of linear layer and $H_q[i]$ is the encoder’s output of the $i^{th}$ token.
output_1 = model.encode(sentences_1, return_sparse=True)
output_2 = model.encode(sentences_2, return_sparse=True)
# you can see the weight for each token:
print(model.convert_id_to_token(output_1['lexical_weights']))
[{'What': 0.08362077, 'is': 0.081469566, 'B': 0.12964639, 'GE': 0.25186998, 'M': 0.17001738, '3': 0.26957875, '?': 0.040755156}, {'De': 0.050144322, 'fin': 0.13689369, 'ation': 0.045134712, 'of': 0.06342201, 'BM': 0.25167602, '25': 0.33353207}]
Based on the tokens’ weights of query and passage, the relevance score between them is computed by the joint importance of the co-existed terms within the query and passage:
$$s_{lex} = \sum_{t\in q\cap p}(w_{qt} * w_{pt})$$
where $w_{qt}, w_{pt}$ are the importance weights of each co-existed term $t$ in query and passage, respectively.
# compute the scores via lexical mathcing
s_lex_10_20 = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
s_lex_10_21 = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][1])
print(s_lex_10_20)
print(s_lex_10_21)
0.19554448500275612
0.00880391988903284
2.3 Multi-Vector#
The multi-vector method utilizes the entire output embeddings for the representation of query $E_q$ and passage $E_p$.
$$E_q = norm(W_{mul}^T H_q)$$ $$E_p = norm(W_{mul}^T H_p)$$
where $W_{mul}$ is the learnable projection matrix.
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
print(f"({len(output_1['colbert_vecs'][0])}, {len(output_1['colbert_vecs'][0][0])})")
print(f"({len(output_2['colbert_vecs'][0])}, {len(output_2['colbert_vecs'][0][0])})")
(8, 1024)
(30, 1024)
Following ColBert, we use late-interaction to compute the fine-grained relevance score:
$$s_{mul}=\frac{1}{N}\sum_{i=1}^N\max_{j=1}^M E_q[i]\cdot E_p^T[j]$$
where $E_q, E_p$ are the entire output embeddings of query and passage, respectively.
This is a summation of average of maximum similarity of each $v\in E_q$ with vectors in $E_p$
s_mul_10_20 = model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]).item()
s_mul_10_21 = model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]).item()
print(s_mul_10_20)
print(s_mul_10_21)
0.7796662449836731
0.4621177911758423
2.4 Hybrid Ranking#
BGE-M3’s multi-functionality gives the possibility of hybrid ranking to improve retrieval. Firstly, due to the heavy cost of multi-vector method, we can retrieve the candidate results by either of the dense or sparse method. Then, to get the final result, we can rerank the candidates based on the integrated relevance score:
$$s_{rank} = w_1\cdot s_{dense}+w_2\cdot s_{lex} + w_3\cdot s_{mul}$$
where the values chosen for $w_1, w_2$ and $w_3$ varies depending on the downstream scenario (here 1/3 is just for demonstration).
s_rank_10_20 = 1/3 * s_dense[0][0] + 1/3 * s_lex_10_20 + 1/3 * s_mul_10_20
s_rank_10_21 = 1/3 * s_dense[0][1] + 1/3 * s_lex_10_21 + 1/3 * s_mul_10_21
print(s_rank_10_20)
print(s_rank_10_21)
0.5337047390639782
0.27280585498859483