BGE Series#
In this Part, we will walk through the BGE series and introduce how to use the BGE embedding models.
1. BAAI General Embedding#
BGE stands for BAAI General Embedding, it’s a series of embeddings models developed and published by Beijing Academy of Artificial Intelligence (BAAI).
A full support of APIs and related usages of BGE is maintained in FlagEmbedding on GitHub.
Run the following cell to install FlagEmbedding in your environment.
%%capture
%pip install -U FlagEmbedding
The collection of BGE models can be found in Huggingface collection.
2. BGE Series Models#
2.1 BGE#
The very first version of BGE has 6 models, with ‘large’, ‘base’, and ‘small’ for English and Chinese.
Model |
Language |
Parameters |
Model Size |
Description |
Base Model |
---|---|---|---|---|---|
English |
335M |
1.34 GB |
Embedding Model which map text into vector |
BERT |
|
English |
109M |
438 MB |
a base-scale model but with similar ability to |
BERT |
|
English |
33.4M |
133 MB |
a small-scale model but with competitive performance |
BERT |
|
Chinese |
326M |
1.3 GB |
Embedding Model which map text into vector |
BERT |
|
Chinese |
102M |
409 MB |
a base-scale model but with similar ability to |
BERT |
|
Chinese |
24M |
95.8 MB |
a small-scale model but with competitive performance |
BERT |
For inference, import FlagModel from FlagEmbedding and initialize the model.
from FlagEmbedding import FlagModel
# Load BGE model
model = FlagModel('BAAI/bge-base-en',
query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
use_fp16=True)
queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]
# encode the queries and corpus
q_embeddings = model.encode(queries)
p_embeddings = model.encode(corpus)
# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)
To use FlagModel
:
FlagModel.encode(sentences, batch_size=256, max_length=512, convert_to_numpy=True)
The encode() function directly encode the input sentences to embedding vectors.
FlagModel.encode_queries(sentences, batch_size=256, max_length=512, convert_to_numpy=True)
The encode_queries() function concatenate the query_instruction_for_retrieval
with each of the input query, and then call encode()
.
2.2 BGE v1.5#
BGE 1.5 alleviate the issue of the similarity distribution, and enhance retrieval ability without instruction.
Model |
Language |
Parameters |
Model Size |
Description |
Base Model |
---|---|---|---|---|---|
English |
335M |
1.34 GB |
version 1.5 with more reasonable similarity distribution |
BERT |
|
English |
109M |
438 MB |
version 1.5 with more reasonable similarity distribution |
BERT |
|
English |
33.4M |
133 MB |
version 1.5 with more reasonable similarity distribution |
BERT |
|
Chinese |
326M |
1.3 GB |
version 1.5 with more reasonable similarity distribution |
BERT |
|
Chinese |
102M |
409 MB |
version 1.5 with more reasonable similarity distribution |
BERT |
|
Chinese |
24M |
95.8 MB |
version 1.5 with more reasonable similarity distribution |
BERT |
BGE 1.5 models shares the same API of FlagModel
with BGE models.
model = FlagModel('BAAI/bge-base-en-v1.5',
query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
use_fp16=True)
queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]
# encode the queries and corpus
q_embeddings = model.encode(queries)
p_embeddings = model.encode(corpus)
# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)
[[0.736794 0.5989914]
[0.5684842 0.7461165]]
2.3 LLM-Embedder#
LLM-Embedder is a unified embedding model supporting diverse retrieval augmentation needs for LLMs. It is fine-tuned over 6 tasks:
Question Answering (qa)
Conversational Search (convsearch)
Long Conversation (chat)
Long-Rnage Language Modeling (lrlm)
In-Context Learning (icl)
Tool Learning (tool)
Model |
Language |
Parameters |
Model Size |
Description |
Base Model |
---|---|---|---|---|---|
English |
109M |
438 MB |
a unified embedding model to support diverse retrieval augmentation needs for LLMs |
BERT |
To use LLMEmbedder
:
LLMEmbedder.encode_queries(
queries,
batch_size=256,
max_length=256,
task='qa'
)
The encode_queries() will call the _encode() functions (similar to the encode() in FlagModel
) and add the corresponding query instruction of the given task in front of each of the input queries.
LLMEmbedder.encode_keys(
keys,
batch_size=256,
max_length=512,
task='qa'
)
Similarly, encode_keys() also calls _encode() and automatically add instructions according to given task.
from FlagEmbedding import LLMEmbedder
# load the LLMEmbedder model
model = LLMEmbedder('BAAI/llm-embedder', use_fp16=False)
# Define queries and keys
queries = ["test query 1", "test query 2"]
keys = ["test key 1", "test key 2"]
# Encode for a specific task (qa, icl, chat, lrlm, tool, convsearch)
task = "qa"
query_embeddings = model.encode_queries(queries, task=task)
key_embeddings = model.encode_keys(keys, task=task)
# compute the similarity scores
similarity = query_embeddings @ key_embeddings.T
print(similarity)
[[0.89705944 0.85341793]
[0.8462474 0.90914035]]
2.4 BGE M3#
BGE-M3 is the new version of BGE models that is distinguished for its versatility in:
Multi-Functionality: Simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
Multi-Linguality: Supports more than 100 working languages.
Multi-Granularity: Can proces inputs with different granularityies, spanning from short sentences to long documents of up to 8192 tokens.
For more details, feel free to check out the paper.
Model |
Language |
Parameters |
Model Size |
Description |
Base Model |
---|---|---|---|---|---|
Multilingual |
568M |
2.27 GB |
Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) |
XLM-RoBERTa |
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
sentences = ["What is BGE M3?", "Defination of BM25"]
Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 228780.22it/s]
BGEM3FlagModel.encode(
sentences,
batch_size=12,
max_length=8192,
return_dense=True,
return_sparse=False,
return_colbert_vecs=False
)
It returns a dictionary like:
{
'dense_vecs': 'array of dense embeddings of inputs if return_dense=True, otherwise None,'
'lexical_weights': 'array of dictionaries with keys and values are ids of tokens and their corresponding weights if return_sparse=True, otherwise None,'
'colbert_vecs': 'array of multi-vector embeddings of inputs if return_cobert_vecs=True, otherwise None,'
}
# If you don't need such a long length of 8192 input tokens, you can set max_length to a smaller value to speed up encoding.
embeddings = model.encode(
sentences,
max_length=10,
return_dense=True,
return_sparse=True,
return_colbert_vecs=True
)
print(f"dense embedding:\n{embeddings['dense_vecs']}")
print(f"sparse embedding:\n{embeddings['lexical_weights']}")
print(f"multi-vector:\n{embeddings['colbert_vecs']}")
dense embedding:
[[-0.03411707 -0.04707828 -0.00089447 ... 0.04828531 0.00755427
-0.02961654]
[-0.01041734 -0.04479263 -0.02429199 ... -0.00819298 0.01503995
0.01113793]]
sparse embedding:
[defaultdict(<class 'int'>, {'4865': 0.08362077, '83': 0.081469566, '335': 0.12964639, '11679': 0.25186998, '276': 0.17001738, '363': 0.26957875, '32': 0.040755156}), defaultdict(<class 'int'>, {'262': 0.050144322, '5983': 0.13689369, '2320': 0.045134712, '111': 0.06342201, '90017': 0.25167602, '2588': 0.33353207})]
multi-vector:
[array([[-8.6726490e-03, -4.8921868e-02, -3.0449261e-03, ...,
-2.2082448e-02, 5.7268854e-02, 1.2811369e-02],
[-8.8765034e-03, -4.6860173e-02, -9.5845405e-03, ...,
-3.1404708e-02, 5.3911421e-02, 6.8714428e-03],
[ 1.8445771e-02, -4.2359587e-02, 8.6754939e-04, ...,
-1.9803897e-02, 3.8384371e-02, 7.6852231e-03],
...,
[-2.5543230e-02, -1.6561864e-02, -4.2125367e-02, ...,
-4.5030322e-02, 4.4091221e-02, -1.0043185e-02],
[ 4.9905590e-05, -5.5475257e-02, 8.4884483e-03, ...,
-2.2911752e-02, 6.0379632e-02, 9.3577225e-03],
[ 2.5895271e-03, -2.9331330e-02, -1.8961012e-02, ...,
-8.0389353e-03, 3.2842189e-02, 4.3894034e-02]], dtype=float32), array([[ 0.01715658, 0.03835309, -0.02311821, ..., 0.00146474,
0.02993429, -0.05985384],
[ 0.00996143, 0.039217 , -0.03855301, ..., 0.00599566,
0.02722942, -0.06509776],
[ 0.01777726, 0.03919311, -0.01709837, ..., 0.00805702,
0.03988946, -0.05069073],
...,
[ 0.05474931, 0.0075684 , 0.00329455, ..., -0.01651684,
0.02397249, 0.00368039],
[ 0.0093503 , 0.05022853, -0.02385841, ..., 0.02575599,
0.00786822, -0.03260205],
[ 0.01805054, 0.01337725, 0.00016697, ..., 0.01843987,
0.01374448, 0.00310114]], dtype=float32)]