BGE Series#

In this Part, we will walk through the BGE series and introduce how to use the BGE embedding models.

1. BAAI General Embedding#

BGE stands for BAAI General Embedding, it’s a series of embeddings models developed and published by Beijing Academy of Artificial Intelligence (BAAI).

A full support of APIs and related usages of BGE is maintained in FlagEmbedding on GitHub.

Run the following cell to install FlagEmbedding in your environment.

%%capture
%pip install -U FlagEmbedding

The collection of BGE models can be found in Huggingface collection.

2. BGE Series Models#

2.1 BGE#

The very first version of BGE has 6 models, with ‘large’, ‘base’, and ‘small’ for English and Chinese.

Model

Language

Parameters

Model Size

Description

Base Model

BAAI/bge-large-en

English

335M

1.34 GB

Embedding Model which map text into vector

BERT

BAAI/bge-base-en

English

109M

438 MB

a base-scale model but with similar ability to bge-large-en

BERT

BAAI/bge-small-en

English

33.4M

133 MB

a small-scale model but with competitive performance

BERT

BAAI/bge-large-zh

Chinese

326M

1.3 GB

Embedding Model which map text into vector

BERT

BAAI/bge-base-zh

Chinese

102M

409 MB

a base-scale model but with similar ability to bge-large-zh

BERT

BAAI/bge-small-zh

Chinese

24M

95.8 MB

a small-scale model but with competitive performance

BERT

For inference, import FlagModel from FlagEmbedding and initialize the model.

from FlagEmbedding import FlagModel

# Load BGE model
model = FlagModel('BAAI/bge-base-en',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode(queries)
p_embeddings = model.encode(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)

To use FlagModel:

FlagModel.encode(sentences, batch_size=256, max_length=512, convert_to_numpy=True)

The encode() function directly encode the input sentences to embedding vectors.

FlagModel.encode_queries(sentences, batch_size=256, max_length=512, convert_to_numpy=True)

The encode_queries() function concatenate the query_instruction_for_retrieval with each of the input query, and then call encode().

2.2 BGE v1.5#

BGE 1.5 alleviate the issue of the similarity distribution, and enhance retrieval ability without instruction.

Model

Language

Parameters

Model Size

Description

Base Model

BAAI/bge-large-en-v1.5

English

335M

1.34 GB

version 1.5 with more reasonable similarity distribution

BERT

BAAI/bge-base-en-v1.5

English

109M

438 MB

version 1.5 with more reasonable similarity distribution

BERT

BAAI/bge-small-en-v1.5

English

33.4M

133 MB

version 1.5 with more reasonable similarity distribution

BERT

BAAI/bge-large-zh-v1.5

Chinese

326M

1.3 GB

version 1.5 with more reasonable similarity distribution

BERT

BAAI/bge-base-zh-v1.5

Chinese

102M

409 MB

version 1.5 with more reasonable similarity distribution

BERT

BAAI/bge-small-zh-v1.5

Chinese

24M

95.8 MB

version 1.5 with more reasonable similarity distribution

BERT

BGE 1.5 models shares the same API of FlagModel with BGE models.

model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode(queries)
p_embeddings = model.encode(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)
[[0.736794  0.5989914]
 [0.5684842 0.7461165]]

2.3 LLM-Embedder#

LLM-Embedder is a unified embedding model supporting diverse retrieval augmentation needs for LLMs. It is fine-tuned over 6 tasks:

  • Question Answering (qa)

  • Conversational Search (convsearch)

  • Long Conversation (chat)

  • Long-Rnage Language Modeling (lrlm)

  • In-Context Learning (icl)

  • Tool Learning (tool)

Model

Language

Parameters

Model Size

Description

Base Model

BAAI/llm-embedder

English

109M

438 MB

a unified embedding model to support diverse retrieval augmentation needs for LLMs

BERT

To use LLMEmbedder:

LLMEmbedder.encode_queries(
    queries, 
    batch_size=256, 
    max_length=256, 
    task='qa'
)

The encode_queries() will call the _encode() functions (similar to the encode() in FlagModel) and add the corresponding query instruction of the given task in front of each of the input queries.

LLMEmbedder.encode_keys(
    keys, 
    batch_size=256, 
    max_length=512, 
    task='qa'
)

Similarly, encode_keys() also calls _encode() and automatically add instructions according to given task.

from FlagEmbedding import LLMEmbedder

# load the LLMEmbedder model
model = LLMEmbedder('BAAI/llm-embedder', use_fp16=False)

# Define queries and keys
queries = ["test query 1", "test query 2"]
keys = ["test key 1", "test key 2"]

# Encode for a specific task (qa, icl, chat, lrlm, tool, convsearch)
task = "qa"
query_embeddings = model.encode_queries(queries, task=task)
key_embeddings = model.encode_keys(keys, task=task)

# compute the similarity scores
similarity = query_embeddings @ key_embeddings.T
print(similarity)
[[0.89705944 0.85341793]
 [0.8462474  0.90914035]]

2.4 BGE M3#

BGE-M3 is the new version of BGE models that is distinguished for its versatility in:

  • Multi-Functionality: Simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.

  • Multi-Linguality: Supports more than 100 working languages.

  • Multi-Granularity: Can proces inputs with different granularityies, spanning from short sentences to long documents of up to 8192 tokens.

For more details, feel free to check out the paper.

Model

Language

Parameters

Model Size

Description

Base Model

BAAI/bge-m3

Multilingual

568M

2.27 GB

Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens)

XLM-RoBERTa

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)

sentences = ["What is BGE M3?", "Defination of BM25"]
Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 228780.22it/s]
BGEM3FlagModel.encode(
    sentences, 
    batch_size=12, 
    max_length=8192, 
    return_dense=True, 
    return_sparse=False, 
    return_colbert_vecs=False
)

It returns a dictionary like:

{
    'dense_vecs': 'array of dense embeddings of inputs if return_dense=True, otherwise None,'
    'lexical_weights': 'array of dictionaries with keys and values are ids of tokens and their corresponding weights if return_sparse=True, otherwise None,'
    'colbert_vecs': 'array of multi-vector embeddings of inputs if return_cobert_vecs=True, otherwise None,'
}
# If you don't need such a long length of 8192 input tokens, you can set max_length to a smaller value to speed up encoding.
embeddings = model.encode(
    sentences, 
    max_length=10,
    return_dense=True, 
    return_sparse=True, 
    return_colbert_vecs=True
)
print(f"dense embedding:\n{embeddings['dense_vecs']}")
print(f"sparse embedding:\n{embeddings['lexical_weights']}")
print(f"multi-vector:\n{embeddings['colbert_vecs']}")
dense embedding:
[[-0.03411707 -0.04707828 -0.00089447 ...  0.04828531  0.00755427
  -0.02961654]
 [-0.01041734 -0.04479263 -0.02429199 ... -0.00819298  0.01503995
   0.01113793]]
sparse embedding:
[defaultdict(<class 'int'>, {'4865': 0.08362077, '83': 0.081469566, '335': 0.12964639, '11679': 0.25186998, '276': 0.17001738, '363': 0.26957875, '32': 0.040755156}), defaultdict(<class 'int'>, {'262': 0.050144322, '5983': 0.13689369, '2320': 0.045134712, '111': 0.06342201, '90017': 0.25167602, '2588': 0.33353207})]
multi-vector:
[array([[-8.6726490e-03, -4.8921868e-02, -3.0449261e-03, ...,
        -2.2082448e-02,  5.7268854e-02,  1.2811369e-02],
       [-8.8765034e-03, -4.6860173e-02, -9.5845405e-03, ...,
        -3.1404708e-02,  5.3911421e-02,  6.8714428e-03],
       [ 1.8445771e-02, -4.2359587e-02,  8.6754939e-04, ...,
        -1.9803897e-02,  3.8384371e-02,  7.6852231e-03],
       ...,
       [-2.5543230e-02, -1.6561864e-02, -4.2125367e-02, ...,
        -4.5030322e-02,  4.4091221e-02, -1.0043185e-02],
       [ 4.9905590e-05, -5.5475257e-02,  8.4884483e-03, ...,
        -2.2911752e-02,  6.0379632e-02,  9.3577225e-03],
       [ 2.5895271e-03, -2.9331330e-02, -1.8961012e-02, ...,
        -8.0389353e-03,  3.2842189e-02,  4.3894034e-02]], dtype=float32), array([[ 0.01715658,  0.03835309, -0.02311821, ...,  0.00146474,
         0.02993429, -0.05985384],
       [ 0.00996143,  0.039217  , -0.03855301, ...,  0.00599566,
         0.02722942, -0.06509776],
       [ 0.01777726,  0.03919311, -0.01709837, ...,  0.00805702,
         0.03988946, -0.05069073],
       ...,
       [ 0.05474931,  0.0075684 ,  0.00329455, ..., -0.01651684,
         0.02397249,  0.00368039],
       [ 0.0093503 ,  0.05022853, -0.02385841, ...,  0.02575599,
         0.00786822, -0.03260205],
       [ 0.01805054,  0.01337725,  0.00016697, ...,  0.01843987,
         0.01374448,  0.00310114]], dtype=float32)]