BGE Auto Embedder#

FlagEmbedding provides a high level class FlagAutoModel that unify the inference of embedding models. Besides BGE series, it also supports other popular open-source embedding models such as E5, GTE, SFR, etc. In this tutorial, we will have an idea how to use it.

% pip install FlagEmbedding

1. Usage#

First, import FlagAutoModel from FlagEmbedding, and use the from_finetuned() function to initialize the model:

from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned(
    'BAAI/bge-base-en-v1.5',
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages: ",
    devices="cuda:0",   # if not specified, will use all available gpus or cpu when no gpu available
)

Then use the model exactly same to FlagModel (FlagM3Model if using BGE M3, FlagLLMModel if using BGE Multilingual Gemma2, FlagICLModel if using BGE ICL)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode_corpus(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[[0.76   0.6714]
 [0.6177 0.7603]]

2. Explanation#

FlagAutoModel use an OrderedDict MODEL_MAPPING to store all the supported models configuration:

from FlagEmbedding.inference.embedder.model_mapping import AUTO_EMBEDDER_MAPPING

list(AUTO_EMBEDDER_MAPPING.keys())
['bge-en-icl',
 'bge-multilingual-gemma2',
 'bge-m3',
 'bge-large-en-v1.5',
 'bge-base-en-v1.5',
 'bge-small-en-v1.5',
 'bge-large-zh-v1.5',
 'bge-base-zh-v1.5',
 'bge-small-zh-v1.5',
 'bge-large-en',
 'bge-base-en',
 'bge-small-en',
 'bge-large-zh',
 'bge-base-zh',
 'bge-small-zh',
 'e5-mistral-7b-instruct',
 'e5-large-v2',
 'e5-base-v2',
 'e5-small-v2',
 'multilingual-e5-large-instruct',
 'multilingual-e5-large',
 'multilingual-e5-base',
 'multilingual-e5-small',
 'e5-large',
 'e5-base',
 'e5-small',
 'gte-Qwen2-7B-instruct',
 'gte-Qwen2-1.5B-instruct',
 'gte-Qwen1.5-7B-instruct',
 'gte-multilingual-base',
 'gte-large-en-v1.5',
 'gte-base-en-v1.5',
 'gte-large',
 'gte-base',
 'gte-small',
 'gte-large-zh',
 'gte-base-zh',
 'gte-small-zh',
 'SFR-Embedding-2_R',
 'SFR-Embedding-Mistral',
 'Linq-Embed-Mistral']
print(AUTO_EMBEDDER_MAPPING['bge-en-icl'])
EmbedderConfig(model_class=<class 'FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder'>, pooling_method=<PoolingMethod.LAST_TOKEN: 'last_token'>, trust_remote_code=False, query_instruction_format='<instruct>{}\n<query>{}')

Taking a look at the value of each key, which is an object of EmbedderConfig. It consists four attributes:

@dataclass
class EmbedderConfig:
    model_class: Type[AbsEmbedder]
    pooling_method: PoolingMethod
    trust_remote_code: bool = False
    query_instruction_format: str = "{}{}"

Not only the BGE series, it supports other models such as E5 similarly:

print(AUTO_EMBEDDER_MAPPING['bge-en-icl'])
EmbedderConfig(model_class=<class 'FlagEmbedding.inference.embedder.decoder_only.icl.ICLLLMEmbedder'>, pooling_method=<PoolingMethod.LAST_TOKEN: 'last_token'>, trust_remote_code=False, query_instruction_format='<instruct>{}\n<query>{}')

3. Customization#

If you want to use your own models through FlagAutoModel, consider the following steps:

  1. Check the type of your embedding model and choose the appropriate model class, is it an encoder or a decoder?

  2. What kind of pooling method it uses? CLS token, mean pooling, or last token?

  3. Does your model needs trust_remote_code=Ture to ran?

  4. Is there a query instruction format for retrieval?

After these four attributes are assured, add your model name as the key and corresponding EmbedderConfig as the value to MODEL_MAPPING. Now have a try!