M3Embedder#
- class FlagEmbedding.inference.embedder.encoder_only.m3.M3Embedder(model_name_or_path: str, normalize_embeddings: bool = True, use_fp16: bool = True, query_instruction_for_retrieval: str | None = None, query_instruction_format: str = '{}{}', devices: None | str | List[str] = None, pooling_method: str = 'cls', trust_remote_code: bool = False, cache_dir: str | None = None, colbert_dim: int = -1, batch_size: int = 256, query_max_length: int = 512, passage_max_length: int = 512, return_dense: bool = True, return_sparse: bool = False, return_colbert_vecs: bool = False, **kwargs: Any)[source]#
Embedder class for BGE-M3.
- Parameters:
model_name_or_path (str) – If it’s a path to a local model, it loads the model from the path. Otherwise tries to download and load a model from HuggingFace Hub with the name.
normalize_embeddings (bool, optional) – If True, normalize the dense embedding vector. Defaults to
True
.use_fp16 (bool, optional) – If true, use half-precision floating-point to speed up computation with a slight performance degradation. Defaults to
True
.query_instruction_for_retrieval – (Optional[str], optional): Query instruction for retrieval tasks, which will be used with with
query_instruction_format
. Defaults toNone
.query_instruction_format – (str, optional): The template for
query_instruction_for_retrieval
. Defaults to"{}{}"
.devices (Optional[Union[str, int, List[str], List[int]]], optional) – Devices to use for model inference. Defaults to
None
.pooling_method (str, optional) – Pooling method to get embedding vector from the last hidden state. Defaults to
"cls"
.trust_remote_code (bool, optional) – trust_remote_code for HF datasets or models. Defaults to
False
.cache_dir (Optional[str], optional) – Cache directory for the model. Defaults to
None
.cobert_dim (int, optional) – Dimension of colbert linear. Return the hidden_size if -1. Defaults to
-1
.batch_size (int, optional) – Batch size for inference. Defaults to
256
.query_max_length (int, optional) – Maximum length for query. Defaults to
512
.passage_max_length (int, optional) – Maximum length for passage. Defaults to
512
.return_dense (bool, optional) – If true, will return the dense embedding. Defaults to
True
.return_sparse (bool, optional) – If true, will return the sparce embedding. Defaults to
False
.return_colbert_vecs (bool, optional) – If true, will return the colbert vectors. Defaults to
False
.
- DEFAULT_POOLING_METHOD#
The default pooling method when running the model.
Methods#
- M3Embedder.encode_queries(queries: List[str] | str, batch_size: int | None = None, max_length: int | None = None, return_dense: bool | None = None, return_sparse: bool | None = None, return_colbert_vecs: bool | None = None, **kwargs: Any) Dict[Literal['dense_vecs', 'lexical_weights', 'colbert_vecs'], ndarray | List[Dict[str, float]] | List[ndarray]] [source]#
Encode the queries using the specified way.
- Parameters:
queries (Union[List[str], str]) – The input queries to encode.
batch_size (Optional[int], optional) – Number of sentences for each iter. Defaults to
None
.max_length (Optional[int], optional) – Maximum length of tokens. Defaults to
None
.return_dense (Optional[bool], optional) – If True, compute and return dense embedding. Defaults to
None
.return_sparse (Optional[bool], optional) – If True, compute and return sparce embedding. Defaults to
None
.return_colbert_vecs (Optional[bool], optional) – If True, compute and return cobert vectors. Defaults to
None
.
- Returns:
Dict[Literal[“dense_vecs”, “lexical_weights”, “colbert_vecs”], Union[np.ndarray, List[Dict[str, float]], List[np.ndarray]]
- M3Embedder.encode_corpus(corpus: List[str] | str, batch_size: int | None = None, max_length: int | None = None, return_dense: bool | None = None, return_sparse: bool | None = None, return_colbert_vecs: bool | None = None, **kwargs: Any) Dict[Literal['dense_vecs', 'lexical_weights', 'colbert_vecs'], ndarray | List[Dict[str, float]] | List[ndarray]] [source]#
Encode the corpus using the specified way.
- Parameters:
corpus (Union[List[str], str]) – The input corpus to encode.
batch_size (Optional[int], optional) – Number of sentences for each iter. Defaults to
None
.max_length (Optional[int], optional) – Maximum length of tokens. Defaults to
None
.return_dense (Optional[bool], optional) – If True, compute and return dense embedding. Defaults to
None
.return_sparse (Optional[bool], optional) – If True, compute and return sparce embedding. Defaults to
None
.return_colbert_vecs (Optional[bool], optional) – If True, compute and return cobert vectors. Defaults to
None
.
- Returns:
Dict[Literal[“dense_vecs”, “lexical_weights”, “colbert_vecs”], Union[np.ndarray, List[Dict[str, float]], List[np.ndarray]]
- M3Embedder.encode(sentences: List[str] | str, batch_size: int | None = None, max_length: int | None = None, return_dense: bool | None = None, return_sparse: bool | None = None, return_colbert_vecs: bool | None = None, **kwargs: Any) Dict[Literal['dense_vecs', 'lexical_weights', 'colbert_vecs'], ndarray | List[Dict[str, float]] | List[ndarray]] [source]#
Encode the sentences using the specified way.
- Parameters:
sentences (Union[List[str], str]) – The input sentences to encode.
batch_size (Optional[int], optional) – Number of sentences for each iter. Defaults to
None
.max_length (Optional[int], optional) – Maximum length of tokens. Defaults to
None
.return_dense (Optional[bool], optional) – If True, compute and return dense embedding. Defaults to
None
.return_sparse (Optional[bool], optional) – If True, compute and return sparce embedding. Defaults to
None
.return_colbert_vecs (Optional[bool], optional) – If True, compute and return cobert vectors. Defaults to
None
.
- Returns:
Dict[Literal[“dense_vecs”, “lexical_weights”, “colbert_vecs”], Union[np.ndarray, List[Dict[str, float]], List[np.ndarray]]
- M3Embedder.convert_id_to_token(lexical_weights: List[Dict])[source]#
Convert the ids back to tokens.
- Parameters:
lexical_weights (List[Dict]) – A list of dictionaries of id & weights.
- Returns:
A list of dictionaries of tokens & weights.
- Return type:
List[Dict]
- M3Embedder.compute_lexical_matching_score(lexical_weights_1: Dict[str, float] | List[Dict[str, float]], lexical_weights_2: Dict[str, float] | List[Dict[str, float]]) ndarray | float [source]#
Compute the laxical matching score of two given lexical weights.
- Parameters:
lexical_weights_1 (Union[Dict[str, float], List[Dict[str, float]]]) – First array of lexical weights.
lexical_weights_2 (Union[Dict[str, float], List[Dict[str, float]]]) – Second array of lexical weights.
- Returns:
The computed lexical weights across the two arries of lexical weights.
- Return type:
Union[np.ndarray, float]
- M3Embedder.colbert_score(q_reps, p_reps)[source]#
Compute colbert scores of input queries and passages.
- Parameters:
q_reps (np.ndarray) – Multi-vector embeddings for queries.
p_reps (np.ndarray) – Multi-vector embeddings for passages/corpus.
- Returns:
Computed colbert scores.
- Return type:
torch.Tensor
- M3Embedder.encode_single_device(sentences: List[str] | str, batch_size: int = 256, max_length: int = 512, return_dense: bool = True, return_sparse: bool = False, return_colbert_vecs: bool = False, device: str | None = None, **kwargs: Any)[source]#
Using single device to encode the input sentences.
- Parameters:
sentences (Union[List[str], str]) – The input sentences to encode.
batch_size (Optional[int], optional) – Number of sentences for each iter. Defaults to
256
.max_length (Optional[int], optional) – Maximum length of tokens. Defaults to
512
.return_dense (Optional[bool], optional) – If True, compute and return dense embedding. Defaults to
True
.return_sparse (Optional[bool], optional) – If True, compute and return sparce embedding. Defaults to
False
.return_colbert_vecs (Optional[bool], optional) – If True, compute and return cobert vectors. Defaults to
False
.device (Optional[str], optional) – _description_. Defaults to
None
.
- Returns:
Dict[Literal[“dense_vecs”, “lexical_weights”, “colbert_vecs”], Union[np.ndarray, List[Dict[str, float]], List[np.ndarray]]
- M3Embedder.compute_score(sentence_pairs: List[Tuple[str, str]] | Tuple[str, str], batch_size: int | None = None, max_query_length: int | None = None, max_passage_length: int | None = None, weights_for_different_modes: None | List[float] = None, **kwargs: Any) Dict[Literal['colbert', 'sparse', 'dense', 'sparse+dense', 'colbert+sparse+dense'], List[float]] [source]#
Compute the relevance score of different attributes.
- Parameters:
sentence_pairs (Union[List[Tuple[str, str]], Tuple[str, str]]) – _description_
batch_size (Optional[int], optional) – _description_. Defaults to None.
max_query_length (Optional[int], optional) – _description_. Defaults to None.
max_passage_length (Optional[int], optional) – _description_. Defaults to None.
weights_for_different_modes (Optional[List[float]], optional) – _description_. Defaults to None.
- Returns:
Dict[Literal[“colbert”, “sparse”, “dense”, “sparse+dense”, “colbert+sparse+dense”], List[float]]
- M3Embedder.compute_score_multi_process(sentence_pairs: List[Tuple[str, str]], pool: Dict[Literal['input', 'output', 'processes'], Any], **kwargs)[source]#
- M3Embedder.compute_score_single_device(sentence_pairs: List[Tuple[str, str]] | Tuple[str, str], batch_size: int = 256, max_query_length: int = 512, max_passage_length: int = 512, weights_for_different_modes: List[float] | None = None, device: str | None = None, **kwargs: Any) Dict[Literal['colbert', 'sparse', 'dense', 'sparse+dense', 'colbert+sparse+dense'], List[float]] [source]#
Compute the relevance score of different attributes.
- Parameters:
sentence_pairs (Union[List[Tuple[str, str]], Tuple[str, str]]) – Pairs of sentences to compute the score.
batch_size (Optional[int], optional) – _description_. Defaults to
None
.max_query_length (Optional[int], optional) – _description_. Defaults to
None
.max_passage_length (Optional[int], optional) – _description_. Defaults to
None
.weights_for_different_modes (Optional[List[float]], optional) – The weights for different methods. Defaults to
None
.device (Optional[str], optional) – The device to use. Defaults to
None
.
- Returns:
Dict[Literal[“colbert”, “sparse”, “dense”, “sparse+dense”, “colbert+sparse+dense”], List[float]]