BGE-VL#
BGE-VL is a series of multimodel retrieval models training on MegaPairs
BGE-VL contains light weight CLIP based models as well as more powerful LLAVA-NeXT based MLLM models:
Model |
Language |
Parameters |
Model Size |
Description |
---|---|---|---|---|
English |
150M |
299 MB |
Light weight multimodel embedder among image and text |
|
English |
428M |
855 MB |
Large scale multimodel embedder among image and text |
|
English |
7.57B |
15.14 GB |
SOTA in composed image retrieval, trained on MegaPairs dataset |
|
English |
7.57B |
15.14 GB |
Finetune BGE-VL-MLLM-S1 with one epoch on MMEB training set |
BGE-VL-CLIP#
The base and large model are trained based on CLIP-vit-base-patch16 and CLIP-vit-large-patch14. For composed image-text data, the model directly use score-fusion to sum up the outputs of visual encoder and text encoder and get the final embedding.
Tip
Our code works well on transformers==4.45.2, and we recommend using this version.
You can easily use BGE-VL-CLIP models based on transformers:
import torch
from transformers import AutoModel
MODEL_NAME = "BAAI/BGE-VL-base" # or "BAAI/BGE-VL-large"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) # You must set trust_remote_code=True
model.set_processor(MODEL_NAME)
model.eval()
with torch.no_grad():
query = model.encode(
images = "./assets/cir_query.png",
text = "Make the background dark, as if the camera has taken the photo at night"
)
candidates = model.encode(
images = ["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"]
)
scores = query @ candidates.T
print(scores)
BGE-VL-MLLM#
The multimodal large language models (MLLMs) incorporate a visual encoder, typically based on a vision transformer, into a large language model (LLM). This integration allows image tokens to be directly processed by the LLM. Consequently, MLLMs can effectively handle diverse multimodal inputs by converting any type of input into a sequence of tokens.
BGE-VL-MLLM builds upon the LLaVA1.6. In both training and inference stages, MMRet uses task-specific instructions for query inputs to improve generalization, aligning with standard practices in LLM-based embedding models. A typical multimodal query input is structured as follows:
where \({task_inst}\) represents the task-specific instruction, \({qt}\) denotes the input query text, and \({qi}\) is the input query image. The normalized last hidden state of the [EOS] token in the MLLM is used as the embedding of any given input sequence.
import torch
from transformers import AutoModel
from PIL import Image
MODEL_NAME= "BAAI/BGE-VL-MLLM-S1"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
model.eval()
model.cuda()
with torch.no_grad():
model.set_processor(MODEL_NAME)
query_inputs = model.data_process(
text="Make the background dark, as if the camera has taken the photo at night",
images="./assets/cir_query.png",
q_or_c="q",
task_instruction="Retrieve the target image that best meets the combined criteria by using both the provided image and the image retrieval instructions: "
)
candidate_inputs = model.data_process(
images=["./assets/cir_candi_1.png", "./assets/cir_candi_2.png"],
q_or_c="c",
)
query_embs = model(**query_inputs, output_hidden_states=True)[:, -1, :]
candi_embs = model(**candidate_inputs, output_hidden_states=True)[:, -1, :]
query_embs = torch.nn.functional.normalize(query_embs, dim=-1)
candi_embs = torch.nn.functional.normalize(candi_embs, dim=-1)
scores = torch.matmul(query_embs, candi_embs.T)
print(scores)
For more details, check out the repo of MegaPairs