LightweightLLMReranker#

class FlagEmbedding.inference.reranker.decoder_only.lightweight.LightweightLLMReranker(model_name_or_path: str, peft_path: str | None = None, use_fp16: bool = False, use_bf16: bool = False, query_instruction_for_rerank: str = 'A: ', query_instruction_format: str = '{}{}', passage_instruction_for_rerank: str = 'B: ', passage_instruction_format: str = '{}{}', cache_dir: str | None = None, trust_remote_code: bool = False, devices: str | List[str] | List[int] | None = None, cutoff_layers: List[int] | None = None, compress_layers: List[int] = [8], compress_ratio: int = 1, prompt: str | None = None, batch_size: int = 128, query_max_length: int | None = None, max_length: int = 512, normalize: bool = False, **kwargs: Any)[source]#

Base reranker class for light weight LLM like decoder only models.

Parameters:
  • model_name_or_path (str) – If it’s a path to a local model, it loads the model from the path. Otherwise tries to download and load a model from HuggingFace Hub with the name.

  • peft_path (Optional[str], optional) – Path to the PEFT config. Defaults to None.

  • use_fp16 (bool, optional) – If true, use half-precision floating-point to speed up computation with a slight performance degradation. Defaults to False. Defaults to False.

  • use_bf16 (bool, optional) – Another type of half-precision floating-point, you can use bf16 if the hardware supports. Defaults to :data:False.

  • query_instruction_for_rerank (str, optional) – Query instruction for retrieval tasks, which will be used with with query_instruction_format. Defaults to "A: ".

  • query_instruction_format (str, optional) – The template for query_instruction_for_rerank. Defaults to "{}{}".

  • passage_instruction_for_rerank (str, optional) – Passage instruction for retrieval tasks, which will be used with with passage_instruction_format. Defaults to "B: ".

  • passage_instruction_format (str, optional) – The template for passage. Defaults to “{}{}”.

  • cache_dir (Optional[str], optional) – Cache directory for the model. Defaults to None.

  • trust_remote_code (bool, optional) – trust_remote_code. Defaults to False.

  • devices (Union[str, List[str], List[int]], optional) – Devices to use for model inference, such as [“cuda:0”] or [“0”]. Defaults to None.

  • cutoff_layers (Optional[List[int]]) – Pick which layers are used for computing the score. Defaults to None.

  • compress_layers (List[int], optional) – Choose the layers to compress. Defaults to [8].

  • compress_ratio (int, optional) – Ratio to compress the selected layers, supported ratios: [1, 2, 4, 8]. Defaults to 1.

  • prompt (Optional[str], optional) – Prompt for the specific task. Defaults to None.

  • batch_size (int, optional) – Batch size for inference. Defaults to 128.

  • query_max_length (int, optional) – Maximum length for queries. If not specified, will be 3/4 of max_length. Defaults to None.

  • max_length (int, optional) – Maximum length of passages. Defaults to :data`512`.

  • normalize (bool, optional) – If True, use Sigmoid to normalize the results. Defaults to False.

Methods#

class FlagEmbedding.inference.reranker.decoder_only.lightweight.LightweightLLMReranker.compute_score_single_gpu(self, sentence_pairs: List[Tuple[str, str]] | Tuple[str, str], batch_size: int | None = None, query_max_length: int | None = None, max_length: int | None = None, cutoff_layers: List[int] | None = None, compress_layer: List[int] | None = None, compress_layers: List[int] | None = None, compress_ratio: int | None = None, prompt: str | None = None, normalize: bool | None = None, device: str | None = None, **kwargs: Any)#

Compute the relevance scores using a single GPU.

Parameters:
  • sentence_pairs (Union[List[Tuple[str, str]], Tuple[str, str]]) – Input sentence pairs to compute scores.

  • batch_size (Optional[int], optional) – Number of inputs for each iter. Defaults to None.

  • query_max_length (Optional[int], optional) – Maximum length of tokens of queries. Defaults to None.

  • max_length (Optional[int], optional) – Maximum length of tokens. Defaults to None.

  • cutoff_layers (Optional[List[int]], optional) – Pick which layers are used for computing the score. Defaults to None.

  • compress_layer (Optional[List[int]]) – Deprecated, use compress_layers instead. Defaults to None.

  • compress_layers (Optional[List[int]]) – Selected layers to compress. Defaults to None.

  • compress_ratio (Optional[int]) – Ratio to compress the selected layers, supported ratios: [1, 2, 4, 8]. Defaults to None.

  • prompt (Optional[str], optional) – Prompt for the specific task. Defaults to None.

  • normalize (Optional[bool], optional) – If True, use Sigmoid to normalize the results. Defaults to None.

  • device (Optional[str], optional) – Device to use for computation. Defaults to None.

Returns:

The computed scores.

Return type:

List[float]