LLM-Blender

We introduce LLM-Blender, an novel ensembling framework to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). LLM-Blender cut the weaknesses through ranking and integrate the strengths through fusing generation to enhance the capability of LLMs.
Our framework consists of two complementary modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. GenFuser aims to merge the top-ranked candidates from the aggregation of PairRanker's pairwise comparisons into an improved output by capitalizing on their strengths and mitigating their weaknesses.
To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons for testing purposes. Our LLM-Blender significantly surpasses the best LLMs and baseline ensembling methods across various metrics on MixInstruct, establishing a substantial performance gap.

The open-source Large Language Models exhibit diverse strengths and weaknesses due to variations in data, architectures, and hyperparameters, making them complementary to each other. There are two primary approaches of ensembling the multiple output candidates from various language models: (1) Ranking and (2) Fusing Generation. Ranking aims to select the best one from the candidate pool via training a critic model. Fusing Generation aims to merge the candidates into a better output by framing the task as a sequence-to-sequence learning problem. Recent studies have demonstrated ranking and fusing generation could improve the performance of a single language model.

MLM-Scoring is a un-supervised ranking method which only encodes the candidate text using a pre-trained BERT and compute the pseudo-log-likelihood of masked tokens as the candidate score.
SimCLS is a supervised ranking method which encodes the input text and the candidate text using a single encoder and compute the cosine similarity of their representations as the score of each candidate text. It is trained using a marginal-ranking loss to maximize the margin between the score of the best candidate and the score of the worst candidate.
SummaReranker, another supervised ranking method which encodes the input text and the candidate text using a single cross-encoder and adopt a mixture-of-experts layer for various metrics signals. It takes the average of the prediction scores for these metrics as the score of each candidate text. It is trained using a binary cross entropy loss to only maximize the score of the best candidate and minimize the score of all the other candidates.
SummaFusion is a fusing generation method that applies the fusion-in-decoder artecture to fuse all the candidates into a better output. However, it does not filter out the bad candidates during the fusion which may lead to a worse output.
Limitations: All the three ranking method focused on individually scoring each candidate biased on the input text, and they never directly compare each candidate in a pairwise manner. Among the output candidates from LLMs, candidate differences can be quite subtle, as they are all produced by very sophisticated models and one may only be marginally better than another. Even for humans, it can be challenging to gauge candidate quality without direct comparison. And the current fusing generation method never filter out the bad candidates during the fusion which may lead to a worse output.
LLM-Blender (ours) is a hybrid ensembling framework that combines the strengths of ranking and fusing generation. It first ranks the candidates in a pairwise manner and then fuses the top-ranked candidates via a sequence-to-sequence model. Check the next for more details.

[show more]

Citation

 
@inproceedings{llm-blender-2023,
  title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
  author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
  booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
  year = "2023"
}

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion [ACL2023]

Introduction

Background

LLM-Blender Framework

Evaluation

Analysis

Misc.

Citation