Think Before You Score: Self-Rationalizing Evaluators are State-of-the-Art for Fine-grained Evaluation
Finetuning on Rationales improves Judge Rationale and Score
Overview
As LLM systems advance and become more sophisticated, it has become increasingly crucial to align them with robust, feedback-driven learning mechanisms like RLHF and RLAIF. Reinforcement Learning from AI Feedback (RLAIF), being more scalable and less resource-intensive than curating human feedback, has accelerated the development of techniques that leverage Large Language Model’s ability to evaluate AI-generated content for coherence, quality, and alignment. In addition to scoring the content, LLM-as-a-judge can also be tasked to give detailed rationales that not only increase transparency but also helps the model itself to calibrate its own judgements. To harness this potential, we introduce Self-Rationalizing Evaluators (SRE)- a novel approach of improving the fine-grained judging ability of LLM-as-a-judge by iteratively reflecting on its own rationales, improving its rationalization and consequently the judgment scores.
As shown in the table below, SRE combats the limitations of current judge training methods, by relying only on examples from the training data, our model significantly improves both its rationale quality and its scoring accuracy, outperforming similar and even bigger sized models as well as different post-SFT methods, on diverse evaluation benchmarks. Additionally, we also show the quality improvement in rationales through our proposed recipe through human evaluation.
Methodology
We propose a novel training strategy to enhance the fine-grained performance of LLM-as-a-Judge through iterative alignment using synthetic preference data, with a key focus on rationalizing by conditioning the model's scoring based on its rationale. Our iterative approach consists of several stages:
Seed Initialization: Train a base supervised fine tuned judge model through an initial labelled data.
Self-rationalization: Given an input conversation, generate N judgements comprising of a rationale followed by a score.
Preference Data Selection: Create synthetic preference pairs from the N judgments to create (chosen,rejected) pairs by a selection process.
Preference Optimization: Iteratively improve the judge model by DPO, resulting in an improved judge model at each iteration.
Results
Self-rationalization improves fine-grained evaluation: As shown in Table 2, SRE outperforms post-SFT methods like consistency (majority voting) and best-of-N sampling, as well as other baseline models.
Rationales with DPO improve judging: Conditioning the score on rationale combined with DPO performs better than models trained or prompted not to give rationales.
Self-rationalization implicitly leads to better rationale quality: With each iteration the quality of rationales increase.
Quality of preference pairs is key: The DPO model with a high margin threshold i.e. difference between scores of chosen and rejected pair, outperformed other preference selection methods.
At Collinear AI, we’re not just participants in the Enterprise LLM market—we’re setting new standards, ensuring our customers have access to an effective, scalable and automated feedback for their AI Applications.
Try it out for yourself - https://app.collinear.ai/
References
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge