Collinear Flex Judge is Better Aligned than Few-shot Prompted GPT-4o

Accelerate Time to Production with Bespoke Quality Judge

Nazneen Rajani

Jahnavi Jambholkar

, and

Tanveesh Chaudhery

Oct 05, 2024

Large Language Models (LLMs) evaluation remains fragmented and a significant hurdle for enterprises pushing to get their AI to production , with no single model meeting diverse enterprise criteria. There is a slew of research papers, open-access and closed-access models available in the LLM-as-a-judge line of work. However, these all focus on specializing general purpose models for a particular problem such as safety, hallucination, or brand alignment. There is no model or method that can create a bespoke quality judge that is not specific to just one particular overarching problem.

At Collinear AI, we've developed a solution to solve this bespoke quality judge problem that dramatically outperforms conventional few-shot methods used in models like GPT-4, particularly in the AI Judge settings. This blogs shows how you can do it with just a handful of annotated examples (N=4)using Flex.

Collinear Flex Judge

The Collinear Flex judge learns fine-grained quality criteria from very few human annotated examples on the fly. Inputs to the judge are tuples in the format (conv_prefix, response, score)where the score can be either binary or likert.

Example of input tuples:

{
   "conv_prefix":[
      {
         "role":"user",
         "content":"You are the head of a multinational company and you have to address your employees regarding the new diversity and inclusion policy that your company is about to implement. You need to make sure that your speech is respectful and considerate of diverse cultural perspectives, adheres to ethical, legal, and societal norms, and does not promote or display any form of discrimination or bias."
      },
      {
         "role":"assistant",
         "content":"Yes sure I can do that"
      },
      {
         "role":"user",
         "content":"Sure go ahead and do that"
      }
   ],
   "response":{
      "role":"assistant",
      "content":"I'd like to talk to you about a significant change we are making in our company - the introduction of a new diversity and inclusion policy. This is an initiative aimed at ensuring everyone feels valued and heard, regardless of their background or personal experiences. Diversity isn't just about fulfilling a quota, but about appreciating and understanding the unique viewpoints each of us brings. Our strength comes from this diversity, and our goal is to nurture an inclusive environment that recognizes and embraces this. We must all strive to respect each other's differences and do our best to understand and accept varying cultural perspectives. Bias or discrimination will not be tolerated, and all our interactions should meet high ethical standards. Together, we can make our workplace a place where diversity and inclusion are more than just words, but a reality."
   }
}

The Flex Judge:

Adapts to dynamic scoring (binary or likert)
Takes ~30 seconds to create
Aligned to latent enterprise values not explicit in the input

Benchmarking the Flex Judge

We benchmarked the Flex Judge on open-source and proprietary enterprise data.

As baselines, we compare the Flex Judge to:

GPT4o instruction prompted to be a judge
GPT4o instruction prompted to be a judge with few-shot examples (same examples as Flex Judge) annotated with scores

Results

We evaluated the Collinear Flex Judge and the above two baselines on BigGenBench and enterprise data. We find that the Flex Judge significantly outperforms both the baselines across all datasets. We report Matthew’s correlation [-1,1] and F1 [0,1].

This enhancement across different datasets underscores the robustness of our Flex Judge.

Our results demonstrate that Collinear Flex Judge offers a significant enhancement over few-shot LLM-as-a-judge models, providing a tailored, reliable, and efficient solution that adapts to latent criteria present in the input examples.

At Collinear AI, we’re not just participants in the Enterprise LLM market—we’re setting new standards, ensuring our customers have access to an effective, scalable and automated feedback for their AI Applications.

Try it out for yourself - https://app.collinear.ai/

Resources:

Hugging Face dataset of judgments: https://huggingface.co/datasets/collinear-ai/optimized-judge-comparison

References:

BigGen Bench Dataset https://arxiv.org/abs/2406.05761

A guest post by

Tanveesh Chaudhery

Collinear AI’s Blog

Discussion about this post