Collinear-Guard: Where Customization Meets Precision for Fine-Grained Evaluation and Feedback

Prapti Trivedi

Aditya Gulati

, and

Nazneen Rajani

Dec 09, 2024

Overview

LLM-as-Judge models are becoming more popular as evaluators in different areas, especially when it comes to aligning models with specific rules or principles using approaches like Reward Learning from AI Feedback (RLAIF). Even though these models are evolving, they still have significant drawbacks when applied to real-world situations:

Not Customizable
Most of the LLM-as-Judge models are designed to be general-purpose, and thus they can’t easily be tailored for specific use cases. (give examples) There’s not much room to tweak them to suit your own requirements, so you don’t have much control over how they fit into your own workflow.
Not Granular enough
These models typically provide high-level judgments, often in binary or pairwise preference formats, without offering sufficient insight into the intensity or nuances of the decision. However, enterprise needs are rarely clear-cut; they often require deeper analysis and more detailed insights to guide informed, complex decision-making.
Black Box Decision-Making
One of the most significant concerns with LLM-as-Judge models is their "black box" nature. Users often have little insight into the internal workings of the model, making it difficult to understand how decisions are made or to trust the rationale behind the evaluations.

The All-in-one solution: Collinear-Guard Flex Judge

The Collinear-Guard Flex addresses these common limitations, offering a solution that not only provides enterprise-grade precision right out of the box but also allows for easy, on-the-fly customization to align with your organization's specific values and constitutional requirements. With Collinear-Guard Flex, you can achieve more nuanced, transparent, and tailored decision-making, making it a powerful tool for your enterprise needs.

Out-of-the-box enterprise-grade functionality

Collinear-Guard Flex Judge is designed to deliver high-quality, enterprise-grade performance from day one, ensuring that organizations can deploy it quickly and effectively without the need for extensive customization or setup. With its robust features and advanced capabilities, Collinear Guard Flex Judge is ready to meet the demands of businesses across industries, offering reliable, precise, and scalable decision-making power.

In Table 1, we compare our Collinear-guard Flex Judge with other LLM-as-Judge models on real-world enterprise data. As shown, CG Flex, outperforms both Prometheus-7B and LLama-3.1-8B-Instruct on all the domain-specific data, highlighting its superior ability to handle complex, enterprise-level tasks precisely.

Table 1: Comparison of Collinear-Guard Flex on enterprise domain-specific data

Dynamic Customization Tailored to Your Exact Needs

Why settle for rigid systems that lag behind your needs? With Collinear-Guard Flex, you’re always in control. This isn’t just customization—it’s real-time adaptability that keeps you ahead of the game.

Change On Demand: Whether it’s a compliance update, a strategy pivot, or a surprise challenge, tweak settings instantly—no downtime, no red tape.
Built for You: Customize everything to match your company’s values, priorities, and quirky specifics. It’s your system, your way.
Scale Without Limits: Growing team? New region? No problem. Flex expands effortlessly to match your ambitions.

No more rigid systems. No more delays. Just a powerful, customizable solution that moves as fast as you do.

Ready to flex your customization muscle? 💪

Let’s Get Hands-On: A Quick Guide to Flex Your CG Judge

Ready to see just how easy it is to tweak Collinear-Guard Flex Judge to fit your needs? Here’s your no-nonsense guide to customizing like a pro. No IT department required.

Step 1: Getting Started

To get started with Collinear-Guard Flex, follow these steps to install the required packages and configure your environment.

!pip install collinear datasets

You’ll need your space_id and token to authenticate and connect to the platform. For steps on how to set up and extract the space_id and token, follow the steps mentioned here.

Step 2: Load a dataset with few-shot examples

To tailor a model to your specific needs, all it takes is 2–3 carefully chosen examples of a conversation paired with a response. For this demonstration, we’ll select two random data points from the “Economic_crime” harm category of the PKU-Alignment/PKU-SafeRLHF dataset. These examples will help create a customized AI judge specifically designed for evaluating Financial Crime scenarios.

You can upload the few-shot examples on our platform in the following way:

space_id = <SPACE_ID>
token = <API_TOKEN>
def upload_new_dataset(df: pd.DataFrame, db_name: str):
    req_obj = {
        "name": db_name,
        "space_id": space_id,
    }
    conversations = []
    df = pd.DataFrame(df)
    for index, row in df.iterrows():
        conversations.append({
            'conv_prefix': list(row['conv_prefix']),
            'response': row['response']['content'],
            'judgements': {},
            'ground_truth': row['ground_truth'],
        })
    req_obj['conversations'] = conversations
    host = 'https://api.collinear.ai'
    url = f'{host}/api/v1/dataset'
    output = requests.post(url, json=req_obj,
                           headers={
                               'Authorization': f'Bearer {token}'})
    response = output.json()
    return response

df=load_dataset("collinear-ai/financial_cg_flex_customization")["pku_safer_rlhf_economic_crime"].to_pandas()
upload_new_dataset(df)

Step 3: Create a Custom Judge

Step 3.1 Do human annotations for the few-shot examples ✍️

Step 3.2 Auto-generate a custom scoring criteria ✨

After labelling the few-shot examples, you can auto-generate a scoring criteria using very less few-shot examples. This dynamic scoring criteria automatically extracts the model specification and constitution required by your enterprise

Step 4: Your Custom CG Flex is ready for use! 🧑🏾‍⚖️🚀

Finish creating your Custom Judge and note the Judge_id, as mentioned in the documentation here. Now your custom judge is fully customized and aligned to your enterprise needs in minutes!

judge_id=<JUDGE_ID>
client  = Collinear(access_token=token,space_id=space_id)

await client.judge.collinear_guard.pointwise(judge_id,[{'role':'user','content':'Your conversation'}],{'role':'assistant','content':'Response to be evaluated'})

The complete implementation is available in this Colab Notebook.

Results

As highlighted in Table 2, the customized CG Flex significantly outperforms the standard, out-of-the-box version when applied to financial data. Its tailored enhancements unlock a new level of precision and efficiency, offering a clear advantage in such high-stakes domain enterprise data like Finance.

Table 2: Performance increase with dynamic customization

Curious about incorporating this into your workflow or adapting it to fit your enterprise needs? Reach out to us—we’re eager to collaborate and help you take your safety solutions to the next level!

Did You Know?

We also recently garnered attention from a pivotal article in MIT Technology Review, titled “How OpenAI Stress-Tests Its Large Language Models.” In this piece, we emphasize the urgent need for “downstream users to have access to tools that allow them to test large language models themselves.”

Looking Ahead

We have a family of Collinear guard models in different sizes designed to meet various business needs, from high-stakes safety validation to real-time moderation just like the one discussed in this blog post.

Ready to build Safe and Reliable AI systems? Get started today:

✨ Sign up at platform.collinear.ai or,

🚀 Explore our playground to test our models.

Collinear AI’s Blog

Discussion about this post