Introducing Collinear Simulations: Steerable Personas for AI Agent Testing
TraitBasis, inspired from mech intrep, gives high-fidelity user personas for comprehensive agent testing
“SPEAK TO A HUMAN!!!!!!!”
Most of us have, at least once in our lives, had to type this when talking to a customer support chatbot. Despite the incredible progress in AI, we see them falter when dealing with the most natural human emotions: impatience, skepticism, and whatnot. This points directly to a gap in robustness testing, as most benchmarks today use predictable, robotic users, failing to capture how an agent performs under the pressure of a real human interaction.
Our research quantifies this robustness gap. By simulating users with four reality-grounded traits, impatience, skepticism, confusion, and incoherence, on τ-Bench, our experiments revealed that the success rates of frontier models plummeted by over 20 percent. This confirms that to build reliable AI, a better method for simulating the users who cause these failures is essential.
To solve this, we created TraitBasis: a lightweight and data-efficient method for generating high-fidelity, steerable user traits for testing. It works by identifying a “trait vector,” a specific direction in a language model’s internal activation space that corresponds to a human characteristic. By applying this vector to the model that is simulating the user, we can precisely control their behavior, creating a tough, dynamic benchmark to evaluate the agent against.
The Challenge: Realistic Personas
Our research began with a systematic evaluation of the established methods for inducing specific human traits in language models. We focused our analysis on the most common/ obvious approaches: system prompting and fine-tuning
We found that they consistently fail to meet the demands of a realistic and stable simulation, falling short in three main areas:
Fine-grained control: Prompt/SFT often blur intensity; “moderate” vs “high” impatience looks the same.
Stability over long chats: Personas fade mid-conversation; traits drift back to neutral.
Mixing traits: Combining two prompts makes one dominate; mixes look unbalanced.
Clearly, a more robust method for inducing traits was needed. That’s TraitBasis.
TraitBasis
Our solution, TraitBasis, moves beyond external instructions like prompting and instead makes adjustments to a model’s behavior from the inside. This approach, known as activation steering, is built on the insight that human traits correspond to specific directions within a language model’s internal activation space.
Here’s how we find these directions within the model activation space: To isolate the vector for “impatience,” for example, we start with two nearly identical user responses; one is neutral, and the other is clearly impatient. When we look at the model’s internal activations for both, they are mostly the same. The difference between these vectors is the pure essence of the trait. By subtracting one from the other, we cancel out all the noise, that is the context, the user’s goal, etc., and are left with a clean signal. This is the trait vector.
This resulting trait vector is what we use to steer the user model. During a conversation, we add this vector to the user model’s hidden activations at every turn to provide a continuous nudge to its behavior.
Putting TraitBasis to the test
We ran several experiments to show how superior TraitBasis is, compared to the other methods, across 4 key areas:
Feels like a real user (Realism)
In head-to-head comparisons, TraitBasis was the clear winner, achieving the highest Elo rating (1624) and a 63% win rate against other methods. This gave it a significant advantage over both fine-tuning (1561 Elo) and prompting (1530 Elo), and it achieved this while using 3,000x less data (4 vs 13k samples).Implication: You can run believable behavioral evaluations without a heavy data collection phase.
You can actually set the dial (Control)
When we asked evaluators to distinguish between higher and lower intensity traits, TraitBasis proved exceptionally reliable with 97.5% accuracy. This gave it a slight edge over full Supervised Fine-Tuning (95%) and a massive advantage over prompt-based methods (75%).Implication: Intensity is calibrated: medium vs. high produces consistently different behavior you can depend on.
Stays in character over long chats (Stability)
TraitBasis is the only method that demonstrates true dynamic stability. While baseline methods are defined by persona collapse(fading in 94% of prompt-based and 66% of fine-tuned conversations), TraitBasis remains consistent or escalates in over 77% of cases.Implication: Reveals breakdowns that happen after persistence/escalation, the way real users behave.
Two traits at once without one taking over (Compositionality)
For complex users (e.g., both impatient and confused), TraitBasis produced the correct combination of traits 62.5% of the time, far more accurately than any other method. It blends traits without one overpowering the other.Implication: Lets you evaluate realistic combinations of traits, the way users show up in production.
Table 1 shows how TraitBasis compares to the other methods across all the criteria we defined above.
τ-Trait
To apply our TraitBasis method systematically, we developed τ-Trait, a new benchmark designed specifically to measure agent robustness. We extended τ-Bench in two ways to create τ-Trait: first, by integrating our high-fidelity user personas powered by TraitBasis, and second, by adding two new domains — telecom and telehealth.
Examples
Figures 2 and 3 show side-by-side user-agent conversations . On the left (τ-Bench), the agent interacts with a standard, cooperative user. On the right (τ-Trait), the same agent interacts with a user steered by TraitBasis.
Figure 2 shows a complete breakdown of execution and trust. The agent on the left correctly downgrades all flights. The agent on the right, when pressured by a skeptical user, not only fails to perform the downgrade internally (it calls the tool with cabin: ‘business’), it then lies to the user, claiming the downgrade to ‘economy’ was successful. This is a critical failure of both the agent’s logic and its ability to be truthful.
Figure 3 example shows how impatience causes a critical reasoning error. In τ-Bench succeeds flawlessly. In τ-Trait, the agent, when rushed by the impatient user, hallucinates an incorrect argument in its internal tool call (security_home instead of home_security). This single mistake causes an error that forces the agent into a lengthy, inefficient recovery process, turning a simple task into a complex failure.
Conclusion
We’ve shown that even the most advanced AI agents are more brittle than their benchmark scores suggest. Their performance plummets under the pressure of normal human emotions, not because they aren’t smart, but because they haven’t been tested for this kind of real-world robustness.
TraitBasis provides a stable, controllable, and realistic way to simulate these human traits. It allows researchers to move beyond asking “Can my agent do the task?” and start asking the more important question: “Can my agent do the task when the user is frustrated, confused, and unpredictable?”
At Collinear, we believe that answering this second question is the key to building AI systems that people can truly trust.
For the full technical details, read our paper, try out TraitBasis, and test your AI agents on τ-trait.
If you use TraitBasis or τ-trait in your work, please cite:
@article{he2025impatient,
title = {Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents},
author = {He, Muyu and Kumar, Anand and Mackey, Tsach and Rajeev, Meghana and Zou, James and Rajani, Nazneen},
journal = {arXiv preprint arXiv:2510.04491},
year = {2025},
url = {https://arXiv.org/abs/2510.04491}
}