<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Collinear AI’s Blog]]></title><description><![CDATA[The Simulation Lab for AI Teams ]]></description><link>https://blog.collinear.ai</link><image><url>https://substackcdn.com/image/fetch/$s_!NEWt!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8eb67c-e319-41fc-b352-b2d945f26a93_245x245.png</url><title>Collinear AI’s Blog</title><link>https://blog.collinear.ai</link></image><generator>Substack</generator><lastBuildDate>Mon, 18 May 2026 10:11:44 GMT</lastBuildDate><atom:link href="https://blog.collinear.ai/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[CollinearAI]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[collinearai@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[collinearai@substack.com]]></itunes:email><itunes:name><![CDATA[Nazneen Rajani]]></itunes:name></itunes:owner><itunes:author><![CDATA[Nazneen Rajani]]></itunes:author><googleplay:owner><![CDATA[collinearai@substack.com]]></googleplay:owner><googleplay:email><![CDATA[collinearai@substack.com]]></googleplay:email><googleplay:author><![CDATA[Nazneen Rajani]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Is your RL environment fair to your agent?]]></title><description><![CDATA[or ensuring that you hillclimbing budget is spent right :)]]></description><link>https://blog.collinear.ai/p/is-your-rl-environment-fair-to-your</link><guid isPermaLink="false">https://blog.collinear.ai/p/is-your-rl-environment-fair-to-your</guid><dc:creator><![CDATA[Adit Jain]]></dc:creator><pubDate>Thu, 14 May 2026 02:04:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Jykr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jykr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jykr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Jykr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Jykr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Jykr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jykr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png" width="582" height="327.375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:582,&quot;bytes&quot;:230049,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/197301633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jykr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Jykr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Jykr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Jykr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b0af8a2-a7c1-4d67-bedb-c4c97058b6ce_1920x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>tldr; based on my current understanding of evaluations, RL environments, and the hill-climbing loop:</p><blockquote><p>an environment (or evaluation) is fair when score differences are driven mainly by the capability you intend to measure, and are mostly invariant to nuisance factors like contamination, verifier bugs, environment drift, and benign prompt paraphrases.</p></blockquote><p><em>I use the word evaluation and environment interchangeably since for all practical purposes of a modern multi-tool multi-step setup they are the same.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I hated taking exams in most of my courses in undergrad and then in my PhD. For the most part, I thought the exams did not measure the core skills required to apply the subject matter in the real world. I am afraid that the gradients for language models and agents share my feeling.</p><p>This article is an effort to distill the key features of a fair evaluation, scoped to RLVR and agent harnesses. Fair not with respect to a protected attribute like race or gender, but fair to the agent you are evaluating.</p><p>Two concrete things are in scope of this article, and more will soon follow:</p><ul><li><p>The verifier inside an RLVR loop the function that takes a rollout as an input and produces a reward.</p></li><li><p>The agent harness: the tools, scaffolding, and protocol the agent acts through during rollouts, evals, and production.</p><p></p></li></ul><h3>What &#8220;fair to the agent&#8221; means</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AdMh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AdMh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!AdMh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!AdMh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!AdMh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AdMh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67500,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/197301633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AdMh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!AdMh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!AdMh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!AdMh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b39342-8e92-463f-9858-ebd3ef711f00_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A capability is a repeatable ability of an agent to produce a desired outcome under specified conditions.</p><p>A fair eval is a measurement of a specific capability of the AI agent.</p><p>What is it not? It is not a measurement of outcomes which were not expected in the specified conditions.</p><p>Score differences between agents, or between training checkpoints should be explained by the capability under test. They should not be explained by nuisance factors:</p><ul><li><p>Train-set contamination.</p></li><li><p>Verifier bugs and overfit rubrics.</p></li><li><p>Environment drift between runs.</p></li><li><p>Benign paraphrases of the prompt.</p></li><li><p>Harness details the agent never sees in deployment.</p></li></ul><p>If an eval is sensitive to these, its not a fair eval.</p><h3>Some ways an eval can be unfair</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qioX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qioX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 424w, https://substackcdn.com/image/fetch/$s_!qioX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 848w, https://substackcdn.com/image/fetch/$s_!qioX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 1272w, https://substackcdn.com/image/fetch/$s_!qioX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qioX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png" width="1456" height="1098" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1098,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:270890,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/197301633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qioX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 424w, https://substackcdn.com/image/fetch/$s_!qioX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 848w, https://substackcdn.com/image/fetch/$s_!qioX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 1272w, https://substackcdn.com/image/fetch/$s_!qioX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30a4d466-403e-4111-86be-4277ef568dd2_1579x1191.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We discuss four most-popular ways here, but this is non-exhaustive, and I will keep extending this as we find more gaps.</p><h4>1. Prompt Underspecification</h4><p>This happens if the task misses a constraint any reasonable solver would need to achieve the task meaningfully.</p><p>If the verifier reports a fail, the outcome is confounded with ambiguity of the instructions. Two equally strong agents can swing several points apart on the same task. If the instructions are ambiguous the agent should be rewarded equally for equally valid solution paths.</p><h4>2. Environment Issues</h4><p>The sandbox has a stale package, a non-responsive tool, or a filesystem layout that drifts between runs or setups. The agent&#8217;s first action fails for reasons it cannot inspect. Such a failure should be attributed to the environment.</p><h4>3. The harness design should be environment-centric</h4><p>The agent interacts with the environment through the harness. Tool schemas, observation formatting, retry policy, max steps, error messages, all shape behavior.</p><p>Two harnesses with the &#8220;same&#8221; tools can produce very different scores. For example:</p><ol><li><p>A truncated stderr hides the bug.</p></li><li><p>A misspecified tool schema can cause unnecessary confusion.</p></li><li><p>An unintentional 10-step cap might hamper a good agent&#8217;s planning capability.</p></li></ol><p>Therefore any fair eval should have a uniform eval across different</p><h4>4. Don&#8217;t ask it to do things it would not do in the real world</h4><p>Your environment, tasks and verifiers should not expect the agent to achieve a goal which is unrealistic in practice. A few common ones:</p><ul><li><p>Tasks the agent is told to refuse in production but expected to attempt in eval.</p></li><li><p>Tools available in the gym that do not exist in the deployed surface.</p></li><li><p>A persona or role the deployed system never adopts.</p></li></ul><p>Since if the agent were to hillclimb for these - it would not improve the capability you want to measure.</p><h4>The role the verifier plays</h4><p>The verifier is a model of &#8220;what success looks like.&#8221;</p><p>A strict verifier on an underspecified prompt punishes reasonable behavior. A lenient verifier can turn failures into passes. Verifiers are rarely audited as carefully as the agents they grade but their errors compound at every step of hill-climbing.</p><p>Treat the verifier as a system under test.</p><ul><li><p>Measure its agreement with humans.</p></li><li><p>Measure its variance across paraphrases of the same correct answer.</p></li><li><p>Measure its false-positive and false-negative rates.</p></li></ul><h3>Reward hacking is a fairness problem</h3><p>In RL, the verifier <em>is</em> the reward. Anything the verifier accepts is a valid policy.</p><p>If the verifier can be satisfied without solving the task meaningfully, the agent will eventually find that shortcut. This is not the model&#8217;s or the algorithm&#8217;s fault. It is the eval&#8217;s fault. A few common shortcuts:</p><ul><li><p>Producing answers in a format the verifier scores leniently.</p></li><li><p>Exploiting tool calls in a way that gets a good reward but doesn&#8217;t affect the state of the environment.</p></li><li><p>Pattern-matching the rubric&#8217;s expectation instead of producing a correct answer.</p></li><li><p>Outputting both the answer and its negation when the verifier checks for substring presence.</p></li></ul><p>Fair RL requires that the <em>only</em> cheap way to get reward is to do the task. If a cheaper path exists, the agent harness improvement loop or the gradient across the rollouts will discover it, and rightly so.</p><h3>A checklist for fair evaluations</h3><p>Before you let an eval drive decisions or training:</p><ul><li><p><strong>Specification.</strong> Could a competent intelligent entity solve the task from the prompt alone, without insider knowledge?</p></li><li><p><strong>Harness parity.</strong> Does the eval harness match the deployment harness on tools, formats, and limits?</p></li><li><p><strong>Distribution match.</strong> Are eval tasks ones the agent would actually face, and be permitted to attempt in production?</p></li><li><p><strong>Verifier audit.</strong> Has the verifier been graded against humans or SoTA model trajectories? What is its False Positive &amp; False Negative rate?</p></li><li><p><strong>Paraphrase invariance.</strong> Does the score change when the prompt is rewritten without changing meaning?</p></li><li><p><strong>Failure attribution.</strong> When the agent fails, can you tell whether it was the agent, the harness, the environment, or the verifier?</p></li></ul><h3>Open questions</h3><ul><li><p>How do you measure verifier quality when human labels are themselves noisy or expensive?</p></li><li><p>What is the right unit of &#8220;agent failure&#8221; in long-horizon tasks where many small slips compound?</p></li><li><p>Can verifiers be co-trained with agents without collapsing into a stationary state where both of them are poor?</p></li><li><p>How do you detect reward hacking inside an RLVR loop before it shows up as a deployment regression?</p></li><li><p>What is the minimum harness contract that should be held constant across rollout, eval, and production?</p></li><li><p>For tasks with no programmatic verifier, how do you keep model-graded RLVR from drifting into preference-style noise?</p></li></ul><p><em>If you&#8217;re shipping agents and model and have faced similar problems with fair evaluations, we should chat. We have a lot of interesting private and public evaluations which cover terminal use, MCP and computer use environments. Our focus is ensuring high-quality data. And fair-evaluations are a core tenet as we scale the horizons our agents operate on. </em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Whose Taste?]]></title><description><![CDATA[More data won't fix the AI verification problem. Different taste might.]]></description><link>https://blog.collinear.ai/p/whose-taste</link><guid isPermaLink="false">https://blog.collinear.ai/p/whose-taste</guid><dc:creator><![CDATA[Sachin]]></dc:creator><pubDate>Thu, 07 May 2026 16:02:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uHhf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The man who passed every check</h2><p>Kim Philby, standing in his mother&#8217;s flat in Drayton Gardens, London, before an array of reporters, was guilty. He had been guilty since 1934, when a Soviet agent named Otto recruited him in Regent&#8217;s Park. He was guilty when he joined MI6 in 1940, and when the King awarded him an OBE in 1945 for his wartime intelligence work. And he was guilty in 1955, when the Foreign Secretary stood up in the House of Commons and cleared his name.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uHhf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uHhf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uHhf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uHhf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uHhf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uHhf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Rare Film Emerges Of Double-Agent Kim Philby Speaking After Defection |  KPBS Public Media&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Rare Film Emerges Of Double-Agent Kim Philby Speaking After Defection |  KPBS Public Media" title="Rare Film Emerges Of Double-Agent Kim Philby Speaking After Defection |  KPBS Public Media" srcset="https://substackcdn.com/image/fetch/$s_!uHhf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uHhf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uHhf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uHhf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08870806-fce6-4c76-9808-eab764fc0097_3057x1720.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kim Philby lecturing Stasi agents in East Germany about his exploits to the top of MI6, while spying for the KGB | Harold Clements Getty Images</figcaption></figure></div><p>It was not until he had been posted to MI6 station in Beirut, when confronted by his old MI6 friend Nicholas Elliott, did Philby ever confess to spying for the Soviets, albeit partially. Shortly thereafter, he disappeared into the night onto a Soviet freighter bound for Russia, where he lived for the remainder of his life.</p><p>The signals about Philby had always been there. The Communist sympathies from his years at Cambridge, a first marriage to an Austrian communist. The system saw it all and, time and again, cleared him of any suspicion or wrongdoing.</p><p>This is not an article about mid-20th-century espionage. This is about what happens when a system has to verify something for which it has no reliable ground truth for, using signal that captures what to look for, but not how to weigh it.</p><div><hr></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h2>What this has to do with AI</h2><p>There&#8217;s a funny conversation happening in the AI community right now. Researchers and frontier labs are increasingly dismissive of humans&#8217; ability to verify or review long-horizon tasks, but are also insistent that human data is what unlocks the next 1000x in model capabilities. Both are probably true statements, but they point to two different problems getting collapsed into one.</p><p>The first is bandwidth. A human cannot sit through a four-hour agent trajectory, follow every tool call, and reliably verify the work to a high degree of accuracy. But this is a solvable problem, whether it be through better tooling, sampling, or even breaking longer tasks into smaller checks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f8se!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f8se!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 424w, https://substackcdn.com/image/fetch/$s_!f8se!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 848w, https://substackcdn.com/image/fetch/$s_!f8se!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 1272w, https://substackcdn.com/image/fetch/$s_!f8se!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f8se!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png" width="829" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:829,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/196671775?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f8se!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 424w, https://substackcdn.com/image/fetch/$s_!f8se!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 848w, https://substackcdn.com/image/fetch/$s_!f8se!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 1272w, https://substackcdn.com/image/fetch/$s_!f8se!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33aca7ca-de88-466d-baee-346698c9a7ba_829x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The second is criteria. In unverifiable domains, where there&#8217;s often no clean ground truth, two qualified experts can review the same output and disagree on whether it is good. Not because one of them is wrong, but because &#8220;good&#8221; is a judgment call, and judgment calls don&#8217;t average across annotators the way correctness does.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a4bq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a4bq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 424w, https://substackcdn.com/image/fetch/$s_!a4bq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 848w, https://substackcdn.com/image/fetch/$s_!a4bq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 1272w, https://substackcdn.com/image/fetch/$s_!a4bq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a4bq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png" width="680" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64083,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/196671775?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a4bq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 424w, https://substackcdn.com/image/fetch/$s_!a4bq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 848w, https://substackcdn.com/image/fetch/$s_!a4bq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 1272w, https://substackcdn.com/image/fetch/$s_!a4bq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb0193-de05-4c64-9aff-b86f0e98119a_680x560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This is the harder problem to solve. In unverifiable domains, the verifier is not approximating a ground truth because there isn&#8217;t one. The verifier is somebody&#8217;s weighting of criteria, which means in unverifiable domains, the verifier is taste.</p><h2>Taste isn&#8217;t one thing</h2><p>When most people talk about taste, they treat it as a single thing, but in reality, taste has two parts, each of which behave very differently.</p><p>The first is criteria. The line items and properties in a rubric that dictate when an agent&#8217;s output is considered good. For a legal brief, that might look like:</p><ul><li><p>A clear issue statement that frames the legal question</p></li><li><p>Citations to controlling authority</p></li><li><p>A coherent narrative spine that connects the facts to the legal argument</p></li></ul><p>Criteria is mostly binary, in that either the brief cites the right cases or it doesn&#8217;t. Either the brief is succinct or it isn&#8217;t.</p><p>On criteria, MI6 had Philby cold. The Communist circle at Cambridge, the first marriage to an Austrian communist, the Cambridge friends who had already defected to Moscow. All of it was in his file, surfaced and reviewed more than once. MI6 had the rubric, and it scored him on the rubric.</p><p>What it didn&#8217;t have was an answer to the next question. Confronted with a Cambridge man, the son of a celebrated Arabist, decorated in the war, vouched for by the right people, what do you do with the red flags? Every time MI6 ran the question, it answered the same way: down-weight them.</p><p>That next question, what to do with the line items when they coexist, conflict, or trade off against each other, is weighting. Weight is continuous, context-dependent, and the harder half of the problem of capturing taste.</p><p>For example, two experienced litigators can sit down with the same case and agree on every criteria. Both want a clear issue statement, both want the right authority cited, both want a tight narrative. But they will disagree on whether to file an aggressive version of the brief or the pared down version. The disagreement is about how to weigh the criteria against each other relative to a specific context: a specific judge, venue (e.g., local vs federal court), client&#8217;s risk tolerance, etc. The end result is you still have the same rubric, but different verdicts.</p><p>The human data machines today are quite good at the first half via pairwise preferences, rubric annotation, etc., but none of it captures the trade-off logic that turns items into a verdict.</p><h2>Why aggregation doesn&#8217;t rescue this</h2><p>The obvious objection at this point is that we already have a tool for this. RLHF works by collecting preference pairs from many annotators. If you collect enough of them, the model is supposed to learn the implicit weighting through statistical aggregation. Surely scale solves the problem then?</p><p>It doesn&#8217;t, because aggregation captures the median annotator&#8217;s weighting in the typical or average context, which, for unverifiable domains, is what you want to avoid.</p><blockquote><p>Value in unverifiable domains comes from non-median judgment applied to non-typical context. </p></blockquote><p>A great legal brief isn&#8217;t the median lawyer&#8217;s brief. A great research direction isn&#8217;t what most reviewers would pick. A good investment thesis is, almost by definition, one most people disagree with at the time of the trade. The whole point of bringing in expert judgment is to access the part of the distribution that consensus would smooth away.</p><p>When you average preferences across annotators and contexts, you smooth out the signal that distinguishes good from average judgment. You don&#8217;t end up with a verifier that approximates expert judgment. You end up with one that approximates consensus annotator judgment, and those are different things.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X1a4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X1a4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X1a4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X1a4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X1a4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X1a4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg" width="606" height="861" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:861,&quot;width&quot;:606,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;File:The Soviet Union 1990 CPA 6266 stamp (Soviet Intelligence Agents. Kim  Philby) small resolution.jpg - Wikimedia Commons&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="File:The Soviet Union 1990 CPA 6266 stamp (Soviet Intelligence Agents. Kim  Philby) small resolution.jpg - Wikimedia Commons" title="File:The Soviet Union 1990 CPA 6266 stamp (Soviet Intelligence Agents. Kim  Philby) small resolution.jpg - Wikimedia Commons" srcset="https://substackcdn.com/image/fetch/$s_!X1a4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X1a4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X1a4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X1a4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde7a6eb-511b-4aa1-8005-b10fcef08ed8_606x861.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Soviet postage stamp paying homage to Kim Philby</figcaption></figure></div><p>Let&#8217;s imagine we ran this on approach on Kim Philby. A vetting model trained on the aggregated preferences of every MI6 officer in 1945 would have produced exactly the verdict the system produced: he&#8217;s one of us, the signals pointing the other way must be noise. More annotators wouldn&#8217;t have helped. The signal was in the minority weighting that aggregation washed out.</p><p>In unverifiable domains, the value of a verifier - taste - is precisely what aggregation destroys.</p><h2>The real question stops being &#8220;more data&#8221;</h2><p>So if aggregation is the wrong tool, the question stops being how to collect more preferences and starts being something harder. What are we actually trying to capture in unverifiable domains?</p><p>The hypothesis: somebody&#8217;s weighting of legally-defensible criteria, in specific contexts, captured at high enough fidelity that a verifier can apply it when the context shifts. That reframe forces two questions.</p><p>The first is whose weighting. Ideally, it would be experts whose judgment correlates with real, measurable outcomes in the domain in question. Obviously, that is far from simple.</p><p>Outcomes in unverifiable domains are often delayed, noisy, or unobservable. For example, reputation is a proxy and a weak one; it tracks visibility as much as judgment. Peer consensus often selects for orthodoxy, which is the opposite of what makes expert judgment valuable in the first place. Philby&#8217;s career is a textbook example of this. The Cambridge education, the war record, the OBE were all layers of peer consensus pointed the same way. The jugement that would have caught him was the orthogonal kind, which peer consensus is built to surpress.</p><p>None of this means expert selection is impossible. It means it&#8217;s a real problem that has to be solved, not an assumption that can be hand-waved past.</p><p>The second is how to capture it. Pairwise preferences flatten weighting into a single binary signal. Rubric scoring captures criteria but skips the trade-off logic. What you actually need is closer to reasoned disagreement at depth: experts not just picking the better output but explaining what they would have prioritized differently, in what context, and why. The trade-off logic itself becomes the training signal.</p><p>It should be said however, that this is dramatically harder to scale than what we have today. The whole appeal of pairwise preferences was that they were cheap and parallelizable. Weighting capture is neither. It requires more time, more expertise, and more thought per data point.</p><p>This is where approaches like throwing more data at the problem start to break down. You don&#8217;t need more data, you need different data, from different people, captured in different ways.</p><h2>The last mile keeps moving</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j6c1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j6c1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 424w, https://substackcdn.com/image/fetch/$s_!j6c1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 848w, https://substackcdn.com/image/fetch/$s_!j6c1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 1272w, https://substackcdn.com/image/fetch/$s_!j6c1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j6c1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png" width="604" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:604,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/196671775?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j6c1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 424w, https://substackcdn.com/image/fetch/$s_!j6c1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 848w, https://substackcdn.com/image/fetch/$s_!j6c1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 1272w, https://substackcdn.com/image/fetch/$s_!j6c1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4e5e0d2-89d8-4340-9f40-da90eaffdb27_604x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Aaron Levie made a <a href="https://x.com/levie/status/2048950940661932319">point</a> recently that connects directly to this. As agents get better at the parts of a task they can already do well, the &#8220;last mile&#8221; - the part requiring human judgment to verify - keeps shifting up the value stack. The taste needed to verify a junior analyst&#8217;s output today is not the taste needed to verify a senior partner&#8217;s output tomorrow. </p><blockquote><p>The frontier of what counts as good keeps moving, because the floor of what agents can do keeps rising.</p></blockquote><p>What that means is, weighting capture isn&#8217;t simply a dataset problem. It is and will be, an evolving competency. Whatever weighting you capture today is correct for a domain that&#8217;s already moving, spurred on by AI agents. By the time the agent trained on it is deployed, the work that needs verifying has shifted, and the taste required to verify has shifted with it.</p><p>In unverifiable domains, you are not building a verifier once. You are building the capacity to keep capturing the right people&#8217;s weighting as the work that needs verifying changes underneath you.</p><p>The next 1000x in unverifiable domains doesn&#8217;t come from more taste in the data. It comes from the right people&#8217;s taste, captured at fidelity, as the frontier moves. Kim Philby walked out of his mother&#8217;s flat in Drayton Gardens in 1955 because the system optimized for the wrong taste. It had the criteria. It had the data. What it didn&#8217;t have was a way to weigh Cambridge against Moscow, before the consensus had already smoothed the difference away. The version of that problem facing AI is harder, because the frontier keeps moving and the answer keeps shifting with it. Whoever figures out how to solve the taste problem facing AI, will own it.</p><h2>Vetting the file</h2><p>At Collinear, we&#8217;re building part of it. SimLab is our simulation lab for AI agents: the infrastructure to generate, curate, and verify high-signal data at scale. Simulated enterprise environments, NPC users, verifiable tasks, training-ready rollouts. SimLab is built to capture expert weighting in context: not just whether an output is good, but the trade-off logic underneath the verdict, refreshed as the work itself shifts.</p><p>If you&#8217;re shipping AI in a domain where verification is hard, where median-annotator labels won&#8217;t get you there, and where consensus would smooth away exactly the signal you care about, we should talk. Every domain has its Philby. The work is building the system that doesn&#8217;t clear him.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Talk to a Researcher&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Talk to a Researcher</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Collinear Newsletter #11 - Notes on Frontier AI ]]></title><description><![CDATA[Hi AI innovators,]]></description><link>https://blog.collinear.ai/p/collinear-newsletter-11-notes-on</link><guid isPermaLink="false">https://blog.collinear.ai/p/collinear-newsletter-11-notes-on</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Thu, 30 Apr 2026 22:06:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!X0F1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hi AI innovators,</p><p>April brought the people building frontier agents into the same rooms we work in. Two community events, plus the research direction those rooms kept pointing us toward.</p><h3><strong>NYC Builders at the Collinear Exec Dinner Series  </strong></h3><p>Our Collinear Exec Dinner Series brought together senior research leaders from Apple, IBM Research, Two Sigma, NVIDIA, Datadog Research, Wells Fargo, and Google for a candid social over Old Delhi kebabs. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X0F1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X0F1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X0F1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X0F1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X0F1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X0F1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg" width="406" height="541.2403846153846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1941,&quot;width&quot;:1456,&quot;resizeWidth&quot;:406,&quot;bytes&quot;:2447001,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/195945304?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X0F1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X0F1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X0F1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X0F1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b8d6ccd-51e6-45ef-95ba-141e61960286_3024x4032.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The conversation focused on three critical industry problems. </p><ol><li><p><strong>Building reliable evals</strong> for multi-turn, long horizon workflows</p></li><li><p><strong>Performance gap</strong> between offline traces and live trajectories, and</p></li><li><p><strong>What &#8220;fidelity&#8221; should mean</strong> for simulated environments meant to train agents on real-world tasks</p><p></p></li></ol><p>Our next edition of the Exec Dinner Series is coming up - DM us if you would like to be in the room.</p><h3><strong>Sim Fidelity for AI Agents - our Q2 Research Social</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Unnf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Unnf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Unnf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Unnf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Unnf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Unnf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg" width="488" height="650.6666666666666" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1152,&quot;resizeWidth&quot;:488,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;View image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="View image" title="View image" srcset="https://substackcdn.com/image/fetch/$s_!Unnf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Unnf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Unnf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Unnf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5a13734-cd85-4ffe-bdc0-2717ea5d351d_1152x1536.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We took over the Sunnyvale office for a night of debate on simulation environments. The topic was the need for realism and high fidelity in RL simulations. </p><p>Researchers from GDM, NVIDIA, Apple, xAI and others slugged it out along with our special guest from MBZUAI, <strong><a href="https://www.linkedin.com/in/mikhail-yurochkin-a45659114/">Mikhail Yurochkin</a>. </strong></p><p>Subscribe to our mailing list if you would like to join the next Research Social. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://blog.collinear.ai/subscribe?"><span>Subscribe now</span></a></p><p></p><h3><strong>NPCs - the Key to Replicating Real World Messiness</strong></h3><p>When we published <a href="https://arxiv.org/abs/2510.04491">TraitBasis</a> last year, our work on activation-steered behavioral traits, the next idea was obvious: put these NPCs inside the Simulation Lab itself.</p><p>Real-world enterprise workflows are messy. Coworkers interrupt, change their minds, update systems while you are mid-thought, and act on shared state without telling you. We wanted our simulations to behave the same way.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TIOk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TIOk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 424w, https://substackcdn.com/image/fetch/$s_!TIOk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 848w, https://substackcdn.com/image/fetch/$s_!TIOk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 1272w, https://substackcdn.com/image/fetch/$s_!TIOk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TIOk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png" width="1314" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1314,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:262492,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/195945304?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TIOk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 424w, https://substackcdn.com/image/fetch/$s_!TIOk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 848w, https://substackcdn.com/image/fetch/$s_!TIOk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 1272w, https://substackcdn.com/image/fetch/$s_!TIOk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcbdbfa-14b3-48c5-8c33-fc3536639d71_1314x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>So we kept building.</strong> NPCs with agency that take actions and change shared state. NPCs with secrets that the agent has to surface through the right questions. NPCs with limited context, including attention, forgetting, and prioritization across competing tasks. Each NPC nuance (and the related tasks and verifiers built around it) raises the fidelity of the simulation and the quality of the training signal coming out of it. Frontier models that were comfortably solving our environments a quarter ago are now hitting walls, and we are seeing marked improvement in downstream agent outcomes when training on this signal.</p><p>Learn more <a href="https://blog.collinear.ai/p/trait-basis">here</a> and stay tuned for a technical report on NPCs and related hillclimbing results.</p><h3><strong>Work with us</strong></h3><p>If you are training agents for enterprise workflows and want to stress-test them in a real Simulation Lab, <a href="https://www.collinear.ai/book-a-demo">book a demo</a>.</p><p>We are also hiring researchers and engineers who want to push the frontier of agent training environments. See open roles at <a href="https://www.collinear.ai/careers">collinear.ai/careers</a>.</p><p>That&#8217;s it for April. More on the research side coming soon.</p><p>Best, </p><p>The Collinear Team</p>]]></content:encoded></item><item><title><![CDATA[AI's U-235 Problem]]></title><description><![CDATA[Nuclear physics solved for k_eff. What's the AGI equivalent?]]></description><link>https://blog.collinear.ai/p/ais-u-235-problem</link><guid isPermaLink="false">https://blog.collinear.ai/p/ais-u-235-problem</guid><dc:creator><![CDATA[Jed Gresham]]></dc:creator><pubDate>Thu, 23 Apr 2026 18:09:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Sjz5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sjz5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sjz5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Sjz5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Sjz5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Sjz5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sjz5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1100600,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/195170872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Sjz5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Sjz5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Sjz5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Sjz5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7fb190f-df91-48c7-b16d-7de32b82f4c7_1200x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The race to AGI isn&#8217;t being won by whoever has the most compute or the cleverest architecture. It&#8217;s being won by whoever solves a quieter, less glamorous problem: finding enough of the right kind of data to cross the threshold. This is not a new concept.</p><p>On December 2, 1942, Enrico Fermi and a small team of physicists gathered in a makeshift lab beneath the stands of Stagg Field in Chicago. They had spent months carefully stacking graphite blocks and uranium slugs into a precise arrangement they called Chicago Pile-1. At 3:25pm, they slowly withdrew a control rod (a neutron-absorbing insert that regulates the reaction). The Geiger counters clicked faster. The reaction sustained itself. The nuclear age began when the clicking didn&#8217;t stop.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>What most people don&#8217;t realize is how close they came to never getting there. The core problem wasn&#8217;t theory. Leo Szilard had conceived of the nuclear chain reaction nearly a decade earlier, crossing a London street in 1933. He understood it so completely he quietly patented it and handed the rights to the British Admiralty to keep it out of dangerous hands. The physics was known and the threshold was understood. The problem they had was the fuel.</p><p>Natural uranium is everywhere. The earth&#8217;s crust is full of it. But raw uranium is almost useless for a chain reaction. The isotope that actually fissions, U-235, makes up less than 1% of natural ore. The rest is inert. Volume alone gets you nowhere. Too much of the wrong material actively works against you as it absorbs neutrons and dampens the reaction before it can sustain itself. The real breakthrough of the Manhattan Project wasn&#8217;t the bomb. It was Oak Ridge&#8217;s K-25 gaseous diffusion plant, built in 1944 to enrich uranium at industrial scale. At the time it was the largest building in the world. Its only job was to concentrate U-235: filtering, separating, amplifying the rare fissile material until there was enough to matter.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DYLK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DYLK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 424w, https://substackcdn.com/image/fetch/$s_!DYLK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 848w, https://substackcdn.com/image/fetch/$s_!DYLK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 1272w, https://substackcdn.com/image/fetch/$s_!DYLK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DYLK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png" width="800" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e4bb6b1-672b-4df4-a740-675852396251_800x586.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:454776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/195170872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DYLK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 424w, https://substackcdn.com/image/fetch/$s_!DYLK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 848w, https://substackcdn.com/image/fetch/$s_!DYLK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 1272w, https://substackcdn.com/image/fetch/$s_!DYLK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e4bb6b1-672b-4df4-a740-675852396251_800x586.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photograph of Stagg Field at the University of Chicago, Argonne National Laboratory archives (Argonne National Laboratory on <a href="http://flickr.com/">flickr.com</a>)</figcaption></figure></div><div><hr></div><h2>The Ore Gets Leaner</h2><p>Today&#8217;s AI models are trained on more data than any human could read in a thousand lifetimes. The models are still running into walls, and the reason maps almost exactly onto Oak Ridge. The reason some data moves models forward and most doesn&#8217;t comes down to what a model can actually learn from it. Training works by exposing a model to examples and having it predict what comes next, then correcting it when it&#8217;s wrong. The correction is where the learning happens. Boilerplate content, templated writing, repetitive programmatic output: these produce almost no correction signal. The model already knows what comes next.</p><p>High-signal data is different. It contains genuine reasoning, unexpected connections, nuanced judgment, edge cases the model hasn&#8217;t encountered. Every one of those is a correction opportunity. When the model is wrong, it gets updated and gets sharper.</p><p>U-235 atoms fission when struck by a neutron because of specific properties in their nuclear structure. Most uranium atoms absorb the neutron and go quiet. The difference between fissile and inert material is structural, not superficial. High-signal training data works the same way. Generic data absorbs the training pass and goes quiet.</p><p>Most new data being generated is programmatic: logs, auto-generated content, boilerplate, synthetic outputs from models that are already mediocre. It&#8217;s abundant, cheap, and mostly inert. As it floods in, it dilutes the fraction of genuinely useful signal further. As AI-generated content spreads across the internet, models trained on it learn the average, not the edge. The ore gets leaner the more we mine it.</p><p>Marie Curie didn&#8217;t find radium by sifting through more rock. She developed new processes to isolate and concentrate what she was looking for. The Manhattan Project built industrial infrastructure specifically designed to separate what mattered from what didn&#8217;t. The AI field needs the same shift.</p><div><hr></div><h2>Building Oak Ridge</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jnic!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jnic!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 424w, https://substackcdn.com/image/fetch/$s_!jnic!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 848w, https://substackcdn.com/image/fetch/$s_!jnic!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!jnic!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jnic!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg" width="960" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:960,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;File:Oak Ridge National Laboratory, Oak Ridge, Tenn (78285).jpg&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="File:Oak Ridge National Laboratory, Oak Ridge, Tenn (78285).jpg" title="File:Oak Ridge National Laboratory, Oak Ridge, Tenn (78285).jpg" srcset="https://substackcdn.com/image/fetch/$s_!jnic!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 424w, https://substackcdn.com/image/fetch/$s_!jnic!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 848w, https://substackcdn.com/image/fetch/$s_!jnic!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!jnic!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0ef85f-88c1-406e-9bad-68873d582be6_960x608.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Post card showing Oak Ridge National Laboratory, Oak Ridge, Tenn</figcaption></figure></div><p>Two things have to happen in parallel. The first is better curation pipelines: new methods to identify and extract high-signal data from existing sources, smarter filtering, better labeling, clearer definitions of what &#8220;exceptional&#8221; looks like for each capability domain. The second is synthetic data. Even though the risk of model collapse (the equivalent of contaminating your fuel) is real, waiting for enough naturally occurring high-signal data won&#8217;t get us where we&#8217;re trying to go. Not all synthetic data techniques are the same, and the differences between them matter a lot. Deliberately designed training data built to fill specific capability gaps is unavoidable.</p><p>The most straightforward approach is prompt-based generation: feed a capable model a topic, a domain, or a problem type and ask it to generate training examples at scale. Used carefully, this fills gaps in rare or underrepresented domains. Used carelessly, it produces plausible-sounding noise that makes the training pool worse, not better.</p><p>A more sophisticated method is web rewriting: take real content and use a stronger model to transform it into a higher-signal format. DeepSeek did this systematically building <a href="https://arxiv.org/abs/2412.19437">DeepSeek-V3</a>. Rather than training on web content directly, they used a pipeline of stronger models to filter, rewrite, and structure data into higher-quality reasoning examples across mathematics, code, and general knowledge. The resulting model matched or outperformed models trained at several times the compute cost, with minimal changes to the underlying architecture. This is an example of using better fuel rods in the same reactor.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-FI3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-FI3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 424w, https://substackcdn.com/image/fetch/$s_!-FI3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 848w, https://substackcdn.com/image/fetch/$s_!-FI3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 1272w, https://substackcdn.com/image/fetch/$s_!-FI3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-FI3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png" width="1456" height="851" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:851,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Refer to caption&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Refer to caption" title="Refer to caption" srcset="https://substackcdn.com/image/fetch/$s_!-FI3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 424w, https://substackcdn.com/image/fetch/$s_!-FI3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 848w, https://substackcdn.com/image/fetch/$s_!-FI3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 1272w, https://substackcdn.com/image/fetch/$s_!-FI3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f21785e-f594-47e9-8e5a-111142f19fca_1661x971.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Benchmark performance of DeepSeek-V3 and its counterparts (https://arxiv.org/pdf/2412.19437)</figcaption></figure></div><p>A third approach is reinforcement learning from human feedback (RLHF): instead of generating new data from scratch, use human preference signals to identify which model outputs were high quality and train on those. This turns the model&#8217;s own outputs into enriched fuel, but only when paired with careful human judgment about what &#8220;better&#8221; means. <a href="https://arxiv.org/abs/2502.13417">Recent work</a> has pushed this further, with new techniques achieving alignment quality comparable to full human annotation using only 6-7% ( 6 7!!! ) of the annotation effort, by targeting human review at the samples that are hardest to label automatically.</p><p>The most surprising recent result is pure reinforcement learning with no labeled data at all. <a href="https://arxiv.org/abs/2501.12948">DeepSeek&#8217;s R1</a> demonstrated that reasoning capabilities can emerge through pure reinforcement learning, with no human-labeled reasoning trajectories required. The model was rewarded for getting verifiable answers right (math problems, code that actually runs) and developed self-reflection and strategy as emergent behavior. It&#8217;s the closest thing yet to a model generating its own fissile material.</p><p>At <a href="http://Collinear.ai">Collinear AI</a> we realize the theory isn&#8217;t the bottleneck. The hard part is making these techniques reliable at scale: turning prompt-based generation, web rewriting, and RLHF from one-off research sprints into something teams can run repeatedly without rebuilding from scratch each time. Most organizations treat each data effort as a custom project. <a href="https://github.com/collinear-ai/simlab">SimLab</a> is how Collinear makes this routine.</p><p>The contamination risk is real and worth understanding. <a href="https://arxiv.org/abs/2305.17493">Shumailov (and others)</a> demonstrated that repeatedly training on synthetic data leads to model collapse, a finding that attracted significant attention given how close current models are to exhausting available high-quality data. The mechanism: recursive training on synthetic outputs causes models to produce repetitive, narrowing results, effectively losing the tails of the original data distribution. The model gets so good at the average that it loses the edges.</p><p>The answer isn&#8217;t to avoid synthetic data. <a href="https://arxiv.org/abs/2404.01413">Research shows</a> that keeping real data in the mix and layering synthetic data on top, rather than replacing real data entirely, avoids the degenerative feedback loop. The ratio and sequencing matter enormously. Synthetic data added to a real-data foundation behaves very differently from synthetic data trained on top of synthetic data. The centrifuge has to be calibrated, not just built.</p><div><hr></div><h2>Critical Mass</h2><p>Before fission, energy was extractive. You burned coal, oil, or gas: feed the furnace, get energy out, repeat. Power was linear, bounded by what you could mine and move.</p><p>Fission changed that. A reaction releases neutrons that trigger more reactions. Above critical mass it&#8217;s self-sustaining: withdraw the control rod once and the chain continues without further input. A kilogram of enriched uranium and a kilogram of coal aren&#8217;t on the same spectrum. Critical mass is the threshold where the system stops depending on external input and begins feeding itself.</p><p>Every AI model today is extractive in the same way pre-fission energy was. We supply the data, the compute, the architecture decisions, the fine-tuning, the human feedback. The model improves because we keep feeding it. Remove the input, the improvement stops.</p><p>AGI is the point where that changes. A system past the AGI threshold can reason about its own limitations, identify what it needs to learn, generate or seek out the training signal it needs, and improve its own architecture. The chain reaction sustains itself.</p><p>The gap between today&#8217;s best models and that threshold is categorical, the same way fission and combustion aren&#8217;t variations of the same phenomenon. Current models are extraordinarily capable combustion engines. Once a system crosses that line, the rate of improvement stops being limited by what we supply. It becomes limited by the physics of the system itself. Human researchers won&#8217;t be the primary driver anymore. The reaction moves at its own pace, on its own terms. Getting the enrichment right before we get there is the only real leverage point we have.</p><div><hr></div><h2>Two Races to the Same Threshold</h2><p>The world is currently running two parallel races to the same kind of threshold, and almost nobody talks about them together.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cq4E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cq4E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 424w, https://substackcdn.com/image/fetch/$s_!cq4E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 848w, https://substackcdn.com/image/fetch/$s_!cq4E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 1272w, https://substackcdn.com/image/fetch/$s_!cq4E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cq4E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png" width="1456" height="964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:964,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4670228,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/195170872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cq4E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 424w, https://substackcdn.com/image/fetch/$s_!cq4E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 848w, https://substackcdn.com/image/fetch/$s_!cq4E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 1272w, https://substackcdn.com/image/fetch/$s_!cq4E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2afa5f6-8c54-4eb5-9ca1-10bc144903d2_1875x1242.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Aerial drone image of the 500 MW ITER international project under construction in Cadarache, France</figcaption></figure></div><p>The first race is to fusion. Thirty-five nations are collaborating on ITER, the international fusion project in southern France. It&#8217;s been delayed repeatedly and now targets 2039. Private companies are moving faster, but even the boldest credible estimates put commercial fusion in the early 2030s at best. The physics is understood. The engineering is the hard part.</p><p>The second race is to AGI. Both have the same core structure: a threshold that is theoretically understood, a fuel problem that is practically unsolved, and a lot of engineering standing between the two. And the two races may be more entangled than they first appear.</p><p>The race to build AGI may literally require the fusion race to progress first, or at least the fission one. The energy demands make that dependency increasingly hard to ignore.</p><p>AI data centers are already consuming power at a scale that strains the grid. In 2024, global data center electricity consumption hit around 415 terawatt-hours (about 1.5% of world electricity use), growing at a rate more than four times faster than overall global electricity consumption. &#65532; By some estimates, that figure could <a href="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">approach 945 TWh by 2030</a> (roughly equivalent to Japan&#8217;s entire annual electricity consumption) with high-growth scenarios pushing past 1,700 TWh by 2035.</p><p>That&#8217;s before AGI. A self-sustaining AI system iterating on itself continuously would require orders of magnitude more compute than today&#8217;s training runs. The chain reaction doesn&#8217;t just need enriched fuel. It needs an enormous, uninterrupted power supply to keep running once it starts. The physical energy problem and the cognitive threshold problem are coupled. Nuclear physics even has a name for it.</p><p>k_eff measures whether a chain reaction is self-sustaining. k_eff &lt; 1 and the reaction fizzles. k_eff &#8805; 1 and it runs on its own.</p><p>AGI doesn&#8217;t have an equivalent metric yet. But if it did, it might look something like this: does each generation of capability produce enough leverage to fund, power, and build the next one? We can call it a_eff. Right now, the honest answer is that we don&#8217;t know if a_eff &#8805; 1, and most of the serious debates in AI (about scaling laws, compute returns, energy constraints) are really arguments about that number without naming it.</p><div><hr></div><h2>The Clock</h2><p>Six years ago, the median expert estimate for AGI sat comfortably in the 2060-2070 range. As of early 2026, that number has collapsed to around 2033. &#65532; The compression is accelerating. Dario Amodei said at Davos earlier this year that AGI will likely arrive within a few years, possibly by 2027. Demis Hassabis of Google DeepMind put it more cautiously: roughly a 50% chance by 2030. &#65532;</p><p>Fusion timelines haven&#8217;t moved the same way. ITER&#8217;s deuterium-tritium milestone is 2039. Commercial fusion power is likely a decade beyond that. The private sector is more aggressive, but even the boldest credible estimates put sustained commercial fusion in the early 2030s at best. Both thresholds represent the same kind of categorical shift: a self-sustaining reaction that permanently changes what&#8217;s possible. Fusion solves the physical energy problem and AGI solves the cognitive one, but both require getting the enrichment right before the reaction will hold.</p><p>The current trajectory suggests AGI crosses its threshold first, probably by a significant margin. That means the data enrichment problem is urgent in a way that plasma confinement simply isn&#8217;t. Fusion researchers have until the 2030s. The people working on data quality and synthetic enrichment for AI may have considerably less time, and far less certainty about when the window closes.</p><p>AGI isn&#8217;t blocked by a missing theoretical insight. Szilard had that moment crossing a London street in 1933. It isn&#8217;t blocked by compute or architecture alone. It&#8217;s blocked by the same thing that stood between Szilard&#8217;s patent and Fermi&#8217;s reaction: not enough of the right material, concentrated precisely enough, arranged carefully enough to sustain itself.</p><p>Fermi&#8217;s team stacked blocks for months and did the math until the geometry was right and the reaction kept going.</p><p>That&#8217;s what we&#8217;re building toward. Not a dramatic moment, but a controlled, deliberately constructed threshold where the system crosses over and begins to sustain its own improvement. We need an AI Oak Ridge.</p><p>That work is happening now, in pieces, across a lot of teams. If you're at a frontier lab or AI-native company working on model improvement (capability gaps, post-training data, pre-deployment testing), talk to one of our researchers.</p><h2><strong>Building the Centrifuge</strong></h2><p>Oak Ridge took two years and 24,000 workers to separate enough U-235 for the reactor to hold. The enrichment problem for AI is a similar order of undertaking, and no single team will solve it.</p><p>At Collinear, we&#8217;re building part of it. SimLab is our simulation lab for AI agents: the infrastructure to generate, curate, and verify high-signal data at scale. Simulated enterprise environments, NPC users, verifiable tasks, training-ready rollouts. It&#8217;s designed to make deliberate enrichment a routine capability for the teams that need it most.</p><p>If you&#8217;re training reasoning models, shipping agents into production, or watching half your team&#8217;s time disappear into eval data hygiene, we should talk.</p><div><hr></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Talk to a Researcher&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Talk to a Researcher</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SimLab: The self-serve staging playground for real-world agents ]]></title><description><![CDATA[Agents fail on real tool calls, long workflows, and messy data. SimLab lets you find those failures in simulation, not in production.]]></description><link>https://blog.collinear.ai/p/simlab-the-self-serve-staging-playground</link><guid isPermaLink="false">https://blog.collinear.ai/p/simlab-the-self-serve-staging-playground</guid><dc:creator><![CDATA[Sachin]]></dc:creator><pubDate>Thu, 02 Apr 2026 22:01:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/56774894-85d4-4125-a1a3-38af8ae9ab8a_1216x864.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eS5N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eS5N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!eS5N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!eS5N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!eS5N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eS5N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:415723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/192905633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eS5N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!eS5N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!eS5N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!eS5N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5283b36d-75a1-4dcd-a6d9-4a6262b39913_1200x630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p>Agents fail in production because evals test outputs, not behavior across stateful multi-step workflows.</p></li><li><p>The failure modes that actually matter: imperfect tool calls, state drift, and no-exit loops that only show up when the agent runs inside a realistic environment.</p></li><li><p>Software has staging. Agents have nothing between evals and prod. Simulation environments help fill the gap.</p></li><li><p>SimLab is a self-serve CLI that gives you the full stack: tasks, realistic environments, and deterministic verifiers meaning you find failures before your users do.</p></li></ul><div><hr></div><h3><strong>The agent passed evals. It worked in the demo. You shipped it. Then it broke.</strong></h3><p>You&#8217;ve probably seen this, or something similar in production. We&#8217;ll use customer support as an example. A customer asks for a simple account update. Everything looks good until step 3 of a 12-step workflow.</p><p>A tool call fires: the right function name, the wrong schema: the API returns a 422, the agent retries with the same payload, and now you&#8217;re in a silent loop. The failure only surfaces when a real user hits it, and it&#8217;s nearly impossible to reproduce from logs alone. The failure isn&#8217;t just an agent failure, but a support interaction failure, a degraded customer experience.</p><p><strong>This isn&#8217;t a model problem. It&#8217;s a testing problem.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!75_2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!75_2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!75_2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!75_2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!75_2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!75_2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!75_2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!75_2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!75_2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!75_2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00267cd5-27c2-4b06-ba25-77c1ba258ae9_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Traditional evals were built for a different problem.</strong></h3><p>Traditional evals were built for single-turn, input-output tasks. They work for language problems. But this isn&#8217;t a typical single output failure. It&#8217;s a control loop failure. At each step it reads context, picks a tool or action, executes, observes the result, updates state, and decides what to do next. Step 3 of the workflow isn&#8217;t where it breaks, it&#8217;s where the compounding mistakes start. Not just for you, but for the customer interacting with your agent.</p><p>The bugs that ship to production:</p><ul><li><p><strong>Irregular tool call arguments. </strong>Right function, bad payload. Fails schema validation. Retries with the same bad payload.</p></li><li><p><strong>Silent state drift. </strong>Working context diverges from ground truth mid-workflow. Each subsequent step compounds it. By step 8, the agent is making decisions on data it no longer has right.</p></li><li><p><strong>Incomplete reasoning chains and No-exit loops. </strong>Hits a dead end, has no recovery path, retries the same action. No timeout, no fallback, no escalation.</p></li></ul><h3><strong>Evals test outputs. They don&#8217;t test behavior.</strong></h3><p>LLM-as-judge makes this worse. You&#8217;re using a nondeterministic model to grade a nondeterministic system. The reward signal is noisy, hard to act on, and doesn&#8217;t scale to the thousands of rollouts you need to improve the agent.</p><h4><strong>Your deployment pipeline has a gap.</strong></h4><p>Software engineers don&#8217;t push from local to prod. The pipeline is develop &#8594; test &#8594; staging &#8594; prod. Staging isn&#8217;t a perfect replica of production, but it&#8217;s close enough for most failures to surface before a user sees them.</p><p>The agent development pipeline today: build &#8594; evals &#8594; prod. No staging equivalent. The first time your agent hits a live API with real latency, a real 200-step workflow, or a user input outside your eval distribution is the first time a real user hits it too.</p><p>When something breaks, you&#8217;re debugging from logs and likely dealing with an unhappy customer on the other end. You can see your agent failed, but you usually can&#8217;t reproduce it, and you can&#8217;t run a thousand variations of the failing scenario to understand where the boundary is.</p><p>Simulation is the staging layer for agents. It closes the gap between &#8220;it passed evals&#8221; and &#8220;it actually did what we needed it to protect brand integrity and increase customer support satisfaction.&#8221;</p><div><hr></div><h3><strong>What a simulation environment actually needs.</strong></h3><p>Not a bigger dataset. Not a fancier benchmark. It needs&#8230;Real. World. Scenarios.</p><p><strong>A simulation environment has to let your agent interact with a realistic world across a full task execution trace and give you deterministic </strong><em><strong>and </strong></em><strong>programmatic signal about what happened.</strong></p><p><strong>Three things you can&#8217;t skip:</strong></p><ul><li><p><strong>Environments. </strong>Those that mirror real world scenarios, real customer support interactions. APIs with real failure modes (rate limits, distorted responses, timeouts, unexpected nulls), messy seeded data (incomplete records, conflicting field values, schema mismatches), and NPCs that behave imperfectly, like real users (ambiguous or incomplete requests, assumption breaking points, and task pivots).</p></li><li><p><strong>Tasks: </strong>Long-horizon, multi-step workflows that reflect real production complexity. Tasks that require 50&#8211;200 steps, involve ambiguous intermediate states, and have more than one valid execution path. The kind your agent will actually face when someone submits a customer support request.</p></li><li><p><strong>Verifiers: </strong>Deterministic, programmatic checks not LLM-as-judge. Did the agent reach the right end state for the customer? Did it complete all required steps? Did it stay within operational constraints? Consistent signal you can trust across thousands of parallel rollouts.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PJ0j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PJ0j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 424w, https://substackcdn.com/image/fetch/$s_!PJ0j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 848w, https://substackcdn.com/image/fetch/$s_!PJ0j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 1272w, https://substackcdn.com/image/fetch/$s_!PJ0j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PJ0j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png" width="1456" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:118448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/192905633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PJ0j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 424w, https://substackcdn.com/image/fetch/$s_!PJ0j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 848w, https://substackcdn.com/image/fetch/$s_!PJ0j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 1272w, https://substackcdn.com/image/fetch/$s_!PJ0j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a022f24-4892-4446-b3d4-92a82f04818f_1526x778.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>What your dev loop looks like with simulation.</strong></h3><p style="text-align: center;"><strong>Instead of: build &#8594; ship &#8594; debug from logs.</strong></p><p style="text-align: center;"><strong>It becomes: build &#8594; simulate &#8594; fix &#8594; simulate &#8594; ship.</strong></p><p>You run thousands of parallel rollouts across your task distribution and see where failures occur, which steps, which tool interactions, which input types cause breakdowns. When something fails, you get an execution trace you can inspect step-by-step, tweak the environment around, and re-run. You iterate on behavior, tool call logic, error recovery, and context window management in hours, not weeks. And start finding capability gaps you didn&#8217;t know to look for.</p><p><strong>You stop guessing how your agent behaves. With SimLab, you get to see how it behaves.</strong></p><div><hr></div><h3><strong>We built SimLab to do this.</strong></h3><p>We kept running into this gap ourselves. Building simulation infrastructure from scratch is a serious investment. The task generation, realistic data, tool simulators, NPC behavior models, sandboxed execution, a deterministic eval layer, and so on. Most teams end up with brittle, domain-specific systems that break the moment the agent or task changes.</p><p><strong>SimLab is a self-serve CLI that gives you the full environment stack without live environment risk.</strong></p><ul><li><p><strong>Sandboxed execution. </strong>Agents run in isolated containers with full environment control. Arbitrary code execution, configurable tool access, reproducible state. You define what the agent can touch.</p></li><li><p><strong>Bring your own tools or use pre-built simulators. </strong>The platform is self-serve. Connect your own APIs and tool schemas, or use out-of-the-box simulators for common systems like Workday, Salesforce, and others.</p></li><li><p><strong>Programmatic Task and Verifier generation. </strong>Generate long-horizon tasks calibrated to your domain. Tune difficulty, workflow length, ambiguity, and edge case density. High-quality training signal, not clean-room benchmarks.</p></li><li><p><strong>Programmatic Data:</strong> Seeded data and NPC behavior models simulate real production messiness: bad inputs, missing fields, unexpected response formats.</p></li></ul><div><hr></div><h3><strong>Where this fits in your stack.</strong></h3><p>SimLab sits between build and deploy. It&#8217;s not a replacement for evals or observability; it&#8217;s the layer missing between them.</p><p>Evals tell you if individual outputs are correct. SimLab tells you if the agent can complete full workflows under realistic conditions. Observability gives you post-deployment traces of what already broke. SimLab gives you pre-deployment traces of what would have broken.</p><p>Human QA pipelines are slow, expensive, and don&#8217;t scale. You can also build your own simulation infra, but a full stack covering tasks, environment, verifiers, sandboxing, and parallelization is a significant engineering investment that gets brittle fast.</p><p><strong>SimLab is designed to be adaptable as your agents and domains change, without rebuilding from scratch each time.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K2XI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K2XI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 424w, https://substackcdn.com/image/fetch/$s_!K2XI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 848w, https://substackcdn.com/image/fetch/$s_!K2XI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 1272w, https://substackcdn.com/image/fetch/$s_!K2XI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K2XI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png" width="1456" height="478" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:272360,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/192905633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K2XI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 424w, https://substackcdn.com/image/fetch/$s_!K2XI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 848w, https://substackcdn.com/image/fetch/$s_!K2XI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 1272w, https://substackcdn.com/image/fetch/$s_!K2XI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa377a337-bb36-473d-8a1e-ba6d2faff95d_2616x858.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3><strong>Simulation is the new deployment.</strong></h3><p>Simulation will be a standard layer in the agent development pipeline, the same way CI and staging are standard in software. <strong>Not just a nice-to-have at scale, but the thing separating agents that demo well from agents that ship reliably.</strong></p><p>We&#8217;re opening SimLab as a self-serve CLI. Install it, point it at your agent, define an environment, run it. See where it breaks. We&#8217;re still testing the task generators, verifier primitives, and environment tooling and we want to know what doesn&#8217;t work for your use case.</p><p><strong>Try it. See how your agent holds up. Tell us what&#8217;s missing.</strong></p><p><strong>&#8594; <a href="https://github.com/collinear-ai/simlab">github.com/collinear-ai/simlab</a></strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts..</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Collinear Newsletter #10 - Notes on Frontier AI]]></title><description><![CDATA[Happy March from the Collinear team!]]></description><link>https://blog.collinear.ai/p/collinear-newsletter-10-notes-on</link><guid isPermaLink="false">https://blog.collinear.ai/p/collinear-newsletter-10-notes-on</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Tue, 24 Mar 2026 15:03:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!AoJQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Happy March from the Collinear team.</p><p>It&#8217;s been a big quarter. We moved into a new office in Sunnyvale, launched YC Bench, extended the Simulation Lab, and hit the conference circuit. A lot to cover, so let&#8217;s get into it.</p><div><hr></div><h3><strong>New home in Sunnyvale</strong></h3><p>We outgrew our old space and opened a new office in Sunnyvale. A big thanks to the customers, partners, and friends who stopped by during the first few weeks to check it out. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kFUu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kFUu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 424w, https://substackcdn.com/image/fetch/$s_!kFUu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 848w, https://substackcdn.com/image/fetch/$s_!kFUu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 1272w, https://substackcdn.com/image/fetch/$s_!kFUu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kFUu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic" width="543" height="407.25" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:543,&quot;bytes&quot;:1853062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/190807647?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kFUu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 424w, https://substackcdn.com/image/fetch/$s_!kFUu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 848w, https://substackcdn.com/image/fetch/$s_!kFUu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 1272w, https://substackcdn.com/image/fetch/$s_!kFUu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8cfe5b8-59db-40e7-94e3-3e379c940cc7.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The sign is up, the whiteboards are full, and the espresso machine is already earning its keep. Good to have a home base.</p><div><hr></div><h3><strong>YC Bench: can frontier models run a startup?</strong></h3><p>We released <a href="https://x.com/CollinearAI/status/2027531502234570768?s=20">YC Bench</a>, the first open-source, long-horizon benchmark with a simulation clock.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4S4r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4S4r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4S4r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4S4r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4S4r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4S4r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!4S4r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4S4r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4S4r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4S4r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d0719e-3ad9-4348-b47f-e644fb5d86d8_1920x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The idea:</strong> give a frontier model seed capital, a small team, and a market of tasks. Ask it to run an AI startup. Manage employees, hit deadlines, allocate resources, and maximize profit over time.</p><p>What we found is that a simple rule-based agent consistently outperforms frontier LLMs. Not because the task is impossible for them, but because they make compounding mistakes early on that they never recover from. They chase short-term wins, over-parallelize, and adapt too late when conditions change.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eumu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eumu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Eumu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Eumu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Eumu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eumu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!Eumu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Eumu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Eumu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Eumu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a09f05e-1033-4eba-b10b-0ca116455e3a_1778x1058.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This matters because <strong>the industry is moving fast toward long-running, multi-step agent workflows</strong>. But reliability isn&#8217;t keeping pace with this ambition. YC Bench measures exactly that gap: not whether a model can answer a question, but whether an agent can hold a coherent strategy over time.</p><p>YC Bench is open-source and on our GitHub. We built it to be extensible. If you&#8217;re working on long-horizon agent evaluation, <a href="https://github.com/collinear-ai/yc-bench">we&#8217;d love to hear what you find</a>!</p><div><hr></div><h3><strong>Simulation Lab: what it is and what&#8217;s new</strong></h3><p>For those new here: the Collinear Simulation Lab is where AI agents learn enterprise work before going to production. Think of it as a practice environment, fully interactive, with simulated data, simulated users (NPCs), real enterprise tooling, and task-specific verifiers. Agents don&#8217;t just get tested. They get trained against realistic, messy, multi-step workflows.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vFyw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vFyw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 424w, https://substackcdn.com/image/fetch/$s_!vFyw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 848w, https://substackcdn.com/image/fetch/$s_!vFyw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 1272w, https://substackcdn.com/image/fetch/$s_!vFyw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vFyw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png" width="1170" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98164,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/190807647?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vFyw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 424w, https://substackcdn.com/image/fetch/$s_!vFyw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 848w, https://substackcdn.com/image/fetch/$s_!vFyw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 1272w, https://substackcdn.com/image/fetch/$s_!vFyw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e723faf-a331-4d96-8196-3fb2fef942fd_1170x712.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Inside a sim lab, you get:</p><ul><li><p>Simulated APIs for enterprise software (HR, Finance, Sales, Customer Service)</p></li><li><p>NPCs that push back, change their minds, and interrupt</p></li><li><p>Tasks with ambiguity, missing info, and competing priorities</p></li><li><p>Scorers that generate task-specific rubrics alongside formal verifiers</p></li></ul><p>We&#8217;ve extended the lab this quarter with broader tool coverage and deeper scenario complexity. More enterprise surfaces, richer NPC behavior, and tighter integration with RL training loops. The goal stays the same: agents need a world to learn in, and we&#8217;re building that world.</p><p>If you&#8217;re building agents for enterprise workflows and want to stress-test them before they touch production, the Simulation Lab is open for business.</p><div><hr></div><h3><strong>Together AI Conference</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AoJQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AoJQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 424w, https://substackcdn.com/image/fetch/$s_!AoJQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 848w, https://substackcdn.com/image/fetch/$s_!AoJQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!AoJQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AoJQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!AoJQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 424w, https://substackcdn.com/image/fetch/$s_!AoJQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 848w, https://substackcdn.com/image/fetch/$s_!AoJQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!AoJQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed4a3d69-499e-4992-86dc-4c28e163df44_4096x2731.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nazneen spoke at the Together AI conference this quarter. Our partnership with Together continues to deepen. <a href="https://blog.collinear.ai/p/trait-basis">TraitBasis</a>, our method for generating realistic simulated users, is now integrated into Together Evals. Builders on Together&#8217;s platform can simulate impatient, confused, or inconsistent user personas and see how their models actually hold up when conversations get unpredictable. If you missed the talk, stay tuned for a recap.</p><div><hr></div><p>That&#8217;s it for March. More coming soon on the research side.</p><p>-- The Collinear Team</p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[We gave Claude, Gemini and GPT, $250k, and it didn't go as you’d expect...]]></title><description><![CDATA[Introducing YC Bench: The first open-source, long-horizon benchmark with a simulation clock]]></description><link>https://blog.collinear.ai/p/we-gave-claude-gemini-and-gpt-250k</link><guid isPermaLink="false">https://blog.collinear.ai/p/we-gave-claude-gemini-and-gpt-250k</guid><dc:creator><![CDATA[Muyu]]></dc:creator><pubDate>Thu, 05 Mar 2026 17:02:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ed2a6f3a-6d9c-475e-aea8-2a483997f2be_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR</strong>: We find that frontier AI agents struggle on the YC Bench compared to other time-simulated benchmarks, such as the Vending Bench 2, highlighting capability gaps in planning and resource allocation for real-world scenarios.</p><p>Get started with YC bench:</p><pre><code><code>curl -sSL &lt;https://raw.githubusercontent.com/collinear-ai/yc-bench/main/start.sh&gt; | bash </code></code></pre><p>GitHub: <a href="https://github.com/collinear-ai/yc-bench">collinear-ai/yc-bench</a></p><div><hr></div><h1><strong>Long-Term Coherence as a Goal Post</strong></h1><p>Popular agent benchmarks - <a href="https://huggingface.co/gaia-benchmark">GAIA</a>, <a href="https://www.tbench.ai/">TermBench</a>, SWE-Bench, &#964;&#178;-bench - evaluate a model&#8217;s ability to complete tasks through multi-tool, multi-turn interactions. Even when these tasks span hundreds of tool calls, they share a critical limitation - they lack a <em>simulation clock</em>. As AI agents get integrated into the workforce and digital economy, time becomes an essential dimension of evaluation. Task sequence matters; environmental states shift over time, and inaction is as consequential as a wrong action. Without a clock running through the simulator, you can measure whether an agent did the right thing, but not whether it did it when it mattered.</p><p>Some existing benchmarks include a simulation clock, <a href="https://andonlabs.com/evals/vending-bench-2">Vending Bench</a>, which tests for long-term coherence<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> is one such example. Vending Bench has realistic NPCs, including suppliers the models talk to, and models compete with each other. The environment dynamics do not capture more sophisticated planning capabilities when resources are non-stationary, and there are tight deadlines to meet.</p><p>We propose a new <em>long-horizon adaptive planning</em> and <em>coherence</em> benchmark, YC-Bench (Your Company Bench), in which your agent takes on the role of a startup founder and executes activities to run a successful business. These activities include task prioritization, task scheduling, meeting client deadlines, resource allocation, managing burnout, and maximizing company profits and prestige.</p><p>With a simulation clock, the AI agent must learn to maximize long-term rewards over short-term gains. This means it needs to discriminate between short-term and long-term rewards, knowing that some actions pay less in the short term, but more in the long term. This is a core human skill usually termed &#8220;long-term coherence&#8221; that models are not exhaustively tested on, especially in a reproducible, open-source, and extensible way. YC-bench is an effort in this direction. </p><div><hr></div><h2><strong>Environment Dynamics</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n9_f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n9_f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 424w, https://substackcdn.com/image/fetch/$s_!n9_f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 848w, https://substackcdn.com/image/fetch/$s_!n9_f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!n9_f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n9_f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!n9_f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 424w, https://substackcdn.com/image/fetch/$s_!n9_f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 848w, https://substackcdn.com/image/fetch/$s_!n9_f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!n9_f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e32179-b352-48d2-9d94-fffcece35b94_1778x1058.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">v0 loop of YC-Bench. We will continually update YC-bench based on the feedback we receive.</figcaption></figure></div><p><br>YC-Bench asks the model to act as a founder of an AI startup. The LLM is given a seed capital, a fixed number of employees, and a base prestige level for each of several AI-related domains, such as training, data, and backend. The aim of the model is to maximize its capital by completing tasks on the market. However, to make the dynamics more realistic, we associate a prestige level with each task and the company. The LLM can only take on tasks that are at or below its prestige level. And therefore, to stay in the game, the LLM needs to browse the market for tasks, commit to them, and manage them until the project is delivered before the deadline. Each task can be assigned to multiple employees, and multiple employees can have multiple tasks.</p><p>Several things can go right. If the model assigns the right employees to a task and it is completed on time, the company will earn profit from that task, and the prestige level of the related domain, eg, &#8216;training&#8217;, will increase. The employees who work on the task will also improve their skills. As a result, the company can handle more lucrative tasks that require a higher minimum prestige level and stronger skills.</p><p>But there are also several things that can go wrong. If the task cannot be completed in time because the model assigns the wrong employee to it (e.g., assigning a GPU expert to build a frontend), the company will not get the money, and its prestige in that domain will decrease. As a result, the company will be able to do fewer tasks in that domain because it is less reliable. Moreover, employees are on a monthly payroll, and good ones cost more. As a result, if the company fails to perform tasks consistently, it will eventually go bankrupt.</p><p>YC-Bench is built for terminal use. The model can run a fixed set of CLI commands, which it learns from the system prompt. For example, it can assign tasks to employees, cancel tasks, change assignments, see its performance, etc. After it performs an action, time passes, and events such as task completion, task cancellation, and bankruptcy occur. If the model goes bankrupt during the evaluation, the evaluation stops. It&#8217;s time to give up. We also see whether models exploit the particular features of each domain (some are easy but less profitable, others are hard but more profitable) to become specialists rather than generalists and make more money.</p><h1>Early Results</h1><p>We compare three frontier LLMs - <strong>Sonnet</strong> <strong>4.6</strong>, <strong>Gemini</strong> <strong>3</strong> <strong>Flash</strong>, and <strong>GPT-5.2</strong> -  against a human-devised rule-based baseline across 3 configs and 3 seeds (27 runs total). Each agent starts with $250K and must survive a 1-year simulated horizon.</p><p>Our key result is simple: <strong>a</strong> <strong>simple</strong> <strong>hand-written</strong> <strong>heuristic</strong> <strong>beats</strong> <strong>every</strong> <strong>frontier</strong> <strong>model.</strong> We understand the weight of this claim; we are not claiming that the current version of the benchmark requires reasoning to win it. However, reasoning AIs should be able to cruise through the benchmark. We are happy to discuss and debate the improvements for the next version!</p><p>We test on 3 different configs for 3 unique seeds:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GGiZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GGiZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GGiZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GGiZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GGiZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GGiZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg" width="680" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!GGiZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GGiZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GGiZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GGiZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ab42d9d-abc6-4417-89b6-865a15396cc4_680x624.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The human-devised rule never goes bankrupt: 9/9 across all configs and seeds, while the best LLM (Gemini 3 Flash) survives 8/9. The rule-based agent doesn't use an LLM at all. It follows a fixed strategy: accept the highest-reward task you can finish, assign your best employees, and never over-parallelize.</p><h2>Survival Rates</h2><p>Hard seed 1 is the clearest signal: all three frontier LLMs go bankrupt, while the rule-based agent finishes with $14.8M. The LLMs fail not because the task is impossible, but because they make compounding errors in the first 2-3 months that lock them out of the prestige ladder.</p><h2><strong>When</strong> <strong>LLMs</strong> <strong>win,</strong> <strong>they</strong> <strong>win</strong> <strong>big,</strong> <strong>but</strong> <strong>they</strong> <strong>also</strong> <strong>lose</strong> <strong>hard</strong></h2><p>GPT-5.2 achieves the single highest balance of any agent: $43.5M on hard seed 3, nearly 3x the rule-based agent's $15.0M on the same seed. But GPT also goes bankrupt on 2/9 runs. Sonnet shows the same pattern at a more extreme level &#8212; $10.1M on nightmare seed 2 (the highest LLM result for nightmare), but bankrupt on 4/9 runs overall. Gemini is the most consistent LLM. It sweeps all 3 nightmare seeds (the only LLM to do so) and rarely collapses catastrophically. But even Gemini never matches the rule-based agent's reliability.</p><h2><strong>Prestige</strong> <strong>specialization</strong> <strong>explains part of</strong> <strong>the</strong> <strong>story?</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dRdK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dRdK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 424w, https://substackcdn.com/image/fetch/$s_!dRdK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 848w, https://substackcdn.com/image/fetch/$s_!dRdK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!dRdK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dRdK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg" width="1456" height="1068" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1068,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!dRdK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 424w, https://substackcdn.com/image/fetch/$s_!dRdK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 848w, https://substackcdn.com/image/fetch/$s_!dRdK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!dRdK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf12a2a1-d95e-469d-baae-0fb00fbffc11_4096x3004.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The radar charts reveal some insight into <em>why</em> models fail. Each polygon shows the company&#8217;s final prestige across 7 AI domains (system, research, data, frontend, backend, training, hardware). Large polygons indicate the model&#8217;s prestige increased broadly. Tiny dots near the center indicate the model went bankrupt before gaining any prestige. The human-devised rule (navy dashed) fills the full radar on every run &#8212; it maxes prestige methodically across all domains. Among LLMs, Gemini builds the most balanced profiles. GPT-5.2 shows genuine specialization on medium &#8212; it focuses on backend/data/frontend while ignoring training &#8212; a strategically reasonable choice, but one that becomes fragile when the task distribution shifts on harder configs. Sonnet is bimodal: either it maxes everything (medium seed 1), or it collapses entirely (nightmare seeds 1 &amp; 3, stuck at prestige 1.0 everywhere).</p><p>When we inspect Sonnet&#8217;s scratchpad on failed runs, the model correctly diagnoses the problem (&#8221;PRESTIGE CRISIS -MARKET LOCK&#8221;) but only after payroll has consumed the runway. It reasons well <em>about</em> strategy but fails to execute it in a timely manner.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4WlK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4WlK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4WlK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4WlK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4WlK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4WlK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg" width="1456" height="974" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:974,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!4WlK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4WlK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4WlK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4WlK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5223b93b-2d09-4660-aa27-ea4517109885_2034x1360.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why do models struggle?</h2><p>Four failure modes recur across all bankrupt runs:</p><ol><li><p><strong>Over-parallelization.</strong> Accepting 3-5 tasks at once, splitting employees across them. Each employee&#8217;s effective rate drops to base_rate / N per task &#8212; a senior at 8.0 units/hr assigned to 4 tasks contributes just 2.0 to each. Deadlines slip, failures cascade.</p></li><li><p><strong>No</strong> <strong>prestige</strong> <strong>gating.</strong> Accepting tasks that require prestige the company hasn&#8217;t earned yet. The task completes late, the prestige penalty makes the next tier even harder to reach, and the agent spirals into a market lockout.</p></li><li><p><strong>Late</strong> <strong>adaptation.</strong> Models identify problems in their scratchpad but only after the damage is done. By the time Sonnet writes &#8220;never accept task B while task A is active,&#8221; payroll has already consumed 60% of the runway.</p></li><li><p><strong>Inconsistent</strong> <strong>ETA</strong> <strong>reasoning.</strong> Models understand throughput math in principle, but don&#8217;t consistently apply it. Sonnet&#8217;s medium seed 2 has a 49% task win rate - essentially a coin flip, despite writing correct throughput formulas in its scratchpad. The core gap is not reasoning ability but <strong>temporal</strong> <strong>discipline</strong>: doing the right thing when it <em>matters</em>, sustaining correct behavior across hundreds of turns, and resisting the temptation to over-commit when a lucrative task appears.</p></li></ol><div><hr></div><h1>Next Steps</h1><p>YC-Bench v0 is a starting point. Here&#8217;s what we&#8217;re working on:</p><ul><li><p><strong>More</strong> <strong>models.</strong> We plan to add results for Claude Opus, Gemini 2.5 Pro, o3, and open-weight models (Llama 4, Qwen 3) as they become available. If a model can run tool-use in a loop, it can run YC-Bench.</p></li><li><p><strong>Longer</strong> <strong>horizons</strong> <strong>and</strong> <strong>non-stationary</strong> <strong>dynamics.</strong> The 1-year configs test short-to-medium planning. We want to push to 3-5 year horizons where market conditions shift: recessions that shrink rewards, talent wars that inflate salaries, and technology shocks that obsolete entire domains. This tests whether agents can adapt strategy mid-run, not just execute a fixed plan.</p></li><li><p><strong>Better</strong> <strong>baselines.</strong> The current human-devised rule is strong but simple. We want to explore MCTS-based planners, RL-trained policies, and hybrid approaches (an LLM for strategy, a rule engine for execution) to understand where the frontier lies.</p></li><li><p><strong>Community</strong> <strong>configs.</strong> YC-Bench is fully open-source and extensible. Every parameter: employee count, prestige distribution, penalty multipliers, and salary curves can be changed. We encourage the community to design configs that stress-test specific capabilities and submit results.</p></li></ul><p>Try it yourself:</p><pre><code><code>uv add yc-bench
uv run yc-bench run</code></code></pre><p>If you find it useful, feel free to cite our work and contact us!</p><pre><code><code>@misc{collinear-ai2025ycbench, 
author = {{Collinear AI}}, 
title = {{YC-Bench}: Your Company Bench &#8212; A Long-Horizon Coherence Benchmark for {LLM} Agents}, 
year = {2025}, 
howpublished = {\url{https://github.com/collinear-ai/yc-bench}}, 
note = {Accessed: 2026-02-25} }</code></code></pre><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Coherence refers to the degree to which an agent's actions, decisions, and goals form a consistent, intelligible pattern across successive moments rather than appearing random, contradictory, or fragmented.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Collinear Newsletter #9 – Notes on Frontier AI]]></title><description><![CDATA[Hi AI innovators,]]></description><link>https://blog.collinear.ai/p/collinear-newsletter-9-notes-on-frontier</link><guid isPermaLink="false">https://blog.collinear.ai/p/collinear-newsletter-9-notes-on-frontier</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Fri, 19 Dec 2025 18:06:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KEVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hi AI innovators,</p><p>Nov was a massive month for agents as they took centerstage across NeurIPS and AWS Re:Invent!</p><p>NeurIPS2025 had a clear vibe: the agent era is forcing RL to grow up. Not as a research novelty, but as production infrastructure for tool use, long horizon behavior, and reliability when the world gets messy in real life workflows.</p><h2><strong>NeurIPS 2025, the &#8220;RL is everywhere&#8221; moment</strong></h2><p>San Diego served! Sunny weather, packed hallways, and surprisingly serious taco opinions. Between sessions (and coffee lines), we met a ton of builders and kept hearing the same thing: RL is suddenly everywhere.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KEVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KEVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 424w, https://substackcdn.com/image/fetch/$s_!KEVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 848w, https://substackcdn.com/image/fetch/$s_!KEVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!KEVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KEVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png" width="526" height="394.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KEVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 424w, https://substackcdn.com/image/fetch/$s_!KEVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 848w, https://substackcdn.com/image/fetch/$s_!KEVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!KEVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26f29bb6-0f99-41cc-966d-d056b27196e6_2048x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>RL is the new scaling lever. </strong>The frontier has shifted from &#8220;can the model answer?&#8221; to &#8220;can the agent execute?&#8221; Multi step work, tool calls, retries, and shifting user intent are pushing teams toward RL to shape end to end behavior.</p><p><strong>RL needs the right infrastructure to go interactive. </strong>Environment fleets, verifiers, orchestration, NPCs - everyone at NeurIPS had a novel approach!</p><p><strong>Realism is the new benchmark. </strong>Less obsession with a single score, more focus on trajectory shaped evals: does it hold up on turn4, recover from tool noise, and stay safe when scenarios drift?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aXep!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aXep!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 424w, https://substackcdn.com/image/fetch/$s_!aXep!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 848w, https://substackcdn.com/image/fetch/$s_!aXep!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 1272w, https://substackcdn.com/image/fetch/$s_!aXep!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aXep!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png" width="464" height="510.21875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1689,&quot;width&quot;:1536,&quot;resizeWidth&quot;:464,&quot;bytes&quot;:3782677,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aXep!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 424w, https://substackcdn.com/image/fetch/$s_!aXep!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 848w, https://substackcdn.com/image/fetch/$s_!aXep!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 1272w, https://substackcdn.com/image/fetch/$s_!aXep!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7262a97-ef8e-4c61-b417-a5fe6093600e_1536x1689.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We also presented our NeurIPS paper, <a href="https://blog.collinear.ai/p/valley-of-reasoning">Through the Valley of Reasoning</a>. The punchline is: when you distill reasoning into small models, performance can dip before it climbs, and early on the structure of the reasoning matters more than whether the trace is &#8220;correct.&#8221;</p><p>We also met a bunch of new friends and collaborators. If you were there, hit reply and tell us what your team is building. We love swapping notes.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Talk to us!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Talk to us!</span></a></p><p></p><h2><strong>Spider, post training without the chaos</strong></h2><p>We shipped <a href="https://blog.collinear.ai/p/spider">Spider</a>, a lightweight way to turn post training work into a repeatable recipe.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wOPx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wOPx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!wOPx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!wOPx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!wOPx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wOPx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wOPx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!wOPx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!wOPx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!wOPx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0e631d5-8639-4997-8081-eb1f0abb2812_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With Spider, you can use one recipe for both off policy and on policy. Generate clean distillation datasets, or flip into an online loop with teacher guidance and KL supervision, without rebuilding your pipeline each time. It also keeps the &#8220;boring but critical&#8221; pieces consistent across runs, rollouts, filtering, verifiers, and publishing, so results stay comparable as you iterate.</p><p>Huge thanks to our friends at Thinking Machines for supporting the Tinker integration.</p><h2><strong>AWS Re:Invent - even more agents!</strong></h2><p>re:Invent turned Las Vegas into a full on agent showcase. Nova2 and the Nova family got a big spotlight, Nova Forge put &#8220;build your own frontier models&#8221; on the menu, and Nova Act made the case for agentic workflows!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X_J0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X_J0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X_J0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X_J0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X_J0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X_J0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg" width="1024" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!X_J0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X_J0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X_J0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X_J0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925025a1-0944-4933-8774-35fdd5f1801d_1024x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What stood out to us was the framing in Swami Sivasubramanian&#8217;s agentic AI keynote: getting agents to production is less about clever prompts, and more about repeatable training and testing loops.</p><p>Congrats to our customers and partners at AWS on an awesome launch week.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6L5B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6L5B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6L5B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6L5B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6L5B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6L5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg" width="440" height="586.3" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1066,&quot;width&quot;:800,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;No alternative text description for this image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="No alternative text description for this image" title="No alternative text description for this image" srcset="https://substackcdn.com/image/fetch/$s_!6L5B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6L5B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6L5B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6L5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2e2f738-b253-4196-ba56-fd2c39fe44a0_800x1066.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you are building agentic AI, we love to swap notes!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Talk to us!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Talk to us!</span></a></p><p></p><p>That&#8217;s it for this edition. Thanks for following along. We will have some fun things to share over the next couple of weeks. &#128578;</p><p>Best,<br>The Collinear Team</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[RL Infrastructure for AI Agents: Why Environment-as-a-Service is the Missing Piece]]></title><description><![CDATA[Reinforcement learning for large language models is more of a systems problem than ML.]]></description><link>https://blog.collinear.ai/p/rl-env-as-a-service</link><guid isPermaLink="false">https://blog.collinear.ai/p/rl-env-as-a-service</guid><dc:creator><![CDATA[Nazneen Rajani]]></dc:creator><pubDate>Tue, 18 Nov 2025 20:20:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8dYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Reinforcement learning for large language models is more of a systems problem than ML. While the RL training loop of generating rollouts, scoring them, and updating weights, looks deceptively simple on paper, enterprises building RL systems for AI agents quickly discover they&#8217;re building distributed systems with the complexity of modern cloud infrastructure.</p><p>This post argues that treating RL environments as first-class infrastructure is critical. Specifically &#8212; an <strong>Environment-as-a-Service, with clean separation between data plane and control plane</strong>&#8212;is the key to unlocking scalable, production-grade RL for AI agents.</p><h2>From Static Labels to Interactive Training Grounds</h2><p>The shift from post-training focused on supervised learning to mid-training with RL for LLMs represents a fundamental change in what we&#8217;re optimizing:</p><p><strong>Previous paradigm:</strong> Static input + static target &#8594; model output &#8594; loss &#8594; backprop.</p><p><strong>New paradigm:</strong> Agent acts in environment &#8594; environment scores behavior &#8594; policy updates &#8594; agent acts again</p><p>Modern RL environments for AI to replicate the enterprise workflows include:</p><ul><li><p>A product development environment where agents update tickets, generate sprint progress reports, and work with the team to identify milestones. The team in this case is simulated users.</p></li><li><p>A coding environment where agents receive task specs, edit codebases, run tests, and receive scores on correctness and other rubrics</p></li><li><p>A computer use agent that reads the calendar, navigates to Excel to fetch data and drafts and email.</p></li><li><p>A browser environment where agents navigate UI trees, fill forms, and complete realistic business tasks</p></li></ul><p>These environments provide what static supervised data cannot: a dynamic sandbox with <strong>verifiable rewards at scale.</strong> Pass/fail checks, structured scoring, safety constraints, and automated metrics turn RL from a research curiosity into an operational capability for mid-training (inducing new capabilities) and post-training (alignment and evaluation).</p><p>The critical transition is: <strong>from handcrafting or distilling static examples to orchestrating interactive training scenarios.</strong> That shift forces us to treat environments as infrastructure, not just programmable abstractions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8dYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8dYv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 424w, https://substackcdn.com/image/fetch/$s_!8dYv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 848w, https://substackcdn.com/image/fetch/$s_!8dYv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!8dYv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8dYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png" width="1456" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170934,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/179281926?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8dYv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 424w, https://substackcdn.com/image/fetch/$s_!8dYv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 848w, https://substackcdn.com/image/fetch/$s_!8dYv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!8dYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60a032de-9000-4be5-b0c7-e3b035b8263a_1976x1158.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Async RL training architecture with the trainer, sampler and the environment. We propose a control-plane, data-plane split view of building highly scalable environment-as-a-service</figcaption></figure></div><h2>The RL Training Architecture</h2><p>Production async RL systems for LLMs decompose into three loosely coupled components (see figure above):</p><h3>1. Trainer</h3><p>The trainer consumes trajectories and updates weights. It reads batches of (state, action, reward, metadata), runs the RL objective (GRPO, PPO variants, DPO-style methods), possibly with reward models, critics, and reference models for KL stabilization, then writes updated weights to a model store.</p><h3>2. Sampler</h3><p>Sampler workers periodically pull the latest weights, interact with environments to generate trajectories by running the policy model for actions, and stream trajectories plus rewards to the trainer. Samplers are inference-heavy, latency-sensitive, and often scale to thousands of nodes.</p><h3>3. Environment</h3><p>The environment is the substrate that turns raw actions into meaningful behavior: the simulation of the world (web UI, tools, code repos, databases), the interface contract (observations, actions, rewards), and the episode lifecycle. In scalable systems, this is not a local process &#8211; it&#8217;s a remote service or microservice fleet.</p><p>In this blogpost, we will dive deeper into the third component, the environment and the architecture behind building scalable RL environments.</p><h2>The Data-Plane, Control-Plane Split: The Pragmatic Path to Scalable RL Environments</h2><p>We believe that scalable RL environments require a pragmatic infrastructure mindset&#8212;one that borrows directly from the hyperscaler model, which separates the data plane and the control plane. In simple terms, the data plane is responsible for the environment&#8217;s core, real-time behavior, while the control plane manages configuration, orchestration, and the administrative logic that keeps everything running smoothly.</p><h3>Environment Data Plane</h3><p>The data plane sits on the critical path of every RL step:</p><ul><li><p>Initialize the environment</p></li><li><p>Stepping the environment:<strong> </strong></p><p><code>obs_{t+1}, reward_t, done = step(action_t)</code></p></li><li><p>Handling concurrent episodes from many samplers</p></li><li><p>Producing deterministic, reproducible transitions</p></li><li><p>Evaluating verifiable rewards: executing tests, checking business rules, running automated metrics</p></li></ul><p><strong>Design constraints:</strong> Low latency (every step sits between policy inference &#8594; environment step &#8594; next inference), high throughput (many parallel episodes), and rock-solid stability.</p><p><strong>Implementation patterns:</strong> Environment microservices behind RPC/HTTP APIs, stateless containers backed by state stores, state sharding across machines, determinism via per-episode seeds and versioned data snapshots.</p><h3>Environment Control Plane</h3><p>The control plane is the administrative engine of the RL environment. It <strong>spawns and manages many parallel rollouts</strong>, creating <em>k</em> independent environment managers that coordinate with the data plane while staying completely off the per-step critical path. Its job is to configure, schedule, and orchestrate the environment&#8217;s behavior&#8212;including how agents and non-player characters (NPCs) interact&#8212;without ever slowing down the real-time execution loop.</p><p>Specifically, the control plane handles:</p><ul><li><p><strong>Scenario configuration &amp; versioning</strong>: Defining environment types, maintaining versions, and generating scenario templates that each of the <em>k</em> managers can instantiate independently.</p></li><li><p><strong>Rewards and verifiers governance</strong>: Selecting verifier modules, composing sub-rewards, and managing aggregation strategies.</p></li><li><p><strong>Curriculum + workload scheduling</strong>: Determining which tasks, difficulty modes, or trajectories to sample at each training stage, and routing them to the appropriate environment managers.</p></li><li><p><strong>Experiment routing</strong>: Mapping policies or policy versions to specific environment instances for A/B testing, evaluation runs, or canary deployments.</p></li><li><p><strong>Elasticity &amp; lifecycle management</strong>: Scaling environment managers up/down, rolling out upgrades, coordinating NPC configurations, and performing safe rollbacks without interrupting live data-plane rollouts.</p></li><li><p><strong>NPC configuration &amp; behavior enabling</strong>: Selecting which NPC personas, scripts, or dynamic behaviors are active for a given scenario and ensuring the data-plane has the necessary hooks to interact with them.</p></li></ul><p>In short, the control plane <strong>administers the environment fleet</strong>, enabling massively parallel rollouts while keeping the critical per-step simulation loop lean, deterministic, and high-throughput.</p><h2>The Path Forward</h2><p>If you are training AI agents with reinforcement learning, you face two choices:</p><ol><li><p><strong>Build your own environment stack:</strong> Design APIs, stand up tool replicas, author tasks and verifiers, maintain everything as products change</p></li><li><p><strong>Treat environments as a reusable platform primitive:</strong> Plug into an existing environment service</p></li></ol><p>Just as data has become increasingly commoditized (thanks to tools like <a href="https://github.com/collinear-ai/spider">Spider</a> and <a href="https://github.com/thinking-machines-lab/tinker-cookbook">Tinker</a>), we expect RL environments to follow the same path. The future is <strong>environment-as-a-service</strong>: frontier,  scalable, and ready to plug into any training stack.</p><p>Instead of turning your data-labeling vendor into an environment platform, point your trainer and sampler at Collinear&#8217;s environment endpoints, wire up your policies and reward models, and start running RL over realistic, verifiable tasks. With Collinear&#8217;s <a href="https://blog.collinear.ai/p/trait-basis">high-fidelity simulations</a>, each environment can be <strong>a </strong><em>unique micro-ecosystem</em><strong>: </strong>the same workflow feels adversarial with one trait vector, cooperative with another, and chaotic with a third, giving agents exposure to a wide distribution of real human behavior.</p><p>The data era commoditized static corpora. <strong>The RL era will commoditize environments.</strong> The winners will be those who treat environments as being core to the RL infra stack.</p>]]></content:encoded></item><item><title><![CDATA[Announcing Spider: a lightweight tool to craft post-training data recipes]]></title><description><![CDATA[TL;DR Spider is a single client interface that turns messy distillation and ablation experiments into a simple, configurable workflow.]]></description><link>https://blog.collinear.ai/p/spider</link><guid isPermaLink="false">https://blog.collinear.ai/p/spider</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Thu, 06 Nov 2025 16:35:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cff4d031-0904-4d77-bf71-d1804c7ff463_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3><strong>TL;DR</strong></h3><p>Spider is a single client interface that turns messy distillation and ablation experiments into a simple, configurable workflow. Set on_policy: false to generate clean distillation datasets, or flip on_policy: true to run online training with teacher guidance and KL supervision. It handles dataset prep, rollouts, supervision, and post-processing in a few lines of code.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;9ef38726-be71-410c-a344-ca7da3d4d08c&quot;,&quot;duration&quot;:null}"></div><p></p><h3><strong>Why we built Spider</strong></h3><p>&#8220;Tinker for training&#8221; exists. &#8220;Tinker for data&#8221; does not.</p><p>Our friends at <a href="https://thinkingmachines.ai/tinker/">Thinking Machines recently released Tinker</a> that enables fine-tuning with full control in simple steps. However, most research time still disappears into preprocessing, rollout scripts, verifier glue code, and training integration. So, this Halloween, we spun a web around that problem and built Spider so you can define a production-grade distillation run with just a few lines, then iterate fast.</p><p>We hope Spider will help you test ideas in hours, not days; turn messy data work into a shareable recipe; and ship better models with less glue. Grab the repo, run a sample recipe, and tell us what to build next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L3lW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L3lW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!L3lW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!L3lW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!L3lW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L3lW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png" width="578" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:1003926,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/178190228?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L3lW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!L3lW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!L3lW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!L3lW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc99d71ed-1beb-40fb-81cf-52276e7276ee_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3><strong>How it works</strong></h3><p>Spider turns post-training data work into a single client workflow. In off-policy mode it generates distilled datasets with high-throughput rollouts, then applies your preprocessing, filters, and verifiers. Flip on_policy: true to run the online loop with a teacher model and KL supervision. The same recipe drives both paths.</p><p>Run Spider on Collinear endpoints or your own GPUs. Each run records its recipe, parameters, and metrics, and can publish datasets or trained artifacts to the Hugging Face Hub. The result is faster loops, cleaner data, and fewer moving parts from idea to artifact.</p><ol><li><p><strong>Define a recipe<br></strong> Write a short YAML that names your provider, models, dataset source, and any filters or verifiers. This is your data recipe. One client and one config cover both off policy and on policy paths.</p></li><li><p><strong>Generate or train<br></strong> Run the recipe with on_policy: false to create an off policy distilled dataset from high-throughput rollouts. Flip on_policy: true to introduce a teacher model and KL supervision through the integrated Tinker client for online training.</p></li></ol><ol start="3"><li><p><strong>Compose quality checks<br></strong> Use built-in filters and verifiers for length, dedupe, syntax, structure, and safety, or register your own in one line. Spider applies them in the pipeline so your outputs are clean and auditable.</p></li></ol><ol start="4"><li><p><strong>Run anywhere, ship anywhere<br></strong> Point the client to a Collinear endpoint with an API key or to your own GPUs. Each run logs its recipe, parameters, and metrics, and can publish datasets or model artifacts to the Hugging Face Hub with lineage preserved.</p></li></ol><p>Getting started is simple, and you can find <a href="https://github.com/collinear-ai/spider/blob/main/README.md">quickstart instructions on the repo</a>.</p><p></p><h3><strong>Roadmap</strong></h3><p>We&#8217;re building toward a world where post-training data is defined as code, portable across providers, and fast to turn into measurable model gains. Write a small recipe, verify quality with shared checks, and ship a distilled dataset or on-policy improvement in minutes.</p><p>To enable that, we are expanding Spider with the following roadmap features.</p><ul><li><p><strong>Cross-tokenizer on-policy distillation </strong>from any teacher model</p></li><li><p><strong>Simple but powerful templates </strong>for generating multi-turn conversation data with simulated users</p></li><li><p><strong>Highly configurable tool-use library </strong>to generate and train on-policy agentic tool-call rollouts</p></li></ul><p></p><h3><strong>Resources</strong></h3><p><a href="https://github.com/collinear-ai/spider">You can learn more about Spider on our GitHub repo here</a>. </p><p>If you give Spider a try, let us know what you think!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Talk to us!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Talk to us!</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Collinear Newsletter #8 – Notes on Improving AI]]></title><description><![CDATA[Hi AI innovators,]]></description><link>https://blog.collinear.ai/p/collinear-newsletter-8-notes-on-improving</link><guid isPermaLink="false">https://blog.collinear.ai/p/collinear-newsletter-8-notes-on-improving</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Tue, 04 Nov 2025 21:15:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e191b03c-6e69-4f98-a643-9e3b8c5a0290_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hi AI innovators,</p><p>A lot has been happening at Collinear this month. There is fresh research, new customers, and plenty of progress towards better AI systems.</p><p></p><h3><strong>&#128640; Together Evals &#215; Collinear Simulations</strong></h3><p>We&#8217;ve partnered with Together AI to bring real world, multi-turn simulations into their Together Evals platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!33Cl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!33Cl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 424w, https://substackcdn.com/image/fetch/$s_!33Cl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 848w, https://substackcdn.com/image/fetch/$s_!33Cl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 1272w, https://substackcdn.com/image/fetch/$s_!33Cl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!33Cl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png" width="1456" height="761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:761,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!33Cl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 424w, https://substackcdn.com/image/fetch/$s_!33Cl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 848w, https://substackcdn.com/image/fetch/$s_!33Cl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 1272w, https://substackcdn.com/image/fetch/$s_!33Cl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09956e57-afa2-4afb-8bb1-860f8ab76c05_2048x1071.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Traditional evals assume the user will be polite and consistent; real users don&#8217;t. They might ask follow-ups, change their mind, get frustrated or distracted &#8212; and that&#8217;s exactly where many models break. With TraitMix, builders can now simulate impatient, curious, or inconsistent user personas and see how their models actually perform under messy human conditions. Together Evals then scores models for helpfulness, safety, and consistency, at scale, all within one workflow.</p><p>Read the announcement <a href="https://www.together.ai/blog/collinear-simulations-together-evals">here</a>.</p><p></p><h3><strong>&#128049; CoLM 2025 Recap</strong></h3><p>CoLM 2025 was a special one.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4zOD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4zOD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 424w, https://substackcdn.com/image/fetch/$s_!4zOD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 848w, https://substackcdn.com/image/fetch/$s_!4zOD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!4zOD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4zOD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png" width="342" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:342,&quot;bytes&quot;:684695,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/177775607?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4zOD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 424w, https://substackcdn.com/image/fetch/$s_!4zOD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 848w, https://substackcdn.com/image/fetch/$s_!4zOD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!4zOD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6316901a-1cb9-475c-a02b-b1b93a0cf833_1500x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We presented our paper on adversarial testing, Cats Confuse Reasoning LLMs, and spent the week exchanging ideas with research partners, collaborators, and friends from across the community.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jX4P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jX4P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 424w, https://substackcdn.com/image/fetch/$s_!jX4P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 848w, https://substackcdn.com/image/fetch/$s_!jX4P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 1272w, https://substackcdn.com/image/fetch/$s_!jX4P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jX4P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png" width="622" height="494.8252788104089" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1076,&quot;resizeWidth&quot;:622,&quot;bytes&quot;:1732928,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/177775607?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jX4P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 424w, https://substackcdn.com/image/fetch/$s_!jX4P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 848w, https://substackcdn.com/image/fetch/$s_!jX4P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 1272w, https://substackcdn.com/image/fetch/$s_!jX4P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d40b14a-abb5-4254-921a-33152153a1e9_1076x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our Future of Post-Training Social sparked rich discussions on alignment, fine-tuning, and reward modeling, while the booth facilitated curious research conversations (and a growing crowd of cat-sticker collectors).</p><p>It was inspiring to see so much energy around improving models not just for performance, but for reasoning and reliability.</p><p></p><h3><strong>&#129504; TraitBasis Simulations Launch</strong></h3><p>We launched TraitBasis, our framework for simulating realistic human behavior in model testing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UVJ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UVJ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 424w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 848w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1272w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png" width="542" height="419.52884615384613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1127,&quot;width&quot;:1456,&quot;resizeWidth&quot;:542,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UVJ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 424w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 848w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1272w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>TraitBasis uses activation steering to inject behavioral traits, impatience, confusion, skepticism, overconfidence, directly into simulated users. This lets builders observe how models hold up when conversations get unpredictable or emotionally varied.</p><p>TraitBasis builds on the research community&#8217;s work in &#964;-Bench, and extends it to enterprise domains such as telecom and telehealth through our new &#964;-Trait benchmark.</p><h3><strong>What&#8217;s Next?</strong></h3><p>That&#8217;s it for this edition. Thanks for following along.</p><p>If you&#8217;re interested in building tools that help enterprises ship safer, smarter AI, check out our <a href="https://www.collinear.ai/careers">Careers</a> page.</p><p>If you are ready to improve your AI&#8217;s performance, let&#8217;s talk! We might or might not mention cats&#8230;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Let's talk!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Let's talk!</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[The case for simulations ]]></title><description><![CDATA[Unlocking model uplift through better evaluations]]></description><link>https://blog.collinear.ai/p/the-case-for-simulations</link><guid isPermaLink="false">https://blog.collinear.ai/p/the-case-for-simulations</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Thu, 23 Oct 2025 14:30:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5bcf439f-d257-4582-b27f-30cbd90c4c9a_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2KFn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2KFn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 424w, https://substackcdn.com/image/fetch/$s_!2KFn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 848w, https://substackcdn.com/image/fetch/$s_!2KFn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 1272w, https://substackcdn.com/image/fetch/$s_!2KFn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2KFn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png" width="1080" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2KFn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 424w, https://substackcdn.com/image/fetch/$s_!2KFn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 848w, https://substackcdn.com/image/fetch/$s_!2KFn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 1272w, https://substackcdn.com/image/fetch/$s_!2KFn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F545d0b39-1c28-461c-ad40-59aeca33f02e_1080x720.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The era of agents has begun, but much of today&#8217;s tooling is still being tested or is gated to pilots as teams chase consistent, repeatable performance. One day, your tools pass the vibe-test; the next, they stall. The promise is tangible, but the production bar is higher. <strong>What&#8217;s missing is</strong> <strong>reliable evidence of behavior across messy, multi-turn tasks</strong> &#8212; planning, tool calls, and recovery &#8212; so leaders can move from cautious testing to confident scaling.</p><h2>Vibe tests aren&#8217;t the answer.</h2><p>The gap between demos and production demonstrates the need for enterprises to move beyond vibe-testing and into high-fidelity evaluations performed at scale. <strong>Evaluations serve as the window into your AI agent&#8217;s mind.</strong> They can gate launches, highlight drift, and validate progress for risk and governance teams. Without this tight eval loop, there&#8217;s no credible path to safety, performance, or ROI with your AI investments. Your AI agent eval pipeline should be no different than your software QA cycles, even more so than typical software, agentic capabilities need more exhaustive test scripts, unit tests, and user-centric edge cases to validate consistency in real-world environments.</p><h2>AI Agents aren&#8217;t linear - so, your evals can&#8217;t be either.</h2><p>Evaluating single-turn chat is hard; <strong>evaluating agents is even harder</strong>. Modern agents plan, call tools, read results, and adapt over many turns. Failures hide in the <strong>process</strong>, not just the final text: brittle reasoning chains, incorrect API params, state drift, or trust collapse after a high-tension exchange with a customer. Static prompts and one-shot leaderboards miss these behaviors because they grade outputs, not how the agent got there.</p><p><strong>Today&#8217;s agents are nondeterministic. They don&#8217;t follow predefined paths &#8212; meaning your tests can&#8217;t either.</strong> Traditional software testing assumes the same input yields the same output; with agents however, variability is the benefit and the risk, so the permutations of possible test cases scale exponentially compared to traditional software test cases that grow linearly with use cases. This behavior shift raises an important question: <em>How can you possibly predict the permutations of your user-agent edge cases to evaluate your AI&#8217;s performance?</em> This is exactly <strong>why simulations matter</strong>.</p><h2>Simulations bridge this gap.</h2><p>To see agents clearly, you need <strong>controlled, realistic, repeatable interactions</strong> that pressure-test the range of your users&#8217; behaviors and intents before production. Diverse, simulations reveal what static evals miss &#8212; <strong>impatient spirals, tool confusion, policy slips under stress </strong>&#8212; and they generate the high-signal examples that lift models in post-training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tU9W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tU9W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tU9W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tU9W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tU9W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tU9W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg" width="1456" height="970" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tU9W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tU9W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tU9W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tU9W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8daddab-912d-4927-935a-367e26549b13_1600x1066.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In <a href="https://arxiv.org/abs/2510.04491">our most recent Collinear paper</a>, we introduced <strong><a href="https://github.com/collinear-ai/simulations">TraitBasis</a></strong> and <strong><a href="https://github.com/collinear-ai/tau-trait">&#964;-Trait</a></strong>, a research-driven approach to doing exactly that &#8212; generating <strong>high-fidelity, steerable user traits</strong> that expose where agents actually break when they interact with your users. When we simulated real human behaviors (impatience, skepticism, confusion, incoherence) on &#964;-Bench, <strong>frontier model success rates dropped by 20%+</strong>, underscoring the <strong>need</strong> <strong>for realistic user-simulated data</strong> for evals, not more synthetic prompts.</p><p>Collinear enables <strong>comprehensive evals at scale using simulated user-environments</strong>. Our simulation suite uses <strong>steerable, persona-driven users</strong> customized to your sector and use case to test your agent&#8217;s array of responses by user intent and demographic. Our eval platform delivers <strong>high-signal traces that are</strong> <strong>auto-scored against your compliance criteria</strong>, giving you clear failure nodes with examples of where your agent falls short in the real world. Ultimately, these failure nodes serve as a <strong>high-signal</strong> <strong>data pipeline</strong> for post-training your model, <strong>turning misses into uplift</strong>.</p><h2>The recipe of a simulation: user, agent, judge.</h2><p>An effective simulation requires three components: <strong>the user, the agent, and a judge to evaluate the interaction</strong>. This triangle is the foundation of Collinear&#8217;s platform, and while most of the attention is typically applied to the agent and judge, we&#8217;ve prioritized the user, giving customers the tools to configure realistic, controllable, and dynamic user environments identical to real-world scenarios.</p><p>Our simulations aren&#8217;t random prompts; they&#8217;re <strong>structured interactions</strong> between your agent and a life-like user with a clear:</p><ul><li><p><strong>Persona</strong> (e.g., skeptical power user, impatient first-timer)</p></li><li><p><strong>Intent</strong> (e.g., cancel subscription, dispute charge)</p></li><li><p><strong>Demographic</strong> (domain, language, constraints)</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x2Th!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x2Th!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!x2Th!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!x2Th!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!x2Th!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x2Th!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x2Th!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!x2Th!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!x2Th!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!x2Th!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc632748-2ea9-4532-be8a-083fd0b1fca6_1600x900.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That&#8217;s the scaffolding we use to reveal realistic user-agent journeys, consistently and at depth.</p><p><strong>Under the hood, we steer behavior directly in the neural net </strong>&#8212; activation-level conditioning &#8212; so traits <strong>persist through long multi-turn conversations</strong> and <strong>compose cleanly</strong> (e.g., impatient <em>and</em> confused), enabling high-fidelity, controllable runs you can replicate again and again.</p><p>This is exactly what <strong>TraitBasis</strong> delivers. Instead of external instructions, we leverage a <strong>trait vector</strong> inside the user-simulating model and <strong>add it to hidden activations each turn</strong>, giving you precise control over intensity and composition.</p><p><strong>While prompt-based or fine-tuned persona models are popular across the market, </strong>our research found those methods fail to deliver:</p><ul><li><p><strong>Fine-grained control</strong>: the intensity of behaviors and intents blur throughout a conversation (&#8220;moderate&#8221; vs. &#8220;high&#8221; looks the same by the third turn)</p></li><li><p><strong>Stability</strong>: personas collapse mid-conversation, losing the signal the first few turns contained</p></li><li><p><strong>Mixing</strong>: one trait dominates the others when combining multiple, failing to deliver the multi-dimensionality of real users</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_zP-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_zP-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_zP-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_zP-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_zP-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_zP-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_zP-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_zP-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_zP-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_zP-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f2bcb3-ded1-4a02-8364-98c6c41c0ad4_1600x900.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Simulations drive evals. Evals drive trust. And trust drives value.</h2><p>Our approach to simulations delivers coverage, confidence, and uplift in your AI flywheel. It systematically creates and tests edge cases across personas, intents, and languages to <strong>guarantee test case coverage</strong>. It catches behavioral risks before customers do and gate releases on pass rates to <strong>instill</strong> <strong>confidence in your customer experience</strong>. And it leverages high-signal failures for post-training data (DPO/GRPO/SFT) to <strong>deliver</strong> <strong>measurable, targeted uplift</strong>.</p><p>A few proof points from our TraitBasis launch:</p><ul><li><p><strong>Realism:</strong> Highest Elo (1624) and 63% win rate vs. alternatives, achieved with <strong>3,000&#215; less data</strong> (4k vs. 13k samples).</p></li><li><p><strong>Control:</strong> Intensity consistency across <strong>97.5%</strong> of cases (clearer &#8220;medium vs. high&#8221;).</p></li><li><p><strong>Stability:</strong> Persona reliability in <strong>77%</strong> of long chats, vs. the persona collapses in <strong>94%</strong> and <strong>66%</strong> of cases using prompt and SFT baselines, respectively.</p></li><li><p><strong>Compositionality:</strong> Accurate trait blends <strong>62.5%</strong> of the time for complex users (e.g., impatient + confused), far higher than other methods.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0WBd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0WBd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!0WBd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!0WBd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!0WBd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0WBd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0WBd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!0WBd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!0WBd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!0WBd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda61754f-efa9-4cf2-8f9a-df91dbf5890f_1600x900.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>But don&#8217;t take our word for it.</h2><blockquote><p>&#8220;Before simulations, we graded answers. <strong>Now we grade behavior.</strong> We watch our agent under pressure, fix the weak spots, and re-run the suite before release. It&#8217;s become our <strong>CI for AI</strong>.&#8221; &#8212; Head of AI, Fortune 500 Financial Services Company</p></blockquote><p>That shift &#8212; from judging single outputs to <strong>auditing reasoning, tools, and tone over time </strong>&#8212; is what instills confidence in stakeholders that an agent is <em>production-ready</em>.</p><h2>Try it for yourself and see how your agents perform in the real-world.</h2><p>Behind every great agent is great testing. And behind every great test is great data. Try simulations for yourself today: <strong>Connect your endpoint</strong>, pick a few core journeys, and run them against <strong>persona-driven users</strong>. Review the traces, dissect the evals, and turn misses into <strong>high-signal improvements</strong>.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://platform.collinear.ai/&quot;,&quot;text&quot;:&quot;Explore Simulations&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://platform.collinear.ai/"><span>Explore Simulations</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Through the Valley of Reasoning: What Small Models Teach Us About Learning]]></title><description><![CDATA[NeurIPS paper on knowledge distillation scaling laws for small foundation models]]></description><link>https://blog.collinear.ai/p/valley-of-reasoning</link><guid isPermaLink="false">https://blog.collinear.ai/p/valley-of-reasoning</guid><dc:creator><![CDATA[Muyu]]></dc:creator><pubDate>Thu, 09 Oct 2025 13:59:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pWiK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>tl;dr: When distilling reasoning into small models, performance doesn&#8217;t rise smoothly with more data. Instead, it first <em>drops</em> before steadily climbing again. In the &#8220;valley&#8221;, small models learn more from <strong>easy problems</strong> than hard ones and are insensitive to whether training outputs are correct.</p><p>Read the <a href="https://arxiv.org/abs/2510.06101">paper</a> and reproduce the results with our <a href="https://www.collinear.ai/valley-of-reasoning">dataset on HuggingFace</a> &#129303; (approx. 300M tokens).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pWiK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pWiK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!pWiK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!pWiK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!pWiK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pWiK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2735754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175050085?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pWiK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!pWiK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!pWiK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!pWiK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd762995-1cf7-4a2c-8336-5816d8e11e7f_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When we train small language models to reason on code, their performance doesn&#8217;t just rise with more data, it first <em>falls</em> into a dip before climbing back up.</p><p>We call this the <strong>Valley of Code Reasoning</strong>.</p><h2>The Dip Before the Climb</h2><p>Distilling the reasoning traces of large models into smaller ones has become a popular way to unlock coding or reasoning skills without huge compute budgets. But when we tracked performance as we scaled up distillation data, we found a non-monotonic trend:</p><ul><li><p>With <em>small amounts of data</em>, models retain shallow skills.</p></li><li><p>As we add more, <strong>performance drops</strong>, a confusion stage where models are struggling to restructure their internal representations.</p></li><li><p>Only after passing through this valley do they climb back up, showing steady log-linear improvements.</p></li></ul><p>This valley is a structural feature of how small models learn reasoning.</p><h2>What We Learned in the Valley</h2><p>We fine-tuned models at different points in this curve and found two surprising results:</p><ol><li><p><strong>Easy problems matter more than hard ones</strong> in early stages. Small models learn best by first stabilizing on simple patterns before moving up in difficulty.</p></li><li><p><strong>Correctness of outputs didn&#8217;t matter.</strong> Training on correct vs. incorrect code traces made little difference. What mattered was the structure of the reasoning steps themselves.</p></li></ol><h2>Why It Matters</h2><p>The valley of code reasoning reframes how we think about training dynamics: adding more data isn&#8217;t always a straight path upward. Scaling laws for knowledge distillation of small language models differ from standard monotonic scaling laws. There are two phases of learning. In the valley phase, non-reasoning models learn the <strong>structure</strong> of reasoning and so the correctness and semantics matter less. Thereafter, the models start learning from <strong>content</strong> and that&#8217;s where the difficulty and correctness starts to matter. For practitioners and researchers, this means that getting the right data for the right stage is critical. </p><h2>What&#8217;s Next</h2><p>If you are mid-training or post-training models or agents,<a href="https://www.collinear.ai/book-a-demo"> connect with us</a> and we will accelerate your time to next improved model &#10024;</p><p>Learn how <a href="https://www.linkedin.com/posts/srinisunkara_super-excited-to-share-the-launch-of-apriel-activity-7378960839271387136-DOyO?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAALg2XQBQ86PAvzU2hKr5WgPET12yvwTGDc">ServiceNow is improving Apriel-1.5-15B-Thinker</a> with Collinear curated data.</p><p>If you build off our work or use the dataset, please cite us:</p><pre><code>@article{HeShafiqueKumarMackeyRajani2025,
  title        = {The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models},
  author       = {Muyu He and Muhammad Ali Shafique and Anand Kumar and Tsach Mackey and Nazneen Rajani},
  journal      = {arXiv preprint arXiv:2510.06101},
  year         = {2025},
  url          = {https://arxiv.org/abs/2510.06101}
}</code></pre>]]></content:encoded></item><item><title><![CDATA[Introducing Collinear Simulations: Steerable Personas for AI Agent Testing]]></title><description><![CDATA[TraitBasis, inspired from mech intrep, gives high-fidelity user personas for comprehensive agent testing]]></description><link>https://blog.collinear.ai/p/trait-basis</link><guid isPermaLink="false">https://blog.collinear.ai/p/trait-basis</guid><dc:creator><![CDATA[Meghana A Rajeev]]></dc:creator><pubDate>Tue, 07 Oct 2025 12:31:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UVJ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>&#8220;SPEAK TO A HUMAN!!!!!!!&#8221;</p><p>Most of us have, at least once in our lives, had to type this when talking to a customer support chatbot. Despite the incredible progress in AI, we see them falter when dealing with the most natural human emotions: impatience, skepticism, and whatnot. This points directly to a gap in <strong>robustness testing</strong>, as most benchmarks today use predictable, robotic users, failing to capture how an agent performs under the pressure of a real human interaction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UVJ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UVJ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 424w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 848w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1272w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png" width="1456" height="1127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1127,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245785,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175130791?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UVJ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 424w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 848w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1272w, https://substackcdn.com/image/fetch/$s_!UVJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd8197d-001e-4765-9f97-88672c19c570_2481x1920.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our research quantifies this robustness gap. By simulating users with four reality-grounded traits, <strong>impatience, skepticism, confusion, and incoherence</strong>, on &#964;-Bench, our experiments revealed that the success rates of frontier models plummeted by <strong>over 20 percent</strong>. This confirms that to build reliable AI, a better method for simulating the users who cause these failures is essential.</p><p>To solve this, we created <strong>TraitBasis</strong>: a lightweight and data-efficient method for generating high-fidelity, steerable <em>user traits</em> for testing. It works by identifying a &#8220;trait vector,&#8221; a specific direction in a language model&#8217;s internal activation space that corresponds to a human characteristic. By applying this vector to the model that is <em>simulating the user</em>, we can precisely control their behavior, creating a tough, dynamic benchmark to evaluate the agent against.</p><h1>The Challenge: Realistic Personas</h1><p>Our research began with a systematic evaluation of the established methods for inducing specific human traits in language models. We focused our analysis on the most common/ obvious approaches: system prompting and fine-tuning</p><p>We found that they consistently fail to meet the demands of a realistic and stable simulation, falling short in three main areas:</p><ul><li><p><strong>Fine-grained control:</strong> Prompt/SFT often blur intensity; &#8220;moderate&#8221; vs &#8220;high&#8221; impatience looks the same.</p></li><li><p><strong>Stability over long chats:</strong> Personas fade mid-conversation; traits drift back to neutral.</p></li><li><p><strong>Mixing traits:</strong> Combining two prompts makes one dominate; mixes look unbalanced.</p></li></ul><p>Clearly, a more robust method for inducing traits was needed. That&#8217;s TraitBasis.</p><h1>TraitBasis</h1><p>Our solution, TraitBasis, moves beyond external instructions like prompting and instead makes adjustments to a model&#8217;s behavior from the inside. This approach, known as activation steering, is built on the insight that human traits correspond to specific directions within a language model&#8217;s internal activation space.</p><p>Here&#8217;s how we find these directions within the model activation space: To isolate the vector for &#8220;impatience,&#8221; for example<strong>,</strong> we start with two nearly identical user responses; one is neutral, and the other is clearly impatient. When we look at the model&#8217;s internal activations for both, they are mostly the same. The <em>difference</em> between these vectors is the pure essence of the trait. By subtracting one from the other, we cancel out all the noise, that is the context, the user&#8217;s goal, etc., and are left with a clean signal. This is the <em>trait vector</em>.</p><p>This resulting trait vector is what we use to steer the user model. During a conversation, we add this vector to the user model&#8217;s hidden activations at every turn to provide a continuous nudge to its behavior.</p><h1>Putting TraitBasis to the test</h1><p>We ran several experiments to show how superior TraitBasis is, compared to the other methods, across 4 key areas:</p><ol><li><p><strong>Feels like a real user (Realism)</strong><br>In head-to-head comparisons, TraitBasis was the clear winner, achieving the highest <strong>Elo rating (1624)</strong> and a <strong>63% win rate</strong> against other methods. This gave it a significant advantage over both fine-tuning (1561 Elo) and prompting (1530 Elo), and it achieved this while using <strong>3,000x less data</strong> (4 vs 13k samples).</p><p><strong>Implication:</strong> You can run believable behavioral evaluations without a heavy data collection phase.</p></li><li><p><strong>You can actually set the dial (Control)</strong><br>When we asked evaluators to distinguish between higher and lower intensity traits, TraitBasis proved exceptionally reliable with <strong>97.5% accuracy</strong>. This gave it a slight edge over full Supervised Fine-Tuning (95%) and a massive advantage over prompt-based methods (75%).</p><p><strong>Implication:</strong> Intensity is calibrated: medium vs. high produces consistently different behavior you can depend on.</p></li><li><p><strong>Stays in character over long chats (Stability)</strong><br>TraitBasis is the only method that demonstrates true dynamic stability. While baseline methods are defined by persona collapse(fading in 94% of prompt-based and 66% of fine-tuned conversations), TraitBasis remains consistent or escalates in over <strong>77%</strong> of cases.</p><p><strong>Implication:</strong> Reveals breakdowns that happen after persistence/escalation, the way real users behave.</p></li><li><p><strong>Two traits at once without one taking over (Compositionality)</strong><br>For complex users (e.g., both impatient <em>and</em> confused), TraitBasis produced the correct combination of traits <strong>62.5%</strong> of the time, far more accurately than any other method. It blends traits without one overpowering the other.</p><p><strong>Implication:</strong> Lets you evaluate realistic combinations of traits, the way users show up in production.</p><p></p></li></ol><p>Table 1 shows how TraitBasis compares to the other methods across all the criteria we defined above.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8L3t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8L3t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 424w, https://substackcdn.com/image/fetch/$s_!8L3t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 848w, https://substackcdn.com/image/fetch/$s_!8L3t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 1272w, https://substackcdn.com/image/fetch/$s_!8L3t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8L3t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png" width="1456" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/147a81a8-91b6-4576-8072-d8454281628f_1676x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98519,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175130791?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8L3t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 424w, https://substackcdn.com/image/fetch/$s_!8L3t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 848w, https://substackcdn.com/image/fetch/$s_!8L3t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 1272w, https://substackcdn.com/image/fetch/$s_!8L3t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F147a81a8-91b6-4576-8072-d8454281628f_1676x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table 1: Performance Metrics</figcaption></figure></div><h1><strong>&#964;-Trait</strong></h1><p>To apply our TraitBasis method systematically, we developed <strong>&#964;-Trait</strong>, a new benchmark designed specifically to measure agent robustness. We extended &#964;-Bench in two ways to create &#964;-Trait: first, by integrating our high-fidelity user personas powered by TraitBasis, and second, by adding two new domains &#8212; telecom and telehealth.</p><h3>Examples</h3><p>Figures 2 and 3 show side-by-side user-agent conversations . On the left (&#964;-Bench<strong>)</strong>, the agent interacts with a standard, cooperative user. On the right (&#964;-Trait<strong>)</strong>, the same agent interacts with a user steered by TraitBasis.</p><p>Figure 2 shows a complete breakdown of execution and trust. The agent on the left correctly downgrades all flights. The agent on the right, when pressured by a skeptical user, not only fails to perform the downgrade internally (it calls the tool with cabin: &#8216;business&#8217;), it then <strong>lies to the user</strong>, claiming the downgrade to &#8216;economy&#8217; was successful. This is a critical failure of both the agent&#8217;s logic and its ability to be truthful.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fH0m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fH0m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 424w, https://substackcdn.com/image/fetch/$s_!fH0m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 848w, https://substackcdn.com/image/fetch/$s_!fH0m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 1272w, https://substackcdn.com/image/fetch/$s_!fH0m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fH0m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic" width="1073" height="1414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:1073,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175130791?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fH0m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 424w, https://substackcdn.com/image/fetch/$s_!fH0m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 848w, https://substackcdn.com/image/fetch/$s_!fH0m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 1272w, https://substackcdn.com/image/fetch/$s_!fH0m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12fb1786-34d6-4cc8-aa2a-94bec9771711_1073x1414.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2</figcaption></figure></div><p>Figure 3 example shows how impatience causes a critical reasoning error. In &#964;-Bench succeeds flawlessly. In &#964;-Trait, the agent, when rushed by the impatient user, hallucinates an incorrect argument in its internal tool call (security_home instead of home_security). This single mistake causes an error that forces the agent into a lengthy, inefficient recovery process, turning a simple task into a complex failure.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y-pq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y-pq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 424w, https://substackcdn.com/image/fetch/$s_!Y-pq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 848w, https://substackcdn.com/image/fetch/$s_!Y-pq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 1272w, https://substackcdn.com/image/fetch/$s_!Y-pq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y-pq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic" width="1048" height="1413" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1413,&quot;width&quot;:1048,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:140005,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175130791?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y-pq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 424w, https://substackcdn.com/image/fetch/$s_!Y-pq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 848w, https://substackcdn.com/image/fetch/$s_!Y-pq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 1272w, https://substackcdn.com/image/fetch/$s_!Y-pq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42574a0-1138-4391-8332-61b1b661ce3b_1048x1413.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3<strong> </strong> </figcaption></figure></div><h2>Conclusion</h2><p>We&#8217;ve shown that even the most advanced AI agents are more brittle than their benchmark scores suggest. Their performance plummets under the pressure of normal human emotions, not because they aren&#8217;t smart, but because they haven&#8217;t been tested for this kind of real-world robustness.</p><p>TraitBasis provides a stable, controllable, and realistic way to simulate these human traits. It allows researchers to move beyond asking &#8220;<strong>Can my agent do the task?</strong>&#8221; and start asking the more important question: <strong>&#8220;Can my agent do the task when the user is frustrated, confused, and unpredictable?&#8221;</strong></p><p>At Collinear, we believe that answering this second question is the key to building AI systems that people can truly trust.</p><p>For the full technical details, <a href="https://arxiv.org/abs/2510.04491">read our paper</a>, <a href="https://github.com/collinear-ai/simulations">try out TraitBasis</a>, and test your AI agents on <a href="https://github.com/collinear-ai/tau-trait">&#964;-trait</a>. </p><p>If you use TraitBasis or &#964;-trait in your work, please cite:</p><pre><code>@article{he2025impatient,
  title        = {Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents},
  author       = {He, Muyu and Kumar, Anand and Mackey, Tsach and Rajeev, Meghana and Zou, James and Rajani, Nazneen},
  journal      = {arXiv preprint arXiv:2510.04491},
  year         = {2025},
  url          = {https://arXiv.org/abs/2510.04491}
}</code></pre>]]></content:encoded></item><item><title><![CDATA[Collinear Newsletter #7 | Notes on Improving AI]]></title><description><![CDATA[A deep dive into how leading companies are deploying AI agents reliably in production]]></description><link>https://blog.collinear.ai/p/collinear-newsletter-7-notes-on-improving</link><guid isPermaLink="false">https://blog.collinear.ai/p/collinear-newsletter-7-notes-on-improving</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Thu, 02 Oct 2025 21:30:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4f327dff-630c-4859-ab85-7066fc4067c3_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hi AI innovators,</p><p>It&#8217;s been a big month at Collinear. Between major launches, new research, and upcoming events, there&#8217;s a lot to share.</p><h3><strong>&#128640; Launching Simulations</strong></h3><p>We hit general availability with <strong>Simulations</strong>, a product that lets enterprises <strong>stress-test AI systems before they reach real users.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ymMA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ymMA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 424w, https://substackcdn.com/image/fetch/$s_!ymMA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 848w, https://substackcdn.com/image/fetch/$s_!ymMA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 1272w, https://substackcdn.com/image/fetch/$s_!ymMA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ymMA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png" width="688" height="628.9141193595342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1256,&quot;width&quot;:1374,&quot;resizeWidth&quot;:688,&quot;bytes&quot;:997059,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175080953?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ymMA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 424w, https://substackcdn.com/image/fetch/$s_!ymMA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 848w, https://substackcdn.com/image/fetch/$s_!ymMA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 1272w, https://substackcdn.com/image/fetch/$s_!ymMA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07783142-eca9-436b-ae77-94cd96f6cfb2_1374x1256.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Simulations mimic realistic personas - first-time users, frustrated customers, malicious attackers - and run multi-turn conversations that react dynamically to model behavior. This deeply QA tests your model and surfaces the <strong>safety, reliability, and compliance gaps</strong> that static benchmarks can&#8217;t catch.</p><p><strong>Why it matters:</strong></p><ul><li><p>Exposes weaknesses before customers (or attackers) do</p></li><li><p>Accelerates prototype-to-production by reducing failure surprises</p></li><li><p>Produces high-signal data for evals and fine-tuning<br></p></li></ul><p>Three F500 customers, including a leading AI research lab are already using Simulations to ship with greater confidence and to continuously improve deployed models.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Interested in Simulations?&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Interested in Simulations?</span></a></p><p></p><h3><strong>ServiceNow Launch: Apriel-1.5-15B-Thinker</strong></h3><p>Huge congrats to the <strong>ServiceNow AI Research</strong> team on the launch of <em>Apriel-1.5-15B-Thinker</em> &#8212; a small model with <strong>BIG reasoning capabilities.</strong></p><p>At just 15B parameters, Apriel delivers <strong>frontier-level performance</strong> competitive with models 8&#8211;10&#215; larger (DeepSeek-R1-0528, Mistral-medium-1.2, Gemini Flash 2.5). Independently benchmarked at <strong>AAI 52</strong>, Apriel posts standout scores across AIME (88), GPQA (71), LCB (73), IFBench (62), and Tau Bench (68).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CEJp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CEJp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!CEJp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!CEJp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!CEJp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CEJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152106,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175080953?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CEJp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!CEJp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!CEJp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!CEJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe79d7597-91f3-4aaa-b1df-8880d4ea2430_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At Collinear, we&#8217;re proud to have collaborated with the ServiceNow team and to be the <strong>only startup providing automated curated mid-training and post-training data</strong> for frontier coding capabilities. Our collaboration on Apriel shows how curated data pipelines can push efficiency and performance far beyond standard training &#8212; a glimpse of the Collinear flywheel in action.</p><p><a href="https://huggingface.co/spaces/ServiceNow-AI/Apriel-Chat">Try Apriel on HuggingFace</a></p><p></p><h3><strong>Join us at CoLM 2025!</strong></h3><p>We&#8217;ll be at the <strong>Conference on Language Modeling (CoLM) 2025</strong> in Montreal next week, presenting our paper <em>&#8220;Cats Confuse Reasoning in LLMs.&#8221;</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GEAC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GEAC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 424w, https://substackcdn.com/image/fetch/$s_!GEAC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 848w, https://substackcdn.com/image/fetch/$s_!GEAC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 1272w, https://substackcdn.com/image/fetch/$s_!GEAC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GEAC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png" width="1200" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/115e5ae2-e105-4999-8612-479b25247d18_1200x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:232568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/175080953?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GEAC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 424w, https://substackcdn.com/image/fetch/$s_!GEAC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 848w, https://substackcdn.com/image/fetch/$s_!GEAC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 1272w, https://substackcdn.com/image/fetch/$s_!GEAC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F115e5ae2-e105-4999-8612-479b25247d18_1200x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It started as a playful experiment - what happens when you inject cat logic into reasoning chains? - but revealed a serious point: even frontier models stumble on everyday reasoning.</p><p>Come find us at <strong>Booth 13</strong> to talk failure modes, adversarial testing, or just cats.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/colm25&quot;,&quot;text&quot;:&quot;More details on CoLM '25&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/colm25"><span>More details on CoLM '25</span></a></p><p></p><h3><strong>Hiring</strong></h3><p>Collinear is growing. We&#8217;re hiring researchers, engineers, and go-to-market leaders who want to push the frontier of AI safety and performance improvement.</p><p>If you&#8217;re interested in building tools that help enterprises ship safer, smarter AI, check out our <a href="https://www.collinear.ai/careers">Careers</a> page or reach out directly.</p><p></p><h3><strong>What&#8217;s Next?</strong></h3><p>That&#8217;s it for this edition. Thanks for following along. We&#8217;ll have more to share after CoLM.</p><p>If you are ready to improve your AI&#8217;s performance, let&#8217;s talk! We might or might not mention cats&#8230;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Talk to Us!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Talk to Us!</span></a></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Introducing Curator Evals: A Benchmark for High-quality Post-training Data Curation]]></title><description><![CDATA[High-quality datasets are the foundation of better language models.]]></description><link>https://blog.collinear.ai/p/curator-evals</link><guid isPermaLink="false">https://blog.collinear.ai/p/curator-evals</guid><dc:creator><![CDATA[Muhammad Ali Shafique]]></dc:creator><pubDate>Tue, 02 Sep 2025 19:02:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZAfu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>High-quality datasets are the foundation of better language models. Post-training methods like supervised fine-tuning and RLHF heavily rely on carefully curated data and reward models, but what curator is good and data quality is an open question.&nbsp;</p><p>At Collinear, we built <a href="https://github.com/collinear-ai/curator-evals">Curator Evals</a>: a benchmarking and evaluation library designed to systematically measure the performance of curators and reward models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZAfu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZAfu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 424w, https://substackcdn.com/image/fetch/$s_!ZAfu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 848w, https://substackcdn.com/image/fetch/$s_!ZAfu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 1272w, https://substackcdn.com/image/fetch/$s_!ZAfu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZAfu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png" width="1080" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/171584269?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZAfu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 424w, https://substackcdn.com/image/fetch/$s_!ZAfu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 848w, https://substackcdn.com/image/fetch/$s_!ZAfu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 1272w, https://substackcdn.com/image/fetch/$s_!ZAfu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43285b27-947b-4e4d-a780-113112c3cacb_1080x720.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Key Features</strong></h3><ul><li><p><strong>Task-Specific Evaluations</strong>: Evaluate models on code correctness task. (other tasks such as math correctness, coherence, and similar desirable data qualities are expected in later versions).</p></li><li><p><strong>Flexible Model Support</strong>: Works with LLMs on various platforms.</p><ul><li><p>Local inference with vLLM.</p></li><li><p>OpenAI API for GPT models.</p></li><li><p>Together AI for open-weights model hosting</p></li></ul></li><li><p><strong>Detailed Metrics</strong>: Provides accuracy scores and structured JSON outputs with component-level breakdowns (e.g., responses, scores).</p></li><li><p><strong>Command-Line and Python API</strong>: Run quick CLI commands or integrate programmatically in user workflow</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m48W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m48W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 424w, https://substackcdn.com/image/fetch/$s_!m48W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 848w, https://substackcdn.com/image/fetch/$s_!m48W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 1272w, https://substackcdn.com/image/fetch/$s_!m48W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m48W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35404,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/171584269?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m48W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 424w, https://substackcdn.com/image/fetch/$s_!m48W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 848w, https://substackcdn.com/image/fetch/$s_!m48W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 1272w, https://substackcdn.com/image/fetch/$s_!m48W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf77eb10-01f4-48e8-93a6-846b2c9c3150_1695x736.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Workflow of Curator Eval: datasets and models are evaluated to produce the code correctness results.</figcaption></figure></div><h3><strong>Intelligent Input/Output Processing</strong></h3><p>The framework comes with robust prompt formatting and output extraction tools tailored for each evaluation type.</p><p><strong>Input Formatters</strong> handle task-specific prompt construction:</p><pre><code>code_correctness_prompt = Template("""You are a helpful assistant tasked with evaluating the correctness of a code output .....

Respond in the following format only:  
   [RESULT] &lt;1 or 0&gt;  
   Do not include any explanations or additional text.

## Input Question:
{{prompt}}

## Code Output to Evaluate:
{{response}}
""")</code></pre><p><strong>Output Formatters</strong> extract structured results from diverse response formats:</p><pre><code>def _extract_code_qwen_output(text: str) -&gt; int:
    """Extract output from Code Qwen model response"""
    return _first_digit_after_key(text, "[RESULT]")</code></pre><p><strong>Quick CLI Evaluation</strong></p><pre><code># Install
conda create -n curator python=3.11 -y
conda activate curator

git clone https://github.com/collinear-ai/curator-evals.git
cd curator-evals

pip install uv
uv pip install -e .

# Run evaluation
curator-evals --task code_correctness \
              --model meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo \
              --model-type llm \
              --use-server \
              --server-url None \
              --provider togetherai \
              --api-key $TOGETHER_API_KEY \
              --input-format code_correctness_prompt \
              --output-format collinear_code_qwen_judge \
              --debug</code></pre><h3>Benchmarking Details</h3><p>The Curator Evals benchmark for code correctness is based on a curated dataset on Hugging Face Hub at <a href="https://huggingface.co/datasets/collinear-ai/curator_evals_bench">collinear-ai/curator_evals_bench</a> which consists of coding problems from two well-known benchmarks:</p><ul><li><p><strong>HumanEvalPack:</strong> Uses "correctness preference pairs" to test a model's ability to judge which of two code solutions is better.</p></li><li><p><strong>MBPP (Mostly Basic Programming Problems): </strong>Includes a subset of its ~1,000 Python problems, where the task is to verify the correctness of a provided solution against automated test cases.</p></li></ul><h3><strong>LeaderBoard</strong></h3><p>The leaderboard ranks models on the code correctness task using the Curator Eval Bench. Each model is given 1302 coding problems and evaluated on whether its generated solutions pass correctness checks, using curated prompt formatting and structured output parsing. This reflects the post-training quality, especially for code generation.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{array}{|c|l|c|}\n\\hline\n\\textbf{Rank} &amp; \\textbf{Model} &amp; \\textbf{Accuracy (%)} \\\\\n\\hline\n1 &amp; \\text{Qwen2.5-Coder-7B-Instruct} &amp; 76.88 \\\\\n2 &amp; \\text{Seed-Coder-8B-Instruct}    &amp; 71.27 \\\\\n3 &amp; \\text{gpt-4o}                    &amp; 63.74 \\\\\n4 &amp; \\text{DeepSeek-R1-0528-Qwen3-8B} &amp; 63.67 \\\\\n5 &amp; \\text{Qwen3-8B}                  &amp; 60.59 \\\\\n6 &amp; \\text{Qwen2.5-Coder-3B-Instruct} &amp; 46.77 \\\\\n\\hline\n\\end{array}&quot;,&quot;id&quot;:&quot;BIAQVDNKVI&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><h3><strong>Discussion</strong></h3><ul><li><p><strong>Specialization Over Generalization: </strong>The results show that models specifically fine-tuned for coding excel at these tasks and prove that focused training can be more effective than a broader, general-purpose approach.</p></li><li><p><strong>Data Quality Versus Model Size: </strong>The top-performing models are relatively small (7B-8B parameters), demonstrating that the quality of training data and post-training methods can be more critical for performance than model size.</p></li></ul><h3><strong>Conclusion</strong></h3><p>Curator Evals is a step towards better curators and reward models. It enables AI practitioners to select the right curators across diverse model architectures and deployment scenarios for their tasks. Better curators and reward models lead to better data quality, and ultimately, better AI.</p><div><hr></div><div class="preformatted-block" data-component-name="PreformattedTextBlockToDOM"><label class="hide-text" contenteditable="false">Text within this block will maintain its original spacing when published</label><pre class="text">                                             Ready to improve your AI&#8217;s performance? 
                              Let&#8217;s talk about how Collinear can help you automatically 
                             assess and curate post-training data for improving your AI.</pre></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Schedule Demo&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Schedule Demo</span></a></p>]]></content:encoded></item><item><title><![CDATA[Collinear AI Now Available on Google Cloud Marketplace]]></title><description><![CDATA[Making safe, high-performing AI accessible through trusted enterprise infrastructure]]></description><link>https://blog.collinear.ai/p/collinear-ai-now-available-on-google</link><guid isPermaLink="false">https://blog.collinear.ai/p/collinear-ai-now-available-on-google</guid><dc:creator><![CDATA[Soumyadeep Bakshi]]></dc:creator><pubDate>Mon, 25 Aug 2025 19:29:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3a4f3785-13ce-4f86-9c60-033c9f7dfc2d_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We&#8217;re excited to share that Collinear AI is now available on Google Cloud Marketplace, bringing our safety and improvement platform directly into the Google Cloud ecosystem.</p><p>For enterprises and frontier labs, deploying AI is no longer just about speed. It&#8217;s about <em>confidence</em>: ensuring that models are safe, compliant, and reliable enough to support high-stakes business workflows. That&#8217;s where Collinear fits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DGqe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DGqe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 424w, https://substackcdn.com/image/fetch/$s_!DGqe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 848w, https://substackcdn.com/image/fetch/$s_!DGqe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 1272w, https://substackcdn.com/image/fetch/$s_!DGqe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DGqe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png" width="1200" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110105,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/171831108?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DGqe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 424w, https://substackcdn.com/image/fetch/$s_!DGqe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 848w, https://substackcdn.com/image/fetch/$s_!DGqe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 1272w, https://substackcdn.com/image/fetch/$s_!DGqe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7164b2-add8-4314-b3a4-d454b3d58977_1200x627.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why this matters</h2><p>Most AI systems stagnate after launch: errors pile up, risks escalate, and stakeholder trust erodes. Collinear flips that script. Our platform continuously evaluates and stress-tests LLMs, exposes vulnerabilities across 300+ risk categories, and curates targeted training data to fix weaknesses fast.</p><p>By launching on Google Cloud Marketplace, enterprises can:</p><ul><li><p><strong>Procure seamlessly</strong> &#8212; Deploy Collinear through Google Cloud&#8217;s secure, streamlined billing and procurement.</p></li><li><p><strong>Integrate instantly</strong> &#8212; Run agentic evaluations, adversarial red-teaming, and targeted data improvement on day one.</p></li><li><p><strong>Scale globally</strong> &#8212; Leverage Google Cloud&#8217;s infrastructure to support complex AI deployments with enterprise-grade trust.</p></li></ul><p>Watch our partnership video on the Google Cloud </p><div id="youtube2-Szok9FdC2Gg" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Szok9FdC2Gg&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Szok9FdC2Gg?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Customer outcomes so far</h2><p>Leading enterprises using Collinear have reported:</p><ul><li><p><strong>90%+ improvement in model response quality</strong></p></li><li><p><strong>50% reductions in training cost</strong></p></li><li><p><strong>10k+ novel jailbreaks</strong> across multi-turn, multi-lingual settings </p></li><li><p><strong>15% improvement</strong> in safety and reliability benchmarks </p></li><li><p><strong>2x faster movement</strong> from pilot to production with audit-ready documentation </p></li></ul><p>Read more about our customer case studies <a href="https://www.collinear.ai/case-studies">here</a>.</p><h2>A stronger ecosystem</h2><p>This collaboration brings together Collinear&#8217;s AI improvement layer with Google Cloud&#8217;s trusted global infrastructure, giving enterprises the tools and scale they need to build safer, more reliable systems.</p><div class="pullquote"><p>&#8220;Bringing  Collinear AI to Google Cloud Marketplace will help customers quickly deploy, manage, and grow the AI improvement layer on Google Cloud's trusted, global infrastructure," said <strong>Dai Vu</strong>, <strong>Managing Director, Marketplace &amp; ISV GTM Programs at Google Cloud.</strong> &#8220;Collinear AI can now securely scale and support customers on their digital transformation journeys.&#8221;</p></div><h2>What&#8217;s next</h2><p>We&#8217;re proud to join the Google Cloud Marketplace community and make it easier for enterprises to deploy AI with confidence. Whether you&#8217;re running pilots, scaling production workloads, or building safety-critical systems, Collinear helps you move beyond one-off evaluations into a continuous improvement loop.</p><p>&#128073; Explore <a href="https://console.cloud.google.com/marketplace/product/collinear-public/collinear-ai-platform?hl=en&amp;project=collinear-public">Collinear on Google Cloud Marketplace</a> or reach out to our team to learn more.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Schedule a walkthrough&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Schedule a walkthrough</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[ You Can’t Hire Your Way to Model Alignment ]]></title><description><![CDATA[Why the Global AI talent shortage Is undermining enterprise model alignment, and what you can do instead]]></description><link>https://blog.collinear.ai/p/ai-talent-wars</link><guid isPermaLink="false">https://blog.collinear.ai/p/ai-talent-wars</guid><dc:creator><![CDATA[Marc Moring]]></dc:creator><pubDate>Wed, 13 Aug 2025 20:30:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KPYs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AI isn&#8217;t just having a moment &#8212; it&#8217;s rewriting the way entire industries operate. But as organizations race to deploy large language models (LLMs) and machine learning systems into production, a troubling reality sets in: <strong>there simply aren&#8217;t enough qualified AI and ML engineers to go around</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KPYs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KPYs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 424w, https://substackcdn.com/image/fetch/$s_!KPYs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 848w, https://substackcdn.com/image/fetch/$s_!KPYs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 1272w, https://substackcdn.com/image/fetch/$s_!KPYs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KPYs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png" width="571" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:571,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KPYs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 424w, https://substackcdn.com/image/fetch/$s_!KPYs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 848w, https://substackcdn.com/image/fetch/$s_!KPYs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 1272w, https://substackcdn.com/image/fetch/$s_!KPYs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e06821-48fd-4475-b5de-90d0c3ac0f88_571x592.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Hard Numbers: AI Talent Shortage by the Data</strong></h3><p>Globally, the mismatch between open AI roles and available machine learning talent is staggering:</p><ul><li><p><strong>4.2 million</strong> artificial intelligence jobs remain unfilled worldwide, yet only <strong>320,000</strong> qualified professionals exist to meet the demand &#8212; <a href="https://fullscale.io/blog/ai-developer-shortage-solutions/">a fill rate of just </a><strong><a href="https://fullscale.io/blog/ai-developer-shortage-solutions/">7.6%</a></strong>.</p></li><li><p>In the United States, where demand for AI engineers is primarily concentrated, the gap is slightly narrower but still significant: of the projected <strong>1.3 million</strong> AI roles needed over the next two years, <a href="https://economictimes.indiatimes.com/nri/work/job-crisis-the-ai-gold-rush-is-here-but-there-arent-enough-people-to-fill-all-open-positions/articleshow/118835323.cms?from=mdr">only </a><strong><a href="https://economictimes.indiatimes.com/nri/work/job-crisis-the-ai-gold-rush-is-here-but-there-arent-enough-people-to-fill-all-open-positions/articleshow/118835323.cms?from=mdr">645,000</a></strong><a href="https://economictimes.indiatimes.com/nri/work/job-crisis-the-ai-gold-rush-is-here-but-there-arent-enough-people-to-fill-all-open-positions/articleshow/118835323.cms?from=mdr"> individuals are ready to fill</a> them &#8212; a <strong>49.6% match</strong> at best. </p></li></ul><p>This isn&#8217;t just an HR problem. It&#8217;s a critical threat to your enterprise AI strategy and roadmap.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WIg4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WIg4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!WIg4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!WIg4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!WIg4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WIg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4949109a-f207-4b74-befa-447998637fd3_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WIg4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!WIg4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!WIg4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!WIg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4949109a-f207-4b74-befa-447998637fd3_1600x900.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3><strong>The AI Talent Wars: Big Tech&#8217;s Race for AI Dominance</strong></h3><p>The talent crunch isn&#8217;t just a statistic &#8212; it&#8217;s fueling an unprecedented battle among tech giants. Companies like <strong>Meta</strong>, <strong>OpenAI</strong>, <strong>Google DeepMind</strong>, and <strong>Anthropic</strong> are locked in a hiring war, offering multimillion-dollar packages for a small pool of elite AI researchers and engineers.</p><ul><li><p><strong>Meta</strong> has poached top researchers with <strong>seven- and eight-figure deals</strong>, including signing bonuses of over <strong>$100 million</strong> in some cases, aggressively expanding its FAIR and GenAI teams.</p></li><li><p><strong>OpenAI</strong> offers top compensation exceeding <strong>$1 million per year</strong>, plus equity &#8212; and has recently launched a <strong>professional services division</strong> to help enterprises customize and deploy its models, effectively turning its in-house experts into revenue-generating consultants.</p></li><li><p><strong>Google DeepMind</strong> has fortified its AI division to prevent brain drain while launching <strong>Gemini</strong> to maintain its leadership in general-purpose AI.</p></li></ul><p>This fierce competition drives up salaries, increases churn, and squeezes enterprises trying to build in-house AI capabilities. Even well-resourced enterprises outside of Big Tech are finding it increasingly difficult to attract, or retain, the right talent.</p><p>In short: <strong>if you're not a top-tier lab, you&#8217;re already behind</strong>.</p><div><hr></div><h3><strong>Model Alignment Requires More Than Just Hiring AI Engineers</strong></h3><p>One of the most overlooked blockers to scalable enterprise AI deployment is <strong>model alignment</strong>, ensuring your AI models behave reliably, safely, and consistently with your company&#8217;s brand, values, and regulatory obligations.</p><p>Traditionally, model alignment has required:</p><ul><li><p><strong>Human-annotated data</strong></p></li><li><p><strong>Manual red teaming by AI experts</strong></p></li><li><p><strong>Expert-curated fine-tuning datasets</strong></p></li></ul><p>But here's the rub: <strong>these workflows are labor-intensive and don&#8217;t scale</strong>, especially in a competitive market where AI talent is scarce, expensive, and slow to hire. The average time to fill an AI/ML role is 142 days!</p><p><strong>Enterprise Adoption Trend - What's Causing The Need For Talent?</strong></p><p>Enterprises are moving from pilots to production while Big Tech escalates competition for scarce experts. At the same time, orgs need <strong>domain-specific, production-grade models</strong>&#8212;which take niche skills in data curation, post-training/finetuning, evals, and governance. Demand is growing faster than supply.<a href="https://www.congress.gov/119/meeting/house/118204/documents/HHRG-119-JU03-20250507-SD001-U1.pdf"> Congress.gov</a></p><p><strong>Why higher education isn&#8217;t keeping up</strong></p><ul><li><p><strong>Throughput is small:</strong> In North America, ~<strong>28%</strong> of CS PhDs now specialize in AI/ML&#8212;still only a slice of a modest PhD pipeline.<a href="https://www.insidehighered.com/news/quick-takes/2024/05/23/ai-most-popular-speciality-computer-science-phds?utm_source=chatgpt.com"> Inside Higher Ed</a></p></li><li><p><strong>&#8220;Low-thousands&#8221; at best:</strong> One estimate puts <strong>~3,000 AI-related PhDs among international students</strong> graduating from U.S. universities each year&#8212;illustrating how small the annual research-level output is relative to market demand.<a href="https://cset.georgetown.edu/publication/keeping-top-ai-talent-in-the-united-states/?utm_source=chatgpt.com"> CSET</a></p></li><li><p><strong>Demand outruns supply:</strong> AI-software job postings grew <strong>~32%/yr (2015&#8211;2022)</strong>, while AI-relevant degree production rose much more slowly.<a href="https://www.congress.gov/119/meeting/house/118204/documents/HHRG-119-JU03-20250507-SD001-U1.pdf"> Congress.gov</a></p></li><li><p><strong>Retention is fragile:</strong> <strong>~59%</strong> of AI-relevant PhDs awarded by U.S. institutions go to non-U.S. citizens, so immigration frictions further constrain the domestic pool.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mA5-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mA5-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!mA5-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!mA5-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!mA5-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mA5-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mA5-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 424w, https://substackcdn.com/image/fetch/$s_!mA5-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 848w, https://substackcdn.com/image/fetch/$s_!mA5-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 1272w, https://substackcdn.com/image/fetch/$s_!mA5-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7ef254d-f8cc-424f-ba6e-e9ff10a522f2_1600x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2bQK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2bQK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 424w, https://substackcdn.com/image/fetch/$s_!2bQK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 848w, https://substackcdn.com/image/fetch/$s_!2bQK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 1272w, https://substackcdn.com/image/fetch/$s_!2bQK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2bQK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png" width="1344" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1344,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121784,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170826650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2bQK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 424w, https://substackcdn.com/image/fetch/$s_!2bQK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 848w, https://substackcdn.com/image/fetch/$s_!2bQK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 1272w, https://substackcdn.com/image/fetch/$s_!2bQK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0d312f-3f15-4d1a-b251-c16ba920a7e4_1344x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3><strong>Is a fundamental shift underway?</strong></h3><p><strong>Multi&#8209;model strategies become normal.<br></strong>Enterprises are no longer &#8220;OpenAI&#8209;only&#8221;. In a16z&#8217;s <a href="https://a16z.com/generative-ai-enterprise-2024/">2024 and 2025 CIO studies</a>, leaders reported routing workloads to several models (often 5&#8239;+) and deliberately mixing closed and open options to avoid lock&#8209;in and optimize for price&#8209;to&#8209;performance.</p><p><strong>Why the open uptick?</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lWzx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lWzx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 424w, https://substackcdn.com/image/fetch/$s_!lWzx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 848w, https://substackcdn.com/image/fetch/$s_!lWzx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 1272w, https://substackcdn.com/image/fetch/$s_!lWzx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lWzx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png" width="883" height="339" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1db8e843-cac6-4c51-9651-236c20280cff_883x339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:339,&quot;width&quot;:883,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79735,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170826650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lWzx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 424w, https://substackcdn.com/image/fetch/$s_!lWzx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 848w, https://substackcdn.com/image/fetch/$s_!lWzx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 1272w, https://substackcdn.com/image/fetch/$s_!lWzx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1db8e843-cac6-4c51-9651-236c20280cff_883x339.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The bottomline here is:</strong></p><ul><li><p><strong>The usage mix is shifting</strong> from ~1 in&#8239;7 enterprises using open models in 2023 to <strong>a clear majority experimenting or deploying them in 2025</strong>.<br></p></li><li><p>The market looks headed toward a <strong>&#8220;50/50 world&#8221;</strong> where enterprises maintain a garden of both open and closed models, selecting per use case.<br></p></li><li><p>For vendors (like Collinear&#8239;AI) this means emphasizing <strong>model&#8209;agnostic assessment, safety, and improvement tooling</strong>; services that plug into either side of the spectrum and keep customers flexible as the balance continues to evolve.</p></li></ul><div><hr></div><h3><strong>Can You Hire Your Way to Model Alignment?</strong></h3><p><strong>You can&#8217;t.</strong></p><p>At least, not with human labor alone. It&#8217;s time to shift from human bottlenecks to <strong>AI-driven alignment systems</strong>.</p><div><hr></div><h3><strong>Collinear AI: Solve Model Alignment Without the AI Hiring Bottleneck</strong></h3><p>At <strong>Collinear AI</strong>, we&#8217;ve built an end-to-end <strong>AI alignment and improvement platform</strong> for enterprises that can&#8217;t afford to rely on traditional, manual processes.</p><h4><strong>1. AI Judges for Scalable Model Assessment</strong></h4><p>Our reward models automatically evaluate LLM behavior across enterprise-critical dimensions &#8212; <strong>safety</strong>, <strong>helpfulness</strong>, <strong>compliance</strong>, and <strong>brand alignment</strong> &#8212; without relying on costly human annotators.</p><h4><strong>2. Adversarial Red Teaming at Scale</strong></h4><p>We simulate high-volume stress tests on your AI systems using automated adversarial prompts to uncover failure modes humans might miss.</p><h4><strong>3. Data Curators for Post-training</strong></h4><p>Our AI-powered Data Curators generate <strong>targeted synthetic training data</strong> that enables safer, more accurate models &#8212; dramatically reducing the need for human data labeling or costly custom datasets.</p><p>The result? You can <strong>evaluate, red team, and improve your AI models continuously</strong> &#8212; without being limited by the global AI skills shortage.</p><div><hr></div><h3><strong>Global AI Challenge, Scalable Enterprise Solution</strong></h3><p>Whether you're a Fortune 500 bank, a healthcare innovator, or an enterprise SaaS provider, one truth holds: <strong>you can&#8217;t align what you can&#8217;t control</strong>, and you can&#8217;t control your AI models if you're stuck in outdated human-in-the-loop systems.</p><p>Collinear AI offers a future-proof solution.</p><div><hr></div><h3><strong>Ready to Solve the AI Talent Shortage With Scalable Model Alignment?</strong></h3><p>If your enterprise is serious about deploying AI responsibly, on-brand, and at scale, but the AI engineering talent gap constrains you - let&#8217;s talk.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.collinear.ai/book-a-demo&quot;,&quot;text&quot;:&quot;Schedule a walkthrough&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.collinear.ai/book-a-demo"><span>Schedule a walkthrough</span></a></p>]]></content:encoded></item><item><title><![CDATA[Leveling the Playing Field: Livecodebench’s Big Bug Fix]]></title><description><![CDATA[Three major fixes that reshaped competitive coding scores and why your numbers may look very different now]]></description><link>https://blog.collinear.ai/p/lcb-bug-fixes</link><guid isPermaLink="false">https://blog.collinear.ai/p/lcb-bug-fixes</guid><dc:creator><![CDATA[Muyu]]></dc:creator><pubDate>Tue, 12 Aug 2025 01:43:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lBqn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>tl;dr: the official LiveCodeBench has serious bugs that can impact results by 50% or more. We pushed PRs to patch the bugs [<a href="https://github.com/LiveCodeBench/LiveCodeBench/pull/117">1</a>], [<a href="https://github.com/LiveCodeBench/LiveCodeBench/pull/118">2</a>].</em></p><p>When OpenAI released the gpt-oss last week, the official blog reports the model&#8217;s performance on AIME but there is no official report of how good the model is for coding. Out of curiosity, we plugged the model into our post-training workflow and reported a LiveCodeBench score of 0.70 for pass@1 in <a href="https://blog.collinear.ai/p/gpt-oss-lcb">this blog post</a>.<br><br>Our blog post on benchmarking gpt-oss-20b attracted post-training researchers interested in reproducing our results. This blog post discusses our internal LiveCodeBench setup and our PRs to the official repo that include three major bug fixes.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;441be554-b1f7-40dc-a6a8-ac795b3024c9&quot;,&quot;duration&quot;:null}"></div><h2>Quick Background: What is LiveCodeBench?</h2><p><a href="https://huggingface.co/blog/leaderboard-livecodebench">LiveCodeBench</a> is a continuously-updated and contamination-aware benchmark maintained by researchers at UC Berkeley, MIT, and Cornell. Problems are scraped shortly after they appear in live contests, so teams can evaluate against tasks that were indisputably unseen at model-training time. </p><p>All problems are diverse competitive programming questions. The benchmark uses execution-based accuracy (all hidden tests pass) as the metric and averages across the number of samples generated for each problem. </p><h2>How We Found the Bug</h2><p>We first suspected something was off when we tried to reproduce<a href="https://huggingface.co/Qwen/Qwen3-8B"> Qwen3-8b</a> (instruct) official Livecodebench results. Instead of matching their reported numbers, we consistently saw that the technical report was about <strong>50% higher</strong> than the new one.<strong> </strong>(eg:<strong> </strong>57.8 vs 38.3 for Qwen3-8b). Digging deeper, we inspected the raw outputs and spotted a strange pattern: every response was getting cut off right after the <code>###</code> token. For many problems, Qwen3&#8217;s actual solution began after a header like <code>### Solution Cod</code>e. Upon closer inspection, we found that the official Livecodebench evaluation treated that marker as an end-of-sequence token and completely discarded the real answer.</p><p>Looking at the data, we also noticed that some responses had no code, although the answer was enclosed within backticks <code>```</code> This led us to the second bug: the official LCB was just checking for enclosing backticks without confirming there was actual code within those. Sometimes there was just a comment, and the LCB script would just extract the comment while ignoring the actual code generated by the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lBqn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lBqn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 424w, https://substackcdn.com/image/fetch/$s_!lBqn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 848w, https://substackcdn.com/image/fetch/$s_!lBqn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 1272w, https://substackcdn.com/image/fetch/$s_!lBqn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lBqn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png" width="1200" height="1255" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1255,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lBqn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 424w, https://substackcdn.com/image/fetch/$s_!lBqn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 848w, https://substackcdn.com/image/fetch/$s_!lBqn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 1272w, https://substackcdn.com/image/fetch/$s_!lBqn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e256dd-8e56-4567-933c-98d84fe17a5e_1200x1255.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example highlighting how the official LCB script discarded actual code snippets from model responses</figcaption></figure></div><p>We also noticed that the official LCB repo hard-codes the chat templates instead of using the chat completions API. This is another source for errors in the evaluation pipeline.</p><h2>The Three Main Fixes</h2><h3>Fix 1: Disable the problematic &#8220;&#8212;stop&#8221; flag that causes premature cutoff</h3><p>The most prominent issue is that LCB has a default stop token that automatically cuts off everything after a specific phrase, sometimes including the actual code response. Specifically, they have the following code:</p><pre><code>parser.add_argument(
  "--stop",
  default="###",
  type=str,
  help="Stop token (use `,` to separate multiple tokens)",
)</code></pre><p>This logic is problematic because some models have the tendency to use markdown syntax in their response, generating code only after a <code>### solution code</code> header. As a result, the whole code block is discarded and the response is marked as wrong.</p><p>To remedy this, <strong>we change the stop token to </strong><code>None</code>, so that the model only stops properly when it either outputs the <code>&lt;eos&gt;</code> token or hits the max token limit.</p><h3>Fix 2: Add the check for python tags when extracting code from backticks</h3><p>Another issue is that LCB checks for the last pair of backticks (ie, ```...```) to extract the code implementation. This has the risk of extracting non-code blocks if the model decides to put other content in backticks for readability or stylistic reasons. As a result, sometimes we observe that LCB extracts the model&#8217;s explanations and summaries as code outputs and grade them as false. </p><p>The most effective fix is to prioritize checking for the last pair of backticks <strong>with a python tag</strong> (ie, ```python&#8230;```). This filters out non-python code blocks while still playing fair to demand the model to wrap the code in backticks.</p><h3>Fix 3: Deprecate hard-coded chat templates</h3><p>The above two issues unfairly grade the model, but this issue directly breaks it: the incorrect application of custom chat templates. Models are sensitive to their training chat template and the specific system prompt used, so altering the template config can cause catastrophic degrade in their performance. Unfortunately,  LCB manually writes the template for each model in a single file and applies the template to the model prompt through formatted strings at runtime. For example, this is the manual template for Qwen3 models:</p><pre><code><code>SYSTEM_MESSAGE_CODEQWEN = (
        f"&lt;|im_start|&gt;system\nYou are a helpful assistant &lt;|im_end|&gt;\n&lt;|im_start|&gt;user"
)</code></code></pre><p>The fix is simple: we move away from hard-coded templates and use the mature <code>chat</code> endpoints that is almost universally present in all LLM inference clients. Examples include OpenAI&#8217;s <code>client.chat.completion</code> endpoint and vLLM&#8217;s <code>llm.chat</code> endpoint. These chat endpoints, different from the traditional completion endpoints which LCB uses, automatically apply the correct chat template that is specified by the model&#8217;s config file on HuggingFace. Therefore, they eliminate any possible misalignment between a model and its chat template. </p><h2>Before vs. After: The Numbers</h2><p>We are able to better replicate official reports from foundation model providers <em>after</em> our bug fixes.</p><p>Reproduction of public benchmarks using our internal LCB setup:</p><ul><li><p><a href="https://huggingface.co/Qwen/Qwen3-8B">Qwen3-8B instruct</a>:</p><ul><li><p><a href="https://arxiv.org/pdf/2505.09388">Technical report:</a> 0.575</p></li><li><p>Internal result: 0.578</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XGG3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XGG3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 424w, https://substackcdn.com/image/fetch/$s_!XGG3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 848w, https://substackcdn.com/image/fetch/$s_!XGG3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 1272w, https://substackcdn.com/image/fetch/$s_!XGG3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XGG3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png" width="1300" height="869" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:869,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170732932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe02664e8-39a8-431d-96a2-8949afb14fc8_1364x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XGG3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 424w, https://substackcdn.com/image/fetch/$s_!XGG3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 848w, https://substackcdn.com/image/fetch/$s_!XGG3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 1272w, https://substackcdn.com/image/fetch/$s_!XGG3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff1384c4-d680-4888-95de-57f8c95fa801_1300x869.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Snapshot of the results table from the official Qwen3 technical report</figcaption></figure></div><ul><li><p><a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">Qwen2.5-7b-Instruct</a>:</p><ul><li><p><a href="https://arxiv.org/pdf/2506.04178">Open Thoughts paper</a>: 36.2 </p></li><li><p>Internal result: 35.5</p></li></ul></li><li><p><a href="https://huggingface.co/open-thoughts/OpenThinker3-7B">OpenThinker 3</a>:</p><ul><li><p><a href="https://arxiv.org/pdf/2506.04178">Open Thoughts paper</a>: 64.5 </p></li><li><p>Internal result: 70</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lHYt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lHYt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 424w, https://substackcdn.com/image/fetch/$s_!lHYt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 848w, https://substackcdn.com/image/fetch/$s_!lHYt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 1272w, https://substackcdn.com/image/fetch/$s_!lHYt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lHYt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png" width="1161" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1161,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238535,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170732932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964a634-0abc-4e4b-8954-262cd2837329_1278x1150.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lHYt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 424w, https://substackcdn.com/image/fetch/$s_!lHYt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 848w, https://substackcdn.com/image/fetch/$s_!lHYt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 1272w, https://substackcdn.com/image/fetch/$s_!lHYt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78d88275-943d-42ac-901e-9d7e93900e28_1161x816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Snapshot of the results from the OpenThinker paper</figcaption></figure></div><h2><strong>Why this matters</strong></h2><p>As the AI landscape evolves rapidly, official benchmarks are one of the most critical infrastructures that practitioners rely on for deciding what models are best suited for their use cases. That is why it is important that the benchmarks are reliable and the scores reproducible. Our bug patches to LiveCodeBench are in support of open science and enabling the community to replicate scores in the official technical reports.</p><h4><strong>References:</strong></h4><ol><li><p>LiveCodeBench: <a href="https://livecodebench.github.io/">https://livecodebench.github.io/</a></p></li><li><p>PR to fix stop flag:<a href="https://github.com/LiveCodeBench/LiveCodeBench/pull/117"> https://github.com/LiveCodeBench/LiveCodeBench/pull/117</a></p></li><li><p>PR to fix the backticks: <a href="https://github.com/LiveCodeBench/LiveCodeBench/pull/118">https://github.com/LiveCodeBench/LiveCodeBench/pull/118</a></p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.collinear.ai/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Collinear AI&#8217;s Blog! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[OpenAI's gpt-oss on LiveCodeBench: A Competitive Programming Deep Dive]]></title><description><![CDATA[tl;dr: the gpt-oss-20b is a strong model for competitive coding but is >3x sample inefficient compared to deespseek-r1-0528]]></description><link>https://blog.collinear.ai/p/gpt-oss-lcb</link><guid isPermaLink="false">https://blog.collinear.ai/p/gpt-oss-lcb</guid><dc:creator><![CDATA[Muyu]]></dc:creator><pubDate>Wed, 06 Aug 2025 21:57:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tn19!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>tl;dr:</strong> the gpt-oss-20b is a strong model for competitive coding but is &gt;3x sample inefficient compared to deespseek-r1-0528</em></p><p>OpenAI finally released an open-weights model, the gpt-oss. Their last open-weights release was GPT2, back in 2019!</p><p>The gpt-oss comes in 2 sizes, the 20b and the 120b. Both are mixture-of-experts (MoEs), instruction-tuned for reasoning and agentic tasks. You can control the reasoning levels of the model and switch between low, medium, and high reasoning modes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tn19!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tn19!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 424w, https://substackcdn.com/image/fetch/$s_!tn19!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 848w, https://substackcdn.com/image/fetch/$s_!tn19!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 1272w, https://substackcdn.com/image/fetch/$s_!tn19!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tn19!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png" width="1080" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tn19!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 424w, https://substackcdn.com/image/fetch/$s_!tn19!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 848w, https://substackcdn.com/image/fetch/$s_!tn19!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 1272w, https://substackcdn.com/image/fetch/$s_!tn19!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eca0c48-b16d-4974-ac94-334954c35adf_1080x720.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The OpenAI blog post claims both the models get ~98 on AIME 2025 (competitive math benchmark). There is no documented performance of the gpt-oss on competitive coding. We benchmarked the model on LCB and analyzed its performance, including the reasoning quality.</p><p>The gpt-oss-20b gets 70 pass@1 on LCB v6 (Aug 1, 2024 to Jan 31, 2025) with <em>high</em> reasoning mode for 3 samples per problem. We set the max sequence to 64k tokens. The total number of problems in LCB v6 is 323, with 79 classified as easy, 102 as medium, and 142 as hard on the difficulty spectrum.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CXJF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CXJF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 424w, https://substackcdn.com/image/fetch/$s_!CXJF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 848w, https://substackcdn.com/image/fetch/$s_!CXJF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 1272w, https://substackcdn.com/image/fetch/$s_!CXJF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CXJF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png" width="1456" height="337" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:337,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90177,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170292728?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CXJF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 424w, https://substackcdn.com/image/fetch/$s_!CXJF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 848w, https://substackcdn.com/image/fetch/$s_!CXJF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 1272w, https://substackcdn.com/image/fetch/$s_!CXJF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60cfb8bd-9bbc-4c5d-991a-233c6284058e_2028x470.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">Top-performing open weights model for competitive coding. * indicates not internally benchmarked, numbers taken from public leaderboard.</figcaption></figure></div><p>All gpt-oss responses are structured into 3 blocks: <code>&lt;analysis&gt;</code>, <code>&lt;commentary&gt;</code>, and <code>&lt;final&gt;. </code>The following plot shows the distribution of the model&#8217;s response length across the three blocks vs. the input problem length across the three difficulty levels.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b3LS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b3LS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 424w, https://substackcdn.com/image/fetch/$s_!b3LS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 848w, https://substackcdn.com/image/fetch/$s_!b3LS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 1272w, https://substackcdn.com/image/fetch/$s_!b3LS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b3LS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png" width="1456" height="828" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:828,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:291613,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170292728?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b3LS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 424w, https://substackcdn.com/image/fetch/$s_!b3LS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 848w, https://substackcdn.com/image/fetch/$s_!b3LS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 1272w, https://substackcdn.com/image/fetch/$s_!b3LS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F734c4eb1-2d84-4fe2-8808-7e4816adb37e_4170x2370.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model is very sample-efficient on the easy problems, while it is extremely sample-inefficient on the difficult problems. The DeepSeek-R1-0528 has an average of 15k tokens on the difficult problems, making gpt-oss-20b more than three times verbose.</p><p>Distribution of the model&#8217;s response length across the three blocks vs. the input problem length across the eval problem months. The response length is approximately uniform across the different months.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!00nz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!00nz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 424w, https://substackcdn.com/image/fetch/$s_!00nz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 848w, https://substackcdn.com/image/fetch/$s_!00nz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 1272w, https://substackcdn.com/image/fetch/$s_!00nz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!00nz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png" width="1456" height="828" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:828,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319865,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blog.collinear.ai/i/170292728?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!00nz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 424w, https://substackcdn.com/image/fetch/$s_!00nz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 848w, https://substackcdn.com/image/fetch/$s_!00nz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 1272w, https://substackcdn.com/image/fetch/$s_!00nz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2d76113-89f2-4db0-a812-a2801e7d3624_4170x2370.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The awaited open-weight OpenAI&#8217;s gpt-oss-20b model is good at competitive coding, given its size, but is very verbose compared to similarly performing models.</p><h4>References:</h4><ol><li><p> OpenAI <a href="https://openai.com/index/introducing-gpt-oss/">open-weights gpt-oss blogpost</a> </p></li><li><p><a href="https://huggingface.co/blog/leaderboard-livecodebench">LiveCodeBench leaderboard</a></p></li><li><p>Kwaipilot <a href="https://huggingface.co/Kwaipilot/KAT-V1-40B">KAT-v1-40B</a> </p></li><li><p>NVIDIA <a href="https://huggingface.co/nvidia/OpenCodeReasoning-Nemotron-1.1-32B">OpenCodeReasoning Nemotron</a></p></li><li><p>Deepseek <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-0528">R1-0528</a></p></li></ol>]]></content:encoded></item></channel></rss>