leaderboards are dead. the vals ai print is the funeral

grok 4.3 topping vals ai legal and finance, while a friendly academic admits public benchmarks can't measure risk, is the same data point twice. private vertical evals are the only scoreboard that prices into revenue.

the print

on may 4, vals ai posted grok 4.3 at #1 on its private legal and finance benchmarks, with gpt-5.1 and claude variants trailing on both. these are not lmsys vibes or another mmlu refresh. vals runs evals that are held out from training, gated behind buyer contracts, and graded on tasks those two verticals actually pay for, where a wrong answer costs real dollars on the finance side and a bar complaint on the legal side.

the same week, ethan mollick, who is about as friendly to frontier ai as any tenured academic gets, conceded the obvious: "a challenge with AI regulation and vetting is how bad our benchmarks of AI model performance and risks are." two data points, one signal. the public leaderboard era is functionally over for anyone making a buying decision, and the vertical private eval is the only thing replacing it.

the dynamic

call it the dollar-denominated eval. a benchmark prices into revenue if and only if (a) it is held out from training, (b) the buyer in that vertical trusts it, and (c) being wrong on it costs the user real money. mmlu fails all three. swe-bench passes (a) and arguably (c), which is why coding agents are the one workflow where capability claims actually convert to gross profit. vals legal and vals finance pass all three, which is why grok winning them matters more than xai's last six press releases combined.

the consensus read is that grok 4.3 is a benchmark curiosity and the real game is still gpt-5.x distribution plus claude code stickiness. that read is not wrong about distribution. it is wrong about what the benchmark print means. this is the second consecutive quarter we have seen xai win on a domain where accuracy is dollar-denominated, and the first one where the buyer-side eval house, not the model lab, picked the winner. that is a different kind of receipt.

why public leaderboards stopped working

public benchmarks rotted on a predictable schedule. the moment a benchmark has a public test set, the next training run absorbs it, directly or through paraphrase. the moment it has a public leaderboard, lab pms get bonused on it. by the time it shows up in a keynote, it is a marketing surface, not a measurement.

mollick's framing is the soft version of this. the harder version: there is no current public benchmark that reliably predicts cyber-uplift, deception, agent misuse, or sustained reliability under real tool use. nist has the AISI mandate but no production-grade eval suite, which is why any pre-release review regime would land on top of measurement infrastructure that does not yet exist. when the trump administration floats federal pre-release model review, the first procurement check it writes is going to nist to build the evals the labs themselves failed to ship. that is a separate moat conversation, but it shares the same root cause: nobody has a trustworthy public number.

private vertical evals fill the vacuum because they invert every failure mode of the public ones. vals does not publish its test set. its customers are casey-quinn law firms and middle-office finance teams, not lab marketing departments. the leaderboard is a buying signal, not a recruiting signal. when grok wins vals legal, a general counsel somewhere updates a vendor shortlist. when a model wins lmsys, an intern updates a tweet.

the steelman

the counter is simple and partly correct: vals ai is one eval house, run by a small team, with a methodology that is itself not fully public. one private benchmark winner does not make a category. and grok's enterprise distribution in regulated legal and finance buyers is, charitably, embryonic. claude has anthropic's enterprise team and the deployment muscle anthropic just financed via the blackstone JV. gpt-5.1 has chatgpt enterprise and the new openai deployco. winning the eval and winning the procurement cycle are different sports, and historically the second one is decided by sales coverage and soc2 paperwork, not by accuracy deltas.

that is real. it is also exactly why this print matters. grok now has the one thing xai's go-to-market could not previously manufacture: a third-party, vertical-specific accuracy claim that a buyer in a regulated industry can defend internally. it does not close deals on its own. it gets you into the bake-off you previously were not invited to.

what to do with this

two concrete updates. first, stop using public leaderboards as a capability proxy in any memo that touches revenue. if a vendor is pitching you on lmsys rank or generic mmlu, that is a tell about their buyer, not their model. ask for the vertical eval that matches your workflow, or commission one. iso 42001 will be the soc2 of ai inside eighteen months, and the vertical eval will be the actual technical diligence underneath the checkbox.

second, watch which eval houses the regulators adopt. if a federal pre-release review regime lands, the eval suite it standardizes on becomes the de facto national benchmark, and whichever private vendor gets that contract becomes the moody's of model risk. vals, mlcommons, and whatever nist greenlights are the three names to track. the labs already know this. the buyers are about to find out.

leaderboards measured persuasion. private vertical evals measure outcomes. only the second one shows up in arr.

← back to articles