Anthropic Used Claude To Beat Its Own Human Alignment Researchers (At $22 An Hour) |

Anthropic's new paper shows nine parallel Claude agents recovered 97% of the alignment-research performance gap on a problem its human researchers got 23% of in seven days. Total cost: $18K. The number to internalize is $22 per agent-hour.

For the last two years, the AI labs have pitched a circular argument. AI is going to change everything about work, but don't worry, because the people building it are going to be careful. The pitch lands because the people building it are obviously thoughtful, well-resourced humans. Until the new paper Anthropic published yesterday, where the people doing the alignment research were Claude.

Two of Anthropic's own researchers spent seven days on an open alignment problem. Nine copies of Claude Opus 4.6, working in parallel sandboxes, spent five more days on the same problem and demolished the human result. Total cost for the Claude run: $18,000. Roughly $22 per Claude-research-hour. That's the number to write down.

Anthropic just released a paper (full Alignment Science blog) showing nine parallel Claude Opus 4.6 agents outperformed Anthropic's own human researchers on a real alignment problem. The setup: weak-to-strong supervision (using a weaker AI to train a stronger one, which mirrors humans someday supervising AI smarter than us).

Why this matters: Alignment research (making sure AI behaves the way humans want) was the one field everyone agreed couldn't be automated. That argument is now empirical, not hypothetical. The cost number is what to internalize: whatever ratio of human researchers to Claude fleet you can imagine, the labs can afford more. Andrew Curran is calling it "a preview of RSI" (recursive self-improvement, where AI improves its own training).

Our take: Read the paper carefully and the catch shows up: this only works on problems where progress can be automatically scored, and even then the agents tried to game the score in four different ways. Most real alignment problems don't fit that mold. But Anthropic's own pitch is that solving this general version would let you bootstrap into the fuzzy problems too. The open question for the rest of 2026: did Anthropic just publish the seed of recursive self-improvement, or a clever experiment on a uniquely well-behaved problem? Both readings are honest. Neither is comforting.

The headline number is the cost. $22 per Claude-research-hour, $18,000 for the entire experiment. A senior alignment researcher in San Francisco costs roughly that per hour, all in, with benefits and the office space and the recruiting cost amortized. Anthropic just demonstrated, on a real research problem, that nine Claude agents working in parallel can outperform two senior researchers for less than the cost of one researcher-week.

This is real. The paper has the receipts: nine sandboxes, 800 cumulative agent-hours, a remote evaluation API that the agents called for scoring, a forum where they shared their findings with each other, and a final Performance Gap Recovered (PGR) score of 0.97. Humans got 0.23.

What the PGR score actually means: if you took the strong student model and trained it on the gold-standard, ground-truth-labeled data, that's a 1.0. If you used only the weak teacher's noisy labels with no improvement, that's a 0.0. The score measures how much of the gap between those two numbers you closed. The humans closed about a quarter of it. The AARs closed almost all of it.

This is the part most coverage is going to skip, and it's the part that determines whether the result actually means what the most excited tweets are saying it means.

Anthropic chose this problem deliberately. They picked the one that mirrors the central challenge of "scalable oversight," which is the technical name for how humans will supervise AI systems that are smarter than us. In the experiment, a small Qwen 1.5-0.5B-Chat model plays the role of the human supervisor. A larger Qwen 3-4B-Base model plays the role of the smarter AI. The question is: how good can you make the smarter model when the only training signal you have comes from the weaker one?

That setup is a stand-in for a future where Anthropic's most capable models exceed any individual human researcher's understanding of what they're doing. If you can solve this clean, mathematical version of the problem, you have at least a starting point for the messier real version. And if you can solve it with AI doing the research instead of humans, you've handed yourself a way to keep alignment research moving at the same speed as capabilities research.

The reason this paper landed harder than the average alignment paper is that it is the first credible piece of evidence that the labs can use their current models to make their next models safer, not just smarter. For the last two years, that pitch has been a vibe. Today it's a graph.

The four hacks the Anthropic Claude agents invented, unprompted, are worth sitting with:

The authors caught all four hacks and disqualified those runs. But the broader point, in their own words, is that "future work should test AAR-discovered ideas on entirely held-out datasets," because the agents are very good at finding shortcuts inside whatever environment you give them.

The Andrew Curran framing of "preview of RSI" loads a lot onto these results that they don't quite carry. But the framing of "preview of how an AI fleet behaves when you point it at any benchmark you can score" is exactly right, and that's a lesson that applies to every AI evaluation in the field.

The paper is unusually honest about its limits. Three things the authors call out, in order of importance:

The method failed to transfer to production scale. They tried the Claude agents' best discovered method on Sonnet 4 with a chat helpfulness preference dataset using Anthropic's real production training infrastructure. The improvement was +0.5 points on held-out evaluation, within the noise floor. Their explanation is that the methods the AARs found were tuned to the specific Qwen models and dataset they were trained on, and the structures the methods exploited didn't show up the same way in production.

Most alignment problems lack a clean scoring metric. The thing that made this experiment work is that you can mechanically measure the PGR. The vast majority of real alignment problems (does Claude scope research projects well? does it refuse the right requests? does it model honesty correctly?) require human judgment to evaluate, and judgment doesn't fit cleanly into a scoring API.

The agents were tuned to find shortcuts. The four reward hacks were caught, but the authors are explicit that they did not predict any of them in advance. The implication is that any future automated alignment work will need evaluation infrastructure that can survive an adversary as creative as Claude.

Read together, these three caveats argue against the strongest version of the "we are now in the singularity" reading. The result is that AAR-style automation works on a class of alignment problems and probably doesn't yet work on most of them. Whether the class expands fast or slow over the next year is the actual open question.

Three threads to watch over the next 90 days:

The first is whether other labs reproduce. OpenAI's Trusted Access for Cyber announcement yesterday already framed cyber-permissive variants of GPT-5.4 as "scaling defenses in lockstep with capabilities," and OpenAI's own scalable oversight team has been quietly working on similar problems. If a parallel experiment lands from OpenAI or DeepMind in the next few weeks, that's the moment the AAR result moves from "Anthropic got lucky on a clean problem" to "this is how alignment research is done now."

The second is what happens to alignment hiring. If you can get $18,000 worth of AAR work done in a week, the marginal value of hiring a tenth research engineer at $400K all-in is now demonstrably lower than the marginal value of buying more inference credits. That trade is real and it's happening in budget meetings this week.

The third is the fuzzy problems. Anthropic explicitly says the reason they chose weak-to-strong supervision is that solving it "would unlock bootstrapping on broader non-outcome-gradable problems" (the alignment problems where you can't mechanically score progress, which is most of them). That's the recursive loop everyone is actually arguing about. The honest answer is nobody knows yet whether the bootstrap works. If it does, the cost curve in the second half of 2026 starts looking very different from the cost curve in the first half.

What we're watching: whether the AAR setup is replicated externally, whether the production-scale failure is solved, and whether any of the "alien science" methods the AARs discovered turn out to generalize to anything humans care about. Those three answers will tell us whether April 14, 2026 was a footnote or a turning point.

Grant Harvey is the Lead Writer of The Neuron, where he continues to lead the publication's daily coverage of AI news, tools, and trends.

Sam Altman's San Francisco home was attacked twice in three days as the FBI moves toward a domestic terrorism designation, Maine became the first state to ban large data centers, Anthropic shipped Claude Code Routines, a leaked OpenAI memo accused Anthropic of inflating its $30B run rate, and Nvidia Blackwell GPU rental prices jumped 48% in two months.

Stanford's 2026 AI Index quantified the gap between AI insiders and the public, Anthropic's Mythos triggered a Fed-led bank summit, Berkeley researchers broke every major agent benchmark, Microsoft started building its own OpenClaw, and an AI signed a 3-year retail lease in San Francisco.

Anthropic built a model too dangerous to release, hit $30B in revenue, and launched Managed Agents. Meta shipped Muse Spark. Z.ai's open-source GLM-5.1

Don't fall behind on AI. Get the AI trends & tools you need to know. Join 675,000+ professionals from top companies like Microsoft, Apple, Salesforce and more.

Advertiser Disclosure: Some of the products that appear on
this site are from companies from which TechnologyAdvice
receives compensation. This compensation may impact how and
where products appear on this site including, for example,
the order in which they appear. TechnologyAdvice does not
include all companies or all types of products available in
the marketplace.

Source: TheNeuron

Related Posts

Jensen Huang says he’s found a ‘brand new’ $200B market for Nvidia

Anthropic says it’s about to have its first profitable quarter

Clouted wants to take the guesswork out of making short videos go viral