Anthropic’s Safety Superpower – Stratechery by Ben Thompson

✨ Read this must-read post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

I’m sympathetic to the cynics who consistently characterize Anthropic’s public statements, particularly those surrounding their model releases, as scare-mongering for the sake of marketing. It was only two months ago that Anthropic announced Mythos Preview, a model that they said was too dangerous to make publicly available, thanks in particular to its advanced cybersecurity capabilities. Then, two months later, the company publicly released Fable, a version of Mythos with various safety guardrails.

Fable is, in my limited experience, a very impressive model. It’s increasingly difficult to objectively evaluate models for anything other than coding performance, but there is subjective feel, and I found my interactions with Fable to be extremely impressive; it made other models, including GPT 5.5 and Opus 4.8, feel small and dumb. The two times I felt that way previously were with GPT-4 and Grok 4, both of which represented new generations in terms of base model size and complexity; my sense is that Fable is downstream of a new pre-train and the first of a new generation.

To that end, I can certainly buy the case that Fable/Mythos is in fact more capable when it comes to identifying and exploiting security issues, and that Anthropic’s cautious roll-out was justified. The problem with publicly releasing models, however, is that guardrails can be jailbroken, and apparently that is exactly what happened shortly after the release.

Anthropic vs. the U.S. Government, Again

What happened next is somewhat unclear. Anthropic wrote in a blog post:

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Anthropic models will not be affected.

We received the directive from the government today at 5:21pm (ET). The letter did not provide specific details of its national security concern. Our understanding is that the government believes it has become aware of a method of bypassing, or “jailbreaking” Fable 5. We reviewed a demonstration of this specific technique being used to identify a small number of previously known, minor vulnerabilities. These vulnerabilities all appear relatively simple, and we have found that other publicly-available models are able to discover them as well without requiring a bypass.

Anthropic went on to make the case that non-universal jailbreaks were inevitable and also narrow, and that there was no evidence of a universal jailbreak; the jailbreak that was found, meanwhile, appears to have been reported by Amazon, which is notable given Amazon is both an investor in Anthropic and a major provider of inference to the company. As I write this, senior Anthropic staff are in Washington D.C. seeking to resolve what they insist is a misunderstanding, and which White House officials are suggesting is insouciance by the company’s leadership to legitimate national security concerns.

I don’t actually have much to add to the current conflict given how many facts are in dispute; what I am not surprised about is the fact that the conflict is happening: I already explained in Anthropic and Alignment why conflict between the U.S. government and Anthropic was inevitable. To that end, people who are arguing that Mythos isn’t powerful enough to warrant the government’s drastic action are missing the point: if it’s not powerful enough now, the next one will be, or the one after that, particularly now that models are increasingly useful in creating their successors.

That, however, raises another question — one that seems to validate the cynics’ viewpoint: if Mythos is so dangerous, why even release Fable in the first place, and why fight with the government doing exactly what you claim to want? In fact, I think that Anthropic’s actions are quite understandable; what makes the company unique is how it justifies them, and it is those justifications that both give the cynics their fuel and Anthropic its magic.

The Economic Imperative

For the first few years of AI the most economic value has flown to compute, for obvious reasons: we don’t have enough supply to meet demand, which has meant skyrocketing prices; the biggest beneficiaries have been Nvidia, TSMC, and the memory makers (SK hynix, Samsung, and Micron). Anthropic and OpenAI, meanwhile, have collectively lost tens of billions of dollars building leading-edge models that, once released, are distilled and commoditized by open source models, primarily from China.

This represents the bear case for the labs — they never cover their costs because their differentiation is fleeting, while free alternatives become “good enough” — and I think it’s a legitimate one. A world where models are interchangeable is one where models are commodities, while most of the value flows elsewhere. Right now that’s compute, but in the fullness of time, whenever we have enough compute, the most valuable place to be in the value chain will be the place that has always been the most valuable: owning the user touchpoint.

To that end, it has long been clear to me that the frontier labs have the economic imperative to move closer to the user. If you own the user touchpoint, then you have meaningful lock-in, and the best way to own the user touchpoint is to be the canvas for everything they need to do. This, by extension, means that the frontier labs are on a collision course with software companies: it’s software that owns the user touchpoint, and it’s in the frontier labs’ long-term interest to not simply be a commodity input into software but to simply replace software outright.

Software companies, meanwhile, are working to do the opposite. Satya Nadella laid out his vision for how companies should build on models in an essay on X:

Every company is going to have to build what I think of as human capital and token capital. Human capital comprises the knowledge, judgment, relationships, ingenuity, and pattern recognition of its people, while token capital is the firm’s AI capability it builds and owns. Importantly, human capital does not become less valuable as token capital grows. It only becomes more valuable! I believe human agency will be the driver of token capital growth. Humans will set ambitious goals, connect dots across domains, build relationships, and recognize patterns that matter most. Without human direction, you have compute running in circles.

This means the real opportunity is not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound. You can offload a task, or even a job, but you can never offload your learning. The future of the firm is the ability to compound that learning across people and AI. This requires a new architectural approach where every business is able to build agentic systems that improve over time, while still retaining control over their IP. A company should be able to switch out a “generalist” model without losing the “company veteran” expertise built into their learning system. This is the key “test” of your control and sovereignty in the era ahead.

Nadella set this vision off with a warning:

The last thing any of us want is a world where every company across every sector is ceding value to a few models that eat everything they see. If all the value is accrued by only a few models, the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries.

Think about what happened in the first phase of globalization where entire industrial economies were hollowed out by outsourcing. The GDP numbers looked fine on the surface, but the displacement was real and the consequences are still being felt. Let us not bring that dynamic into the AI era, with a small number of AI systems capturing all the economic returns, while entire industries find their knowledge commoditized right out from underneath them.

Here’s the problem with that analogy: the globalization happened, and the industrial economies were hollowed out. There’s a possibility that this isn’t a warning but a prophecy; small wonder Nadella is raising the alarm given that Microsoft could be one of the casualties. And, by the same token, the economic imperative for the model makers is to accomplish exactly this.

The Data Imperative

The models — not even Mythos — are not yet at this point. What they need, beyond more compute, is more and better data. Model improvements increasingly come from reinforcement learning; some of this can be generated synthetically, but the most powerful lever for a frontier lab is real world use.

This, I think, is a major reason why both OpenAI and Anthropic offer their heavily subsidized subscription plans. SemiAnalysis recently estimated that a $200 plan gets you $8,000 worth of Claude tokens and $14,000 worth of Codex tokens. Of course both are fighting for user and developer mindshare, but they’re also fighting to have access to actual usage data to make their models better.

Anthropic upped the ante in a major way with Fable, announcing that they would retain the data for all usage for 30 days, even for their enterprise plans that previously promised zero data retention. The company said they would not train on this data, but they didn’t put in any sort of safeguards to guarantee they wouldn’t do so in the future (like storing the data with a third party). If this policy change (whenever Fable is restored) doesn’t lead to a significant loss of customers, I suspect it’s only a matter of time until they start using the data: it’s simply too valuable to their end goals.

Note also the virtuous cycle with moving up into user touchpoints: the more workflows that are done directly with Claude or Codex, the more data each company gets to feed back into their training, which makes their products that much more capable and useful, expanding the number of workflows they can serve, expanding their access to data.

Nadella, in his essay, highlights the importance of this data, but naturally thinks it should be independent from the model:

Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!). Private reinforcement learning environments should let models grow stronger on real traces from inside the organization. Its knowledge base makes institutional memory queryable and use of tokens more efficient.

This loop becomes the new IP of the firm. I think of it as a hill climbing machine. And unlike most assets, it compounds. Every improved workflow generates better training signal, which accelerates the accumulation of tacit knowledge unique to the firm. The companies that build this early will have an advantage that is hard to replicate, regardless of any new individual model capability.

What if, however, the companies that give in to Anthropic’s data policies get better results right now? Or what if existing companies resist, leaving the door open for new companies — or the model makers themselves — to outcompete them in the market? Anthropic is certainly putting the resolve Nadella is calling for to the test.

The Power Imperative

The data retention policies around Fable/Mythos were, amazingly enough, not even the most controversial part of the launch. Rather, Anthropic said at launch that it would silently degrade Fable performance if it were used for LLM development; from the System Card:

We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with — as we wrote then — “accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose – without necessarily having commensurate safeguards.”

In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.

Anthropic walked back this change — Fable will simply hand off LLM-related requests to Opus 4.8, and disclose this hand off to the user — but I think the initial policy was very illuminating. On one hand, I actually don’t begrudge Anthropic not wanting to help its competitors; on the other hand, what should be blisteringly clear is that Anthropic does not think that anyone else other than them should even be making frontier LLMs.

What makes this policy all the more remarkable is the fact that it was enacted only two months after Anthropic had that dispute with the Department of War: the latter wanted to use Claude for any legal use, while the former wanted more stringent controls around surveillance and autonomous weapons. What this degradation represented was both the capability and willingness of Anthropic to silently alter its models to achieve its policy preferences. In other words, Anthropic willfully validated some of its critics’ worst fears in terms of being a supply chain risk.

The broader takeaway from that previous episode, however, is that Anthropic believes that they are the ones who should have final say over how Anthropic is used; given that they think only they should be developing leading edge AI, they by extension think that only they should have final say over AI generally. When you further combine this realization with the company’s pronouncements about AI’s ability to conduct all economic activity, you realize that Anthropic’s leadership effectively wants to have power over everything and everyone.

The Safety Story

Of course Anthropic would never put things so baldly; the story, rather, is safety:

I expect Anthropic to increasingly expose their model’s capabilities to end users through endpoints increasingly tailored to different workflows, even as they start to restrict the API. This replacement of software and restriction of access will be done in the name of safety, even as Anthropic fulfills its economic imperative of getting closer to end users.
Anthropic’s explanation for their dramatic change in their data retention policy was safety. Specifically, the company claims that retaining all user data for 30 days is necessary to prevent the jailbreaks the U.S. government is worried about. I can certainly imagine a future where safety compels them to train on this data as well, to better protect against malicious usage.
The entire Anthropic origin story is rooted in the founders’ belief that OpenAI wasn’t taking safety seriously enough; the company believes that only they can control AI, and that because they uniquely care about safety, they are justified in trying to control everyone else, up to and including the U.S. government.

Here’s the thing about these safety justifications: I think they work because, to Anthropic, they aren’t justifications. The company really believes that they are the only ones who believe in super intelligence, and thus are the only ones who are sufficiently concerned about the dangers. That excuses decision after decision, policy after policy, and confrontation after confrontation that, to people on the outside, look like a bizarre combination of cynicism and naiveté.

The contrast to OpenAI is massive: I think that one way to understand how and why OpenAI lost its lead is that, in the years following the release of ChatGPT, the company has been at war with itself internally as what used to be a research lab was suddenly seized with the burden of being the accidental consumer tech company; to the extent OpenAI solved that conflict, it was by bleeding huge amounts of talent to Anthropic in particular.

Anthropic, on the other hand, has perfect alignment between talent and mission and business. The company gets to sell to researchers the creation of a machine god, with the mantle of being the sort of person who cares about the dangers and is smart enough to navigate them on behalf of humanity; that every policy change that falls out of that happens to be great for business is the most beautiful coincidence in the world.

I respect this alignment, and I fear it. I respect it because it is so clearly effective; the closest analogy is probably Apple, which has always framed every self-serving action in the guise of doing right by users — and often they were. So it is with Anthropic. What I fear, however, is that it is one thing to have people convinced they know best building a smartphone that I can take or leave; it’s considerably more concerning to have them building superintelligence that has the potential to rival or exceed the power of nation states, or merely massive corporations. The history of brilliant people convinced they know what humanity needs is a sordid one, precisely because they have convinced themselves that their intentions are good, justifying actions that very much are not.

🔥 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Anthropics #Safety #Superpower #Stratechery #Ben #Thompson**

🕒 **Posted on**: 1781524156

🌟 **Want more?** Click here for more info! 🌟