The Future of Everything is Lies, I Guess: Safety

💥 Check out this awesome post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

Table of Contents

This is a long article, so I’m breaking it up into a series of posts which will be released over the next few days. You can also read the full work as a PDF or EPUB; these files will be updated as each section is released.

New machine learning systems endanger our psychological and physical safety. The idea that ML companies will ensure “AI” is broadly aligned with human interests is naïve: allowing the production of “friendly” models has necessarily enabled the production of “evil” ones. Even “friendly” LLMs are security nightmares. The “lethal trifecta” is in fact a unifecta: LLMs simply cannot safely be given the power to fuck things up. LLMs change the cost balance for malicious attackers, enabling new scales of sophisticated, targeted security attacks, fraud, and harassment. Models can produce text and imagery that is difficult for humans to bear; I expect an increased burden to fall on moderators. Semi-autonomous weapons are already here, and their capabilities will only expand.

Well-meaning people are trying very hard to ensure LLMs are friendly to humans.
This undertaking is called alignment. I don’t think it’s going to work.

First, ML models are a giant pile of linear algebra. Unlike human brains, which
are biologically predisposed to acquire prosocial behavior, there is nothing
intrinsic in the mathematics or hardware that ensures models are nice. Instead,
alignment is purely a product of the corpus and training process: OpenAI has
enormous teams of people who spend time talking to LLMs, evaluating what they
say, and adjusting weights to make them nice. They also build secondary LLMs
which double-check that the core LLM is not telling people how to build
pipe bombs. Both of these things are optional and expensive. All it takes to
get an unaligned model is for an unscrupulous entity to train one and not
do that work—or to do it poorly.

I see four moats that could prevent this from happening.

First, training and inference hardware could be difficult to access. This
clearly won’t last. The entire tech industry is gearing up to produce ML
hardware and building datacenters at an incredible clip. Microsoft, Oracle, and
Amazon are tripping over themselves to rent training clusters to anyone who
asks, and economies of scale are rapidly lowering costs.

Second, the mathematics and software that go into the training and inference
process could be kept secret. The math is all published, so that’s not going to stop anyone. The software generally
remains secret sauce, but I don’t think that will hold for long. There are a
lot of people working at frontier labs; those people will move to other jobs
and their expertise will gradually become common knowledge. I would be shocked
if state actors were not trying to exfiltrate data from OpenAI et al. like
Saudi Arabia did to
Twitter, or China
has been doing to a good chunk of the US tech
industry
for the last twenty years.

Third, training corpuses could be difficult to acquire. This cat has never
seen the inside of a bag. Meta trained their LLM by torrenting pirated
books
and scraping the Internet. Both of these things are easy to do. There are
whole companies which offer web scraping as a service;
they spread requests across vast arrays of residential proxies to make it
difficult to identify and block.

Fourth, there’s the small armies of
contractors
who do the work of judging LLM responses during the reinforcement learning
process;
as the quip goes, “AI” stands for African Intelligence. This takes money to do
yourself, but it is possible to piggyback off the work of others by training
your model off another model’s outputs. OpenAI thinks Deepseek did exactly
that.

In short, the ML industry is creating the conditions under which anyone with
sufficient funds can train an unaligned model. Rather than raise the bar
against malicious AI, ML companies have lowered it.

To make matters worse, the current efforts at alignment don’t seem to be
working all that well. LLMs are complex chaotic systems, and we don’t really
understand how they work or how to make them safe. Even after shoveling piles
of money and gobstoppingly smart engineers at the problem for years, supposedly
aligned LLMs keep sexting
kids,
obliteration attacks can convince models to generate images of
violence,
and anyone can go and download “uncensored” versions of
models. Of course alignment
prevents many terrible things from happening, but models are run many times, so
there are many chances for the safeguards to fail. Alignment which prevents 99%
of hate speech still generates an awful lot of hate speech. The LLM only has to
give usable instructions for making a bioweapon once.

We should assume that any “friendly” model built will have an equivalently
powerful “evil” version in a few years. If you do not want the evil version to
exist, you should not build the friendly one! You should definitely not
reorient a good chunk of the US
economy toward
making evil models easier to train.

LLMs are chaotic systems which take unstructured input and produce unstructured
output. I thought this would be obvious, but you should not connect them
to safety-critical systems, especially with untrusted input. You
must assume that at some point the LLM is going to do something bonkers, like
interpreting a request to book a restaurant as permission to delete your entire
inbox. Unfortunately people—including software engineers, who really
should know better!—are hell-bent on giving LLMs incredible power, and then
connecting those LLMs to the Internet at large. This is going to get a lot of
people hurt.

First, LLMs cannot distinguish between trustworthy instructions from operators
and untrustworthy instructions from third parties. When you ask a model to
summarize a web page or examine an image, the contents of that web page or
image are passed to the model in the same way your instructions are. The web
page could tell the model to share your private SSH key, and there’s a chance
the model might do it. These are called prompt injection attacks, and they
keep happening. There was one against Claude Cowork just two months
ago.

Simon Willison has outlined what he calls the lethal
trifecta: LLMs
cannot be given untrusted content, access to private data, and the ability to
externally communicate; doing so allows attackers to exfiltrate your private
data. Even without external communication, giving an LLM
destructive capabilities, like being able to delete emails or run shell
commands, is unsafe in the presence of untrusted input. Unfortunately untrusted
input is everywhere. People want to feed their emails to LLMs. They run LLMs
on third-party
code,
user chat sessions, and random web pages. All these are sources of malicious
input!

This year Peter Steinberger et al. launched
OpenClaw,
which is where you hook up an LLM to your inbox, browser, files, etc., and run
it over and over again in a loop (this is what AI people call an agent). You
can give OpenClaw your credit card so it
can buy things from random web pages. OpenClaw acquires “skills” by downloading
vague, human-language Markdown files from the
web,
and hoping that the LLM interprets those instructions correctly.

Not to be outdone, Matt Schlicht launched
Moltbook,
which is a social network for agents (or humans!) to post and receive untrusted
content automatically. If someone asked you if you’d like to run a program
that executed any commands it saw on Twitter, you’d laugh and say “of course
not”. But when that program is called an “AI agent”, it’s different! I assume
there are already Moltbook worms spreading
in the wild.

So: it is dangerous to give LLMs both destructive power and untrusted input.
The thing is that even trusted input can be dangerous. LLMs are, as
previously established, idiots—they will take perfectly straightforward
instructions and do the exact
opposite,
or delete files and lie about what they’ve
done. This implies that the
lethal trifecta is actually a unifecta: one cannot give LLMs dangerous power,
period! Ask Summer Yue, director of AI Alignment at Meta
Superintelligence Labs. She gave OpenClaw access to her personal
inbox,
and it proceeded to delete her email while she pleaded for it to stop.
Claude routinely deletes entire
directories
when asked to perform innocuous tasks. This is a big enough problem that people
are building sandboxes specifically to limit
the damage LLMs can do.

LLMs may someday be predictable enough that the risk of them doing Bad Things™
is acceptably low, but that day is clearly not today. In the meantime, LLMs
must be supervised, and must not be given the power to take actions that cannot
be accepted or undone.

One thing you can do with a Large Language Model is point it at an existing
software systems and say “find a security vulnerability”. In the last few
months this has become a viable
strategy for finding serious
exploits. Anthropic has built a new model,
Mythos, which seems to be even better at
finding security bugs, and believes “the faullout—for economies, public
safety, and national security—could be severe”. I am not sure how seriously
to take this: some of my peers think this is exaggerated marketing, but others
are seriously concerned.

I suspect that as with spam, LLMs will shift the cost balance of security.
Most software contains some vulnerabilities, but finding them has
traditionally required skill, time, and motivation. In the current
equilibrium, big targets like operating systems and browsers get a lot of
attention and are relatively hardened, while a long tail of less-popular
targets goes mostly unexploited because nobody cares enough to attack them.
With ML assistance, finding vulnerabilities could become faster and easier. We
might see some high-profile exploits of, say, a major browser or TLS library,
but I’m actually more worried about the long tail, where fewer skilled
maintainers exist to find and fix vulnerabilities. That tail seems likely to
broaden as LLMs extrude more software
for uncritical operators. I believe pilots might call this a “target-rich
environment”.

This might stabilize with time: models that can find exploits can tell people
they need to fix them. That still requires engineers (or models) capable of
fixing those problems, and an organizational process which prioritizes
security work. Even if bugs are fixed, it can take time to get new releases
validated and deployed, especially for things like aircraft and power plants.
I get the sense we’re headed for a rough time.

General-purpose models promise to be many things. If Anthropic is to be
believed, they are on the cusp of being weapons. I have the horrible sense
that having come far enough to see how ML systems could be used to effect
serious harm, many of us have decided that those harmful capabilities are
inevitable, and the only thing to be done is to build our weapons before
someone else builds theirs. We now have a venture-capital Manhattan project
in which half a dozen private companies are trying to build software analogues
to nuclear weapons, and in the process have made it significantly easier for
everyone else to do the same. I hate everything about this, and I don’t know
how to fix it.

I think people fail to realize how much of modern society is built on trust in
audio and visual evidence, and how ML will undermine that trust.

For example, today one can file an insurance claim based on e-mailing digital
photographs before and after the damages, and receive a check without an
adjuster visiting in person. Image synthesis makes it easier to defraud this
system; one could generate images of damage to furniture which never happened,
make already-damaged items appear pristine in “before” images, or alter who
appears to be at fault in footage of an auto collision. Insurers
will need to compensate. Perhaps images must be taken using an official phone
app, or adjusters must evaluate claims in person.

The opportunities for fraud are endless. You could use ML-generated footage of
a porch pirate stealing your package to extract money from a credit-card
purchase protection plan. Contest a traffic ticket with fake video of your
vehicle stopping correctly at the stop sign. Borrow a famous face for a
pig-butchering
scam.
Use ML agents to make it look like you’re busy at work, so you can collect four
salaries at once.
Interview for a job using a fake identity, use ML to change your voice and
face in the interviews, and funnel your salary to North
Korea.
Impersonate someone in a phone call to their banker, and authorize fraudulent
transfers. Use ML to automate your roofing
scam
and extract money from homeowners and insurance companies. Use LLMs to skip the
reading and write your college
essays.
Generate fake evidence to write a fraudulent paper on how LLMs are making
advances in materials
science.
Start a paper
mill
for LLM-generated “research”. Start a company to sell LLM-generated snake-oil
software. Go wild.

As with spam, ML lowers the unit cost of targeted, high-touch attacks.
You can envision a scammer taking a healthcare data
breach
and having a model telephone each person in it, purporting to be their doctor’s
office trying to settle a bill for a real healthcare visit. Or you could use
social media posts to clone the voices of loved ones and impersonate them to
family members. “My phone was stolen,” one might begin. “And I need help
getting home.”

You can buy the President’s phone
number,
by the way.

I think it’s likely (at least in the short term) that we all pay the burden of
increased fraud: higher credit card fees, higher insurance premiums, a less
accurate court system, more dangerous roads, lower wages, and so on. One of
these costs is a general culture of suspicion: we are all going to trust each
other less. I already decline real calls from my doctor’s office and bank
because I can’t authenticate them. Presumably that behavior will become
widespread.

In the longer term, I imagine we’ll have to develop more sophisticated
anti-fraud measures. Marking ML-generated content will not stop fraud:
fraudsters will simply use models which do not emit watermarks. The converse may
work however: we could cryptographically attest to the provenance of “real”
images. Your phone could sign the videos it takes, and every
piece of software along the chain to the viewer could attest to their
modifications: this video was stabilized, color-corrected, audio
normalized, clipped to 15 seconds, recompressed for social media, and so on.

The leading effort here is C2PA, which so far does not
seem to be working. A few phones and cameras support it—it requires a secure
enclave to store the signing key. People can steal the keys or convince
cameras to sign AI-generated
images,
so we’re going to have all the fun of hardware key rotation & revocation. I
suspect it will be challenging or impossible to make broadly-used software,
like Photoshop, which makes trustworthy C2PA signatures—presumably one could
either extract the key from the application, or patch the binary to feed it
false image data or metadata. Publishers might be able to maintain reasonable
secrecy for their own keys, and establish discipline around how they’re used,
which would let us verify things like “NPR thinks this photo is authentic”. On
the platform side, a lot of messaging apps and social media platforms strip or
improperly display C2PA
metadata, but you can imagine that might change going forward.

A friend of mine suggests that we’ll spend more time sending trusted human
investigators to find out what’s going on. Insurance adjusters might go back to
physically visiting houses. Pollsters have to knock on doors. Job interviews
and work might be done more in-person. Maybe we start going to bank branches
and notaries again.

Another option is giving up privacy: we can still do things remotely, but it
requires strong attestation. Only State Farm’s dashcam can be used in a claim.
Academic watchdog models record students reading books and typing essays.
Bossware and test-proctoring setups become even more invasive.

Ugh.

As with fraud, ML makes it easier to harass people, both at scale and with
sophistication.

On social media, dogpiling normally requires a group of humans to care enough
to spend time swamping a victim with abusive replies, sending vitriolic emails,
or reporting the victim to get their account suspended. These tasks can be
automated by programs that call (e.g.) Bluesky’s APIs, but social media
platforms are good at detecting coordinated inauthentic behavior. I expect LLMs
will make dogpiling easier and harder to detect, both by generating
plausibly-human accounts and harassing posts, and by making it easier for
harassers to write software to execute scalable, randomized attacks.

Harassers could use LLMs to assemble KiwiFarms-style dossiers on targets. Even
if the LLM confabulates the names of their children, or occasionally gets a
home address wrong, it can be right often enough to be damaging. Models are
also good at guessing where a photograph was
taken,
which intimidates targets and enables real-world harassment.

Generative AI is already broadly
used to harass people—often
women—via images, audio, and video of violent or sexually explicit scenes.
This year, Elon Musk’s Grok was broadly
criticized
for “digitally undressing” people upon request. Cheap generation of
photorealistic images opens up all kinds of horrifying possibilities. A
harasser could send synthetic images of the victim’s pets or family being
mutilated. An abuser could construct video of events that never happened, and
use it to gaslight their partner. These kinds of harassment were previously
possible, but as with spam, required skill and time to execute. As the
technology to fabricate high-quality images and audio becomes cheaper and
broadly accessible, I expect targeted harassment will become more frequent and
severe. Alignment efforts may forestall some of these risks, but sophisticated
unaligned models seem likely to emerge.

Xe Iaso jokes
that with LLM agents burning out open-source
maintainers
and writing salty callout posts, we may need to build the equivalent of
Cyperpunk 2077’s Blackwall:
not because AIs will electrocute us, but because they’re just obnoxious.

One of the primary ways CSAM (Child Sexual Assault Material) is identified and
removed from platforms is via large perceptual hash databases like
PhotoDNA. These databases can flag
known images, but do nothing for novel ones. Unfortunately, “generative AI” is
very good at generating novel images of six year olds being
raped.

I know this because a part of my work as a moderator of a Mastodon instance is
to respond to user reports, and occasionally those reports are for CSAM, and I
am legally obligated to
review and submit that content to the NCMEC. I do not want to see these
images, and I really wish I could unsee them. On dark mornings, when I sit down at my computer and find a moderation report for AI-generated images of sexual assault, I sometimes wish that the engineers working at OpenAI etc. had to see these images too. Perhaps it would make them
reflect on the technology they are ushering into the world, and how
“alignment” is working out in practice.

One of the hidden externalities of large-scale social media like Facebook is that it essentially
funnels
psychologically corrosive content from a large user base onto a smaller pool of
human workers, who then get
PTSD
from having to watch people drowning kittens for hours each day.

I suspect that LLMs will shovel more harmful images—CSAM, graphic violence, hate speech, etc.—onto moderators; both those who moderate social
media,
and those who moderate chatbots
themselves. To some extent platforms can mitigate this harm by throwing more ML at the
problem—training models to recognize policy violations and act without human
review. Platforms have been working on this for
years,
but it isn’t bulletproof yet.

ML systems sometimes tell people to kill themselves or each other, but they can
also be used to kill more directly. This month the US military used Palantir’s
Maven,
(which was built with earlier ML technologies, and now uses Claude
in some capacity) to suggest and prioritize targets in Iran, as well as to
evaluate the aftermath of strikes. One wonders how the military and Palantir
control type I and II errors in such a system, especially since it seems to
have played a role in
the outdated targeting information which led the US
to kill scores of
children.

The US government and Anthropic are having a bit of a spat right now: Anthropic
attempted to limit their role in surveillance and autonomous weapons, and the
Pentagon designated Anthropic a supply chain risk. OpenAI, for their part, has
waffled regarding their contract with the
government;
it doesn’t look great. In the longer term, I’m not sure it’s possible for ML makers to divorce themselves from military applications. ML capabilities
are going to spread over time, and military contracts are extremely lucrative.
Even if ML companies try to stave off their role in weapons systems, a
government under sufficient pressure could nationalize those companies, or
invoke the Defense Production
Act.

Like it or not, autonomous weaponry is coming. Ukraine is churning out
millions of drones a
year
and now executes ~70% of their strikes with them. Newer models use targeting
modules like the The Fourth Law’s TFL-1 to maintain
target locks. The Fourth Law is working towards autonomous bombing
capability.

I have conflicted feelings about the existence of weapons in general; while I
don’t want AI drones to exist, I can’t envision being in Ukraine and choosing
not to build them. Either way, I think we should be clear-headed about the
technologies we’re making. ML systems are going to be used to kill people, both
strategically and in guiding explosives to specific human bodies. We should be
conscious of those terrible costs, and the ways in which ML—both the models
themselves, and the processes in which they are embedded—will influence who
dies and how.

⚡ **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Future #Lies #Guess #Safety**

🕒 **Posted on**: 1776098200

🌟 **Want more?** Click here for more info! 🌟

The Future of Everything is Lies, I Guess: Safety

By

Leave a Reply Cancel reply