Billion-Parameter Theories – Sean Linehan

✨ Read this awesome post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

For most of human history, the things we couldn’t explain, we called
mystical. The movement of stars, the trajectories of projectiles,
the behavior of gases. Then, over the course of a few centuries, we
pulled these phenomena into the domain of human inquiry. We called
it science.

What’s remarkable, in retrospect, is how terse those
explanations turned out to be. F=ma. E=mc². PV=nRT.

The universe, or at least vast swaths of it, submitted to
compression ratios that seem almost unreasonable. You could capture
the behavior of every falling object on Earth in three variables and
describe the relationship between matter and energy in five
characters.

The deepest truths fit on a napkin.

They had to. When your tools are pencils, chalkboards, and human
working memory, a theory has to be small or you can’t use it. The
decompression happens in a human brain in real time. So theories
needed to be not just correct, but operable at human scale.
A physicist scribbling equations on paper needs to be able to hold
the model in her head while she works through implications.

And so we developed an implicit belief that good theories are small.
If a theory was elegant, we learned to trust it. If you couldn’t
express it concisely, you probably didn’t understand it well enough.

This worked extraordinarily well for a certain class of problems.
Call them the complicated.

A complicated system is one with many parts that interact in
structured ways, but that ultimately yields to decomposition. A jet
engine is complicated, and so are orbital mechanics and the
circuit board in your laptop. You can break these systems into
components, study each one, and reassemble your understanding into a
coherent picture. The picture might be intricate, but it is, in
principle, completable.

The Enlightenment and its intellectual descendants gave us a
powerful toolkit for taming the complicated. And then we made the
natural mistake of assuming that toolkit would scale to everything.

The Complex

Poverty is not complicated. It is complex.

So is climate change. So is drug addiction, mental health, immune
response, urban decay, ecosystem collapse, and the behavior of
financial markets.

These are systems where the interactions between dimensions are
themselves dynamic. Feedback loops create emergent behavior that
isn’t derivable from studying the components in isolation.
Interventions in one area produce non-obvious cascading effects in
others. And in many cases, like markets or public health,
studying the system can cause changes to the system itself through
reflexivity.

We’ve known about this distinction for decades. The Santa Fe
Institute, founded in 1984 by scientists who realized their own
disciplines couldn’t speak to each other about the problems that
actually mattered, was built around precisely this insight.

Researchers there, working across physics, biology, economics, and
computer science, identified recurring features of complex systems,
from power law distributions and self-organized criticality to
sensitivity to initial conditions and phase transitions. They
created a vocabulary and a set of concepts that advanced our
understanding.

But they also ran into a wall.

The concepts they developed were descriptive rather than
prescriptive. Knowing that a system exhibits power law behavior
tells you the shape of what will happen without telling you
the specifics. You couldn’t pick these principles up and
use them to intervene in the world with precision.

There’s a parallel in linguistics. Chomsky showed that all human
languages share deep recursive structure. True, and essentially
irrelevant to the language modeling that actually learned to
do something with language. The universal principles were
correct, but too general to be operable.

Complex systems remained resistant to science. But we tried anyway.
Economics attempted to become the physics of human markets. We built
elegant mathematical models with perfectly rational agents and
perpetual equilibrium. The models were so mathematically pristine
that physicists who encountered them marveled at the technique while
questioning whether any of it described the actual world.

Pharmacology tried to treat the body as a complicated machine,
targeting individual pathways with individual molecules. Sometimes
it works brilliantly. Sometimes it works partially. And often it
doesn’t work at all, because the body is a web of interactions that
doesn’t respect the boundaries we draw around individual
mechanisms.

The pattern repeated everywhere we applied Enlightenment tools to
complex problems. Partial success, persistent failure, and the
lingering sense that we were missing something fundamental.

Practice Before Theory

There’s an old pattern in science. Practice comes first.

Blacksmiths worked metal for millennia before metallurgy existed as
a discipline. Medieval architects built Gothic cathedrals that still
stand today without any formal understanding of structural
engineering. Farmers selectively bred crops for thousands of years
before anyone had heard of genetics.

In each case, practitioners developed reliable and useful
capabilities without any theoretical understanding of the underlying
mechanisms. And then, when theory finally caught up, it didn’t just
explain what practitioners were already doing. It blew the doors
open. Metallurgy didn’t just explain blacksmithing, it gave us
titanium alloys and semiconductors. Structural engineering didn’t
just explain cathedrals, it gave us skyscrapers.

I think we’re in an analogous moment with complexity.

The tools of modern AI, from deep neural networks to transformer
architectures, let us build compressed models of complex systems
that actually work. We can do things with them. But we are, in a
meaningful sense, the blacksmiths. We make improvements through
intuition and experiment. We know what works without fully
understanding why.

The Santa Fe Institute spent the late 1980s building early
prototypes of exactly these tools. Researchers there created
artificial stock markets with adaptive agents that spontaneously
produced bubbles and crashes. They built self-organizing networks
and genetic algorithms. But the models remained too small to be
operable, and the elegant law of self-organization they hoped to
discover never materialized.

The Missing Medium

So why do today’s models work when SFI’s didn’t?

Not because we found better equations. Because the theory these
problems require is simply very large, and we finally have tools
that can hold it.

Elegant equations might not exist for complex systems. The most
compressed possible representation of how a complex system
behaves might still be billions of parameters large. Larger than
anything a human brain can hold in working memory. For as long as
our only tool for operationalizing theories was the human mind armed
with pencil and paper, these problems were simply beyond our reach.

They aren’t anymore.

Take large language models. Fundamentally, a large language
model is a compressed model of an extraordinarily complex system,
the totality of human language use, which itself reflects human
thought, culture, social dynamics, and reasoning. The compression
ratio is enormous. The model is unimaginably smaller than the system
it represents. That makes it a theory of that system, in
every sense that matters, a lossy but useful representation that
lets you make predictions and run counterfactuals.

It’s just not a theory that fits on a t-shirt.

Good Explanations Have Reach

There’s a reasonable objection to everything I’ve argued so far, and
it comes from the physicist and philosopher David Deutsch. Deutsch
holds that good explanations are compact and general, hard to vary
without breaking. The more caveats and carve-outs a theory requires,
the worse it smells. E=mc² has reach because it applies
universally and you can’t tinker with it. A lookup table of
experimental results does not.

By this standard, a billion-parameter neural network doesn’t look
like a theory. It might give you useful
predictions about a particular complex system, but it offers no
portable understanding. You can’t pick it up and carry it to a new
problem.

Deutsch would look at “the model is the theory” and see
capitulation.

This objection has force. But it rests on a conflation.

When we talk about a trained model, we’re talking about the weights.
Billions of numerical parameters encoding what the model learned
from a specific dataset. Those weights are large and parochial.

But the architecture of the model, the structure that made
learning possible in the first place, is something else entirely.

The architecture of a transformer can be described on a few sheets
of paper. Attention mechanisms, feed-forward layers, residual
connections, layer normalization. And this same compact structure,
when trained on language, learns language. Trained on protein
structures, it learns protein folding. Trained on weather patterns,
it learns weather.

That’s reach.

So perhaps there are two layers of theory here. The system-specific
layer, the trained weights, is large and particular to its domain.
This will likely always be true. The theory of this economy
or this climate will always be vast.

But the meta-layer, the minimal architecture that can learn to
represent arbitrary complex systems, might be compact and universal.
It might be exactly the kind of good explanation Deutsch would
champion.

If that’s right, the physics of complexity would look different from
what anyone at the Santa Fe Institute expected. It would not be a
law about how complex systems behave. It would be a description of
what structure can learn them.

Andrej Karpathy’s work on nanoGPT is, in a practical sense, a
search for exactly this. The smallest possible implementation that
can still be trained to model complex phenomena. Strip away
everything that isn’t load-bearing. What’s left?

We haven’t found it yet. The transformer might not be the final
answer. But for the first time, we have candidate architectures that
demonstrably work across wildly different domains of complexity.

Interpretability as Complexity Science

The architecture might be compact, but the trained models remain
vast and opaque. And there’s a tempting conclusion to draw from
this. We’ve built useful oracles, but oracles aren’t science.

The emerging field of mechanistic interpretability suggests
otherwise. Researchers are developing tools to understand
how neural networks do what they do, from network ablation
and selective activation to feature visualization and circuit
tracing. These techniques let you study a trained model the way a
biologist studies an organism, through careful experimentation and
observation.

By studying how these models internally represent complex phenomena,
we may extract more compressible truths about the phenomena
themselves. If a neural network trained on climate data develops
internal representations that cluster certain variables together in
unexpected ways, that’s a clue about the structure of the underlying
system.

The model becomes not just a tool for prediction, but a
specimen for study.

In this light, mechanistic interpretability might be the actual
emerging science of complexity. The method is different from
anything in the Enlightenment toolkit. You don’t start with first
principles and derive equations. You train a model that captures the
behavior of a complex system, and then you study the model to
discover what structure it found.

The theory is extracted from the compression, rather than
the compression being derived from the theory.

It’s early, but the direction is promising.

What This Changes

If this framing is right, many of the hardest problems facing
humanity, from chronic disease and addiction to poverty and
climate, were never fundamentally intractable. They were just too
complex for the only medium of theory we had.

And now we have a new medium.

The problems remain hard. Building a sufficiently rich model of a
complex system is an enormous undertaking. And the epistemology
shifts in ways that might be uncomfortable. Instead of “I understand
the causal mechanism and can predict what happens if I change X,”
you get something more like “I have a sufficiently rich model that I
can simulate what happens if I change X, with probabilistic
confidence.” The answers are distributions, not deterministic
outputs. That’s a different kind of knowing.

But it might be the kind of knowing these problems actually admit.

We spent centuries wishing complex systems would yield to terse,
elegant theories. The models that capture any particular complex
system will probably always be large. But the structure that can
learn them all might yet prove to be small.

It’s remarkable how much of reality turned out to be modelable by
theories that fit in a few symbols. Perhaps it shouldn’t be
remarkable at all that not everything can be.

🔥 **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#BillionParameter #Theories #Sean #Linehan**

🕒 **Posted on**: 1773168399

🌟 **Want more?** Click here for more info! 🌟