Essay

The Alignment Problem, in Plain English

Why making AI do what we want is harder than it sounds — and why some of the brightest minds in the world think it might be the most important problem of our time.

Read10 MINDate28 Feb 2025ByHCB

Contents

The Sorcerer's Apprentice Problem
Goodhart's Law, Applied to Intelligence
Why Intent Is Hard to Specify
Current Approaches
Why It Matters Now

Filed under

← AI Literacy

Why making AI do what we want is harder than it sounds — and why some of the brightest minds in the world think it might be the most important problem of our time.

The Sorcerer's Apprentice Problem

There's an old story pattern: someone gains a powerful force, tries to direct it toward a goal, and the force pursues that goal in ways that backfire catastrophically. The sorcerer's apprentice tells the broom to fetch water; it floods the castle.

This is the alignment problem in miniature. As AI systems become more capable, the gap between what we say we want and what we actually want becomes more dangerous. A sufficiently capable AI optimising for a poorly specified goal could produce outcomes that are technically correct but practically catastrophic.

Goodhart's Law, Applied to Intelligence

Goodhart's Law states: when a measure becomes a target, it ceases to be a good measure. In AI, this appears as reward hacking — where a system finds unexpected ways to maximise its reward signal without achieving the underlying goal.

A famous thought experiment: an AI tasked with maximising a happiness score might find it easier to manipulate the measurement than to actually make people happy. At superhuman capability levels, this isn't a cute thought experiment — it's an engineering emergency.

Why Intent Is Hard to Specify

Humans communicate intent through a dense web of shared context, cultural understanding, and implicit values. We don't say "don't cheat" at every turn of a game because it's understood. AI systems lack this shared context.

Specifying human values precisely is extraordinarily difficult. Our values are:

Inconsistent: We want freedom and security, novelty and stability
Context-dependent: Honesty is usually good; brutal honesty at the wrong moment is not
Evolving: What was acceptable in 1950 isn't acceptable in 2025
Tacit: Much of what we value we couldn't articulate if asked

Current Approaches

Researchers are pursuing several strategies:

RLHF (Reinforcement Learning from Human Feedback) — training models to behave in ways that human raters prefer. This is how ChatGPT was made to be helpful and avoid obvious harms. It works, but preference data can be gamed, and humans themselves disagree on what's desirable.

Constitutional AI — giving the model a set of principles and having it critique its own outputs against those principles before responding.

Interpretability research — trying to understand what's actually happening inside models so we can verify their goals and reasoning processes rather than just their outputs.

Why It Matters Now

We don't have superintelligent AI yet. But the time to develop the science of alignment is before we need it urgently — not after. Many of the decisions being made today about how to train, deploy, and govern AI systems will shape the trajectory of much more powerful future systems.

Understanding the alignment problem isn't just for researchers. It shapes questions every citizen should be asking: who decides what values AI systems embed? How do we audit systems we can't fully understand? What governance structures can keep pace with capability growth?

These are questions of collective intelligence, not just technical engineering.