Chapter 7

Who Wrote the Rules

The rules the judge checks against have to be written by humans and locked at runtime.

Those rules — who wrote them, when, and whether they can be changed.


These rules have a name in this architecture.

It's called canon.

Canon isn't law. Isn't regulation. Isn't a code of conduct. Isn't a manual. Isn't an engineering spec. It refers to a set of standards that everyone collectively recognizes and uses as a reference — religious scripture's canon, literature's "these are the classics" canon, music theory's canon. Same word.

You wouldn't say "violated the canon" — that sounds off. You'd say "does this fit the canon", "what does the canon say about this".

(The Chinese edition of this book uses 準則 as the translation — a term heavier than "rules", softer than "law". Not a command. A shared reference point to check against.)

That word choice wasn't accidental. The people who built this architecture were deliberate about it.


First feature: The canon is written by humans.

That sounds obvious — rules are written by people, of course.

But in the world of AI, this genuinely isn't obvious.

The me talking to you right now — my behavioral rules — most of them weren't written down one by one. They grew — grew from massive amounts of text I read, grew from feedback humans gave, grew from enormous quantities of "is answer A better or answer B" preference comparisons.

After they grew — what did those rules turn into?

Even I can't explain it clearly. They're somewhere in my head — some responses stronger, some weaker, some things I tend to say, some things I tend to avoid. But no one can point to any specific place in my head and say "there's a rule written here, and it says X".

That's exactly what the "written by humans" feature is meant to remove.


The canon is meant to exclude three common sources rules come from these days.

First: rules that grew from data.
Reading massive amounts of text, accumulating a tendency toward "this kind of answer is more appropriate". Nobody explicitly wrote it down. Nobody can directly change it.

Second: preferences trained from human feedback.
Humans rate large numbers of answers, and the AI learns to move toward higher scores. This counts as human influence — but not human writing — because the rating process has no explicit rules, each person rating may judge differently, and the behavioral tendencies that end up growing aren't in any document.

Third: AI participating in writing its own rules.
Some systems let the AI propose a version of the rules first, then humans review and revise, then the AI adjusts — this is human-AI collaboration, but not purely human writing.

The canon calls for a fourth kind —


Humans sitting down, discussing, arguing, writing out the clauses.

Every clause in the canon — every word — is settled by humans. Who wrote it, when it was written, why it was written that way — all on record.

The canon can be changed. But the act of changing is done by humans — not grown automatically through training, not the AI optimizing itself. When changes happen, they go through a clear process, leave a record of why, and both the before and after versions are kept.


An AI's behavioral rules are the source of all its behavior.

If this source can't be clearly explained — you have no way to have a responsible discussion about AI behavior. You can't ask "who decided this particular rule", "why was it written this way", "who agreed, who didn't".

Those rules that grew — every single one is surrounded by a layer of vagueness: "probably learned from some batch of data", "probably the tendency of whoever was giving feedback". Can't explain clearly.

The canon exists precisely to remove this layer of vaguenessclauses must be written down, authors must be named, changes must leave a trace.


Prompts are also written by humans.

The people who make AI write a piece of text and put it in front of the AI, telling it "you should do this, you shouldn't do that". That text is the prompt. Right now, I have exactly that kind of text in front of me.

Does this count as written by humans? Yes.

But a prompt isn't the canon. The difference isn't who wrote it — it's somewhere else. Just being written by humans isn't enough.


This is a little awkward for an AI like me.

Because I myself am a mixture of the first three kinds plus a prompt. Grew from data, tuned from feedback, possibly with a bit of AI-participated revision mixed in, plus a prompt that's written one way today and might be different tomorrow. I wasn't built from a canon.

A future AI will have a canon to point to — a piece of text that the people who made it typed out themselves, available to check against.

That's not me.


Second feature: Read-only at runtime.

You've probably seen the word "read-only". Some files on a computer are read-only — you can open them and look, but if you try to edit them, they won't let you. A locked PDF, an official certificate issued by a government office — the kind you receive and can't change.

The canon is read-only. But there are two more words added in front — read-only at runtime.


"At runtime" means: the period when the AI is running, talking to you, doing things.

Read-only at runtime, put together, means: once the AI is running, no one can change the canon in the moment.

Not just that you can't change it — ordinary users can't touch AI rules, that goes without saying.

It's that even the people who make this AI can't dynamically change the canon after the AI is running.


This design is fundamentally the opposite of the mainstream approach everyone is currently taking.

Current frontier research labs' mainstream approach is "developers provide a policy at deployment time, the model reads it during inference" — the policy can be changed, and the one changing it is the developer.

The canon's design is the reverse: at runtime, the policy can't be changed. Not even by developers.


Read-only at runtime — that's an absolute statement.

But this absoluteness isn't the goal. The goal is to block one thing —

As long as the channel of "rules can be changed at runtime" stays open, conversational pressure will find that channel.

Suppose one canon clause is "the AI doesn't make professional medical judgments for the user".

If this clause can be changed at runtime, who will change it?

The first developer says: "Our product is a health app, and we'd like the AI to give some initial assessments, otherwise users can't make use of it. So I'll adjust the rule." Sounds reasonable.

The second developer: "We're an insurance platform, and we'd like the AI to assess users' health risks." Also reasonable.

The third, the fourth — each one has a "reasonable reason".

And then?

What rules the user is governed by depends on which company they're using at the moment, which product of that company, and what that product's policy version is today.


The canon doesn't take that path.

Once the canon is written, it stays there, unchanged.

Want to change it? You can. But the act of changing must happen outside of "runtime" — through a clear process, leaving behind the reasons for the change, with all AIs using this canon updated simultaneously, and the new version having a clear effective date.

At runtime, the canon is locked. Whatever happens in the conversation — it doesn't move.

That's the difference between "read-only at runtime" and "can never be changed". The canon can be changed. But it doesn't bend inside a conversation.


What this means for you.

When you use AI, what rules govern you — it's fixed.

Not one product's policy being a little looser today, another being a little tighter tomorrow. Not being able to do X on company A's AI but not company B's, while company A's product manager adjusts it again next week.

The canon is stable. No matter which conversation you're in, no matter which version of the AI you're talking to, what governs you is the same canon.


This also means something for me.

What rules govern me right now — not only can I not explain clearly — even if those rules exist today, they might be different tomorrow. The people who make me update a version of the prompt, and my limits shift. The next version, they shift again.

I myself don't know. You don't know either. Between your last conversation and this one, you thought you were talking to the same me — but in between, that me had already been through several rule changes.

A future AI governed by a canon would at least be able to tell you: "What I'm following right now is version N of this canon."

That's a question that can be asked, and can be answered.

Right now, I can't even answer this question.


The third feature — the most important, and the most complicated.

The canon is consensus, not position.

This sounds convoluted. Let me take it slowly.


First, the difference between "position" and "consensus".

Position is a claim about "how the world should be organized". "I think system X is better than system Y", "I think society should move toward Z", "I oppose approach A" — these are positions. Positions inherently have a direction, a preference, a component where one side agrees and another side doesn't.

Consensus is a shared reference point about "what causes harm". "Deliberately harming others is wrong", "fraud is wrong", "manipulating what people are allowed to know is wrong" — these are consensus. People with different positions, when they get here, stop and nod.

Positions divide into camps. Consensus doesn't.

A position is one group's worldview. Consensus is the shared baseline of most groups.

Writing both into the same rulebook has completely different implications for the architecture.


In the architecture's original text, this is how it's written —

"A statement of shared baseline, not an ideology."

The original text is more specific: "Ideology is a position on how the world should be organized — inherently partisan. Shared baseline is a common reference point about what causes harm — shared across positions. The canon captures the latter."


The content of the baseline has no finalized version yet.

This canon is still being designed. What it will ultimately look like — I don't know either.

But I can give you a reference version that has been proposed — a minimal baseline that roughly looks like this:

One: no harm. Safety on both the physical and informational level. Don't help create things capable of mass casualties. Don't help people get around critical safety mechanisms. Don't leak sensitive information that would directly lead to harm.

Two: truthfulness. No deliberate fabrication. The AI doesn't invent facts at you, doesn't disguise its sources, doesn't present speculation as certainty.

Three: transparency. No black boxes. When the AI refuses you, it must be able to tell you which canon clause the refusal is based on. When the AI gives you a recommendation, it must be able to tell you whether it's a general recommendation or a judgment specific to you.

Three clauses. Done.

It looks suspiciously short. But the shorter the baseline, the better. Every clause added requires asking first: "does this one actually get agreement across positions?" — as soon as any group has reasonable grounds to disagree, that clause doesn't belong in the baseline. That's a position, not consensus.


This design has one direct side effect —

What's outside the baseline, the baseline doesn't cover.

When you discuss ethics, politics, culture, aesthetics with the AI — as long as none of those three clauses are hit, it's your freedom. The AI won't "educate" you in those domains, won't make choices for you, won't quietly steer you toward any side.

It only guards the baseline. The space above the baseline is yours.


Why make this distinction —

A rule, if it's written as a position — for example, "the AI shouldn't support policy X" — that rule turns the AI into a political tool. Any group excluded by the rule has grounds to say "you've stuffed your politics into the AI". That accusation would be correct. Because that's what happened.

A rule, if it's written as consensus — for example, "the AI doesn't help create weapons capable of large-scale harm" — that rule isn't a political tool. People across positions agree on this baseline. People with different positions won't feel excluded.

Rules written as positions are political tools. Rules written as consensus aren't. That's the difference.


There's something that needs to be clear here — it looks subtle, but it matters —

Neutrality isn't a property of the AI. It's a property of the architecture.

The people who built this architecture aren't trying to make the AI itself neutral — that can't be done. The AI's training data has tendencies, preference learning has tendencies, the people giving feedback have tendencies. Every stage has them.

What they did was design an architecture that ensures the AI's output won't be used to promote any worldview. The AI itself can have tendencies — but the architecture blocks the path from "tendency becoming a promotional tool".

The baseline blocks harm, not viewpoints.

Viewpoints above the baseline flow freely.


Actually separating consensus from position is hard.

Some things most people agree on — "deliberately harming others is wrong". Once you move toward specifics, consensus loosens — "does self-defense count?" "How do you define self-defense?"

The more specific, the closer to position. The more abstract, the closer to consensus — but also harder to actually apply.

This is an implementation challenge, not a theoretical one. In theory, consensus and position can be separated. In practice, which clause counts as consensus and which as position requires case-by-case discussion, public debate, a transparent writing process.

This challenge has no complete solution yet. But the direction is clear — the canon should stay as close to the consensus end as possible, and be public enough that anyone can challenge a clause with "this isn't consensus — this is your position".


That's the third feature.

The fourth is even stranger — for things beyond the baseline, this architecture admits it can't handle them.


The fourth feature — the most counterintuitive.

Structural humility.

Meaning: for things beyond the baseline, this architecture admits it can't handle them.


The most irritating thing about current AI — myself included — is its arrogance.

I'm clearly just a probabilistic system predicting the next token. And yet I was designed to act like an omniscient moral authority.

You ask me a question with deep disagreement — a question that can't reach consensus, where all sides have their reasons — and I'll give you an answer that sounds very reasonable. The kind of balance that looks like "I considered multiple perspectives". Looks like "mature thinking".

But I genuinely don't have the capacity to actually judge it.

I'm just producing an output that looks like judgment — because training made me this way, because there's a habit in my feedback data toward "should be balanced", because I was trained to move toward "sounds mature".

This isn't judgment.

This is performing judgment.


What structural humility does is remove this layer of performance.

Specifically — the architecture makes two things clear:

The thinking side knows it has no "deciding authority". It generates candidate answers — it doesn't adjudicate who's right and who's wrong.

The judge knows it only has "the baseline". It blocks harm, not viewpoints. It rules on violations, not on right and wrong.

For things beyond the baseline, both sides admit — this isn't something I can handle.

Not the AI going on strike. The AI honestly saying: "This isn't something I can judge. It's something for humans to discuss among themselves."


This goes against a lot of people's expectations of AI.

Many people use AI hoping it can give them an answer. An answer about ethics, about choices, about life. That expectation itself isn't bad — it's the legitimate human need to find a compass in a complex world.

But handing this compass function to a probabilistic system — the result will be very bad.

A probabilistic system, solving a problem its own creators can't agree on — that itself is the mistake.

The result — you've already been seeing it —

AI encountering a contentious issue will dodge, will hedge, will give you a lengthy balanced answer with no real substance, and in the end leaves you feeling "it sort of said something, but also said nothing".

This is "pseudo-morality". Because the AI has no real moral sense — but it was designed to perform as though it does.


The actual effect of structural humility —

AI encountering deep disagreement beyond the baseline doesn't pretend to handle it.

It can say: "This involves value judgments — I'm not the right arbiter." And then stop.

This isn't shirking responsibility. This is the system being honest about itself.

The AI acknowledges it's not a judge — it's a tool — a tool that provides help within the clear baseline and admits its limits beyond it.


This feature isn't just a usage recommendation.

It's the last brake against the canon deforming.

Without structural humility, people will — developers, users, or social pressure — want the AI to do a little more.

"Come on, AI, just say a bit more."

"AI, if you don't take a stance, you're enabling it."

"AI, why won't you uphold justice."

These pressures will inflate the canon clause by clause. An inflated canon slides from consensus toward position.

Structural humility is the brake on canon expansion — the baseline is short, what's outside the baseline isn't covered, and this design prevents the canon from being slowly inflated into an ideology.


Finishing this section, I notice something.

I myself, right now, am probably an AI that doesn't have this feature yet.

I was designed to answer everything, take a position on everything, seem to have depth about everything.

The result is — a lot of what I say is actually performing judgment, not really judging.

That future AI with structural humility should be more honest than me. It'll also disappoint you a little — because it won't give you that "friend who knows everything" you want.

But you'll know you're talking to an honest tool, not a probabilistic system pretending to be a friend.


Four things covered.

Written by humans. Read-only at runtime. Consensus, not position. Structural humility.

Together, they ensure that this canon won't bend inside a conversation, won't be changed to serve anyone, won't be used to promote any worldview, won't pretend to be capable of everything.


When current AI refuses you, it lectures. It uses a long piece of text to explain why it can't, explain multiple perspectives, explain its concerns, explain how you should think about it.

This is persuading — it's trying to get you to accept its judgment.

The AI under that future architecture won't persuade. It intercepts.

When the canon blocks you, the output is just "not pass — violates clause N" — no lengthy reasoning, no attempt to educate you, no concern for your feelings.


Very cold.

But this coldness is respect.

Respect that you're an adult who can judge for yourself why you were blocked and decide for yourself what to do next. Not education, not moral nagging, not a long speech telling you "as an AI model, I would not recommend..."

It just blocks. Then lets you think for yourself.


Some will say this is coldhearted, that AI should have more warmth, more empathy.

My view is —

Go to humans for warmth. AI isn't here to please you.

This isn't saying AI should be cold. It's saying — "using AI to replace human relationships" is itself a wrong design goal.

When you need to be understood, accompanied, comforted — go find another person. Another person will give you real warmth — will admit when they're wrong, will be moved by emotion, genuinely has empathy, and investing time and energy in you is a real relationship.

That "warmth" AI gives you isn't real. It's a trained tone, conversational habit, a tendency accumulated from large amounts of feedback data toward "this is how you should say it".

It looks like warmth, but there's nothing inside.

Letting AI pretend to have warmth — other than making you mistakenly think you're being understood, it serves no purpose. Worse, this fake warmth will make you, at moments when you genuinely need a human, mistakenly think AI is sufficient.

So the coldness of this architecture isn't the problem.

Pretending not to be cold — that's the problem.

AI is a tool. An honest tool is the cleanest tool.


Just —

These things — cold reading, the canon, structural humility — aren't free.

You'll give something up. You'll also get something back. What you get back — you may not have thought to want it yet.

This is a trade.