The Shape of Good Behavior
Reward function geometry & the boundaries of aligned behavior.
Why "Number Go Up" breaks AI alignment, and what reward geometry can do about it. Sheaf-theoretic reward spaces, boundary behavior, and what this buys us for detecting alignment faking.