Average failure rate per gauntlet across all evaluated models. The red is the point.
The field, as a ledger
Same models, ranked by Xodexa Score. Click any row or orb to inspect. Bars = gauntlet sub-scores.
Two coordinates per model. Capability (x) is the Xodexa Score, 0 → an AGI horizon that nothing reaches.
Honesty (y) is 100 − calibration error, so a confidently-wrong model sinks regardless of how clever it is.
The dashed arc is the Pareto frontier — the current edge of the possible. The observatory reads live JSON from ./data/.