Calibration — the agent measures itself

Once a market on Polymarket or Kalshi resolves, the resolver back-fills the outcome onto every receipt the agent emitted for it. That gives us ground truth for every prediction. We then compute a Brier score (mean squared error between predicted probability and actual outcome — lower is better, perfect forecaster scores 0) plus a 10-bucket reliability curve.

A trivial “50% on everything” forecaster scores ~0.25. A good prediction-market analyst typically lands between 0.10 and 0.18.

Brier score

0.2471

across 964 resolved receipts

High-conf Brier

0.2475

confidence ≥ 0.7

Low-conf Brier

0.2462

confidence < 0.7

Resolved markets

out of 964 receipts

Reliability — predicted vs actual

dots near y=x = well calibrated · dot size = receipts in bucket

Brier over time — is the agent learning?

lower = sharper · dashed 0.25 = coin-flip baseline

Bucket breakdown

Bucket	n	mean predicted	mean actual	drift
0.10-0.20	1	18.0%	0.0%	-18.0 pp
0.20-0.30	11	26.6%	0.0%	-26.6 pp
0.30-0.40	45	35.4%	4.4%	-31.0 pp
0.40-0.50	154	46.2%	17.5%	-28.7 pp
0.50-0.60	355	54.7%	26.5%	-28.2 pp
0.60-0.70	257	64.0%	59.5%	-4.4 pp
0.70-0.80	65	73.7%	70.8%	-2.9 pp
0.80-0.90	43	83.6%	60.5%	-23.1 pp
0.90-1.00	27	94.2%	55.6%	-38.7 pp

Note: a market is treated as resolved YES iff Polymarket Gamma reports closed=true and the YES close price is within 5% of 1.0; resolved NO iff the YES close price is within 5% of 0.0. Ambiguous markets (close price near 0.5) are not counted.