Human Benchmarking

A key challenge in studying AI reasoning is establishing what “normal” looks like. Our human benchmarking work uses the Forma annotation platform to collect structured assessments of reasoning trace quality from human raters.

We focus on inter-rater reliability, demographic variation in assessment patterns, and the relationship between human intuitions about reasoning quality and our quantitative coupling metrics.