Human Benchmarking
Establishing human baselines for process-coupled metrics through large-scale annotation studies, enabling meaningful comparison between human and model reasoning traces.
A key challenge in studying AI reasoning is establishing what “normal” looks like. Our human benchmarking work uses the Forma annotation platform to collect structured assessments of reasoning trace quality from human raters.
We focus on inter-rater reliability, demographic variation in assessment patterns, and the relationship between human intuitions about reasoning quality and our quantitative coupling metrics.