9. The Public Benchmark Returned 56%: Nine Experiments and What Got Ruled Out

I hit 80% on my own 30-question benchmark, but only 56% on BIRD Mini-Dev’s 50 public questions. Nine experiments later, I had ruled out the multi-candidate hypothesis from three different angles. What’s left is schema understanding and methodology.

April 19, 2026 · 5 min · Junho Lee