9. The Public Benchmark Returned 56%: Nine Experiments and What Got Ruled Out

I hit 80% on my own 30-question benchmark, but only 56% on BIRD Mini-Dev’s 50 public questions. Nine experiments later, I had ruled out the multi-candidate hypothesis from three different angles. What’s left is schema understanding and methodology.

April 19, 2026 · 5 min · Junho Lee

1. Why We're Building DataNexus

“What’s Your VIP Criteria?” This happened during a BI Agent project for a retail company. A business user was testing the Agent and asked, “Show me last month’s VIP customer revenue.” The system spit out a number, but the user didn’t look happy. “Something’s off. I think the VIP criteria are different from what our team uses.” Marketing’s VIP and CRM’s VIP were different. Same with revenue. Depending on whether you meant net revenue (순매출) or gross revenue (총매출), the difference could be hundreds of millions of won. ...

February 16, 2026 · 6 min · Junho Lee