NL2SQL accuracy hit 80% on a 30-question benchmark (previous post ). Moving the same model to a public benchmark (BIRD Mini-Dev) produced completely different numbers. BIRD is the standard NL2SQL benchmark — 500 questions across 11 different database domains. 80% → 56% It worked well on the retail domain I trained on, but on an unseen domain it was barely half. I ran 9 experiments to push past that 56%. (Honestly, I thought it would come quickly.) The one-line conclusion More SQL generation attempts wasn’t the answer. Getting it right on the first try was. Why try the multi-candidate approach Instead of asking the LLM to generate SQL once, generate several and pick the best one. Already the dominant direction in NL2SQL research. self-consistency (generate the same query multiple times, pick by majority vote) execution-based selection (filter candidates based on actual execution results) multi-agent pipelines (multiple specialized agents collaborate to generate SQL) Four of the top five systems on the BIRD leaderboard use this approach. Instead, two assumptions drove it. Single-shot LLM output has an accuracy ceiling. Splitting into exploration + selection can break through it. 1. More hints to the LLM (4 failures) Result: No effect on BIRD. In some cases, the hint overrode a correct answer. Give the LLM better hints, and accuracy will improve. For example: Which aggregation function to use (AVG/SUM/COUNT) Which column maps to “revenue” A list of candidate columns for the question Own 30 questions: held steady BIRD: no change In some cases it got worse. The LLM was already correctly choosing products.price, and the hint switched it to order_items.unit_price. The hint overrode the correct answer. 60% of failures weren’t about the wrong column. They were about the wrong SQL pattern. For example: SUM(CASE WHEN ...) COUNT(CASE WHEN ...) Both are valid SQL, but NULL handling differences produce different results. Hints can fix column selection. They can’t change which SQL pattern the LLM prefers. 2. Generate multiple candidates and pick one (3 failures) Generating more candidates didn’t change the outcome. The first SQL generated was always the one that got used. Settings: k=3 temperature=0.3 result-based selection 56% accuracy — no change Metric Value 3 candidates converging to the same result 92% First candidate selected 100% The selector never overrode the first candidate. Not once. The “+8pp” that wasn’t real It looked like 48% → 56%, a +8pp gain. Re-grading the old baseline with the current grader: it was already 56%. The grader logic had drifted. That got a grader drift guard baked in as permanent infrastructure. Raising diversity temperature raised to 0.8 5 prompt variations added SQL text varied Execution results: identical 62% of questions had all 5 candidates returning the same result. There is one correct answer. Raising the temperature and varying the prompts didn’t change the execution results. 3. Force diversity at the system level (3 failures) Observation: Even with forced candidates, the selector can’t pick the right one. Execution results alone provide no signal. If the LLM won’t produce diversity on its own, force it. Experiment 1: forced column binding Forcing the correct column → 60% pass Forcing the wrong column → 0% pass Schema binding determines accuracy. Experiment 2: selector validation Correct / obviously wrong / plausibly wrong / subtly different → Can the selector pick the right one from 4 candidates? 28.6% — effectively random The most telling case The actual value stored in the DB was Czech VYBER (cash withdrawal). The LLM had no way to know this value existed. All 4 candidates used English in the WHERE clause (cash withdrawal). None of them matched anything in the DB. Result: All 4 returned empty results (0 rows) The selector saw 4 identical results and called it “consensus” Wrong answer selected If the right answer never appears, consensus is meaningless. Experiment 3: score-based selector Instead of execution results alone, V2 scored each candidate on a mix of signals: value distribution, column count, row count, and more. The highest-scoring SQL got selected. V1: 40% V2: 33% More sophisticated made it worse. qid 819 shows why. V2 gave the correct SQL 55 points and the obviously wrong SQL 75 points. The wrong answer had the higher score, so the wrong answer won. Execution results alone cannot tell you which SQL is more correct. Ruled out, remaining All three directions are closed: Hint injection → failed LLM-driven multi-candidate → failed Execution-based selection → failed The common cause The problem isn’t the selection stage. It’s the stage before generation. Not “pick the best answer after execution.” The right tables and columns have to be locked in before execution even starts. The new direction One direction remains. Strengthen schema understanding. Get it right on the first try. Next hypothesis: Schema Binding Plan Don’t generate SQL directly. First: output a JSON specifying tables / columns / join conditions System validates the plan Then generate SQL The third-attempt validation confirmed: When binding is forced, the LLM follows it 100% of the time. The problem was never SQL generation. It was schema interpretation. What the 9 failures left behind The experiments failed. The infrastructure didn’t. Grader drift guard. Keeps past results comparable even as the grader logic evolves. Without it, this experiment would have been logged as a “+8pp success.” Signal classifier. Rates each question on a four-level scale (STRONG to MISLEADING): is there a detectable signal pointing to the right answer? Separates “the selector is weak” from “there was no signal to detect.” Forced binding verification code. Automatically checks whether the SQL the LLM generated actually uses the column it was told to use (via the sqlglot SQL parser). Reusable as-is for the next schema grounding experiments. Stop criteria / experimental design framework. Lock in “if this threshold isn’t met, stop” before running a benchmark, use small spot checks to make fast directional decisions. And one more thing. SOTA pointing one direction doesn’t mean it’s the right direction for my problem. After the experiments, I ran deep research sessions with Claude, ChatGPT, and Gemini. Of 10 suggestions, 8 directly conflicted with already-closed directions. Without the data from 9 experiments, I would have followed them. Closing In post 8 I wrote “80% is the start.” That 80% was domain-specific. On an unseen domain, it’s 56%. That’s the real starting point. The next post reports where that number moves once Schema Binding Plan is in.