9. The Public Benchmark Returned 56%: Nine Experiments and What Got Ruled Out

NL2SQL accuracy hit 80% on a 30-question benchmark (previous post ).

Moving the same model to a public benchmark (BIRD Mini-Dev) produced completely different numbers. BIRD is the standard NL2SQL benchmark — 500 questions across 11 different database domains.

80% → 56%

It worked well on the retail domain I trained on, but on an unseen domain it was barely half.

I ran 9 experiments to push past that 56%. (Honestly, I thought it would come quickly.)

The one-line conclusion

More SQL generation attempts wasn’t the answer. Getting it right on the first try was.

Why try the multi-candidate approach

Instead of asking the LLM to generate SQL once, generate several and pick the best one.

Already the dominant direction in NL2SQL research.

self-consistency (generate the same query multiple times, pick by majority vote)
execution-based selection (filter candidates based on actual execution results)
multi-agent pipelines (multiple specialized agents collaborate to generate SQL)

Four of the top five systems on the BIRD leaderboard use this approach.

Instead, two assumptions drove it.

Single-shot LLM output has an accuracy ceiling.
Splitting into exploration + selection can break through it.

1. More hints to the LLM (4 failures)

Result: No effect on BIRD. In some cases, the hint overrode a correct answer.

Give the LLM better hints, and accuracy will improve.

For example:

Which aggregation function to use (AVG/SUM/COUNT)
Which column maps to “revenue”
A list of candidate columns for the question
Own 30 questions: held steady
BIRD: no change

In some cases it got worse. The LLM was already correctly choosing products.price, and the hint switched it to order_items.unit_price. The hint overrode the correct answer.

60% of failures weren’t about the wrong column. They were about the wrong SQL pattern.

For example:

SUM(CASE WHEN ...)
COUNT(CASE WHEN ...)

Both are valid SQL, but NULL handling differences produce different results.

Hints can fix column selection. They can’t change which SQL pattern the LLM prefers.

2. Generate multiple candidates and pick one (3 failures)

Generating more candidates didn’t change the outcome. The first SQL generated was always the one that got used.

Settings:

k=3
temperature=0.3
result-based selection

56% accuracy — no change

Metric	Value
3 candidates converging to the same result	92%
First candidate selected	100%

The selector never overrode the first candidate. Not once.

The “+8pp” that wasn’t real

It looked like 48% → 56%, a +8pp gain.

Re-grading the old baseline with the current grader: it was already 56%. The grader logic had drifted. That got a grader drift guard baked in as permanent infrastructure.

Raising diversity

temperature raised to 0.8
5 prompt variations added
SQL text varied
Execution results: identical

62% of questions had all 5 candidates returning the same result.

There is one correct answer.

Raising the temperature and varying the prompts didn’t change the execution results.

3. Force diversity at the system level (3 failures)

Observation: Even with forced candidates, the selector can’t pick the right one. Execution results alone provide no signal.

If the LLM won’t produce diversity on its own, force it.

Experiment 1: forced column binding

Forcing the correct column → 60% pass
Forcing the wrong column → 0% pass

Schema binding determines accuracy.

Experiment 2: selector validation

Correct / obviously wrong / plausibly wrong / subtly different → Can the selector pick the right one from 4 candidates?

28.6% — effectively random

The most telling case

The actual value stored in the DB was Czech VYBER (cash withdrawal). The LLM had no way to know this value existed.

All 4 candidates used English in the WHERE clause (cash withdrawal). None of them matched anything in the DB.

Result:

All 4 returned empty results (0 rows)
The selector saw 4 identical results and called it “consensus”
Wrong answer selected

If the right answer never appears, consensus is meaningless.

Experiment 3: score-based selector

Instead of execution results alone, V2 scored each candidate on a mix of signals: value distribution, column count, row count, and more. The highest-scoring SQL got selected.

V1: 40%
V2: 33%

More sophisticated made it worse.

qid 819 shows why. V2 gave the correct SQL 55 points and the obviously wrong SQL 75 points. The wrong answer had the higher score, so the wrong answer won.

Execution results alone cannot tell you which SQL is more correct.

Ruled out, remaining

All three directions are closed:

Hint injection → failed
LLM-driven multi-candidate → failed
Execution-based selection → failed

The common cause

The problem isn’t the selection stage. It’s the stage before generation.

Not “pick the best answer after execution.” The right tables and columns have to be locked in before execution even starts.

The new direction

One direction remains.

Strengthen schema understanding. Get it right on the first try.

Next hypothesis: Schema Binding Plan

Don’t generate SQL directly.

First: output a JSON specifying tables / columns / join conditions
System validates the plan
Then generate SQL

The third-attempt validation confirmed:

When binding is forced, the LLM follows it 100% of the time.

The problem was never SQL generation. It was schema interpretation.

What the 9 failures left behind

The experiments failed. The infrastructure didn’t.

Grader drift guard. Keeps past results comparable even as the grader logic evolves. Without it, this experiment would have been logged as a “+8pp success.”

Signal classifier. Rates each question on a four-level scale (STRONG to MISLEADING): is there a detectable signal pointing to the right answer? Separates “the selector is weak” from “there was no signal to detect.”

Forced binding verification code. Automatically checks whether the SQL the LLM generated actually uses the column it was told to use (via the sqlglot SQL parser). Reusable as-is for the next schema grounding experiments.

Stop criteria / experimental design framework. Lock in “if this threshold isn’t met, stop” before running a benchmark, use small spot checks to make fast directional decisions.

And one more thing.

SOTA pointing one direction doesn’t mean it’s the right direction for my problem.

After the experiments, I ran deep research sessions with Claude, ChatGPT, and Gemini. Of 10 suggestions, 8 directly conflicted with already-closed directions. Without the data from 9 experiments, I would have followed them.

Closing

In post 8 I wrote “80% is the start.”

That 80% was domain-specific. On an unseen domain, it’s 56%.

That’s the real starting point. The next post reports where that number moves once Schema Binding Plan is in.

The one-line conclusion#

Why try the multi-candidate approach#

1. More hints to the LLM (4 failures)#

2. Generate multiple candidates and pick one (3 failures)#

The “+8pp” that wasn’t real#

Raising diversity#

3. Force diversity at the system level (3 failures)#

Experiment 1: forced column binding#

Experiment 2: selector validation#

The most telling case#

Experiment 3: score-based selector#

Ruled out, remaining#

The common cause#

The new direction#

Next hypothesis: Schema Binding Plan#

What the 9 failures left behind#

Closing#