8. From 66% to 80% NL2SQL Accuracy: Four Measure-and-Fix Loops

I implemented the router (previous post ) and ran a 30-question benchmark. EX (Execution Accuracy): 66.67%, 20 out of 30. Four measure-and-fix loops later: 80%.

The method was simple. Run 30 questions, classify the failures, fix a few high-impact ones. Then run it again. I did this four times.

Starting point: every query goes through the LLM

At Phase 0.5, Vanna was trained on the DDL and a Glossary YAML with 25 business terms. The PoC DB was an e-commerce sample: 21 tables, around 230k rows. I built a 30-question test set against it and measured EX for the first time: 66.67%. 20 out of 30.

The problem was this. Easy queries and hard queries all went through the LLM. “What’s this month’s total revenue?” (trivial aggregation) and “Give me quarterly YoY growth” (complex) were both generated end-to-end by Vanna. LLM-generated SQL isn’t consistent. The same question produces slightly different results each time. There’s no reason to burn the LLM on queries whose patterns are obvious.

So I built a QueryRouter.

Cycle 1: QueryRouter

QueryRouter classifies queries into three categories.

DETERMINISTIC: queries with fixed patterns. “Sum of total revenue”, “order count by category” — things a SQL template can solve. No LLM, just generate SQL directly
HYBRID: queries that include a glossary term but have complex conditions. Rules draft the SQL, the LLM validates it
PROBABILISTIC: free-form queries that don’t match the above. Hand it to Vanna

It’s an MVP, so the classifier isn’t fancy. Keyword matching and regex. If keywords like “sum”, “total”, “average” appear, it becomes a DETERMINISTIC candidate, and the router checks whether a registered glossary term is in the query.

Ran 30 questions. EX 70%. +3.33%p. Looks tiny on the number line, but there were two meaningful changes underneath.

One is the synonym recognition rate, which jumped from 33% to 67%. When someone asked for “Revenue” instead of “매출”, the previous system would miss it. The Router picks it up from the glossary’s synonym list.

The other is P95 latency (the slowest 5% of queries) which dropped from 26 seconds to 3.3 seconds, a 87% cut. The DETERMINISTIC path skips the LLM and runs the SQL template directly. Cheaper and faster.

But an unexpected problem showed up here.

The “top 10” trap: Fake Determinism

20 queries got routed as DETERMINISTIC, and 7 of them were wrong. 35%.

A representative case: “Top 10 customers by coupon usage.” The router saw “top” (상위) and dispatched it to the HIERARCHY_ANCESTORS template — the template that walks up a hierarchy. But the actual intent was ORDER BY coupon_count DESC LIMIT 10. It’s a ranking query, not a hierarchy walk.

The word “top” (상위 in Korean) has two meanings. “Top category” (상위 카테고리) refers to an upper node in a hierarchy. “Top 10 customers” (상위 10명) refers to a ranking. Keyword matching can’t distinguish these.

The dangerous part is that this is worse than LLM fallback. An LLM reads the context and has a chance of saying “ah, this is a ranking” and generating the correct SQL. But once it’s routed as DETERMINISTIC, the template executes immediately and the wrong answer goes out unverified. It’s being deterministically wrong. I shared the result with ChatGPT and it named the pattern “Fake Determinism.” Fitting name.

Cycle 2: top-N patch + sanity check

I shipped three patches together.

First, top-N patterns got a regex exclusion. If a ranking expression matches first, the query no longer goes to HIERARCHY; it falls through to HYBRID or PROBABILISTIC.

TOP_N_PATTERNS = [
    r"상위\s*\d+",       # "top 10" in Korean
    r"top\s*\d+",        # "top 5"
    r"가장\s*(많|높|큰)", # "most / highest / largest"
]

That simple rule alone brought the Fake Det Rate from 35% down to 20%.

Next, a sanity check. After the DETERMINISTIC path runs, if the result looks abnormal we retry once via LLM fallback. “Abnormal” has two criteria: zero rows, or a NULL ratio above 70%.

if det_result.row_count == 0 or det_result.null_ratio > 0.7:
    return await self._execute_probabilistic(query)  # one retry only

No infinite recursion, so we fall back exactly once. Whatever PROBABILISTIC returns, right or wrong, is returned as-is.

Then few-shot exemplar reinforcement. Classifying the failures by error type gave me 6 wrong_mapping (table/column mapping errors) and 2 wrong_formula (calculation errors). I added SQL examples for each type into Vanna’s training data: DATE_TRUNC patterns for “this month’s revenue,” the net-revenue formula that subtracts discounts and returns, and so on.

Result: EX 76.67%. +6.67%p. HYBRID dispatches went from 1 to 6. The router was moving in the “when in doubt, send it to the LLM” direction, which is what I wanted.

Cycle 3: the is_active trap, and why gold SQL can also be wrong

One pattern kept showing up in wrong_mapping. When querying the customers table, the generated SQL was missing is_active=true. Once inactive customers (churned, dormant) are included, aggregates drift. Adding more few-shot examples didn’t fix it.

This isn’t “knowledge” — it’s “policy.” “We only count active customers” is a business rule, not something the LLM should be inferring from context. So I wrote a hard rule using sqlparse: parse the SQL, and if a table with an is_active column appears in the query, auto-inject the WHERE condition.

Not a string replacement — AST-level (Abstract Syntax Tree, the parsed grammar tree of the SQL) handling. Subqueries are left alone, and you don’t end up with conditions pinned after GROUP BY by accident. The rule runs right after SQL generation and right before execution, across all DET/HYBRID/PROB paths.

Enabled the is_active rule, ran the benchmark. EX dropped to 56.67%.

I thought I misread the numbers at first. 8 cases flipped from PASS to FAIL, all for the same reason: the gold SQL didn’t have is_active=true. The system was injecting the filter according to the business policy, but the reference answers were written without that policy, so strict comparison flagged them as mismatches.

Confirmed something here: the evaluation criterion itself can be wrong. The BIRD benchmark has similar cases — strict scoring agrees with human expert judgment only 62% of the time, and the remaining 38% are mostly false negatives where correct SQL gets marked wrong.

Cycle 3.1: fixing the gold SQL → 80%

The fix was simple. Added is_active=true to the 8 gold SQL statements. Aligned the policy between the system output and the reference.

Re-measured: EX 80.00%. Hit the 80% target.

Summary from Phase 0.5 to here:

Version	EX	Key change	Failures
Phase 0.5	66.67%	Vanna + Glossary RAG	10/30
v1	70.00%	QueryRouter added	9/30
v2	76.67%	top-N fix + sanity check + few-shot	7/30
v3.1	80.00%	is_active hard rule + gold alignment	6/30

+13.33%p cumulative. Easy difficulty is at 100%, Fake Det Rate dropped from 35% to 13.3%. Hard difficulty sits unchanged at 37.5% — that’s a formula/JOIN complexity issue, an area that needs DataHub synonym expansion and more exemplars.

What I learned

Don’t leave business policy to the LLM. I tried teaching the is_active filter via few-shot multiple times. It didn’t stick. It’s not a probabilistic problem. A single sqlparse-based hard rule beat stacking more few-shot examples. “When querying customers, is_active=true is mandatory” is policy, not knowledge, and policy should be enforced by the system.

Evaluation criteria need design too. If the gold SQL doesn’t reflect business policy, a correct improvement can lower the score. When building the test set, I needed a step to check “does the gold SQL reflect our system’s policy?” Only 8 cases this time, but with hundreds of test items, catching this kind of mismatch later becomes much harder.

Change one thing at a time, then measure. Every time — adding the router, adding is_active — I ran the same 30 questions and looked at the delta. Without this I couldn’t have told which change improved things and which broke them. Especially for is_active: if I hadn’t isolated the variable, I might have concluded “EX dropped, so the is_active rule is wrong.” The rule was actually fine; the gold SQL was the issue. Freezing a baseline and moving one variable at a time feels tedious, but it’s what makes later analysis possible.

Fake Determinism is a structural limit of rule-based classifiers. Keyword matching can’t distinguish whether “top” means hierarchy or ranking. I covered it with an exception like top-N exclusion, but once these exceptions pile up, the rules get messy and unmaintainable. I may have to revisit an LLM classifier in Phase 2. Parked for now.

Error buckets show the direction. If you leave wrong queries as “wrong,” there’s no way to know what to fix. Classifying them as wrong_mapping, wrong_formula, wrong_aggregation, hallucination made the next action obvious: “6 wrong_mapping, so reinforce mapping exemplars; 2 wrong_formula, so write the formula into the glossary.” Without buckets, “raise accuracy” gives you nowhere to start.

Starting point: every query goes through the LLM#

Cycle 1: QueryRouter#

The “top 10” trap: Fake Determinism#

Cycle 2: top-N patch + sanity check#

Cycle 3: the is_active trap, and why gold SQL can also be wrong#

Cycle 3.1: fixing the gold SQL → 80%#

What I learned#