The journey of building DataNexus, an ontology-based data agent platform

9. The Public Benchmark Returned 56%: Nine Experiments and What Got Ruled Out

I hit 80% on my own 30-question benchmark, but only 56% on BIRD Mini-Dev’s 50 public questions. Nine experiments later, I had ruled out the multi-candidate hypothesis from three different angles. What’s left is schema understanding and methodology.

April 19, 2026 · 5 min · Junho Lee

8. From 66% to 80% NL2SQL Accuracy: Four Measure-and-Fix Loops

After wiring up the router, I ran a 30-question benchmark and pushed NL2SQL EX (Execution Accuracy) from 66.67% to 80%. Here’s what I fixed across four cycles and where things broke.

April 14, 2026 · 7 min · Junho Lee

7. When a Question Comes In, Who Decides the Routing?

The term definitions are done. But when a user asks a question, who decides whether to search the graph, write SQL, or run a vector search? Things I ran into while designing the router.

April 11, 2026 · 3 min · Junho Lee

6. When You Don't Have to Build Agent Infra Yourself, Harnesses Become Obsolete. What About the Ontology?

Shortly after the Conway leak, Anthropic officially launched Claude Managed Agents. As agent infrastructure gets absorbed into platforms, here’s why DataNexus’s ontology layer remains safe.

April 10, 2026 · 4 min · Junho Lee

5. Automating Metadata Maintenance: Karpathy's LLM Wiki Architecture

RAG starts from scratch every time. Karpathy proposes having the LLM maintain a wiki directly so knowledge accumulates. DataNexus’s ontology catalog needs the same principle to avoid abandonment.

April 5, 2026 · 4 min · Junho Lee

Design guides for data warehouse modeling

4. Super-Sub Types — Can a Customer Be Both Individual and Corporate?

Super-sub types clarify business classifications at the logical model level. When converting to a physical model, three options emerge — and in a DW, that choice reshapes the entire dimension design.

February 22, 2026 · 4 min · Junho Lee

3. ERD Notation — Same Diagram, Different Interpretation

Same Crow’s Foot, different meaning. A single dashed line means different things in different tools. If you want models to serve as a shared language on your project, start by aligning on notation.

February 21, 2026 · 4 min · Junho Lee

2. OLTP vs DW Models — Different Purpose, Different Design

Even when the ERDs look similar, the design philosophies are completely different. OLTP is about transactional integrity; DW is about analytical access paths. That difference creates unfamiliar things like Unknown records and point-in-time data.

February 20, 2026 · 6 min · Junho Lee

1. Is Kimball Still Relevant in the Cloud DW Era?

How DW modeling considerations have shifted with Synapse, BigQuery, and Redshift. Kimball, Data Vault, One Big Table — practical criteria for choosing the right approach.

February 19, 2026 · 6 min · Junho Lee

Design guides for ETL/ELT pipeline architecture

4. SCD - When a Customer Moves, Where Were Past Orders Shipped?

When dimension data changes, do you overwrite history or preserve it? We implement SCD Type 1, 2, and 3 differences in SQL, then build a production pattern with dbt snapshot.

February 22, 2026 · 10 min · Junho Lee

3. Silver Layer - Promoting Bronze to an Analysis-Ready State

Cleanse and standardize the raw data stacked in Bronze. Fix types, unify column names, remove duplicates. We define this process as SQL models in dbt.

February 22, 2026 · 6 min · Junho Lee

2. Bronze Layer - Load the Source Data Exactly As-Is

There are two ways to load data into Bronze. Overwrite everything, or bring only what changed. Which one you choose completely changes the complexity of your pipeline.

February 22, 2026 · 6 min · Junho Lee

1. Medallion Architecture - Why We Stack Data in Three Layers

Bronze, Silver, Gold. What changes when you load data into separate layers. We build it hands-on with DuckDB and dbt.

February 22, 2026 · 5 min · Junho Lee

Generative Engine Optimization — technical strategies for creating content that AI engines cite

5. AEO - Why Coding Agents Read Documentation Differently

If GEO optimizes for consumer AI, AEO optimizes for coding agents. This article covers document length constraints, llms.txt, skill.md, and AGENTS.md — the files that matter.

April 17, 2026 · 6 min · Junho Lee

4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site

Even with perfect On-Site GEO, half of AI citations come from external channels. We cover platform-specific Off-Site strategies and how to diagnose your robots.txt setup.

April 4, 2026 · 7 min · Junho Lee

3. On-Site GEO Technical Architecture - From Product DB to JSON-LD

How product master DB data flows through a 3-stage pipeline to become JSON-LD in your HTML . Covers the pipeline architecture and SSR-based automated deployment.

April 1, 2026 · 8 min · Junho Lee

2. Each AI Cites Different Sources

ChatGPT favors Wikipedia, Perplexity leans on Reddit, and Gemini prefers official websites. Covering all AI platforms with a single strategy is impossible.

March 29, 2026 · 7 min · Junho Lee

1. What Is GEO - AI Citation Strategy Beyond SEO

Only 9% of Google’s top 10 pages are cited by AI. In an era where SEO rankings no longer guarantee AI citations, we break down the three core principles of GEO and their academic foundations.

March 26, 2026 · 6 min · Junho Lee

A curated collection of data engineering, AI, and technology trend insights

What Actually Improves Claude Code Performance: Configuration and Architecture

Keeping Claude Code minimal, forcing full reasoning with three settings.json lines, and investing in system architecture — a practical take on what actually moves the needle.

April 15, 2026 · 2 min · Junho Lee
32 Slash Command Shortcuts That LLMs Instantly Understand

32 Slash Command Shortcuts That LLMs Instantly Understand

32 slash commands that work out of the box with Claude, ChatGPT, and Gemini – no custom definitions needed. Categorized by use case with practical combination strategies.

April 7, 2026 · 2 min · Junho Lee

Analyzing the Official GitHub Repository of a Major Korean Brokerage Open API

A structural analysis of the official sample code from a major Korean brokerage API, optimized for LLM agents and Python environments.

April 5, 2026 · 2 min · Junho Lee
macshot: A Native macOS Tool Emerging as an Alternative to Paid Apps

macshot: A Native macOS Tool Emerging as an Alternative to Paid Apps

A look at macshot, a native open-source macOS screenshot and screen recording tool that delivers powerful features without the subscription burden.

April 5, 2026 · 2 min · Junho Lee
OpenDocuments: A Local RAG Platform That Unifies Fragmented Team Knowledge

OpenDocuments: A Local RAG Platform That Unifies Fragmented Team Knowledge

An open-source platform that connects documents scattered across Notion, GitHub, and S3, then queries them with a local LLM. Runs entirely on-premise without external APIs.

April 1, 2026 · 2 min · Junho Lee