Junho Lee

Building DataNexus

The journey of building DataNexus, an ontology-based data agent platform

9. The Public Benchmark Returned 56%: Nine Experiments and What Got Ruled Out

I hit 80% on my own 30-question benchmark, but only 56% on BIRD Mini-Dev’s 50 public questions. Nine experiments later, I had ruled out the multi-candidate hypothesis from three different angles. What’s left is schema understanding and methodology.

8. From 66% to 80% NL2SQL Accuracy: Four Measure-and-Fix Loops

After wiring up the router, I ran a 30-question benchmark and pushed NL2SQL EX (Execution Accuracy) from 66.67% to 80%. Here’s what I fixed across four cycles and where things broke.

7. When a Question Comes In, Who Decides the Routing?

The term definitions are done. But when a user asks a question, who decides whether to search the graph, write SQL, or run a vector search? Things I ran into while designing the router.

6. When You Don't Have to Build Agent Infra Yourself, Harnesses Become Obsolete. What About the Ontology?

Shortly after the Conway leak, Anthropic officially launched Claude Managed Agents. As agent infrastructure gets absorbed into platforms, here’s why DataNexus’s ontology layer remains safe.

5. Automating Metadata Maintenance: Karpathy's LLM Wiki Architecture

RAG starts from scratch every time. Karpathy proposes having the LLM maintain a wiki directly so knowledge accumulates. DataNexus’s ontology catalog needs the same principle to avoid abandonment.

DW Modeling

Design guides for data warehouse modeling

4. Super-Sub Types — Can a Customer Be Both Individual and Corporate?

Super-sub types clarify business classifications at the logical model level. When converting to a physical model, three options emerge — and in a DW, that choice reshapes the entire dimension design.

3. ERD Notation — Same Diagram, Different Interpretation

Same Crow’s Foot, different meaning. A single dashed line means different things in different tools. If you want models to serve as a shared language on your project, start by aligning on notation.

2. OLTP vs DW Models — Different Purpose, Different Design

Even when the ERDs look similar, the design philosophies are completely different. OLTP is about transactional integrity; DW is about analytical access paths. That difference creates unfamiliar things like Unknown records and point-in-time data.

1. Is Kimball Still Relevant in the Cloud DW Era?

How DW modeling considerations have shifted with Synapse, BigQuery, and Redshift. Kimball, Data Vault, One Big Table — practical criteria for choosing the right approach.

ETL Design

Design guides for ETL/ELT pipeline architecture

4. SCD - When a Customer Moves, Where Were Past Orders Shipped?

When dimension data changes, do you overwrite history or preserve it? We implement SCD Type 1, 2, and 3 differences in SQL, then build a production pattern with dbt snapshot.

3. Silver Layer - Promoting Bronze to an Analysis-Ready State

Cleanse and standardize the raw data stacked in Bronze. Fix types, unify column names, remove duplicates. We define this process as SQL models in dbt.

2. Bronze Layer - Load the Source Data Exactly As-Is

There are two ways to load data into Bronze. Overwrite everything, or bring only what changed. Which one you choose completely changes the complexity of your pipeline.

1. Medallion Architecture - Why We Stack Data in Three Layers

Bronze, Silver, Gold. What changes when you load data into separate layers. We build it hands-on with DuckDB and dbt.

GEO Optimization Guide

Generative Engine Optimization — technical strategies for creating content that AI engines cite

5. AEO - Why Coding Agents Read Documentation Differently

If GEO optimizes for consumer AI, AEO optimizes for coding agents. This article covers document length constraints, llms.txt, skill.md, and AGENTS.md — the files that matter.

4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site

Even with perfect On-Site GEO, half of AI citations come from external channels. We cover platform-specific Off-Site strategies and how to diagnose your robots.txt setup.

3. On-Site GEO Technical Architecture - From Product DB to JSON-LD

How product master DB data flows through a 3-stage pipeline to become JSON-LD in your HTML . Covers the pipeline architecture and SSR-based automated deployment.

2. Each AI Cites Different Sources

ChatGPT favors Wikipedia, Perplexity leans on Reddit, and Gemini prefers official websites. Covering all AI platforms with a single strategy is impossible.

1. What Is GEO - AI Citation Strategy Beyond SEO

Only 9% of Google’s top 10 pages are cited by AI. In an era where SEO rankings no longer guarantee AI citations, we break down the three core principles of GEO and their academic foundations.

Curations

more →

A curated collection of data engineering, AI, and technology trend insights

What Actually Improves Claude Code Performance: Configuration and Architecture

Keeping Claude Code minimal, forcing full reasoning with three settings.json lines, and investing in system architecture — a practical take on what actually moves the needle.

32 Slash Command Shortcuts That LLMs Instantly Understand

32 slash commands that work out of the box with Claude, ChatGPT, and Gemini – no custom definitions needed. Categorized by use case with practical combination strategies.

Analyzing the Official GitHub Repository of a Major Korean Brokerage Open API

A structural analysis of the official sample code from a major Korean brokerage API, optimized for LLM agents and Python environments.

macshot: A Native macOS Tool Emerging as an Alternative to Paid Apps

A look at macshot, a native open-source macOS screenshot and screen recording tool that delivers powerful features without the subscription burden.

OpenDocuments: A Local RAG Platform That Unifies Fragmented Team Knowledge

An open-source platform that connects documents scattered across Notion, GitHub, and S3, then queries them with a local LLM. Runs entirely on-premise without external APIs.