[{"content":"\rGoogle Colab에서 실습하기\rThe Customer Moved. What About Past Order Shipping Addresses? Say a jaffle_shop customer moved from Seoul to Busan. You UPDATE the city column in the customers table to \u0026lsquo;Busan\u0026rsquo;. Now when you query past orders, every shipping address shows \u0026ldquo;Busan\u0026rdquo; \u0026ndash; even for orders that were actually shipped to Seoul.\nThis is the same \u0026ldquo;point-in-time data\u0026rdquo; problem we covered in DW Modeling Part 2\r. OLTP only manages the current state. A DW needs to know \u0026ldquo;what was the value at that point in time?\u0026rdquo; When a sales rep changes, whose numbers do past deals belong to? When a customer\u0026rsquo;s tier changes, which tier should past orders be aggregated under? Same problem.\nKimball systematized this. SCD (Slowly Changing Dimension) \u0026ndash; a framework for handling attribute changes in dimension tables, organized by type. The name \u0026ldquo;slowly\u0026rdquo; contrasts with fact data (orders, logs) that accumulates continuously. Customer addresses, product categories, employee departments \u0026ndash; they don\u0026rsquo;t change often, but they do change.\nSCD Type 1 - Overwrite The simplest approach. UPDATE with the current value and move on. History is lost.\n-- Type 1: overwrite with current value UPDATE dim_customers SET city = \u0026#39;Busan\u0026#39; WHERE customer_id = 1; After execution, whether you join this customer\u0026rsquo;s past orders or current orders, everything shows \u0026ldquo;Busan.\u0026rdquo; The time they lived in Seoul is gone.\nThere are valid cases for Type 1. Typo corrections are the classic example. Changing \u0026ldquo;Seoul Special City\u0026rdquo; to \u0026ldquo;Seoul\u0026rdquo; doesn\u0026rsquo;t warrant keeping history. Code table description updates, customer name typo fixes \u0026ndash; use this for attributes where knowing the past value serves no purpose.\nSCD Type 2 - Accumulate History When you need to preserve past values, use Type 2. Close the existing row and add a new one.\nThree columns are added to the table:\nvalid_from \u0026ndash; when this row became effective valid_to \u0026ndash; when this row was superseded (current rows use 9999-12-31) is_current \u0026ndash; whether this row is currently active -- Initial state: customer_id = 1, Seoul -- dim_customers_sk | customer_id | city | valid_from | valid_to | is_current -- 1001 | 1 | Seoul | 2025-01-01 | 9999-12-31 | true When the customer moves to Busan, two steps are executed.\n-- Step 1: close the existing row UPDATE dim_customers SET valid_to = \u0026#39;2026-02-15\u0026#39;, is_current = false WHERE customer_id = 1 AND is_current = true; -- Step 2: insert the new row INSERT INTO dim_customers (dim_customers_sk, customer_id, city, valid_from, valid_to, is_current) VALUES (1002, 1, \u0026#39;Busan\u0026#39;, \u0026#39;2026-02-15\u0026#39;, \u0026#39;9999-12-31\u0026#39;, true); Now one customer_id has two rows.\n-- dim_customers_sk | customer_id | city | valid_from | valid_to | is_current -- 1001 | 1 | Seoul | 2025-01-01 | 2026-02-15 | false -- 1002 | 1 | Busan | 2026-02-15 | 9999-12-31 | true Here, dim_customers_sk is the surrogate key. Since one customer can have multiple rows, customer_id alone can no longer uniquely identify a row. That\u0026rsquo;s why a separate surrogate key is needed. Design details are covered in the Gold edition.\nPoint-in-time queries work like this.\n-- Join with the customer address at order time SELECT o.order_id, o.order_date, c.city AS city_at_order_time FROM fct_orders o JOIN dim_customers c ON o.customer_id = c.customer_id AND o.order_date BETWEEN c.valid_from AND c.valid_to; A June 2025 order shows \u0026ldquo;Seoul,\u0026rdquo; a March 2026 order shows \u0026ldquo;Busan.\u0026rdquo; Each order reflects the actual value at its point in time.\nAs mentioned in DW Modeling Part 1\r, \u0026ldquo;the storage overhead of SCD Type 2 has diminished.\u0026rdquo; In cloud columnar storage, the cost of additional rows is far lower than on-premises. This is an environment where you can use Type 2 more aggressively.\nSCD Type 3 - Keep the Previous Value as a Column When a single level of history is sufficient, use Type 3. Store the previous value in a separate column.\n-- Type 3: previous value as a column -- customer_id | city | previous_city -- 1 | Busan | Seoul Implementation is straightforward.\nUPDATE dim_customers SET previous_city = city, city = \u0026#39;Busan\u0026#39; WHERE customer_id = 1; Row count doesn\u0026rsquo;t increase. However, the value from two changes ago is lost. If the city changes from Seoul → Busan → Daejeon, \u0026ldquo;Seoul\u0026rdquo; disappears.\nThere are practical use cases. Comparing before and after an organizational restructuring: \u0026ldquo;What department was this employee in before the reorg?\u0026rdquo; When a single previous value is enough and full history isn\u0026rsquo;t needed.\nWhich Type to Choose Criterion Type 1 Type 2 Type 3 History preservation None Full Previous value only Implementation complexity Low High Medium Storage No change Rows keep growing Column addition Point-in-time analysis Not possible Fully supported Limited Best for Typos, code descriptions Address, tier, department Before/after reorg comparison The decision criterion is simple. \u0026ldquo;Do I need to analyze using past values?\u0026rdquo; If yes, Type 2. If no, Type 1. Type 3 is limited to the special case where only the previous value is needed.\nYou can mix types within a single table by column. Track city with Type 2 for full history, but overwrite phone with Type 1. There\u0026rsquo;s no analytical reason to keep historical phone numbers.\nPreparing Lab Data jaffle_shop\u0026rsquo;s raw_customers doesn\u0026rsquo;t have an address column. We need to generate synthetic data to demonstrate SCD.\nimport duckdb conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) # Customer data for SCD demo: add city, membership_grade, updated_at conn.execute(\u0026#34;\u0026#34;\u0026#34; CREATE SCHEMA IF NOT EXISTS bronze; CREATE OR REPLACE TABLE bronze.customers_v2 AS SELECT id AS customer_id, first_name, last_name, CASE WHEN id % 3 = 0 THEN \u0026#39;Seoul\u0026#39; WHEN id % 3 = 1 THEN \u0026#39;Busan\u0026#39; ELSE \u0026#39;Daejeon\u0026#39; END AS city, CASE WHEN id % 4 = 0 THEN \u0026#39;Gold\u0026#39; WHEN id % 4 = 1 THEN \u0026#39;Silver\u0026#39; WHEN id % 4 = 2 THEN \u0026#39;Bronze\u0026#39; ELSE \u0026#39;Standard\u0026#39; END AS membership_grade, TIMESTAMP \u0026#39;2025-01-15 09:00:00\u0026#39; AS updated_at FROM read_csv_auto( \u0026#39;https://raw.githubusercontent.com/dbt-labs/jaffle_shop/main/seeds/raw_customers.csv\u0026#39; ); \u0026#34;\u0026#34;\u0026#34;) conn.execute(\u0026#34;SELECT * FROM bronze.customers_v2 LIMIT 5\u0026#34;).fetchdf() We also create change simulation data. A scenario where some customers moved and their tiers were upgraded.\n# Change data: some customers have moved conn.execute(\u0026#34;\u0026#34;\u0026#34; CREATE OR REPLACE TABLE bronze.customers_v2_updated AS SELECT customer_id, first_name, last_name, CASE WHEN customer_id IN (1, 3, 5) THEN \u0026#39;Jeju\u0026#39; ELSE city END AS city, CASE WHEN customer_id IN (2, 4) THEN \u0026#39;Gold\u0026#39; ELSE membership_grade END AS membership_grade, CASE WHEN customer_id IN (1, 2, 3, 4, 5) THEN TIMESTAMP \u0026#39;2026-02-20 14:00:00\u0026#39; ELSE updated_at END AS updated_at FROM bronze.customers_v2; \u0026#34;\u0026#34;\u0026#34;) # Check changed customers conn.execute(\u0026#34;\u0026#34;\u0026#34; SELECT a.customer_id, a.city AS before_city, b.city AS after_city, a.membership_grade AS before_grade, b.membership_grade AS after_grade FROM bronze.customers_v2 a JOIN bronze.customers_v2_updated b ON a.customer_id = b.customer_id WHERE a.city != b.city OR a.membership_grade != b.membership_grade \u0026#34;\u0026#34;\u0026#34;).fetchdf() Automating SCD Type 2 with dbt snapshot What Is a Snapshot We implemented Type 2 manually in SQL above. Close the existing row, insert the new one, manage valid_from/valid_to. That\u0026rsquo;s manageable for one table. When dimension tables grow to 10 or 20, writing this logic from scratch every time isn\u0026rsquo;t realistic.\ndbt snapshot handles this for you. Define a single snapshot file, and dbt detects changes in the source data and manages the valid_from/valid_to rows automatically.\nWriting the Snapshot File import os os.makedirs(\u0026#39;jaffle_shop/snapshots\u0026#39;, exist_ok=True) %%writefile jaffle_shop/snapshots/snap_customers.sql {% snapshot snap_customers %} {{ config( target_schema=\u0026#39;snapshots\u0026#39;, unique_key=\u0026#39;customer_id\u0026#39;, strategy=\u0026#39;timestamp\u0026#39;, updated_at=\u0026#39;updated_at\u0026#39; ) }} select * from bronze.customers_v2 {% endsnapshot %} strategy='timestamp' \u0026ndash; determines change based on the updated_at column. If updated_at is newer than the previous snapshot point, the row is considered changed.\nunique_key='customer_id' \u0026ndash; the key that identifies which row is which. Previous and current values are compared based on this key.\nRunning the Snapshot from dbt.cli.main import dbtRunner # Colab\u0026#39;s ! shell commands spawn a separate process. # DuckDB uses file locks that prevent concurrent writes across processes. # Running via dbtRunner within the same process avoids lock conflicts. result = dbtRunner().invoke([\u0026#39;snapshot\u0026#39;, \u0026#39;--project-dir\u0026#39;, \u0026#39;jaffle_shop\u0026#39;, \u0026#39;--profiles-dir\u0026#39;, \u0026#39;jaffle_shop\u0026#39;]) This is the first run. All rows are new, so they\u0026rsquo;re inserted as-is. dbt automatically adds dbt_valid_from and dbt_valid_to columns.\nconn.execute(\u0026#34;SELECT * FROM snapshots.snap_customers LIMIT 5\u0026#34;).fetchdf() dbt_valid_to is NULL for all rows. That means they\u0026rsquo;re currently active. dbt snapshot uses NULL instead of 9999-12-31.\nNow let\u0026rsquo;s inject the change data and run again.\n# Replace source table with changed data conn.execute(\u0026#34;\u0026#34;\u0026#34; CREATE OR REPLACE TABLE bronze.customers_v2 AS SELECT * FROM bronze.customers_v2_updated; \u0026#34;\u0026#34;\u0026#34;) conn.close() # Re-run snapshot result = dbtRunner().invoke([\u0026#39;snapshot\u0026#39;, \u0026#39;--project-dir\u0026#39;, \u0026#39;jaffle_shop\u0026#39;, \u0026#39;--profiles-dir\u0026#39;, \u0026#39;jaffle_shop\u0026#39;]) conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) # Check customers with history conn.execute(\u0026#34;\u0026#34;\u0026#34; SELECT customer_id, city, membership_grade, dbt_valid_from, dbt_valid_to FROM snapshots.snap_customers WHERE customer_id IN (1, 2, 3) ORDER BY customer_id, dbt_valid_from \u0026#34;\u0026#34;\u0026#34;).fetchdf() Customer with customer_id = 1 now has two rows. The first row\u0026rsquo;s dbt_valid_to is populated, and the second row is the currently active one. Type 2 implemented without writing a single line of SQL.\nThe Check Strategy Some sources don\u0026rsquo;t have an updated_at column. As mentioned in Part 2\r, surprisingly many systems UPDATE data without touching updated_at.\nIn these cases, use the check strategy. It directly compares whether specified column values have changed.\n%%writefile jaffle_shop/snapshots/snap_customers_check.sql {% snapshot snap_customers_check %} {{ config( target_schema=\u0026#39;snapshots\u0026#39;, unique_key=\u0026#39;customer_id\u0026#39;, strategy=\u0026#39;check\u0026#39;, check_cols=[\u0026#39;city\u0026#39;, \u0026#39;membership_grade\u0026#39;] ) }} select customer_id, first_name, last_name, city, membership_grade from bronze.customers_v2 {% endsnapshot %} check_cols=['city', 'membership_grade'] \u0026ndash; if either of these column values differs from before, it\u0026rsquo;s treated as a change. No updated_at needed. The trade-off is that it compares all rows every time, so it\u0026rsquo;s slower than the timestamp strategy for large datasets.\nConnecting Snapshots to Silver/Gold Snapshots are stored in a separate schema (snapshots), neither Bronze nor Silver. Here\u0026rsquo;s how this fits into the Medallion Architecture.\nSource → [Bronze] → [Silver] → [Gold] ↑ Bronze → [Snapshot] ──┘ This is the layer structure from Part 1\rwith snapshots added. Snapshots look directly at Bronze data, and Silver or Gold models reference the snapshot results.\nHere\u0026rsquo;s how a Silver model references a snapshot.\n%%writefile jaffle_shop/models/staging/stg_customers_hist.sql with source as ( select * from {{ ref(\u0026#39;snap_customers\u0026#39;) }} ), cleaned as ( select customer_id, first_name, last_name, city, membership_grade, dbt_valid_from AS valid_from, coalesce(dbt_valid_to, \u0026#39;9999-12-31\u0026#39;::timestamp) AS valid_to, dbt_valid_to IS NULL AS is_current from source ) select * from cleaned We renamed dbt\u0026rsquo;s dbt_valid_from/dbt_valid_to to valid_from/valid_to, and converted NULL to 9999-12-31. This makes BETWEEN joins in Gold more convenient.\nA point-in-time join from a Gold fact table looks like this.\n-- Gold: join with customer info at order time select o.order_id, o.order_date, c.city AS city_at_order, c.membership_grade AS grade_at_order from stg_orders o join stg_customers_hist c on o.customer_id = c.customer_id and o.order_date \u0026gt;= c.valid_from and o.order_date \u0026lt; c.valid_to SCD Application Patterns Summary Pattern Target Implementation Type 1 (overwrite) Typos, code descriptions, phone numbers Simple UPDATE Type 2 (history accumulation) Address, tier, department dbt snapshot (timestamp / check) Type 3 (preserve previous value) Before/after reorg comparison Add previous_ column Mixed Per-column differentiation within a table Combine Type 1 + Type 2 In practice, Type 2 dominates. Since dbt snapshot handles the implementation, the overhead isn\u0026rsquo;t significant. Type 1 is reserved for attributes that don\u0026rsquo;t need history. Type 3 is used in rare cases where only the previous value matters.\nPractical Reference: Running dbt snapshot from Airflow In Part 3\r, we set up an Airflow DAG with dbt run → dbt test. Adding snapshots changes the order.\nfrom airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id=\u0026#39;medallion_with_snapshot\u0026#39;, schedule=\u0026#39;0 6 * * *\u0026#39;, start_date=datetime(2026, 1, 1), catchup=False, ) as dag: # 1. Run snapshot first — capture change history from Bronze run_snapshot = BashOperator( task_id=\u0026#39;dbt_snapshot\u0026#39;, bash_command=\u0026#39;cd /opt/dbt/jaffle_shop \u0026amp;\u0026amp; dbt snapshot\u0026#39;, ) # 2. Silver transformation — some models reference snapshot results run_staging = BashOperator( task_id=\u0026#39;dbt_run_staging\u0026#39;, bash_command=\u0026#39;cd /opt/dbt/jaffle_shop \u0026amp;\u0026amp; dbt run --select staging\u0026#39;, ) # 3. Gold transformation run_marts = BashOperator( task_id=\u0026#39;dbt_run_marts\u0026#39;, bash_command=\u0026#39;cd /opt/dbt/jaffle_shop \u0026amp;\u0026amp; dbt run --select marts\u0026#39;, ) run_snapshot \u0026gt;\u0026gt; run_staging \u0026gt;\u0026gt; run_marts The critical ordering is run_snapshot \u0026gt;\u0026gt; run_staging. The Silver model stg_customers_hist references snap_customers. Snapshots must run first so Silver reflects the latest history. If you run snapshots after Silver, changes detected in this batch won\u0026rsquo;t appear in Silver until the next batch. A full day\u0026rsquo;s delay.\nThe next post covers the Gold layer. The process of combining cleansed Silver data and snapshot history to build fact and dimension tables. The dbt marts directory takes center stage.\nGoogle Colab에서 실습하기\r","permalink":"https://datanexus-kr.github.io/en/guides/etl-design/004-scd/","summary":"When dimension data changes, do you overwrite history or preserve it? We implement SCD Type 1, 2, and 3 differences in SQL, then build a production pattern with dbt snapshot.","title":"4. SCD - When a Customer Moves, Where Were Past Orders Shipped?"},{"content":"\rGoogle Colab에서 실습하기\rWhat Happens When You Use Bronze Data Directly In Part 2\r, we loaded raw data into Bronze as-is. No transformation. That principle is correct. The problem is that Bronze data isn\u0026rsquo;t in a state you can use for analysis.\nLook at jaffle_shop\u0026rsquo;s bronze.orders. The order_date column came in as VARCHAR. You can\u0026rsquo;t use date functions on it. The status column has a mix of returned, return_pending, completed, placed, and shipped \u0026ndash; and from the schema alone, you can\u0026rsquo;t tell which value represents the final state.\nThe amount column in bronze.payments is an integer in cents. To convert to dollars, you need to divide by 100. Doing this division manually every time you analyze is a recipe for mistakes.\nSilver is the layer that cleans all of this up in one place. Fix types, unify column names, convert units. No business logic yet. Silver\u0026rsquo;s job is to create a \u0026ldquo;clean state ready for analysis.\u0026rdquo;\nWhat Silver Does and Doesn\u0026rsquo;t Do Maintaining boundaries is important. Once you start putting business logic in Silver, the whole point of separating Bronze and Silver disappears.\nWhat Silver does:\nType casting \u0026ndash; VARCHAR to DATE, INTEGER to DECIMAL Column name standardization \u0026ndash; unify user_id and userId to user_id Unit conversion \u0026ndash; cents to dollars, milliseconds to seconds Deduplication \u0026ndash; when the same record was loaded twice NULL handling \u0026ndash; unify empty strings to NULL What Silver doesn\u0026rsquo;t do:\nKPI calculation \u0026ndash; business metrics like revenue or margin rate Table joins \u0026ndash; combining orders and customers into a single view Aggregation \u0026ndash; summarizing with GROUP BY Joins and aggregation belong to Gold. Silver only cleanses at the individual table level.\nWhy dbt Is Needed In Part 1\r, we introduced dbt as a tool. Why not just run SQL files directly?\nRunning SQL files one by one works fine at first. When Silver tables grow to 5, 10, or more, things change. You lose track of which table depends on which Bronze table, what order to run them in, and when the last run happened.\ndbt solves this. Each SQL file is a model. Declare dependencies between models using the ref() function, and dbt figures out the execution order automatically. Since transformation logic lives in SQL files, you get version history through Git as well.\ndbt Project Setup Create a dbt project in Colab.\n!pip install -q duckdb dbt-core dbt-duckdb import os # Create dbt project directory structure os.makedirs(\u0026#39;jaffle_shop/models/staging\u0026#39;, exist_ok=True) os.makedirs(\u0026#39;jaffle_shop/models/marts\u0026#39;, exist_ok=True) Create the dbt configuration file. Specify DuckDB as the database.\n%%writefile jaffle_shop/dbt_project.yml name: \u0026#39;jaffle_shop\u0026#39; version: \u0026#39;1.0.0\u0026#39; profile: \u0026#39;jaffle_shop\u0026#39; model-paths: [\u0026#34;models\u0026#34;] %%writefile jaffle_shop/profiles.yml jaffle_shop: target: dev outputs: dev: type: duckdb path: /content/warehouse.duckdb Writing Silver Models In dbt, Silver layer models go in the models/staging/ directory. The stg_ prefix stands for staging (= Silver).\nstg_orders %%writefile jaffle_shop/models/staging/stg_orders.sql with source as ( select * from bronze.orders ), cleaned as ( select id as order_id, user_id as customer_id, cast(order_date as date) as order_date, status from source ) select * from cleaned We renamed Bronze\u0026rsquo;s id to order_id. When joining multiple tables, a bare id doesn\u0026rsquo;t tell you which table it belongs to. We also renamed user_id to customer_id to clarify meaning. order_date is cast to DATE.\nstg_customers %%writefile jaffle_shop/models/staging/stg_customers.sql with source as ( select * from bronze.customers ), cleaned as ( select id as customer_id, first_name, last_name from source ) select * from cleaned stg_payments %%writefile jaffle_shop/models/staging/stg_payments.sql with source as ( select * from bronze.payments ), cleaned as ( select id as payment_id, order_id, payment_method, amount / 100.0 as amount_dollars from source ) select * from cleaned We divided amount by 100 to convert to dollars. The column name is changed to amount_dollars so the unit is immediately obvious from the name.\nRunning dbt !cd jaffle_shop \u0026amp;\u0026amp; dbt run --select staging.* dbt executes the three models: stg_orders, stg_customers, stg_payments. Each is created as a view in DuckDB.\nChecking the Results import duckdb conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) # Check Silver layer conn.execute(\u0026#34;SELECT * FROM stg_orders LIMIT 5\u0026#34;).fetchdf() # Check types — has order_date been converted to DATE? conn.execute(\u0026#34;\u0026#34;\u0026#34; SELECT column_name, data_type FROM information_schema.columns WHERE table_name = \u0026#39;stg_orders\u0026#39; \u0026#34;\u0026#34;\u0026#34;).fetchdf() # Has payments amount been converted to dollars? conn.execute(\u0026#34;SELECT * FROM stg_payments LIMIT 5\u0026#34;).fetchdf() order_date, which was VARCHAR in Bronze, is now DATE. amount has been converted from cents to dollars. Column names are unified. That\u0026rsquo;s Silver.\nThe CTE Pattern There\u0026rsquo;s a recurring pattern in the SQL above: with source as (...), cleaned as (...) select * from cleaned. This is the CTE (Common Table Expression) pattern widely used in the dbt community.\nwith source as ( -- Step 1: pull the raw data from Bronze select * from bronze.orders ), cleaned as ( -- Step 2: apply cleansing logic select id as order_id, cast(order_date as date) as order_date from source ) -- Step 3: return the final result select * from cleaned source → cleaned → select. Each step\u0026rsquo;s purpose is readable from its name. When cleansing logic grows more complex, just add more CTEs. Some teams use names like renamed, filtered, or deduplicated to break things into stages.\nDeduplication Pattern Sometimes the same record ends up in Bronze twice. Maybe the source system re-sent data, or there was a bug in the incremental load logic. Silver needs to catch this.\nwith source as ( select * from bronze.orders ), deduplicated as ( select *, row_number() over ( partition by id order by _loaded_at desc ) as row_num from source ), cleaned as ( select id as order_id, user_id as customer_id, cast(order_date as date) as order_date, status from deduplicated where row_num = 1 ) select * from cleaned row_number() keeps only the most recently loaded record when the same id appears multiple times. The _loaded_at metadata column we added in Part 2\ris what makes this work.\nChange Silver Carelessly and Gold Breaks Gold models rely on Silver table column names, types, and units. If stg_orders has order_date as DATE and Gold uses date functions on it, then someone renaming the column to ordered_at in Silver causes every Gold model to throw errors.\nAdding columns is fine. Renaming or changing the type of existing columns is dangerous. dbt\u0026rsquo;s ref() function tracks dependencies, so you can at least see what gets affected.\nPractical Reference: Running dbt from Airflow There are several ways to run dbt from Airflow. The simplest is calling dbt run via BashOperator. For more sophisticated setups, you can use the cosmos library.\nfrom airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id=\u0026#39;silver_transformation\u0026#39;, schedule=\u0026#39;0 6 * * *\u0026#39;, start_date=datetime(2026, 1, 1), catchup=False, ) as dag: # Run Silver transformation after Bronze load completes run_staging = BashOperator( task_id=\u0026#39;dbt_run_staging\u0026#39;, bash_command=\u0026#39;cd /opt/dbt/jaffle_shop \u0026amp;\u0026amp; dbt run --select staging\u0026#39;, ) # Validate Silver data quality with dbt test test_staging = BashOperator( task_id=\u0026#39;dbt_test_staging\u0026#39;, bash_command=\u0026#39;cd /opt/dbt/jaffle_shop \u0026amp;\u0026amp; dbt test --select staging\u0026#39;, ) run_staging \u0026gt;\u0026gt; test_staging dbt run followed by dbt test. Quality validation runs immediately after Silver transformation finishes. If tests fail, the pipeline doesn\u0026rsquo;t proceed to Gold transformation. This structure prevents bad data from propagating up to Gold.\nWith the cosmos library, you can split each dbt model into its own Airflow task. If stg_orders fails, stg_customers can still succeed independently. This granularity becomes meaningful when you have dozens of models.\nThe next post covers SCD (Slowly Changing Dimension). When a customer\u0026rsquo;s address changes, how do you preserve the historical address? The differences between Type 1, 2, and 3, and how to choose.\nGoogle Colab에서 실습하기\r","permalink":"https://datanexus-kr.github.io/en/guides/etl-design/003-silver-layer/","summary":"Cleanse and standardize the raw data stacked in Bronze. Fix types, unify column names, remove duplicates. We define this process as SQL models in dbt.","title":"3. Silver Layer - Promoting Bronze to an Analysis-Ready State"},{"content":"\rGoogle Colab에서 실습하기\rTouch the Original and There\u0026rsquo;s No Going Back In Part 1\r, we established the Bronze layer principle. Data from source systems is stored without any transformation. No type casting, no column renaming.\nThe principle is simple, but sticking to it in practice is hard. The temptation arises: \u0026ldquo;The date column is a string \u0026ndash; can\u0026rsquo;t I just cast it to DATE on the way in?\u0026rdquo; No. If you change types in Bronze, you lose the ability to restore the original. When the source system sends an invalid date like \u0026quot;2026-02-30\u0026quot;, casting to DATE either throws an error or silently converts it to NULL. You lose track of what the original value was.\nBronze is insurance. Even if your Silver transformation logic has a bug, or the source system suddenly changes its schema, you can start over from Bronze. Give up this insurance, and every time something goes wrong, you\u0026rsquo;ll need to pull data from the source system again. There\u0026rsquo;s no guarantee the source system owner will be cooperative.\nFull Load and Incremental Load There are two main ways to load data into Bronze.\nFull Load fetches the entire source table and overwrites it every time. Simple. Since what\u0026rsquo;s in the source is exactly what\u0026rsquo;s in Bronze, there\u0026rsquo;s no data consistency headache. The downside is that costs grow as data grows. If the orders table has 100 million rows but only 10,000 new orders come in daily, you\u0026rsquo;re re-fetching the same 99.99 million rows that haven\u0026rsquo;t changed.\nIncremental Load fetches only the data that changed since the last load. Efficient \u0026ndash; you only bring 10,000 rows. But complex. You need to decide how to determine \u0026ldquo;since the last load\u0026rdquo; and how to detect deleted records.\nWhich to use depends on the table\u0026rsquo;s characteristics.\nAspect Full Load Incremental Load Implementation difficulty Low High Network / cost Proportional to data size Proportional to changes Delete detection Automatic (full overwrite) Requires separate handling Best for Code tables, small masters High-volume transactions In practice, you use both. Small tables like code tables or product masters get Full Load for simplicity. High-volume tables like orders, logs, and events get Incremental Load to bring only the changes.\nHow to Define the Increment The most critical decision in Incremental Load is the criterion for \u0026ldquo;what has changed.\u0026rdquo; Three approaches are commonly used.\nTimestamp-based. If the source table has a column like updated_at, this is the simplest approach. You fetch only rows newer than the last load timestamp. One condition: the source system must honestly update the modification timestamp. Surprisingly many systems UPDATE data without touching updated_at.\nAuto-incrementing key-based. If there\u0026rsquo;s a monotonically increasing PK like order_id, you fetch rows after the last loaded ID. This catches INSERTs but misses UPDATEs. Best suited for log-style tables where the ID never changes once issued.\nCDC (Change Data Capture). Read the source database\u0026rsquo;s change log directly. Tools like Debezium capture MySQL or PostgreSQL WAL (Write-Ahead Log) and catch all INSERTs, UPDATEs, and DELETEs. Most accurate, but requires separate infrastructure.\nTimestamp-based: WHERE updated_at \u0026gt; \u0026#39;last_load_timestamp\u0026#39; Auto-incrementing key: WHERE order_id \u0026gt; last_loaded_id CDC: Capture database change logs Comparing Both Approaches in DuckDB We continue from the environment set up in Part 1\r.\nimport duckdb conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) Full Load Simulation Full Load is straightforward. Drop the existing data and reload everything.\n# Simulate a source data change scenario # In practice, you\u0026#39;d SELECT * from the source system conn.execute(\u0026#34;\u0026#34;\u0026#34; -- Full Load: replace entirely CREATE OR REPLACE TABLE bronze.orders AS SELECT * FROM read_csv_auto( \u0026#39;https://raw.githubusercontent.com/dbt-labs/jaffle_shop/main/seeds/raw_orders.csv\u0026#39; ); \u0026#34;\u0026#34;\u0026#34;) print(\u0026#34;Full Load complete:\u0026#34;, conn.execute(\u0026#34;SELECT count(*) FROM bronze.orders\u0026#34;).fetchone()[0], \u0026#34;rows\u0026#34;) CREATE OR REPLACE TABLE is the key. The table is recreated every time. Previous data is gone, and the current state of the source is loaded as-is.\nIncremental Load Simulation Incremental Load requires one extra step. You need to remember where you left off.\n# Watermark table: records the last load point conn.execute(\u0026#34;\u0026#34;\u0026#34; CREATE TABLE IF NOT EXISTS bronze.watermarks ( table_name VARCHAR PRIMARY KEY, last_loaded_id INTEGER, last_loaded_at TIMESTAMP DEFAULT current_timestamp ); \u0026#34;\u0026#34;\u0026#34;) # Check current watermark watermark = conn.execute(\u0026#34;\u0026#34;\u0026#34; SELECT COALESCE(last_loaded_id, 0) FROM bronze.watermarks WHERE table_name = \u0026#39;orders\u0026#39; \u0026#34;\u0026#34;\u0026#34;).fetchone() last_id = watermark[0] if watermark else 0 print(f\u0026#34;Last loaded ID: {last_id}\u0026#34;) # Incremental load: only fetch data after last_id conn.execute(f\u0026#34;\u0026#34;\u0026#34; INSERT INTO bronze.orders SELECT * FROM read_csv_auto( \u0026#39;https://raw.githubusercontent.com/dbt-labs/jaffle_shop/main/seeds/raw_orders.csv\u0026#39; ) WHERE id \u0026gt; {last_id}; \u0026#34;\u0026#34;\u0026#34;) # Update watermark conn.execute(\u0026#34;\u0026#34;\u0026#34; INSERT OR REPLACE INTO bronze.watermarks (table_name, last_loaded_id, last_loaded_at) SELECT \u0026#39;orders\u0026#39;, MAX(id), current_timestamp FROM bronze.orders; \u0026#34;\u0026#34;\u0026#34;) print(\u0026#34;Incremental Load complete\u0026#34;) The watermarks table is the heart of incremental loading. It records how far you\u0026rsquo;ve loaded, and the next run fetches only what comes after. This pattern is called the High Watermark pattern.\nAdding Metadata Columns If you only store the raw data in Bronze, you\u0026rsquo;ll eventually face questions you can\u0026rsquo;t answer. \u0026ldquo;When was this data loaded?\u0026rdquo; \u0026ldquo;Which source did it come from?\u0026rdquo;\nKeep the original columns intact and add metadata columns alongside them.\nconn.execute(\u0026#34;\u0026#34;\u0026#34; CREATE OR REPLACE TABLE bronze.orders_with_meta AS SELECT *, current_timestamp AS _loaded_at, \u0026#39;jaffle_shop\u0026#39; AS _source_system, \u0026#39;full\u0026#39; AS _load_type FROM read_csv_auto( \u0026#39;https://raw.githubusercontent.com/dbt-labs/jaffle_shop/main/seeds/raw_orders.csv\u0026#39; ); \u0026#34;\u0026#34;\u0026#34;) conn.execute(\u0026#34;SELECT * FROM bronze.orders_with_meta LIMIT 3\u0026#34;).fetchdf() _loaded_at, _source_system, _load_type. The underscore prefix distinguishes them from original columns. The source might have its own loaded_at column, after all.\nWith these metadata columns, when something goes wrong in Silver transformation, you can narrow down the scope: \u0026ldquo;Data loaded up to this point was fine; it\u0026rsquo;s the data after that which is problematic.\u0026rdquo; The deduplication pattern using _loaded_at in Part 3 also starts here.\nLoading Patterns Summary Here\u0026rsquo;s a summary of Bronze loading patterns.\nPattern Target Implementation Full Load (overwrite) Code tables, small masters CREATE OR REPLACE TABLE Full Load (snapshot) When daily state history is needed Use load date as partition key Incremental (timestamp) Tables with updated_at WHERE updated_at \u0026gt; watermark Incremental (auto-increment key) Logs, events, orders WHERE id \u0026gt; watermark CDC When delete detection is needed Debezium + Kafka There\u0026rsquo;s one more Full Load variant: the snapshot approach. Instead of overwriting, you store each day\u0026rsquo;s full state separately, partitioned by load date. Useful when you want to compare yesterday\u0026rsquo;s product master state with today\u0026rsquo;s. It consumes more storage, but as we discussed in Part 1\r, storage costs in cloud environments are practically negligible.\nPractical Reference: Bronze Loading with Airflow When you implement Bronze loading as an Airflow DAG, you can separate tasks per table based on Full Load vs. Incremental Load.\nfrom airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime import duckdb def load_full(table_name, source_url, **context): \u0026#34;\u0026#34;\u0026#34;Full Load: complete replacement\u0026#34;\u0026#34;\u0026#34; conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) conn.execute(f\u0026#34;\u0026#34;\u0026#34; CREATE OR REPLACE TABLE bronze.{table_name} AS SELECT *, current_timestamp AS _loaded_at, \u0026#39;{table_name}\u0026#39; AS _source_system, \u0026#39;full\u0026#39; AS _load_type FROM read_csv_auto(\u0026#39;{source_url}\u0026#39;) \u0026#34;\u0026#34;\u0026#34;) conn.close() def load_incremental(table_name, source_url, key_column, **context): \u0026#34;\u0026#34;\u0026#34;Incremental Load: only after watermark\u0026#34;\u0026#34;\u0026#34; conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) wm = conn.execute(f\u0026#34;\u0026#34;\u0026#34; SELECT COALESCE(last_loaded_id, 0) FROM bronze.watermarks WHERE table_name = \u0026#39;{table_name}\u0026#39; \u0026#34;\u0026#34;\u0026#34;).fetchone() last_id = wm[0] if wm else 0 conn.execute(f\u0026#34;\u0026#34;\u0026#34; INSERT INTO bronze.{table_name} SELECT *, current_timestamp AS _loaded_at FROM read_csv_auto(\u0026#39;{source_url}\u0026#39;) WHERE {key_column} \u0026gt; {last_id} \u0026#34;\u0026#34;\u0026#34;) conn.close() with DAG( dag_id=\u0026#39;bronze_ingestion\u0026#39;, schedule=\u0026#39;0 5 * * *\u0026#39;, start_date=datetime(2026, 1, 1), catchup=False, ) as dag: # Small master → Full Load load_customers = PythonOperator( task_id=\u0026#39;load_customers_full\u0026#39;, python_callable=load_full, op_kwargs={\u0026#39;table_name\u0026#39;: \u0026#39;customers\u0026#39;, \u0026#39;source_url\u0026#39;: \u0026#39;...\u0026#39;}, ) # High-volume transactions → Incremental Load load_orders = PythonOperator( task_id=\u0026#39;load_orders_incremental\u0026#39;, python_callable=load_incremental, op_kwargs={ \u0026#39;table_name\u0026#39;: \u0026#39;orders\u0026#39;, \u0026#39;source_url\u0026#39;: \u0026#39;...\u0026#39;, \u0026#39;key_column\u0026#39;: \u0026#39;id\u0026#39;, }, ) # Parallel execution — no dependencies between tables [load_customers, load_orders] Small masters use load_full, high-volume transactions use load_incremental. Functions are split by table characteristics. Since there are no dependencies between tables, Airflow runs them in parallel.\nThe next post covers the Silver layer. The process of cleansing and standardizing the raw data we\u0026rsquo;ve stacked in Bronze. This is where dbt starts to shine.\nGoogle Colab에서 실습하기\r","permalink":"https://datanexus-kr.github.io/en/guides/etl-design/002-bronze-layer/","summary":"There are two ways to load data into Bronze. Overwrite everything, or bring only what changed. Which one you choose completely changes the complexity of your pipeline.","title":"2. Bronze Layer - Load the Source Data Exactly As-Is"},{"content":"\rGoogle Colab에서 실습하기\rHow a Data Lake Turns into a Swamp There are teams that dump files into a data lake and try to analyze them right away. At first, it\u0026rsquo;s fast. Upload a CSV, write one SQL query, and you get results.\nThree months later, things look different. Nobody knows who uploaded which file. You can\u0026rsquo;t tell whether it\u0026rsquo;s the raw source or a processed version. The same revenue table shows different numbers depending on which department you ask. This is what people call a data swamp.\nThe cause is simple. Raw data and processed artifacts are mixed in the same space. Separating them into layers solves this problem.\nBronze, Silver, Gold The Medallion Architecture divides data into three layers. Databricks coined the name and popularized it, but the concept itself is the same layered approach traditional data warehouses have used for decades.\nSource System → [Bronze] → [Silver] → [Gold] → BI / Analytics Raw Load Cleanse \u0026amp; Business Standardize Aggregation Bronze is the raw source. Data from source systems is stored without any transformation. CSV, JSON, API responses \u0026ndash; exactly as-is. This is the starting point for data lineage. If you change anything here, you lose the original.\nSilver is cleansing and standardization. You fix data types from Bronze, remove duplicates, and unify keys. This layer makes data \u0026ldquo;ready for analysis.\u0026rdquo; No business logic goes in here yet.\nGold is business-level aggregation. Fact tables, dimension tables, KPI marts. This is the layer end users query directly. The star schema we covered in DW Modeling Part 1\rbelongs here.\nThe key point is that each layer has a clear role. No transformation in Bronze. No business logic in Silver. Business-level processing only happens in Gold. Once you break these rules, the whole point of having layers disappears.\nMapping to Traditional DW Layers In the DW Modeling series\r, we covered the Raw → Staging → Integration → Mart structure. The names differ from Medallion, but the roles are nearly identical.\nMedallion Traditional DW Purpose Bronze Raw / Staging Raw load, no transformation Silver Integration (3NF / Data Vault) Cleansing, standardization, key unification Gold Mart (Star Schema) Business aggregation, analytics-ready In traditional DW, an ETL server handled heavy transformations between Staging and Integration. Medallion follows the ELT paradigm. You first load into Bronze, then build Silver and Gold inside the DW engine. The difference is that transformations are handled by the DW engine\u0026rsquo;s compute power rather than a separate server.\nThe Lab Environment for This Series We\u0026rsquo;ll use three tools throughout this series. All free, and you can run them directly in Google Colab without a cloud account.\nTool Role DuckDB Local DW engine. Columnar storage-based, so it works the same way as BigQuery/Snowflake dbt-core + dbt-duckdb Transformation layer. Defines Bronze → Silver → Gold in SQL Soda Core Data quality validation. Sets quality gates between layers There\u0026rsquo;s a reason we chose DuckDB. It installs with a single pip install, yet works the same way as real cloud DWs. It reads Parquet and CSV natively, analyzes with SQL, and supports column-based scanning through columnar storage. Think of it as a miniature cloud DW running locally.\nEnvironment Setup Run the following in a Colab cell and you\u0026rsquo;re ready to go.\n# Install tools !pip install -q duckdb dbt-core dbt-duckdb import duckdb # Create DuckDB database conn = duckdb.connect(\u0026#39;warehouse.duckdb\u0026#39;) print(f\u0026#34;DuckDB {duckdb.__version__} ready\u0026#34;) Preparing Sample Data The sample data for this series is a simple e-commerce dataset. Three tables: orders, customers, and products. It\u0026rsquo;s the same domain as the structure we covered in DW Modeling Part 2\r.\n# Bronze layer: load raw data as-is conn.execute(\u0026#34;\u0026#34;\u0026#34; CREATE SCHEMA IF NOT EXISTS bronze; CREATE OR REPLACE TABLE bronze.orders AS SELECT * FROM read_csv_auto(\u0026#39;https://raw.githubusercontent.com/ dbt-labs/jaffle_shop/main/seeds/raw_orders.csv\u0026#39;); CREATE OR REPLACE TABLE bronze.customers AS SELECT * FROM read_csv_auto(\u0026#39;https://raw.githubusercontent.com/ dbt-labs/jaffle_shop/main/seeds/raw_customers.csv\u0026#39;); CREATE OR REPLACE TABLE bronze.payments AS SELECT * FROM read_csv_auto(\u0026#39;https://raw.githubusercontent.com/ dbt-labs/jaffle_shop/main/seeds/raw_payments.csv\u0026#39;); \u0026#34;\u0026#34;\u0026#34;) # Verify load conn.execute(\u0026#34;SELECT count(*) as cnt FROM bronze.orders\u0026#34;).fetchdf() Data loaded into Bronze. We read CSVs and put them into DuckDB \u0026ndash; nothing more. No type casting, no column renaming. That\u0026rsquo;s the Bronze principle.\n# Check Bronze data conn.execute(\u0026#34;SELECT * FROM bronze.orders LIMIT 5\u0026#34;).fetchdf() At this point, you\u0026rsquo;ll be tempted to run analytical queries directly. Resist. If you use Bronze data directly for analysis, you\u0026rsquo;ll end up in a data swamp within three months. The next post covers the process of promoting Bronze to Silver.\nWhy Go Through All This Trouble People ask whether separating layers makes things slower. After all, you use more storage and add more transformation steps.\nTrue. But you gain three things in return.\nReprocessing becomes possible. If there\u0026rsquo;s a bug in your Silver logic, just rebuild from Bronze. The original data is still there. Without Bronze, you\u0026rsquo;d have to pull from the source system all over again.\nProblem tracing works. If Gold numbers look wrong, check Silver. If Silver looks wrong, check Bronze. You can pinpoint exactly which layer introduced the problem.\nResponsibilities are separated. Data engineers own Bronze-to-Silver. Analytics engineers own Silver-to-Gold. Neither team needs to touch the other\u0026rsquo;s territory.\nIn cloud environments, storage costs are practically negligible. The cost of maintaining one extra layer is far less than the cost of falling into a data swamp.\nPractical Reference: Airflow DAG This series uses Colab + dbt for hands-on exercises, but in production, Airflow handles pipeline scheduling and orchestration. Here\u0026rsquo;s what the Medallion Architecture\u0026rsquo;s Bronze → Silver → Gold flow looks like as an Airflow DAG.\nfrom airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id=\u0026#39;medallion_pipeline\u0026#39;, schedule=\u0026#39;0 6 * * *\u0026#39;, # Daily at 6 AM start_date=datetime(2026, 1, 1), catchup=False, ) as dag: bronze = BashOperator( task_id=\u0026#39;load_bronze\u0026#39;, bash_command=\u0026#39;python scripts/load_bronze.py\u0026#39;, ) silver = BashOperator( task_id=\u0026#39;run_silver\u0026#39;, bash_command=\u0026#39;cd dbt_project \u0026amp;\u0026amp; dbt run --select staging\u0026#39;, ) gold = BashOperator( task_id=\u0026#39;run_gold\u0026#39;, bash_command=\u0026#39;cd dbt_project \u0026amp;\u0026amp; dbt run --select marts\u0026#39;, ) bronze \u0026gt;\u0026gt; silver \u0026gt;\u0026gt; gold bronze \u0026gt;\u0026gt; silver \u0026gt;\u0026gt; gold. The dependency chain reads in a single line. Bronze loading must finish before Silver runs, and Silver must finish before Gold runs. Airflow guarantees this order, sends alerts on failure, and handles reruns.\ndbt defines \u0026ldquo;what to transform.\u0026rdquo; Airflow defines \u0026ldquo;when and in what order to run it.\u0026rdquo; Different roles.\nThe next post dives into the Bronze layer in depth. The difference between Full Load and Incremental Load, and how to choose the right column for incremental extraction.\nGoogle Colab에서 실습하기\r","permalink":"https://datanexus-kr.github.io/en/guides/etl-design/001-medallion-architecture/","summary":"Bronze, Silver, Gold. What changes when you load data into separate layers. We build it hands-on with DuckDB and dbt.","title":"1. Medallion Architecture - Why We Stack Data in Three Layers"},{"content":"NL2SQL accuracy hit 80% on a 30-question benchmark (previous post\r).\nMoving the same model to a public benchmark (BIRD Mini-Dev) produced completely different numbers. BIRD is the standard NL2SQL benchmark — 500 questions across 11 different database domains.\n80% → 56%\nIt worked well on the retail domain I trained on, but on an unseen domain it was barely half.\nI ran 9 experiments to push past that 56%. (Honestly, I thought it would come quickly.)\nThe one-line conclusion More SQL generation attempts wasn\u0026rsquo;t the answer. Getting it right on the first try was.\nWhy try the multi-candidate approach Instead of asking the LLM to generate SQL once, generate several and pick the best one.\nAlready the dominant direction in NL2SQL research.\nself-consistency (generate the same query multiple times, pick by majority vote) execution-based selection (filter candidates based on actual execution results) multi-agent pipelines (multiple specialized agents collaborate to generate SQL) Four of the top five systems on the BIRD leaderboard use this approach.\nInstead, two assumptions drove it.\nSingle-shot LLM output has an accuracy ceiling. Splitting into exploration + selection can break through it. 1. More hints to the LLM (4 failures) Result: No effect on BIRD. In some cases, the hint overrode a correct answer.\nGive the LLM better hints, and accuracy will improve.\nFor example:\nWhich aggregation function to use (AVG/SUM/COUNT)\nWhich column maps to \u0026ldquo;revenue\u0026rdquo;\nA list of candidate columns for the question\nOwn 30 questions: held steady\nBIRD: no change\nIn some cases it got worse. The LLM was already correctly choosing products.price, and the hint switched it to order_items.unit_price. The hint overrode the correct answer.\n60% of failures weren\u0026rsquo;t about the wrong column. They were about the wrong SQL pattern.\nFor example:\nSUM(CASE WHEN ...) COUNT(CASE WHEN ...) Both are valid SQL, but NULL handling differences produce different results.\nHints can fix column selection. They can\u0026rsquo;t change which SQL pattern the LLM prefers.\n2. Generate multiple candidates and pick one (3 failures) Generating more candidates didn\u0026rsquo;t change the outcome. The first SQL generated was always the one that got used.\nSettings:\nk=3 temperature=0.3 result-based selection 56% accuracy — no change\nMetric Value 3 candidates converging to the same result 92% First candidate selected 100% The selector never overrode the first candidate. Not once.\nThe \u0026ldquo;+8pp\u0026rdquo; that wasn\u0026rsquo;t real It looked like 48% → 56%, a +8pp gain.\nRe-grading the old baseline with the current grader: it was already 56%. The grader logic had drifted. That got a grader drift guard baked in as permanent infrastructure.\nRaising diversity temperature raised to 0.8\n5 prompt variations added\nSQL text varied\nExecution results: identical\n62% of questions had all 5 candidates returning the same result.\nThere is one correct answer.\nRaising the temperature and varying the prompts didn\u0026rsquo;t change the execution results.\n3. Force diversity at the system level (3 failures) Observation: Even with forced candidates, the selector can\u0026rsquo;t pick the right one. Execution results alone provide no signal.\nIf the LLM won\u0026rsquo;t produce diversity on its own, force it.\nExperiment 1: forced column binding Forcing the correct column → 60% pass Forcing the wrong column → 0% pass Schema binding determines accuracy.\nExperiment 2: selector validation Correct / obviously wrong / plausibly wrong / subtly different → Can the selector pick the right one from 4 candidates?\n28.6% — effectively random\nThe most telling case The actual value stored in the DB was Czech VYBER (cash withdrawal). The LLM had no way to know this value existed.\nAll 4 candidates used English in the WHERE clause (cash withdrawal). None of them matched anything in the DB.\nResult:\nAll 4 returned empty results (0 rows) The selector saw 4 identical results and called it \u0026ldquo;consensus\u0026rdquo; Wrong answer selected If the right answer never appears, consensus is meaningless.\nExperiment 3: score-based selector Instead of execution results alone, V2 scored each candidate on a mix of signals: value distribution, column count, row count, and more. The highest-scoring SQL got selected.\nV1: 40% V2: 33% More sophisticated made it worse.\nqid 819 shows why. V2 gave the correct SQL 55 points and the obviously wrong SQL 75 points. The wrong answer had the higher score, so the wrong answer won.\nExecution results alone cannot tell you which SQL is more correct.\nRuled out, remaining All three directions are closed:\nHint injection → failed LLM-driven multi-candidate → failed Execution-based selection → failed The common cause The problem isn\u0026rsquo;t the selection stage. It\u0026rsquo;s the stage before generation.\nNot \u0026ldquo;pick the best answer after execution.\u0026rdquo; The right tables and columns have to be locked in before execution even starts.\nThe new direction One direction remains.\nStrengthen schema understanding. Get it right on the first try.\nNext hypothesis: Schema Binding Plan Don\u0026rsquo;t generate SQL directly.\nFirst: output a JSON specifying tables / columns / join conditions System validates the plan Then generate SQL The third-attempt validation confirmed:\nWhen binding is forced, the LLM follows it 100% of the time.\nThe problem was never SQL generation. It was schema interpretation.\nWhat the 9 failures left behind The experiments failed. The infrastructure didn\u0026rsquo;t.\nGrader drift guard. Keeps past results comparable even as the grader logic evolves. Without it, this experiment would have been logged as a \u0026ldquo;+8pp success.\u0026rdquo;\nSignal classifier. Rates each question on a four-level scale (STRONG to MISLEADING): is there a detectable signal pointing to the right answer? Separates \u0026ldquo;the selector is weak\u0026rdquo; from \u0026ldquo;there was no signal to detect.\u0026rdquo;\nForced binding verification code. Automatically checks whether the SQL the LLM generated actually uses the column it was told to use (via the sqlglot SQL parser). Reusable as-is for the next schema grounding experiments.\nStop criteria / experimental design framework. Lock in \u0026ldquo;if this threshold isn\u0026rsquo;t met, stop\u0026rdquo; before running a benchmark, use small spot checks to make fast directional decisions.\nAnd one more thing.\nSOTA pointing one direction doesn\u0026rsquo;t mean it\u0026rsquo;s the right direction for my problem.\nAfter the experiments, I ran deep research sessions with Claude, ChatGPT, and Gemini. Of 10 suggestions, 8 directly conflicted with already-closed directions. Without the data from 9 experiments, I would have followed them.\nClosing In post 8 I wrote \u0026ldquo;80% is the start.\u0026rdquo;\nThat 80% was domain-specific. On an unseen domain, it\u0026rsquo;s 56%.\nThat\u0026rsquo;s the real starting point. The next post reports where that number moves once Schema Binding Plan is in.\n","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/009-multi-candidate-seal/","summary":"I hit 80% on my own 30-question benchmark, but only 56% on BIRD Mini-Dev\u0026rsquo;s 50 public questions. Nine experiments later, I had ruled out the multi-candidate hypothesis from three different angles. What\u0026rsquo;s left is schema understanding and methodology.","title":"9. The Public Benchmark Returned 56%: Nine Experiments and What Got Ruled Out"},{"content":"\rGEO Optimization Guide — 전체 시리즈\n1. What Is GEO - AI Citation Strategy Beyond SEO\r2. Each AI Cites Different Sources\r3. On-Site GEO Technical Architecture - From Product DB to JSON-LD\r4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site\r5. AEO - Why Coding Agents Read Documentation Differently ← 현재 글\rAI Does Not Read Just One Kind of Document Through Part 4\r, we covered On-Site and Off-Site GEO — JSON-LD on your official site, external directories, community channels. The goal was to make your brand appear as a cited source when consumers ask something on ChatGPT or Perplexity.\nBut that is not the only kind of document AI reads.\nWhen a developer tells Claude Code or Cursor to \u0026ldquo;integrate this API,\u0026rdquo; the agent crawls the API docs on its own. The way an agent processes those documents is fundamentally different from how a person browses a page.\nThis is called AEO (Agentic Engine Optimization). The concept was formalized recently\r, so there is almost no discussion of it in Korea yet — but for anyone doing GEO, it is worth understanding.\nHow AEO Differs From GEO The same AI calls for different optimization depending on who is using it.\nItem GEO AEO Target ChatGPT, Perplexity, Gemini Claude Code, Cursor, Cline, Aider Consumer Person asking questions (indirect) Code-writing agent (direct) Content type Product pages, brand information API docs, developer portals Key format JSON-LD, Schema.org Markdown, llms.txt, skill.md Metric Citation rate Token efficiency, parse success rate Failure mode Not appearing in answers Agent making wrong API calls There are overlapping principles. SSR-based serving, robots.txt review, structured content — all three need to be in place for both. Organizations that have properly set up GEO face a lower barrier to entering AEO.\nGEO asks \u0026ldquo;is the content cited in answers?\u0026rdquo; AEO asks \u0026ldquo;is the agent using the API correctly?\u0026rdquo; When the latter fails, broken code ships without anyone noticing.\nAgents Do Not Read Documentation Like Humans When a person lands on a developer portal, they scan the menu, click Getting Started, try running the sample code, and follow a few related links over 4–8 minutes. All of this behavior gets recorded in analytics.\nAn agent fetches the page in one or two HTTP GET requests, parses it, and moves on. No scrolling, no clicking. In GA, it shows up as a single request with a 400ms session duration.\nYou can identify agents from server logs via User-Agent.\nAgent User-Agent Claude Code axios/1.8.4 Cursor got (sindresorhus/got) Cline, Junie curl/8.4.0 Windsurf colly Aider, OpenCode Headless Chromium (Playwright) A significant share of what previously showed up as \u0026ldquo;unknown crawlers\u0026rdquo; may well be these agents.\nAgents Don\u0026rsquo;t Read Long Documents to the End Agents have context limits. Claude and GPT-class models typically sit between 100K and 200K tokens. When a single document approaches or exceeds that window, the agent quietly does one of the following.\nTruncates the tail. If the important content was near the end, the answer is wrong Moves to a shorter alternative document Spends time chunking, adding latency, and producing errors Gives up and answers from its trained knowledge — which is hallucination Some API reference documents exceed 100K tokens. At that scale, a single document can consume the agent\u0026rsquo;s entire context window by itself.\nDocument length therefore becomes a metric. Recommended guidelines:\nContent type Token target Quick Start / Getting Started Under 15K Individual API reference Under 25K Concept guide Under 20K (link out for details) Full API reference Split by resource or endpoint GEO has no such constraint. Consumer-facing AI just extracts the snippets it needs. AEO requires the entire document to fit into context, so length needs to be designed intentionally.\nFour Files to Start With No complex technology required — four files are enough to get started.\nFirst, robots.txt. Same file covered in Part 4, but now coding agent User-Agents also need to be considered. An overly broad block will prevent agents from reading your docs at all.\nNext, llms.txt. An agent-facing sitemap in Markdown, served at /llms.txt. Rather than page titles, it describes what a reader will learn at each URL. Including token counts per page lets agents decide upfront whether to read a given document.\n# MyService Documentation ## Getting Started - [Quick Start](/docs/quickstart): First API call in 5 minutes (8K tokens) - [Authentication](/docs/auth): OAuth 2.0 and API Key auth (12K tokens) ## API Reference - [Users API](/docs/api/users): User CRUD operations (12K tokens) - [Events API](/docs/api/events): Event streaming and webhooks (8K tokens) skill.md is a file that declaratively states what a service \u0026ldquo;can do.\u0026rdquo; It lets an agent understand your capabilities without reading through long documentation. A basic structure has four sections: What I can accomplish, Required inputs, Constraints, and Key documentation.\nFinally, AGENTS.md. A README variant for coding agents, placed in the project root. It is typically the first file a coding agent looks for when opening a project. Many open-source projects are already adopting it as a standard starting point.\nWho in Korea Should Actually Care About AEO Honestly, most Korean companies are not the primary audience for AEO right now. Developer portals are not that common. For e-commerce, distribution, finance, and service businesses, well-executed consumer GEO will deliver far greater returns.\nAEO is worth taking seriously in these cases:\nCompanies with public APIs: Payment gateways, logistics, maps, authentication, and data API providers. Developers using Cursor to write integration code is already the default Large corporations running internal dev platforms: Shared group APIs, auth gateways, and internal data platform docs are natural targets Open source and developer tools: If you have a public GitHub project, AGENTS.md is close to mandatory MCP server providers: skill.md maps directly to the expected convention When we tried connecting an internal data platform to agents, the agents frequently misread internal wikis and metadata, producing garbled answers. Applying a few AEO principles made a measurable difference. Serving docs as Markdown, adding token counts to pages, and stripping navigation noise was enough to produce a noticeable improvement in parse quality.\nFour files will not solve everything. Organization-specific terminology and department-level tacit context remain. Those are a different class of problem that file design alone cannot fix.\nThe Incremental Cost Is Low If You Have GEO in Place If GEO is already built, the implementation cost is smaller than it looks. A lot of the work overlaps.\nReview robots.txt through an AI crawler lens (GPTBot, ClaudeBot, PerplexityBot, plus coding agent User-Agents) Calculate token counts per documentation page (approximate as character count ÷ 4) Draft /llms.txt with your key document list and token counts Write skill.md for your top 3–5 APIs Add AGENTS.md to internal GitHub repositories Serve developer docs as Markdown — e.g., returning raw Markdown when .md is appended to the URL Segment coding agent traffic in server logs to establish a baseline Reviewing robots.txt and drafting llms.txt can be done in half a day. skill.md and AGENTS.md, starting with just a handful of top APIs, are not a heavy lift either.\n","permalink":"https://datanexus-kr.github.io/en/guides/geo-optimization/005-what-is-aeo/","summary":"If GEO optimizes for consumer AI, AEO optimizes for coding agents. This article covers document length constraints, llms.txt, skill.md, and AGENTS.md — the files that matter.","title":"5. AEO - Why Coding Agents Read Documentation Differently"},{"content":"When Claude Code performance gets inconsistent, the instinct is to add more tooling. That\u0026rsquo;s backwards. Strip things down, and output quality climbs.\nRemove the Skill Sets Stacking external skill sets (Superpowers-type plugins and the like) bloats the system prompt with noise. The model spends resources figuring out how to operate tools instead of reading and reasoning over code. Vanilla Claude is noticeably smarter — you feel it within a session or two.\nForce Full Reasoning with Three Settings Add the following to ~/.claude/settings.json:\n{ \u0026#34;effortLevel\u0026#34;: \u0026#34;max\u0026#34;, \u0026#34;env\u0026#34;: { \u0026#34;CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING\u0026#34;: \u0026#34;1\u0026#34;, \u0026#34;CLAUDE_CODE_DISABLE_1M_CONTEXT\u0026#34;: \u0026#34;1\u0026#34; } } effortLevel: max forces maximum reasoning every turn. Opus 4.6 only.\nDISABLE_ADAPTIVE_THINKING blocks the model from deciding \u0026ldquo;this seems simple, I\u0026rsquo;ll skip the thinking.\u0026rdquo; Leave it enabled and hallucinations — fake SHAs, non-existent API versions — become a recurring problem.\nDISABLE_1M_CONTEXT caps the context at 200k, which keeps reasoning focused. Without this, token consumption can run 5–7x higher than necessary.\nWhat Stays Relevant as Models Improve The ability to transmit clear requirements logically and structurally doesn\u0026rsquo;t go away. That\u0026rsquo;s not a prompting skill — it\u0026rsquo;s a thinking problem. No matter how capable models get, this part stays on the human side.\nWhere to Put the Energy Obsessing over prompt engineering or assembling skill sets is spending energy on what models are already handling. The real leverage is elsewhere: memory systems, ontologies, pipeline architecture — the structural layer that sits above the model. That\u0026rsquo;s where investment compounds.\nStacking external skill sets bloats the system prompt and degrades the model\u0026rsquo;s native reasoning. effortLevel: max, disabling Adaptive Thinking, and limiting the context window directly reduce hallucinations and token costs. System architecture and data structure design deliver more leverage than prompt engineering. ","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-15-claude-code-performance-optimization-settings/","summary":"Keeping Claude Code minimal, forcing full reasoning with three settings.json lines, and investing in system architecture — a practical take on what actually moves the needle.","title":"What Actually Improves Claude Code Performance: Configuration and Architecture"},{"content":"I implemented the router (previous post\r) and ran a 30-question benchmark. EX (Execution Accuracy): 66.67%, 20 out of 30. Four measure-and-fix loops later: 80%.\nThe method was simple. Run 30 questions, classify the failures, fix a few high-impact ones. Then run it again. I did this four times.\nStarting point: every query goes through the LLM At Phase 0.5, Vanna was trained on the DDL and a Glossary YAML with 25 business terms. The PoC DB was an e-commerce sample: 21 tables, around 230k rows. I built a 30-question test set against it and measured EX for the first time: 66.67%. 20 out of 30.\nThe problem was this. Easy queries and hard queries all went through the LLM. \u0026ldquo;What\u0026rsquo;s this month\u0026rsquo;s total revenue?\u0026rdquo; (trivial aggregation) and \u0026ldquo;Give me quarterly YoY growth\u0026rdquo; (complex) were both generated end-to-end by Vanna. LLM-generated SQL isn\u0026rsquo;t consistent. The same question produces slightly different results each time. There\u0026rsquo;s no reason to burn the LLM on queries whose patterns are obvious.\nSo I built a QueryRouter.\nCycle 1: QueryRouter QueryRouter classifies queries into three categories.\nDETERMINISTIC: queries with fixed patterns. \u0026ldquo;Sum of total revenue\u0026rdquo;, \u0026ldquo;order count by category\u0026rdquo; — things a SQL template can solve. No LLM, just generate SQL directly HYBRID: queries that include a glossary term but have complex conditions. Rules draft the SQL, the LLM validates it PROBABILISTIC: free-form queries that don\u0026rsquo;t match the above. Hand it to Vanna It\u0026rsquo;s an MVP, so the classifier isn\u0026rsquo;t fancy. Keyword matching and regex. If keywords like \u0026ldquo;sum\u0026rdquo;, \u0026ldquo;total\u0026rdquo;, \u0026ldquo;average\u0026rdquo; appear, it becomes a DETERMINISTIC candidate, and the router checks whether a registered glossary term is in the query.\nRan 30 questions. EX 70%. +3.33%p. Looks tiny on the number line, but there were two meaningful changes underneath.\nOne is the synonym recognition rate, which jumped from 33% to 67%. When someone asked for \u0026ldquo;Revenue\u0026rdquo; instead of \u0026ldquo;매출\u0026rdquo;, the previous system would miss it. The Router picks it up from the glossary\u0026rsquo;s synonym list.\nThe other is P95 latency (the slowest 5% of queries) which dropped from 26 seconds to 3.3 seconds, a 87% cut. The DETERMINISTIC path skips the LLM and runs the SQL template directly. Cheaper and faster.\nBut an unexpected problem showed up here.\nThe \u0026ldquo;top 10\u0026rdquo; trap: Fake Determinism 20 queries got routed as DETERMINISTIC, and 7 of them were wrong. 35%.\nA representative case: \u0026ldquo;Top 10 customers by coupon usage.\u0026rdquo; The router saw \u0026ldquo;top\u0026rdquo; (상위) and dispatched it to the HIERARCHY_ANCESTORS template — the template that walks up a hierarchy. But the actual intent was ORDER BY coupon_count DESC LIMIT 10. It\u0026rsquo;s a ranking query, not a hierarchy walk.\nThe word \u0026ldquo;top\u0026rdquo; (상위 in Korean) has two meanings. \u0026ldquo;Top category\u0026rdquo; (상위 카테고리) refers to an upper node in a hierarchy. \u0026ldquo;Top 10 customers\u0026rdquo; (상위 10명) refers to a ranking. Keyword matching can\u0026rsquo;t distinguish these.\nThe dangerous part is that this is worse than LLM fallback. An LLM reads the context and has a chance of saying \u0026ldquo;ah, this is a ranking\u0026rdquo; and generating the correct SQL. But once it\u0026rsquo;s routed as DETERMINISTIC, the template executes immediately and the wrong answer goes out unverified. It\u0026rsquo;s being deterministically wrong. I shared the result with ChatGPT and it named the pattern \u0026ldquo;Fake Determinism.\u0026rdquo; Fitting name.\nCycle 2: top-N patch + sanity check I shipped three patches together.\nFirst, top-N patterns got a regex exclusion. If a ranking expression matches first, the query no longer goes to HIERARCHY; it falls through to HYBRID or PROBABILISTIC.\nTOP_N_PATTERNS = [ r\u0026#34;상위\\s*\\d+\u0026#34;, # \u0026#34;top 10\u0026#34; in Korean r\u0026#34;top\\s*\\d+\u0026#34;, # \u0026#34;top 5\u0026#34; r\u0026#34;가장\\s*(많|높|큰)\u0026#34;, # \u0026#34;most / highest / largest\u0026#34; ] That simple rule alone brought the Fake Det Rate from 35% down to 20%.\nNext, a sanity check. After the DETERMINISTIC path runs, if the result looks abnormal we retry once via LLM fallback. \u0026ldquo;Abnormal\u0026rdquo; has two criteria: zero rows, or a NULL ratio above 70%.\nif det_result.row_count == 0 or det_result.null_ratio \u0026gt; 0.7: return await self._execute_probabilistic(query) # one retry only No infinite recursion, so we fall back exactly once. Whatever PROBABILISTIC returns, right or wrong, is returned as-is.\nThen few-shot exemplar reinforcement. Classifying the failures by error type gave me 6 wrong_mapping (table/column mapping errors) and 2 wrong_formula (calculation errors). I added SQL examples for each type into Vanna\u0026rsquo;s training data: DATE_TRUNC patterns for \u0026ldquo;this month\u0026rsquo;s revenue,\u0026rdquo; the net-revenue formula that subtracts discounts and returns, and so on.\nResult: EX 76.67%. +6.67%p. HYBRID dispatches went from 1 to 6. The router was moving in the \u0026ldquo;when in doubt, send it to the LLM\u0026rdquo; direction, which is what I wanted.\nCycle 3: the is_active trap, and why gold SQL can also be wrong One pattern kept showing up in wrong_mapping. When querying the customers table, the generated SQL was missing is_active=true. Once inactive customers (churned, dormant) are included, aggregates drift. Adding more few-shot examples didn\u0026rsquo;t fix it.\nThis isn\u0026rsquo;t \u0026ldquo;knowledge\u0026rdquo; — it\u0026rsquo;s \u0026ldquo;policy.\u0026rdquo; \u0026ldquo;We only count active customers\u0026rdquo; is a business rule, not something the LLM should be inferring from context. So I wrote a hard rule using sqlparse: parse the SQL, and if a table with an is_active column appears in the query, auto-inject the WHERE condition.\nNot a string replacement — AST-level (Abstract Syntax Tree, the parsed grammar tree of the SQL) handling. Subqueries are left alone, and you don\u0026rsquo;t end up with conditions pinned after GROUP BY by accident. The rule runs right after SQL generation and right before execution, across all DET/HYBRID/PROB paths.\nEnabled the is_active rule, ran the benchmark. EX dropped to 56.67%.\nI thought I misread the numbers at first. 8 cases flipped from PASS to FAIL, all for the same reason: the gold SQL didn\u0026rsquo;t have is_active=true. The system was injecting the filter according to the business policy, but the reference answers were written without that policy, so strict comparison flagged them as mismatches.\nConfirmed something here: the evaluation criterion itself can be wrong. The BIRD benchmark has similar cases — strict scoring agrees with human expert judgment only 62% of the time, and the remaining 38% are mostly false negatives where correct SQL gets marked wrong.\nCycle 3.1: fixing the gold SQL → 80% The fix was simple. Added is_active=true to the 8 gold SQL statements. Aligned the policy between the system output and the reference.\nRe-measured: EX 80.00%. Hit the 80% target.\nSummary from Phase 0.5 to here:\nVersion EX Key change Failures Phase 0.5 66.67% Vanna + Glossary RAG 10/30 v1 70.00% QueryRouter added 9/30 v2 76.67% top-N fix + sanity check + few-shot 7/30 v3.1 80.00% is_active hard rule + gold alignment 6/30 +13.33%p cumulative. Easy difficulty is at 100%, Fake Det Rate dropped from 35% to 13.3%. Hard difficulty sits unchanged at 37.5% — that\u0026rsquo;s a formula/JOIN complexity issue, an area that needs DataHub synonym expansion and more exemplars.\nWhat I learned Don\u0026rsquo;t leave business policy to the LLM. I tried teaching the is_active filter via few-shot multiple times. It didn\u0026rsquo;t stick. It\u0026rsquo;s not a probabilistic problem. A single sqlparse-based hard rule beat stacking more few-shot examples. \u0026ldquo;When querying customers, is_active=true is mandatory\u0026rdquo; is policy, not knowledge, and policy should be enforced by the system.\nEvaluation criteria need design too. If the gold SQL doesn\u0026rsquo;t reflect business policy, a correct improvement can lower the score. When building the test set, I needed a step to check \u0026ldquo;does the gold SQL reflect our system\u0026rsquo;s policy?\u0026rdquo; Only 8 cases this time, but with hundreds of test items, catching this kind of mismatch later becomes much harder.\nChange one thing at a time, then measure. Every time — adding the router, adding is_active — I ran the same 30 questions and looked at the delta. Without this I couldn\u0026rsquo;t have told which change improved things and which broke them. Especially for is_active: if I hadn\u0026rsquo;t isolated the variable, I might have concluded \u0026ldquo;EX dropped, so the is_active rule is wrong.\u0026rdquo; The rule was actually fine; the gold SQL was the issue. Freezing a baseline and moving one variable at a time feels tedious, but it\u0026rsquo;s what makes later analysis possible.\nFake Determinism is a structural limit of rule-based classifiers. Keyword matching can\u0026rsquo;t distinguish whether \u0026ldquo;top\u0026rdquo; means hierarchy or ranking. I covered it with an exception like top-N exclusion, but once these exceptions pile up, the rules get messy and unmaintainable. I may have to revisit an LLM classifier in Phase 2. Parked for now.\nError buckets show the direction. If you leave wrong queries as \u0026ldquo;wrong,\u0026rdquo; there\u0026rsquo;s no way to know what to fix. Classifying them as wrong_mapping, wrong_formula, wrong_aggregation, hallucination made the next action obvious: \u0026ldquo;6 wrong_mapping, so reinforce mapping exemplars; 2 wrong_formula, so write the formula into the glossary.\u0026rdquo; Without buckets, \u0026ldquo;raise accuracy\u0026rdquo; gives you nowhere to start.\n","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/008-pdca-ex-80/","summary":"After wiring up the router, I ran a 30-question benchmark and pushed NL2SQL EX (Execution Accuracy) from 66.67% to 80%. Here\u0026rsquo;s what I fixed across four cycles and where things broke.","title":"8. From 66% to 80% NL2SQL Accuracy: Four Measure-and-Fix Loops"},{"content":"The term engine design is done. \u0026ldquo;VIP customer\u0026rdquo; has a definition. \u0026ldquo;Net revenue\u0026rdquo; has a formula.\nThen I got stuck.\nA user types \u0026ldquo;Show me last month\u0026rsquo;s VIP customer revenue.\u0026rdquo; What\u0026rsquo;s supposed to happen inside the system?\nFirst, Decide Where the Answer Comes From There are multiple sources that can produce an answer. You can traverse the graph. You can write SQL to pull from the DW, or just run a vector search against past queries.\nThe answer changes depending on where you send it.\nSend \u0026ldquo;VIP customer revenue\u0026rdquo; to SQL and Vanna assembles tables to produce a number. Send it to the graph and it pulls term definitions and relationships first. You might think: just send it to all three and merge. Try it\u0026hellip; they conflict immediately.\nSame word \u0026ldquo;churn rate,\u0026rdquo; but the marketing report number and the CRM dashboard number don\u0026rsquo;t match. A human reads the context and picks the right one. An agent just sees two different numbers.\nWhen a Question Comes In, Who Moves First? When a question comes in, the Router catches it first. It reads term_type from the term metadata \u0026ndash; metric goes to SQL, concept goes to the graph. If it routes wrong here, everything downstream falls apart. Every term already has a type and linked columns nailed down. So the Router just reads and branches. Done.\n\u0026hellip;what happens without this? You end up asking the model \u0026ldquo;Is this SQL or graph?\u0026rdquo; every time. Try it a few times and you\u0026rsquo;ll feel it immediately. It doesn\u0026rsquo;t hold up.\nThe real headache is the Supervisor.\nWhen answers come back from multiple sources with different numbers, someone has to decide whose answer wins. If you don\u0026rsquo;t force a single standard, it turns into a fight every time. Marketing says \u0026ldquo;churn rate 12%.\u0026rdquo; CRM says \u0026ldquo;churn rate 8%.\u0026rdquo; Different formulas, neither wrong.\nSo I just forced a priority order. Called it HoT \u0026ndash; Hierarchy of Truth. The term engine\u0026rsquo;s standard definition wins, SQL execution results come next, vector search ranks last. Without this, the Supervisor has to ask the model \u0026ldquo;which one\u0026rsquo;s right?\u0026rdquo; every time there\u0026rsquo;s a conflict.\nThe Graph DBA validates the schema before any Cypher query fires \u0026ndash; if someone writes a query with an unregistered term, it gets blocked before execution.\nWhat I Still Don\u0026rsquo;t Know The thing I\u0026rsquo;m most uneasy about is compound questions.\n\u0026ldquo;Analyze VIP customer purchase patterns over the last 3 months\u0026rdquo; \u0026ndash; that needs both the graph and SQL. The Router reading term_type and routing should handle most single questions fine, but when something like this comes in, it needs to be split. Where to cut, what order to send. I\u0026rsquo;ve sketched it out in the design, but I won\u0026rsquo;t really know until I run it.\nHoT is also on my mind. What if the definition is six months old? I\u0026rsquo;ve added staleness detection, but that\u0026rsquo;s a problem for later.\nThe Graph DBA is the opposite worry \u0026ndash; blocking too aggressively.\nNext: Running It Against a Real DB After finishing the sub-module design, I\u0026rsquo;m running an A/B experiment on the sql-tutorial DB with 21 tables. EX (Execution Accuracy) +15%p needs to land. If it doesn\u0026rsquo;t, the design gets torn up. Push this on intuition and it\u0026rsquo;ll blow up later.\nIf the numbers come in, I\u0026rsquo;ll write about it.\nDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/007-multi-agent-router/","summary":"The term definitions are done. But when a user asks a question, who decides whether to search the graph, write SQL, or run a vector search? Things I ran into while designing the router.","title":"7. When a Question Comes In, Who Decides the Routing?"},{"content":"The Conway leak was barely old news when Anthropic dropped the real thing. Claude Managed Agents.\nThere\u0026rsquo;s a list of things that always tagged along when building an agent. Infrastructure, state management, permissions, orchestration. All of it, now bundled into a managed runtime. What took months from prototype to production now takes days, they say (Managed Agents Official Announcement).\nExactly the direction the Conway leak predicted. But I didn\u0026rsquo;t expect it this fast, or this polished.\nWhat Shipped A structure that separates sessions, harnesses, and sandboxes.\nPreviously, all three were crammed into one container. A crash wiped session data. Debugging meant accessing a container holding user data.\nThey split it apart.\nAnthropic\u0026rsquo;s engineering blog called this \u0026ldquo;decoupling the brain from the hands,\u0026rdquo; and the metaphor is accurate (Scaling Managed Agents: Decoupling the brain from the hands). Reasoning (brain) and execution (hands) can now scale independently.\nAnd the name is Agents. Plural.\nMulti-agent orchestration is a built-in assumption. Hand a complex task to one agent and it can spawn sub-agents. Notion, Asana, and Sentry are already running this in production.\nPricing is simple. Token costs plus per-session-hour billing. Idle time isn\u0026rsquo;t charged (Managed Agents Introduction Video).\nWith this kind of structure, the experiment-to-deployment cycle should get quite a bit faster.\nThe Shelf Life of a Harness An interesting point comes up in the engineering blog (Scaling Managed Agents (Engineering Blog)).\nA harness is essentially \u0026ldquo;code built on the assumption that the model can\u0026rsquo;t do X.\u0026rdquo; When the model improves, that assumption collapses.\nCase in point: they solved context instability issues with a harness workaround on Sonnet 4.5. On Opus 4.5, the problem simply vanished.\nThat breaks the assumption that harness engineering is a competitive advantage.\nThere was a time when it was. It\u0026rsquo;s hard to assume that will keep holding.\nHence the talk of a \u0026ldquo;meta-harness.\u0026rdquo; A structure where internals change but the interface stays stable.\nBut at this point, the thinking shifts.\nIs being good at writing agent loops itself a durable advantage? If you have to rework the harness every time the model levels up, the ROI on that investment drops.\nAt least right now, the domain side has a better chance of lasting.\nWhy DataNexus Is Safe There\u0026rsquo;s a problem that keeps showing up.\nThe same word \u0026ldquo;churn rate\u0026rdquo; means different things to different teams. Table names like T_CUST_MST are confusing even to humans.\nAn LLM guessing this correctly is even harder.\nDDL doesn\u0026rsquo;t carry this context.\nTaking the definitions that only lived in people\u0026rsquo;s heads and turning them into an ontology. That\u0026rsquo;s where it goes.\nThen feeding it to the model alongside the query.\nAttach it once and the SQL comes out completely different.\nBut these definitions aren\u0026rsquo;t publicly available. You can\u0026rsquo;t scrape it.\nTake \u0026ldquo;active customer\u0026rdquo; alone \u0026ndash; marketing says 30 days, CRM says 90 days, and some team defines it as \u0026ldquo;at least one login.\u0026rdquo;\nIt\u0026rsquo;s mostly rules that were set internally. So it\u0026rsquo;s not the kind of thing a runtime solves.\nOne More Distribution Channel I didn\u0026rsquo;t see this as a threat.\nIt felt more like gaining one more distribution channel.\nThink about running DataNexus\u0026rsquo;s ontology engine on top of Managed Agents \u0026ndash; there\u0026rsquo;s less reason to carry your own infrastructure.\nMCP wrapping was already in the plan, and it connects naturally with this structure.\nGeneral-purpose AI is, in the end, general-purpose.\nTo explain why a number looks wrong inside a specific company\u0026rsquo;s DW, you need the metric formulas and team-specific interpretation rules that live inside.\nThat\u0026rsquo;s the ontology side.\nThe Wall of Regulated Industries Law, healthcare, accounting, manufacturing.\nThe barriers here are high. Not because of data, but because of the logic inside.\nThink about hospital data \u0026ndash; it\u0026rsquo;s nearly impossible to construct real logic from externally accessible information alone.\nA single insurance claim has layers upon layers of internal rules.\nFor a global AI company, digging into all of this doesn\u0026rsquo;t pencil out.\nDW/BI is similar.\nThousands of tables, abbreviated columns, definitions that differ by team.\nIt\u0026rsquo;s not just a data problem \u0026ndash; it\u0026rsquo;s the kind where interpretation keeps piling on.\nInfrastructure Gets Caught Up Fast The speed is what\u0026rsquo;s interesting.\nHours after the Managed Agents announcement, an open-source framework replicating the core functionality appeared (Multica).\nIt always goes like this. One shows up and clones follow fast.\nSomeone builds infrastructure, and almost immediately someone else builds something similar.\nThis layer keeps converging. Doesn\u0026rsquo;t matter much whose version you use.\nDirection Check Infrastructure moves up, and what\u0026rsquo;s left below is data and definitions.\nNext up is MCP wrapping.\nDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/006-managed-agents-and-ontology/","summary":"Shortly after the Conway leak, Anthropic officially launched Claude Managed Agents. As agent infrastructure gets absorbed into platforms, here\u0026rsquo;s why DataNexus\u0026rsquo;s ontology layer remains safe.","title":"6. When You Don't Have to Build Agent Infra Yourself, Harnesses Become Obsolete. What About the Ontology?"},{"content":" 32 shortcut hacks for faster Claude prompts! These commands work without any custom definitions because Claude, ChatGPT, and Gemini naturally recognize them. Why? They\u0026rsquo;ve appeared hundreds of millions of times in training data. Thanks to instruction tuning and structured prompt design, these patterns just work. — @lucas_flatwhite (original: @rubenhassid)\nWriting longer prompts does not guarantee better results. A single well-designed command often outperforms several paragraphs of instructions. These shortcuts work immediately because LLMs have already internalized these patterns through instruction tuning \u0026ndash; no prompt engineering required.\nOutput Control Commands The most frequently used group in practice controls output format. /ELI5 simplifies complex concepts for non-experts. /TLDR condenses lengthy reports. /EXEC SUMMARY produces executive-level summaries. /CHECKLIST and /STEP-BY-STEP transform the same information into actionable formats. /FORMAT AS forces specific output structures like tables, JSON, or markdown.\nThe key insight is that changing the output format alone completely transforms the utility of a response. Summarize meeting notes with /TLDR, then convert with /CHECKLIST, and you have an immediately assignable task list.\nMetacognitive Commands for Deeper Thinking The most interesting group controls the AI\u0026rsquo;s reasoning process itself. /CHAIN OF THOUGHT reveals intermediate reasoning steps. /FIRST PRINCIPLES strips away assumptions and rebuilds from fundamentals. /DELIBERATE THINKING suppresses hasty answers. /NO AUTOPILOT prevents generic pattern repetition.\n/REFLECTIVE MODE and /EVAL-SELF make the model critique its own output. Add /SYSTEMATIC BIAS CHECK and you get bias detection on top. To use AI as a genuine thinking partner rather than a simple Q\u0026amp;A tool, combine these commands deliberately.\nRole and Perspective Switching /ACT AS is the most widely known role assignment pattern. /DEV MODE and /PM MODE instantly shift to profession-specific viewpoints. /MULTI-PERSPECTIVE and /PARALLEL LENSES illuminate a single issue from multiple angles simultaneously. /SWOT and /COMPARE apply decision-making frameworks directly.\nCombining /AUDIENCE with /TONE lets you produce both a beginner-friendly explanation and a technical specification from the same input. /GUARDRAIL explicitly constrains response boundaries, preventing the model from drifting off topic.\nNo need to memorize these upfront \u0026ndash; try them once in context and they become second nature. This is the most practical approach to cutting prompt authoring time while raising output quality.\nKey takeaways\nSlash command patterns already learned through LLM instruction tuning work instantly without custom definitions Metacognitive commands like /CHAIN OF THOUGHT, /FIRST PRINCIPLES, and /NO AUTOPILOT let you control the quality of AI reasoning itself Chaining output format switches (/TLDR → /CHECKLIST) with role switches (/DEV MODE, /PM MODE) generates diverse deliverables from a single input Source\nhttps://x.com/i/status/2041125496755470589\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-07-llm-prompt-slash-command-shortcuts/","summary":"32 slash commands that work out of the box with Claude, ChatGPT, and Gemini \u0026ndash; no custom definitions needed. Categorized by use case with practical combination strategies.","title":"32 Slash Command Shortcuts That LLMs Instantly Understand"},{"content":"When working with brokerage APIs, there are moments when running actual code matters far more than reading documentation. A major Korean securities firm released an official GitHub repository with sample code that fills that gap with remarkably high completeness.\nStructural Design for AI Agents The repository goes beyond simple API calls and provides a structure well-suited for AI agents to use as tools. Directories are split by function, helping external AI models discover and execute specific functions. MCP server support is a notable addition, reflecting the latest trends in AI integration.\nThe traditional approach required users to read hundreds of pages of documentation and implement logic manually. This repository instead guides AI to call functions directly, improving development efficiency. Data engineers can skip the heavy lifting of building complex pipelines.\nStructured API responses minimize post-processing when integrating base asset data. Clean response formats make a real difference in overall development speed.\nExtensibility for Live Trading Package manager and configuration files drastically reduce environment setup time. A single config file edit switches between live and simulated trading environments. Orders go through REST API while real-time quotes stream via WebSocket \u0026ndash; the standard pattern.\nThe category coverage is broad, spanning domestic and international equities, bonds, futures, and options. Derivatives data flows without interruption, and the examples are ready for immediate deployment in automated trading. No need to assemble individual functions from scratch.\nFor developers building Python-based financial services, this serves as a solid reference point. The flow from environment setup to data ingestion is smooth, maintaining fast response times while using infrastructure resources efficiently.\nRunning the official repository\u0026rsquo;s examples alone lays the foundation for a stable automated trading system.\nKey takeaways\nFunction-level directory structure designed for LLMs to discover and call APIs as tools MCP server support strengthens integration with latest AI models like Claude uv package manager and YAML config enable rapid switching between live and simulated trading environments Source\nhttps://x.com/i/status/2039681334038442123\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-05-kis-open-api-official-github-analysis/","summary":"A structural analysis of the official sample code from a major Korean brokerage API, optimized for LLM agents and Python environments.","title":"Analyzing the Official GitHub Repository of a Major Korean Brokerage Open API"},{"content":"Launching a screen capture app only to watch it eat hundreds of megabytes of memory is frustrating. Popular paid apps are excellent, but monthly subscriptions or one-time fees add up. macshot is a welcome open-source alternative \u0026ndash; lightweight and built to scratch the itches engineers actually have.\nNative Swift Performance By ditching Electron and building entirely with Swift and AppKit, macshot keeps memory usage extremely low. System resource consumption is minimal while launch speed is instant. The native experience feels distinctly different from tools wrapped in web technology.\nThe interface is intuitive, carrying the same feel as open-source tools from other operating systems. Annotation tools like arrows, text, shapes, blur effects, and pixelation are rich enough that there\u0026rsquo;s no need to open a separate editor. The entire flow from capture to annotation happens in one place, cutting out extra steps.\nPII Auto-Redaction and Workflow The one-click PII redaction feature catches sensitive information like emails, phone numbers, and API keys before they leak. Useful for code review screen shares and technical documentation where manual masking used to eat up time. OCR text extraction from images works smoothly, dropping results straight into the clipboard.\nScroll capture uses the OS framework to stitch vertical or horizontal images seamlessly. Long log outputs or entire web pages become a single file without hassle. Screen recording supports both MP4 and GIF formats with simultaneous system audio and microphone control.\nStorage Integration and Open-Source Value Cloud storage integration generates upload links immediately after capture. Sharing images with teammates skips the file transfer step entirely, keeping the workflow lean. Installation via a single terminal command is a nice touch.\nThe GPLv3 open-source license inspires confidence, and recent updates have stabilized the project. Multi-monitor support works without hiccups, including drag-based composition across screens. The level of polish rivals paid alternatives.\nA native app that takes the headache out of choosing a productivity tool.\nKey takeaways\nNative Swift/AppKit instead of Electron, keeping memory usage around 8MB PII auto-redaction and S3-compatible storage integration tailored to engineer workflows High-performance scroll capture stitching powered by Apple Vision framework Source\nhttps://github.com/sw33tLie/macshot\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-05-macshot-native-macos-screenshot-tool-review/","summary":"A look at macshot, a native open-source macOS screenshot and screen recording tool that delivers powerful features without the subscription burden.","title":"macshot: A Native macOS Tool Emerging as an Alternative to Paid Apps"},{"content":"Why Data Dictionaries Don\u0026rsquo;t Survive Six Months I once worked with multiple vendors for over a year on a next-gen data warehouse project. Early on, we built an ambitious data dictionary. We aligned definitions of \u0026ldquo;revenue,\u0026rdquo; \u0026ldquo;cost,\u0026rdquo; and \u0026ldquo;net revenue\u0026rdquo; across vendors and meticulously filled in table mappings. Six months in, new tables weren\u0026rsquo;t being registered, existing definitions hadn\u0026rsquo;t caught up with schema changes, and cross-department interpretation differences sat untouched. \u0026ldquo;It\u0026rsquo;s not accurate anyway\u0026rdquo; became the team\u0026rsquo;s official stance.\nWhether it\u0026rsquo;s called a catalog, a wiki, or a data dictionary, only the name changes. People fill it in enthusiastically at the start, then can\u0026rsquo;t keep up with maintenance and abandon it.\nA few days ago, Karpathy posted an LLM knowledge base idea on X, then followed up with the full architecture in a GitHub Gist (Karpathy LLM Wiki). It was about building a personal knowledge base, but the problems looked similar to what I\u0026rsquo;ve been hitting while building the DataNexus catalog.\nRAG Starts from Scratch Every Time Most LLM document workflows use RAG. NotebookLM, ChatGPT file uploads, internal document search \u0026ndash; they all chunk the source and vector-search. It works well enough, but there\u0026rsquo;s one thing that bugs me. Yesterday the LLM synthesized an answer from five documents; ask the same thing today and it starts the same search over again. It remembers nothing.\nThis bothered me with DataNexus too. In Post 1\r, I designed a structure that injects context by feeding the ontology into a RAG Store. But if the ontology falls out of sync with the actual schema, NL2SQL generates wrong SQL. I haven\u0026rsquo;t solved the problem of keeping the RAG Store itself up to date.\nHand the Wiki to the LLM Karpathy\u0026rsquo;s approach, put simply, is this: instead of searching raw sources every time like RAG, have the LLM manage a markdown wiki directly. Humans add the raw materials (Raw Sources), the LLM reads them and organizes into the wiki (Wiki). A configuration document (Schema) defines the wiki\u0026rsquo;s structure and rules so the LLM doesn\u0026rsquo;t write however it pleases but maintains consistency.\nWhen a new source comes in (Ingest), the LLM summarizes and updates 10-15 related pages. When you ask the wiki a question (Query), discoveries from the answer get written back. A periodic check (Lint) catches contradictions and stale information.\nKarpathy himself works with Obsidian open alongside an LLM agent. He used the analogy of Obsidian as the IDE, the LLM as the programmer, the wiki as the codebase. If you\u0026rsquo;re a developer, that clicks immediately.\nWhat caught my eye was the Schema. In Post 3\r, I wrote about the struggle of using DataHub\u0026rsquo;s Business Glossary as an ontology store with only 4 relationship types. Karpathy\u0026rsquo;s Schema serves a similar purpose \u0026ndash; a rulebook telling the LLM \u0026ldquo;connect this term using only these relationship types.\u0026rdquo; Without it, the LLM organizes however it wants and makes a mess.\nThe Real Bottleneck Is Maintenance Writing a data dictionary or wiki isn\u0026rsquo;t hard. Put a few people on it at the start of a project and it gets done. The problem is what comes after. Once pages exceed a hundred, the time spent updating cross-references, refreshing summaries, and detecting contradictions grows noticeably. People start quietly stepping away.\nThis is what worries me about DataNexus. Term registration, relationship configuration, DozerDB sync \u0026ndash; all built. But DW schemas keep changing after go-live. Phased releases add tables, new deduction items appear in the \u0026ldquo;net revenue\u0026rdquo; formula. The question is who reflects these changes in the catalog.\nAn LLM can update 15 files simultaneously. DataHub already emits MCL (Metadata Change Log) events, so a setup where the LLM receives these events, updates affected term pages, and refreshes cross-references is feasible. The SKOS compatibility layer rules from Post 4\rwould serve as the Schema.\nIn Post 1, I wrote that \u0026ldquo;we need a pipeline that detects changes and automatically refreshes the RAG Store.\u0026rdquo; Back then it was vague. After seeing Karpathy\u0026rsquo;s Ingest/Query/Lint pattern, that pipeline finally has a sketch.\nI don\u0026rsquo;t expect this to work right away. The LLM might create wrong relationships when auto-updating the ontology, and I don\u0026rsquo;t yet know how much domain expert review is needed. That\u0026rsquo;s something to figure out while building the metadata change detection pipeline.\nWhat Humans Can\u0026rsquo;t Do Karpathy pulled in Vannevar Bush\u0026rsquo;s 1945 Memex. An 80-year-old vision that kept failing because people couldn\u0026rsquo;t keep up with the management cost.\nKarpathy said he\u0026rsquo;s spending more LLM tokens on organizing knowledge than writing code these days. Building DataNexus, I\u0026rsquo;m heading the same direction. Term definitions, mappings, reconciling interpretation gaps. People give up on this within six months. I want to see what happens when you hand it to an LLM.\nDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/005-llm-wiki-and-metadata-maintenance/","summary":"RAG starts from scratch every time. Karpathy proposes having the LLM maintain a wiki directly so knowledge accumulates. DataNexus\u0026rsquo;s ontology catalog needs the same principle to avoid abandonment.","title":"5. Automating Metadata Maintenance: Karpathy's LLM Wiki Architecture"},{"content":"\rGEO Optimization Guide — 전체 시리즈\n1. What Is GEO - AI Citation Strategy Beyond SEO\r2. Each AI Cites Different Sources\r3. On-Site GEO Technical Architecture - From Product DB to JSON-LD\r4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site ← 현재 글\r5. AEO - Why Coding Agents Read Documentation Differently\rYou Added JSON-LD, So Why Is a Blog Getting Cited Instead? We covered On-Site GEO through Part 3\r. Pulled JSON-LD from the product database, injected it into the HTML \u0026lt;head\u0026gt; via SSR, validated with Rich Results Test. Technically, nothing was missing.\nThen we asked ChatGPT to \u0026ldquo;recommend products from brand ○○.\u0026rdquo; It cited a Naver blog and TripAdvisor instead of the official site. On Perplexity, a Reddit thread showed up as the source.\nNo matter how well you build your own site, if AI primarily looks at external channels, the impact is cut in half.\nHere is the platform-by-platform citation source data from Part 2\r:\nPlatform Top Citation Source Share ChatGPT Directories/Listings (Yelp, G2, etc.) 49% Perplexity Reddit/Communities 31% Gemini Official websites 52% Google AIO YouTube #1 domain Except for Gemini, official websites do not dominate. Half of ChatGPT\u0026rsquo;s citations come from external directories. If you are not managing those channels, you are leaving half of your citation share on the table.\nThat is Off-Site GEO.\nHow Off-Site GEO Differs Part 1\rbriefly distinguished On-Site from Off-Site. Let\u0026rsquo;s dig deeper.\nOn-Site GEO is about making your own site easy for AI to read. JSON-LD, Schema.org, SSR. The engineering team fixes code and ships it.\nOff-Site GEO is about managing your brand across the external channels that AI actually references. Directory profiles, community mentions, YouTube videos. Marketing and PR have to drive this.\nDimension On-Site GEO Off-Site GEO Target Your own domain External platforms Core techniques JSON-LD, SSR, FAQ Schema Directory management, communities, YouTube Owner Engineering Marketing / PR / Brand Control level High (direct edits) Low (indirect influence) Effective on Gemini (52%) ChatGPT, Perplexity, AIO You cannot do just one. Raise the quality of official data with On-Site, and align brand consistency across external channels with Off-Site. They work as a set.\nPlatform-Specific Off-Site Strategies ChatGPT: Directories and Listings Make Up Half 49% of ChatGPT citations come from third-party directories like Yelp, TripAdvisor, G2, and Capterra (Yext). Directory profiles get cited before your own website.\nWhy? ChatGPT has a weak native search index. It relies on Bing\u0026rsquo;s search layer, and Bing assigns high domain authority to directory sites. Information listed on directories reaches ChatGPT\u0026rsquo;s answers first.\nWhat you can do right away:\nCheck whether your profiles exist on key directories for your industry (Yelp, Google Business, G2, Capterra, TripAdvisor). Create them if missing, update them if stale Ensure NAP consistency. Name, Address, Phone must be identical across all directories. If \u0026ldquo;Company Inc.\u0026rdquo; and \u0026ldquo;Company Corp.\u0026rdquo; are mixed, AI may treat them as separate entities Manage reviews. AI uses review count and rating as trust signals. A profile with zero reviews is unlikely to be cited Perplexity: Reddit and Communities Are the Source 31% of Perplexity citations come from community threads, including Reddit. It trusts real user discussions over official announcements.\nThis does not mean you should just post on Reddit. The reason Perplexity favors Reddit is that its question-and-answer structure is optimized for AI parsing. \u0026ldquo;What do you think of this product?\u0026rdquo; → \u0026ldquo;Used it for 6 months, ○○ is great but ○○ not so much.\u0026rdquo; This kind of dialogue is the easiest format for AI to cite.\nWhat to focus on:\nIdentify subreddits where your brand or category is discussed. Monitor them regularly Contribute genuinely useful answers to product-related questions. Promotional posts get downvoted immediately on Reddit For the Korean market, the dynamics differ. Instead of Reddit, communities like DCInside, Clien, and Ppomppu play a similar role. Data on how much Perplexity cites these sites for Korean-language queries is still scarce. This area needs hands-on testing Google AI Overview: YouTube Is Surging YouTube is the #1 cited domain in Google AI Overview (Ahrefs Brand Radar). Its share grew 34% in just six months.\nAs we noted in Part 2, the characteristics of cited videos are surprising. Videos with under 1,000 views get cited. Plenty have just a few dozen likes. What AI looks at is not popularity but how well the information is organized.\nCommon elements in videos that get cited:\nElement Description Citation Impact Timestamps/Chapters Topic segments within the video High Structured description Table of contents, links, key takeaways High Clear title Question-based or \u0026ldquo;How to\u0026rdquo; format Medium Subtitles/Transcript Even auto-generated ones enable parsing Medium View count/Likes Popularity metrics Low Even for videos you have already uploaded, adding timestamps to the description bumps up AI citation potential. Lay out a table of contents like \u0026ldquo;What this video covers: 1. ○○ 2. ○○\u0026rdquo; and place relevant links. Titles with clear search intent like \u0026ldquo;How to ○○\u0026rdquo; or \u0026ldquo;○○ vs ○○ Comparison\u0026rdquo; tend to do better.\nBut First: Check Your robots.txt Before diving into Off-Site, there is one thing to verify. Make sure your own site is not blocking AI crawlers.\nIf you block GPTBot or PerplexityBot in robots.txt, those AI engines cannot crawl your site. Your On-Site GEO could be flawless, but if AI cannot read it, none of it matters.\nWe built a tool that lets you run the competitor robots.txt analysis discussed in Part 2 hands-on. Feed it a list of domains and it shows allow/block status for 10 AI crawlers as a heatmap.\nGoogle Colab에서 실습하기\rIt runs on Python\u0026rsquo;s standard library alone, no API keys needed. Swap in competitor domains to map out your entire industry.\nWhat You Can Read from robots.txt If a competitor is blocking GPTBot, your chances of getting cited on that AI platform go up relatively. It is a gap you can fill.\nConversely, if competitors have fully opened up and you are the only one blocking, only competitors show up in AI search results while you are invisible.\nOne thing worth knowing: blocking GPTBot does not block ChatGPT-User (browsing mode), which is a separate User-Agent. Browsing mode may still access your site. Blocking Google-Extended does not affect the base Googlebot. You can keep search visibility while blocking AI training specifically.\n# Allow search, block AI training only User-agent: Googlebot Allow: / User-agent: Google-Extended Disallow: / User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Allow: / With this setup, your site appears normally in Google search but is excluded from Gemini AI training and ChatGPT training data. ChatGPT browsing mode is still allowed, keeping real-time citation possible.\nOff-Site Channel Priorities by Industry The same strategy does not work for every industry. The external channels AI references most vary by sector.\nIndustry #1 Off-Site Channel #2 Notes E-commerce/Retail Google Business + Directories YouTube reviews Balancing catalog protection vs AI exposure SaaS/B2B G2, Capterra reviews Reddit (r/SaaS, etc.) Review count directly drives citation likelihood Hotels/Travel TripAdvisor, Booking YouTube tours Freshness of pricing/availability data is key Food/Consumer goods Community reviews YouTube food/review content In Korea, Naver blogs still carry significant weight Finance/Fintech News/Media Specialized forums Many block AI crawlers due to regulatory concerns E-commerce is especially tough. Expose product prices and inventory to AI, and competitors can scrape it in real time. Block it, and you vanish from AI search. Where to draw that line is the critical GEO decision for retail.\nOff-Site GEO Checklist Starting with what you can execute immediately:\nThis week\nCheck whether your robots.txt blocks AI crawlers → Diagnose with the Colab analyzer\rCompare robots.txt across 3 competitors Verify profile existence on major directories (Google Business, industry-specific directories) This month\nUpdate directory profile information (verify NAP consistency) Add timestamps/chapters/structured descriptions to existing YouTube videos Build a list of communities/subreddits where your brand is mentioned This quarter\nEstablish per-platform AI citation monitoring Audit brand consistency across Off-Site channels Redesign robots.txt policy to align with GEO strategy ","permalink":"https://datanexus-kr.github.io/en/guides/geo-optimization/004-offsite-geo-strategy/","summary":"Even with perfect On-Site GEO, half of AI citations come from external channels. We cover platform-specific Off-Site strategies and how to diagnose your robots.txt setup.","title":"4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site"},{"content":"\rGEO Optimization Guide — 전체 시리즈\n1. What Is GEO - AI Citation Strategy Beyond SEO\r2. Each AI Cites Different Sources\r3. On-Site GEO Technical Architecture - From Product DB to JSON-LD ← 현재 글\r4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site\r5. AEO - Why Coding Agents Read Documentation Differently\rWhere Do You Build JSON-LD and Where Does It Go In the previous article\r, we confirmed that each AI platform prefers different citation sources. Gemini favors official websites, ChatGPT leans on directories, and Perplexity gravitates toward community discussions. One thing they share: pages with structured data get cited more often across all platforms.\nSo the technical core of On-Site GEO boils down to this question. How do you transform product master DB data into JSON-LD and inject it into the HTML \u0026lt;head\u0026gt;?\nIt sounds simple, but once you dig in, the tangles pile up fast. Product DB field names are cryptic abbreviations. The attributes AI needs do not exist in the DB. Sites built as SPAs cannot serve JSON-LD to crawlers. This article covers how to solve these problems with a structured approach.\nThe Concentric Architecture of a GEO System A GEO system expands outward through four layers.\nLayer Components Role Core Product Master DB SSOT (Single Source of Truth). The origin of all data Channel Website / Mobile App JSON-LD injection, SSR rendering API Product Query API Interface for AI agent integration Agent ChatGPT / Gemini / Perplexity End consumer touchpoint Data flows from Core through Channel to Agent. Its shape changes at each layer. Raw DB fields become structured JSON-LD, which becomes the citation source in AI answers.\nThe API layer is easy to overlook. You might think just embedding JSON-LD is enough, but once you consider AI agent integrations like ChatGPT Plugins or MCP (Model Context Protocol), a separate API layer becomes necessary. Even if you do not need it right now, accounting for it in the design phase saves pain later.\nThe 3-Stage Data Pipeline Instead of managing product descriptions as monolithic blobs, decompose them into individual fields. AI cites more accurately when data is field-level structured. That is the core idea behind this pipeline.\nStage 1: DB Refinement - Field Mapping This stage maps existing product master DB fields to Schema.org fields. You are not creating new data \u0026ndash; just organizing what already exists.\nDB Field → Schema.org Field ───────────────────────────────────────── PROD_NM → name BRND_CD (code lookup) → brand.name GTIN_13 → gtin13 PRC_AMT → offers.price STCK_YN → offers.availability IMG_URL → image CTG_NM → category The field count runs around 15-18 depending on the industry. Since most values already exist in the DB, development effort is modest. The catch is converting code values to human-readable text. You need to transform BRND_CD = P1042 into brand.name = \u0026quot;FoodCo\u0026quot; for AI to understand it.\nThe biggest stumbling block at this stage is GTIN. It is a GS1 standard identifier, and different variants of the same product (size, flavor) need different GTINs. If you lump \u0026ldquo;Choco Stick Original\u0026rdquo; and \u0026ldquo;Choco Stick Almond\u0026rdquo; under one master code, AI cannot tell them apart.\nStage 2: LLM Extraction - AI-Generated Attributes Some attributes AI needs for citation do not exist in the DB. Target users, usage occasions, sentiment keywords. Having humans write these manually becomes impractical when you have thousands of SKUs.\nInstead, let an LLM read existing product descriptions, reviews, and category data to extract them automatically.\nSource Field Description Example DB @type Schema.org type Product DB name Product name Gram 16 DB gtin13 GS1 identifier 8801056038800 LLM targetUser Target user Students, professionals LLM occasion Usage occasion Graduation gift, work use LLM sentiment Sentiment keywords Lightweight, sleek LLM nutrition Nutrition info Sugar-free LLM safety Safety info CAS 9002-88-4 LLM extraction fields vary by industry. For food, nutrition facts and ingredients are key. For hotels, amenities and check-in times matter. For chemicals/B2B, it is material properties and certifications.\nThis stage adds 10-15 fields. Combined with Stage 1, each product ends up with 25-33 structured fields.\nStage 3: JSON-LD Output - Automated Conversion and SSR Deployment Fields from Stages 1 and 2 are converted into Schema.org-compliant JSON-LD and automatically injected into the HTML \u0026lt;head\u0026gt; via SSR.\n{ \u0026#34;@context\u0026#34;: \u0026#34;https://schema.org\u0026#34;, \u0026#34;@type\u0026#34;: \u0026#34;Product\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Choco Stick Original\u0026#34;, \u0026#34;gtin13\u0026#34;: \u0026#34;8801234567890\u0026#34;, \u0026#34;brand\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;Brand\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;FoodCo\u0026#34; }, \u0026#34;description\u0026#34;: \u0026#34;Chocolate-coated crispy stick snack. 200kcal per 46g serving.\u0026#34;, \u0026#34;offers\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;Offer\u0026#34;, \u0026#34;price\u0026#34;: 1500, \u0026#34;priceCurrency\u0026#34;: \u0026#34;KRW\u0026#34;, \u0026#34;availability\u0026#34;: \u0026#34;https://schema.org/InStock\u0026#34; }, \u0026#34;nutrition\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;NutritionInformation\u0026#34;, \u0026#34;calories\u0026#34;: \u0026#34;200 calories\u0026#34;, \u0026#34;servingSize\u0026#34;: \u0026#34;1 pack (46g)\u0026#34; } } Once this JSON-LD sits inside the \u0026lt;head\u0026gt; tag, the Invisible GEO discussed in Part 1\ris complete. Invisible to users, but parsed directly by AI and search engines.\nSee how the before and after compares for a food product in the demo.\nDemo - JSON-LD Before/After Four Principles for Writing Descriptions The most human-dependent part of the pipeline is product descriptions. Descriptions that AI cites well follow a pattern.\nFact-based - Include only objective information. AI ignores advertising copy like \u0026ldquo;industry-leading\u0026rdquo; or \u0026ldquo;customer satisfaction #1.\u0026rdquo;\n100-300 characters - The sweet spot for AI reference. Too short lacks context; too long buries the key points.\nNatural keywords - As confirmed by Princeton/Georgia Tech research\r, keyword stuffing actually decreases AI visibility. Weave keywords into natural sentences.\nUnique per SKU - Copy-pasting the same template with only the product name swapped gets flagged as duplicate content by AI. Each product needs its own description.\n\u0026lt;!-- Bad: advertising copy + keyword stuffing --\u0026gt; \u0026lt;meta name=\u0026#34;description\u0026#34; content=\u0026#34;About Us\u0026#34;/\u0026gt; \u0026lt;!-- Good: fact-based, natural language, proper length --\u0026gt; \u0026lt;meta name=\u0026#34;description\u0026#34; content=\u0026#34;ChemCo is a global petrochemical company supplying PE/PP products to 50 countries with annual revenue of $11B. Leading ESG management and carbon neutrality by 2050.\u0026#34;/\u0026gt; Why SSR Is Non-Negotiable Even if you build the JSON-LD perfectly, it is useless if AI crawlers cannot read it. This is where SPAs (Single Page Applications) become a bottleneck.\nSPAs require JavaScript execution in the browser to render content. It looks fine to humans, but AI crawlers like GPTBot and Google-Extended mostly do not execute JS. Even if you put JSON-LD in the \u0026lt;head\u0026gt;, when the server sends an empty HTML shell, crawlers see nothing.\nSwitching to SSR (Server-Side Rendering) means the server sends fully rendered HTML, so crawlers can read JSON-LD immediately without JS execution.\nHere is how it looks with the Next.js App Router:\n// app/product/[id]/page.tsx export default async function ProductPage({ params }) { const product = await fetchProduct(params.id); const jsonLd = { \u0026#34;@context\u0026#34;: \u0026#34;https://schema.org\u0026#34;, \u0026#34;@type\u0026#34;: \u0026#34;Product\u0026#34;, \u0026#34;name\u0026#34;: product.name, \u0026#34;gtin13\u0026#34;: product.gtin, \u0026#34;brand\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;Brand\u0026#34;, \u0026#34;name\u0026#34;: product.brand }, \u0026#34;description\u0026#34;: product.description, \u0026#34;image\u0026#34;: product.imageUrl, \u0026#34;offers\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;Offer\u0026#34;, \u0026#34;price\u0026#34;: product.price, \u0026#34;priceCurrency\u0026#34;: \u0026#34;KRW\u0026#34;, \u0026#34;availability\u0026#34;: \u0026#34;https://schema.org/InStock\u0026#34;, \u0026#34;url\u0026#34;: product.pageUrl } }; return ( \u0026lt;\u0026gt; \u0026lt;script type=\u0026#34;application/ld+json\u0026#34; dangerouslySetInnerHTML={{ __html: JSON.stringify(jsonLd) }} /\u0026gt; \u0026lt;ProductDetail product={product} /\u0026gt; \u0026lt;/\u0026gt; ); } The server fetches DB data via fetchProduct, builds the JSON-LD object, and injects it as a \u0026lt;script\u0026gt; tag. This HTML reaches crawlers as-is.\nIf SSR adoption feels too heavy, Google Tag Manager (GTM) can inject JSON-LD as a transitional approach. Less effective than full SSR, but viable when you cannot convert an SPA right away.\nSSR Trade-offs Aspect Advantage Disadvantage Mitigation SEO optimization Crawlers read without JS Initial dev cost SDK/shared module Data reflection Auto-updates on DB changes Increased server load Redis caching + ISR Central management Site-wide uniform deployment Dev team dependency Admin console for non-devs Validation Build-time schema validation Legacy system migration GTM hybrid fallback Server load is largely mitigated by Redis caching and ISR (Incremental Static Regeneration). As long as product data has not changed, cached HTML is served directly.\nData Freshness Drives Citations Even well-structured data gets deprioritized when stale.\nAnalyzing pages with high Perplexity citations, over three-quarters had been updated within the past month. ChatGPT Shopping refreshes feeds every 15 minutes (OpenAI). Pages untouched for over three months are likely to drop in AI citation rankings.\nFreshness management guidelines:\nCritical data (price, inventory, promotions): refresh within 24 hours General data (descriptions, images): refresh within 7 days Static data (brand info, company overview): monthly review Keeping lastmod dates in sitemap.xml aligned with actual update timestamps, and using the IndexNow API to notify search engines of changes immediately, also makes a difference.\n// next-sitemap.config.js module.exports = { siteUrl: \u0026#39;https://www.example.com\u0026#39;, generateRobotsTxt: true, changefreq: \u0026#39;daily\u0026#39;, transform: async (config, path) =\u0026gt; ({ loc: path, changefreq: path.includes(\u0026#39;/product/\u0026#39;) ? \u0026#39;daily\u0026#39; : \u0026#39;weekly\u0026#39;, priority: path.includes(\u0026#39;/product/\u0026#39;) ? 0.9 : 0.5, lastmod: new Date().toISOString(), }), }; Validation - If You Added It, Verify It Inserting JSON-LD is not the finish line. You need to confirm crawlers can actually read it.\nGoogle Rich Results Test - Enter your URL at search.google.com/test/rich-results\rto instantly check whether structured data is being recognized.\nCrawler simulation with curl - Send requests with AI crawler User-Agents to verify JSON-LD is included in the HTML response.\n# Request as GPTBot curl -A \u0026#34;GPTBot\u0026#34; https://www.example.com/product/12345 | grep \u0026#34;application/ld+json\u0026#34; # Extract JSON-LD from HTML source curl -s https://www.example.com/product/12345 \\ | grep -oP \u0026#39;\u0026lt;script type=\u0026#34;application/ld\\+json\u0026#34;\u0026gt;.*?\u0026lt;/script\u0026gt;\u0026#39; If your site is still an SPA without SSR, curl results will likely show no JSON-LD. That is exactly why SSR is non-negotiable.\nTry building JSON-LD yourself with the interactive builder to get a hands-on feel for the structure.\nDemo - JSON-LD Builder Common issues encountered in practice:\nSymptom Cause Fix JSON-LD not crawled robots.txt blocking Set GPTBot, Google-Extended to Allow AI not citing data Schema.org type error Validate with Rich Results Test Slow API response No caching Apply Redis caching + minimize fields Server overload after SSR DB query on every request ISR + Redis caching ","permalink":"https://datanexus-kr.github.io/en/guides/geo-optimization/003-geo-data-pipeline/","summary":"How product master DB data flows through a 3-stage pipeline to become JSON-LD in your HTML \u003chead\u003e. Covers the pipeline architecture and SSR-based automated deployment.","title":"3. On-Site GEO Technical Architecture - From Product DB to JSON-LD"},{"content":"When team documents are scattered across Notion, GitHub Issues, and S3 buckets, building connectors for each source becomes the bottleneck. OpenDocuments is an open-source RAG platform that has that wiring pre-built.\nOver 12 Connectors It connects to Notion, GitHub, S3, PDFs, and Jupyter notebooks. Using Ollama, data stays entirely within the server with no external API calls. This architecture fits environments with strict network isolation like financial institutions or retail companies.\nBuilt on SQLite and LanceDB, installation finishes with a single Docker command. Without heavy infrastructure, spinning it up as a side project or internal team tool feels lightweight. Once configured, embedding and storage happen automatically when files come in. No manual classification work needed from the admin.\nHybrid Search Is the Default It comes with hybrid search combining vector and keyword approaches, plus reranking. Even mixed Korean-English queries pick up context and cite sources. Displaying the basis documents for each answer reduces hallucination concerns.\nSwitching from local to a high-performance external API is a one-line config change. It supports MCP servers so you can plug it directly into coding agents, and the plugin architecture is open for adding custom parsers and connectors.\nKey Takeaways\nOllama integration enables on-premise RAG without external APIs, suitable for network-isolated environments Hybrid search with reranking returns accurate sources even for Korean-English cross-language queries MCP server support lets you connect coding agents directly to internal knowledge bases Source: https://news.hada.io/topic?id=27910\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-01-opendocuments-local-rag-platform/","summary":"An open-source platform that connects documents scattered across Notion, GitHub, and S3, then queries them with a local LLM. Runs entirely on-premise without external APIs.","title":"OpenDocuments: A Local RAG Platform That Unifies Fragmented Team Knowledge"},{"content":"Even after adopting agents, if the work structure itself does not change, the agent just replaces what you were already doing. Productivity does not scale. Reid Hoffman\u0026rsquo;s conductor strategy starts by breaking that illusion.\nBe the Conductor, Not the Performer Handling individual tasks yourself has limits. The role shifts toward placing multiple agents in the right positions and managing the flow. It is not easy. Catching agents when they lose context or veer off course is more cumbersome than expected. A Moroccan driver who built a business with chatbots is mentioned \u0026ndash; what stands out is that willingness to use the tool mattered more than the tool\u0026rsquo;s capabilities.\nThe SaaS Moat Is Crumbling The feature advantages that legacy software companies built up are eroding. As the cost of building tools drops, even distributors and grocery companies are hiring their own engineers. Custom solutions tailored to the organization now outweigh general-purpose features.\nStop Writing Prompts Yourself Having AI craft the optimal prompt for you is the baseline. Voice input lets you convey richer context faster than text. Assigning an agent a specific expert identity to critique your reasoning is an intermediate technique. Recognizing that a model\u0026rsquo;s training data is frozen in the past and mixing in real-time search is also fundamental.\nShare the Process, Not the Output To prove practical application skills, sharing the workflow beats sharing the deliverable. Differentiation comes from focusing on personalized experiences that large platforms cannot reach. Let machines handle the generic content anyone can produce, and concentrate on work infused with your unique context.\nKey Takeaways\nHaving AI write prompts for you instead of writing them yourself produces noticeably better results SaaS feature-accumulation advantages are being replaced by low-cost AI-based custom builds, making organization-specific solutions more advantageous Shifting from individual executor to conductor managing multiple agents is the practical productivity scaling path right now Source: https://eopla.net/magazines/40952\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-01-ai-era-five-percent-conductor-strategy/","summary":"Even after adopting agents, if the work structure itself does not change, productivity does not scale. Reid Hoffman\u0026rsquo;s conductor strategy starts by breaking that illusion.","title":"Thinking Like an Agent Conductor in the AI Era"},{"content":"To feed legal data into an AI agent, you first have to parse XML from public institutions. Table structures break, article numbering systems get tangled, and preprocessing alone eats half a day. Beopmang is a service that has solved this upfront.\nWhat Changes When You Receive JSON Legal data from public institutions is not machine-ready out of the box. Beopmang parses even complex table structures into a consistent JSON format. It covers most Korean statutes and updates weekly.\nToken consumption drops. Without the need for preprocessing to extract text from XML or HTML, the pipeline simplifies. Numerical values inside tables come out as clean arrays, reducing the chance of the model misreading context.\nVector Search Is Built In Keyword matching alone struggles to capture the context of legal articles. Beopmang has converted key articles into pgvector-based vector data. A single API call enables semantic search. The core point is that no separate infrastructure needs to be set up.\nYou can immediately run a RAG structure where the model directly references articles to generate answers.\nIntegration Is Simple There is no authentication process. It works without API key issuance. Rate limits are generous, and no user logs are kept, making it frictionless for prototyping.\nRevision history comparison is also available via API. Even non-experts can track how articles have changed over time.\nKey Takeaways\nPre-cleaned JSON legal data means virtually no preprocessing effort Built-in pgvector-based semantic search plugs directly into RAG pipelines No authentication required, generous rate limits \u0026ndash; fast to get started Related Posts\nWhy DataNexus\r— Structuring domain knowledge determines AI accuracy Legal Data Beyond Pipes \u0026ndash; Gaining a Brain\r— Synergy between korean-law-mcp and DataNexus ontology Source: https://news.hada.io/topic?id=28050\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-01-beopmang-api-korean-law-json-for-ai-agent/","summary":"Parsing raw legal XML means broken table structures and half a day lost on preprocessing. Beopmang is an API that delivers pre-cleaned JSON, solving that problem upfront.","title":"Using Korean Legal Data for AI Agent Development via Beopmang API"},{"content":"When large-scale Excel downloads keep hitting OOM no matter how much memory you add, and writing Apache POI streaming code from scratch every time is costly, StreamSheet is a Spring Boot library that strips away that repetition with a single annotation.\n1 Million Rows Without OOM Data is never loaded into memory all at once. Using JPA Stream or JDBC ResultSet, it reads sequentially while running queries and writing the file simultaneously. Garbage collector load stays constant, and process utilization remained stable in load tests.\nIt supports various database environments, so there are no constraints when attaching it to existing projects.\nOne Annotation and Done Add an annotation to the object and configuration is complete. Spring Boot Auto-configuration is built in, so it works immediately just by adding the dependency. Specify the return type and the internal converter opens a stream and transmits data.\nChanging column order or names requires modifying just one config value. No need to cross-reference column numbers manually, and the effort of maintaining Apache POI boilerplate code by hand disappears.\nKey Takeaways\nSequential processing with JPA Stream and JDBC ResultSet means no OOM on 1M-row Excel downloads The declarative annotation-based approach eliminates the need to write Apache POI boilerplate Spring Boot Auto-configuration support means it plugs into existing projects by just adding the dependency Related Posts\nBronze Layer — Stacking Raw Data As-Is\r— Collection strategies for large-scale data ingestion Source: https://news.hada.io/topic?id=27997\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-01-streamsheet-excel-export-spring-boot/","summary":"A Spring Boot library for streaming 1M-row Excel downloads without OOM. A single annotation eliminates the boilerplate.","title":"Optimizing Large-Scale Excel Downloads in Spring Boot"},{"content":"When Notion pages, GitHub issues, and wiki pages each live in different places, building RAG requires creating connectors for every source from scratch. OpenDocuments is a self-hosted RAG tool that comes with that wiring pre-built.\nOne-Line Installation Start with a single npm install. Auto-loading of API keys minimizes manual setup, and the opendocuments doctor command catches configuration errors on the spot.\nBuilt on TypeScript and Hono, it is lightweight. Using SQLite and LanceDB means no separate database server is required. Connect Ollama and everything processes locally without external API calls. This architecture works in environments with strict network isolation like financial institutions.\nKorean-English Cross-Language Queries Work It parses various formats including PDFs and Jupyter notebooks, splits them into semantic units, and stores them as vectors. Search uses a hybrid approach combining vector retrieval and keyword matching. Ask in Korean and it finds answers from English documents, with source documents cited so you can filter out hallucinations.\nMCP server support means coding agents can query the internal knowledge base directly. The monorepo structure makes it straightforward to add custom parsers and connectors as plugins. Hundreds of test cases are in place, so stability is not a concern when layering custom logic on top.\nKey Takeaways\nOllama integration processes data locally only, making it suitable for air-gapped environments Hybrid search with multilingual support finds English documents from Korean queries MCP server support connects coding agents directly to internal knowledge bases Related Posts\nWhy DataNexus\r— The problem of structuring scattered domain knowledge How We Chose These Four Open-Source Tools\r— Evaluation criteria for RAG engine selection Sources\nhttps://github.com/joungminsung/OpenDocuments\rhttps://news.hada.io/topic?id=27910\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-04/2026-04-01-opendocuments-self-hosted-rag-platform/","summary":"A self-hosted RAG platform that runs on Ollama to unify team documents scattered across Notion, GitHub, and S3. Natural language search works even in air-gapped environments without external APIs.","title":"OpenDocuments: A Self-Hosted RAG Platform Connecting Scattered Team Knowledge"},{"content":"Repeating \u0026ldquo;fix this part\u0026rdquo; to Claude only circles within problems you already know. The more fix instructions pile up, the more blind spots you never recognized remain untouched. Switching from giving fix instructions to requesting evaluations reveals areas that were previously invisible.\nThe Role You Assign Determines the Result Give Claude a senior UX designer role and have it evaluate user flows, and it goes beyond fixing a single button to re-examining the entire navigation. From a PM perspective, it cross-references customer feedback with actual metrics to prioritize issues.\nHave a senior engineer review the codebase and it catches performance bottlenecks and duplicate code simultaneously. The more specific the role, the denser the feedback becomes.\nFull Automation Blows Up Costs Chaining agents into a pipeline looks impressive. But running the entire codebase through multiple roles drives token costs through the roof and introduces response latency.\nIt is better to call exactly one persona at exactly the right moment. If stability feels shaky before deployment, pull out just the QA lead and ask them to list error handling gaps by severity. Running the full pipeline is unnecessary.\nOne More Look Through a QA Lens Confirming that features work is not enough. Assign a QA role and have it examine the product, and UI breaks and missing exception handling surface in surprising numbers. These are points you would have simply overlooked working alone.\nExamining the same code through rotating roles is the practical way to reduce the technical debt that accumulates during vibe coding.\nExpert role-based evaluations catch more latent defects than simple fix instructions Calling a single persona at the right moment costs fewer tokens than a full automation pipeline Verifying edge cases with a QA lead persona reduces the technical debt that piles up after feature implementation ","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-claude-vibe-coding-role-based-evaluation/","summary":"Repeating \u0026lsquo;fix this\u0026rsquo; only circles within problems you already know. Deploying expert personas reveals blind spots, and learning when to call which persona also saves tokens.","title":"Stop Giving Coding Instructions -- Borrow an Expert's Perspective Instead"},{"content":"When creating PPTs with NotebookLM, generation sometimes stalls around 20 slides. When it stops before even reaching half the intended volume, you have to split the work and start over from scratch. This video covers a structural workaround for that problem.\nCreate a Design Prompt First Pick a layout from Behance or Dribbble and capture it with GoFullPage. Upload the image to Gemini or ChatGPT and they will extract the color scheme and structure as text. The key is extracting four types: title, body, data visualization, and process. Splitting by type lets you pick and choose styles per slide later.\nDesignate a Master Script as the Source Feeding multiple sources at once causes the AI to lose context. First, extract a script containing slide numbers, titles, and key data for all 40 slides in the chat. Save it as a note and convert it to a NotebookLM source. Uncheck all existing sources and activate only this script \u0026ndash; the AI will not wander off track.\nUse Code-Style Commands Instead of Natural Language Requesting all 40 slides at once causes a mid-generation cutoff. Split the request into slides 1\u0026ndash;20 and 21\u0026ndash;40. Use structured system-administrator-style commands instead of natural language. Include a continuation rule telling it not to stop at page 20 but to flow directly into page 21. Merge the two generated PPTX files with the keep-original-formatting option and you are done.\nKey Takeaways\nBreaking design references into four types (title, body, visualization, process) as prompts lets you select styles per slide Activating only a single master script as the source keeps the AI from losing context across materials Splitting generation into segments with code-style commands produces 40 consistent slides Source\nhttps://www.youtube.com/watch?v=rlVWuvgEftU\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-notebooklm-slide-limit-bypass-strategy/","summary":"When NotebookLM stops at 20 slides, most people just give up. Designate a single master script as the source and push through with code-style commands to generate 40 slides.","title":"Overcoming NotebookLM's Generation Limits with Script-Based Control"},{"content":"When you try to handle lecture planning entirely with Claude, you frequently hit walls. Bottlenecks appear whenever you need real-time information search or deep synthesis across multiple sources. The solution is not swapping tools but assigning distinct roles at each stage to keep the flow moving.\nClaude Designs, Perplexity Researches Claude is used to analyze learning objectives and structure a curriculum in hourly units. Getting the skeleton right first prevents the remaining stages from stalling. When real-time information is needed, the task moves to Perplexity. NotebookLM then consolidates the gathered materials into deeper analysis.\nEach tool excels at something different. Cramming everything into one tool inevitably forces a compromise somewhere.\nSlide Copy and Visualization Also Get Separate Roles Turning organized information into slide copy is Claude\u0026rsquo;s job again. After refining sentences into action-oriented phrasing, the output moves to the visualization stage. NotebookLM shapes the structure, and a dedicated tool produces concept diagrams and schematics. The final check ensures there is no gap between planning intent and deliverables.\nConnection Order Matters More Than Tool Count Using many tools should never become the goal. Each function needs to operate independently at its designated stage while feeding seamlessly into the next. When the steps mesh in sequence, human intervention drops significantly.\nKey Takeaways\nUsing Claude for the skeleton and Perplexity for real-time research speeds up initial drafts Consolidating materials in NotebookLM turns scattered information into a single coherent perspective Assigning roles by stage is what makes each tool effective ","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-ai-lecture-planning-workflow-agent-structure/","summary":"How assigning Claude, Perplexity, and NotebookLM to distinct stages of lecture planning changes the workflow. The point is not using more tools but knowing which tool fits which stage.","title":"Building an Organic AI Collaboration Flow from Planning to Visualization"},{"content":"When feeding web data into a RAG pipeline, it is convenient to have a tool that accepts a URL and converts the page to Markdown. defuddle does exactly that. The problem is that results vary dramatically depending on the site\u0026rsquo;s structure.\nWhether Semantic HTML Exists or Not Tech blogs and official documentation work fine. When heading hierarchy is intact and body tags are clear, the extraction output is usable. The trouble starts with e-commerce sites and layout-driven pages. Without semantic structure, body text and ads come out mixed together.\nFeeding this into RAG indexing contaminates search quality. In dynamic environments where JavaScript renders the page, the content itself often drops out entirely. Static wiki-level sources work well enough, but complex commercial site structures hit clear limits.\nAutomation Tools Do Not Replace Preprocessing The expectation was less effort than writing parsing logic from scratch, but inspecting extraction results ended up taking even more time. Metadata and body boundaries blurring was a recurring issue.\nMaintaining separate preprocessing scripts per source domain is the realistic approach. No matter how good the model, low-quality source data corrupts the index itself, and that corruption spreads across all search results.\nKey Takeaways\ndefuddle extraction performance depends heavily on the target site\u0026rsquo;s adherence to semantic HTML On sites with poor SEO optimization, distinguishing body from noise (ads, menus) is difficult When building RAG, a separate preprocessing stage to validate auto-extracted results must be designed Related Posts\nSilver Layer — Lifting Bronze to an Analyzable State\r— Data cleansing principles and quality gates What Is GEO — AI Citation Strategy Beyond SEO\r— The importance of semantic HTML and structured data RAG Quality and Markdown Conversion Tools for Data Preprocessing\r— Document preprocessing with MarkItDown Source\nhttps://share.google/8V29VWarTG9YMxXI7\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-defuddle-web-markdown-extraction-seo-limitations/","summary":"Tried extracting web data for a RAG pipeline with defuddle and found results vary wildly by site structure. On sites where semantic HTML has collapsed, body text and ads mix together, and in dynamic rendering environments the content itself vanishes.","title":"The Unexpected Walls When Converting Web Pages with defuddle"},{"content":"Letting an agent handle planning end-to-end leads to production incidents where it loops and stalls. The guide published by OpenAI addresses this head-on. It starts from the premise that an agent is not an intelligent being but an automation program that uses tools.\nPredictability Before Autonomy Do not hand over the entire flow to the model. Structure it like a state machine and explicitly guide the system to make decisions at each step. A system without control causes incidents in production.\nTool calls follow the same principle. Just passing an API spec is not enough. Input data formats must be strictly constrained, and when errors occur, the cause is fed back to the model. Connecting errors to a retry loop instead of just surfacing them increases the success rate. Hallucination issues can be significantly reduced just by adding a few examples to the instructions.\nDesign the Memory Structure Before Plugging In RAG Plugging in RAG does not immediately improve performance. As conversations grow, resource consumption rises, and without filtering for only relevant data, response speed just gets slower. Separating fixed information (user profiles, etc.) from variable information (current conversation) and managing them independently reduces token costs.\nDo not defer the evaluation framework. Before modifying prompts, establish quantitative metrics that distinguish success from failure. Running hundreds of test cases repeatedly and accumulating scores is what reveals bottlenecks.\nThe Model Is a Component Tasks requiring precise calculation or strict specifications belong to external code or specialized libraries. Cramming every rule into a single long instruction makes maintenance impossible. Splitting functions and ensuring each module performs a clear, single role is the entire point.\nKey Takeaways\nDeveloper-defined explicit workflow control is more stable in production than autonomous planning A loop structure that feeds tool call errors back to the model raises the agent\u0026rsquo;s success rate Building quantitative evaluation metrics is a prerequisite that should precede prompt engineering Related Posts\nWhy DataNexus\r— The semantic gap problem and an ontology-based approach How We Chose These Four Open-Source Tools\r— Tech stack selection for agent systems Source\nhttps://share.google/OobJU2T2JLz7gxlim\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-openai-agent-building-practical-guide/","summary":"Letting an agent handle planning end-to-end causes it to loop and stall in production. The OpenAI guide addresses this with explicit workflow control.","title":"Practical Agent Design Lessons from the OpenAI Guide"},{"content":"When you hand off work to AI agents, the early results are inconsistent no matter how carefully you craft the prompts. More often than not, the problem is not model performance but the environment the agent operates in. Without a proper folder structure and SOPs, the agent has to re-establish context from scratch every time, and output formats drift with each run.\nWhat Changes When You Align Folders to the Org Chart A week after deploying the agent, its analytical ability was decent, but outputs were saved in random directories and the format changed with every request. The folder structure was reorganized to mirror the actual org chart \u0026ndash; executive office, chief of staff, and each team forming the main branches, with manuals, tools, data, and output subfolders fixed underneath. Once the skeleton was in place, the chief of staff no longer had to explain everything from scratch each time work was delegated to the sales team. Guidelines on how to work were already defined in each team\u0026rsquo;s folder.\nWhen a sales review was assigned, a superficially polished report came back. It lacked field terminology and decision criteria. The agent was fed internal chat logs and documents, then tasked with designing the organization\u0026rsquo;s own standard operating procedures. This was not about training the model on data \u0026ndash; it was about codifying how work gets done.\nSOP Quality Determines Output Quality When the sales agent brought back a draft, feedback went back and forth. Metric calculation logic was revised, and raw number lists were reshaped into issue-driven messages. Once it was clear which data tables to reference and which lens to interpret through, quality became consistent. Now a single command runs variance checks and anomaly extraction without human involvement.\nHumans can muddle through even with sloppy manuals. Machines do exactly what is written. Each morning starts with an integrated briefing from the chief of staff.\nKey Takeaways\nAligning folder structure to the org chart lets agents find context without repeated explanations Codifying domain knowledge and work methods as SOPs matters more for quality than prompt tuning Specifying reference tables and interpretation perspectives keeps outputs consistent Related Posts\nHow We Chose These Four Open-Source Tools\r— A systematic approach to architecture decisions Practical Agent Design from the OpenAI Guide\r— Agent modularization and workflow control Source: https://www.linkedin.com/posts/leekh929_ai-\u0026hellip;\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-ai-agent-organization-sop-folder-structure/","summary":"When you first deploy AI agents, outputs land in random folders and formats vary every time. Aligning folder structure to the org chart and codifying SOPs is what makes quality consistent.","title":"Directory Design and SOPs That Make or Break AI Agent Organizations"},{"content":"When first setting up Claude Code, many people freeze in front of the plugin list. With dozens of plugins available, there is no obvious criteria for what to enable. Plugin Advisor is a tool that cuts that decision cost.\nStarting with a Preset Pack Changes Everything Instead of enabling everything and hoping for the best, you pick a Preset Pack that matches your project\u0026rsquo;s nature. Applying a preset immediately catches missing packages in your local environment.\nThe pre-copy checklist is the core feature. It prevents the scenario where you run code without an API key or environment variable and hit a runtime error. In environments where a single mismatched DDL definition or endpoint address can halt an entire pipeline, start-stage validation dramatically reduces debugging time. Mistakes like running a deploy script with an expired token are also caught at this step.\nHow the Setting Plan Helps with Connector Management Plugin Advisor\u0026rsquo;s Setting Plan produces a step-by-step tailored guide. It records failure points and feeds them as input to the next design stage. As the number of connectors grows, this structure has a real impact on working speed.\nTo run Claude Code as an agent, the configuration foundation must come first. For data engineering environments managing numerous connectors, this tool serves as a checklist. You can focus on implementing logic on top of a lean configuration with only what is needed.\nKey Takeaways\nStarting with a Preset Pack eliminates the decision cost of toggling plugins one by one The pre-flight checklist catches missing environment variables and expired tokens before execution The Setting Plan records failure points and auto-feeds them into the next configuration stage Source\nhttps://plugin-advisor.vercel.app/\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-claude-code-plugin-advisor-efficiency/","summary":"If you have ever stalled in front of the plugin list when first setting up Claude Code, Plugin Advisor narrows that entrance. Preset Packs and pre-flight checklists catch runtime errors before they happen.","title":"Cutting Through Claude Code Setup Confusion with Plugin Advisor"},{"content":"Turning a single report into a slide deck takes half a day, and it is the formatting \u0026ndash; not the content organization \u0026ndash; that eats most of that time. When a single AI model tries to handle both structuring and formatting, quality drops on one side or the other. Splitting the work across NotebookLM, Claude, and Gemini in a staged pipeline solves this problem.\nNotebookLM Structures, Claude Formats Upload the full report to NotebookLM and it classifies issues, scenarios, and implications while extracting headlines for each section. The key at this stage is embedding constraints upfront when requesting the initial draft.\n\u0026ldquo;Write 7 pages in English. Keep headlines and governing subtitles identical, no periods at the end of sentences. White background with blue as the accent color, polished with pictograms and infographics.\u0026rdquo;\nBaking in fine details like removing periods and fixing background color from the start leaves almost nothing to fix by hand later. The difference is larger than you might expect.\nApplying internal formatting is Claude\u0026rsquo;s turn. Feed it an existing report PDF or image, and it extracts fonts, section tab layouts, and page number positions into a style sheet.\n\u0026ldquo;This page I just created is the final version. Extract it into a style sheet I can reuse next time. Also create a prompt template.\u0026rdquo;\nPaste this style sheet into the Claude PPT add-in and layouts auto-arrange. The repetitive work of manually dragging text boxes around disappears.\nHow to Handle Image Cropping and Cover Pages To extract specific parts from NotebookLM-generated images, use PowerPoint\u0026rsquo;s shape merge feature. Draw a freeform shape over the desired area via [Insert] \u0026gt; [Freeform Shape], Shift-click both the image and the shape, then apply [Shape Format] \u0026gt; [Merge Shapes] \u0026gt; [Intersect]. Only the clean object remains. No separate design tool needed.\nGemini handles the cover page. Feed it the existing draft image as a reference and request a polygon-style transformation for a result that captures the overall mood.\nKey Takeaways\nNotebookLM handles structuring and drafts while Claude extracts formatting and injects it into the PPT add-in, nearly eliminating manual layout work Embedding fine details like period removal and color specs into prompts from the start significantly cuts post-editing time Extracting a style sheet once lets you reuse the same formatting for subsequent reports Source\nhttps://www.youtube.com/watch?v=q8C8IrPYulQ\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-ai-pipeline-business-slide-automation/","summary":"Connecting NotebookLM, Claude, and Gemini as a pipeline to turn reports into slides. A single model struggles to handle structuring and formatting simultaneously \u0026ndash; splitting roles produces better results at each stage.","title":"Automating Business Reports with a Multi-AI Pipeline"},{"content":"With LangChain\u0026rsquo;s transition to v1, a significant portion of legacy chain-based example code has become outdated. Controlling complex flows with chains also leads to tangled code quickly. This repository by baem1n bridges that gap with dozens of Korean-language Jupyter notebooks.\nHow to Design State and Nodes in LangGraph LangGraph defines flows around State and Node. It makes explicit how data moves through each stage and branches paths based on conditions. It sounds abstract in words, but running the Python code in a notebook makes the structure click fast.\nSet up the environment with the uv package manager and execute nodes one by one \u0026ndash; you start to see the principle of data flowing between nodes. Advanced RAG flows that rewrite questions or adjust priorities are also covered step by step, ready to transplant into production projects.\nManaging Prompt Patterns as Modules The data analysis examples show a flow where an LLM generates a query, executes it in a sandboxed environment, and continues to visualization. Using the Deep Agents harness to separate recurring prompt patterns into modules makes it easier to maintain conventions in team-scale development.\nPre-recorded execution results are included, so you can trace how a bot selects tools and operates without spinning up a large model yourself. This saves considerable time when trying to understand the internal mechanics.\nKey Takeaways\nRunning LangGraph State/Node design as Python code step by step makes non-deterministic flow control concepts click fast Managing dependencies with uv cuts environment setup time and lets you dive straight into practice Modularizing prompt patterns with the Deep Agents harness keeps team development conventions consistent Related Posts\nHow We Chose These Four Open-Source Tools\r— Agent orchestration tech stack selection Practical Agent Design from the OpenAI Guide\r— Predictability and control in agent design Source: https://www.linkedin.com/posts/baem1n_github-baem1nlangchain-langgraph-deepagents-notebooks-activity-7435985777458561024-TULI?utm_source=share\u0026amp;utm_medium=member_android\u0026amp;rcm=ACoAAAGPfasB-djMifTErXTP5V7RQzL6YbO5POo\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-langchain-v1-langgraph-deepagents-guide/","summary":"With LangChain\u0026rsquo;s move to v1, legacy chain code has become outdated. A repository of Korean Jupyter notebooks walks through LangGraph\u0026rsquo;s State and Node design step by step.","title":"Hands-On Notebooks for the Updated LangChain Ecosystem"},{"content":"Starting every morning at 9 AM by manually sifting through a flood of numbers is an exhausting ritual. Developers working in finance or investment know all too well how much repetitive data collection eats into their resources. Macro-Pulse was born to strip away this inefficiency. It goes well beyond simply scraping numbers \u0026ndash; the attention paid to stable system operations is evident throughout.\nThe overall architecture is focused on practicality. Built on Python, it fetches a wide range of financial data in real time \u0026ndash; not just domestic and international market indices, but also interest rates and commodity price volatility. Since plain text hurts readability, the system also captures heatmap screenshots that visualize market conditions at a glance. A single Telegram message delivers the morning market pulse in one shot.\nThe choice of uv as the package manager is noteworthy. Adopting a tool that has been gaining traction for its speed, it reduces the hassle of environment setup. Test coverage is equally thorough \u0026ndash; logic verification and external communication are rigorously separated to boost operational stability. The approach of adjusting test scope based on environment variables reflects real-world operational thinking.\nTo minimize infrastructure costs, it uses GitHub Actions as the scheduler. Workflows run at set times, and the results are published to GitHub Pages. This demonstrates that a stable dashboard can be maintained without any paid servers. Execution history is retained for rapid incident response, and notification features further reduce the operational burden.\nThe ability to flexibly adjust report format through a single configuration file is another strength. Output items and section ordering can be modified without touching code. Tailoring the report layout to market conditions or selectively including specific data is straightforward. Sensitive information is managed through secure variables, and detailed guidelines are well documented.\nRunning it locally, browser configuration turned out to be the trickiest point. Differences between server and local environments frequently cause screenshot capture to malfunction. The decision to pre-configure the browser inside a container is an excellent choice for preventing this. The internal logic is also cleanly separated between data acquisition and processing, making it easy to read. An option to test functionality without actually sending messages significantly improves developer convenience.\nThe project provides a container-based runtime infrastructure to ensure it can be executed anywhere without constraints. Mounting configuration files and sharing output follows standard development workflows. Running the full test suite and confirming how robustly the code is written reveals the structural completeness.\nAutomating repetitive tasks frees up time to focus on what truly matters: designing core logic and conducting analysis. Deciding what to do next with the collected data is now the human\u0026rsquo;s job. The hours once spent filling spreadsheets can now be devoted to writing more valuable code. Next, we plan to explore architectures that extract deeper insights from the collected metrics.\nKey Takeaways\nCombining GitHub Actions and Cron enables building automated data pipelines with zero server costs. uv package manager and Docker ensure consistency across development and deployment environments. JSON-based configuration files allow flexible management of report layouts and collection items without any code changes. Source: https://github.com/yeseoLee/Macro-Pulse\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-31-automation-macro-report-telegram-bot/","summary":"Examining an automated reporting system built with Python and GitHub Actions to escape the tedium of manually checking macroeconomic indicators every day.","title":"Automating Tedious Metric Collection with Python and GitHub Actions"},{"content":"Imagine it is the night before a morning meeting and you have zero presentation materials in hand. Just filling in the content is overwhelming enough, and worrying about design on top of that quickly becomes a headache. Even leaning on tools like Cursor or Claude Code yields inconsistent results \u0026ndash; sometimes pulling in the wrong library and breaking the build entirely. Fine-tuning prompts alone is not enough to guarantee consistent quality.\nmake-slide, released by Kuneosu, offers an alternative precisely at this pain point. Instead of making the agent figure out design from scratch, it guides the agent to reference pre-defined theme templates. After running an npx command to complete the initial setup, a guideline file is generated at a specific path, and the model reads this file to start generating HTML code according to established rules.\nThis approach dramatically reduces unnecessary overhead. Rather than describing layouts verbally in painstaking detail, you simply instruct the agent to refer to the reference data at a given path. The finished output is extracted as a single file that runs in any browser without additional library installations. Navigating presenter notes via keyboard shortcuts or printing the document during the actual presentation flows seamlessly as well.\nThe workflow is flexible. Whether it is a simple topic, specific text, or a full script \u0026ndash; the agent analyzes any form of input and proposes an overall structure. This includes selecting the layout style that best fits the context, choosing from options like centered or split arrangements. Once the user reviews and approves the proposed outline, the actual implementation phase begins.\nThe entire workflow is meticulously designed. Instead of piling heavy theme files locally, it fetches them from an online repository only when needed, keeping the workspace clean. Code highlighting is built in by default, making it ideal for engineers sharing technical content with clearly rendered source code.\nThe range of available styles is broad. From themes with a technical aesthetic to polished business formats, you can choose what fits the occasion. Beyond automatically searching and placing high-quality images, there are also rules for converting output into office-software-compatible formats. Browsing the full list of options through the available commands makes it easy to find the optimal configuration for your needs.\nRetail or financial companies looking to establish an internal design system can add custom themes and operate standardized templates. As long as reference files and guidelines are in place, the agent strictly adheres to them. This ultimately serves as a useful tool for narrowing the visual gap between design and development teams.\nThis goes beyond a simple authoring tool. It demonstrates a methodology for teaching automation systems domain-specific knowledge and design principles. Once linked in a configuration file, a single command wraps up complex work. It offers the satisfaction of escaping the tedious loop of copying and pasting content by hand.\nThanks to its data-driven structure, fine-grained modifications are straightforward as well. Bar charts and metric cards can be visualized without any additional libraries. The files themselves are lightweight enough to upload to an online repository and share on the web without issues.\nIn the next article, we plan to discuss automation techniques for visually unpacking relationships within data.\nKey Takeaways\nBy having AI reference pre-defined design templates instead of making design decisions independently, both token efficiency and output consistency are achieved. The single HTML file output approach eliminates dependency issues and delivers universal portability \u0026ndash; presentations work anywhere with a browser. Leveraging agent-specific guideline systems like .claude/skills/ to automate complex workflows with a single command is a proven and effective pattern. Source: https://github.com/Kuneosu/make-slide\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-30-ai-eijeonteuwa-make-slidero-guhyeonhaneun-gopumjil/","summary":"Analyzing a workflow where an AI agent trained on design principles generates polished presentation materials as a single HTML file.","title":"Automating High-Quality HTML Slides with AI Agents and make-slide"},{"content":" I\u0026rsquo;ll get the most important n8n fundamentals into your head within an hour. -\nRepeatedly flipping through API documentation and manually sending requests eventually brings a certain sense of futility. Even trying existing automation tools, the lack of flexibility quickly reveals their limits. n8n is a solid choice for breaking through these constraints. Self-hosting it on your own server cuts operating costs, and JavaScript support enables fine-grained control. It functions as an engine for implementing complex logic, going well beyond simply connecting lines.\nThe first priority is understanding that data flowing between nodes takes the form of arrays. Miss this, and loops will run erratically or fire only once. To process individual items from a result set that arrives as a list, a data-splitting step must come first. When you need to reassemble scattered data, an aggregation node restores order.\nExpressions available in the mapping panel are more powerful than they first appear. Beyond basic value extraction, you can apply conditionals and text manipulation logic on the fly. If regex gets tangled up, stepping into a code node to write explicit scripts is the wiser move. Long, convoluted expressions are hard to read, but clearly written scripts pay dividends during future reviews.\nCredential management should be handled through a dedicated menu for safety. Entering sensitive keys directly inside a workflow definition is a practice to avoid. Separating development and production server URLs through environment variables is also essential. There was a time when mixing test data with production while connecting a retail API caused real headaches \u0026ndash; proper environment separation alone could have prevented it.\nAutomation starts with triggering at the right moment. Use webhooks for immediate reactions to external signals; use the schedule feature for actions that need to fire at set times. When periodically fetching data, recording the last-processed position is critical to avoid duplicates. Storing the checkpoint in a database or leveraging built-in state management ensures the flow continues reliably.\nBuilding scenarios for exception handling is non-negotiable. Network latency or invalid input can halt the entire process at any time. Setting up a dedicated error workflow allows graceful recovery under failure conditions. Configuring messenger notifications on failures significantly speeds up response time. Developing the habit of regularly reviewing execution logs also shortens root-cause analysis.\nWhen operating in containerized environments, storage path configuration demands careful attention. A careless oversight can mean losing painstakingly built workflows on restart. Managing the configuration file containing database connection details should not be neglected either. While the runtime footprint is light, memory allocation should be checked in advance when handling large files to prevent unexpected shutdowns.\nFor handling images, the dedicated binary module proves useful. Receiving data and saving it or transferring it to cloud storage proceeds quite smoothly. Occasionally, file names get garbled or formats go unrecognized \u0026ndash; specifying header information explicitly resolves most of these issues. Working on a financial-sector project that involved building a receipt processing pipeline, I recall hitting exactly this snag.\nn8n can be leveraged as more than simple automation \u0026ndash; it works as a lightweight backend server. Careful thought is needed to separate each step and modularize logic so that it does not become tangled. Wrapping frequently used patterns into sub-workflows for reuse boosts efficiency. As scale grows, the number of management points multiplies, but building a solid foundation early saves considerable pain later. Focusing only on appearances while neglecting internal design ultimately results in an unmanageable mess.\nNext up, we plan to explore the process of connecting n8n with large language models to build agents suitable for real-world deployment.\nKey Takeaways\nUnderstanding n8n\u0026rsquo;s item-list data structure is the key to mastering loops and data processing. Using Code Nodes for explicit scripting rather than complex expressions improves long-term maintainability. Securing operational stability through Error Workflows and proper Credentials management is essential for production use. Source\nhttps://youtube.com/watch?v=y9u1IdDYHZQ\u0026amp;si=n_kmaaX4HH9MHwiS\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-30-n8n-weokeupeulrou-hyoyuleul-nopineun-deiteo-gujo-s/","summary":"Addressing data bottlenecks encountered during API integrations and applying n8n\u0026rsquo;s core mechanics to real-world practice.","title":"Data Structure Design for Efficient n8n Workflows"},{"content":"When pulling PDF data into an analysis model, garbled text is all too common. Mangled text is the prime culprit for clouding AI judgment. Finding a reliable tool is no small task either, but a recently released library is alleviating the chronic pain points of the preprocessing pipeline.\nIt goes beyond simply extracting characters \u0026ndash; it faithfully preserves table and heading structures in Markdown format. A recent major update brought significant architectural changes, shifting from a model that accepted file paths directly to one that processes stream data. This eliminates the need for temporary files, making it advantageous for conserving server resources. Installation options have also become more flexible: you can configure all features at once or selectively include only the components you need. A welcome change for large-scale infrastructure where management efficiency is paramount.\nThe scope of supported formats is remarkably wide, covering not just word processor files but video subtitles and audio files as well. Images within documents are interpreted by connecting to vision capabilities, and scanned documents are cleanly processed through optical character recognition. Structured text is the most LLM-friendly format, making it effective at reducing computational costs. The process of stripping away noise and retaining only core data is handled smoothly.\nExecution is as simple as a single command or a short code snippet. Since it supports external interface specifications, it works well for connecting to desktop analysis apps for real-time data inspection. Linking it with document intelligence features from major cloud services can further elevate processing capacity. The library handles character recognition on its own without additional installation complexity, simplifying the system deployment process.\nWhen designing complex pipelines, lightweight utilities like this make an excellent alternative. There is no need to deploy heavy frameworks when something this lean will suffice. A runtime environment above certain specifications is recommended, and running it in an isolated environment ensures stability. Unlike previous versions, it now requires byte-level data as input, so developers accustomed to the older approach should review their integration code.\nThe converted output is in an optimal state for analysis tools to consume immediately. The focus is on maximizing machine comprehension rather than visual flair. This plays a pivotal role in enabling AI to grasp the full context. Tangled cell structures in spreadsheets and hierarchies in presentation files are all unraveled intelligently. Setting up an independent runtime environment to operate a dedicated conversion pipeline is also worth considering.\nWe plan to explore concrete automation use cases built on knowledge graphs using this tool in the near future.\nKey Takeaways\nThe 0.1.0 update eliminates temporary file creation and transitions to binary stream processing, boosting server resource efficiency. Markdown format delivers superior token efficiency and inference accuracy for LLMs compared to HTML or JSON, thanks to higher native comprehension. MCP (Model Context Protocol) server support enables direct integration with LLM desktop apps like Claude. Source: https://github.com/microsoft/markitdown\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-30-deiteo-jeonceoriga-gareuneun-rag-pumjilgwa-makeuda/","summary":"Examining key features of a Python-based document conversion tool released by a global IT company, and proposing efficient approaches to handling data in LLM pipelines.","title":"How Data Preprocessing Determines RAG Quality and Leveraging Markdown Conversion Tools"},{"content":" Even if the MCP fetches the exact statutory text\u0026hellip; interpreting it correctly is a different matter. Especially for things like delegation chains between enforcement decrees and enforcement rules, or transitional provisions in supplementary clauses \u0026ndash; these are areas where AI easily loses context. korean-law-mcp connects the legislative API as a \u0026ldquo;pipe,\u0026rdquo; and when you plug DataNexus in, a \u0026ldquo;brain\u0026rdquo; forms in the middle. For example, when you look up Article 38 of the Occupational Safety and Health Act, what you get now is just the article text. With DataNexus\u0026rsquo;s ontology layer, the relationships \u0026ldquo;Article 38 delegates to Enforcement Decree Article XX, which re-delegates to Enforcement Rule Article XX, with 3 related court precedents, and transitional provisions from a recent amendment currently in effect\u0026rdquo; are all mapped as knowledge graph nodes. AI does not need to reason \u0026ndash; it just traverses the graph.\nFetching the original text of legislation accurately remains a challenging task. South Korea has over 1,600 laws and more than 10,000 administrative regulations. All of this information is hidden behind government APIs that offer virtually no developer experience. The Korea Legislation Research Institute\u0026rsquo;s Open API certainly exists, but anyone who has tried using it directly knows the frustration.\nThis is where the korean-law-mcp project enters the picture. The tool neatly organizes the complex access methods of the legislative Open API into 64 structured tools. It provides a wide range of capabilities: fetching article text, automatically resolving abbreviations, converting HWPX attachments to Markdown, and more. Available as either an MCP server or CLI, it integrates smoothly with AI clients like Claude Desktop. In essence, it builds a robust pipeline for legal data.\nYet no matter how good the pipeline is, correctly interpreting the fetched legislative text is an entirely separate domain. In particular, the subtle contexts of delegation chains between enforcement decrees and enforcement rules, or transitional provisions in supplementary clauses, are areas where AI easily drops the ball. Even if you retrieve the exact text of Article 38 of the Occupational Safety and Health Act, it is far from trivial for AI to reason on its own about which other regulations this article connects to.\nThis is where DataNexus plays a critical role. DataNexus adds an ontology layer on top of the raw data that korean-law-mcp provides. This layer explicitly links the complex interconnections between statutes as nodes in a knowledge graph. For example, when querying Article 38 of the Occupational Safety and Health Act, it does not simply display the article text. DataNexus\u0026rsquo;s knowledge graph pre-constructs the context: \u0026ldquo;Article 38 is delegated to Enforcement Decree Article XX, which is re-delegated to Enforcement Rule Article XX, with 3 related court precedents, and transitional provisions from a recent amendment currently in effect.\u0026rdquo;\nWith an explicitly connected knowledge graph like this, AI no longer needs to perform complex reasoning. It simply traverses the graph to find the information it needs. If korean-law-mcp is the sturdy \u0026ldquo;pipe\u0026rdquo; that fetches data well, DataNexus is the \u0026ldquo;brain\u0026rdquo; that interprets the data and identifies connections. The combination of the two represents an important step in elevating legal information systems to the next level. It demonstrates the potential for knowledge-graph-based approaches to be applied not just to law, but across a wide range of specialized domains.\nKey Takeaways\nkorean-law-mcp provides 64 tools that dramatically improve accessibility to the Korea Legislation Research Institute\u0026rsquo;s Open API. DataNexus\u0026rsquo;s ontology layer and knowledge graph explicitly connect complex delegation relationships and transitional provisions between statutes \u0026ndash; without requiring AI reasoning. The combination of korean-law-mcp (pipe) and DataNexus (brain) strengthens legal AI\u0026rsquo;s data utilization and interpretation capabilities. Source\nhttps://github.com/chrisryugj/korean-law-mcp\r","permalink":"https://datanexus-kr.github.io/en/curations/2026-03/2026-03-30-beobryul-deiteo-paipeureul-neomeo-noereul-eodda-ko/","summary":"korean-law-mcp streamlines legal data access while DataNexus\u0026rsquo;s knowledge-graph-based ontology layer explicitly maps complex statutory connections, significantly enhancing AI\u0026rsquo;s ability to interpret law accurately.","title":"Legal Data Gets a Brain: The Synergy of korean-law-mcp and DataNexus"},{"content":"\rGEO Optimization Guide — 전체 시리즈\n1. What Is GEO - AI Citation Strategy Beyond SEO\r2. Each AI Cites Different Sources ← 현재 글\r3. On-Site GEO Technical Architecture - From Product DB to JSON-LD\r4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site\r5. AEO - Why Coding Agents Read Documentation Differently\rDo Not Take \u0026ldquo;We Show Up in AI\u0026rdquo; at Face Value After deciding to implement GEO, the first thing most people do is ask ChatGPT about their brand. \u0026ldquo;Our company shows up \u0026ndash; great, it\u0026rsquo;s working.\u0026rdquo; And that is where it ends.\nBut ask the same question on Perplexity, and you get a different answer. Google AI Overview gives yet another result. On some platforms, the official site gets cited; on others, blog reviews become the source. Even within ChatGPT itself, citation sources change depending on whether web search mode is on or off.\nYou cannot treat AI search as a single channel. Each platform has its own citation logic and preferred source types.\nPlatform Citation Sources, by the Data Combining analysis data from three providers reveals stark differences across platforms. While Yext, Qwairy, and GrackerAI differ in scale, they all point in the same direction.\nPlatform Preferred Sources Characteristics Gemini Official websites (52%) Based on Google\u0026rsquo;s search index. Prioritizes own-domain sites with structured data ChatGPT Directories/listings (49%) High dependency on third-party aggregation sites like Yelp and TripAdvisor Perplexity Reddit/communities (31%) Actively cites real user discussion threads. Also leverages industry-specific directories Google AIO YouTube (23%) Citation share grew 34% over six months. Video content dominance is unmatched Gemini inherits Google\u0026rsquo;s search logic and trusts structured content from official sites. ChatGPT relies on external search layers, making it heavily influenced by directories and listing sites. Perplexity pulls answers from community threads where people actually exchanged opinions.\nWhat is striking is how little overlap exists between platforms. Only 11% of domains are cited by both ChatGPT and Perplexity. Visibility on one does not mean visibility on the other.\nSource and Citation Are Different Concepts When you look closely at AI responses, you will notice two types of attribution.\nOne is the source \u0026ndash; reference links listed at the bottom of the answer. A URL gets included here if it is deemed trustworthy. The other is the citation \u0026ndash; a hyperlink embedded mid-answer that supports a specific sentence as evidence.\nGetting cited requires passing two gates: domain-level trust of the URL, and information trust of the page content. If the domain is reliable but the content structure is hard to parse, it shows up as a source but not as a citation. Conversely, if the content is well-structured but the domain itself is weak, it stays in the source list only.\nConnecting this back to the three GEO principles from the previous article\r:\nIdentity \u0026ndash; The foundation of domain trust. GTIN, Organization schema contribute here Context \u0026ndash; Information quality of the content. Categories, use cases, and variant relationships must be structured for parsing Citability \u0026ndash; The gateway from source to citation. JSON-LD and FAQ Schema determine this stage Results Differ Based on Web Search Mode If you ask ChatGPT the same question twice, you may get different answers. The difference lies in whether web search is on or off.\nWith web search off, answers are based on pre-trained data. Content that existed at the time of training becomes the citation candidate. When web search is on, it switches to real-time crawling. At that point, the current state of structured data, robots.txt permissions, and content freshness drive citations.\nFailing to distinguish this when monitoring GEO leads to misjudgment. \u0026ldquo;Our brand shows up fine on ChatGPT\u0026rdquo; \u0026ndash; if you tested with web search off, that result came from pre-trained data. You need to test with real-time crawling enabled to see your current GEO status accurately.\nYouTube Is Surging in Google AIO The most notable change in Google AI Overview\u0026rsquo;s citation patterns is YouTube.\nYouTube is the number one cited domain in AI Overview (Ahrefs Brand Radar). Its share grew 34% in six months. Among social platforms, YouTube ranks second after Reddit in citations (OtterlyAI).\nWhat is interesting is that it is not high-view videos getting cited. Nearly half of YouTube videos cited by AI had fewer than 1,000 views. Plenty had only a few dozen likes. AI looks at information structure, not popularity. Timestamps, chapter divisions, clear titles \u0026ndash; these are what determine citation.\nIn contrast, YouTube citations are nearly nonexistent on ChatGPT and Perplexity. The same video content has entirely different value depending on the platform.\nCheck Competitors\u0026rsquo; robots.txt to See Their AI Strategy The fastest way to understand how competitors approach GEO: type https://competitor-domain/robots.txt into your browser.\nE-commerce companies often block AI crawlers to protect product catalogs \u0026ndash; preventing prices, inventory, and product details from being exposed to competitor AI systems. B2B SaaS companies do the opposite and open things up, since visibility in AI search benefits lead generation.\nFor large conglomerates, robots.txt policies often vary wildly across subsidiaries. Some block GPTBot entirely while others allow everything. The result of each subsidiary setting its own policy without a group-level standard.\nIn fact, when surveying the subsidiaries of a major retail group, zero had applied FAQPage Schema, and only one \u0026ndash; a hotel subsidiary \u0026ndash; had adequate AI citations. Even with robots.txt open, if there is no structured data, there is nothing for AI to read.\n# Blocked (GEO impossible) User-agent: GPTBot Disallow: / # Allowed (GEO possible) User-agent: GPTBot Allow: / User-agent: Google-Extended Allow: / User-agent: anthropic-ai Allow: / Restructuring Alone Can Change Citation Rates \u0026ndash; No New Content Needed The most common misconception when starting GEO is that you need to create new content. While new content certainly helps, simply restructuring existing content can already change AI citation outcomes.\nPages with structured data are 36% more likely to appear in AI Overview (GrackerAI), and applying complete Schema pushes the ChatGPT visibility rate up to 80% (Search Engine Land). With only basic Schema, it is 20%.\nThings you can do with existing blog posts:\nAdd a 40-60 word TLDR summary at the top Include author bio and expertise credentials Add an FAQ section wrapped with FAQPage Schema Rewrite \u0026lt;meta name=\u0026quot;description\u0026quot;\u0026gt; to be more specific, at least 100 characters Content freshness also matters. More than three-quarters of pages highly cited by Perplexity were updated within the past month. Pages untouched for over three months get pushed down.\nReddit Appearing in Korean Queries Is Not a Coincidence You may have noticed Reddit threads being cited in response to Korean-language queries. It seems odd, but there is a structural reason.\nReddit uses AI translation and hreflang tags to operate in 22 language versions. The infrastructure to match Korean queries is already in place. Reddit accounts for 6.6% of all Perplexity citations and 2.2% in Google AI Overview \u0026ndash; both top-tier. Its share climbs even higher for subjective queries like \u0026ldquo;best XX\u0026rdquo; or \u0026ldquo;XX recommendations.\u0026rdquo;\nThis has significant implications from an Off-Site GEO perspective. There are domains that On-Site structuring alone cannot cover. For query types where AI prioritizes \u0026ldquo;real user opinions,\u0026rdquo; community and review platform influence is substantial.\nOne Strategy Cannot Cover All AI Platforms Each platform trusts different things:\nTrust Basis Description Favored Platform Own-site structuring Schema.org, JSON-LD, FAQ Gemini, Google AIO Third-party listing consistency Directory and review site data alignment ChatGPT Community reputation Reddit, forums, UGC Perplexity Video content structuring YouTube chapters, timestamps Google AIO On-Site GEO is the baseline. Open robots.txt, add JSON-LD, structure FAQs \u0026ndash; and you will see strong results on Gemini and Google AIO in particular.\nTo gain visibility on ChatGPT, you need to additionally ensure consistency across third-party listings. To target Perplexity as well, natural mentions in community discussions become necessary.\nThis series focuses on On-Site GEO. However, depending on which platform you prioritize, Off-Site work may be needed too.\n","permalink":"https://datanexus-kr.github.io/en/guides/geo-optimization/002-ai-citation-sources/","summary":"ChatGPT favors Wikipedia, Perplexity leans on Reddit, and Gemini prefers official websites. Covering all AI platforms with a single strategy is impossible.","title":"2. Each AI Cites Different Sources"},{"content":"\rGEO Optimization Guide — 전체 시리즈\n1. What Is GEO - AI Citation Strategy Beyond SEO ← 현재 글\r2. Each AI Cites Different Sources\r3. On-Site GEO Technical Architecture - From Product DB to JSON-LD\r4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site\r5. AEO - Why Coding Agents Read Documentation Differently\rRanking on Google\u0026rsquo;s First Page Felt Like Enough You worked hard on SEO and landed on Google\u0026rsquo;s first page. Organic traffic naturally increased. So far, a familiar scenario.\nBut lately, the way people search has changed. They type \u0026ldquo;recommend a budget laptop\u0026rdquo; into ChatGPT or look up \u0026ldquo;family-friendly hotels in Seoul\u0026rdquo; on Perplexity. Google AI Overview now drops answers right above the search results.\nClicks are disappearing. Because AI is answering on behalf of the user.\nOf the URLs cited by AI, only 9% rank in Google\u0026rsquo;s top 10 (Ahrefs). High SEO rankings do not guarantee AI citations. A separate optimization layer is needed.\nThat layer is GEO (Generative Engine Optimization).\nWhat Is GEO GEO is about making AI search engines like ChatGPT, Perplexity, Gemini, and Google AI Overview cite our content in their answers. It is the work of restructuring data itself so that AI can read it easily.\nSEO was about getting people to click. GEO is about getting AI to name us as a source.\nAspect SEO GEO Goal Drive clicks (Traffic) Get cited in AI answers (Citation) Trust criteria Keyword density, backlink count Identifiability, structured data Recognized by Humans + search engine bots Generative AI models Core techniques Meta tags, content optimization JSON-LD, Schema.org, FAQ structuring KPI Ranking, CTR Mention rate, citation accuracy One thing to be clear about: adopting GEO does not mean abandoning SEO. A solid SEO foundation is what makes GEO work. GEO is built on top of it.\nWhy Now The data shows the tide has already turned.\nChatGPT WAU has surpassed 800 million (OpenAI). South Korea is the world\u0026rsquo;s second-largest paid subscription market, with one in three economically active people using AI. Gartner projects that traditional search volume will decline by 25% by 2026 (Gartner). According to a Capgemini report, more than two-thirds of consumers actually purchase products recommended by AI (Capgemini). Over half of Google searches end as zero-click results, and this ratio will only climb higher once AI Overview is layered in.\nThe conversion side is even more compelling. The purchase conversion rate from AI search traffic is 14.2% \u0026ndash; 5 times higher than traditional Google search (GrackerAI). Revenue per AI-referred visit is also more than 2.5 times higher (Adobe).\nTraffic is declining, yet conversion rates for AI-curated results are actually higher. What drives revenue is no longer how widely you are exposed, but whether AI selects you.\nThe Three Principles of GEO When applying GEO, certain questions keep coming up. Can AI distinguish this product from others? Does it understand the purpose and context? And once it reads the content, is the structure there for it to cite the source?\nIdentity - Identifiability AI must be able to clearly distinguish a product or service.\nInternational standard identifiers like GS1 GTIN/GLN are key. For AI to recognize \u0026ldquo;ChocoStick Original\u0026rdquo; and \u0026ldquo;ChocoStick Almond\u0026rdquo; as separate products, each needs its own unique GTIN. If they are bundled under a single representative code, AI cannot tell them apart.\nContext - Contextual Connectivity AI must be able to understand the purpose, relationships, and positioning of a product.\nCategory hierarchies, variant relationships (flavor/size/color), brand-product-SKU structures \u0026ndash; these contexts need to be structured so AI can match the right product to a query like \u0026ldquo;running shoes for men in their 20s.\u0026rdquo;\nCitability - Citation Readiness The content must be structured so that AI can read it and attribute the source.\nJSON-LD, FAQ Schema, and robots.txt configuration fall under this principle. No matter how good the data is, if AI crawlers cannot access it or if the structure is hard to parse, AI will simply skip it.\nHere is the summary table:\nPrinciple Key Question Core Technology Verification Criteria Identity Can AI distinguish this from others? GS1 GTIN/GLN Identifier registration, variant differentiation Context Does AI understand the purpose and relationships? Categories, knowledge graph Metadata quality, cross-channel consistency Citability Can AI read this and cite the source? JSON-LD, FAQ, robots.txt Structured data validity, crawler access permission Invisible GEO vs Visible GEO Implementation splits into two approaches.\nInvisible GEO is JSON-LD inside the \u0026lt;head\u0026gt; tag. It is invisible to users but directly parsed by AI and search engines. It is the most powerful method for boosting AI citation rates. However, if your site is an SPA, a transition to SSR (Server-Side Rendering) must come first.\n\u0026lt;!-- Invisible GEO: JSON-LD inside \u0026lt;head\u0026gt; --\u0026gt; \u0026lt;script type=\u0026#34;application/ld+json\u0026#34;\u0026gt; { \u0026#34;@context\u0026#34;: \u0026#34;https://schema.org\u0026#34;, \u0026#34;@type\u0026#34;: \u0026#34;Product\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;ChocoStick Original\u0026#34;, \u0026#34;gtin13\u0026#34;: \u0026#34;8801234567890\u0026#34;, \u0026#34;brand\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;Brand\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;K Foods\u0026#34; }, \u0026#34;offers\u0026#34;: { \u0026#34;@type\u0026#34;: \u0026#34;Offer\u0026#34;, \u0026#34;price\u0026#34;: 1500, \u0026#34;priceCurrency\u0026#34;: \u0026#34;KRW\u0026#34;, \u0026#34;availability\u0026#34;: \u0026#34;https://schema.org/InStock\u0026#34; } } \u0026lt;/script\u0026gt; Visible GEO is HTML content inside \u0026lt;body\u0026gt;. FAQ pages, detailed product descriptions, nutrition tables. Both humans and AI read it. The technical barrier is low, so you can start right away.\nItem Invisible GEO Visible GEO Location \u0026lt;head\u0026gt; JSON-LD \u0026lt;body\u0026gt; HTML SEO impact High Moderate AI citation rate High High Implementation difficulty High (requires SSR) Low User experience None (machine-only) Directly visible In practice, both are used. JSON-LD is added for machine parsing, while HTML is laid out for both humans and AI to read.\nAI Already Answers Well \u0026ndash; the Problem Is the Source \u0026ldquo;Recommend a family-trip hotel.\u0026rdquo; I threw this query simultaneously at Genspark, Perplexity, and ChatGPT. All three returned similar answers \u0026ndash; pool information, room prices, breakfast details. Answer quality is already sufficient.\nThe problem is the source. Genspark directly cites Schema data from official websites. Perplexity scrapes from blogs and travel platforms. ChatGPT references official sites but loses precision without structured data. The same hotel, yet each AI shows a different price.\nSchema does not guarantee official citations. What it does is make official sites easy for AI to parse, increasing the probability that AI chooses the official source over a blog. That is why On-Site GEO matters.\nDemo - AI Search Comparison What the Research Says A research team from Princeton and Georgia Tech\ranalyzed 10,000 queries, and the results are quite clear:\nContent with explicit sources: AI visibility +40% Content with statistics: +30% Content with repeated keyword stuffing: actually -10% Keyword repetition, which worked in SEO, actually hurts in GEO. AI does not care how many times the same word appears \u0026ndash; it looks at how systematically organized and trustworthy the information is.\nEmpirical evidence on structured data is also accumulating. Brands cited in AI Overview see a 35% higher organic search CTR, and paid ad CTR nearly doubles (Seer Interactive). Adding structured data increases the probability of appearing in AI Overview by 36% (GrackerAI). Sites with complete Schema have an 80% chance of appearing in ChatGPT, compared to 20% with only basic Schema (Search Engine Land).\nContent freshness cannot be overlooked either. More than three-quarters of pages highly cited by Perplexity were updated within the past month. Pages untouched for over three months get pushed down.\nOn-Site GEO and Off-Site GEO GEO broadly splits into two domains.\nAspect On-Site GEO Off-Site GEO Definition Making your own site readable and citable by AI Exposing your brand on external sites that AI references Core techniques JSON-LD, Schema.org, SSR, robots.txt, FAQ Reddit, Wikipedia, news outlets, communities Responsibility Dev team / technical organization Marketing / PR / brand strategy This series focuses on On-Site GEO \u0026ndash; the domain where developers can apply changes directly through code.\n","permalink":"https://datanexus-kr.github.io/en/guides/geo-optimization/001-what-is-geo/","summary":"Only 9% of Google\u0026rsquo;s top 10 pages are cited by AI. In an era where SEO rankings no longer guarantee AI citations, we break down the three core principles of GEO and their academic foundations.","title":"1. What Is GEO - AI Citation Strategy Beyond SEO"},{"content":"The Customer Table Has Both SSN and Business Registration Number This is a structure you see all the time in transactional systems. A single customer table mixes individual customer attributes (social security number, date of birth) with corporate customer attributes (business registration number, CEO name). For individual customers, the business registration number is NULL; for corporate customers, the SSN is NULL. A single customer type code column differentiates them.\nWhen data volumes are small, it\u0026rsquo;s not a big deal. When customers number in the tens of millions, the story changes. Columns needed only for individuals take up space in corporate rows, and columns needed only for corporates sit as NULLs in individual rows. As columns grow, the table gets wider and its meaning becomes muddier. You can\u0026rsquo;t tell from the DDL alone which columns belong to which customer type.\nSuper-sub typing is the method for sorting this out at the logical model stage.\nSeparating Common and Type-Specific Attributes The principle behind super-sub types is simple. Common attributes go in the super type (Customer), and type-specific attributes go in the sub types (Individual Customer, Corporate Customer).\n[Customer] ← Super type: CustomerID, CustomerName, Contact ├─ [Individual Customer] ← Sub type: SSN, DateOfBirth └─ [Corporate Customer] ← Sub type: BusinessRegNo, CEOName A single CustomerID links the super type to its sub types. The Individual Customer table contains only attributes relevant to individual customers. The wide table full of NULLs disappears.\nThere\u0026rsquo;s another reason to split into sub types. Each sub type can independently form relationships with other entities. Perhaps only corporate customers have a relationship with credit limits, or only individual customers relate to membership tiers. When all relationships hang off a single super type, the meaning of each relationship becomes ambiguous. Splitting into sub types makes \u0026ldquo;which type does this relationship apply to?\u0026rdquo; immediately readable from the model.\nExclusive or Overlapping? When designing sub types, there\u0026rsquo;s one thing you must determine first: whether a single instance belongs to exactly one sub type (Exclusive) or can belong to multiple sub types simultaneously (Inclusive).\nExclusive is overwhelmingly more common. A customer is either individual or corporate. An account is savings, time deposit, or fixed deposit — one of them. A product is physical or digital. A single type code handles the classification.\nInclusive is rare, but missing it means major rework later. Service products are a typical example — a single product might target both B2B and B2C simultaneously. Employee roles work similarly. When one person handles both sales and technical support, the \u0026ldquo;Employee Role\u0026rdquo; sub type becomes Inclusive.\nAt the design stage, always double-check: \u0026ldquo;Is this classification truly exclusive?\u0026rdquo; If you build the model assuming Exclusive and then overlapping cases surface, you\u0026rsquo;ll have to rework everything from the type code scheme to the relationship structure.\nOptions When Moving to a Physical Model In the logical model, super-sub types are clean. The choices diverge when converting to a physical model.\nSingle table. Merge the super type and sub types into one table. You end up with the same wide table that was the original problem, but there are no joins, so queries are simple. When type-specific attributes are few, this is a pragmatic choice.\nSeparate tables. Create individual tables for the super type and each sub type. No NULLs and the structure is clear, but to see complete customer information, you need to join the super type with the sub type.\nSub types only. No super type table — just Individual Customer and Corporate Customer tables. Common attributes are duplicated in each. This fits when analysis is fully independent per sub type, but seeing all customers requires a UNION.\nThere\u0026rsquo;s no right answer. It depends on the number of sub types, the volume of type-specific attributes, and query patterns.\nWhen It Feeds Into DW Dimension Design In a DW, this choice directly ties into dimension design. The \u0026ldquo;access paths\u0026rdquo; perspective from Part 2\rbecomes the decision criteria.\nConsider designing a customer dimension. If you split individual and corporate customers into separate dimensions, the fact table gets more FKs, and analysts have to choose which dimension to join every time. If you build a unified dimension, you get a wide table with many NULLs — but as discussed in Part 1\r, cloud columnar storage makes the scan cost of NULL columns virtually zero.\nThe deciding factor is the analysis pattern. If individual customer revenue is analyzed by age group and region, while corporate customer revenue is analyzed by industry and revenue size, then the dimension attributes themselves differ — so splitting makes more sense. If most analysis treats all customers as a single axis, a unified dimension is more convenient.\nA common compromise seen in practice is to maintain a unified dimension as the baseline, and add separate views or marts when sub-type-specific analysis is frequent. In cloud environments where storage costs are low, the overhead of redundant storage is minimal.\nThe next post compares the Inmon and Kimball approaches. We\u0026rsquo;ll dig deeper into what was briefly mentioned in Part 1.\n","permalink":"https://datanexus-kr.github.io/en/guides/dw-modeling/004-super-sub-type/","summary":"Super-sub types clarify business classifications at the logical model level. When converting to a physical model, three options emerge — and in a DW, that choice reshapes the entire dimension design.","title":"4. Super-Sub Types — Can a Customer Be Both Individual and Corporate?"},{"content":"Change the Tool, Change the Model When you join a DW project, you usually start by reviewing the existing model. There\u0026rsquo;s one thing that, if left unchecked, will cause problems later: which tool was used to draw the model, and what notation does that tool use.\nOpen an ERwin model in DA# and the relationship lines get interpreted differently. A dashed line that means \u0026ldquo;non-identifying relationship\u0026rdquo; in one tool means \u0026ldquo;optional participation\u0026rdquo; in the other. Both are correct — the notation is just different. The problem is that if you don\u0026rsquo;t know this during a model review, people end up having different conversations while looking at the same ERD.\nAn ERD is a shared language read by modelers, developers, and business stakeholders alike. If you don\u0026rsquo;t realize that language has multiple dialects, communication breaks down.\nSame Crow\u0026rsquo;s Foot, Different Interpretations The most widely used ERD notation family is Crow\u0026rsquo;s Foot. It combines symbols at the ends of relationship lines — dash (1), circle (0), and crow\u0026rsquo;s foot (N) — to express cardinality. That much is common.\nThe problem is that there are two schools within the Crow\u0026rsquo;s Foot family.\nThe IE (Information Engineering) approach is the default in ERwin, PowerDesigner, and similar tools. Identifying relationships use solid lines; non-identifying relationships use dashed lines. An identifying relationship means the parent entity\u0026rsquo;s PK is included as part of the child entity\u0026rsquo;s PK.\nThe Barker approach is used in Oracle-family tools and DA#. It looks like the same Crow\u0026rsquo;s Foot, but the symbols carry different meanings.\nSymbol IE Approach Barker Approach Dashed line Non-identifying relationship Optional participation (0 or 1) Solid line Identifying relationship Mandatory participation (exactly 1) Dash Exactly 1 Identifying relationship The meaning of a dashed line is completely different. In IE, a dashed line conveys structural information: \u0026ldquo;the parent PK is not part of the child PK.\u0026rdquo; In Barker, a dashed line conveys a business rule: \u0026ldquo;participation is optional.\u0026rdquo; The same symbol carries information from different layers of abstraction.\nWhen you review a model designed in DA# using ERwin, someone unfamiliar with this difference will read every dashed line as a non-identifying relationship. The design intent gets distorted.\nNotations That Don\u0026rsquo;t Use Crow\u0026rsquo;s Foot IDEF1X is a notation that uses circles instead of Crow\u0026rsquo;s Foot to indicate cardinality. No circle means exactly 1, an empty circle means 0 or 1, and a filled circle means 0 or N. Identifying vs. non-identifying is distinguished the same way as IE — solid vs. dashed lines.\nThere are variations. One approach uses filled circles for everything and adds letters like Z (0 or 1) and P (1 or N). Switching ERwin to IDEF1X mode enables this notation.\nThe Crow\u0026rsquo;s Foot family and IDEF1X have fundamentally different symbol systems for expressing cardinality. While confusion within the Crow\u0026rsquo;s Foot family (IE vs. Barker) is about reading the same symbol differently, the difference with IDEF1X is about not being able to read the symbols at all if you don\u0026rsquo;t know them. These are different types of confusion.\nThe Reality on Projects The point isn\u0026rsquo;t to memorize every notation. It\u0026rsquo;s to be aware of the issues that actually arise on projects.\nWhen switching modeling tools, notation conversion happens automatically — but it\u0026rsquo;s not perfect. Subtle expressions may change, and structures like super-sub types (covered later) can be altered entirely. Even within the same IE notation, minor differences exist between tools — like representing two dashes as one.\nAt the start of a project, align on three things:\nDecide which notation you\u0026rsquo;ll use Make sure everyone on the team understands the symbols in that notation Check the tool\u0026rsquo;s help documentation for notation details at least once On projects where this isn\u0026rsquo;t done, model reviews devolve into notation debates. Time that should be spent discussing design gets consumed by \u0026ldquo;what does this line mean?\u0026rdquo;\nCode-Based ERD In cloud DW environments, many teams don\u0026rsquo;t draw ERDs in GUI tools at all. They define models in SQL via dbt and express relationships using text-based diagrams like Mermaid or DBML. The advantage is being able to track model change history like code reviews.\nEven when the tool changes, what needs to be expressed stays the same: cardinality, identifying vs. non-identifying relationships, mandatory vs. optional participation. Without understanding these concepts, you can\u0026rsquo;t properly read a model — whether it\u0026rsquo;s built in a GUI tool or text-based notation.\nThe next post covers super-sub types. It\u0026rsquo;s the structure where you split \u0026ldquo;Customer\u0026rdquo; into \u0026ldquo;Individual Customer\u0026rdquo; and \u0026ldquo;Corporate Customer.\u0026rdquo; When moving from logical to physical models, the options diverge — and in a DW, those choices directly shape dimension design.\n","permalink":"https://datanexus-kr.github.io/en/guides/dw-modeling/003-erd-notation/","summary":"Same Crow\u0026rsquo;s Foot, different meaning. A single dashed line means different things in different tools. If you want models to serve as a shared language on your project, start by aligning on notation.","title":"3. ERD Notation — Same Diagram, Different Interpretation"},{"content":"\u0026ldquo;Who put this Unknown record in here?\u0026rdquo; This question comes up at every DW model review.\nOpen the product master table and you\u0026rsquo;ll find a record named \u0026ldquo;Unknown.\u0026rdquo; Same thing in the employee table. If you\u0026rsquo;ve spent your career in transactional systems, this is naturally puzzling. Dummy data in a master table?\nA follow-up question usually arrives: \u0026ldquo;What\u0026rsquo;s this \u0026lsquo;Point-in-Time Sales Rep\u0026rsquo; column in the order fact table? That didn\u0026rsquo;t exist in the source order table.\u0026rdquo; For someone new to DW models, this is equally baffling.\nThere\u0026rsquo;s no shortage of material explaining the differences between the two models using keywords — denormalization, star schema, snowflake. A quick search will surface them. The problem is that keywords alone don\u0026rsquo;t explain why the design works this way. You need to start with purpose.\nTransactional Models Guard Integrity The goal of an OLTP data model is clear: maintain data integrity through frequent inserts and updates.\nThis goal determines the model\u0026rsquo;s shape. Relationships between entities are strict. You can\u0026rsquo;t register an employee without a department, you can\u0026rsquo;t create an order without a product, and an order without a customer doesn\u0026rsquo;t exist. Every relationship has preconditions, and those conditions must be met at the moment data is created.\nNormalization is how you enforce this. Reducing redundancy means there\u0026rsquo;s only one place to update, which minimizes the chance of breaking integrity. You register top-level masters (like code tables) first, then stack transactional data on top. The sequence cannot be broken.\nThink of it this way: a grandfather must exist before a father can exist, and a father before a son. These are existence relationships. A person must exist before their actions can be recorded. These are behavioral relationships. OLTP models focus on reflecting all such relational constraints without exception.\nDW Models Design Access Paths DW data models solve a different problem: load all data without omission, and create paths for accessing analytical subjects.\n\u0026ldquo;Access paths\u0026rdquo; is the key concept. Say the analytical subject is order performance. You need to be able to approach it from the employee dimension, the product dimension, or the customer dimension. Whichever path you take, the results should be the same, and performance should be comparable. The star schema expresses this structure most intuitively.\n[Employee] | [Product] — Order Facts — [Customer] | [Job] Order facts sit at the center, with dimension tables — the access paths — surrounding them.\nSomeone with deep OLTP experience might look at this model and think it\u0026rsquo;s just \u0026ldquo;denormalized OLTP.\u0026rdquo; Because the tool is the same (ERD), the output must be the same kind of thing, right? The tool is the same. The design starting point is not.\nPoint-in-Time Data: An Unfamiliar Concept An OLTP order table has a \u0026ldquo;Sales Rep\u0026rdquo; column pointing to the current sales rep. A DW order fact table has a Point-in-Time Sales Rep — the sales rep assigned at the exact moment the order was placed.\nWhy is this needed? Say a product\u0026rsquo;s assigned sales rep changed from A to B this year. In OLTP, the current rep is B. That\u0026rsquo;s it. In the DW, it\u0026rsquo;s different. \u0026ldquo;I want to see last year\u0026rsquo;s performance under A and this year\u0026rsquo;s under B\u0026rdquo; is a natural requirement. Using historical data like product sales rep history or customer job history, the point-in-time data is constructed at the moment of loading into the order fact table.\nIn OLTP, when an employee leaves, you deactivate them in the master table and move on. In the DW, employees who only existed at a past point in time, and job codes that are no longer valid, must all remain in the master tables. Data needed for historical analysis cannot be missing.\nWhy Unknown Records Exist There\u0026rsquo;s a common situation in DW projects. You want to analyze 10 years of order history, but product master management was sloppy and only recent products remain. The order records have product IDs, but the product table has no matching entries.\nIn OLTP, this situation simply cannot occur. The design prevents orders from being created without a corresponding product. The DW has a different stance. Historical data that already happened must be loaded as-is.\nAt this point, there are several options:\nReplace the order\u0026rsquo;s product ID with an ID corresponding to \u0026ldquo;Unknown\u0026rdquo; Add an extra analytical product ID column for dual management Insert the unmapped product IDs into the product master at load time, filling remaining attributes with NULL or placeholder values Regardless of the approach, one thing is common: a reference record called Unknown is pre-inserted into the product master table. Since the sales rep for that product is also unknown, an Unknown record goes into the employee table too. Relationships are formally satisfied — but at load time rather than at the time the data originated, through deliberate intervention.\nOLTP modelers may find this uncomfortable. Satisfying relationships with artificial dummy data? Considering the purpose of a DW, it\u0026rsquo;s a rational decision. Analytical data cannot be dropped, and the structure must remain consistent regardless of which access path is used.\nWhen Relationships Are Established Differs Here\u0026rsquo;s the summary.\nOLTP satisfies relationship conditions at the time data is generated. You can\u0026rsquo;t register an employee without a department, and you can\u0026rsquo;t place an order without a customer. If a relationship is violated, the data simply doesn\u0026rsquo;t get in.\nDW aligns relationships at the time data is loaded. If the source has gaps, Unknown fills them. If point-in-time data is needed, it\u0026rsquo;s derived from history. Relationships are established through deliberate intervention during the loading process.\nOLTP DW Purpose Transaction processing, integrity guarantee Analytical data loading, access path design When relationships are met At data generation At data loading Missing data Not allowed Handled with Unknown History management Current state focus Point-in-time data generation Design direction Normalization (minimize redundancy) Access-path-centric (analytical convenience) Once you understand this difference, most of the \u0026ldquo;why did they do it this way?\u0026rdquo; questions about DW models resolve themselves.\nSame Story in the Cloud Era The previous post\rcovered how physical constraints have changed in cloud DW environments. Storage got cheaper, columnar storage shifted join patterns, and we moved to the ELT paradigm.\nPhysical constraints have changed, but the difference in purpose between OLTP and DW remains. Whether you\u0026rsquo;re using BigQuery or Synapse, you still need to design access paths for analytical data. Situations requiring Unknown records and requirements for point-in-time data management don\u0026rsquo;t disappear just because the infrastructure changed.\nWhat has changed is that history management can now be more aggressive. With lower storage costs, maintaining dimension history via SCD Type 2 is less burdensome. SCD type-specific design approaches will be covered later in this series.\nThe next post examines ERD notation differences. Even when the same relationship is drawn, the interpretation changes depending on whether it\u0026rsquo;s Crow\u0026rsquo;s Foot or IDEF1X. Without understanding the notation, two people looking at the same model will be having different conversations.\n","permalink":"https://datanexus-kr.github.io/en/guides/dw-modeling/002-oltp-vs-dw-model/","summary":"Even when the ERDs look similar, the design philosophies are completely different. OLTP is about transactional integrity; DW is about analytical access paths. That difference creates unfamiliar things like Unknown records and point-in-time data.","title":"2. OLTP vs DW Models — Different Purpose, Different Design"},{"content":"Why Are We Suddenly Talking About External System Integration Through Post 3, I wrote from a single perspective: NL2SQL accuracy. \u0026ldquo;How well can we inject business context into the LLM\u0026rdquo; was the criterion for every decision.\nFrom here, a second perspective enters the picture. Platform.\nIf DataNexus ends up as an NL2SQL tool that only runs inside one client, external integration isn\u0026rsquo;t needed. The DozerDB graph works fine on its own. The problem is that Post 1 already laid out a bigger vision \u0026ndash; multi-tenancy per group company, Data Moat, temporal knowledge graph. All of those words sit on the premise that DataNexus needs to become a platform where multiple organizations exchange ontologies, not just a standalone system.\nConsider a retail conglomerate with department stores, hypermarkets, and an online mall, each defining \u0026ldquo;revenue\u0026rdquo; differently. To unify these, you need to export each subsidiary\u0026rsquo;s ontology in a common format and map them. A proprietary DataNexus-only format can\u0026rsquo;t do that.\nThe SKOS compatibility layer doesn\u0026rsquo;t directly improve NL2SQL accuracy. Instead, it helps in a different way.\nImporting industry standard ontologies like FIBO (finance) or GPC (retail) means less time defining terms from scratch. Faster build means earlier context injection into the NL2SQL engine. In Post 1, I wrote \u0026ldquo;DataNexus\u0026rsquo;s data accumulation speed needs to outpace the generalization speed of general-purpose models\u0026rdquo; \u0026ndash; leveraging standards is one way to do that. If a client already uses Collibra or Alation, an inability to export in a standard format blocks adoption entirely. No matter how good the NL2SQL accuracy is, if it can\u0026rsquo;t coexist with existing infrastructure, it won\u0026rsquo;t get used in the field. Lesson learned from the retail project \u0026ndash; field fit, not technology, drives adoption. Post 4 isn\u0026rsquo;t about the NL2SQL engine\u0026rsquo;s internal performance. It\u0026rsquo;s about interface design for DataNexus to function as a platform. Different perspective, different problems to solve.\nExternal System Integration Didn\u0026rsquo;t Work In the previous post\r, the DataHub + DozerDB dual structure solved the internal ontology problem. It was sufficient for internal use only.\nThe problem was external integration. While exploring the finance domain, I found FIBO (Financial Industry Business Ontology) \u0026ndash; an industry-standard term hierarchy with concepts like \u0026ldquo;Financial Product,\u0026rdquo; \u0026ldquo;Loan,\u0026rdquo; \u0026ldquo;Interest Rate\u0026rdquo; organized in layers. Retail has its counterparts. GS1\u0026rsquo;s GPC (Global Product Classification) has standardized product taxonomies like \u0026ldquo;Apparel -\u0026gt; Women\u0026rsquo;s Wear -\u0026gt; Dresses.\u0026rdquo; Healthcare has SNOMED CT, manufacturing has ISA-95. Every domain has thousands of pre-organized terms \u0026ndash; if we could import these, there\u0026rsquo;d be no need to build an ontology from scratch.\nI opened a FIBO file. It was in OWL format. Trying to load it into the DozerDB graph, the structures just didn\u0026rsquo;t match. The reverse direction was the same \u0026ndash; to export the DataNexus ontology to a client\u0026rsquo;s existing system (Collibra, TopBraid, etc.), there was no standard format to use. It worked perfectly internally but became useless the moment you tried to take it outside.\nThe problems with no external compatibility stack up. Many large enterprises already use metadata management tools like Collibra or Alation. Adopting DataNexus doesn\u0026rsquo;t mean abandoning their existing term systems. If you can export in a standard format, coexistence is possible. If you can\u0026rsquo;t, you\u0026rsquo;re looking at manually migrating hundreds of terms. That alone eats months.\nFor retail conglomerates where department stores, hypermarkets, and online malls each define \u0026ldquo;revenue\u0026rdquo; differently, unifying or at least mapping terms at the group level requires a common format. Without one, each subsidiary operates in isolation. Financial institutions face regulatory requirements to report data lineage and term definitions to supervisory authorities. Then there\u0026rsquo;s vendor lock-in. If you use DataNexus but need to switch platforms later, standard format export means you can migrate. Without it, you\u0026rsquo;re trapped. This weighs heavily in adoption decisions.\nSame Graph, Different Language DozerDB uses the LPG (Labeled Property Graph) model.\nNodes (circles) get names and properties: Net Revenue {definition: \u0026quot;Gross Revenue - Returns - Discounts\u0026quot;} Arrows between nodes, and properties on those arrows too: -[MANUFACTURES {since: \u0026quot;2024-01-01\u0026quot;}]-\u0026gt; The key point is that you can put information like \u0026ldquo;since when\u0026rdquo; and \u0026ldquo;confidence level\u0026rdquo; directly on the arrow itself. This is what we leveraged in the previous post\rwhen creating MANUFACTURES and STOCKS relationships.\nSKOS and other web standards use a completely different system. RDF (Resource Description Framework) \u0026ndash; every piece of information is broken into three-word sentences.\nNet Revenue -\u0026gt; broader -\u0026gt; Revenue (Net Revenue\u0026rsquo;s broader concept is Revenue) Net Revenue -\u0026gt; prefLabel -\u0026gt; \u0026quot;Net Revenue\u0026quot;@en (The English name is \u0026ldquo;Net Revenue\u0026rdquo;) Subject-predicate-object \u0026ndash; these three words form one unit. Called a triple.\nThis is where they diverge. LPG can freely attach properties to relationships, but in RDF, the triple is the atomic unit, so you can\u0026rsquo;t directly put properties on a relationship. In exchange, RDF is URI-based, so the same concept can be referenced by the same address from anywhere in the world. For inter-system data exchange, RDF wins decisively.\nSo we needed LPG\u0026rsquo;s expressiveness internally and RDF\u0026rsquo;s compatibility externally. Both were needed.\nOWL Is Overkill, RDFS Is Too Light The RDF world has multiple standards.\nOWL (Web Ontology Language) is the most powerful. Class inheritance, constraints, automated reasoning. Think of it as legal text \u0026ndash; you can precisely specify every clause and exception, but you need a separate Reasoner engine and the learning curve is steep. FIBO uses OWL precisely because of the complexity of financial regulations.\nWhat DataNexus is doing isn\u0026rsquo;t reasoning. It\u0026rsquo;s context provision \u0026ndash; telling the NL2SQL engine \u0026ldquo;what is average transaction value, in which table, in which column.\u0026rdquo; OWL was overkill.\nRDFS (RDF Schema) swings the other way, too lightweight. subClassOf works, but there are no standard properties for synonyms or term definitions.\nSKOS (Simple Knowledge Organization System) landed in the middle. It\u0026rsquo;s a W3C standard originally built for library classification systems and thesauri (a thesaurus maps synonyms, near-synonyms, and hierarchical terms). What DataNexus does is basically business term dictionary management, so it\u0026rsquo;s not a stretch to use SKOS for this.\nHere\u0026rsquo;s how SKOS concepts map to the DataNexus structure:\nSKOS DataNexus (DataHub + DozerDB) In plain terms skos:Concept Glossary Term / Entity node A single term skos:broader IsA relationship (broader concept) \u0026ldquo;ATV is a type of sales metric\u0026rdquo; skos:narrower IsA reverse (narrower concept) \u0026ldquo;Sales metrics includes ATV\u0026rdquo; skos:related RelatedTo family \u0026ldquo;Related terms\u0026rdquo; * skos:prefLabel Term name (primary label) Official name skos:altLabel Synonyms (translations, abbreviations) \u0026ldquo;ATV\u0026rdquo; = \u0026ldquo;Average Transaction Value\u0026rdquo; skos:definition Term definition What the term means skos:ConceptScheme Domain-specific term grouping \u0026ldquo;Retail Terms\u0026rdquo;, \u0026ldquo;Finance Terms\u0026rdquo; * Note: skos:related is bidirectional. \u0026ldquo;A related B\u0026rdquo; automatically implies \u0026ldquo;B related A.\u0026rdquo; DozerDB relationships like SELLS or SUPPLIED_BY are directional. Store A sells Product B, but Product B doesn\u0026rsquo;t sell Store A. This directional information is lost when exporting to SKOS. More on this later.\nOverlaying SKOS on DozerDB The one rule I set was: don\u0026rsquo;t touch the existing graph.\nWe\u0026rsquo;d already built MANUFACTURES, STOCKS, CALCULATED_FROM relationships in DozerDB and had queries running against them. Ripping that apart to comply with a standard would be the kind of mistake I\u0026rsquo;ve seen too many times on other projects.\nSo we overlaid SKOS metadata on existing nodes instead. Like placing a transparent film on top.\n// Add SKOSConcept label and SKOS properties to existing Entity node MATCH (net:Entity {name: \u0026#39;Net Revenue\u0026#39;}) SET net:SKOSConcept SET net.skos_prefLabel = \u0026#39;Net Revenue\u0026#39; SET net.skos_altLabel = [\u0026#39;Net Sales\u0026#39;, \u0026#39;순매출액\u0026#39;] SET net.skos_definition = \u0026#39;Amount after deducting returns and discounts from gross revenue\u0026#39; SET net.skos_inScheme = \u0026#39;finance-terms\u0026#39; Same for the retail domain.\n// Retail domain term example MATCH (atv:Entity {name: \u0026#39;ATV\u0026#39;}) SET atv:SKOSConcept SET atv.skos_prefLabel = \u0026#39;ATV\u0026#39; SET atv.skos_altLabel = [\u0026#39;Average Transaction Value\u0026#39;, \u0026#39;객단가\u0026#39;, \u0026#39;객단\u0026#39;] SET atv.skos_definition = \u0026#39;Total revenue divided by number of purchasing customers\u0026#39; SET atv.skos_inScheme = \u0026#39;retail-terms\u0026#39; Existing Entity nodes remain untouched. Just a SKOSConcept label and skos_-prefixed properties added on top. No impact on existing Cypher queries.\nFor broader/narrower relationships, there were two approaches: create BROADER and NARROWER edges alongside IsA upfront, or convert existing IsA relationships to skos:broader at export time.\nWe chose the latter. Creating dual edges means every time IsA changes, BROADER needs to sync too. If sync drifts, data gets corrupted. The Source of Truth should be singular. Converting once at export time is simpler and safer.\nImport and Export Once the overlay was in place, two things became possible that weren\u0026rsquo;t before.\nImport \u0026ndash; Pulling finance terms from FIBO, product taxonomy from GS1 GPC into DataNexus. FIBO is originally distributed in OWL, but there are derived SKOS versions. GPC is also SKOS-mappable. Product hierarchies like \u0026ldquo;Apparel -\u0026gt; Women\u0026rsquo;s Wear -\u0026gt; Dresses\u0026rdquo; can be imported directly as the backbone for a retail client\u0026rsquo;s ontology. OWL\u0026rsquo;s complex constraints get dropped, but what DataNexus needs is just term names, definitions, and hierarchical relationships. The SKOS subset is sufficient.\nExport \u0026ndash; Sending DataNexus terms to a client\u0026rsquo;s system. Extract nodes and relationships for a specific domain (e.g., retail-terms) from the DozerDB graph and convert to SKOS Turtle format.\n@prefix skos: \u0026lt;http://www.w3.org/2004/02/skos/core#\u0026gt; . @prefix dnx: \u0026lt;http://datanexus.ai/ontology/\u0026gt; . dnx:atv a skos:Concept ; skos:prefLabel \u0026#34;ATV\u0026#34;@en ; skos:altLabel \u0026#34;Average Transaction Value\u0026#34;@en, \u0026#34;객단가\u0026#34;@ko ; skos:definition \u0026#34;Total revenue divided by number of purchasing customers\u0026#34;@en ; skos:broader dnx:sales-metrics ; skos:inScheme dnx:retail-terms . dnx:sales-metrics a skos:Concept ; skos:prefLabel \u0026#34;Sales Metrics\u0026#34;@en ; skos:narrower dnx:atv, dnx:net-sales, dnx:upt ; skos:inScheme dnx:retail-terms . In the retail field, what one system calls \u0026ldquo;ATV\u0026rdquo; (Average Transaction Value), another calls \u0026ldquo;average spend per customer\u0026rdquo; (객단가). Put all these aliases in altLabel and the NL2SQL engine can find the same table regardless of which name is used in the question. This file can be loaded into any SKOS-compatible system \u0026ndash; Collibra, TopBraid, you name it.\nWith import/export working, the problems mentioned earlier are solved. Say a retail conglomerate\u0026rsquo;s department store defines \u0026ldquo;revenue\u0026rdquo; as store-level POS totals, the online mall uses payment confirmation basis, and the hypermarket uses post-return basis. Each subsidiary exports their terms via SKOS from DataNexus, and headquarters can receive these and build a mapping table. \u0026ldquo;Department store revenue = Online mall confirmed revenue = Hypermarket net revenue\u0026rdquo; \u0026ndash; this relationship gets standardized. Financial clients needing to report term definitions and data lineage to regulators can submit the SKOS Turtle file directly or convert to the required format. Without standards, all of this is manual work.\nSchema.org and other RDFS/OWL-based standards are outside the scope of this SKOS layer. If needed, a separate converter can be built, but it\u0026rsquo;s not a priority right now.\nRemaining Limitations There are things SKOS can\u0026rsquo;t do, and I knew that going in.\nSKOS has an extension called SKOS-XL that lets you attach metadata to labels themselves. You could record when \u0026ldquo;Net Revenue\u0026rdquo; was registered or who approved it. If multilingual label management gets complex, we might need it. Haven\u0026rsquo;t added it yet.\nOWL-level reasoning is also outside SKOS scope. \u0026ldquo;If A is narrower than B, and B is narrower than C, then A is narrower than C\u0026rdquo; \u0026ndash; that kind of automated inference. Not needed when the ontology is small, but things may change with thousands of terms.\nThe biggest limitation is custom relationship export. DozerDB\u0026rsquo;s domain-specific retail relationships like SELLS, STOCKS, SUPPLIED_BY have no SKOS standard equivalent. \u0026ldquo;Store A sells Product B\u0026rdquo; is a directional relationship, but lumping it into skos:related erases both direction and meaning. Extending with a custom namespace like dnx:sells preserves the information, but the receiving system needs to understand these custom relationships. Information loss vs. compatibility \u0026ndash; it\u0026rsquo;s a tradeoff.\nHow Custom Relationships Are Actually Exported I said lumping into skos:related erases meaning. So what do we actually do?\nWe define a DataNexus-specific namespace.\n@prefix dnx: \u0026lt;http://datanexus.ai/ontology/relation/\u0026gt; . dnx:atv-store a skos:Concept ; skos:prefLabel \u0026#34;ATV-Store Relationship\u0026#34;@en ; dnx:relationshipType \u0026#34;SoldBy\u0026#34; ; dnx:direction \u0026#34;outgoing\u0026#34; ; dnx:confidence 0.95 ; dnx:validFrom \u0026#34;2024-01-01\u0026#34; . Custom properties like dnx:relationshipType, dnx:direction, dnx:confidence preserve the directionality and metadata from DozerDB\u0026rsquo;s SELLS relationship. If the receiving system understands the dnx: namespace, it can restore the information without loss. If not, it falls back to skos:related. The information doesn\u0026rsquo;t disappear \u0026ndash; it\u0026rsquo;s just only visible to systems that can read it.\nIn practice, we operate like this:\nExport Target Method Information Preservation SKOS-native systems (Collibra, TopBraid) Standard skos: properties only ~80% (direction, properties lost) DataNexus-to-DataNexus (subsidiary to subsidiary) Include dnx: custom namespace ~95% (near-complete preservation) Regulatory reporting skos: + custom relationships as text in skos:note ~85% (human-readable level) On the DataHub side, we\u0026rsquo;ve also defined rules for handling unmapped properties at export time:\nDozerDB Property SKOS Export Handling confidence dnx:confidence (custom) or text in skos:note since / valid_until dnx:validFrom / dnx:validUntil or skos:historyNote cardinality dnx:cardinality (custom only, no SKOS equivalent) operator (CALCULATED_FROM) dnx:calculationOperator I\u0026rsquo;m aware this isn\u0026rsquo;t clean. The dnx: namespace only means something inside the DataNexus ecosystem, and there\u0026rsquo;s no guarantee external systems will understand it. Filling standard gaps with custom extensions is basically making up a new non-standard. To do this properly, we\u0026rsquo;d need SKOS-XL or a formal Application Profile. That feels like overkill right now. We\u0026rsquo;ll add it if clients actually need it.\nRoughly 80% is covered by SKOS standard, the remaining 20% by DozerDB custom properties. Not ideal, but better than pretending everything fits neatly into SKOS when it doesn\u0026rsquo;t.\nDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/004-skos-compatibility-layer/","summary":"Why we chose SKOS to connect the DataNexus ontology with external systems. Designing a compatibility layer between LPG and RDF \u0026ndash; two different graph models.","title":"4. Why We Added a SKOS Compatibility Layer"},{"content":"\u0026ldquo;Do we still need a star schema?\u0026rdquo; This question comes up in every cloud DW migration project.\nYou\u0026rsquo;re in a meeting about moving an on-premises DW — one that\u0026rsquo;s been running for years — to BigQuery or Azure Synapse. Someone asks: \u0026ldquo;Those platforms use columnar storage, so join costs are different. Does that mean we can skip the star schema?\u0026rdquo;\nIt depends. But not many people can articulate what exactly \u0026ldquo;it depends\u0026rdquo; on.\nThe On-Premises Playbook There was an era when Kimball\u0026rsquo;s Dimensional (Star-Schema) modeling was the de facto standard.\nThe reason was straightforward. Disk I/O was expensive, and joins were even more expensive. Joining 10 tables on row-based storage pushed query response times into minutes. Pre-joining data was the rational choice.\nModeling decisions were performance decisions. How far to denormalize, how many aggregate table layers to build, which partition key to choose — these decisions meant the difference between seconds and minutes in query response time.\nKimball\u0026rsquo;s methodology was an attempt to achieve both business readability and query performance within these constraints. It unified dimension tables — previously siloed by department — into enterprise-wide Conformed Dimensions and used a Bus Matrix to design enterprise integration. The methodology\u0026rsquo;s rigor holds up well even today.\nThe problem is that the premises behind this methodology have changed.\nWhat the Cloud Changed Moving to cloud DW fundamentally altered the physical constraints.\nColumnar Storage. BigQuery, Redshift, and Synapse are all column-based. SELECT reads only the columns you need. Query five columns from a table with hundreds, and only those five get scanned. Row-based storage had to read everything.\nCompute/Storage Separation. Storage became cheap. The cost of redundant data storage is negligible. This is a different world from on-premises, where normalization was partly driven by disk capacity constraints.\nMPP Architecture. Massively parallel processing is the default. Join costs are relatively lower compared to on-premises RDBMS. It\u0026rsquo;s not free — shuffles still hurt — but \u0026ldquo;avoid joins at all costs\u0026rdquo; is no longer an absolute rule.\nELT Paradigm. Load raw data first, then transform inside the DW. Transformation logic runs on the DW engine\u0026rsquo;s compute power. Unlike the ETL era, there\u0026rsquo;s no separate transformation server doing the heavy lifting before loading.\nSemi-structured Data Support. JSON, ARRAY, and STRUCT are handled natively. Flexible schemas that were awkward in traditional relational models can now be processed directly within the DW.\nThese changes undermine the original rationale for many modeling principles. But weakened rationale is not the same as a wrong principle.\nThree Options Three approaches are commonly compared in cloud DW environments.\n1. Kimball Dimensional Modeling Star schema, facts and dimensions, Conformed Dimensions. Still the most widely used approach.\nThere are good reasons it remains relevant in the cloud. Business users find it intuitive. The structure of \u0026ldquo;slice the sales fact by the customer dimension\u0026rdquo; plays well with BI tools. Power BI, Tableau, and Looker are all optimized for this structure.\nWhat has changed: pre-built aggregate tables are less necessary. Columnar storage can aggregate raw facts on the fly with sufficient speed. History management patterns like SCD Type 2 are more practical now that storage costs have dropped.\nThe weakness is flexibility. When schemas change, you have to redesign the fact/dimension structure. Structurally, it\u0026rsquo;s difficult to keep up with fast-moving, agile requirements.\n2. Data Vault 2.0 Hubs (business keys), Satellites (attributes), Links (relationships). A methodology focused on storing raw data as-is along with its complete history.\nThe strength is clear: auditability. It fully preserves what the source data looked like at any given time. Adding new source systems or changing schemas doesn\u0026rsquo;t impact existing structures. Parallel loading is possible, making it a natural fit for the ELT paradigm.\nThere are practical hurdles. Querying directly is cumbersome. Multiple Hub-Satellite joins are needed to reconstruct a single business entity. As a result, you typically need a separate presentation layer (usually a star schema). That\u0026rsquo;s one more modeling layer to maintain. If the team has no Data Vault experience, the learning curve is steep.\nIt shines in regulated industries like finance and healthcare where audit trails are mandatory, or in environments where source systems are constantly being added.\n3. One Big Table (OBT) Merge facts and dimensions into a single wide table. Extreme denormalization.\nThere\u0026rsquo;s a reason this works in the cloud. With columnar storage, even if the table has 200 columns, only the 5 used in a query get scanned — so the performance hit is minimal. No joins means simpler queries. Development is fast. One SELECT statement in dbt and you\u0026rsquo;re done.\nThe trade-offs are real. There\u0026rsquo;s no structural mechanism to guarantee data consistency. If a customer\u0026rsquo;s address changes, every OBT has to be rebuilt. When the same dimension attribute is duplicated across multiple OBTs, there\u0026rsquo;s no way to know which one is correct. It\u0026rsquo;s fast when data is small and the domain is simple, but management complexity skyrockets at scale.\nIt works well for prototyping or single-domain analytical marts. It\u0026rsquo;s risky as the foundation for an enterprise DW.\nPractical Decision Criteria \u0026ldquo;Which one is best?\u0026rdquo; is the wrong question. Evaluate based on these criteria:\nTeam capability. To do Data Vault properly, you need someone on the team who knows the methodology. Without that, Kimball is the realistic choice. OBT has a low entry barrier, but as scale grows, an experienced modeler becomes even more critical.\nData complexity. Are there 3 source systems or 30? One domain or many? Higher complexity calls for Kimball\u0026rsquo;s Conformed Dimensions or Data Vault\u0026rsquo;s Hub structure.\nRate of change. If requirements shift frequently, Data Vault has the advantage. In a stable environment, Kimball is sufficient.\nRegulatory requirements. If audit trails are legally required, consider Data Vault. Otherwise, it may be over-engineering.\nQuery patterns. If the focus is BI dashboards, Kimball\u0026rsquo;s structure pairs well with BI tools. If ad-hoc analysis is common, OBT\u0026rsquo;s simplicity becomes an advantage.\nA pattern frequently seen in practice is a layered approach.\nRaw (source ingestion) → Staging (cleansing) → Integration (unified model) → Mart (analytics) A common combination is designing the Integration layer with Data Vault and serving the Mart as a star schema. Data Vault\u0026rsquo;s Hub-Link-Satellite structure excels at history tracking and flexible extension, while the star schema — with its central fact table surrounded by dimension tables — is optimized for analytics. Another traditional approach uses a normalized relational model (3NF) for the Integration layer with Kimball-style Marts. Either way, the key is separating the layers.\nThe moment you try to handle both source preservation and analytical optimization in a single layer, complexity explodes.\nModeling Still Matters What changed with the cloud is the criteria for \u0026ldquo;why we model this way\u0026rdquo; — not whether modeling is needed at all.\nStorage is cheaper, so you can denormalize more aggressively. Join costs are lower, so you can build fewer aggregate tables. But consistency of business terminology, data lineage, and cross-domain integration — these problems don\u0026rsquo;t disappear when the infrastructure changes.\nIn fact, teams that start without modeling in the cloud tend to suffer more later. They build quickly with OBT, and it works at first. A year later, the same metric has different definitions across tables, and nobody knows which source is authoritative. Technical debt accumulates silently.\nAs tools and infrastructure improve, the center of gravity in modeling is shifting from \u0026ldquo;performance optimization\u0026rdquo; to \u0026ldquo;semantic consistency.\u0026rdquo; This is the same problem DataNexus is tackling with ontology. Without machine-readable business context, data loses its meaning — no matter how good the infrastructure underneath.\n","permalink":"https://datanexus-kr.github.io/en/guides/dw-modeling/001-cloud-era-dw-modeling/","summary":"How DW modeling considerations have shifted with Synapse, BigQuery, and Redshift. Kimball, Data Vault, One Big Table — practical criteria for choosing the right approach.","title":"1. Is Kimball Still Relevant in the Cloud DW Era?"},{"content":"\rGoogle Colab에서 실습하기\rWhy We Tried Using the Glossary as an Ontology The core idea of DataNexus is simple. If you define relationships between business terms as a graph, the NL2SQL engine can reference that graph to convert natural language to SQL. That graph is the ontology.\n\u0026ldquo;Ontology\u0026rdquo; sounds like something from an academic paper, but there\u0026rsquo;s nothing fancy about it. \u0026ldquo;Net revenue is a type of revenue (IsA).\u0026rdquo; \u0026ldquo;Revenue includes gross revenue, returns, and discounts (HasA).\u0026rdquo; It\u0026rsquo;s just business knowledge from people\u0026rsquo;s heads, written down in a format machines can read.\nThe question was where to store it. Spinning up a dedicated ontology system adds another management point. We were already using DataHub as the metadata platform, and it came with a Business Glossary. Term registration, relationship configuration \u0026ndash; it\u0026rsquo;s all there. In the previous post\r, I\u0026rsquo;d assessed that the Glossary\u0026rsquo;s 4 relationship types (IsA, HasA, RelatedTo, Values) were sufficient to express business term hierarchies.\nEliminating one system was appealing. Programmatic access via GraphQL API was there, and term changes automatically trigger Kafka MCL (Metadata Change Log) events. A reasonable starting point.\nStarted Putting Terms In First thing after setting up DataHub was registering Glossary Terms. \u0026ldquo;Net Revenue IsA Revenue\u0026rdquo;, \u0026ldquo;Revenue HasA Gross Revenue, Returns, Discounts.\u0026rdquo; Entered terms and set up relationships like this.\nBasic hierarchies went in cleanly. Revenue -\u0026gt; Gross Revenue, Net Revenue -\u0026gt; Actual Revenue.\nThen problems started.\nThe Limits of 4 Relationship Types: \u0026ldquo;Factory and Product\u0026rdquo; Hit a wall when modeling actual business data.\n\u0026ldquo;Product B manufactured at Factory A\u0026rdquo; and \u0026ldquo;Product B in stock at Factory A.\u0026rdquo; Both are relationships between factory and product, but one is manufacturing (Manufactures) and the other is inventory (Stocks). Completely different meanings.\nExpress this in DataHub\u0026rsquo;s Glossary? Both become RelatedTo. You get two \u0026ldquo;Factory RelatedTo Product\u0026rdquo; entries, with no way to tell which is manufacturing and which is inventory.\nWhy this is fatal \u0026ndash; DataNexus\u0026rsquo;s NL2SQL engine builds SQL by looking at the ontology. When a question like \u0026ldquo;Show me products manufactured at Factory A\u0026rdquo; comes in, the engine looks up the factory-product relationship to determine the relevant tables and JOIN paths.\nUser question: \u0026ldquo;What products are manufactured at Factory A?\u0026quot;\nOntology lookup: Factory -\u0026gt; RelatedTo -\u0026gt; Product (Manufacturing? Inventory? Unknown)\n-\u0026gt; LLM might JOIN the inventory table instead of the production table -\u0026gt; Wrong results returned\nWith only RelatedTo as the relationship type, the engine has no basis for judgment. A wrong JOIN means wrong data delivered to the user.\nAdding Types Requires Redeployment Can\u0026rsquo;t you just add more granular relationships to DataHub? No.\nYou\u0026rsquo;d need to define a new Aspect in PDL (Persona Data Language), declare the relationship type with @Relationship annotations, then build and redeploy DataHub. This cycle repeats for every new relationship type.\nIn practice, business modeling means relationships keep multiplying. \u0026ldquo;Supplies,\u0026rdquo; \u0026ldquo;Inspects,\u0026rdquo; \u0026ldquo;Returns\u0026rdquo;\u0026hellip; business context can require dozens. Editing code and redeploying for each one isn\u0026rsquo;t realistic.\nDigging Deeper, More Issues Surfaced Relationship types weren\u0026rsquo;t the only problem.\nSynonym Conflicts I registered \u0026ldquo;Net Revenue\u0026rdquo; (순매출) and \u0026ldquo;Actual Revenue\u0026rdquo; (실매출) as synonyms. Same concept, different names. But both terms had \u0026ldquo;Net Sales\u0026rdquo; as an English synonym. One English name mapped to two Korean terms \u0026ndash; DataHub just lets this pass. No warnings.\nIn NL2SQL, if synonym mappings get tangled, the engine references the wrong term. Once you\u0026rsquo;re past hundreds of terms, catching these conflicts by eye is impossible. You have to build custom validation logic separately.\nVisualization DataHub\u0026rsquo;s UI is designed for data lineage exploration. A directional tree showing data flowing from Table A to Table B.\nOntology has a different structure. Dozens to hundreds of nodes connected in many-to-many mesh networks. \u0026ldquo;Product\u0026rdquo; is connected to \u0026ldquo;Factory,\u0026rdquo; \u0026ldquo;Warehouse,\u0026rdquo; \u0026ldquo;Supplier,\u0026rdquo; \u0026ldquo;Category\u0026rdquo; \u0026ndash; each with different relationships \u0026ndash; and those nodes are interconnected among themselves. DataHub simply doesn\u0026rsquo;t have a screen for exploring this kind of graph.\nIf you can\u0026rsquo;t see the big picture of what you\u0026rsquo;ve built, you can\u0026rsquo;t manage it.\nCan\u0026rsquo;t Attach Properties to Relationships This was the biggest problem.\nWhen you set \u0026ldquo;A RelatedTo B\u0026rdquo; in DataHub\u0026rsquo;s Glossary, you can\u0026rsquo;t add anything more to that relationship. In practice, you often need metadata on the relationship itself.\nConfidence is a prime example. An auto-extracted relationship might be 0.7, while an expert-defined one is 0.95 \u0026ndash; the NL2SQL engine needs to know this difference. Validity period is similar. When organizational restructuring changes department-to-product mappings, you need to track when that relationship was valid. Without this, you end up querying current data with past organizational structures, which is a classic cause of mismatched report numbers. Cardinality directly affects JOIN strategy.\nSummary: What Works and What Doesn\u0026rsquo;t Works Doesn\u0026rsquo;t Work Term definitions (name, definition) Granular relationship types (MANUFACTURES, STOCKS, etc.) Synonym registration (custom fields) Automatic synonym conflict detection 4 relationship types (IsA, HasA, RelatedTo, Values) Properties on relationships (confidence, validity period) Programmatic access via GraphQL API Complex graph exploration UI Kafka MCL event stream Real-time relationship type extension without redeployment As a term dictionary, it\u0026rsquo;s decent. As an ontology store, it lacked expressiveness.\nSplit the Roles: DataHub + DozerDB Throwing away the Glossary entirely wasn\u0026rsquo;t the answer. Nothing could replace DataHub as the Source of Truth for term definitions. GraphQL API, Kafka MCL events \u0026ndash; building this infrastructure from scratch in another tool would be a waste of time.\nEach got assigned what it does best.\nDataHub Glossary -\u0026gt; Source of Truth for term definitions and basic relationships DozerDB -\u0026gt; Handles granular relationships, property-annotated edges, and graph reasoning We chose DozerDB because it supports Cypher queries. Properties can be freely attached to relationships (edges), and adding new relationship types doesn\u0026rsquo;t require schema changes or redeployment.\nThe sync flow is straightforward. When a Glossary Term changes in DataHub, a Kafka MCL event is emitted. A consumer subscribing to the event reflects it in DozerDB\u0026rsquo;s ontology graph. Basic information like names and definitions stays with DataHub; DozerDB adds granular relationships and properties on top.\nRelationship Definitions in DozerDB The \u0026ldquo;factory-product\u0026rdquo; problem from earlier \u0026ndash; here\u0026rsquo;s how it resolves in DozerDB.\n// Entity creation (terms synced from DataHub) CREATE (factory:Entity {name: \u0026#39;Factory A\u0026#39;, type: \u0026#39;Factory\u0026#39;}) CREATE (product:Entity {name: \u0026#39;Product B\u0026#39;, type: \u0026#39;Product\u0026#39;}) // Manufacturing relationship — start date and confidence as properties CREATE (factory)-[:MANUFACTURES { since: \u0026#39;2024-01-01\u0026#39;, confidence: 0.95 }]-\u0026gt;(product) // Inventory relationship — separate edge, quantity and update timestamp CREATE (factory)-[:STOCKS { quantity: 500, last_updated: \u0026#39;2026-02-01\u0026#39; }]-\u0026gt;(product) MANUFACTURES and STOCKS are separate relationship types. When the question \u0026ldquo;What products are manufactured at Factory A?\u0026rdquo; comes in, the engine finds MANUFACTURES and correctly JOINs to the production table. Fundamentally different from lumping everything under a single RelatedTo.\nDerived Metrics Go in the Graph Too Managing derived metric definitions in Excel means when the source term changes, the derived spreadsheet doesn\u0026rsquo;t follow. \u0026ldquo;Net Revenue = Gross Revenue - Returns - Discounts\u0026rdquo; \u0026ndash; if the gross revenue definition changes but the net revenue side stays the same, that inconsistency propagates up to reports.\nThis time, we used CALCULATED_FROM relationships to put the formulas directly in the graph.\n// Expressing net revenue\u0026#39;s calculation structure as relationships MATCH (net:Entity {name: \u0026#39;Net Revenue\u0026#39;}) MATCH (gross:Entity {name: \u0026#39;Gross Revenue\u0026#39;}) MATCH (returns:Entity {name: \u0026#39;Returns\u0026#39;}) MATCH (discounts:Entity {name: \u0026#39;Discounts\u0026#39;}) CREATE (net)-[:CALCULATED_FROM {operator: \u0026#39;subtract\u0026#39;}]-\u0026gt;(gross) CREATE (net)-[:CALCULATED_FROM {operator: \u0026#39;subtract\u0026#39;}]-\u0026gt;(returns) CREATE (net)-[:CALCULATED_FROM {operator: \u0026#39;subtract\u0026#39;}]-\u0026gt;(discounts) When the formula changes, you update the relationship. The graph DB tracks the change history. Better than having it buried in some Excel sheet where nobody knows who changed what or when.\nDoesn\u0026rsquo;t This Break the Source of Truth? One thing to address here.\nIn Post 1, I called the metadata catalog the \u0026ldquo;Source of Truth for the ontology.\u0026rdquo; In Post 2, I also said the Glossary\u0026rsquo;s 4 relationship types were sufficient. But now in Post 3, I\u0026rsquo;ve added DozerDB. \u0026ldquo;So now there are two Sources of Truth?\u0026rdquo; A fair question.\nThe subject of the SoT changed.\nSoT doesn\u0026rsquo;t mean \u0026ldquo;put everything in one system.\u0026rdquo; It means \u0026ldquo;for a specific data category, which system has the final authority.\u0026rdquo; DataHub and DozerDB answer different questions.\n\u0026ldquo;What is net revenue?\u0026rdquo; -\u0026gt; DataHub answers. Name, definition, synonyms, owning department. DataHub has final authority on everything about the term\u0026rsquo;s identity. \u0026ldquo;How is net revenue connected to which tables, through which paths?\u0026rdquo; -\u0026gt; DozerDB answers. Granular relationships like CALCULATED_FROM and MANUFACTURES, confidence, validity periods. DozerDB owns the semantics of inter-term connections. If there\u0026rsquo;s a conflict between the two? DataHub wins. If a node\u0026rsquo;s name or definition in DozerDB differs from the DataHub Glossary, DataHub is correct. Kafka MCL events flow in one direction only \u0026ndash; DataHub to DozerDB. There is no reverse sync.\nI should have made this distinction clear from the start. When I wrote \u0026ldquo;Source of Truth for the ontology\u0026rdquo; in Post 1, what I really meant was \u0026ldquo;Source of Truth for term definitions.\u0026rdquo; The initial assumption that a single system could handle both term definitions and relationship semantics was wrong. Post 3 exposed that, and the DataHub + DozerDB dual structure is the result.\nThe SoT didn\u0026rsquo;t break. Its scope narrowed.\nRemaining Issue: Standards Compatibility DataHub\u0026rsquo;s Glossary model is a DataHub-specific structure. The industry has standard ontologies like FIBO (finance) and Schema.org (general-purpose). To import industry standards or export the DataNexus ontology, you need standard format support. Right now, it\u0026rsquo;s a proprietary system that only works inside DataNexus.\nAn ontology that can\u0026rsquo;t be exchanged externally ends up being used only internally and then abandoned.\nThe next post covers why we added a SKOS compatibility layer.\rGoogle Colab에서 실습하기\rDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/003-datahub-glossary-as-ontology/","summary":"We tried using DataHub\u0026rsquo;s Business Glossary as an ontology store. What worked, what didn\u0026rsquo;t, and how we worked around it.","title":"3. Can DataHub's Glossary Work as an Ontology?"},{"content":"Too Many Candidates There were too many options.\nFor the metadata catalog alone: DataHub, Amundsen, Apache Atlas, OpenMetadata. Add commercial options and you get Collibra, Alation. Then you need to fill four axes \u0026ndash; NL2SQL engine, document knowledge engine, graph DB \u0026ndash; and the number of combinations exploded exponentially.\nI made a comparison spreadsheet. Rows for candidate tools, columns for evaluation criteria. Three weeks in, the spreadsheet had grown to 7 tabs. When you have too many choices, the problem is not choosing. Pick one and the combinations with the rest shift, forcing you to compare from scratch.\nFour Components, Each With Its Own Requirements In the previous post\r, I defined DataNexus\u0026rsquo;s four components: metadata catalog, NL2SQL engine, document knowledge engine, and graph DB.\nThree non-negotiable common criteria. It must be open source. It must support or enable multi-tenancy \u0026ndash; data isolation per group company is mandatory. It must be production-ready \u0026ndash; community activity, release cadence, documentation quality all mattered.\nEach component had additional requirements. The metadata catalog needed the ability to define relationships between terms in the Business Glossary and emit change events in real-time. The NL2SQL engine required per-user context isolation and Row-level Security. The document knowledge engine couldn\u0026rsquo;t rely on vector search alone \u0026ndash; it needed hybrid graph search. The graph DB required Multi-DB support and Cypher query support as prerequisites.\nWith these criteria in hand, I filtered the candidates.\nMetadata Catalog DataHub, OpenMetadata, Amundsen, Apache Atlas, commercial (Collibra/Alation). Five options on the table.\nCommercial was eliminated first. Licensing costs aside, what this project needed was to use the catalog\u0026rsquo;s Glossary like an ontology store. Commercial Glossary features are powerful, but they have limitations when it comes to accessing the internal data model for customization.\nApache Atlas is tied to the Hadoop ecosystem. You need to spin up HBase, Solr, and Kafka. It\u0026rsquo;s a 2016-era design that\u0026rsquo;s too heavy for cloud-native environments. Amundsen is decent as a search-focused catalog, but its ability to define relationships between terms in the Glossary is lacking. Couldn\u0026rsquo;t use it as an ontology store.\nOpenMetadata was the one I deliberated on the longest. Clean architecture, built-in data quality measurement \u0026ndash; excellent as a standalone catalog. The issue was that Glossary relationships are primarily Parent-Child and RelatedTerms. Not enough for ontology representation where you need to clearly distinguish inheritance (IsA) from containment (HasA). Real-time event sync was also webhook-based, which is less reliable than Kafka-native for large-scale streaming.\nI went with DataHub.\nGlossary relationships come in 4 types: IsA (inheritance), HasA (containment), Values (value lists), RelatedTo (general association). These four are enough to express hierarchies between business terms. \u0026ldquo;Net Revenue IsA Revenue\u0026rdquo;, \u0026ldquo;Revenue HasA Gross Revenue, Returns, Discounts\u0026rdquo; \u0026ndash; that kind of thing.\nGraphQL API also played a role. You need to be able to read and write metadata programmatically to auto-sync the ontology to the NL2SQL engine\u0026rsquo;s RAG Store, and with GraphQL you can pick exactly the fields you need.\nThe biggest factor was Kafka MCL events. DataHub exports Metadata Change Logs to Kafka, and when a Glossary Term changes, an event is published. Subscribe to that and you can sync the graph DB ontology in real-time. Manually reflecting metadata changes will inevitably result in gaps as scale grows. This was a non-negotiable requirement.\nNL2SQL Engine At first, I considered building it from scratch. I\u0026rsquo;d already been through building a conversational BI solution \u0026ndash; connecting GPT and Gemini for NL2SQL, optimizing prompt engineering, even designing a multi-agent architecture.\nTwo lessons came out of that. First, LLMs can\u0026rsquo;t understand business context from DDL alone. Second, building from scratch means auxiliary features balloon endlessly \u0026ndash; user auth, query logging, data filtering, response streaming, query learning. The estimate came out to over a month of work.\nThat\u0026rsquo;s when Vanna hit 2.0.\nVersion 1.x was simple. Inherit a Python class, call train() and ask(). Fine for prototyping, but not production-ready. No per-user context isolation, no security features.\n2.0 is a different beast. It switched to an Agent-based architecture where you compose independent components, and added a User-Aware structure where user ID automatically propagates through all components. Row-level Security is supported at the framework level. Tool Memory for auto-learning from successful queries is built in. Streaming with Rich UI Components (tables, charts) in real-time.\nUser-Aware and Row-level Security were the most important. DataNexus needs to isolate data per group company, and having the NL2SQL engine support this at the framework level means significantly less custom code.\nTool Memory was also significant. One of the most reliable ways to improve NL2SQL accuracy is accumulating successful queries and reusing them for similar questions \u0026ndash; and this is built into the framework. Building it separately means handling query storage, similarity matching, and version management. All of that effort gone.\nDocument Knowledge Engine Vector search alone isn\u0026rsquo;t enough.\nWhen searching business reports or internal policy documents, pulling chunks by vector similarity alone breaks context. You want to find \u0026ldquo;Business Unit A\u0026rsquo;s revenue recognition criteria,\u0026rdquo; but vector search just lists chunks containing \u0026ldquo;revenue\u0026rdquo; by similarity score. Graph-structural information like the relationship between Business Unit A and revenue recognition criteria, or when the criteria changed, doesn\u0026rsquo;t live in vectors.\nApeRAG solves this by combining three types of search. Vector Search for embedding-based semantic search. Full-text Search for cases where the literal string matters, like proper nouns or code names. GraphRAG for traversing relationships between entities extracted from documents. All three run simultaneously.\nThere\u0026rsquo;s a specific reason this hybrid works especially well with DataNexus. If you inject DataHub\u0026rsquo;s Glossary Terms as the Taxonomy for ApeRAG\u0026rsquo;s Entity Extraction, entities extracted from documents are automatically linked to business terms. It goes through 4-stage Entity Resolution: Exact Match, Synonym Match, Fuzzy Match (threshold 0.85), and Context Match.\nThere\u0026rsquo;s also MinerU integration. Enterprise documents commonly have complex tables, formulas, and multi-column layouts. Standard PDF parsers break table rows and columns. Especially documents like annual reports with lots of merged cells \u0026ndash; parsing results are disastrous. MinerU preserves document structure during parsing, directly solving this problem.\nGraph DB The biggest variable was the Neo4j license.\nThe critical difference between Community Edition and Enterprise Edition is Multi-DB. Community allows one graph per instance. Enterprise allows multiple databases within the same instance.\nMulti-DB is mandatory for DataNexus. We need to isolate ontology graphs per group company. groupA_ontology_db, groupB_ontology_db \u0026ndash; separate databases per tenant with access controlled by user permissions. Shoving everything into a single Community DB and distinguishing by labels doesn\u0026rsquo;t make sense from a security standpoint.\nBut we can\u0026rsquo;t buy a Neo4j Enterprise license either. That goes against the project\u0026rsquo;s open-source principles.\nDozerDB solved this dilemma. It\u0026rsquo;s an open-source plugin that adds Enterprise features on top of Neo4j Community Edition, including Multi-DB support. You can create per-tenant graphs with CREATE DATABASE, and Cypher queries work as-is.\nI also looked at ArangoDB. The multi-model approach (document + graph + key-value) is appealing, but you can\u0026rsquo;t use Cypher. Its own query language AQL is fine for graph traversal, but you lose access to the Neo4j ecosystem\u0026rsquo;s libraries and tools. Since patterns and references for querying ontologies with Cypher are overwhelmingly abundant, I chose ecosystem compatibility.\nI\u0026rsquo;m aware of DozerDB\u0026rsquo;s limitations. Fabric \u0026ndash; cross-DB queries \u0026ndash; isn\u0026rsquo;t supported yet, so querying across different databases in a single Cypher statement isn\u0026rsquo;t possible. Deferred to Phase 3. For now, single-tenant queries are sufficient.\nConnecting the Four Line up the four and they\u0026rsquo;re just four tools.\nWhen a Glossary Term changes in DataHub, a Kafka MCL event is published. This event is reflected in real-time to the DozerDB ontology graph, and simultaneously to Vanna\u0026rsquo;s RAG Store. The context injected into NL2SQL prompts is automatically refreshed. Since ApeRAG\u0026rsquo;s Entity Extraction references the DataHub Glossary as its Taxonomy, document search results are also linked to the latest term system.\nFix a term in one place and four places update simultaneously. Manual metadata propagation will inevitably miss something as scale grows.\nNext Post The limitations and workarounds when using DataHub\u0026rsquo;s Business Glossary as an ontology.\rDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/002-architecture-decisions/","summary":"How we decided on DataHub + Vanna + ApeRAG + DozerDB for DataNexus. What got eliminated from the candidate list, and why.","title":"2. How We Chose These 4 Open-Source Tools"},{"content":"\u0026ldquo;What\u0026rsquo;s Your VIP Criteria?\u0026rdquo; This happened during a BI Agent project for a retail company.\nA business user was testing the Agent and asked, \u0026ldquo;Show me last month\u0026rsquo;s VIP customer revenue.\u0026rdquo; The system spit out a number, but the user didn\u0026rsquo;t look happy. \u0026ldquo;Something\u0026rsquo;s off. I think the VIP criteria are different from what our team uses.\u0026rdquo;\nMarketing\u0026rsquo;s VIP and CRM\u0026rsquo;s VIP were different. Same with revenue. Depending on whether you meant net revenue (순매출) or gross revenue (총매출), the difference could be hundreds of millions of won.\nThis wasn\u0026rsquo;t the first time I\u0026rsquo;d seen this. I saw it when migrating a DW to the cloud, and again when building an next-gen analytics system with multiple vendors over a year-long project. Each vendor had different definitions of \u0026ldquo;revenue\u0026rdquo; and \u0026ldquo;cost,\u0026rdquo; and we\u0026rsquo;d lose weeks trying to reconcile data. One misaligned term could push back the entire schedule. I\u0026rsquo;ve never worked on a DW/BI project where this problem didn\u0026rsquo;t come up.\nEnterprise data warehouses have tables and columns. What they don\u0026rsquo;t have is context. \u0026ldquo;What does this column mean in business terms\u0026rdquo; isn\u0026rsquo;t defined anywhere in a machine-readable format.\nSo I decided to build it myself.\nNL2SQL Is Not a Silver Bullet There are plenty of NL2SQL tools out there now. Converting natural language to SQL is already possible.\nReal-world deployment is a different story. We connected a model with high benchmark scores to an actual DW, and the perceived accuracy dropped significantly. The environment integrated internal and external data \u0026ndash; card companies, telcos, public data. The LLM couldn\u0026rsquo;t handle this level of complexity (table structures and question difficulty).\nOpen up an enterprise DW\u0026rsquo;s DDL and the reason becomes clear. Abbreviated tables like T_CUST_MST.CUST_GRD_CD and T_ORD_DTL.SALE_AMT number in the thousands. This is a completely different world from benchmark DBs with columns named customer_name and order_date. Even within the same company, naming conventions differ across business units, and the word \u0026ldquo;revenue\u0026rdquo; can point to different tables depending on which unit you ask.\nDerived metrics are even worse. \u0026ldquo;Net revenue\u0026rdquo; (순매출) isn\u0026rsquo;t a single column. It\u0026rsquo;s a formula like SUM(SALE_AMT) - SUM(RTN_AMT) - SUM(DC_AMT), and this formula isn\u0026rsquo;t written in any DDL. It lives in someone\u0026rsquo;s head, or at best, buried somewhere in an Excel spec document.\nThe bottleneck for NL2SQL isn\u0026rsquo;t SQL generation capability. It\u0026rsquo;s the lack of context.\nChoosing Ontology No matter how much I optimized prompt engineering for conversational BI, DDL alone had clear limits. I also designed a multi-agent architecture, but the root problem was the same. There was simply no context to give the LLM.\nI decided to attach an ontology.\nIt\u0026rsquo;s not as academic as it sounds. In practical terms, it looks like this:\n# Net revenue term definition - term: Net Revenue (순매출) definition: Amount after deducting returns and discounts from gross revenue formula: SUM(SALE_AMT) - SUM(RTN_AMT) - SUM(DC_AMT) synonyms: [Net Sales, 순매출액, 넷세일즈] related_tables: [T_SALE_DTL, T_RTN_DTL] owner: Finance Team You register these term definitions in a metadata catalog and auto-sync them to the NL2SQL engine\u0026rsquo;s RAG Store. When the LLM encounters the word \u0026ldquo;net revenue,\u0026rdquo; it knows which tables, which columns, and which formula to combine.\nHere\u0026rsquo;s the processing flow:\nOur internal target for the before/after difference with ontology is a +15-20%p improvement in EX (Execution Accuracy). MVP goal is EX 80%+, stabilization phase 90%+. Whether these numbers are realistic \u0026ndash; we\u0026rsquo;ll validate as we build.\nWhy Pasting DDL Doesn\u0026rsquo;t Work Pasting an entire enterprise DW\u0026rsquo;s DDL means tens to hundreds of thousands of tokens. In environments with hundreds or thousands of tables, you can\u0026rsquo;t fit it all in the context window. Even if you could, having the LLM pick out the right table from all that is a Needle-in-a-Haystack problem.\nSecurity is another issue. You can\u0026rsquo;t send an entire corporate schema to an external API. Even for the same \u0026ldquo;revenue,\u0026rdquo; users from Group Company A and Group Company B should see different scopes. This is an environment that requires Row-level Security.\nThe biggest problem is sustainability. DDL changes. Business term definitions change. Even after launching a next-gen analytics system, phased releases keep coming and metadata changes every time. You need a pipeline that detects changes and automatically refreshes the RAG Store \u0026ndash; not a one-time prompt.\nThat\u0026rsquo;s when I knew I needed to build a separate platform.\nWhat DataNexus Is Trying to Do It consists of four components.\nMetadata Catalog \u0026ndash; Manages business term definitions, table metadata, and data lineage in one place. The Source of Truth for the ontology. NL2SQL Engine \u0026ndash; Converts natural language to SQL, but injects context pulled from the ontology into the prompt. The accuracy gap compared to just throwing DDL at it is significant. Document Knowledge Engine \u0026ndash; Searches unstructured data like annual reports and policy documents using GraphRAG + vector hybrid retrieval. Graph DB \u0026ndash; Stores the ontology as a knowledge graph. With Multi-DB isolation per group company. The ontology defined in the catalog auto-syncs to NL2SQL and document search, so user questions get served with context. The open-source tools chosen for each component and the reasoning behind them will be covered in the next post.\nWhy Now General-purpose models are getting better fast, and simple planning or document generation will soon be commodity. To differentiate, you need a system that can structure enterprise data context and inject it into models.\nNo matter how smart the LLM gets, it doesn\u0026rsquo;t know your company\u0026rsquo;s definition of \u0026ldquo;net revenue\u0026rdquo; (순매출). Because that knowledge only exists inside the enterprise.\nLLM research calls this the \u0026ldquo;Non-verifiable Domain\u0026rdquo;. Math and coding have auto-verifiable answers, but tacit enterprise knowledge, role-specific interpretations, and private operational data are hard to judge from the outside. The competitive advantage built on this kind of data is what AI strategy calls a \u0026ldquo;Data Moat\u0026rdquo;.\nI don\u0026rsquo;t think this advantage is permanent. The data accumulation speed of DataNexus needs to outpace the generalization speed of general-purpose models.\nHere\u0026rsquo;s how to build the Data Moat:\nOntology-based context \u0026ndash; A metadata catalog that gets thicker as domain experts refine terms Role-specific interpretation \u0026ndash; Persona optimization that gives different answers to finance and marketing for the same question. Gets more personalized as usage patterns accumulate. Temporal Knowledge Graph \u0026ndash; Distinguishes \u0026ldquo;VIP definition as of Q4 last year\u0026rdquo; from \u0026ldquo;VIP definition as of this year\u0026rdquo; Private data assets \u0026ndash; Graph DB isolation per group company + Row-level Security. Each group company\u0026rsquo;s data becomes an independent asset. The goal is to ship the MVP by the first half of this year and start spinning the data accumulation loop.\nPurpose of This Blog I\u0026rsquo;m documenting the decisions, struggles, and solutions encountered while building DataNexus.\nTopics I\u0026rsquo;ll cover:\nHow we selected the tech stack (including why candidates were eliminated) Limitations and workarounds when using the metadata catalog\u0026rsquo;s Business Glossary as an ontology Why we added a SKOS compatibility layer User-Aware design and Row-level Security in the NL2SQL engine How to pre-validate an ontology with CQ (Competency Questions) Criteria for splitting deterministic vs. probabilistic routing in the Query Router The 79% Rule for splitting agent tasks I\u0026rsquo;ll focus on problems we actually hit and how we solved them (or haven\u0026rsquo;t yet), rather than theory.\nNext Post DataNexus Tech Stack \u0026ndash; The process of deciding on this combination of 4 open-source tools. What was eliminated from the candidate list, and why.\rDocumenting the process of designing and building DataNexus. GitHub\r| LinkedIn\r","permalink":"https://datanexus-kr.github.io/en/posts/datanexus/001-why-datanexus/","summary":"\u003ch2 id=\"whats-your-vip-criteria\"\u003e\u0026ldquo;What\u0026rsquo;s Your VIP Criteria?\u0026rdquo;\u003c/h2\u003e\n\u003cp\u003eThis happened during a BI Agent project for a retail company.\u003c/p\u003e\n\u003cp\u003eA business user was testing the Agent and asked, \u0026ldquo;Show me last month\u0026rsquo;s VIP customer revenue.\u0026rdquo; The system spit out a number, but the user didn\u0026rsquo;t look happy. \u0026ldquo;Something\u0026rsquo;s off. I think the VIP criteria are different from what our team uses.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eMarketing\u0026rsquo;s VIP and CRM\u0026rsquo;s VIP were different. Same with revenue. Depending on whether you meant net revenue (순매출) or gross revenue (총매출), the difference could be hundreds of millions of won.\u003c/p\u003e","title":"1. Why We're Building DataNexus"},{"content":"Junho Lee (이준호) Data \u0026amp; AI Platform Architect | PM\nI have designed and led large-scale DW cloud migrations and next-generation data platform builds on the front lines of DW/BI delivery. My career started as a Web/ERP developer, then progressed through DW/BI engineer, Technical Lead, and Consulting Division Head — and now I\u0026rsquo;m building an ontology-based AI data platform.\nCareer Summary The first half of my career was focused on enterprise DW/BI. I led large-scale DW cloud migrations as a Tech Leader and operated next-generation information system projects as a multi-vendor PMO. I\u0026rsquo;ve worked across a wide range of industry domains including retail, telecommunications, manufacturing, and construction.\nI also have experience building a consulting organization from the ground up — growing a small team to 20+ people and scaling revenue several times over. Hiring, training, managing a technical organization, Presales, C-level seminars — I\u0026rsquo;m not just someone who codes.\nMore recently, I\u0026rsquo;ve been working at the intersection of data and AI. While building an LLM-based BI Agent, I encountered the real-world limitations of NL2SQL firsthand, and through that process became convinced of the need for an ontology-driven approach. I am now designing and building DataNexus, an integrated data agent platform.\nDataNexus \u0026ldquo;Everyone is an Analyst.\u0026rdquo;\nA platform designed to solve the structural problems of enterprise data analytics. An AI agent that lets anyone explore and analyze internal data through natural language — easy to say, but in practice, table names look like T_CUST_MST, full of abbreviations, and a single term like \u0026ldquo;net revenue\u0026rdquo; carries different calculation logic across departments. LLMs cannot understand business context from DDL alone.\nDataNexus tackles this problem by combining an ontology-based NL2SQL engine, GraphRAG, and a Data Catalog. Built on an open-source composite architecture, it provides a single interface for handling both unstructured documents and structured databases.\nThis blog documents the process of building DataNexus — architecture decisions, reasons behind technology choices, and the struggles and solutions along the way, recorded as-is.\nTechnical Areas AI/ML — Ontology LLM RAG, NL2SQL, Langchain, MCP, multi-agent system design DW/Data Platform — Azure Synapse, BigQuery, Redshift, PostgreSQL, Oracle, Yellowbrick, Palantir Foundry BI — Power BI, Tableau, MicroStrategy, Qlik Sense, Looker, Superset ETL/ELT — ADF, SAP Data Services, IBM DataStage, Informatica, Databricks, SSIS Cloud — Azure (Synapse, ADF, ML), AWS (Redshift, S3, Glue), GCP (BigQuery, Gemini) Graph/Catalog — DataHub, Neo4j (DozerDB), ApeRAG Contact GitHub: @datanexus-kr\rLinkedIn: linkedin.com/in/leejuno\r","permalink":"https://datanexus-kr.github.io/en/about/","summary":"Junho Lee - Data \u0026amp; AI Platform Architect","title":"About"},{"content":"","permalink":"https://datanexus-kr.github.io/en/dashboard/","summary":"DataNexus enterprise ontology-based NL2SQL autonomous data agent platform development roadmap and real-time progress","title":"Development Roadmap"}]