Google Colab에서 실습하기 The Customer Moved. What About Past Order Shipping Addresses? Say a jaffle_shop customer moved from Seoul to Busan. You UPDATE the city column in the customers table to ‘Busan’. Now when you query past orders, every shipping address shows “Busan” – even for orders that were actually shipped to Seoul. This is the same “point-in-time data” problem we covered in DW Modeling Part 2 . OLTP only manages the current state. A DW needs to know “what was the value at that point in time?” When a sales rep changes, whose numbers do past deals belong to? When a customer’s tier changes, which tier should past orders be aggregated under? Same problem. Kimball systematized this. SCD (Slowly Changing Dimension) – a framework for handling attribute changes in dimension tables, organized by type. The name “slowly” contrasts with fact data (orders, logs) that accumulates continuously. Customer addresses, product categories, employee departments – they don’t change often, but they do change. SCD Type 1 - Overwrite The simplest approach. UPDATE with the current value and move on. History is lost. -- Type 1: overwrite with current value UPDATE dim_customers SET city = 'Busan' WHERE customer_id = 1; After execution, whether you join this customer’s past orders or current orders, everything shows “Busan.” The time they lived in Seoul is gone. There are valid cases for Type 1. Typo corrections are the classic example. Changing “Seoul Special City” to “Seoul” doesn’t warrant keeping history. Code table description updates, customer name typo fixes – use this for attributes where knowing the past value serves no purpose. SCD Type 2 - Accumulate History When you need to preserve past values, use Type 2. Close the existing row and add a new one. Three columns are added to the table: valid_from – when this row became effective valid_to – when this row was superseded (current rows use 9999-12-31) is_current – whether this row is currently active -- Initial state: customer_id = 1, Seoul -- dim_customers_sk | customer_id | city | valid_from | valid_to | is_current -- 1001 | 1 | Seoul | 2025-01-01 | 9999-12-31 | true When the customer moves to Busan, two steps are executed. -- Step 1: close the existing row UPDATE dim_customers SET valid_to = '2026-02-15', is_current = false WHERE customer_id = 1 AND is_current = true; -- Step 2: insert the new row INSERT INTO dim_customers (dim_customers_sk, customer_id, city, valid_from, valid_to, is_current) VALUES (1002, 1, 'Busan', '2026-02-15', '9999-12-31', true); Now one customer_id has two rows. -- dim_customers_sk | customer_id | city | valid_from | valid_to | is_current -- 1001 | 1 | Seoul | 2025-01-01 | 2026-02-15 | false -- 1002 | 1 | Busan | 2026-02-15 | 9999-12-31 | true Here, dim_customers_sk is the surrogate key. Since one customer can have multiple rows, customer_id alone can no longer uniquely identify a row. That’s why a separate surrogate key is needed. Design details are covered in the Gold edition. Point-in-time queries work like this. -- Join with the customer address at order time SELECT o.order_id, o.order_date, c.city AS city_at_order_time FROM fct_orders o JOIN dim_customers c ON o.customer_id = c.customer_id AND o.order_date BETWEEN c.valid_from AND c.valid_to; A June 2025 order shows “Seoul,” a March 2026 order shows “Busan.” Each order reflects the actual value at its point in time. As mentioned in DW Modeling Part 1 , “the storage overhead of SCD Type 2 has diminished.” In cloud columnar storage, the cost of additional rows is far lower than on-premises. This is an environment where you can use Type 2 more aggressively. SCD Type 3 - Keep the Previous Value as a Column When a single level of history is sufficient, use Type 3. Store the previous value in a separate column. -- Type 3: previous value as a column -- customer_id | city | previous_city -- 1 | Busan | Seoul Implementation is straightforward. UPDATE dim_customers SET previous_city = city, city = 'Busan' WHERE customer_id = 1; Row count doesn’t increase. However, the value from two changes ago is lost. If the city changes from Seoul → Busan → Daejeon, “Seoul” disappears. There are practical use cases. Comparing before and after an organizational restructuring: “What department was this employee in before the reorg?” When a single previous value is enough and full history isn’t needed. Which Type to Choose Criterion Type 1 Type 2 Type 3 History preservation None Full Previous value only Implementation complexity Low High Medium Storage No change Rows keep growing Column addition Point-in-time analysis Not possible Fully supported Limited Best for Typos, code descriptions Address, tier, department Before/after reorg comparison The decision criterion is simple. “Do I need to analyze using past values?” If yes, Type 2. If no, Type 1. Type 3 is limited to the special case where only the previous value is needed. You can mix types within a single table by column. Track city with Type 2 for full history, but overwrite phone with Type 1. There’s no analytical reason to keep historical phone numbers. Preparing Lab Data jaffle_shop’s raw_customers doesn’t have an address column. We need to generate synthetic data to demonstrate SCD. import duckdb conn = duckdb.connect('warehouse.duckdb') # Customer data for SCD demo: add city, membership_grade, updated_at conn.execute(""" CREATE SCHEMA IF NOT EXISTS bronze; CREATE OR REPLACE TABLE bronze.customers_v2 AS SELECT id AS customer_id, first_name, last_name, CASE WHEN id % 3 = 0 THEN 'Seoul' WHEN id % 3 = 1 THEN 'Busan' ELSE 'Daejeon' END AS city, CASE WHEN id % 4 = 0 THEN 'Gold' WHEN id % 4 = 1 THEN 'Silver' WHEN id % 4 = 2 THEN 'Bronze' ELSE 'Standard' END AS membership_grade, TIMESTAMP '2025-01-15 09:00:00' AS updated_at FROM read_csv_auto( 'https://raw.githubusercontent.com/dbt-labs/jaffle_shop/main/seeds/raw_customers.csv' ); """) conn.execute("SELECT * FROM bronze.customers_v2 LIMIT 5").fetchdf() We also create change simulation data. A scenario where some customers moved and their tiers were upgraded. # Change data: some customers have moved conn.execute(""" CREATE OR REPLACE TABLE bronze.customers_v2_updated AS SELECT customer_id, first_name, last_name, CASE WHEN customer_id IN (1, 3, 5) THEN 'Jeju' ELSE city END AS city, CASE WHEN customer_id IN (2, 4) THEN 'Gold' ELSE membership_grade END AS membership_grade, CASE WHEN customer_id IN (1, 2, 3, 4, 5) THEN TIMESTAMP '2026-02-20 14:00:00' ELSE updated_at END AS updated_at FROM bronze.customers_v2; """) # Check changed customers conn.execute(""" SELECT a.customer_id, a.city AS before_city, b.city AS after_city, a.membership_grade AS before_grade, b.membership_grade AS after_grade FROM bronze.customers_v2 a JOIN bronze.customers_v2_updated b ON a.customer_id = b.customer_id WHERE a.city != b.city OR a.membership_grade != b.membership_grade """).fetchdf() Automating SCD Type 2 with dbt snapshot What Is a Snapshot We implemented Type 2 manually in SQL above. Close the existing row, insert the new one, manage valid_from/valid_to. That’s manageable for one table. When dimension tables grow to 10 or 20, writing this logic from scratch every time isn’t realistic. dbt snapshot handles this for you. Define a single snapshot file, and dbt detects changes in the source data and manages the valid_from/valid_to rows automatically. Writing the Snapshot File import os os.makedirs('jaffle_shop/snapshots', exist_ok=True) %%writefile jaffle_shop/snapshots/snap_customers.sql {% snapshot snap_customers %} {{ config( target_schema='snapshots', unique_key='customer_id', strategy='timestamp', updated_at='updated_at' ) }} select * from bronze.customers_v2 {% endsnapshot %} strategy='timestamp' – determines change based on the updated_at column. If updated_at is newer than the previous snapshot point, the row is considered changed. unique_key='customer_id' – the key that identifies which row is which. Previous and current values are compared based on this key. Running the Snapshot from dbt.cli.main import dbtRunner # Colab's ! shell commands spawn a separate process. # DuckDB uses file locks that prevent concurrent writes across processes. # Running via dbtRunner within the same process avoids lock conflicts. result = dbtRunner().invoke(['snapshot', '--project-dir', 'jaffle_shop', '--profiles-dir', 'jaffle_shop']) This is the first run. All rows are new, so they’re inserted as-is. dbt automatically adds dbt_valid_from and dbt_valid_to columns. conn.execute("SELECT * FROM snapshots.snap_customers LIMIT 5").fetchdf() dbt_valid_to is NULL for all rows. That means they’re currently active. dbt snapshot uses NULL instead of 9999-12-31. Now let’s inject the change data and run again. # Replace source table with changed data conn.execute(""" CREATE OR REPLACE TABLE bronze.customers_v2 AS SELECT * FROM bronze.customers_v2_updated; """) conn.close() # Re-run snapshot result = dbtRunner().invoke(['snapshot', '--project-dir', 'jaffle_shop', '--profiles-dir', 'jaffle_shop']) conn = duckdb.connect('warehouse.duckdb') # Check customers with history conn.execute(""" SELECT customer_id, city, membership_grade, dbt_valid_from, dbt_valid_to FROM snapshots.snap_customers WHERE customer_id IN (1, 2, 3) ORDER BY customer_id, dbt_valid_from """).fetchdf() Customer with customer_id = 1 now has two rows. The first row’s dbt_valid_to is populated, and the second row is the currently active one. Type 2 implemented without writing a single line of SQL. The Check Strategy Some sources don’t have an updated_at column. As mentioned in Part 2 , surprisingly many systems UPDATE data without touching updated_at. In these cases, use the check strategy. It directly compares whether specified column values have changed. %%writefile jaffle_shop/snapshots/snap_customers_check.sql {% snapshot snap_customers_check %} {{ config( target_schema='snapshots', unique_key='customer_id', strategy='check', check_cols=['city', 'membership_grade'] ) }} select customer_id, first_name, last_name, city, membership_grade from bronze.customers_v2 {% endsnapshot %} check_cols=['city', 'membership_grade'] – if either of these column values differs from before, it’s treated as a change. No updated_at needed. The trade-off is that it compares all rows every time, so it’s slower than the timestamp strategy for large datasets. Connecting Snapshots to Silver/Gold Snapshots are stored in a separate schema (snapshots), neither Bronze nor Silver. Here’s how this fits into the Medallion Architecture. Source → [Bronze] → [Silver] → [Gold] ↑ Bronze → [Snapshot] ──┘ This is the layer structure from Part 1 with snapshots added. Snapshots look directly at Bronze data, and Silver or Gold models reference the snapshot results. Here’s how a Silver model references a snapshot. %%writefile jaffle_shop/models/staging/stg_customers_hist.sql with source as ( select * from {{ ref('snap_customers') }} ), cleaned as ( select customer_id, first_name, last_name, city, membership_grade, dbt_valid_from AS valid_from, coalesce(dbt_valid_to, '9999-12-31'::timestamp) AS valid_to, dbt_valid_to IS NULL AS is_current from source ) select * from cleaned We renamed dbt’s dbt_valid_from/dbt_valid_to to valid_from/valid_to, and converted NULL to 9999-12-31. This makes BETWEEN joins in Gold more convenient. A point-in-time join from a Gold fact table looks like this. -- Gold: join with customer info at order time select o.order_id, o.order_date, c.city AS city_at_order, c.membership_grade AS grade_at_order from stg_orders o join stg_customers_hist c on o.customer_id = c.customer_id and o.order_date >= c.valid_from and o.order_date < c.valid_to SCD Application Patterns Summary Pattern Target Implementation Type 1 (overwrite) Typos, code descriptions, phone numbers Simple UPDATE Type 2 (history accumulation) Address, tier, department dbt snapshot (timestamp / check) Type 3 (preserve previous value) Before/after reorg comparison Add previous_ column Mixed Per-column differentiation within a table Combine Type 1 + Type 2 In practice, Type 2 dominates. Since dbt snapshot handles the implementation, the overhead isn’t significant. Type 1 is reserved for attributes that don’t need history. Type 3 is used in rare cases where only the previous value matters. Practical Reference: Running dbt snapshot from Airflow In Part 3 , we set up an Airflow DAG with dbt run → dbt test. Adding snapshots changes the order. from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id='medallion_with_snapshot', schedule='0 6 * * *', start_date=datetime(2026, 1, 1), catchup=False, ) as dag: # 1. Run snapshot first — capture change history from Bronze run_snapshot = BashOperator( task_id='dbt_snapshot', bash_command='cd /opt/dbt/jaffle_shop && dbt snapshot', ) # 2. Silver transformation — some models reference snapshot results run_staging = BashOperator( task_id='dbt_run_staging', bash_command='cd /opt/dbt/jaffle_shop && dbt run --select staging', ) # 3. Gold transformation run_marts = BashOperator( task_id='dbt_run_marts', bash_command='cd /opt/dbt/jaffle_shop && dbt run --select marts', ) run_snapshot >> run_staging >> run_marts The critical ordering is run_snapshot >> run_staging. The Silver model stg_customers_hist references snap_customers. Snapshots must run first so Silver reflects the latest history. If you run snapshots after Silver, changes detected in this batch won’t appear in Silver until the next batch. A full day’s delay. The next post covers the Gold layer. The process of combining cleansed Silver data and snapshot history to build fact and dimension tables. The dbt marts directory takes center stage. Google Colab에서 실습하기