Google Colab에서 실습하기

The Customer Moved. What About Past Order Shipping Addresses?

Say a jaffle_shop customer moved from Seoul to Busan. You UPDATE the city column in the customers table to ‘Busan’. Now when you query past orders, every shipping address shows “Busan” – even for orders that were actually shipped to Seoul.

This is the same “point-in-time data” problem we covered in DW Modeling Part 2 . OLTP only manages the current state. A DW needs to know “what was the value at that point in time?” When a sales rep changes, whose numbers do past deals belong to? When a customer’s tier changes, which tier should past orders be aggregated under? Same problem.

Kimball systematized this. SCD (Slowly Changing Dimension) – a framework for handling attribute changes in dimension tables, organized by type. The name “slowly” contrasts with fact data (orders, logs) that accumulates continuously. Customer addresses, product categories, employee departments – they don’t change often, but they do change.

SCD Type 1 - Overwrite

The simplest approach. UPDATE with the current value and move on. History is lost.

-- Type 1: overwrite with current value
UPDATE dim_customers
SET city = 'Busan'
WHERE customer_id = 1;

After execution, whether you join this customer’s past orders or current orders, everything shows “Busan.” The time they lived in Seoul is gone.

There are valid cases for Type 1. Typo corrections are the classic example. Changing “Seoul Special City” to “Seoul” doesn’t warrant keeping history. Code table description updates, customer name typo fixes – use this for attributes where knowing the past value serves no purpose.

SCD Type 2 - Accumulate History

When you need to preserve past values, use Type 2. Close the existing row and add a new one.

Three columns are added to the table:

  • valid_from – when this row became effective
  • valid_to – when this row was superseded (current rows use 9999-12-31)
  • is_current – whether this row is currently active
-- Initial state: customer_id = 1, Seoul
-- dim_customers_sk | customer_id | city  | valid_from | valid_to   | is_current
-- 1001             | 1           | Seoul | 2025-01-01 | 9999-12-31 | true

When the customer moves to Busan, two steps are executed.

-- Step 1: close the existing row
UPDATE dim_customers
SET valid_to = '2026-02-15',
    is_current = false
WHERE customer_id = 1
  AND is_current = true;

-- Step 2: insert the new row
INSERT INTO dim_customers (dim_customers_sk, customer_id, city, valid_from, valid_to, is_current)
VALUES (1002, 1, 'Busan', '2026-02-15', '9999-12-31', true);

Now one customer_id has two rows.

-- dim_customers_sk | customer_id | city  | valid_from | valid_to   | is_current
-- 1001             | 1           | Seoul | 2025-01-01 | 2026-02-15 | false
-- 1002             | 1           | Busan | 2026-02-15 | 9999-12-31 | true

Here, dim_customers_sk is the surrogate key. Since one customer can have multiple rows, customer_id alone can no longer uniquely identify a row. That’s why a separate surrogate key is needed. Design details are covered in the Gold edition.

Point-in-time queries work like this.

-- Join with the customer address at order time
SELECT
    o.order_id,
    o.order_date,
    c.city AS city_at_order_time
FROM fct_orders o
JOIN dim_customers c
  ON o.customer_id = c.customer_id
 AND o.order_date BETWEEN c.valid_from AND c.valid_to;

A June 2025 order shows “Seoul,” a March 2026 order shows “Busan.” Each order reflects the actual value at its point in time.

As mentioned in DW Modeling Part 1 , “the storage overhead of SCD Type 2 has diminished.” In cloud columnar storage, the cost of additional rows is far lower than on-premises. This is an environment where you can use Type 2 more aggressively.

SCD Type 3 - Keep the Previous Value as a Column

When a single level of history is sufficient, use Type 3. Store the previous value in a separate column.

-- Type 3: previous value as a column
-- customer_id | city  | previous_city
-- 1           | Busan | Seoul

Implementation is straightforward.

UPDATE dim_customers
SET previous_city = city,
    city = 'Busan'
WHERE customer_id = 1;

Row count doesn’t increase. However, the value from two changes ago is lost. If the city changes from Seoul → Busan → Daejeon, “Seoul” disappears.

There are practical use cases. Comparing before and after an organizational restructuring: “What department was this employee in before the reorg?” When a single previous value is enough and full history isn’t needed.

Which Type to Choose

CriterionType 1Type 2Type 3
History preservationNoneFullPrevious value only
Implementation complexityLowHighMedium
StorageNo changeRows keep growingColumn addition
Point-in-time analysisNot possibleFully supportedLimited
Best forTypos, code descriptionsAddress, tier, departmentBefore/after reorg comparison

The decision criterion is simple. “Do I need to analyze using past values?” If yes, Type 2. If no, Type 1. Type 3 is limited to the special case where only the previous value is needed.

You can mix types within a single table by column. Track city with Type 2 for full history, but overwrite phone with Type 1. There’s no analytical reason to keep historical phone numbers.

Preparing Lab Data

jaffle_shop’s raw_customers doesn’t have an address column. We need to generate synthetic data to demonstrate SCD.

import duckdb

conn = duckdb.connect('warehouse.duckdb')

# Customer data for SCD demo: add city, membership_grade, updated_at
conn.execute("""
CREATE SCHEMA IF NOT EXISTS bronze;

CREATE OR REPLACE TABLE bronze.customers_v2 AS
SELECT
    id AS customer_id,
    first_name,
    last_name,
    CASE
        WHEN id % 3 = 0 THEN 'Seoul'
        WHEN id % 3 = 1 THEN 'Busan'
        ELSE 'Daejeon'
    END AS city,
    CASE
        WHEN id % 4 = 0 THEN 'Gold'
        WHEN id % 4 = 1 THEN 'Silver'
        WHEN id % 4 = 2 THEN 'Bronze'
        ELSE 'Standard'
    END AS membership_grade,
    TIMESTAMP '2025-01-15 09:00:00' AS updated_at
FROM read_csv_auto(
    'https://raw.githubusercontent.com/dbt-labs/jaffle_shop/main/seeds/raw_customers.csv'
);
""")

conn.execute("SELECT * FROM bronze.customers_v2 LIMIT 5").fetchdf()

We also create change simulation data. A scenario where some customers moved and their tiers were upgraded.

# Change data: some customers have moved
conn.execute("""
CREATE OR REPLACE TABLE bronze.customers_v2_updated AS
SELECT
    customer_id,
    first_name,
    last_name,
    CASE
        WHEN customer_id IN (1, 3, 5) THEN 'Jeju'
        ELSE city
    END AS city,
    CASE
        WHEN customer_id IN (2, 4) THEN 'Gold'
        ELSE membership_grade
    END AS membership_grade,
    CASE
        WHEN customer_id IN (1, 2, 3, 4, 5) THEN TIMESTAMP '2026-02-20 14:00:00'
        ELSE updated_at
    END AS updated_at
FROM bronze.customers_v2;
""")

# Check changed customers
conn.execute("""
SELECT a.customer_id, a.city AS before_city, b.city AS after_city,
       a.membership_grade AS before_grade, b.membership_grade AS after_grade
FROM bronze.customers_v2 a
JOIN bronze.customers_v2_updated b ON a.customer_id = b.customer_id
WHERE a.city != b.city OR a.membership_grade != b.membership_grade
""").fetchdf()

Automating SCD Type 2 with dbt snapshot

What Is a Snapshot

We implemented Type 2 manually in SQL above. Close the existing row, insert the new one, manage valid_from/valid_to. That’s manageable for one table. When dimension tables grow to 10 or 20, writing this logic from scratch every time isn’t realistic.

dbt snapshot handles this for you. Define a single snapshot file, and dbt detects changes in the source data and manages the valid_from/valid_to rows automatically.

Writing the Snapshot File

import os
os.makedirs('jaffle_shop/snapshots', exist_ok=True)
%%writefile jaffle_shop/snapshots/snap_customers.sql
{% snapshot snap_customers %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at'
    )
}}

select * from bronze.customers_v2

{% endsnapshot %}

strategy='timestamp' – determines change based on the updated_at column. If updated_at is newer than the previous snapshot point, the row is considered changed.

unique_key='customer_id' – the key that identifies which row is which. Previous and current values are compared based on this key.

Running the Snapshot

from dbt.cli.main import dbtRunner

# Colab's ! shell commands spawn a separate process.
# DuckDB uses file locks that prevent concurrent writes across processes.
# Running via dbtRunner within the same process avoids lock conflicts.
result = dbtRunner().invoke(['snapshot', '--project-dir', 'jaffle_shop', '--profiles-dir', 'jaffle_shop'])

This is the first run. All rows are new, so they’re inserted as-is. dbt automatically adds dbt_valid_from and dbt_valid_to columns.

conn.execute("SELECT * FROM snapshots.snap_customers LIMIT 5").fetchdf()

dbt_valid_to is NULL for all rows. That means they’re currently active. dbt snapshot uses NULL instead of 9999-12-31.

Now let’s inject the change data and run again.

# Replace source table with changed data
conn.execute("""
CREATE OR REPLACE TABLE bronze.customers_v2 AS
SELECT * FROM bronze.customers_v2_updated;
""")

conn.close()
# Re-run snapshot
result = dbtRunner().invoke(['snapshot', '--project-dir', 'jaffle_shop', '--profiles-dir', 'jaffle_shop'])
conn = duckdb.connect('warehouse.duckdb')

# Check customers with history
conn.execute("""
SELECT customer_id, city, membership_grade,
       dbt_valid_from, dbt_valid_to
FROM snapshots.snap_customers
WHERE customer_id IN (1, 2, 3)
ORDER BY customer_id, dbt_valid_from
""").fetchdf()

Customer with customer_id = 1 now has two rows. The first row’s dbt_valid_to is populated, and the second row is the currently active one. Type 2 implemented without writing a single line of SQL.

The Check Strategy

Some sources don’t have an updated_at column. As mentioned in Part 2 , surprisingly many systems UPDATE data without touching updated_at.

In these cases, use the check strategy. It directly compares whether specified column values have changed.

%%writefile jaffle_shop/snapshots/snap_customers_check.sql
{% snapshot snap_customers_check %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='check',
        check_cols=['city', 'membership_grade']
    )
}}

select
    customer_id,
    first_name,
    last_name,
    city,
    membership_grade
from bronze.customers_v2

{% endsnapshot %}

check_cols=['city', 'membership_grade'] – if either of these column values differs from before, it’s treated as a change. No updated_at needed. The trade-off is that it compares all rows every time, so it’s slower than the timestamp strategy for large datasets.

Connecting Snapshots to Silver/Gold

Snapshots are stored in a separate schema (snapshots), neither Bronze nor Silver. Here’s how this fits into the Medallion Architecture.

Source → [Bronze] → [Silver] → [Gold]
         Bronze → [Snapshot] ──┘

This is the layer structure from Part 1 with snapshots added. Snapshots look directly at Bronze data, and Silver or Gold models reference the snapshot results.

Here’s how a Silver model references a snapshot.

%%writefile jaffle_shop/models/staging/stg_customers_hist.sql
with source as (
    select * from {{ ref('snap_customers') }}
),

cleaned as (
    select
        customer_id,
        first_name,
        last_name,
        city,
        membership_grade,
        dbt_valid_from AS valid_from,
        coalesce(dbt_valid_to, '9999-12-31'::timestamp) AS valid_to,
        dbt_valid_to IS NULL AS is_current
    from source
)

select * from cleaned

We renamed dbt’s dbt_valid_from/dbt_valid_to to valid_from/valid_to, and converted NULL to 9999-12-31. This makes BETWEEN joins in Gold more convenient.

A point-in-time join from a Gold fact table looks like this.

-- Gold: join with customer info at order time
select
    o.order_id,
    o.order_date,
    c.city AS city_at_order,
    c.membership_grade AS grade_at_order
from stg_orders o
join stg_customers_hist c
  on o.customer_id = c.customer_id
 and o.order_date >= c.valid_from
 and o.order_date < c.valid_to

SCD Application Patterns Summary

PatternTargetImplementation
Type 1 (overwrite)Typos, code descriptions, phone numbersSimple UPDATE
Type 2 (history accumulation)Address, tier, departmentdbt snapshot (timestamp / check)
Type 3 (preserve previous value)Before/after reorg comparisonAdd previous_ column
MixedPer-column differentiation within a tableCombine Type 1 + Type 2

In practice, Type 2 dominates. Since dbt snapshot handles the implementation, the overhead isn’t significant. Type 1 is reserved for attributes that don’t need history. Type 3 is used in rare cases where only the previous value matters.

Practical Reference: Running dbt snapshot from Airflow

In Part 3 , we set up an Airflow DAG with dbt rundbt test. Adding snapshots changes the order.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='medallion_with_snapshot',
    schedule='0 6 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    # 1. Run snapshot first — capture change history from Bronze
    run_snapshot = BashOperator(
        task_id='dbt_snapshot',
        bash_command='cd /opt/dbt/jaffle_shop && dbt snapshot',
    )

    # 2. Silver transformation — some models reference snapshot results
    run_staging = BashOperator(
        task_id='dbt_run_staging',
        bash_command='cd /opt/dbt/jaffle_shop && dbt run --select staging',
    )

    # 3. Gold transformation
    run_marts = BashOperator(
        task_id='dbt_run_marts',
        bash_command='cd /opt/dbt/jaffle_shop && dbt run --select marts',
    )

    run_snapshot >> run_staging >> run_marts

The critical ordering is run_snapshot >> run_staging. The Silver model stg_customers_hist references snap_customers. Snapshots must run first so Silver reflects the latest history. If you run snapshots after Silver, changes detected in this batch won’t appear in Silver until the next batch. A full day’s delay.

The next post covers the Gold layer. The process of combining cleansed Silver data and snapshot history to build fact and dimension tables. The dbt marts directory takes center stage.

Google Colab에서 실습하기