Google Colab에서 실습하기

What Happens When You Use Bronze Data Directly

In Part 2 , we loaded raw data into Bronze as-is. No transformation. That principle is correct. The problem is that Bronze data isn’t in a state you can use for analysis.

Look at jaffle_shop’s bronze.orders. The order_date column came in as VARCHAR. You can’t use date functions on it. The status column has a mix of returned, return_pending, completed, placed, and shipped – and from the schema alone, you can’t tell which value represents the final state.

The amount column in bronze.payments is an integer in cents. To convert to dollars, you need to divide by 100. Doing this division manually every time you analyze is a recipe for mistakes.

Silver is the layer that cleans all of this up in one place. Fix types, unify column names, convert units. No business logic yet. Silver’s job is to create a “clean state ready for analysis.”

What Silver Does and Doesn’t Do

Maintaining boundaries is important. Once you start putting business logic in Silver, the whole point of separating Bronze and Silver disappears.

What Silver does:

  • Type casting – VARCHAR to DATE, INTEGER to DECIMAL
  • Column name standardization – unify user_id and userId to user_id
  • Unit conversion – cents to dollars, milliseconds to seconds
  • Deduplication – when the same record was loaded twice
  • NULL handling – unify empty strings to NULL

What Silver doesn’t do:

  • KPI calculation – business metrics like revenue or margin rate
  • Table joins – combining orders and customers into a single view
  • Aggregation – summarizing with GROUP BY

Joins and aggregation belong to Gold. Silver only cleanses at the individual table level.

Why dbt Is Needed

In Part 1 , we introduced dbt as a tool. Why not just run SQL files directly?

Running SQL files one by one works fine at first. When Silver tables grow to 5, 10, or more, things change. You lose track of which table depends on which Bronze table, what order to run them in, and when the last run happened.

dbt solves this. Each SQL file is a model. Declare dependencies between models using the ref() function, and dbt figures out the execution order automatically. Since transformation logic lives in SQL files, you get version history through Git as well.

dbt Project Setup

Create a dbt project in Colab.

!pip install -q duckdb dbt-core dbt-duckdb

import os

# Create dbt project directory structure
os.makedirs('jaffle_shop/models/staging', exist_ok=True)
os.makedirs('jaffle_shop/models/marts', exist_ok=True)

Create the dbt configuration file. Specify DuckDB as the database.

%%writefile jaffle_shop/dbt_project.yml
name: 'jaffle_shop'
version: '1.0.0'
profile: 'jaffle_shop'

model-paths: ["models"]
%%writefile jaffle_shop/profiles.yml
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /content/warehouse.duckdb

Writing Silver Models

In dbt, Silver layer models go in the models/staging/ directory. The stg_ prefix stands for staging (= Silver).

stg_orders

%%writefile jaffle_shop/models/staging/stg_orders.sql
with source as (
    select * from bronze.orders
),

cleaned as (
    select
        id as order_id,
        user_id as customer_id,
        cast(order_date as date) as order_date,
        status
    from source
)

select * from cleaned

We renamed Bronze’s id to order_id. When joining multiple tables, a bare id doesn’t tell you which table it belongs to. We also renamed user_id to customer_id to clarify meaning. order_date is cast to DATE.

stg_customers

%%writefile jaffle_shop/models/staging/stg_customers.sql
with source as (
    select * from bronze.customers
),

cleaned as (
    select
        id as customer_id,
        first_name,
        last_name
    from source
)

select * from cleaned

stg_payments

%%writefile jaffle_shop/models/staging/stg_payments.sql
with source as (
    select * from bronze.payments
),

cleaned as (
    select
        id as payment_id,
        order_id,
        payment_method,
        amount / 100.0 as amount_dollars
    from source
)

select * from cleaned

We divided amount by 100 to convert to dollars. The column name is changed to amount_dollars so the unit is immediately obvious from the name.

Running dbt

!cd jaffle_shop && dbt run --select staging.*

dbt executes the three models: stg_orders, stg_customers, stg_payments. Each is created as a view in DuckDB.

Checking the Results

import duckdb

conn = duckdb.connect('warehouse.duckdb')

# Check Silver layer
conn.execute("SELECT * FROM stg_orders LIMIT 5").fetchdf()
# Check types — has order_date been converted to DATE?
conn.execute("""
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'stg_orders'
""").fetchdf()
# Has payments amount been converted to dollars?
conn.execute("SELECT * FROM stg_payments LIMIT 5").fetchdf()

order_date, which was VARCHAR in Bronze, is now DATE. amount has been converted from cents to dollars. Column names are unified. That’s Silver.

The CTE Pattern

There’s a recurring pattern in the SQL above: with source as (...), cleaned as (...) select * from cleaned. This is the CTE (Common Table Expression) pattern widely used in the dbt community.

with source as (
    -- Step 1: pull the raw data from Bronze
    select * from bronze.orders
),

cleaned as (
    -- Step 2: apply cleansing logic
    select
        id as order_id,
        cast(order_date as date) as order_date
    from source
)

-- Step 3: return the final result
select * from cleaned

sourcecleanedselect. Each step’s purpose is readable from its name. When cleansing logic grows more complex, just add more CTEs. Some teams use names like renamed, filtered, or deduplicated to break things into stages.

Deduplication Pattern

Sometimes the same record ends up in Bronze twice. Maybe the source system re-sent data, or there was a bug in the incremental load logic. Silver needs to catch this.

with source as (
    select * from bronze.orders
),

deduplicated as (
    select
        *,
        row_number() over (
            partition by id
            order by _loaded_at desc
        ) as row_num
    from source
),

cleaned as (
    select
        id as order_id,
        user_id as customer_id,
        cast(order_date as date) as order_date,
        status
    from deduplicated
    where row_num = 1
)

select * from cleaned

row_number() keeps only the most recently loaded record when the same id appears multiple times. The _loaded_at metadata column we added in Part 2 is what makes this work.

Change Silver Carelessly and Gold Breaks

Gold models rely on Silver table column names, types, and units. If stg_orders has order_date as DATE and Gold uses date functions on it, then someone renaming the column to ordered_at in Silver causes every Gold model to throw errors.

Adding columns is fine. Renaming or changing the type of existing columns is dangerous. dbt’s ref() function tracks dependencies, so you can at least see what gets affected.

Practical Reference: Running dbt from Airflow

There are several ways to run dbt from Airflow. The simplest is calling dbt run via BashOperator. For more sophisticated setups, you can use the cosmos library.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='silver_transformation',
    schedule='0 6 * * *',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    # Run Silver transformation after Bronze load completes
    run_staging = BashOperator(
        task_id='dbt_run_staging',
        bash_command='cd /opt/dbt/jaffle_shop && dbt run --select staging',
    )

    # Validate Silver data quality with dbt test
    test_staging = BashOperator(
        task_id='dbt_test_staging',
        bash_command='cd /opt/dbt/jaffle_shop && dbt test --select staging',
    )

    run_staging >> test_staging

dbt run followed by dbt test. Quality validation runs immediately after Silver transformation finishes. If tests fail, the pipeline doesn’t proceed to Gold transformation. This structure prevents bad data from propagating up to Gold.

With the cosmos library, you can split each dbt model into its own Airflow task. If stg_orders fails, stg_customers can still succeed independently. This granularity becomes meaningful when you have dozens of models.

The next post covers SCD (Slowly Changing Dimension). When a customer’s address changes, how do you preserve the historical address? The differences between Type 1, 2, and 3, and how to choose.

Google Colab에서 실습하기