Google Colab에서 실습하기 What Happens When You Use Bronze Data Directly In Part 2 , we loaded raw data into Bronze as-is. No transformation. That principle is correct. The problem is that Bronze data isn’t in a state you can use for analysis. Look at jaffle_shop’s bronze.orders. The order_date column came in as VARCHAR. You can’t use date functions on it. The status column has a mix of returned, return_pending, completed, placed, and shipped – and from the schema alone, you can’t tell which value represents the final state. The amount column in bronze.payments is an integer in cents. To convert to dollars, you need to divide by 100. Doing this division manually every time you analyze is a recipe for mistakes. Silver is the layer that cleans all of this up in one place. Fix types, unify column names, convert units. No business logic yet. Silver’s job is to create a “clean state ready for analysis.” What Silver Does and Doesn’t Do Maintaining boundaries is important. Once you start putting business logic in Silver, the whole point of separating Bronze and Silver disappears. What Silver does: Type casting – VARCHAR to DATE, INTEGER to DECIMAL Column name standardization – unify user_id and userId to user_id Unit conversion – cents to dollars, milliseconds to seconds Deduplication – when the same record was loaded twice NULL handling – unify empty strings to NULL What Silver doesn’t do: KPI calculation – business metrics like revenue or margin rate Table joins – combining orders and customers into a single view Aggregation – summarizing with GROUP BY Joins and aggregation belong to Gold. Silver only cleanses at the individual table level. Why dbt Is Needed In Part 1 , we introduced dbt as a tool. Why not just run SQL files directly? Running SQL files one by one works fine at first. When Silver tables grow to 5, 10, or more, things change. You lose track of which table depends on which Bronze table, what order to run them in, and when the last run happened. dbt solves this. Each SQL file is a model. Declare dependencies between models using the ref() function, and dbt figures out the execution order automatically. Since transformation logic lives in SQL files, you get version history through Git as well. dbt Project Setup Create a dbt project in Colab. !pip install -q duckdb dbt-core dbt-duckdb import os # Create dbt project directory structure os.makedirs('jaffle_shop/models/staging', exist_ok=True) os.makedirs('jaffle_shop/models/marts', exist_ok=True) Create the dbt configuration file. Specify DuckDB as the database. %%writefile jaffle_shop/dbt_project.yml name: 'jaffle_shop' version: '1.0.0' profile: 'jaffle_shop' model-paths: ["models"] %%writefile jaffle_shop/profiles.yml jaffle_shop: target: dev outputs: dev: type: duckdb path: /content/warehouse.duckdb Writing Silver Models In dbt, Silver layer models go in the models/staging/ directory. The stg_ prefix stands for staging (= Silver). stg_orders %%writefile jaffle_shop/models/staging/stg_orders.sql with source as ( select * from bronze.orders ), cleaned as ( select id as order_id, user_id as customer_id, cast(order_date as date) as order_date, status from source ) select * from cleaned We renamed Bronze’s id to order_id. When joining multiple tables, a bare id doesn’t tell you which table it belongs to. We also renamed user_id to customer_id to clarify meaning. order_date is cast to DATE. stg_customers %%writefile jaffle_shop/models/staging/stg_customers.sql with source as ( select * from bronze.customers ), cleaned as ( select id as customer_id, first_name, last_name from source ) select * from cleaned stg_payments %%writefile jaffle_shop/models/staging/stg_payments.sql with source as ( select * from bronze.payments ), cleaned as ( select id as payment_id, order_id, payment_method, amount / 100.0 as amount_dollars from source ) select * from cleaned We divided amount by 100 to convert to dollars. The column name is changed to amount_dollars so the unit is immediately obvious from the name. Running dbt !cd jaffle_shop && dbt run --select staging.* dbt executes the three models: stg_orders, stg_customers, stg_payments. Each is created as a view in DuckDB. Checking the Results import duckdb conn = duckdb.connect('warehouse.duckdb') # Check Silver layer conn.execute("SELECT * FROM stg_orders LIMIT 5").fetchdf() # Check types — has order_date been converted to DATE? conn.execute(""" SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'stg_orders' """).fetchdf() # Has payments amount been converted to dollars? conn.execute("SELECT * FROM stg_payments LIMIT 5").fetchdf() order_date, which was VARCHAR in Bronze, is now DATE. amount has been converted from cents to dollars. Column names are unified. That’s Silver. The CTE Pattern There’s a recurring pattern in the SQL above: with source as (...), cleaned as (...) select * from cleaned. This is the CTE (Common Table Expression) pattern widely used in the dbt community. with source as ( -- Step 1: pull the raw data from Bronze select * from bronze.orders ), cleaned as ( -- Step 2: apply cleansing logic select id as order_id, cast(order_date as date) as order_date from source ) -- Step 3: return the final result select * from cleaned source → cleaned → select. Each step’s purpose is readable from its name. When cleansing logic grows more complex, just add more CTEs. Some teams use names like renamed, filtered, or deduplicated to break things into stages. Deduplication Pattern Sometimes the same record ends up in Bronze twice. Maybe the source system re-sent data, or there was a bug in the incremental load logic. Silver needs to catch this. with source as ( select * from bronze.orders ), deduplicated as ( select *, row_number() over ( partition by id order by _loaded_at desc ) as row_num from source ), cleaned as ( select id as order_id, user_id as customer_id, cast(order_date as date) as order_date, status from deduplicated where row_num = 1 ) select * from cleaned row_number() keeps only the most recently loaded record when the same id appears multiple times. The _loaded_at metadata column we added in Part 2 is what makes this work. Change Silver Carelessly and Gold Breaks Gold models rely on Silver table column names, types, and units. If stg_orders has order_date as DATE and Gold uses date functions on it, then someone renaming the column to ordered_at in Silver causes every Gold model to throw errors. Adding columns is fine. Renaming or changing the type of existing columns is dangerous. dbt’s ref() function tracks dependencies, so you can at least see what gets affected. Practical Reference: Running dbt from Airflow There are several ways to run dbt from Airflow. The simplest is calling dbt run via BashOperator. For more sophisticated setups, you can use the cosmos library. from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id='silver_transformation', schedule='0 6 * * *', start_date=datetime(2026, 1, 1), catchup=False, ) as dag: # Run Silver transformation after Bronze load completes run_staging = BashOperator( task_id='dbt_run_staging', bash_command='cd /opt/dbt/jaffle_shop && dbt run --select staging', ) # Validate Silver data quality with dbt test test_staging = BashOperator( task_id='dbt_test_staging', bash_command='cd /opt/dbt/jaffle_shop && dbt test --select staging', ) run_staging >> test_staging dbt run followed by dbt test. Quality validation runs immediately after Silver transformation finishes. If tests fail, the pipeline doesn’t proceed to Gold transformation. This structure prevents bad data from propagating up to Gold. With the cosmos library, you can split each dbt model into its own Airflow task. If stg_orders fails, stg_customers can still succeed independently. This granularity becomes meaningful when you have dozens of models. The next post covers SCD (Slowly Changing Dimension). When a customer’s address changes, how do you preserve the historical address? The differences between Type 1, 2, and 3, and how to choose. Google Colab에서 실습하기