1. Medallion Architecture - Why We Stack Data in Three Layers

How a Data Lake Turns into a Swamp

There are teams that dump files into a data lake and try to analyze them right away. At first, it’s fast. Upload a CSV, write one SQL query, and you get results.

Three months later, things look different. Nobody knows who uploaded which file. You can’t tell whether it’s the raw source or a processed version. The same revenue table shows different numbers depending on which department you ask. This is what people call a data swamp.

The cause is simple. Raw data and processed artifacts are mixed in the same space. Separating them into layers solves this problem.

Bronze, Silver, Gold

The Medallion Architecture divides data into three layers. Databricks coined the name and popularized it, but the concept itself is the same layered approach traditional data warehouses have used for decades.

Source System → [Bronze] → [Silver] → [Gold] → BI / Analytics
                Raw Load    Cleanse &   Business
                            Standardize Aggregation

Bronze is the raw source. Data from source systems is stored without any transformation. CSV, JSON, API responses – exactly as-is. This is the starting point for data lineage. If you change anything here, you lose the original.

Silver is cleansing and standardization. You fix data types from Bronze, remove duplicates, and unify keys. This layer makes data “ready for analysis.” No business logic goes in here yet.

Gold is business-level aggregation. Fact tables, dimension tables, KPI marts. This is the layer end users query directly. The star schema we covered in DW Modeling Part 1 belongs here.

The key point is that each layer has a clear role. No transformation in Bronze. No business logic in Silver. Business-level processing only happens in Gold. Once you break these rules, the whole point of having layers disappears.

Mapping to Traditional DW Layers

In the DW Modeling series , we covered the Raw → Staging → Integration → Mart structure. The names differ from Medallion, but the roles are nearly identical.

Medallion	Traditional DW	Purpose
Bronze	Raw / Staging	Raw load, no transformation
Silver	Integration (3NF / Data Vault)	Cleansing, standardization, key unification
Gold	Mart (Star Schema)	Business aggregation, analytics-ready

In traditional DW, an ETL server handled heavy transformations between Staging and Integration. Medallion follows the ELT paradigm. You first load into Bronze, then build Silver and Gold inside the DW engine. The difference is that transformations are handled by the DW engine’s compute power rather than a separate server.

The Lab Environment for This Series

We’ll use three tools throughout this series. All free, and you can run them directly in Google Colab without a cloud account.

Tool	Role
DuckDB	Local DW engine. Columnar storage-based, so it works the same way as BigQuery/Snowflake
dbt-core + dbt-duckdb	Transformation layer. Defines Bronze → Silver → Gold in SQL
Soda Core	Data quality validation. Sets quality gates between layers

There’s a reason we chose DuckDB. It installs with a single pip install, yet works the same way as real cloud DWs. It reads Parquet and CSV natively, analyzes with SQL, and supports column-based scanning through columnar storage. Think of it as a miniature cloud DW running locally.

Environment Setup

Run the following in a Colab cell and you’re ready to go.

# Install tools
!pip install -q duckdb dbt-core dbt-duckdb

import duckdb

# Create DuckDB database
conn = duckdb.connect('warehouse.duckdb')
print(f"DuckDB {duckdb.__version__} ready")

Preparing Sample Data

The sample data for this series is a simple e-commerce dataset. Three tables: orders, customers, and products. It’s the same domain as the structure we covered in DW Modeling Part 2 .

# Bronze layer: load raw data as-is
conn.execute("""
CREATE SCHEMA IF NOT EXISTS bronze;

CREATE OR REPLACE TABLE bronze.orders AS
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/
  dbt-labs/jaffle_shop/main/seeds/raw_orders.csv');

CREATE OR REPLACE TABLE bronze.customers AS
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/
  dbt-labs/jaffle_shop/main/seeds/raw_customers.csv');

CREATE OR REPLACE TABLE bronze.payments AS
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/
  dbt-labs/jaffle_shop/main/seeds/raw_payments.csv');
""")

# Verify load
conn.execute("SELECT count(*) as cnt FROM bronze.orders").fetchdf()

Data loaded into Bronze. We read CSVs and put them into DuckDB – nothing more. No type casting, no column renaming. That’s the Bronze principle.

# Check Bronze data
conn.execute("SELECT * FROM bronze.orders LIMIT 5").fetchdf()

At this point, you’ll be tempted to run analytical queries directly. Resist. If you use Bronze data directly for analysis, you’ll end up in a data swamp within three months. The next post covers the process of promoting Bronze to Silver.

Why Go Through All This Trouble

People ask whether separating layers makes things slower. After all, you use more storage and add more transformation steps.

True. But you gain three things in return.

Reprocessing becomes possible. If there’s a bug in your Silver logic, just rebuild from Bronze. The original data is still there. Without Bronze, you’d have to pull from the source system all over again.

Problem tracing works. If Gold numbers look wrong, check Silver. If Silver looks wrong, check Bronze. You can pinpoint exactly which layer introduced the problem.

Responsibilities are separated. Data engineers own Bronze-to-Silver. Analytics engineers own Silver-to-Gold. Neither team needs to touch the other’s territory.

In cloud environments, storage costs are practically negligible. The cost of maintaining one extra layer is far less than the cost of falling into a data swamp.

Practical Reference: Airflow DAG

This series uses Colab + dbt for hands-on exercises, but in production, Airflow handles pipeline scheduling and orchestration. Here’s what the Medallion Architecture’s Bronze → Silver → Gold flow looks like as an Airflow DAG.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='medallion_pipeline',
    schedule='0 6 * * *',  # Daily at 6 AM
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    bronze = BashOperator(
        task_id='load_bronze',
        bash_command='python scripts/load_bronze.py',
    )

    silver = BashOperator(
        task_id='run_silver',
        bash_command='cd dbt_project && dbt run --select staging',
    )

    gold = BashOperator(
        task_id='run_gold',
        bash_command='cd dbt_project && dbt run --select marts',
    )

    bronze >> silver >> gold

bronze >> silver >> gold. The dependency chain reads in a single line. Bronze loading must finish before Silver runs, and Silver must finish before Gold runs. Airflow guarantees this order, sends alerts on failure, and handles reruns.

dbt defines “what to transform.” Airflow defines “when and in what order to run it.” Different roles.

The next post dives into the Bronze layer in depth. The difference between Full Load and Incremental Load, and how to choose the right column for incremental extraction.

Google Colab에서 실습하기

How a Data Lake Turns into a Swamp#

Bronze, Silver, Gold#

Mapping to Traditional DW Layers#

The Lab Environment for This Series#

Environment Setup#

Preparing Sample Data#

Why Go Through All This Trouble#

Practical Reference: Airflow DAG#