Google Colab에서 실습하기 How a Data Lake Turns into a Swamp There are teams that dump files into a data lake and try to analyze them right away. At first, it’s fast. Upload a CSV, write one SQL query, and you get results. Three months later, things look different. Nobody knows who uploaded which file. You can’t tell whether it’s the raw source or a processed version. The same revenue table shows different numbers depending on which department you ask. This is what people call a data swamp. The cause is simple. Raw data and processed artifacts are mixed in the same space. Separating them into layers solves this problem. Bronze, Silver, Gold The Medallion Architecture divides data into three layers. Databricks coined the name and popularized it, but the concept itself is the same layered approach traditional data warehouses have used for decades. Source System → [Bronze] → [Silver] → [Gold] → BI / Analytics Raw Load Cleanse & Business Standardize Aggregation Bronze is the raw source. Data from source systems is stored without any transformation. CSV, JSON, API responses – exactly as-is. This is the starting point for data lineage. If you change anything here, you lose the original. Silver is cleansing and standardization. You fix data types from Bronze, remove duplicates, and unify keys. This layer makes data “ready for analysis.” No business logic goes in here yet. Gold is business-level aggregation. Fact tables, dimension tables, KPI marts. This is the layer end users query directly. The star schema we covered in DW Modeling Part 1 belongs here. The key point is that each layer has a clear role. No transformation in Bronze. No business logic in Silver. Business-level processing only happens in Gold. Once you break these rules, the whole point of having layers disappears. Mapping to Traditional DW Layers In the DW Modeling series , we covered the Raw → Staging → Integration → Mart structure. The names differ from Medallion, but the roles are nearly identical. Medallion Traditional DW Purpose Bronze Raw / Staging Raw load, no transformation Silver Integration (3NF / Data Vault) Cleansing, standardization, key unification Gold Mart (Star Schema) Business aggregation, analytics-ready In traditional DW, an ETL server handled heavy transformations between Staging and Integration. Medallion follows the ELT paradigm. You first load into Bronze, then build Silver and Gold inside the DW engine. The difference is that transformations are handled by the DW engine’s compute power rather than a separate server. The Lab Environment for This Series We’ll use three tools throughout this series. All free, and you can run them directly in Google Colab without a cloud account. Tool Role DuckDB Local DW engine. Columnar storage-based, so it works the same way as BigQuery/Snowflake dbt-core + dbt-duckdb Transformation layer. Defines Bronze → Silver → Gold in SQL Soda Core Data quality validation. Sets quality gates between layers There’s a reason we chose DuckDB. It installs with a single pip install, yet works the same way as real cloud DWs. It reads Parquet and CSV natively, analyzes with SQL, and supports column-based scanning through columnar storage. Think of it as a miniature cloud DW running locally. Environment Setup Run the following in a Colab cell and you’re ready to go. # Install tools !pip install -q duckdb dbt-core dbt-duckdb import duckdb # Create DuckDB database conn = duckdb.connect('warehouse.duckdb') print(f"DuckDB {duckdb.__version__} ready") Preparing Sample Data The sample data for this series is a simple e-commerce dataset. Three tables: orders, customers, and products. It’s the same domain as the structure we covered in DW Modeling Part 2 . # Bronze layer: load raw data as-is conn.execute(""" CREATE SCHEMA IF NOT EXISTS bronze; CREATE OR REPLACE TABLE bronze.orders AS SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/ dbt-labs/jaffle_shop/main/seeds/raw_orders.csv'); CREATE OR REPLACE TABLE bronze.customers AS SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/ dbt-labs/jaffle_shop/main/seeds/raw_customers.csv'); CREATE OR REPLACE TABLE bronze.payments AS SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/ dbt-labs/jaffle_shop/main/seeds/raw_payments.csv'); """) # Verify load conn.execute("SELECT count(*) as cnt FROM bronze.orders").fetchdf() Data loaded into Bronze. We read CSVs and put them into DuckDB – nothing more. No type casting, no column renaming. That’s the Bronze principle. # Check Bronze data conn.execute("SELECT * FROM bronze.orders LIMIT 5").fetchdf() At this point, you’ll be tempted to run analytical queries directly. Resist. If you use Bronze data directly for analysis, you’ll end up in a data swamp within three months. The next post covers the process of promoting Bronze to Silver. Why Go Through All This Trouble People ask whether separating layers makes things slower. After all, you use more storage and add more transformation steps. True. But you gain three things in return. Reprocessing becomes possible. If there’s a bug in your Silver logic, just rebuild from Bronze. The original data is still there. Without Bronze, you’d have to pull from the source system all over again. Problem tracing works. If Gold numbers look wrong, check Silver. If Silver looks wrong, check Bronze. You can pinpoint exactly which layer introduced the problem. Responsibilities are separated. Data engineers own Bronze-to-Silver. Analytics engineers own Silver-to-Gold. Neither team needs to touch the other’s territory. In cloud environments, storage costs are practically negligible. The cost of maintaining one extra layer is far less than the cost of falling into a data swamp. Practical Reference: Airflow DAG This series uses Colab + dbt for hands-on exercises, but in production, Airflow handles pipeline scheduling and orchestration. Here’s what the Medallion Architecture’s Bronze → Silver → Gold flow looks like as an Airflow DAG. from airflow import DAG from airflow.operators.bash import BashOperator from datetime import datetime with DAG( dag_id='medallion_pipeline', schedule='0 6 * * *', # Daily at 6 AM start_date=datetime(2026, 1, 1), catchup=False, ) as dag: bronze = BashOperator( task_id='load_bronze', bash_command='python scripts/load_bronze.py', ) silver = BashOperator( task_id='run_silver', bash_command='cd dbt_project && dbt run --select staging', ) gold = BashOperator( task_id='run_gold', bash_command='cd dbt_project && dbt run --select marts', ) bronze >> silver >> gold bronze >> silver >> gold. The dependency chain reads in a single line. Bronze loading must finish before Silver runs, and Silver must finish before Gold runs. Airflow guarantees this order, sends alerts on failure, and handles reruns. dbt defines “what to transform.” Airflow defines “when and in what order to run it.” Different roles. The next post dives into the Bronze layer in depth. The difference between Full Load and Incremental Load, and how to choose the right column for incremental extraction. Google Colab에서 실습하기