“Do we still need a star schema?”

This question comes up in every cloud DW migration project.

You’re in a meeting about moving an on-premises DW — one that’s been running for years — to BigQuery or Azure Synapse. Someone asks: “Those platforms use columnar storage, so join costs are different. Does that mean we can skip the star schema?”

It depends. But not many people can articulate what exactly “it depends” on.

The On-Premises Playbook

There was an era when Kimball’s Dimensional (Star-Schema) modeling was the de facto standard.

The reason was straightforward. Disk I/O was expensive, and joins were even more expensive. Joining 10 tables on row-based storage pushed query response times into minutes. Pre-joining data was the rational choice.

Modeling decisions were performance decisions. How far to denormalize, how many aggregate table layers to build, which partition key to choose — these decisions meant the difference between seconds and minutes in query response time.

Kimball’s methodology was an attempt to achieve both business readability and query performance within these constraints. It unified dimension tables — previously siloed by department — into enterprise-wide Conformed Dimensions and used a Bus Matrix to design enterprise integration. The methodology’s rigor holds up well even today.

The problem is that the premises behind this methodology have changed.

What the Cloud Changed

Moving to cloud DW fundamentally altered the physical constraints.

Columnar Storage. BigQuery, Redshift, and Synapse are all column-based. SELECT reads only the columns you need. Query five columns from a table with hundreds, and only those five get scanned. Row-based storage had to read everything.

Compute/Storage Separation. Storage became cheap. The cost of redundant data storage is negligible. This is a different world from on-premises, where normalization was partly driven by disk capacity constraints.

MPP Architecture. Massively parallel processing is the default. Join costs are relatively lower compared to on-premises RDBMS. It’s not free — shuffles still hurt — but “avoid joins at all costs” is no longer an absolute rule.

ELT Paradigm. Load raw data first, then transform inside the DW. Transformation logic runs on the DW engine’s compute power. Unlike the ETL era, there’s no separate transformation server doing the heavy lifting before loading.

Semi-structured Data Support. JSON, ARRAY, and STRUCT are handled natively. Flexible schemas that were awkward in traditional relational models can now be processed directly within the DW.

These changes undermine the original rationale for many modeling principles. But weakened rationale is not the same as a wrong principle.

Three Options

Three approaches are commonly compared in cloud DW environments.

1. Kimball Dimensional Modeling

Star schema, facts and dimensions, Conformed Dimensions. Still the most widely used approach.

There are good reasons it remains relevant in the cloud. Business users find it intuitive. The structure of “slice the sales fact by the customer dimension” plays well with BI tools. Power BI, Tableau, and Looker are all optimized for this structure.

What has changed: pre-built aggregate tables are less necessary. Columnar storage can aggregate raw facts on the fly with sufficient speed. History management patterns like SCD Type 2 are more practical now that storage costs have dropped.

The weakness is flexibility. When schemas change, you have to redesign the fact/dimension structure. Structurally, it’s difficult to keep up with fast-moving, agile requirements.

2. Data Vault 2.0

Hubs (business keys), Satellites (attributes), Links (relationships). A methodology focused on storing raw data as-is along with its complete history.

The strength is clear: auditability. It fully preserves what the source data looked like at any given time. Adding new source systems or changing schemas doesn’t impact existing structures. Parallel loading is possible, making it a natural fit for the ELT paradigm.

There are practical hurdles. Querying directly is cumbersome. Multiple Hub-Satellite joins are needed to reconstruct a single business entity. As a result, you typically need a separate presentation layer (usually a star schema). That’s one more modeling layer to maintain. If the team has no Data Vault experience, the learning curve is steep.

It shines in regulated industries like finance and healthcare where audit trails are mandatory, or in environments where source systems are constantly being added.

3. One Big Table (OBT)

Merge facts and dimensions into a single wide table. Extreme denormalization.

There’s a reason this works in the cloud. With columnar storage, even if the table has 200 columns, only the 5 used in a query get scanned — so the performance hit is minimal. No joins means simpler queries. Development is fast. One SELECT statement in dbt and you’re done.

The trade-offs are real. There’s no structural mechanism to guarantee data consistency. If a customer’s address changes, every OBT has to be rebuilt. When the same dimension attribute is duplicated across multiple OBTs, there’s no way to know which one is correct. It’s fast when data is small and the domain is simple, but management complexity skyrockets at scale.

It works well for prototyping or single-domain analytical marts. It’s risky as the foundation for an enterprise DW.

Practical Decision Criteria

“Which one is best?” is the wrong question. Evaluate based on these criteria:

Team capability. To do Data Vault properly, you need someone on the team who knows the methodology. Without that, Kimball is the realistic choice. OBT has a low entry barrier, but as scale grows, an experienced modeler becomes even more critical.

Data complexity. Are there 3 source systems or 30? One domain or many? Higher complexity calls for Kimball’s Conformed Dimensions or Data Vault’s Hub structure.

Rate of change. If requirements shift frequently, Data Vault has the advantage. In a stable environment, Kimball is sufficient.

Regulatory requirements. If audit trails are legally required, consider Data Vault. Otherwise, it may be over-engineering.

Query patterns. If the focus is BI dashboards, Kimball’s structure pairs well with BI tools. If ad-hoc analysis is common, OBT’s simplicity becomes an advantage.

A pattern frequently seen in practice is a layered approach.

Raw (source ingestion) → Staging (cleansing) → Integration (unified model) → Mart (analytics)

A common combination is designing the Integration layer with Data Vault and serving the Mart as a star schema. Data Vault’s Hub-Link-Satellite structure excels at history tracking and flexible extension, while the star schema — with its central fact table surrounded by dimension tables — is optimized for analytics. Another traditional approach uses a normalized relational model (3NF) for the Integration layer with Kimball-style Marts. Either way, the key is separating the layers.

The moment you try to handle both source preservation and analytical optimization in a single layer, complexity explodes.

Modeling Still Matters

What changed with the cloud is the criteria for “why we model this way” — not whether modeling is needed at all.

Storage is cheaper, so you can denormalize more aggressively. Join costs are lower, so you can build fewer aggregate tables. But consistency of business terminology, data lineage, and cross-domain integration — these problems don’t disappear when the infrastructure changes.

In fact, teams that start without modeling in the cloud tend to suffer more later. They build quickly with OBT, and it works at first. A year later, the same metric has different definitions across tables, and nobody knows which source is authoritative. Technical debt accumulates silently.

As tools and infrastructure improve, the center of gravity in modeling is shifting from “performance optimization” to “semantic consistency.” This is the same problem DataNexus is tackling with ontology. Without machine-readable business context, data loses its meaning — no matter how good the infrastructure underneath.