Data Architecture

Strategic guidance for designing modern data platforms, covering storage paradigms (data lake, warehouse, lakehouse), modeling approaches (dimensional, normalized, data vault, wide tables), data mesh principles, and medallion architecture patterns.

When to Use

Use when:

Designing a new data platform or modernizing legacy systems
Choosing between data lake, data warehouse, or data lakehouse
Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables)
Evaluating centralized vs data mesh architecture
Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi)
Designing medallion architecture (bronze, silver, gold layers)
Implementing data governance and cataloging

Key Features

Storage Paradigms

Three primary patterns for analytical data storage:

Data Lake: Centralized repository for raw data at scale (schema-on-read, cost-optimized)
Data Warehouse: Structured repository optimized for BI (schema-on-write, ACID, fast queries)
Data Lakehouse: Hybrid combining lake flexibility with warehouse reliability (60-80% cost savings)

Decision Framework:

BI/Reporting only + Known queries → Data Warehouse
ML/AI primary + Raw data needed → Data Lake or Lakehouse
Mixed BI + ML + Cost optimization → Data Lakehouse (recommended)

Data Modeling Approaches

Dimensional (Kimball): Star/snowflake schemas for BI dashboards
Normalized (3NF): Eliminate redundancy for transactional systems
Data Vault 2.0: Flexible model with complete audit trail for compliance
Wide Tables: Denormalized, optimized for columnar storage and ML

Medallion Architecture

Standard lakehouse pattern: Bronze (raw) → Silver (cleaned) → Gold (business-level)

Bronze Layer: Exact copy of source data, immutable, append-only
Silver Layer: Validated, deduplicated, typed data
Gold Layer: Business logic, aggregates, dimensional models, ML features

Quick Start

Apache Iceberg Table (Recommended)

-- Create lakehouse table with ACID guarantees
CREATE TABLE catalog.db.sales (
  order_id BIGINT,
  amount DECIMAL(10,2),
  order_date DATE
)
USING iceberg
PARTITIONED BY (days(order_date));

-- Time travel queries
SELECT * FROM catalog.db.sales
TIMESTAMP AS OF '2025-01-01';

dbt Transformation (Gold Layer)

-- models/marts/fct_sales.sql
WITH source AS (
  SELECT * FROM {{ source('silver', 'sales') }}
),
cleaned AS (
  SELECT
    order_id,
    customer_id,
    UPPER(customer_name) AS customer_name,
    amount
  FROM source
  WHERE order_id IS NOT NULL
)
SELECT * FROM cleaned

Data Mesh Readiness Assessment

Score these 6 factors (1-5 each):

Domain clarity
Team maturity
Platform capability
Governance maturity
Scale need
Organizational buy-in

Scoring: 24-30: Strong candidate | 18-23: Hybrid | 12-17: Build foundation first | 6-11: Centralized

Tool Recommendations (2025)

Research-Validated (Context7):

dbt: Score 87.0, 3,532+ code snippets - SQL-based transformations, industry standard
Apache Iceberg: Score 79.7, 832+ code snippets - Open table format, vendor-neutral

By Organization Size:

Startup (<50): BigQuery + Airbyte + dbt + Metabase (<$1K/month)
Growth (50-500): Snowflake + Fivetran + dbt + Airflow + Tableau ($10K-50K/month)
Enterprise (>500): Snowflake + Databricks + Fivetran + Kafka + dbt + Airflow + Alation ($50K-500K/month)

Decision Frameworks

Storage Paradigm Selection

Step 1: Identify Primary Use Case

BI/Reporting only → Data Warehouse
ML/AI primary → Data Lake or Lakehouse
Mixed BI + ML → Data Lakehouse
Exploratory → Data Lake

Step 2: Evaluate Budget

High budget, known queries → Data Warehouse
Cost-sensitive, flexible → Data Lakehouse

Open Table Format Selection

Multi-engine flexibility → Apache Iceberg (recommended)
Databricks ecosystem → Delta Lake
Frequent upserts/CDC → Apache Hudi

Best Practices

Start simple: Avoid over-engineering; begin with warehouse or basic lakehouse
Invest in governance early: Catalog, lineage, quality from day one
Medallion architecture: Use bronze-silver-gold for clear quality layers
Open table formats: Prefer Iceberg or Delta Lake to avoid vendor lock-in
Assess mesh readiness: Don't decentralize prematurely (<500 people)
Automate quality: Integrate tests (Great Expectations, dbt) into CI/CD
Document as code: Use dbt docs, DataHub, YAML for self-service

Streaming Data - Real-time data pipelines with Kafka, Flink
Data Transformation - dbt and Spark transformations
SQL Optimization - Query performance tuning

When to Use​

Key Features​

Storage Paradigms​

Data Modeling Approaches​

Medallion Architecture​

Quick Start​

Apache Iceberg Table (Recommended)​

dbt Transformation (Gold Layer)​

Data Mesh Readiness Assessment​

Tool Recommendations (2025)​

Decision Frameworks​

Storage Paradigm Selection​

Open Table Format Selection​

Best Practices​

Related Skills​

References​