Data Lakehouse Architecture: Unifying Data Lakes and Warehouses

A Data Lakehouse combines the flexibility and scale of a data lake with the reliability and governance of a data warehouse, offering a unified platform for diverse data workloads. This architecture reduces data duplication and simplifies data management for analytics and machine learning.

5 min readAI Guide

Introduction

A Data Lakehouse is a modern data architecture that combines the best features of data lakes and data warehouses, providing a single, unified platform for all data workloads. It enables organizations to store massive amounts of raw, semi-structured, and unstructured data at low cost, while also supporting reliable, high-performance analytics and machine learning with ACID transactions and strong data governance.

Configuration Checklist

Element	Version / Link
Language / Runtime	[Editor's note: Specific languages like Python, Java/Scala, SQL, Apache Spark are implied for processing and querying, but no specific versions are mentioned.]
Main library	[Editor's note: Open table formats like Apache Iceberg, Delta Lake, Apache Hudi are mentioned, but no specific library versions.]
Required APIs	[Editor's note: Cloud object storage APIs (AWS S3, Google Cloud Storage) are implied.]
Keys / credentials needed	[Editor's note: Cloud provider credentials for object storage and governance tools are implied.]

Step-by-Step Guide to Data Lakehouse Architecture

Step 1 — Establish Lakehouse Storage with Object Storage

The foundation of a data lakehouse is a single, unified storage layer, typically leveraging cheap, highly available, and durable object storage services like AWS S3 or Google Cloud Storage (GCS). This layer stores all data, including raw order events, payment records, support logs, and curated analytics tables, eliminating the need for redundant data copies across separate systems.

-- Example: Data landing in object storage (conceptual)
-- Raw data (logs, images, videos, audio, text) is directly stored.
-- Curated data (Parquet files, tables) is also stored here after processing.

-- No direct code for object storage setup, typically done via cloud provider console or IaC.
-- Data ingestion might use tools like Kafka for streaming or ETL jobs for batch.

Step 2 — Implement Transactional Tables with Open Table Formats

To bring database-like reliability and features to the object storage, open table formats such as Apache Iceberg, Delta Lake, or Apache Hudi are used. These formats manage table metadata, snapshots, and commit history, enabling ACID transactions directly on the data files in object storage. This ensures data consistency, even during concurrent updates, and allows for schema evolution without rewriting historical data.

-- Example: Creating a table using an open table format (conceptual)
-- This would typically be done via a processing engine like Spark or Flink.

-- CREATE TABLE orders (
--   order_id INT,
--   price FLOAT,
--   user_id STRING
-- ) USING iceberg
-- LOCATION 's3://my-lakehouse-bucket/orders';

-- Example: Renaming a column (metadata operation)
-- ALTER TABLE orders RENAME COLUMN user_id TO customer_id;
-- [Editor's note: Specific syntax depends on the chosen open table format and query engine.]

Step 3 — Integrate a Shared Catalog for Metadata Management

A shared catalog is crucial for different query engines and tools to discover and interact with the transactional tables. It maps table names to their metadata, schema, and current version, acting as a single source of truth for all data consumers. This allows various tools, such as Trino for fast queries or Apache Spark for heavy-duty ingestion, to access the same consistent view of the data.

-- Example: Querying data via a shared catalog (conceptual)
-- SELECT DATE(order_time) AS day,
--        SUM(revenue) AS total_revenue
-- FROM orders
-- GROUP BY day
-- ORDER BY day;

-- [Editor's note: The catalog itself is usually a service (e.g., AWS Glue Data Catalog, Databricks Unity Catalog) that integrates with query engines.]

Step 4 — Implement Robust Data Governance

As the data platform grows, robust data governance becomes essential to answer critical questions about data existence, lineage, and access control. Tools like AWS Lake Formation or Databricks Unity Catalog provide a central place to manage access policies, audit data usage, and enforce security rules, including locking down sensitive columns. This ensures that only authorized users and applications can access specific data, maintaining compliance and data security.

-- Example: Granting read access to a specific user (conceptual)
-- GRANT SELECT ON TABLE orders TO user 'finance_analyst';

-- Example: Masking sensitive payment fields (conceptual)
-- CREATE MASKING POLICY mask_payment_info AS (val STRING) RETURNS STRING ->
--   CASE WHEN CURRENT_ROLE() = 'finance_team' THEN val ELSE '****' END;
-- ALTER COLUMN payments.card_number SET MASKING POLICY mask_payment_info;

-- [Editor's note: Specific governance commands depend on the chosen governance tool and cloud provider.]

Comparison: Data Warehouse vs. Data Lake vs. Data Lakehouse

Feature / System	Data Warehouse	Data Lake	Data Lakehouse
Primary Use Case	Fast analytics, reporting	Raw data storage, ML training	Unified analytics, ML, streaming
Data Types	Structured, curated	Raw, semi-structured, unstructured	Structured, semi-structured, unstructured
ACID Transactions	Yes	No (limited)	Yes
Schema Enforcement	Schema-on-write	Schema-on-read	Flexible schema-on-read/write
Cost	Expensive (compute & storage tightly coupled)	Cheap (object storage)	Moderate (cheap storage, flexible compute)
Performance	Optimized for fast SQL queries	Slower for structured queries	Fast for diverse workloads
Data Duplication	High (data copied from lake)	Low (raw data stored once)	Low (single storage layer)
Engineering Effort	Lower (fully managed)	Higher (manual data management)	Higher (platform engineering for maintenance)

⚠️ Common Mistakes & Pitfalls

Data Type Inconsistencies Across Engines: Different query engines (e.g., Apache Spark, Trino) might interpret data types, especially timestamps, differently. This can lead to incorrect query results. Fix: Establish strict data type standards and thoroughly test core data types across all engines before allowing teams to build on top of them.
Performance Degradation from Tiny Files: As new data streams in, object storage can accumulate thousands of small files, which significantly slows down query performance. Fix: Schedule regular background jobs to periodically merge these tiny files into larger, more efficient files (e.g., using compaction processes provided by open table formats).
Lack of Full Management: A data lakehouse provides flexibility and scale but is not a fully-managed database. Teams take on new platform responsibilities for maintenance and optimization. Fix: Allocate dedicated platform engineering resources to manage the lakehouse infrastructure, including file compaction, schema evolution, and performance tuning.
Breaking Downstream Consumers with Schema Changes: Because the data layer is deeply shared, a poorly managed schema update can simultaneously break multiple downstream consumers, such as BI dashboards and machine learning pipelines. Fix: Implement robust schema evolution practices, versioning, and thorough testing of schema changes. Leverage the metadata management capabilities of open table formats to perform schema changes as metadata operations rather than data rewrites.

Glossary

Data Lakehouse: A modern data architecture that combines the low-cost storage and flexibility of a data lake with the data management and ACID transaction capabilities of a data warehouse.
Data Lake: A centralized repository that allows you to store all your structured and unstructured data at any scale, typically in its raw format.
Data Warehouse: A system used for reporting and data analysis, and is considered a core component of business intelligence, storing structured and curated data.
ACID Transactions: A set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee reliable processing of database transactions.
Object Storage: A data storage architecture that manages data as objects, offering high scalability, durability, and cost-effectiveness for large volumes of unstructured data.
Open Table Format: A specification (like Apache Iceberg, Delta Lake, Apache Hudi) that defines how data files in object storage are organized into tables, enabling transactional capabilities and schema evolution.
Catalog: A metadata store that tracks information about data assets, including table schemas, locations, and versions, allowing various tools to discover and access data.
Governance: The overall management of the availability, usability, integrity, and security of data in an enterprise.

Key Takeaways

Data Lakehouses unify data lakes and data warehouses into a single, shared data layer, reducing duplication and simplifying architecture.
They leverage cheap object storage for massive scale while providing transactional reliability through open table formats like Iceberg, Delta Lake, or Hudi.
Open table formats enable ACID transactions, schema evolution, and time travel capabilities directly on object storage files.
A shared catalog is essential for providing a single source of truth, allowing diverse query engines and tools to access consistent data views.
Robust data governance, often managed by specialized tools, is crucial for controlling access, ensuring data lineage, and maintaining security at scale.
While offering flexibility and scale, data lakehouses require dedicated platform engineering effort for maintenance, optimization (e.g., file compaction), and managing data type standards.
The choice between a Data Warehouse, Data Lake, or Data Lakehouse depends on your team size, specific workloads, and the trade-off between cost, performance, and engineering effort.

Resources

All guides Lire en français →