What is a data lakehouse?

Organizations everywhere are searching for storage solutions to manage the volume, latency, resiliency, and data access requirements of big data.Traditional siloed approaches—maintaining separate data lakes and warehouses—often result in high costs, data duplication, and inconsistent insights.

The data lakehouse has emerged as a new hybrid architecture that delivers the low-cost, flexible storage of a data lake with the performance, structure, and data management features of a warehouse). By unifying siloed data, a lakehouse provides a single platform for business intelligence (BI), predictive analytics, and generative AI workflows. 

Google Cloud provides an open, enterprise-ready and AI-nativecross-cloud data lakehouse designed to help you go from data to AI to action faster.

Data lakehouse defined

A data lakehouse is a modern data architecture that creates a single platform by combining the raw data storage capabilities of data lakes with the organized structure of data warehouses. It enables organizations to use low-cost storage all types of data—structured, unstructured, and semi-structured—while providing essential management functions like ACID transactions and schema enforcement.

Historically, these architectures were siloed to avoid overloading systems, requiring data to be constantly shifted between repositories. The lakehouse architecture breaks down these silos, eliminating issues around data freshness, duplication, and high engineering overhead.

How does a data lakehouse work?

A data lakehouse uses low-cost cloud object storage of data lakes to provide on-demand, scalable storage for massive volumes of data in its raw form. It then integrates metadata layers over this store to provide warehouse-like performance and optimization.

The architecture consists of three core layers:

  • Storage layer: A low-cost object store for all raw datasets, decoupled from compute resources to allow independent scaling
  • Staging layer: A metadata layer that provides a detailed catalog, applying management features such as indexing, caching, and access control
  • Semantic layer: The user-facing layer where client apps, analytics tools, and data scientists access data for experimentation and BI presentation

Driving unique value for data scientists

For data scientists, the data lakehouse architecture is a critical enabler for AI data analytics and machine learning.

  • High scalability: Decoupled compute and storage provide nearly limitless and instantaneous scalability for training large-scale models
  • Support for diverse workloads: Data scientists can connect AI frameworks, SQL engines, and exploratory tools directly to the same repository
  • Improved data quality: Enforced schemas and data integrity ensure that ML models are trained on trusted, consistent, and fresh data
  • Unified governance: Centralized management makes it easier to implement security controls across datasets used for both BI and AI

Building an open lakehouse on Google Cloud

Google Cloud’s approach focuses on an open, managed, and high-performance architecture that leverages the best of open-source standards and serverless technology. 

Open standards with Apache Iceberg and BigLake

Apache Iceberg is changing lakehouses by bringing warehouse capabilities like time travel and schema evolution directly to data lakes.  Google Cloud Lakehouse enables enterprise storage, governance, and performance to build scalable analytical, operational, and real-time AI use cases on a unified, cross-cloud, and multimodal open lakehouse. This allows you to leverage open-source engines directly on Cloud Storage while avoiding vendor lock-in.

Cross-cloud lakehouse

Achieve high-speed, low-latency data access regardless of location. Cross-cloud catalog federation unifies discovery and analysis across diverse ecosystems.

Autonomous AI with BigQuery

BigQuery is Google’s serverless, autonomous data-to-AI platform. It automates the entire data lifecycle and can directly query Iceberg tables in Cloud Storage, allowing users to leverage powerful SQL analytics on managed data without the need for data movement 

Enterprise-grade governance with Knowledge Catalog

Knowledge Catalog provides unified governance and AI-powered metadata management across your entire lakehouses. It ensures consistent semantics for both data analysts and AI agents, breaking down silos between business and technical metadata.

High performance Spark for data science

With Managed Service for Apache Spark, data engineers and scientists can develop applications in familiar tools like BigQuery Studio notebooks. You can submit jobs with a single command without the need to create, configure, or manage clusters.

Data lakehouse versus data lake versus data warehouse

Feature

Data warehouse

Data lake

Data lakehouse

Data types

Structured

Unstructured and structured

Structured, semi-structured, unstructured

Primary use

BI and reporting

Big data and ML

BI, data science, and AI

Optimization

Schema-on-write

Schema-on-read

Multi-layered metadata

Cost

High

Low

Low (object storage)

Feature

Data warehouse

Data lake

Data lakehouse

Data types

Structured

Unstructured and structured

Structured, semi-structured, unstructured

Primary use

BI and reporting

Big data and ML

BI, data science, and AI

Optimization

Schema-on-write

Schema-on-read

Multi-layered metadata

Cost

High

Low

Low (object storage)

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.
Talk to a Google Cloud sales specialist to discuss your unique challenge in more detail.

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Google Cloud