What is Apache Iceberg?

Last Updated: 05/01/2026

Apache Iceberg is an open source table format designed for large-scale analytical datasets stored in data lakes. Iceberg tables manage data as collections of files, offering enhanced reliability, performance, and flexibility for modern data architectures. Think of it as an intelligent layer that sits on top of your data lake storage, such as Cloud Storage, providing database-like capabilities for your massive datasets. Instead of simply managing files, Iceberg manages tables as collections of data files, enabling features like schema evolution, time travel, and more efficient query planning. This allows data analysts, data scientists, and engineers to work with data in data lakes with greater ease and efficiency, and increase their analytical workloads.

What is a transactional data lake?

A transactional data lake not only stores data at scale but also supports transactional operations to ensure data is accurate and consistent. Iceberg tables enable these properties, collectively known as ACID.

Atomicity: Guarantees that each transaction is treated as a single unit, either succeeds or fails completely, with no half-way status
Consistency: Ensures that all written data is valid according to the defined rules of the data lake.
Isolation: Allows multiple transactions to occur simultaneously without interfering with one another
Durability: Ensures that data is not lost or corrupted once a transaction is submitted, even in the event of a system failure

Common use cases for Iceberg tables

Iceberg tables are suited for a variety of modern data lake and lakehouse use cases, including:

Compliance and privacy: Ideal for data lakes requiring frequent deletes to enforce data privacy laws
Record-level updates: Enables updates to individual records without republishing entire datasets, such as sales data that changes due to customer returns
Managing unpredictable changes: Supports Slowly Changing Dimension (SCD) tables, such as customer records where contact information may change at unknown intervals
Time travel and auditing: Maintains a history of table snapshots, allowing users to query historical versions for trend analysis or to rollback and correct issues
Machine learning: Provides consistent and versioned datasets crucial for training reliable models

Who uses Iceberg tables?

Various technical personas leverage Iceberg tables to manage large datasets efficiently:

Data engineers and administrators: Use Iceberg tables to design and build scalable, reliable storage systems
Data analysts and scientists: Use Iceberg table to analyze massive datasets with the familiarity of SQL and reproducible historical snapshots

Key benefits of Iceberg tables

SQL familiarity

Allows users familiar with standard SQL to perform complex data lake operations without learning a new language.

Schema evolution

Enables seamless changes to data structures (adding, renaming, or removing columns) without disrupting queries.

Incremental processing

Supports Change Data Capture (CDC), allowing users to process only the data that has changed since the last run to improve efficiency.

Performance optimization

Uses metadata to prune unnecessary files, accelerating query execution through techniques like predicate pushdown.

Cross-platform interoperability

Compatible with various engines like Spark, Flink, Hive, and Presto.

How do Apache Iceberg tables work?

Apache Iceberg introduces a metadata layer that sits above the actual data files in your data lake. This metadata tracks the structure and content of your tables in a more organized and robust way than traditional file-based systems. Here's a breakdown of its key mechanisms:

Metadata management: Iceberg maintains metadata files that describe the table's schema, partitions, and the locations of the data files. These metadata files are typically stored in the data lake alongside the data.
Catalog: Iceberg relies on a catalog to keep track of the location of the current metadata for each table. This catalog can be a service like the Hive Metastore, a file system-based implementation, or a cloud-native catalog service.
Table snapshots: Every time a change is made to the table (for example, adding data, deleting data, or schema evolution), Iceberg creates a new snapshot of the table's metadata. These snapshots are immutable and provide a historical record of the table's state.
Manifest lists and manifest files: Each snapshot points to a manifest list, which in turn lists one or more manifest files. Manifest files contain metadata about individual data files, including their location, partition values, and statistics (like row counts and value ranges).

Apache Iceberg architecture

The architecture of Apache Iceberg involves several key components working together:

Data lake storage: This is the underlying storage layer, such as Cloud Storage, where the actual data files (in formats like Parquet, ORC, or Avro) and Iceberg's metadata files are stored.
Iceberg REST catalog: This component is responsible for managing the metadata pointers for Iceberg tables. It acts as a central registry that tracks the current version of each table's metadata. Common catalog implementations include:
Hive Metastore: A widely used metadata repository, often employed with Hadoop-based systems.
File system catalog: A simple implementation where the catalog information is stored directly in the data lake file system.
Cloud-native catalog services: Managed services offered by cloud providers for storing and managing metadata.
Iceberg metadata: This consists of several layers of metadata files that track the table's structure and data:
Table metadata file: This file points to the current manifest list and contains high-level information about the table, such as its schema and partitioning specification.
Manifest list: This file lists the manifest files that contain metadata about the data files in a specific snapshot of the table.
Manifest files: These files contain detailed information about individual data files, including their location, partition values, and statistics.
Query engines and processing frameworks: These are the tools that interact with Iceberg tables to read and write data. These engines leverage Iceberg's metadata to optimize query planning and execution.
Compute resources: These are the underlying infrastructure (for example, virtual machines and containers) that run the query engines and processing frameworks.

Apache Iceberg and data lakes

Apache Iceberg significantly enhances the capabilities of data lakes by adding a reliable and performant table format. In traditional data lakes without a table format like Iceberg, data is often just a collection of files. This can lead to several challenges:

Lack of schema evolution: Changing the structure of the data can be complex and error-prone
Inconsistent reads: Concurrent write operations can lead to queries reading a mix of old and new data
Slow query performance: Without metadata to guide query engines, they often have to scan large portions of the data
Difficulty with data management: Features like time travel and versioning are not readily available

Iceberg addresses these limitations by providing a structured layer on top of the data lake. It brings database-like features to data lakes, transforming them into more powerful and manageable data lakehouses. By managing tables as collections of files with rich metadata, Iceberg enables:

Reliable and consistent data access: ACID properties ensure data integrity
Efficient query processing: Metadata-driven data skipping and filtering accelerate queries
Flexible data management: Schema evolution and time travel simplify data maintenance and analysis
Interoperability: Iceberg is designed to be compatible with various query engines and processing frameworks commonly used with data lakes

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.

Talk to a Google Cloud sales specialist to discuss your unique challenge in more detail.

Google Cloud and Apache Iceberg

Google Cloud provides a robust environment for leveraging Apache Iceberg. Several Google Cloud services integrate well with Iceberg, enabling users to build powerful and scalable data lakehouse solutions.