What is Apache Kafka?

Last Updated: 06/12/2026

Apache Kafka is a popular open-source event streaming platform used to collect, process, and store continuous streams of events. Kafka is commonly used as messaging middleware, but offers scalability and redundancy that enables distributed applications to handle a single event per day, to billions per second. Unlike traditional messaging systems, Kafka is also a durable storage system that stores records in an ordered log that can be read and re-read reproducibly. This makes Kafka a common system for distributing changes to transactional databases that can be used to rebuild, or materialize the data, in analytics and other systems. This pattern is sometimes called event sourcing.

Thus, Kafka is important for both traditional event bus patterns, where event-driven applications are integrated through messaging middleware, as well as data syndication (or “heterogeneous materialization”) architectures. This is low-latency and cost-effective.

Explore Google Cloud Managed Service for Apache Kafka to automate your streaming infrastructure and accelerate data-to-AI workflows.

Spin up Apache Kafka on Google Cloud fast - Create, monitor, and resize a cluster Video

3:07

Overview of Apache Kafka

Kafka takes streaming data and records exactly what happened and when. This record is called an append-only log. It is immutable because it can be appended to, but not changed. From there, applications can subscribe to the log access the data or publish to it add more data in real-time.

While the core of “Kafka” often refers to the low-latency storage system, the streaming platform includes other important components. First is Kafka Connect, an integration system that allows horizontally-scalable connectors to many important systems. This includes change data capture (CDC) connectors, cluster-to-cluster replication (MirrorMaker), and ability to write data to downstream systems such as lakehouses (Apache Iceberg), lakes (Avro or Parquet files on object storage), as well as databases such as BigQuery. Second, the Kafka projects ship with a set of powerful clients, including administrative command line clients for manipulating clusters and topics as well as high performance client libraries for reading and writing data.

Historically, data processing was handled with periodic batch jobs, where raw data was first stored and later processed at arbitrary intervals. For example, a retail company might wait until the end of the day to analyze sales data. One of the limitations of batch processing is that it’s not real time. Increasingly, organizations and data scientists want to analyze data in real time as it is generated to make timely business decisions and power real-time AI models.This is where event streaming comes in. Event streaming is the process of continuously processing infinite streams of events, as they are created. This captures the time-value of data and enables push-based applications that take action whenever something interesting happens. For data scientists, this means the ability to perform real-time feature engineering and deliver low-latency predictions.

Why data engineers use Apache Kafka

While many organizations focus on the downstream insights generated by data scientists, the primary practitioners of Apache Kafka are data engineers. These professionals are responsible for building the critical "data pipes" and integrations that connect a company's applications and databases.

Building scalable integrations

Data engineers use Kafka to create reliable connections across the technology stack. These integrations can take several forms:

Application-to-application: Enabling microservices to communicate through event-driven architectures
Database-to-database: Synchronizing data between different storage systems for redundancy or specialized processing
Application-to-database: Capturing front-end events—such as user interactions on a mobile app—and streaming them into back-end databases

Data syndication and event exporting

In a typical enterprise, the data engineer works closely with application teams to ensure that user events, business transactions, and database updates are exported to Kafka. This process, known as data syndication, makes these events available to multiple users and systems across the organization simultaneously.

Transforming logs for data science

Data engineers write the code for pipelines that transform raw application logs into structured, high-quality formats. This transformation is essential for data scientists, who generally require "clean" data stored in query-able environments like data lakes, lakehouses, or data warehouses rather than interacting with the raw Kafka stream directly.

Prioritizing data access over latency

In the context of data science and AI, the value of Kafka lies primarily in data access. It serves as a comprehensive source for application logs and database changes. While Kafka is famous for its speed, for most data science workflows, the breadth and reliability of the data source are far more critical than low-latency delivery.

Why is Kafka important to AI systems

AI systems run on high quality training data and context during inference. Kafka is often critical for training to collect training data from a variety of source systems, from interaction logs to database changes. In many organizations it is used as an event bus aggregating events from many services or simply as a staging location for application logs. This makes it a natural, single source of data for generating training data sets. Because Kafka stores records in an ordered sequence, it can also be a particularly good fit for LLMs that operate on sequences.

Kafka is essential for many online inference tasks. The ability of an application or agent to provide a relevant product recommendation, search response, or prompt relies on having the most up-to-date context for a user. Because Kafka supports low-latency, scalable communication it allows an inference system to update user context with the latest events within tens of milliseconds. For example, if a user declines the latest song recommendation in a music app or if an equity price changes in a financial application, a recommendation service can immediately generate a better suggestion taking this input into account.

What are the benefits of Kafka?

Open source ecosystem

Kafka’s source code is freely available, benefiting from a global community that contributes a broad range of connectors, monitoring tools, and plugins.

Scale and speed

Kafka is a distributed platform, meaning processing is divided among multiple machines. This allows it to scale to handle massive data volumes while maintaining sub-millisecond latency.

High availability

Because it is distributed, Kafka remains reliable even if individual machines fail, making it suitable for mission-critical applications

Kafka as a managed service

Setting up on-premises Kafka clusters is notoriously difficult,requiring teams to provision machines, manage security, and handle routine patching. With a managed service, a provider handles the underlying infrastructure, allowing you to focus on building applications. This is particularly beneficial for data science teams who want to focus on model development and insights rather than infrastructure management.

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.

Talk to a Google Cloud sales specialist to discuss your unique challenge in more detail.

How does Kafka work?

Kafka enables streaming event processing through four core functions:

Producing, or writing, data: A source can write logs, events, records or of data into topics (groupings of data events).
Store: Apache Kafka provides durable, highly available storage, often serving as a reliable "source of truth" for event-driven architectures. This is specifically useful for when you need to go back and look at what happened historically, rather than simply reacting to events live. Because Kafka stores records in an ordered log that can be read and re-read reproducibly, it allows teams to rebuild data in analytics systems or investigate past transactions with full context
Consume: An application can read to one or more topics to process the resulting data stream.
Connect: Reusable connectors link Kafka to existing systems like BigQuery and Dataproc.

Additional resources

BigQuery Studio Overview: A unified workspace for data practitioners to accelerate data-to-AI workflows with SQL and Python notebooks
Iceberg Tables Overview: Create managed Iceberg tables that allow open-source engines like Spark to query streaming data with high performance

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Need help getting started?
Contact sales
Work with a trusted partner
Find a partner
Continue browsing
See all products