What is Apache Kafka?
Apache Kafka is a popular event streaming platform used to collect, process, and store streaming event data or data that has no discrete beginning or end. Kafka makes possible a new generation of distributed applications capable of scaling to handle billions of streamed events per minute.
Until the arrival of event streaming systems like Apache Kafka and Google Cloud Pub/Sub, data processing has typically been handled with periodic batch jobs, where raw data is first stored and then later processed at arbitrary time intervals. For example, a telecom company might wait until the end of the day, week, or month to analyze the millions of call records and calculate accumulated charges.
One of the limitations of batch processing is that it’s not real time. Increasingly, organizations want to analyze data in real time in order to make timely business decisions and take action when interesting things happen. For example, the same telecom company mentioned above might benefit from keeping customers apprised of charges in real time as a way to enhance the overall customer experience.
This is where event streaming comes in. Event streaming is the process of continuously processing infinite streams of events, as they are created, in order to capture the time-value of data as well as create push-based applications that take action whenever something interesting happens. Examples of event streaming include continuously analyzing log files generated by customer-facing web applications, monitoring and responding to customer behavior as users browse e-commerce websites, keeping a continuous pulse on customer sentiment by analyzing changes in clickstream data generated by social networks, or collecting and responding to telemetry data generated by Internet of Things (IoT) devices.
Learn about Confluent Cloud’s fully managed and integrated offering of Apache Kafka on Google Cloud.
Ready to get started? New customers get $300 in free credits to spend on Google Cloud.
Kafka takes streaming data and records exactly what happened and when. This record is called an immutable commit log. It is immutable because it can be appended to, but not otherwise changed. From there, you can subscribe to the log (access the data) and you can also publish to it (add more data) from any number of streaming real-time applications, as well as other systems.
For example, you could use Kafka to take all transaction data streaming from your website to feed an application that tracks product sales in real time, compares it to the amount of product in stock, and in turn enables just-in-time inventory replenishment.
Kafka is open source
This means its source code is freely available to anyone to take, modify, and distribute as their own version, for any purpose. There are no licensing fees or other restrictions. Kafka also benefits from having a global community of developers working with and contributing to it. As a result, Kafka offers a broad range of connectors, plugins, monitoring tools, and configuration tools as part of a growing ecosystem.
Scale and speed
Kafka not only scales with ever-increasing volumes of data, but provides that data across the business in real time. Being a distributed platform is also a major benefit of Kafka. This means that processing is divided among multiple machines—physical or virtual. This has two advantages: With some work, it has the ability to scale out—to add machines when needing more processing power or storage—and it is reliable, because the platform still runs even if individual machines fail. However, this feature of Kafka can be very difficult to manage at scale.
Kafka as a managed service
Despite all the advantages of Kafka, it is a challenging technology to deploy. On-premises Kafka clusters are difficult to set up, scale, and manage in production. When establishing the on-premises infrastructure to run Kafka, you need to provision machines and configure Kafka. You must also design the cluster of distributed machines to ensure availability, make sure data is stored and secure, set up monitoring, and carefully scale data to support load changes. Then you have to maintain that infrastructure, replacing machines when they fail and doing routine patching and upgrading.
An alternative approach is to utilize Kafka as a managed service in the cloud. A third-party vendor takes care of provisioning, building, and maintaining the Kafka infrastructure. You build and run the applications. This makes it easy for you to deploy Kafka without needing specific Kafka infrastructure management expertise. You spend less time managing infrastructure and more time creating value for your business.
A data source can publish or place a stream of data events into one or more Kafka topics, or groupings of similar data events. For example, you can take data streaming from an IoT device—say a network router—and publish it to an application that does predictive maintenance to calculate when that router is likely to fail.
An application can subscribe to, or take data from, one or more Kafka topics and process the resulting stream of data. For example, an application can take data from multiple social media streams and analyze it to determine the tenor of online conversations about a brand.
Kafka Streams API can act as a stream processor, consuming incoming data streams from one or more topics and producing an outgoing data stream to one or more topics.
You can also build reusable producer or consumer connections that link Kafka topics to existing applications. There are hundreds of existing connectors already available, including connectors to key services like Dataproc, BigQuery, and more.
Apache Kafka provides durable storage. Kafka can act as a "source of truth," being able to distribute data across multiple nodes for a highly available deployment within a single data center or across multiple availability zones.