Data Analytics

Cloud Pub/Sub 2024 highlights: Native integrations, sharing and more

December 17, 2024

Prateek Duble

Product Management Lead, Cloud Pub/Sub

Joaquín Ibar

Technical Account Manager, Google Cloud

Join us at Google Cloud Next

April 9-11 in Las Vegas

In today's rapidly evolving digital landscape, organizations need to leverage real-time data for actionable insights and improved decision-making. Availability of real-time data is emerging as a key element to evolve and grow the business. Pub/Sub is Google Cloud’s simple, reliable, and scalable messaging service that serves as a versatile entry point to ingest streaming data into Google Cloud’s ecosystem, and is integrated with products like BigQuery, Cloud Storage, Dataflow, and more. You can then use this data for downstream analytics, visualization, and AI applications. This year we launched several new features and enhancements to help meet the demands of modern streaming workloads, across three key data analytics patterns:

Streaming ingestion - Stream data directly into BigQuery and Cloud Storage for downstream use cases such as analytics and ML with BigQuery or for backup in Cloud Storage.
Streaming analytics - Process and analyze real-time event streams and take business decisions on high-value, real-time insights with Dataflow or BigQuery Engine for Apache Flink, or BigQuery continuous queries.
Stream sharing and export - Curate, share and monetize your valuable streaming data through data exchanges with your internal teams and/or external customers.

Let’s take a closer look at the Pub/Sub highlights of 2024 across these three areas.

Streaming ingestion

Many customers have some workloads on one public cloud and other workloads (e.g. analytical) on another. Pub/Sub has traditionally supported streaming ingestion into BigQuery and Cloud Storage through export subscriptions. This year, we simplified import into Pub/Sub from various sources, starting with AWS Kinesis Data Streams. Pub/Sub import topics is a new no-code, one-click way to ingest streaming data from AWS Kinesis Data Streams into Pub/Sub. This helps simplify streaming data ingestion pipelines without the overhead of maintaining and running a custom connector.

Another typical streaming ingestion use case is to ingest batch data into Pub/Sub. To ingest data from Cloud Storage into Pub/Sub, you used to have to either configure, deploy, run, manage and scale a custom connector, or use a Dataflow template. Now you can enable the ingestion property to create a Cloud Storage import topic to ingest batch data from Cloud Storage into a Pub/Sub topic. Once the data is flowing into an import topic, you can create a subscription (Pull, Push, BigQuery or Cloud Storage) to get the data to your choice of sink for downstream processing.

There are two key use cases for Cloud Storage import topics:

Batch to streaming - To leverage batch data for streaming analytics use cases like predictions and activations, you must first transform it into a streaming format. With Cloud Storage import topics, you can perform this ingestion in a fully managed way.
Streaming archive data - Many customers need to store historical data; using Pub/Sub with Cloud Storage subscriptions makes it easier to build their archive. From there, Cloud Storage import topics make it easy to ingest historical data into a Pub/Sub topic for streaming analytics use cases.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/1_-_Import_Topic.gif

This year we launched BigQuery tables for Apache Iceberg in preview, a fully managed, Apache Iceberg-compatible storage engine from BigQuery with features such as autonomous storage, optimizations, clustering and high-throughput streaming ingestion. Pub/Sub BigQuery subscriptions integrates with BigQuery tables for Apache Iceberg for high-throughput streaming ingestion that durably stores the ingested tuples in a row-oriented format, and periodically converts them to Parquet stored in a customer-owned Cloud Storage bucket. BigQuery tables for Apache Iceberg can also be used with Pub/Sub to store streaming data in Cloud Storage in Parquet format.

Streaming analytics

Customers use Pub/Sub in conjunction with stream processing engines to power streaming-analytics use cases such as anomaly detection, personalization, etc. With Pub/Sub already natively integrated with Dataflow, in 2024, we focused on supporting Apache Flink, an open-source stream-processing framework that is seeing growing adoption across enterprises. You can now use Apache Flink with Pub/Sub in two ways:

1. BigQuery Engine for Apache Flink
We recently launched BigQuery Engine for Apache Flink in preview, which lets you use the familiar Apache Flink API and ecosystem for stateful stream processing with Java, Python and SQL. It’s also a serverless offering with fully managed deployments, autoscaling, transparent upgrades and pay-as-you-go billing, and is natively integrated into our unified data and AI platform. Pub/Sub is also integrated with BigQuery Engine for Apache Flink.

2. Pub/Sub Apache Flink connector
To support streaming analytics with existing Apache Flink deployments, we launched a new version of the Pub/Sub Flink connector. Now generally available, the connector lets you connect your existing Apache Flink deployment to Pub/Sub in just a few steps. The connector also allows you to publish an Apache Flink output into Pub/Sub topics or use Pub/Sub subscriptions as a source in Apache Flink applications.

Stream sharing & export

BigQuery Analytics Hub lets businesses share batch data assets across organizations efficiently and securely. However, many organizations also need to share real-time streaming data with partners and customers, as well as with internal teams. To help, Pub/Sub Topics sharing in Analytics Hub in preview provides:

Real-time data sharing, allowing data providers to share data updates instantly, facilitating timely access to the freshest data.
Enhanced data discovery: By listing Pub/Sub topics as data products, producers can help increase the visibility and discoverability of their data streams.
Simplified data access, with an integrated experience for centrally managing accessibility to your organization’s streaming data.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_-_Analytics_Hub_Listing.gif

To simplify streaming real-time data from BigQuery to external systems and vendors, you can use BigQuery continuous queries with Pub/Sub, extending new streaming SQL capabilities within BigQuery in the form of SQL jobs that can run indefinitely and process real-time data the moment it arrives. BigQuery continuous queries lets you analyze streaming data in real-time, and act on those insights immediately.

You can even leverage Pub/Sub as both an input and output for real-time data processing: Use BigQuery subscriptions to ingest streaming data into BigQuery, with a BigQuery continuous query to process, analyze, and develop event-driven data pipelines for communicating insights to downstream applications by exporting the query results to a separate Pub/Sub topic. Multiple Google Cloud ISV partners already support Pub/Sub messages generated from a continuous query, including (but not limited to) Aiven, Census, Confluent, Estuary, Hightouch, Keboola, Lytics, Nexla, Qlik, and Redpanda.

Observability

New support for OpenTelemetry in Pub/Sub lets you see a detailed trace of your message lifecycle, including the ability to see a distributed trace from the moment a message is published to when it's received and processed. Analyzing these traces can decrease troubleshooting time by allowing you to quickly identify bottlenecks, misconfigurations, and other failures in your Pub/Sub applications.

Looking ahead

As we look ahead to 2025, we have planned innovation across following key areas:

Simplified Kafka ingestion - Oftentimes customers migrate from Kafka to Pub/Sub to simplify their messaging infrastructure and enjoy Pub/Sub’s key benefits of simplicity, reliability and auto-scalability. To make this migration journey simpler, we will be launching cross-cloud Kafka sources with Import Topics in early 2025.
Single message transforms - Almost all streaming data pipelines need some form of transformations. Some customers prefer to transform the data after it has landed into a data lake or data warehouse (ELT pattern), while others prefer to transform the data before landing it in the sink (data lake, data warehouse). In 2025, we plan to further simplify streaming analytics architectures by providing native, lightweight, single-message transformations. Pub/Sub Single Message Transforms (SMT) will help you perform simple, lightweight modifications to the message attributes and/or data with JavaScript User-Defined Function (UDFs).

Thanks for reading this far. We are excited to get these capabilities to you. Get started with Pub/Sub today and start exploring these new features to solve your hardest business challenges.

Posted in

Data Analytics

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

By Qiqi Wu • 5-minute read

Data Analytics

How to use gen AI for better data schema handling, data quality, and data generation

By Deb Lee • 9-minute read

Data Analytics

BigQuery ML is now compatible with open-source gen AI models

By Vaibhav Sethi • 3-minute read

Data Analytics

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

By Yuri Volobuev • 4-minute read

Cloud Pub/Sub 2024 highlights: Native integrations, sharing and more

Prateek Duble

Joaquín Ibar

Join us at Google Cloud Next

Streaming ingestion

Streaming analytics

Stream sharing & export

Observability

Looking ahead

Related articles

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

How to use gen AI for better data schema handling, data quality, and data generation

BigQuery ML is now compatible with open-source gen AI models

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support