Jump to Content
Data Analytics

Easily stream data from AWS Kinesis to Google Cloud with Pub/Sub import topics

May 30, 2024
Jaume Marhuenda-Beltran

Software Engineer

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Try it

Many companies use a multi-cloud model to support their business, whether it’s to avoid being tied to one provider, to increase redundancy, or to take advantage of differentiated products from different cloud providers. 

One of Google Cloud’s most beloved and differentiated products is BigQuery, which offers a fully managed, AI-ready multi-cloud data analytics platform. BigQuery Omni provides a unified management interface where you can use BigQuery to query data in AWS or Azure, and see the results displayed in the Google Cloud console. Then, if you want to combine and move data between clouds in real-time, Pub/Sub offers a new capability that allows one-click streaming ingestion into Pub/Sub from external sources: import topics. The first supported external source is Amazon Kinesis Data Streams. Let's explore how you can leverage these new import topics, along with Pub/Sub's BigQuery subscriptions, to make your streaming data in AWS available in BigQuery with only a few clicks.

Getting to know import topics

Pub/Sub is a scalable asynchronous messaging service that lets you decouple services that produce messages from services processing those messages. Pub/Sub can be used to stream data from any source to any sink with its client libraries, and is well-integrated within the Google Cloud ecosystem. Pub/Sub supports export subscriptions that automatically stream data to BigQuery and Cloud Storage. Pub/Sub is also natively integrated with Cloud Functions, and Cloud Run, and can deliver messages to any arbitrary publicly reachable endpoint, for example on Google Kubernetes Engine (GKE), Google Compute Engine, or even on-premises.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_cross_cloud_streaming.max-1400x1400.png

What export subscriptions provide for writing data to BigQuery and Cloud Storage, import topics provide for reading data from Amazon Kinesis Data Streams: a fully managed, streamlined way to ingest data from Amazon Kinesis Data Streams directly into Pub/Sub. As a result, the complexity of setting up data pipelines between clouds is significantly reduced. Import topics also provide out-of-the-box monitoring for visibility into the health and performance of the data ingestion processes. Additionally, import topics offer automatic scaling, eliminating the need for manual configuration to handle fluctuations in data volume.

In addition to enabling multi-cloud analytics with BigQuery, import topics allows for the easy migration of streaming data from Amazon Kinesis Data Streams into Pub/Sub. Once a connection between both systems has been established via an import topic, the Amazon Kinesis producers can be gradually migrated to Pub/Sub publishers on an arbitrary schedule.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_aws_kinesis_migration.max-1000x1000.png

Please note that only Enhanced Fan-Out Amazon Kinesis consumers are supported at the current time.

Analyzing your Amazon Kinesis Data Streams data in BigQuery

Now, imagine you operate a business with a highly variable volume of streaming data residing in Amazon Kinesis Data Streams. This data is crucial for analysis and decision-making and you want to leverage BigQuery to analyze it. First, create an import topic following these comprehensive instructions. You can create an import topic via different official Pub/Sub libraries, as well as with the Google Cloud console. After clicking on “Create Topic” from the Pub/Sub page in the console, you’ll see:

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_create_topic.max-600x600.png

Important: After hitting "Create", Pub/Sub immediately begins reading from the Amazon Kinesis data stream and publishes messages to the import topic. If you already have data in your Kinesis data stream, there are some steps you need to take to prevent data loss when creating an import topic. If a topic has no subscriptions attached to it and does not have message retention enabled, then Pub/Sub may drop messages published to that topic. Creating a default subscription when creating the topic is not sufficient; these are still two separate operations and there is a brief period of time where the topic exists without the subscription. 

To prevent data loss, you have two options:

  1. Create a topic, and then update it to become an import topic:

    1. Create a non-import topic.

    2. Create a subscription to the topic.

    3. Update the topic configuration to enable ingestion from Amazon Kinesis Data Streams, thus turning it into an import topic.

  2. Enable message retention and seek your subscription: 

    1. Create an import topic with message retention enabled.

    2. Create a subscription to the topic.

    3. Seek the subscription to a timestamp prior to the creation of the topic.

Note that export subscriptions start writing data as soon as they are created. Therefore, seeking back in time can result in duplicates. Because of this, the recommended approach when using export subscriptions is the first option.

To route the data to BigQuery, create a BigQuery subscription by navigating to the Pub/Sub subscriptions page in the Pub/Sub console and clicking on “Create Subscription”:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image7_Q7hE1yB.max-600x600.png

Pub/Sub autoscales by actively monitoring the Amazon Kinesis data stream. It periodically queries the Amazon Kinesis ListShards API to maintain an up-to-date view of the stream’s shards. Whenever changes occur within the Amazon Kinesis data stream (resharding), Pub/Sub automatically adapts its ingestion configuration to ensure that all data is captured and published to your Pub/Sub topic.

Pub/Sub uses the Amazon Kinesis SubscribeToShard API to establish a persistent connection for each shard that either doesn’t have a parent shard, or whose parent shard has already been ingested, ensuring continuous ingestion of data from the different shards of the Amazon Kinesis data stream. Pub/Sub doesn’t start ingesting a child shard until its parent has been completely ingested. However, there are no strict ordering guarantees, since messages are published without an ordering key. Each individual Amazon Kinesis record is transformed into its corresponding Pub/Sub message by copying the data blob of the Amazon Kinesis record to the data field of the Pub/Sub message, which is then published. Pub/Sub attempts to maximize the data read rate per Amazon Kinesis shard.

You can now validate the successful data transfer by querying the BigQuery table directly. A quick SQL query confirms that the data from Amazon Kinesis has been populated into the table, and is now ready for further analysis and to be integrated into broader analytics workflows:

https://storage.googleapis.com/gweb-cloudblog-publish/images/5_sql_query.max-1000x1000.png

Monitoring your cross-cloud import

Monitoring your data ingestion pipeline is essential for ensuring smooth operations. We recently added three new metrics to Pub/Sub that let you verify the health of the import topic and understand its performance. They show the byte count, the message count, and the topic’s state. Unless the state is “ACTIVE”, ingestion is blocked by a misconfiguration, missing stream, or missing consumer. Refer to the official documentation for a comprehensive list of potential error states and their troubleshooting steps. These metrics can be conveniently accessed from the topic’s detail page, from where you can see your topic’s state is “ACTIVE”, its throughput, and messages per second:

https://storage.googleapis.com/gweb-cloudblog-publish/images/6_dashboards.max-1100x1100.png

Conclusion

Running in multiple cloud environments has become standard operating procedure for many companies. Even when using different clouds for different parts of your business, you should still be able to take advantage of the best products each has to offer, which may mean moving data around among them. Pub/Sub now makes it easy to stream data into Google Cloud from AWS. To get started, visit Pub/Sub in the Google Cloud console or sign up for a free trial to get started today.

Posted in