Read from a Pub/Sub streaming source

Cloud Data Fusion supports Pub/Sub sources in streaming data pipelines.

Before you begin

Roles and permissions

To get the permissions that you need to read from a Pub/Sub streaming source, ask your administrator to grant you the Pub/Sub Editor (roles/pubsub.editor) IAM role on service account used to access the Pub/Sub subscription. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to read from a Pub/Sub streaming source. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to read from a Pub/Sub streaming source:

  • pubsub.snapshots.create
  • pubsub.snapshots.delete
  • pubsub.snapshots.seek
  • pubsub.subscriptions.consume
  • pubsub.topics.attachSubscription

You might also be able to get these permissions with custom roles or other predefined roles.

You grant the role on the service account you specified in the plugin properties for accessing Pub/Sub. If none is specified, grant the role on the Dataproc service account.

For more information about granting roles, see Manage access.

Add a Pub/Sub source to your streaming data pipeline

  1. Go to your instance:

    1. In the Google Cloud console, go to the Cloud Data Fusion page.

    2. To open the instance in the Cloud Data Fusion web interface, click Instances, and then click View instance.

      Go to Instances

  2. In the Cloud Data Fusion web interface, click Studio.

  3. Select Data Pipeline - Realtime.

  4. In the Source menu, select Pub/Sub. A Pub/Sub streaming source node appears in the pipeline.

  5. On the Pub/Sub node, click Properties to configure the source. For more information, see Pub/Sub Streaming Source.

Support for a single Pub/Sub source with no Windower plugins

Cloud Data Fusion version 6.9.1 supports real time pipelines with a single Pub/Sub streaming source and no Windower plugins.

  • The Pub/Sub streaming source has built-in support and data is processed at least once. Enabling Spark checkpointing isn't required.
  • The Pub/Sub streaming source creates a Pub/Sub snapshot at the beginning of each batch and removes it at the end of each batch.
  • Creating Pub/Sub snapshots has a cost associated with it. For more information, see Pub/Sub pricing.
  • You can monitor snapshot creation in Cloud Audit Logs.

Upgrade a pipeline with a Pub/Sub streaming source

Cloud Data Fusion supports direct application upgrades for streaming pipelines with a Pub/Sub streaming source created in 6.9.1 or later.

Cloud Data Fusion doesn't support upgrades for data pipelines with a Pub/Sub streaming source in version 6.9.0 or earlier. Instead, upgrade those pipelines to 6.9.1:

  1. Stop publishing the data to the topic when the instance upgrade is planned.
  2. Wait for the pipeline to finish processing the published data.
  3. After the data is processed completely, stop the pipeline.
  4. Upgrade the instance.
  5. Duplicate the existing pipeline and update to the newest plugins.
  6. Deploy the pipeline.
  7. Run the new pipeline to read data.

    The new version automatically uses snapshot instead of Spark checkpointing.

  8. Delete the old pipeline.

What's next