Destination

Overview

The stream destination for Datastream is Cloud Storage.

In this section, you learn about:

  • Setting up this stream destination
  • Specifying the destination bucket and prefix to determine where the data is written to in the destination
  • Behavior associated with data written to the destination

Setup

Enabling the Datastream API is sufficient when the destination Cloud Storage bucket is in the same project as Datastream.

If you're writing to a bucket in a different project, then you must give the service account associated with the Datastream permission to access the bucket.

  1. Obtain the email address used for the service account. To do so, find the Project number in the Cloud Console home page. The email address of the service account will be service-[project_number]@gcp-sa-datastream.iam.gserviceaccount.com.
  2. Give this email address the required permissions to write to the destination bucket (typically the roles/storage.objectViewer, roles/storage.objectCreator and roles/storage.legacyBucketReader permissions).

Specify the destination bucket and prefix

Provide the project name, bucket name, and optional file prefix in the stream configuration to determine where the data is written.

Write behavior

  • The data for a given stream is written to the provided bucket or file prefix at: [bucket]/[prefix]/[object name]/yyyy/mm/dd/hh/mm/[filename(idempotent)].
  • The object name in the path for database sources is the schema name followed by the table name (separated by an underscore '_').
  • The timestamp in the path (hh/mm) is the source timestamp from the event metadata.
    • For backfill events, the timestamp represents the time when the backfill task was started (typically when the stream was started, or when backfill was manually triggered).
    • For CDC events, the timestamp represents when the record was changed on the source.
  • Files are rotated whenever either the maximum file size or maximum timeout is reached, whichever comes first.
  • Additionally, files are rotated each time a source schema change is detected (For example, a new column is added.).
  • There can be multiple files within the same path for the same minute.
  • Consuming the data can be done according to the Cloud Storage API.

In some cases, most commonly when a stream first starts running, there will be an overlap between backfill and CDC, resulting in duplicate events. This happens when Datastream starts reading from the current database log file (which was created before the stream was started), and this log file includes some events that are also captured by the backfill. This is anticipated behavior, and duplicate events can be eliminated using the event metadata, if necessary.