Create a stream

Overview

In this section, you learn how to create a stream. Datastream uses this stream to transfer data from a source Oracle, MySQL, SQL Server or PostgreSQL database into BigQuery or Cloud Storage.

Creating a stream includes:

Defining settings for the stream.
Selecting the connection profile that you created for your source Oracle, MySQL, SQL Server, or PostgreSQL database (the source connection profile), or creating a source connection profile if you haven't created one.
Configuring information about the source database for the stream by specifying the tables and schemas in the source database that Datastream:
- Can transfer into the destination.
- Is restricted from transferring into the destination.
Determining whether Datastream will backfill historical data, as well as stream ongoing changes into the destination, or stream only changes to the data. When enabling historical backfill, you can optionally specify schemas and tables in the source database that Datastream shouldn't backfill into the destination.
Selecting the connection profile that you created for BigQuery or Cloud Storage (the destination connection profile), or creating a destination connection profile if you haven't created one.
Configuring information about the destination for the stream. This information includes:
- For BigQuery:
  - The datasets into which Datastream will replicate schemas, tables, and data from a source database.
- For Cloud Storage:
  - The folder of the destination bucket into which Datastream will transfer schemas, tables, and data from a source database.
Validating the stream to ensure that it will run successfully. Validating a stream checks:
- Whether the source is configured properly to allow Datastream to stream data from it.
- Whether the stream can connect to both the source and the destination.
- The end-to-end configuration of the stream.

Define settings for the stream

Go to the Streams page in the Google Cloud Console.

Go to the Streams page
Click CREATE STREAM.

Use the following table to populate the fields of the Define stream details section of the Create stream page:

Field	Description
Stream name	Enter the display name of the stream.
Stream ID	Datastream populates this field automatically based on the stream name that you enter. You can keep the ID that's auto-generated or change it.
Region	Select the region where the stream is stored. Streams, like all resources, are saved in a region. Region selection doesn't impact whether your stream can connect to the source database or the destination, but can impact availability if the region experiences downtime.
Source type	Select the profile type that you specified when you created a connection profile for a source Oracle, MySQL, SQL Server, or PostgreSQL database. Or, if you haven't created a connection profile for the source database, you can create one now.
Destination type	Select the profile type that you specified when you created a connection profile for a BigQuery or Cloud Storage destination. Or, if you haven't created a connection profile for the destination, you can create one now.
Encryption	By default, your data is encrypted with a key that's managed by Google Cloud. If you want to manage your encryption, then you can use a customer-managed encryption key (CMEK): Select the Cloud KMS key checkbox. From the Key type drop-down menu, select Cloud KMS and then select your CMEK. If you don't see your key, then click ENTER KEY RESOURCE NAME to provide the resource name of the key that you want to use. For example, you can enter `projects/my-project-name/locations/my-location/keyRings/my-keyring/cryptoKeys/my-key` in the Key resource name field, and then click SAVE. As part of creating the stream, Datastream will verify that the CMEK exists, and that Datastream has permissions to use the key. For more information about creating a CMEK or giving Datastream permissions for the key, see Use customer-managed encryption keys (CMEK).

Optionally, use labels to organize your Datastream resources.
1. To create a label, click ADD LABEL, and then enter the key-value pair for the label.
2. To remove the label, click the trashcan icon to the right of the row that contains the label.
Optionally, add alerting policies for your stream. Alerting policies define when and how you want to be notified about stream failures.
1. To create an alerting policy, click Add alerting policy.
2. The Create alerting policy page appears in Cloud Monitoring. On this page, you define the alerting policy in case your stream fails.
For more information about alerting policies, see Managing metric-based alerting policies.
Review the required prerequisites that are generated automatically to reflect how the environment must be prepared for a stream. These prerequisites can include how to configure the source database and how to connect it to the destination. It's best to complete these prerequisites at this step, but you can complete them at any time before you test the stream or start it. For more information about these prerequisites, see Sources.
Click CONTINUE. The Define connection profile panel of the Create stream page appears for your source database type.

Note: If your source database type is Oracle, then the panel name is Define Oracle connection profile. If you select a different source database, then that database name appears in the panel name.

Specify information about the source connection profile

If you have created a source connection profile for Oracle, MySQL, SQL Server, or PostgreSQL, then select it from the list of connection profiles.

If you haven't created a source connection profile, then create one by clicking CREATE CONNECTION PROFILE at the bottom of the drop-down list, and then perform the same steps as in Create connection profiles.
Click RUN TEST to verify that the source database and Datastream can communicate with each other.

If the test fails, then the issue associated with the connection profile appears. Refer to the Diagnose issues page for troubleshooting steps. Make the necessary changes to correct the issue, and then retest.
Click CONTINUE. The Configure stream source panel of the Create stream page appears.

Configure information about the source database for the stream

Configure PostgreSQL source database

If you selected PostgreSQL as your source database, you need to define its replication properties.

In the Replication properties section, specify the following properties:
1. In the Replication slot name field, enter the name of the slot you created specifically for this stream. The database server uses this slot to send events to Datastream.
2. In the Publication name field, enter the name of the publication you created in the database. A publication is a group of all tables that you want to replicate changes from using this stream.
Follow the steps described in the Configure source databases section.

Configure SQL Server source database

If you selected SQL Server as your source database, you need to define the CDC method.

In the CDC method drop-down menu, select the CDC method that matches your source database configuration:
- Transaction logs: select this method to process changes directly from the database logs. This method provides the best performance and is more efficient, but requires additional configuration steps.
- Change tables: select this method to process changes from dedicated change tables. This method is easier to configure and has fewer limitations, but supports lower throughput and produces higher load on your database than the transaction logs method.
For more information about configuring your source SQL Server database, see Configure a source SQL Server database.
Follow the steps described in the Configure source databases section.

Configure source databases

In the Select objects to include section, use the Objects to include drop-down menu to specify the tables and schemas in the source database that Datastream can transfer into the destination.
The menu fails to load the list of tables and schemas if your database has more than 5,000 objects.
1. If you want Datastream to transfer all tables and schemas, then select All tables from all schemas.
2. If you want Datastream to transfer only specific tables and schemas, select Specific schemas and tables, and then select the checkboxes for the schemas and tables that you want Datastream to pull.
  For each schema that you select, there's a Future tables checkbox.
  - If this checkbox is selected, then any tables for the schema that you add to the source after the stream is created will be transferred into the destination automatically.
  - If this checkbox isn't selected, then after the stream is created, Datastream won't transfer any new tables for the schema from the source into the destination.
  By default, the Future tables checkbox is selected for each schema.
  For each schema table that you select, there's a SELECT COLUMNS button. Click this button to filter any columns of the table that you want to exclude from being streamed from the source into the destination. Any columns that you clear are excluded from the stream.
  
  By default, the checkboxes for all columns of the table are selected.
3. If you want to provide a textual definition of the tables and schemas that you want Datastream to transfer, select Custom, and then, in the Object matching criteria field, enter the schemas and tables that you want Datastream to pull.
  
  If your database has large numbers of tables and schemas, we recommend using the Custom option because some tables and schemas might not be included in the list of objects to pull.
  Enter schemas and tables as a comma-separated list, using the [schema].[table] format, where both the schema and the table can be a wildcard (asterisk). You can't use partial wildcards.
  
  Also, if Oracle is your source database, and unless specified explicitly, then Oracle's predefined schemas and tables won't be pulled into the destination.
Optionally, expand the Select objects to exclude node. In the Objects to exclude field, enter the tables and schemas in the source database that you want to restrict Datastream from pulling. The Objects to exclude list takes precedence over the Objects to include list. If an object meets the criteria in both the include and exclude lists, the object is excluded from the stream.
You define these schemas and tables with the same comma-separated logic that you use if you're specifying schemas and tables that you want Datastream to pull.
Optionally, expand the Choose backfill mode for historical data node, and then make one of the following selections:
1. Select the Automatic option to stream all existing data, in addition to changes to the data, from the source into the destination.
  
  In the Objects excluded from automatic backfill field, enter the tables and schemas in the source database that you want to restrict Datastream from backfilling into the destination.
2. Select the Manual option to stream only changes to the data into the destination.

Click CONTINUE. The Define connection profile panel of the Create stream page appears for your destination type.

Note: If your destination type is BigQuery, then the panel name is Define BigQuery connection profile. If your destination type is Cloud Storage, then Define Cloud Storage connection profile appears as the panel name.

Select a destination connection profile

If you have created a destination connection profile, then select it from the list of connection profiles.

If you haven't created a destination connection profile, then create one by clicking CREATE CONNECTION PROFILE at the bottom of the drop-down list, and then perform the same steps as in Create connection profiles.
Click CONTINUE. The Configure stream destination panel of the Create stream page appears.

Configure information about the destination for the stream

BigQuery destination

Under Configure the connection from Datastream to BigQuery, specify how Datastream should stream into a BigQuery dataset. Select one of the following options:
- Dataset for each schema: Datastream creates a BigQuery dataset for each source schema, based on the schema name.
  
  If you select this option, then for every schema in the source database, Datastream creates a dataset in the project that contains the stream.
  
  In addition, when selecting the Dataset for each schema option:
  1. In the Location field, enter the region where you want the new dataset to be created.
  2. (Optional) In the Prefix field, define a prefix for the dataset created by the stream. The string that you specify is added to the source schema name. For example, if the schema is called <mySchema>, and you define the dataset prefix as <myPrefix>, Datastream creates a dataset called <myPrefix><mySchema>.
  3. Specify how you want your data to be encrypted in BigQuery by selecting a Google-managed or customer-managed encryption key.
- Single dataset for all schemas: Datastream creates all the tables inside the existing BigQuery dataset that you specify. Each table that Datastream creates is named using a combination of the source schema name and table name, separated by an underscore (for example, <mySchemaName>_<myTableName>).
Specify the Stream write mode. Select your write mode to define how data is written and processed in BigQuery:
- Merge: select this option if you want your data to be synchronised with the source.
- Append-only: select this option if you want to keep all historical data for your change events.
Specify the limit of data staleness. Only available when Merge mode is selected in the preceding step. Select a limit to balance BigQuery query performance and cost versus data freshness. BigQuery applies the changes in the background on an ongoing basis, or at query run time, according to the staleness limit. Lower staleness (fresher data) may increase BigQuery processing costs.
If you select 0 minutes, BigQuery always returns the query results based on the freshest data. This results in the highest cost of the query.
Click CONTINUE. The Review stream details and create panel of the Create stream page appears.

Cloud Storage destination

Optionally, in the Stream path prefix field, enter the folder of the destination bucket into which Datastream will transfer schemas, tables, and data from a source database.

For example, if you want Datastream to transfer data from the source database into the /root/example folder in the destination bucket, then enter /root/example in the Stream path prefix field.
In the Output format field, choose the format of files written to Cloud Storage.
Datastream supports two output formats: Avro and JSON.
Optional. If you select the JSON format, then two checkboxes appear:
1. Include a Unified Types schema file in the file path: If you select this checkbox, then Datastream writes two files to Cloud Storage: a JSON data file and an Avro schema file. The schema file has the same name as the data file, with a .schema extension. If you don't select this checkbox, then Datastream writes only the JSON data file to Cloud Storage. By default, this checkbox isn't selected.
2. Enable gzip compression: If you select this checkbox, then Datastream uses the gzip utility to compress the files that Datastream writes to Cloud Storage. If you don't select this checkbox, then Datastream writes the files to Cloud Storage without compressing them. By default, this checkbox is selected.
Click CONTINUE. The Review stream details and create panel of the Create stream page appears.

Create the stream

Verify details about the stream as well as the source and destination connection profiles that the stream will use to transfer data from a source database into a destination.

Click RUN VALIDATION to validate the stream. By validating a stream, Datastream checks that the source is configured properly, validates that the stream can connect to both the source and the destination, and verifies the end-to-end configuration of the stream.

If your source database is Oracle, then Datastream performs the following checks:

Check	Description
Objects list validity	Datastream verifies that the exclude list doesn't subsume the include list.
Forward SSH tunnel connectivity	Datastream verifies that it can connect to a bastion host through a Forward SSH tunnel.
Connectivity to Oracle database	Datastream verifies that it can connect to the source Oracle database.
Oracle user permissions	Datastream verifies that the user which it uses to connect to the source database has all of the necessary permissions to retrieve schemas, tables, and data from the database so that Datastream can stream this information into the destination.
Logging mode configuration	Datastream verifies that the logging mode for Oracle database is set to ARCHIVELOG.
Supplemental logging configuration	Datastream verifies that supplemental logging is turned on for the database tables that are being streamed from the source into the destination.
Archive log files configuration	Datastream verifies that archive logging is configured in the source, and that archive log files are present.
Cloud Storage permissions (for Cloud Storage destinations)	Datastream verifies that it has the necessary permissions to write to the destination bucket in Cloud Storage.

If your source database is MySQL, then Datastream performs the following checks:

Check	Description
Objects list validity	Datastream verifies that the exclude list doesn't subsume the include list.
Forward SSH tunnel connectivity	Datastream verifies that it can connect to a bastion host through a Forward SSH tunnel.
Connectivity to MySQL database	Datastream verifies that it can connect to the source MySQL database.
Binary logging is enabled	Datastream verifies that the binary log files are configured correctly, and that there are log files present.
Binary log format configuration	Datastream verifies that the binary log format of the MySQL database is set to `ROW`.
Replication permissions	Datastream verifies that it has permissions to replicate a MySQL database.
Cloud Storage permissions (for Cloud Storage destinations)	Datastream verifies that it has the necessary permissions to write to the destination bucket in Cloud Storage.

If your source database is PostgreSQL, then Datastream performs the following checks:

Check	Description
Connectivity to PostgreSQL database	Datastream verifies that it can connect to the source PostgreSQL database.
Logical decoding is enabled	Datastream verifies that `wal_level` parameter of the PostgreSQL database is set to `logical`.
Replication slot configuration	Datastream verifies whether the PostgreSQL replication slot exists and is active, whether Datastream has the permissions required to access it, and whether it is configured correctly.
Publication configuration	Datastream verifies that the publication exists and that the tables to replicate for the stream are included in the publication.
Backfill permissions	Datastream verifies that tables required to backfill historical data exist and that it has the necessary permissions to read them.

If your source database is SQL Server, then Datastream performs the following checks:

Check	Description
SQL Server edition	Datastream verifies that the source SQL Server database edition is supported. For information about supported editions, see Versions.
SQL Server user permissions	Datastream verifies that the user has the permissions required to access the source SQL Server database.
SQL Server CDC is enabled	Datastream verifies whether CDC is enabled for the database and all tables included in the stream.
SQL Server database validations	Datastream checks whether the source database is configured properly, whether all replication requirements are met and whether there are no unsupported data types in the columns.
Cloud Storage or BigQuery permissions	Datastream verifies that the user has the necessary permissions to write to the destination.

After all validation checks pass, click CREATE & START to create the stream and start it immediately, or CREATE to create the stream without starting it immediately.

If you don't start the stream now, then you can start it from the Streams page by clicking START.

Note: The stream creation process may incur an increased load on the source database. To put off that load, create the stream without starting it, and come back to start the stream when the load can be incurred.

After creating a stream, you can view high-level and detailed information about it.

Events and streams

Run a stream