Use Datastream to ingest data into partitioned tables in BigQuery

You might need to partition your BigQuery tables into smaller segments to improve query performance and control costs. Because Datastream doesn't support partitioning tables in BigQuery, you need to manually add the partitions before starting your stream. For general information about partitioning in BigQuery, see Introduction to partitioned tables.

Partition tables in BigQuery

To partition your tables in BigQuery, use one of the options described in the sections that follow, depending on your use case.

Option 1: The table already exists in BigQuery and is included in a stream

  1. Exclude the table from the source configuration of your stream. For more information about including and excluding objects from your source configuration, see Configure source databases.
  2. Wait a few minutes to ensure that Datastream has completed processing all events for the table.
  3. Create your partitioned table in BigQuery. If you want to keep the data that was already in the original BigQuery table, give the table a different, temporary name.
  4. Copy the data from the original table to the new partitioned table.
  5. Drop or rename the original table.
  6. Change the temporary name of the new table to the name of the original table.
  7. Add the source table to the configuration of your stream.

Option 2: The table doesn't exist in BigQuery

  1. Create the table in BigQuery using one of the following approaches:

    • Use the BigQuery Migration Toolkit.
    • Manually create a Datastream-compatible BigQuery table. For example, if you want to create a table and partition the data based on the TIMESTAMP column, you can use a query similar to the following:

      CREATE TABLE dataset.partitioned_table (
      'id' INT64,
      'name' STRING
      'update_date' DATETIME,
      'datastream_metadata' STRUCT<'uuid' STRING, 'source_timestamp' INT64>,
      PRIMARY KEY ('id') NOT ENFORCED
      )
      PARTITION BY TIMESTAMP(update_date)
      
  2. After you create the partitioned table, make sure that its max_staleness value is set according to your requirements. If you don't set the value, the default value of 0 is set. For more information, see Use BigQuery tables with the max_staleness option.

  3. Add the source table to the configuration of your stream.

  4. Optionally, if you've set manual backfill for the stream, initiate backfill for the table.