Spanner to BigQuery template

The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.

Pipeline requirements

  • The source Spanner table must exist prior to running the pipeline.
  • The BigQuery dataset must exist prior to running the pipeline.
  • A JSON file that describes your BigQuery schema.

    The file must contain a top-level JSON array titled fields. The contents of the fields array must use the following pattern:
    {"name": "COLUMN_NAME", "type": "DATA_TYPE"}.

    The following JSON describes an example BigQuery schema:

    {
      "fields": [
        {
          "name": "location",
          "type": "STRING"
        },
        {
          "name": "name",
          "type": "STRING"
        },
        {
          "name": "age",
          "type": "STRING"
        },
        {
          "name": "color",
          "type": "STRING"
        },
        {
          "name": "coffee",
          "type": "STRING"
        }
      ]
    }
    

    The Spanner to BigQuery batch template doesn't support importing data into STRUCT (Record) fields in the target BigQuery table.

Template parameters

Parameter Description
spannerInstanceId The instance ID of the Spanner database to read from.
spannerDatabaseId The database ID of the Spanner database to export.
spannerTableId The table name of the Spanner database to export.
sqlQuery The SQL query to use to read data from the Spanner database.
outputTableSpec The BigQuery table to write to. For example, <project>:<dataset>.<table_name>.
spannerProjectId Optional: The ID of the project that the Spanner database resides in. The default value for this parameter is the project where the Dataflow pipeline is running.
spannerRpcPriority Optional: The request priority for Spanner calls. Possible values are HIGH, MEDIUM, and LOW. The default value is HIGH.
bigQuerySchemaPath Optional: The Cloud Storage path (gs://) to the JSON file that defines your BigQuery schema. For example, gs://path/to/my/schema.json.
writeDisposition Optional: The BigQuery WriteDisposition. Supported values are WRITE_APPEND, WRITE_EMPTY, and WRITE_TRUNCATE. The default value is WRITE_APPEND.
createDisposition Optional: The BigQuery CreateDisposition. Supported values are CREATE_IF_NEEDED and CREATE_NEVER. The default value is CREATE_IF_NEEDED.
useStorageWriteApi Optional: If true, the pipeline uses the BigQuery Storage Write API. The default value is false. For more information, see Using the Storage Write API.
useStorageWriteApiAtLeastOnce Optional: When using the Storage Write API, specifies the write semantics. To use at-least-once semantics, set this parameter to true. To use exactly-once semantics, set the parameter to false. This parameter applies only when useStorageWriteApi is true. The default value is false.

Run the template

Console

  1. Go to the Dataflow Create job from template page.
  2. Go to Create job from template
  3. In the Job name field, enter a unique job name.
  4. Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.

    For a list of regions where you can run a Dataflow job, see Dataflow locations.

  5. From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
  6. In the provided parameter fields, enter your parameter values.
  7. Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow flex-template run JOB_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --parameters \
       spannerInstanceId=SPANNER_INSTANCE_ID,\
       spannerDatabaseId=SPANNER_DATABASE_ID,\
       spannerTableId=SPANNER_TABLE_ID,\
       sqlQuery=SQL_QUERY,\
       outputTableSpec=OUTPUT_TABLE_SPEC,\

Replace the following:

  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
  • SPANNER_INSTANCE_ID: the Spanner instance ID
  • SPANNER_DATABASE_ID: the Spanner database ID
  • SPANNER_TABLE_ID: the Spanner table name
  • SQL_QUERY: the SQL query
  • OUTPUT_TABLE_SPEC: the BigQuery table location

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launchParameter": {
     "jobName": "JOB_NAME",
     "parameters": {
       "spannerInstanceId": "SPANNER_INSTANCE_ID",
       "spannerDatabaseId": "SPANNER_DATABASE_ID",
       "spannerTableId": "SPANNER_TABLE_ID",
       "sqlQuery": "SQL_QUERY",
       "outputTableSpec": "OUTPUT_TABLE_SPEC",
     },
     "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex",
     "environment": { "maxWorkers": "10" }
  }
}

Replace the following:

  • PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
  • JOB_NAME: a unique job name of your choice
  • VERSION: the version of the template that you want to use

    You can use the following values:

  • LOCATION: the region where you want to deploy your Dataflow job—for example, us-central1
  • SPANNER_INSTANCE_ID: the Spanner instance ID
  • SPANNER_DATABASE_ID: the Spanner database ID
  • SPANNER_TABLE_ID: the Spanner table name
  • SQL_QUERY: the SQL query
  • OUTPUT_TABLE_SPEC: the BigQuery table location

What's next