Using Dataflow SQL

The page explains how to use Dataflow SQL and create Dataflow SQL jobs.

To create a Dataflow SQL job, write and run a Dataflow SQL query.

Using the Dataflow SQL UI

The Dataflow SQL UI is a BigQuery web UI setting for creating Dataflow SQL jobs.

You can access the Dataflow SQL UI from the BigQuery web UI.

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. Switch to the Cloud Dataflow engine.

    a. Click the More drop-down menu and select Query settings.

    The More drop-down menu in the BigQuery web UI with the Query settings option selected

    a. In the Query settings menu, select Dataflow engine.

    a. In the prompt that appears if the Dataflow and Data Catalog APIs are not enabled, click Enable APIs.

    The Query settings menu with the Enable APIs prompt

    a. Click Save.

    The Query settings menu with the Dataflow engine radio button selected and the APIs enabled

You can also access the Dataflow SQL UI from the Dataflow monitoring interface.

  1. Go to the Dataflow monitoring interface.

    Go to the Dataflow monitoring interface

  2. Click Create job from SQL.

Writing Dataflow SQL queries

Dataflow SQL queries use the Dataflow SQL query syntax. The Dataflow SQL query syntax is similar to BigQuery standard SQL.

You can use the Dataflow SQL streaming extensions to aggregate data from continuously updating Dataflow sources like Pub/Sub.

For example, the following query counts the passengers in a Pub/Sub stream of taxi rides every minute:

SELECT
  TUMBLE_START('INTERVAL 1 MINUTE') as period_start,
  SUM(passenger_count) AS pickup_count
FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime`
WHERE
  ride_status = "pickup"
GROUP BY
  TUMBLE(event_timestamp, 'INTERVAL 1 MINUTE')

Running Dataflow SQL queries

When you run a Dataflow SQL query, Dataflow turns the query into an Apache Beam pipeline and executes the pipeline.

You can run a Dataflow SQL query using the Cloud Console or gcloud command-line tool.

Console

To run a Dataflow SQL query, use the Dataflow SQL UI.

  1. Go to the Dataflow SQL UI.

    Go to the Dataflow SQL UI

  2. Enter the Dataflow SQL query into the query editor.

  3. Click Create Cloud Dataflow job to open a panel of job options.

  4. (Optional) Click Show optional parameters and set Dataflow pipeline options.

  5. In the Destination section of the panel, select an Output type.

  6. Click Create.

gcloud

To run a Dataflow SQL query, use the gcloud dataflow sql query command.

gcloud dataflow sql query \
  --job-name=job-name \
  --region=region \
  --bigquery-table=table-name \
  --bigquery-dataset=destination-dataset \
'query'

For more information about querying data and writing Dataflow SQL query results, see Using data sources and destinations.

Setting pipeline options

You can set Dataflow pipeline options for Dataflow SQL jobs. Dataflow pipeline options are execution parameters that configure how and where to run Dataflow SQL queries.

To set Dataflow pipeline options for Dataflow SQL jobs, specify the following parameters when you run a Dataflow SQL query.

Console

Parameter Type Description Default value
Regional endpoint String The region to run the query in. Dataflow SQL queries can be run in regions that have a Dataflow regional endpoint. If not set, defaults to us-central1.
Max workers int The maximum number of Compute Engine instances available to your pipeline during execution. If unspecified, Dataflow automatically service determines an appropriate number of workers.
Worker region String The Compute Engine region for launching worker instances to run your pipeline. The Compute Engine worker region can be in a different region than the Dataflow regional endpoint. If not set, defaults to the specified Dataflow regional endpoint.
Worker zone String The Compute Engine zone for launching worker instances to run your pipeline. The Compute Engine zone can be in a different region than the Dataflow regional endpoint.

If not set, defaults to a zone in the worker region.

If the worker region is not set, defaults to a zone in the specified Dataflow regional endpoint.

Service account email String The email address of the controller service account with which to run the pipeline. The email address must be in the form my-service-account-name@<project-id>.iam.gserviceaccount.com. If not set, Dataflow workers use the Compute Engine service account of the current project as the controller service account.
Machine type String

The Compute Engine machine type that Dataflow uses when starting workers. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

If not set, Dataflow automatically chooses the machine type.
Additional experiments String The experiments to enable. An experiment can be a value, like enable_streaming_engine, or a key-value pair, such as shuffle_mode=service. The experiments must be in a comma-separated list. If unspecified, no experiments are enabled.
Worker IP Address Configuration String

Specifies whether Dataflow workers use public IP addresses.

If the value is set to Private, Dataflow workers use private IP addresses for all communication. The specified Network or Subnetwork must have Private Google Access enabled.

If the value is set to Private and the Subnetwork option is specified, the Network option is ignored.

If not set, defaults to Public.
Network String The Compute Engine network to which workers are assigned. If not set, defaults to the network default.
Subnetwork String The Compute Engine subnetwork to which workers are assigned. The subnetwork must be in the form regions/region/subnetworks/subnetwork. If not set, Dataflow automatically determines subnetwork.

gcloud

Flag Type Description Default value
‑‑region String The region to run the query in. Dataflow SQL queries can be run in regions that have a Dataflow regional endpoint. If not set, throws an error.
‑‑max‑workers int The maximum number of Compute Engine instances available to your pipeline during execution. If unspecified, Dataflow automatically determines an appropriate number of workers.
‑‑num‑workers int The initial number of Compute Engine instances to use when executing your pipeline. This parameter determines how many workers Dataflow starts up when your job begins. If unspecified, Dataflow automatically determines an appropriate number of workers.
‑‑worker‑region String

The Compute Engine region for launching worker instances to run your pipeline. The Compute Engine worker region can be in a different region than the Dataflow regional endpoint.

You can specify one of ‑‑worker‑region or ‑‑worker‑zone.

If not set, defaults to the specified Dataflow regional endpoint.
‑‑worker‑zone String

The Compute Engine zone for launching worker instances to run your pipeline. The Compute Engine zone can be in a different region than the Dataflow regional endpoint.

You can specify one of ‑‑worker‑region or ‑‑worker‑zone.

If not set, defaults to a zone in the specified Dataflow regional endpoint.
‑‑worker‑machine‑type String

The Compute Engine machine type that Dataflow uses when starting workers. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

If not set, Dataflow automatically chooses the machine type.
‑‑service‑account‑email String The email address of the controller service account with which to run the pipeline. The email address must be in the form my-service-account-name@<project-id>.iam.gserviceaccount.com. If not set, Dataflow workers use the Compute Engine service account of the current project as the controller service account.
‑‑disable‑public‑ips boolean

Specifies whether Dataflow workers use public IP addresses.

If set to true, Dataflow workers use private IP addresses for all communication.

If not set, defaults to false and Dataflow workers use public IP addresses.
‑‑network String The Compute Engine network to which workers are assigned. If not set, defaults to the network default.
‑‑subnetwork String The Compute Engine subnetwork to which workers are assigned. The subnetwork must be in the form regions/region/subnetworks/subnetwork. If not set, Dataflow automatically determines subnetwork.
‑‑dataflow‑kms‑key String The customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. The key must be in the same location as the job. If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.

For more information, see the gcloud dataflow sql query command reference.

Stopping Dataflow SQL jobs

To stop Dataflow SQL jobs, use the Cancel command. Stopping a Dataflow SQL job with Drain is not supported.

Pricing

Dataflow SQL uses the standard Dataflow pricing; it does not have separate pricing. You are billed for the resources consumed by the Dataflow jobs that you create based on your SQL statements. The charges for these resource are the standard Dataflow charges for vCPU, memory, Persistent Disk, Streaming Engine, and Dataflow Shuffle.

A Dataflow SQL job might consume additional resources such as Pub/Sub and BigQuery, each billed at their own pricing.

For more information about Dataflow pricing, see the Dataflow pricing page.

What's next