The page explains how to use Dataflow SQL and create Dataflow SQL jobs.
To create a Dataflow SQL job, you must write and run a Dataflow SQL query.
Use the Dataflow SQL editor
The Dataflow SQL editor is a page in the Google Cloud console where you write and run queries for creating Dataflow SQL jobs.
To access the Dataflow SQL editor, follow these steps:
In the Google Cloud console, go to the Dataflow SQL Editor page.
You can also access the Dataflow SQL editor from the Dataflow monitoring interface by following these steps:
In the Google Cloud console, go to the Dataflow Jobs page.
In the Dataflow menu, click SQL Workspace.
Write Dataflow SQL queries
Dataflow SQL queries use the Dataflow SQL query syntax. The Dataflow SQL query syntax is similar to BigQuery standard SQL.
You can use the Dataflow SQL streaming extensions to aggregate data from continuously updating Dataflow sources like Pub/Sub.
For example, the following query counts the passengers in a Pub/Sub stream of taxi rides every minute:
SELECT DATETIME(tr.window_start) AS starttime, SUM(tr.passenger_count) AS pickup_count FROM TUMBLE ((SELECT * FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime`), DESCRIPTOR(event_timestamp), 'INTERVAL 1 MINUTE') AS tr WHERE tr.ride_status = "pickup" GROUP BY DATETIME(tr.window_start)
Run Dataflow SQL queries
When you run a Dataflow SQL query, Dataflow turns the query into an Apache Beam pipeline and runs the pipeline.
You can run a Dataflow SQL query using the Google Cloud console or Google Cloud CLI.
Console
To run a Dataflow SQL query, use the Dataflow SQL editor:
Go to the Dataflow SQL Editor page.
Enter the Dataflow SQL query into the query editor.
Click Create job to open a panel of job options.
Optional: For Job name, enter a unique job name.
For Regional endpoint, select a value from the menu.
Optional: Click Show optional parameters, and then enter values for the provided Dataflow pipeline options.
For Destination, select an Output type, and then enter values for the provided fields.
Optional: In the SQL query parameters section, add parameters and then enter values in the provided fields.
Click Create.
gcloud
To run a Dataflow SQL query, use the gcloud dataflow sql query
command. The following is an example SQL query that creates
gcloud dataflow sql query \ --job-name=JOB_NAME \ --region=REGION \ --bigquery-table=BIGQUERY_TABLE \ --bigquery-dataset=BIGQUERY_DATASET \ --bigquery-project=BIGQUERY_PROJECT \ 'SQL_QUERY'
Replace the following:
JOB_NAME
: a name for your Dataflow SQL jobREGION
: the Dataflow location for deploying your Dataflow jobBIGQUERY_TABLE
: the name of the BigQuery table to which you want to write the outputBIGQUERY_DATASET
: the BigQuery dataset ID that contains the output tableBIGQUERY_PROJECT
: the Google Cloud project ID that contains the output BigQuery tableSQL_QUERY
: your Dataflow SQL query
Set pipeline options
You can set Dataflow pipeline options for Dataflow SQL jobs. Dataflow pipeline options are execution parameters that configure how and where to run Dataflow SQL queries.
To set Dataflow pipeline options for Dataflow SQL jobs, specify the following parameters when you run a Dataflow SQL query.
Console
Parameter | Type | Description | Default value |
---|---|---|---|
Regional endpoint | String |
The region to run the query in. Dataflow SQL queries can be run in regions that have a Dataflow location. | If not set, defaults to us-central1. |
Max workers | int |
The maximum number of Compute Engine instances available to your pipeline during execution. | If unspecified, Dataflow automatically service determines an appropriate number of workers. |
Worker region | String |
The Compute Engine region for launching worker instances to run your pipeline. The Compute Engine worker region can be in a different region than the Dataflow job region. | If not set, defaults to the specified Dataflow region. |
Worker zone | String |
The Compute Engine zone for launching worker instances to run your pipeline. The Compute Engine zone can be in a different region than the Dataflow job region. |
If not set, defaults to a zone in the worker region. If the worker region is not set, defaults to a zone in the specified Dataflow region. |
Service account email | String |
The email address of the worker service account
with which to run the pipeline. The email address must be in the form
my-service-account-name@<project-id>.iam.gserviceaccount.com .
|
If not set, Dataflow workers use the Compute Engine service account of the current project as the worker service account. |
Machine type | String |
The Compute Engine machine type that Dataflow uses when starting workers. You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. |
If not set, Dataflow automatically chooses the machine type. |
Additional experiments | String |
The experiments to enable. An experiment can be a value, like
enable_streaming_engine , or a key-value pair, such as
shuffle_mode=service . The experiments must be in a
comma-separated list. |
If unspecified, no experiments are enabled. |
Worker IP Address Configuration | String |
Specifies whether Dataflow workers use public IP addresses. If the value is set to If the value is set to |
If not set, defaults to Public . |
Network | String |
The Compute Engine network to which workers are assigned. | If not set, defaults to the network default . |
Subnetwork | String |
The Compute Engine subnetwork
to which workers are assigned. The subnetwork must be in the form
regions/region/subnetworks/subnetwork . |
If not set, Dataflow automatically determines subnetwork. |
gcloud
Flag | Type | Description | Default value |
---|---|---|---|
‑‑region |
String |
The region to run the query in. Dataflow SQL queries can be run in regions that have a Dataflow location. | If not set, throws an error. |
‑‑max‑workers |
int |
The maximum number of Compute Engine instances available to your pipeline during execution. | If unspecified, Dataflow automatically determines an appropriate number of workers. |
‑‑num‑workers |
int |
The initial number of Compute Engine instances to use when executing your pipeline. This parameter determines how many workers Dataflow starts up when your job begins. | If unspecified, Dataflow automatically determines an appropriate number of workers. |
‑‑worker‑region |
String |
The Compute Engine region for launching worker instances to run your pipeline. The Compute Engine worker region can be in a different region than the Dataflow job region.
You can specify one of
|
If not set, defaults to the specified Dataflow region. |
‑‑worker‑zone |
String |
The Compute Engine zone for launching worker instances to run your pipeline. The Compute Engine zone can be in a different region than the Dataflow job region.
You can specify one of
|
If not set, defaults to a zone in the specified Dataflow region. |
‑‑worker‑machine‑type |
String |
The Compute Engine machine type that Dataflow uses when starting workers. You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. |
If not set, Dataflow automatically chooses the machine type. |
‑‑service‑account‑email |
String |
The email address of the worker service account
with which to run the pipeline. The email address must be in the form
my-service-account-name@<project-id>.iam.gserviceaccount.com .
|
If not set, Dataflow workers use the Compute Engine service account of the current project as the worker service account. |
‑‑disable‑public‑ips |
boolean |
Specifies whether Dataflow workers use public IP addresses. If set, Dataflow workers use private IP addresses for all communication. |
If not set, Dataflow workers use public IP addresses. |
‑‑network |
String |
The Compute Engine network to which workers are assigned. | If not set, defaults to the network default . |
‑‑subnetwork |
String |
The Compute Engine subnetwork
to which workers are assigned. The subnetwork must be in the form
regions/region/subnetworks/subnetwork . |
If not set, Dataflow automatically determines subnetwork. |
‑‑dataflow‑kms‑key |
String |
The customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. The key must be in the same location as the job. | If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. |
For more information, see the
gcloud dataflow sql query
command reference.
Stop Dataflow SQL jobs
To stop a Dataflow SQL job, you must cancel it.
Stopping a Dataflow SQL job with the drain
option is not supported.
Pricing
Dataflow SQL uses the standard Dataflow pricing; it does not have separate pricing. You are billed for the resources consumed by the Dataflow jobs that you create based on your SQL statements. The charges for these resource are the standard Dataflow charges for vCPU, memory, Persistent Disk, Streaming Engine, and Dataflow Shuffle.
A Dataflow SQL job might consume additional resources such as Pub/Sub and BigQuery, each billed at their own pricing.
For more information about Dataflow pricing, see Dataflow pricing.
What's next
- Walk through the Joining streaming data with Dataflow SQL tutorial.
- Explore the Google Cloud CLI for Dataflow SQL.