Introduction to Cloud Dataflow SQL

Dataflow SQL lets you use SQL queries to develop and run Dataflow jobs from the BigQuery web UI. Dataflow SQL integrates with Apache Beam SQL and supports a variant of the ZetaSQL query syntax. You can use ZetaSQL's streaming extensions to define your streaming data parallel-processing pipelines:

  • Use your existing SQL skills to develop and run streaming pipelines from the BigQuery web UI. You do not need to set up an SDK development environment or know how to program in Java or Python.
  • Join streams (such as Pub/Sub) with snapshotted datasets (such as BigQuery tables and Cloud Storage filesets).
  • Query your streams or static datasets with SQL by associating schemas with objects, such as tables, Cloud Storage filesets, and Pub/Sub topics
  • Write your results into a BigQuery table for analysis and dashboarding.

Supported regions

Dataflow SQL can run jobs in regions that have a Dataflow regional endpoint.

Limitations

The current version of Cloud Dataflow SQL is subject to the following limitations:

  • Dataflow SQL only supports a subset of BigQuery standard SQL. See the Dataflow SQL reference for details.
  • With Dataflow SQL, there is a single aggregated output per window grouping when the watermark indicates that the window is complete. Data that arrives later is dropped.
  • Dataflow SQL has millisecond timestamp precision:
    • Your BigQueryTIMESTAMP fields must have millisecond timestamp precision at most. If a TIMESTAMP field has sub-millisecond precision, Dataflow SQL throws an IllegalArgumentException.
    • Pub/Sub publish timestamps are truncated to milliseconds.
  • Sources: Reading is limited to Pub/Sub topics, Cloud Storage filesets, and BigQuery tables.
  • Dataflow SQL expects messages in Pub/Sub topics to be serialized in JSON format. Support for other formats such as Avro will be added in the future.
  • Dataflow SQL expects CSV files in Cloud Storage filesets. CSV files must not contain a header row with column names; the first row in each CSV file is interpreted a data record.
  • Destinations: Writing is limited to BigQuery tables.
  • You can only run jobs in regions that have a Dataflow regional endpoint.
  • Dataflow uses autoscaling of resources and chooses the execution mode for the job (batch or streaming). There are no parameters to control this behavior.
  • Creating a Dataflow job can take several minutes. The job fails if there are any errors during pipeline execution.
  • Stopping a pipeline with the Drain command is not supported. Use the Cancel command to stop your pipeline.
  • Updating a running pipeline is not supported.
  • You can only edit previous SQL queries from running jobs (streaming or batch) and successfully completed batch jobs.

Quotas

For information about Dataflow quotas and limits, see Quotas & limits.

Pricing

Dataflow SQL uses the standard Dataflow pricing; it does not have separate pricing. You are billed for the resources consumed by the Dataflow jobs that you create based on your SQL statements. The charges for these resources are the standard Dataflow charges for vCPU, memory, and Persistent Disk. In addition, a job might consume additional resources such as Pub/Sub and BigQuery, each billed at their own pricing.

For more information about Dataflow pricing, see the Dataflow pricing page.

What's next

Hai trovato utile questa pagina? Facci sapere cosa ne pensi:

Invia feedback per...

Hai bisogno di assistenza? Visita la nostra pagina di assistenza.