Introduction to Cloud Dataflow SQL

Cloud Dataflow SQL lets you use SQL queries to develop and run Cloud Dataflow jobs from the BigQuery web UI. Cloud Dataflow SQL integrates with Apache Beam SQL and supports a variant of the ZetaSQL query syntax. You can use ZetaSQL's streaming extensions to define your streaming data parallel-processing pipelines:

  • Use your existing SQL skills to develop and run streaming pipelines from the BigQuery web UI. You do not need to set up an SDK development environment or know how to program in Java or Python.
  • Join streams (such as Cloud Pub/Sub) with snapshotted datasets (such as BigQuery tables).
  • Query your streams or static datasets with SQL by associating schemas with objects, such as tables, files, and Cloud Pub/Sub topics
  • Write your results into a BigQuery table for analysis and dashboarding.

Supported regions

Cloud Dataflow SQL can run jobs in regions that have a Cloud Dataflow regional endpoint.

Limitations

The current version of Cloud Dataflow SQL is subject to the following limitations:

  • Cloud Dataflow SQL only supports a subset of BigQuery standard SQL. See the Cloud Dataflow SQL reference for details.
  • With Cloud Dataflow SQL, there is a single aggregated output per window grouping when the watermark indicates that the window is complete. Data that arrives later is dropped.
  • Cloud Dataflow SQL has millisecond timestamp precision:
    • Your BigQueryTIMESTAMP fields must have millisecond timestamp precision at most. If a TIMESTAMP field has sub-millisecond precision, Cloud Dataflow SQL throws an IllegalArgumentException.
    • Cloud Pub/Sub publish timestamps are truncated to milliseconds.
  • Sources: Reading is limited to Cloud Pub/Sub topics and BigQuery tables.
  • Cloud Dataflow SQL expects messages in Cloud Pub/Sub topics to be serialized in JSON format. Support for other formats such as Avro will be added in the future.
  • Destinations: Writing is limited to BigQuery tables.
  • You can only run jobs in regions that have a Cloud Dataflow regional endpoint.
  • Cloud Dataflow uses autoscaling of resources and chooses the execution mode for the job (batch or streaming). There are no parameters to control this behavior.
  • Creating a Cloud Dataflow job can take several minutes. The job fails if there are any errors during pipeline execution.
  • BigQuery buffers the data that you stream into your BigQuery tables. As a result, there is a delay in displaying your data in the preview pane. However, you can query the table using regular SQL commands.
  • Stopping a pipeline with the Drain command is not supported. Use the Cancel command to stop your pipeline.
  • Updating a running pipeline is not supported.
  • You can only edit previous SQL queries from running jobs (streaming or batch) and successfully completed batch jobs.

Quotas

For information about Cloud Dataflow quotas and limits, see Quotas & limits.

Pricing

Cloud Dataflow SQL uses the standard Cloud Dataflow pricing; it does not have separate pricing. You are billed for the resources consumed by the Cloud Dataflow jobs that you create based on your SQL statements. The charges for these resources are the standard Cloud Dataflow charges for vCPU, memory, and Persistent Disk. In addition, a job might consume additional resources such as Cloud Pub/Sub and BigQuery, each billed at their own pricing.

For more information about Cloud Dataflow pricing, see the Cloud Dataflow pricing page.

What's next

Oliko tästä sivusta apua? Kerro mielipiteesi

Palautteen aihe:

Tämä sivu
Cloud Dataflow
Tarvitsetko apua? Siirry tukisivullemme.