Quickstart using SQL

This page shows you how to create a Dataflow job using the gcloud command-line tool for Dataflow SQL. The Dataflow job writes the results of a Dataflow SQL query to a table in a BigQuery dataset.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. Enable the Dataflow, Compute Engine, Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Resource Manager, and Data Catalog APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Create a BigQuery dataset

Create a BigQuery dataset named taxirides.

bq mk taxirides

Query a Pub/Sub topic

Query the public Pub/Sub topic taxirides-realtime for the number of passengers picked up every 10 seconds.

gcloud beta dataflow sql query \
  --job-name=dataflow-sql-quickstart \
  --region=us-central1 \
  --bigquery-dataset=taxirides \
  --bigquery-table=passengers_per_minute \
'SELECT
  TUMBLE_START("INTERVAL 10 SECOND") as period_start,
  SUM(passenger_count) AS pickup_count,
FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime`
WHERE
  ride_status = "pickup"
GROUP BY
  TUMBLE(event_timestamp, "INTERVAL 10 SECOND")'

View the query results

  1. Confirm that the Dataflow job is running.

    1. Go to the Dataflow monitoring interface.

      Go to the Dataflow monitoring interface

    2. In the list of jobs, click dataflow-sql-quickstart.

    3. In the Job info panel, confirm that the Job status field is set to Running.

      Starting the job might take several minutes. The Job status is set to Queued until the job starts.

    4. In the Job graph tab, confirm that every step is running for at least 1 sec.

      The Dataflow job graph with two composite steps. The first step is running for 6 min 45 sec and the second step is running for 1 sec.

      After the job starts, the steps might take several minutes to start running.

  2. Return the busiest intervals from the passengers_per_minute table.

    bq query \
    'SELECT *
    FROM taxirides.passengers_per_minute
    ORDER BY pickup_count DESC
    LIMIT 5'
    

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this quickstart, follow these steps.

  1. Delete the taxirides dataset.

    1. Run the bq rm command:

      bq rm taxirides
      
    2. To confirm, type y.

  2. Cancel the Dataflow job.

    1. Go to the Dataflow monitoring interface.

      Go to the Dataflow monitoring interface

    2. In the list of jobs, click dataflow-sql-quickstart.

    3. Click Stop > Cancel > Stop job.

What's next