Quickstart using SQL

In this quickstart, you learn how to write SQL syntax to query a publicly available Pub/Sub topic. The SQL query runs a Dataflow pipeline, and the results of the pipeline are written to a BigQuery table.

To run a Dataflow SQL job, you can use the Google Cloud Console, the Cloud SDK installed on a local machine, or Cloud Shell. In addition to the Cloud Console, this example requires that you use either a local machine or Cloud Shell.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Resource Manager, and Google Cloud Data Catalog APIs.

    Enable the APIs

  5. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  7. Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Resource Manager, and Google Cloud Data Catalog APIs.

    Enable the APIs

Install and initialize Cloud SDK

  • Download the Cloud SDK package for your operating system, and then install and configure the Cloud SDK.

    Depending on your internet connection, the download might take a while.

Create a BigQuery dataset

In this quickstart, the Dataflow SQL pipeline publishes a BigQuery dataset to a BigQuery table that you create in the next section.

  • Create a BigQuery dataset that is named taxirides:

    bq mk taxirides
    

Run the pipeline

  • Run a Dataflow SQL pipeline that calculates the number of passengers per minute by using data from a publicly available Pub/Sub topic about taxi rides. This command also creates a BigQuery table that's named passengers_per_minute to store the data output.

    gcloud dataflow sql query \
        --job-name=dataflow-sql-quickstart \
        --region=us-central1 \
        --bigquery-dataset=taxirides \
        --bigquery-table=passengers_per_minute \
    'SELECT
         TUMBLE_START("INTERVAL 60 SECOND") as period_start,
         SUM(passenger_count) AS pickup_count,
    FROM pubsub.topic.`pubsub-public-data`.`taxirides-realtime`
    WHERE
        ride_status = "pickup"
    GROUP BY
        TUMBLE(event_timestamp, "INTERVAL 60 SECOND")'
    

    It might take a while for the Dataflow SQL job to start running.

The following describes the values used in the Dataflow SQL pipeline:

  • dataflow-sql-quickstart: the name of the Dataflow job
  • us-central1: the region where the job runs
  • taxirides: the name of the BigQuery dataset used as the sink
  • passengers_per_minute: the name of the BigQuery table
  • taxirides-realtime: the name of the Pub/Sub topic used as the source

The SQL command queries the Pub/Sub topic taxirides-realtime for the total number of passengers picked up every 60 seconds. This public topic is based on the NYC Taxi & Limousine Commission's open dataset.

View the results

  • Verify that the pipeline is running.

    Console

    1. In the Cloud Console, go to the Dataflow Jobs page.

      Go to Jobs

    2. In the list of jobs, click dataflow-sql-quickstart.

    3. In the Job info panel, confirm that the Job status field is set to Running.

      The job might take several minutes to start. The Job status is set to Queued until the job starts.

    4. In the Job graph tab, confirm that every step is running.

      After the job starts, the steps might take several minutes to start running.

      The Dataflow job graph with two composite steps. The first step is running for 6 minutes 45 seconds and the second step is running for 1 second.

    5. In the Cloud Console, go to the BigQuery page.

      Go to BigQuery

    6. In the Editor, paste the following SQL query and click Run:

      'SELECT *
      FROM taxirides.passengers_per_minute
      ORDER BY pickup_count DESC
      LIMIT 5'
      

      This query returns the busiest intervals from the passengers_per_minute table.

    gcloud

    1. Get the list of Dataflow jobs running in your project:

      gcloud dataflow jobs list
      
    2. Get more information about the dataflow-sql-quickstart job:

      gcloud dataflow jobs describe JOB_ID
      

      Replace JOB_ID with the job ID of the dataflow-sql-quickstart job from your project.

    3. Return the busiest intervals from the passengers_per_minute table:

      bq query \
      'SELECT *
      FROM taxirides.passengers_per_minute
      ORDER BY pickup_count DESC
      LIMIT 5'
      

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this page, follow these steps.

  1. To cancel the Dataflow job, go to the Jobs page.

    Go to Jobs

  2. In the list of jobs, click dataflow-sql-quickstart.

  3. Click Stop > Cancel > Stop job.

  4. Delete the taxirides dataset:

    bq rm taxirides
    
  5. To confirm the deletion, type y.

What's next