Using the Dataflow command-line interface

When you run your pipeline using the Dataflow-managed service, you can obtain information about your Dataflow job by using the Dataflow command-line interface. The Dataflow command-line interface is part of the command-line tool in Cloud SDK.

If you'd rather view and interact with your Dataflow jobs using Cloud Console, use the Dataflow monitoring interface.

Installing the Dataflow command-line component

To use the Dataflow command-line interface from your local terminal, install and configure Cloud SDK.

For Cloud Shell, the Dataflow command-line interface is automatically available.

Running the commands

You interact with the Dataflow command-line interface by running the available commands. To see the list of available Dataflow commands, type the following command into your shell or terminal:

  gcloud dataflow --help

As seen in the output, the Dataflow command has the following three groups: flex-template, jobs, and sql.

Flex Template commands

The flex-template sub-command group enables you to work with Dataflow Flex Templates. The following operations are supported:

  • build: Builds a Flex Template file from the specified parameters.
  • run: Runs a job from the specified path.

To run a template, you must create a template specification file that is stored in a Cloud Storage bucket. The template specification file contains all the necessary information to run the job, such as the SDK information and metadata. Additionally, the metadata.json file contains information about the template such as the name, description, and input parameters. After creating the template specification file, you can then build the Flex Template by using either Java or Python.

In order to create and run a Flex Template using gcloud command-line tool, see the tutorial on Using Flex Templates.

Jobs commands

The jobs sub-command group enables you to work with Dataflow jobs in your project. The following operations are supported:

  • cancel: Cancels all jobs that match the command-line arguments.
  • describe: Outputs the Job object resulting from the Get API.
  • drain: Drains all jobs that match the command line arguments.
  • list: Lists all jobs in a particular project, optionally filtered by region.
  • run: Runs a job from the specified path.
  • show: Shows a short description of the given job.

To get a list of all the Dataflow jobs in your project, run the following command in your shell or terminal :

gcloud dataflow jobs list

The command returns a list of your current jobs. The following is a sample output:

  ID                                        NAME                                    TYPE   CREATION_TIME        STATE   REGION
  2015-06-03_16_39_22-4020553808241078833   wordcount-janedoe-0603233849            Batch  2015-06-03 16:39:22  Done    us-central1
  2015-06-03_16_38_28-4363652261786938862   wordcount-johndoe-0603233820            Batch  2015-06-03 16:38:28  Done    us-central1
  2015-05-21_16_24_11-17823098268333533078  bigquerytornadoes-johndoe-0521232402    Batch  2015-05-21 16:24:11  Done    europe-west1
  2015-05-21_13_38_06-16409850040969261121  bigquerytornadoes-johndoe-0521203801    Batch  2015-05-21 13:38:06  Done    us-central1
  2015-05-21_13_17_18-18349574013243942260  bigquerytornadoes-johndoe-0521201710    Batch  2015-05-21 13:17:18  Done    europe-west1
  2015-05-21_12_49_37-9791290545307959963   wordcount-johndoe-0521194928            Batch  2015-05-21 12:49:37  Done    us-central1
  2015-05-20_15_54_51-15905022415025455887  wordcount-johndoe-0520225444            Batch  2015-05-20 15:54:51  Failed  us-central1
  2015-05-20_15_47_02-14774624590029708464  wordcount-johndoe-0520224637            Batch  2015-05-20 15:47:02  Done    us-central1

Using the job IDdisplayed for each job, you can run the describe command to display more information about a job.

gcloud dataflow jobs describe JOB_ID

Replace JOB_ID with the job ID of one of the Dataflow jobs from your project.

For example, if you run the command for job ID 2015-02-09_11_39_40-15635991037808002875, the following is a sample output:

createTime: '2015-02-09T19:39:41.140Z'
currentState: JOB_STATE_DONE
currentStateTime: '2015-02-09T19:56:39.510Z'
id: 2015-02-09_11_39_40-15635991037808002875
name: tfidf-bchambers-0209193926
projectId: google.com:clouddfe
type: JOB_TYPE_BATCH

To format the result into JSON, run the command with the --format=json option:

gcloud --format=json dataflow jobs describe JOB_ID

Replace JOB_ID with the job ID of one of the Dataflow jobs from your project.

The following sample output is formatted as JSON:

{
  "createTime": "2015-02-09T19:39:41.140Z",
  "currentState": "JOB_STATE_DONE",
  "currentStateTime": "2015-02-09T19:56:39.510Z",
  "id": "2015-02-09_11_39_40-15635991037808002875",
  "name": "tfidf-bchambers-0209193926",
  "projectId": "google.com:clouddfe",
  "type": "JOB_TYPE_BATCH"
}

SQL commands

The sql sub-command group enables you to work with Dataflow SQL. The gcloud Dataflow sql query command accepts and runs a user-specified SQL query on Dataflow.

For example, to run a simple SQL query on a Dataflow job that reads from a BigQuery dataset and writes to another BigQuery dataset, run the following in your shell or terminal:

gcloud dataflow sql query 'SELECT word FROM
bigquery.table.PROJECT_ID.input_dataset.input_table
where count > 3'
    --job-name=JOB_NAME \
    --region=us-west1 \
    --bigquery-dataset=OUTPUT_DATASET \
    --bigquery-table=OUTPUT_TABLE

Replace the following:

  • PROJECT_ID: a globally unique name for your project
  • JOB_NAME: a name for your Dataflow job
  • OUTPUT_DATASET: a name for the output dataset
  • OUTPUT_TABLE: a name for the output table

Starting a Dataflow SQL job might take several minutes. You cannot update the job after creating it. Dataflow SQL jobs use autoscaling and Dataflow automatically chooses the execution mode to be either batch or streaming. You cannot control this behavior for Dataflow SQL jobs. To stop Dataflow SQL jobs, use the cancel command. Stopping a Dataflow SQL job with drain is not supported.

For more information about using SQL commands for Dataflow, see the Dataflow SQL reference and gcloud Dataflow sql query documentation.

Using commands with regional endpoints

The Dataflow command-line interface supports regional endpoints since gcloud tool version 176. Use the --region option with any command to specify the regional endpoint that manages your job.

For example, gcloud dataflow jobs list lists jobs from all regions, but gcloud dataflow jobs list --region=europe-west1 only lists jobs managed from europe-west1.