Using the Cloud Dataflow command-line interface

When you execute your pipeline using the Dataflow managed service, you can obtain information about your Dataflow job by using the Dataflow command-line interface. The Dataflow command-line interface is part of the gcloud command-line tool in the Cloud SDK.

NOTE: If you'd rather view and interact with your Dataflow jobs using the web-based UI, use the Dataflow monitoring interface.

Installing the Cloud Dataflow command-line component

To use the Dataflow Command-line Interface, you'll first need to install the gcloud tool.

Running the available commands

You interact with the Dataflow command-line interface by running the available commands. To run a command, type the following command into your shell or terminal:

  gcloud dataflow

The Dataflow command-line interface has three major subcommands: jobs, logs, and metrics.

Jobs commands

The jobs subcommands group lets you view and interact with the Dataflow jobs in your Google Cloud project. You can use these commands to view a list of your jobs, cancel a job, show a description of a specific job, and others. For example, to view a list of all your Dataflow jobs, type the following command into your shell or terminal:

gcloud dataflow jobs list

The gcloud tool returns a list of your current jobs, as follows:

  ID                                        NAME                                    TYPE   CREATION_TIME        STATE   REGION
  2015-06-03_16_39_22-4020553808241078833   wordcount-janedoe-0603233849            Batch  2015-06-03 16:39:22  Done    us-central1
  2015-06-03_16_38_28-4363652261786938862   wordcount-johndoe-0603233820            Batch  2015-06-03 16:38:28  Done    us-central1
  2015-05-21_16_24_11-17823098268333533078  bigquerytornadoes-johndoe-0521232402    Batch  2015-05-21 16:24:11  Done    europe-west1
  2015-05-21_13_38_06-16409850040969261121  bigquerytornadoes-johndoe-0521203801    Batch  2015-05-21 13:38:06  Done    us-central1
  2015-05-21_13_17_18-18349574013243942260  bigquerytornadoes-johndoe-0521201710    Batch  2015-05-21 13:17:18  Done    europe-west1
  2015-05-21_12_49_37-9791290545307959963   wordcount-johndoe-0521194928            Batch  2015-05-21 12:49:37  Done    us-central1
  2015-05-20_15_54_51-15905022415025455887  wordcount-johndoe-0520225444            Batch  2015-05-20 15:54:51  Failed  us-central1
  2015-05-20_15_47_02-14774624590029708464  wordcount-johndoe-0520224637            Batch  2015-05-20 15:47:02  Done    us-central1

Using the job ID, you can run the describe command to display more information about a job.

export JOBID=<X>
gcloud dataflow jobs describe $JOBID

For example, if you run the command for job ID 2015-02-09_11_39_40-15635991037808002875, the gcloud tool returns the following information:

createTime: '2015-02-09T19:39:41.140Z'
currentState: JOB_STATE_DONE
currentStateTime: '2015-02-09T19:56:39.510Z'
id: 2015-02-09_11_39_40-15635991037808002875
name: tfidf-bchambers-0209193926

You can run the command with the --format=json option to format the result into JSON.

gcloud --format=json dataflow jobs describe $JOBID

The gcloud tool returns the following formatted information:

  "createTime": "2015-02-09T19:39:41.140Z",
  "currentState": "JOB_STATE_DONE",
  "currentStateTime": "2015-02-09T19:56:39.510Z",
  "id": "2015-02-09_11_39_40-15635991037808002875",
  "name": "tfidf-bchambers-0209193926",
  "projectId": "",
  "type": "JOB_TYPE_BATCH"

For a complete list of jobs commands, see the gcloud dataflow jobs command in the Cloud SDK documentation.

Logs commands

The logs commands display log entries for jobs run on the Dataflow Service.

For example, you can use the list command to print the logs that provide information about what your job is doing.

export JOBID=<X>
gcloud dataflow logs list $JOBID

For job ID 2015-02-09_11_39_40-15635991037808002875, the gcloud tool returns:

Listed 0 items.

In this example, no logs showed up at the default severity (Warning). You can include the BASIC logs by running the list command with the --importance=detailed option.

gcloud dataflow logs list $JOBID --importance=detailed

The gcloud tool prints out the following logs:

d 2016-08-29T09:33:28 2015-02-09_11_39_40-15635991037808002875_00000156d72606f7 (39b2a31f5e883423): Starting worker pool synchronously
d 2016-08-29T09:33:28 2015-02-09_11_39_40-15635991037808002875_00000156d7260871 (39b2a31f5e883ce9): Worker pool is running
d 2016-08-29T09:33:28 2015-02-09_11_39_40-15635991037808002875_00000156d7260874 (39b2a31f5e883b77): Executing operation Count.PerElement/Sum.PerKey/GroupByKey/GroupByKeyOnly…

For a complete list of logs commands, see the gcloud dataflow logs command in the Cloud SDK documentation.

Metrics commands

The metrics commands allow you to view the metrics for a given Dataflow job.

Note: The metric command names are subject to change and certain metrics are subject to deletion.

You can use the list command to get information about the steps in your job.

gcloud dataflow metrics list $JOBID

For this command, the gcloud tool returns:

  name: s09-s14-start-msecs
  origin: dataflow/v1b3
scalar: 137
updateTime: '2016-08-29T16:35:50.007Z'
    output_user_name: WordCount.CountWords/Count.PerElement/Init-out0
  name: ElementCount
  origin: dataflow/v1b3
scalar: 26181
updateTime: '2016-08-29T16:35:50.007Z'
    step: s2
  name: emptyLines
  origin: user
scalar: 1080
updateTime: '2016-08-29T16:35:50.007Z'

You can use gcloud dataflow metrics list command to obtain tentative metrics while your job is running or shortly after it finishes. To view tentative metrics, run the command with the --tentative flag. A metric marked tentative is updated frequently as worker instances process your pipeline's data. It may decrease if a worker experiences an error. tentative metrics become committed values as a worker finishes work and commits the results.

For a complete list of metrics commands, see the gcloud dataflow metrics command in the Cloud SDK documentation.

Using commands with regional endpoints

The Dataflow command-line interface supports regional endpoints since gcloud tool version 176. Use the --region option with any command to specify the regional endpoint that manages your job.

For example, gcloud dataflow jobs list will list jobs from all regions, but gcloud dataflow jobs list --region=europe-west1 will only list jobs managed from europe-west1.

Note: The --region option is required to obtain job information from a regional endpoint. If you do not specify a regional endpoint, us-central1 will be used as the default endpoint.