When you execute your pipeline using the
Dataflow
managed service, you can obtain
information about your Dataflow job by using the
Dataflow command-line interface. The Dataflow
command-line interface is part of the gcloud
command-line tool in the
Cloud SDK.
NOTE: If you'd rather view and interact with your Dataflow jobs using the web-based UI, use the Dataflow monitoring interface.
Installing the Cloud Dataflow command-line component
To use the Dataflow Command-line Interface, you'll first need to
install the gcloud
tool.
Running the available commands
You interact with the Dataflow command-line interface by running the available commands. To run a command, type the following command into your shell or terminal:
gcloud dataflow
The Dataflow command-line interface has three major subcommands:
jobs
, logs
, and metrics
.
Jobs commands
The jobs
subcommands group lets you view and interact with the
Dataflow jobs in your Google Cloud project. You can use
these commands to view a list of your jobs, cancel a job, show a description of
a specific job, and others. For example, to view a list of all your
Dataflow jobs, type the following command into your shell or terminal:
gcloud dataflow jobs list
The gcloud
tool returns a list of your current jobs, as follows:
ID NAME TYPE CREATION_TIME STATE REGION 2015-06-03_16_39_22-4020553808241078833 wordcount-janedoe-0603233849 Batch 2015-06-03 16:39:22 Done us-central1 2015-06-03_16_38_28-4363652261786938862 wordcount-johndoe-0603233820 Batch 2015-06-03 16:38:28 Done us-central1 2015-05-21_16_24_11-17823098268333533078 bigquerytornadoes-johndoe-0521232402 Batch 2015-05-21 16:24:11 Done europe-west1 2015-05-21_13_38_06-16409850040969261121 bigquerytornadoes-johndoe-0521203801 Batch 2015-05-21 13:38:06 Done us-central1 2015-05-21_13_17_18-18349574013243942260 bigquerytornadoes-johndoe-0521201710 Batch 2015-05-21 13:17:18 Done europe-west1 2015-05-21_12_49_37-9791290545307959963 wordcount-johndoe-0521194928 Batch 2015-05-21 12:49:37 Done us-central1 2015-05-20_15_54_51-15905022415025455887 wordcount-johndoe-0520225444 Batch 2015-05-20 15:54:51 Failed us-central1 2015-05-20_15_47_02-14774624590029708464 wordcount-johndoe-0520224637 Batch 2015-05-20 15:47:02 Done us-central1
Using the job ID, you can run the describe
command to display more information
about a job.
export JOBID=<X> gcloud dataflow jobs describe $JOBID
For example, if you run the command for job ID 2015-02-09_11_39_40-15635991037808002875
,
the gcloud
tool returns the following information:
createTime: '2015-02-09T19:39:41.140Z' currentState: JOB_STATE_DONE currentStateTime: '2015-02-09T19:56:39.510Z' id: 2015-02-09_11_39_40-15635991037808002875 name: tfidf-bchambers-0209193926 projectId: google.com:clouddfe type: JOB_TYPE_BATCH
You can run the command with the --format=json
option to format the result
into JSON.
gcloud --format=json dataflow jobs describe $JOBID
The gcloud
tool returns the following formatted information:
{ "createTime": "2015-02-09T19:39:41.140Z", "currentState": "JOB_STATE_DONE", "currentStateTime": "2015-02-09T19:56:39.510Z", "id": "2015-02-09_11_39_40-15635991037808002875", "name": "tfidf-bchambers-0209193926", "projectId": "google.com:clouddfe", "type": "JOB_TYPE_BATCH" }
For a complete list of jobs
commands, see the
gcloud dataflow jobs command
in the Cloud SDK documentation.
Logs commands
The logs
commands display log entries for jobs run on the
Dataflow Service.
For example, you can use the list
command to print the logs that provide
information about what your job is doing.
export JOBID=<X> gcloud dataflow logs list $JOBID
For job ID 2015-02-09_11_39_40-15635991037808002875
, the gcloud
tool returns:
Listed 0 items.
In this example, no logs showed up at the default severity (Warning
). You can
include the BASIC
logs by running the list
command with the
--importance=detailed
option.
gcloud dataflow logs list $JOBID --importance=detailed
The gcloud
tool prints out the following logs:
d 2016-08-29T09:33:28 2015-02-09_11_39_40-15635991037808002875_00000156d72606f7 (39b2a31f5e883423): Starting worker pool synchronously d 2016-08-29T09:33:28 2015-02-09_11_39_40-15635991037808002875_00000156d7260871 (39b2a31f5e883ce9): Worker pool is running d 2016-08-29T09:33:28 2015-02-09_11_39_40-15635991037808002875_00000156d7260874 (39b2a31f5e883b77): Executing operation Count.PerElement/Sum.PerKey/GroupByKey/GroupByKeyOnly… ...
For a complete list of logs
commands, see the
gcloud dataflow logs command
in the Cloud SDK documentation.
Metrics commands
The metrics
commands allow you to view the metrics for a given
Dataflow job.
Note: The metric
command names are subject to change and
certain metrics are subject to deletion.
You can use the list
command to get information about the steps in
your job.
gcloud dataflow metrics list $JOBID
For this command, the gcloud
tool returns:
--- name: name: s09-s14-start-msecs origin: dataflow/v1b3 scalar: 137 updateTime: '2016-08-29T16:35:50.007Z' --- name: context: output_user_name: WordCount.CountWords/Count.PerElement/Init-out0 name: ElementCount origin: dataflow/v1b3 scalar: 26181 updateTime: '2016-08-29T16:35:50.007Z' --- name: context: step: s2 name: emptyLines origin: user scalar: 1080 updateTime: '2016-08-29T16:35:50.007Z' ...
You can use gcloud dataflow metrics list
command to obtain tentative
metrics while your job is running or shortly after it finishes. To view
tentative metrics, run the command with the --tentative
flag. A metric marked
tentative
is updated frequently as worker instances process your pipeline's
data. It may decrease if a worker experiences an error. tentative
metrics
become committed
values as a worker finishes work and commits the results.
For a complete list of metrics
commands, see the
gcloud dataflow metrics
command in the Cloud SDK documentation.
Using commands with regional endpoints
The Dataflow command-line interface supports
regional endpoints since gcloud
tool version 176. Use the --region
option with any command to specify the
regional endpoint that manages your job.
For example, gcloud dataflow jobs list
will list jobs from all regions, but
gcloud dataflow jobs list --region=europe-west1
will only list jobs managed
from europe-west1
.
Note: The --region
option is required to obtain job information
from a regional endpoint. If you do not specify a regional endpoint, us-central1
will be used as the default endpoint.