When you run your pipeline using the Dataflow-managed service, you can obtain information about your Dataflow job by using the Dataflow command-line interface. The Dataflow command-line interface is part of the command-line tool in Google Cloud CLI.
If you'd rather view and interact with your Dataflow jobs using Google Cloud console, use the Dataflow monitoring interface.
Installing the Dataflow command-line component
To use the Dataflow command-line interface from your local terminal, install and configure Google Cloud CLI.
For Cloud Shell, the Dataflow command-line interface is automatically available.
Running the commands
You interact with the Dataflow command-line interface by running the available commands. To see the list of available Dataflow commands, type the following command into your shell or terminal:
gcloud dataflow --help
As seen in the output, the Dataflow command has the following four groups:
flex-template
, jobs
, snapshots
and sql
.
Flex Template commands
The flex-template
sub-command group enables you to work with Dataflow Flex Templates. The following operations are supported:
build
: Builds a Flex Template file from the specified parameters.run
: Runs a job from the specified path.
To run a template, you must create a template specification file that is stored in a Cloud Storage bucket. The template specification file contains all the necessary information to run the job, such as the SDK information and metadata. Additionally, the metadata.json
file contains information about the template such as the name, description, and input parameters. After creating the template specification file, you can then build the Flex Template by using either Java or Python.
For information about creating and running a Flex Template using the Google Cloud CLI, see the tutorial Build and run Flex Templates.
Jobs commands
The jobs
sub-command group enables you to work with Dataflow jobs in your project. The following operations are supported:
cancel
: Cancels all jobs that match the command-line arguments.describe
: Outputs the Job object resulting from the Get API.drain
: Drains all jobs that match the command line arguments.list
: Lists all jobs in a particular project, optionally filtered by region.run
: Runs a job from the specified path.show
: Shows a short description of the given job.
To get a list of all the Dataflow jobs in your project, run the following command in your shell or terminal :
gcloud dataflow jobs list
The command returns a list of your current jobs. The following is a sample output:
ID NAME TYPE CREATION_TIME STATE REGION 2015-06-03_16_39_22-4020553808241078833 wordcount-janedoe-0603233849 Batch 2015-06-03 16:39:22 Done us-central1 2015-06-03_16_38_28-4363652261786938862 wordcount-johndoe-0603233820 Batch 2015-06-03 16:38:28 Done us-central1 2015-05-21_16_24_11-17823098268333533078 bigquerytornadoes-johndoe-0521232402 Batch 2015-05-21 16:24:11 Done europe-west1 2015-05-21_13_38_06-16409850040969261121 bigquerytornadoes-johndoe-0521203801 Batch 2015-05-21 13:38:06 Done us-central1 2015-05-21_13_17_18-18349574013243942260 bigquerytornadoes-johndoe-0521201710 Batch 2015-05-21 13:17:18 Done europe-west1 2015-05-21_12_49_37-9791290545307959963 wordcount-johndoe-0521194928 Batch 2015-05-21 12:49:37 Done us-central1 2015-05-20_15_54_51-15905022415025455887 wordcount-johndoe-0520225444 Batch 2015-05-20 15:54:51 Failed us-central1 2015-05-20_15_47_02-14774624590029708464 wordcount-johndoe-0520224637 Batch 2015-05-20 15:47:02 Done us-central1
Using the job ID
displayed for each job, you can run the describe
command to display more information about a job.
gcloud dataflow jobs describe JOB_ID
Replace JOB_ID with the job ID
of one of the Dataflow jobs from your project.
For example, if you run the command for job ID 2015-02-09_11_39_40-15635991037808002875
, the following is a sample output:
createTime: '2015-02-09T19:39:41.140Z' currentState: JOB_STATE_DONE currentStateTime: '2015-02-09T19:56:39.510Z' id: 2015-02-09_11_39_40-15635991037808002875 name: tfidf-bchambers-0209193926 projectId: google.com:clouddfe type: JOB_TYPE_BATCH
To format the result into JSON, run the command with the --format=json
option:
gcloud --format=json dataflow jobs describe JOB_ID
Replace JOB_ID with the job ID
of one of the Dataflow jobs from your project.
The following sample output is formatted as JSON:
{ "createTime": "2015-02-09T19:39:41.140Z", "currentState": "JOB_STATE_DONE", "currentStateTime": "2015-02-09T19:56:39.510Z", "id": "2015-02-09_11_39_40-15635991037808002875", "name": "tfidf-bchambers-0209193926", "projectId": "google.com:clouddfe", "type": "JOB_TYPE_BATCH" }
Snapshots commands
The snapshots
sub-command group enables you to work with Dataflow snapshots. The following operations are supported:
create
: Creates a snapshot for a Dataflow job.delete
: Deletes a Dataflow snapshot.describe
: Describes a Dataflow snapshot.list
: Lists all Dataflow snapshots in a project in the specified region, optionally filtered by job ID.
For more information about using snapshots in Dataflow, see Using Dataflow snapshots.
SQL commands
The sql
sub-command group enables you to work with Dataflow SQL. The gcloud Dataflow sql query
command accepts and runs a user-specified SQL query on Dataflow.
For example, to run a simple SQL query on a Dataflow job that reads from a BigQuery dataset and writes to another BigQuery dataset, run the following in your shell or terminal:
gcloud dataflow sql query 'SELECT word FROM bigquery.table.PROJECT_ID.input_dataset.input_table where count > 3' --job-name=JOB_NAME \ --region=us-west1 \ --bigquery-dataset=OUTPUT_DATASET \ --bigquery-table=OUTPUT_TABLE
Replace the following:
- PROJECT_ID: a globally unique name for your project
- JOB_NAME: a name for your Dataflow job
- OUTPUT_DATASET: a name for the output dataset
- OUTPUT_TABLE: a name for the output table
Starting a Dataflow SQL job might take several minutes. You cannot update the job after creating it. Dataflow SQL jobs use autoscaling and Dataflow automatically chooses the execution mode to be either batch or streaming. You cannot control this behavior for Dataflow SQL jobs. To stop Dataflow SQL jobs, use the cancel
command. Stopping a Dataflow SQL job with drain
is not supported.
For more information about using SQL commands for Dataflow, see the Dataflow SQL reference and gcloud Dataflow sql query
documentation.
Using commands with regions
The Dataflow command-line interface supports
regions starting in
gcloud CLI version 176. Use the --region
option with any command to specify the
region that manages your job.
For example, gcloud dataflow jobs list
lists jobs from all regions, but
gcloud dataflow jobs list --region=europe-west1
only lists jobs managed
from europe-west1
.