Submit a Job

You can submit a job via a Cloud Dataproc API jobs.submit request, using the Google Cloud SDK gcloud command-line tool, or from the Google Cloud Platform Console. You can also connect to a machine instance in your cluster using SSH, and then run a job from the instance.

Submitting a Cloud Dataproc job

gcloud command

To submit a job to a Cloud Dataproc cluster, use the Cloud SDK gcloud dataproc jobs submit command.
gcloud dataproc jobs submit job-command --cluster cluster-name \
  job-specific flags and args
PySpark job example
gsutil cp gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/ .

import pyspark
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!'])
words = sorted(rdd.collect())
print words

gcloud dataproc jobs submit pyspark --cluster cluster-name

Copying file:///tmp/ [Content-Type=text/x-python]...
Job [dc1c28ac-c380-4d6c-a543-2a6ca43691eb] submitted.
Waiting for job output...
['Hello,', 'world!']
Job finished successfully.
Spark job example
gcloud dataproc jobs submit spark --cluster cluster-name \
--class org.apache.spark.examples.SparkPi \
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

Job [54825071-ae28-4c5b-85a5-58fae6a597d6] submitted.
Waiting for job output…
Pi is roughly 3.14177148
Job finished successfully.


Use the Cloud Dataproc jobs.submit API to submit a job to a cluster. Here is a POST request to submit a Spark job to compute the approximate value of pi:
POST /v1/projects/vigilant-sunup-163401/regions/global/jobs:submit/
  "projectId": "vigilant-sunup-163401",
  "job": {
    "placement": {
      "clusterName": "cluster-1"
    "reference": {
      "jobId": "d566957c-5bd1-464b-86a8-72a06907e493"
    "sparkJob": {
      "args": [
      "mainClass": "org.apache.spark.examples.SparkPi",
      "jarFileUris": [


Open the Cloud Dataproc Submit a job page in the GCP Console.
Spark job example

To submit a sample Spark job, fill in the fields on the Submit a job page, as follows (as shown in the previous screenshot):

  1. Select your Cluster name from the cluster list.
  2. Set Job type to Spark.
  3. Set Main class or jar to org.apache.spark.examples.SparkPi.
  4. Set Arguments to the single argument 1000.
  5. Add file:///usr/lib/spark/examples/jars/spark-examples.jar to Jar files:
    1. file:/// denotes a Hadoop LocalFileSystem scheme. Cloud Dataproc installed /usr/lib/spark/examples/jars/spark-examples.jar on the cluster's master node when it created the cluster.
    2. Alternatively, you can specify a Cloud Storage path (gs://your-bucket/your-jarfile.jar) or a Hadoop Distributed File System path (hdfs://path-to-jar.jar) to one of your jars.

Click Submit to start the job. Once the job starts, it is added to the Jobs list.

Click the Job ID to open the Jobs page, where you can view the job's driver output (see Accessing job driver output–CONSOLE),. Since this job produces long output lines that exceed the width of the browser window, you can check the Line wrapping box to bring all output text within view in order to display the calculated result for pi.

You can view your job's driver output from the command line using the gcloud dataproc jobs wait command shown below (for more information, see Accessing job driver output–GCLOUD COMMAND). Copy and paste your project ID as the value for the --project flag and your Job ID (shown on the Jobs list) as the final argument.

gcloud dataproc --project=project-id jobs wait job-id

Here are snippets from the driver output for the sample SparkPi job submitted above:

gcloud dataproc --project=spark-pi-demo jobs wait \


2015-06-25 23:27:23,810 INFO [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 0 (reduce at
SparkPi.scala:35) finished in 21.169 s

2015-06-25 23:27:23,810 INFO [task-result-getter-3] cluster.YarnScheduler
(Logging.scala:logInfo(59)) - Removed TaskSet 0.0, whose tasks have all
completed, from pool

2015-06-25 23:27:23,819 INFO [main] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 0 finished: reduce at SparkPi.scala:35,
took 21.674931 s

Pi is roughly 3.14189648


Job [c556b47a-4b46-4a94-9ba2-2dcee31167b2] finished successfully.


Submit a job directly on your cluster

If you want to run a job directly on your cluster without using the Cloud Dataproc service, SSH into the master node of your cluster, then run the job on the master node.

SSH into the master instance

You can connect to a Compute Engine VM instance in your cluster by using SSH from the command line or from the GCP Console.

gcloud command

Use gcloud compute ssh to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m suffix).

gcloud compute ssh cluster-name-m

The following snippet uses gcloud compute ssh to SSH into the master node of cluster-1.

gcloud compute ssh cluster-1-m
Linux cluster-1-m 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1~...


Use the GCP Console to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m suffix).
  1. In the GCP Console, go to the VM Instances page.
  2. In the list of virtual machine instances, click SSH in the row of the master instance (-m suffix) that you want to connect to.

A browser window opens at your home directory on the master node.

Connected, host fingerprint: ssh-rsa ...
Linux cluster-1-m 3.16.0-0.bpo.4-amd64 ...

Run a Spark command locally on your cluster

After establishing an SSH connection to the VM master instance, run the following commands to:

  1. Open a Spark shell.
  2. Run a simple Spark job to count the number of lines in a (seven-line) Python "hello-world" file located in a publicly-accessible Cloud Storage file.
  3. Quit the shell.

    user@cluster-name-m:~$ spark-shell
    scala> sc.textFile("gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f"
    + "/src/pyspark/hello-world/").count
    res0: Long = 7
    scala> :quit

Send feedback about...

Google Cloud Dataproc Documentation