Submit a Job

You can submit a job to an existing Cloud Dataproc cluster via a Cloud Dataproc API jobs.submit HTTP or programmatic request, using the Cloud SDK gcloud command-line tool in a local terminal window or in Cloud Shell, or from the Google Cloud Platform Console opened in a local browser. You can also SSH into the master instance in your cluster, and then run a job directly from the instance without using the Cloud Dataproc service.

Submitting a Cloud Dataproc job

gcloud command

To submit a job to a Cloud Dataproc cluster, run the Cloud SDK gcloud dataproc jobs submit command locally in a terminal window or in Cloud Shell.
gcloud dataproc jobs submit job-command \
  --cluster cluster-name --region region \
  job-specific flags and args
PySpark job submit example
  1. List the publicly accessible hello-world.py located in Cloud Storage.
    gsutil cat gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
    
    File Listing:
    #!/usr/bin/python
    import pyspark
    sc = pyspark.SparkContext()
    rdd = sc.parallelize(['Hello,', 'world!'])
    words = sorted(rdd.collect())
    print(words)
    
  2. Submit the Pyspark job to Cloud Dataproc.
    gcloud dataproc jobs submit pyspark \
        --cluster cluster-name --region region \
        gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
    
    Terminal output:
    Waiting for job output...
    …
    ['Hello,', 'world!']
    Job finished successfully.
    
Spark job submit example
  1. Run the SparkPi example pre-installed on the Cloud Dataproc cluster's master node.
    gcloud dataproc jobs submit spark \
        --cluster cluster-name --region region \
        --class org.apache.spark.examples.SparkPi \
        --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
    
    Terminal output:
    Job [54825071-ae28-4c5b-85a5-58fae6a597d6] submitted.
    Waiting for job output…
    …
    Pi is roughly 3.14177148
    …
    Job finished successfully.
    …
    

REST API

Use the Cloud Dataproc jobs.submit API to submit a job to a cluster. Here is an HTTP POST request to submit a Spark job to compute the approximate value of pi:
POST /v1/projects/vigilant-sunup-163401/regions/global/jobs:submit/
{
  "projectId": "vigilant-sunup-163401",
  "job": {
    "placement": {
      "clusterName": "cluster-1"
    },
    "reference": {
      "jobId": "d566957c-5bd1-464b-86a8-72a06907e493"
    },
    "sparkJob": {
      "args": [
        "1000"
      ],
      "mainClass": "org.apache.spark.examples.SparkPi",
      "jarFileUris": [
        "file:///usr/lib/spark/examples/jars/spark-examples.jar"
      ]
    }
  }
}

Console

Open the Cloud Dataproc Submit a job page in the GCP Console in your browser.
Spark job example

To submit a sample Spark job, fill in the fields on the Submit a job page, as follows (as shown in the previous screenshot):

  1. Select your Cluster name from the cluster list.
  2. Set Job type to Spark.
  3. Set Main class or jar to org.apache.spark.examples.SparkPi.
  4. Set Arguments to the single argument 1000.
  5. Add file:///usr/lib/spark/examples/jars/spark-examples.jar to Jar files:
    1. file:/// denotes a Hadoop LocalFileSystem scheme. Cloud Dataproc installed /usr/lib/spark/examples/jars/spark-examples.jar on the cluster's master node when it created the cluster.
    2. Alternatively, you can specify a Cloud Storage path (gs://your-bucket/your-jarfile.jar) or a Hadoop Distributed File System path (hdfs://path-to-jar.jar) to one of your jars.

Click Submit to start the job. Once the job starts, it is added to the Jobs list.

Click the Job ID to open the Jobs page, where you can view the job's driver output (see Accessing job driver output–CONSOLE),. Since this job produces long output lines that exceed the width of the browser window, you can check the Line wrapping box to bring all output text within view in order to display the calculated result for pi.

You can view your job's driver output from the command line using the gcloud dataproc jobs wait command shown below (for more information, see Accessing job driver output–GCLOUD COMMAND). Copy and paste your project ID as the value for the --project flag and your Job ID (shown on the Jobs list) as the final argument.

gcloud dataproc --project=project-id jobs wait job-id

Here are snippets from the driver output for the sample SparkPi job submitted above:

gcloud dataproc --project=spark-pi-demo jobs wait \
  c556b47a-4b46-4a94-9ba2-2dcee31167b2

...

2015-06-25 23:27:23,810 INFO [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 0 (reduce at
SparkPi.scala:35) finished in 21.169 s

2015-06-25 23:27:23,810 INFO [task-result-getter-3] cluster.YarnScheduler
(Logging.scala:logInfo(59)) - Removed TaskSet 0.0, whose tasks have all
completed, from pool

2015-06-25 23:27:23,819 INFO [main] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 0 finished: reduce at SparkPi.scala:35,
took 21.674931 s

Pi is roughly 3.14189648

...

Job [c556b47a-4b46-4a94-9ba2-2dcee31167b2] finished successfully.

driverOutputUri:
gs://sample-staging-bucket/google-cloud-dataproc-metainfo/cfeaa033-749e-48b9-...
...

Submit a job directly on your cluster

If you want to run a job directly on your cluster without using the Cloud Dataproc service, SSH into the master node of your cluster, then run the job on the master node.

SSH into the master instance

You can connect to a Compute Engine VM instance in your cluster by using SSH from the command line or from the GCP Console.

gcloud command

Run the gcloud compute ssh command in a local terminal window or from Cloud Shell to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m suffix).

gcloud compute ssh --project=project-id cluster-name-m

The following snippet uses gcloud compute ssh to SSH into the master node of cluster-1.

gcloud compute ssh --project=my-project-id cluster-1-m
...
Linux cluster-1-m 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6...
...
user@cluster-1-m:~$

Console

Use the GCP Console to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m suffix).
  1. In the GCP Console, go to the VM Instances page.
  2. In the list of virtual machine instances, click SSH in the row of the master instance (-m suffix) that you want to connect to.

A browser window opens at your home directory on the master node.

Connected, host fingerprint: ssh-rsa ...
Linux cluster-1-m 3.16.0-0.bpo.4-amd64 ...
...
user@cluster-1-m:~$

Run a Spark job on the master node

After establishing an SSH connection to the VM master instance, run commands in a terminal window on the cluster's master node to:

  1. Open a Spark shell.
  2. Run a simple Spark job to count the number of lines in a (seven-line) Python "hello-world" file located in a publicly accessible Cloud Storage file.
  3. Quit the shell.

    user@cluster-name-m:~$ spark-shell
    ...
    scala> sc.textFile("gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f"
    + "/src/pyspark/hello-world/hello-world.py").count
    ...
    res0: Long = 7
    scala> :quit
    

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation