Submit a Job

You can submit a job via a Cloud Dataproc API jobs.submit request, using the Google Cloud SDK gcloud command-line tool, or from the Google Cloud Platform Console. You can also connect to a machine instance in your cluster using SSH, and then run a job from the instance.

Using the command line

Run one of the following examples to submit a sample job.

Submit a sample PySpark job

gsutil cp gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py .
cat hello-world.py

#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!'])
words = sorted(rdd.collect())
print words

gcloud dataproc jobs submit pyspark --cluster <my-dataproc-cluster> hello-world.py

Copying file:///tmp/hello-world.py [Content-Type=text/x-python]...
…
Job [dc1c28ac-c380-4d6c-a543-2a6ca43691eb] submitted.
Waiting for job output...
…
['Hello,', 'world!']
Job finished successfully.
…

Submit a sample Spark job

gcloud dataproc jobs submit spark --cluster <my-dataproc-cluster> \
--class org.apache.spark.examples.SparkPi \
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

Job [54825071-ae28-4c5b-85a5-58fae6a597d6] submitted.
Waiting for job output…
…
Pi is roughly 3.14177148
…
Job finished successfully.
…

Using the Google Cloud Platform Console

To submit a job from the Cloud Platform Console, after creating a cluster, go to the Cloud Platform Console. Select your project, and then click Continue. The first time you submit a job, the following dialog appears.

Click Submit a job.

To submit a sample Spark job, fill in the fields on the Submit a job page, as follows (as shown in the above screenshot):

  • Select your Cluster name from the cluster list.
  • Set Job type to Spark.
  • Add file:///usr/lib/spark/examples/jars/spark-examples.jar to Jar files:
    • file:/// denotes a Hadoop LocalFileSystem scheme; Cloud Dataproc installed /usr/lib/spark/examples/jars/spark-examples.jar on the cluster's master node when it created the cluster.
    • Alternatively, you can specify a Cloud Storage path (gs://your-bucket/your-jarfile.jar) or a Hadoop Distributed File System path (hdfs://examples/example.jar) to one of your jars.
  • Set Main class or jar to org.apache.spark.examples.SparkPi.
  • Set Arguments to the single argument 1000.

Click Submit to start the job. Once the job starts, it is added to the Jobs list.

Click the Job ID to open the Jobs page, where you can view the job's driver output (see Accessing job driver output—Cloud Platform Console) and confirm the job's configuration settings. Since this job produces long output lines that exceed the width of the browser window, you can check the Line wrapping box to bring all output text within view in order to display the calculated result for pi.

You can view your job's driver output from the command line using the gcloud dataproc jobs wait command shown below (for more information, see Accessing job driver output—Google Cloud SDK). Copy and paste your project ID as the value for the --project flag and your Job ID (shown on the Jobs list) as the final argument.

gcloud dataproc --project=<your-project-id> jobs wait <your-job-id>

Here are snippets from the driver output for the sample SparkPi job submitted above:

gcloud dataproc --project=spark-pi-demo jobs wait \
  c556b47a-4b46-4a94-9ba2-2dcee31167b2

...

2015-06-25 23:27:23,810 INFO [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 0 (reduce at
SparkPi.scala:35) finished in 21.169 s

2015-06-25 23:27:23,810 INFO [task-result-getter-3] cluster.YarnScheduler
(Logging.scala:logInfo(59)) - Removed TaskSet 0.0, whose tasks have all
completed, from pool

2015-06-25 23:27:23,819 INFO [main] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 0 finished: reduce at SparkPi.scala:35,
took 21.674931 s

Pi is roughly 3.14189648

...

Job [c556b47a-4b46-4a94-9ba2-2dcee31167b2] finished successfully.

driverOutputUri:
gs://sample-staging-bucket/google-cloud-dataproc-metainfo/cfeaa033-749e-48b9-...
...

SSH into an instance

You can connect to a Compute Engine VM instance in your cluster by using SSH from the command line or from the Cloud Platform Console.

SSH using the command line

Use gcloud compute ssh to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m suffix):

gcloud compute ssh <dataproc-cluster-name>-m

The following snippet uses gcloud compute ssh to SSH into the master node of the cluster-1 cluster:

gcloud compute ssh cluster-1-m
...
Linux cluster-1-m 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1~...
...
user@cluster-1-m:~$

SSH using the Cloud Platform Console

Use the Cloud Platform Console to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m suffix):

  1. In the Cloud Platform Console, go to the VM Instances page.

    Go to the VM Instances page

  2. In the list of virtual machine instances, click the SSH button in the row of the instance to which you want to connect.

A browser window opens at your home directory on the master node.

Connected, host fingerprint: ssh-rsa ...
Linux cluster-1-m 3.16.0-0.bpo.4-amd64 ...
...
user@cluster-1-m:~$

Run a Spark command on your instance

After establishing an SSH connection to the VM master instance, run the following commands to:

  1. Open a Spark shell.
  2. Run a simple Spark job to count the number of lines in a (seven-line) Python "hello-world" file located in a Cloud Storage file.
  3. Quit the shell.

    user@my-dataproc-cluster-m:~$ spark-shell
    ...
    scala> sc.textFile("gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f"
    + "/src/pyspark/hello-world/hello-world.py").count
    ...
    res0: Long = 7
    scala> :quit
    


Send feedback about...

Google Cloud Dataproc Documentation