This tutorial includes a Cloud Shell walkthrough that uses the Google Cloud client libraries for Python to programmatically call Dataproc gRPC APIs to create a cluster and submit a job to the cluster.
The following sections explain the operation of the walkthrough code contained in the GitHub GoogleCloudPlatform/python-dataproc repository.
Run the Cloud Shell walkthrough
Click Open in Google Cloud Shell to run the walkthrough.
Understand the Python example code
Application Default Credentials
The Cloud Shell walkthrough in this tutorial provides authentication by using your Google Cloud project credentials. When you run code locally, the recommended practice is to use service account credentials to authenticate your code.
Create a Dataproc cluster
You can create a new Dataproc cluster with the CreateCluster API.
You must specify the following values when creating a cluster:
- The project in which the cluster will be created
- The name of the cluster
- The region to use. If you specify the
global
region (the tutorial code uses a--global_region
flag to select the global region), you must also specify a zone (seezone_uri
). If you specify a non-global region and leave thezone_uri
field empty, Dataproc Auto Zone Placement will select a zone for your cluster.
You can also override default cluster config settings. For example, you
can specify the number of workers (default = 2), whether to use preemptible VMs
(default = 0), and network settings (default = default network
. See
CreateClusterRequest
for more information.
List Dataproc clusters
You can list clusters within a project by calling the ListClusters API. The output returns a JSON object that lists the clusters. You can transverse the JSON response to print cluster details.
Submit a job to a Dataproc cluster
You can submit a job to an existing cluster with the SubmitJob API. When you submit a job, it runs asynchronously.
To submit a job, you must specify the following information:
- The name of the cluster to which the job will be submitted
- The region to use
- The type of job being submitted (such as
Hadoop
,Spark
, 'PySpark) - Job details for the type of job being submitted (see SubmitJobRequest for more information).
The following code submits a Spark Job to a cluster.
By default, the tutorial code runs the SparkPi example job included with Spark.
Delete a Dataproc cluster
Call the DeleteCluster API to delete a cluster.