This tutorial includes a Cloud Shell walkthrough that uses the Google Cloud client libraries for Python to programmatically call Dataproc gRPC APIs to create a cluster and submit a job to the cluster.
The following sections explain the operation of the walkthrough code contained in the GitHub GoogleCloudPlatform/python-docs-samples/dataproc repository.
Run the Cloud Shell walkthrough
Click Open in Google Cloud Shell to run the walkthrough.
Understand the Python example code
Application Default Credentials
The Cloud Shell walkthrough in this tutorial`` provides authentication by using your Google Cloud project credentials. When you run code locally, the recommended practice is to use service account credentials to authenticate your code.
Get cluster and job clients
Two
Google Cloud Dataproc API clients
are needed to run the tutorial code: a ClusterControllerClient
to call
clusters
gRPC APIs and a JobControllerClient
to call jobs
gRPC APIs. If
the cluster to create and run jobs on is in the Dataproc
global region, the code uses the
default gRPC endpoint. If the cluster region is non-global, a regional gRPC
endpoint is used.
List Dataproc clusters
You can list clusters within a project by calling the ListClusters API. The output returns a JSON object that lists the clusters. You can transverse the JSON response to print cluster details.
Create a Dataproc cluster
You can create a new Dataproc cluster with the CreateCluster API.
You must specify the following values when creating a cluster:
- The project in which the cluster will be created
- The name of the cluster
- The region to use. If you specify the
global
region (the tutorial code uses a--global_region
flag to select the global region), you must also specify a zone (seezone_uri
). If you specify a non-global region and leave thezone_uri
field empty, Dataproc Auto Zone Placement will select a zone for your cluster.
You can also override default cluster config settings. For example, you
can specify the number of workers (default = 2), whether to use preemptible VMs
(default = 0), and network settings (default = default network
. See
CreateClusterRequest
for more information.
Submit a job to a Dataproc cluster
You can submit a job to an existing cluster with the SubmitJob API. When you submit a job, it runs asynchronously.
To submit a job, you must specify the following information:
- The name of the cluster to which the job will be submitted
- The region to use
- The type of job being submitted (such as
Hadoop
,Spark
, 'PySpark) - Job details for the type of job being submitted (see SubmitJobRequest for more information).
The following code submits a PySpark Job to a cluster.
By default, the tutorial code runs the following small PySpark job.
Since the job runs asynchronously, the job must finish before the output is displayed. You can call GetJob while the job is running to get JobStatus and job details after the job completes.
Get job status and details
Make a GetJobRequest with the following required information:
- The project of the cluster where the job was submitted
- The cluster region
- The job ID (UUID)
The following code checks a job's status, and returns job details when the job completes.
Delete a Dataproc cluster
Call the DeleteCluster API to delete a cluster.