This tutorial includes a Cloud Shell walkthrough that uses the Google Cloud client libraries for Python to programmatically call Dataproc gRPC APIs to create a cluster and submit a job to the cluster.
The following sections explain the operation of the walkthrough code contained in the GitHub GoogleCloudPlatform/python-dataproc repository.
Run the Cloud Shell walkthrough
Click Open in Cloud Shell to run the walkthrough.
Understand the code
Application Default Credentials
The Cloud Shell walkthrough in this tutorial provides authentication by using your Google Cloud project credentials. When you run code locally, the recommended practice is to use service account credentials to authenticate your code.
Create a Dataproc cluster
The following values are set to create the cluster:
- The project in which the cluster will be created
- The region where the cluster will be created
- The name of the cluster
- The cluster config, which specifies one master and two primary workers
Default config settings are used for the remaining cluster settings. You can override default cluster config settings. For example, you can add secondary VMs (default = 0) or specify a non-default VPC network for the cluster. For more information, see CreateCluster.
Submit a job
The following values are set to submit the job:
- The project in which the cluster will be created
- The region where the cluster will be created
- The job config, which specifies the cluster name and the Cloud Storage filepath (URI) of the PySpark job
See SubmitJob for more information.
Delete the cluster
The following values are set to delete the cluster:
- The project in which the cluster will be created
- The region where the cluster will be created
- The name of the cluster
For more information, see the DeleteCluster.