The Cloud Shell walkthrough in this
tutorial provides authentication by using your Google Cloud project credentials.
When you run code locally, the recommended practice is to use
service account credentials
to authenticate your code.
Create a Dataproc cluster
The following values are set to create the cluster:
The project in which the cluster will be created
The region where the cluster will be created
The name of the cluster
The cluster config, which specifies one master and two primary
workers
Default config settings are used for the remaining cluster settings.
You can override default cluster config settings. For example, you
can add secondary VMs (default = 0) or specify a non-default
VPC network for the cluster. For more information, see
CreateCluster.
defquickstart(project_id,region,cluster_name,gcs_bucket,pyspark_file):# Create the cluster client.cluster_client=dataproc_v1.ClusterControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the cluster config.cluster={"project_id":project_id,"cluster_name":cluster_name,"config":{"master_config":{"num_instances":1,"machine_type_uri":"n1-standard-2"},"worker_config":{"num_instances":2,"machine_type_uri":"n1-standard-2"},},}# Create the cluster.operation=cluster_client.create_cluster(request={"project_id":project_id,"region":region,"cluster":cluster})result=operation.result()print(f"Cluster created successfully: {result.cluster_name}")
Submit a job
The following values are set to submit the job:
The project in which the cluster will be created
The region where the cluster will be created
The job config, which specifies the cluster name and the Cloud Storage
filepath (URI) of the PySpark job
# Create the job client.job_client=dataproc_v1.JobControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the job config.job={"placement":{"cluster_name":cluster_name},"pyspark_job":{"main_python_file_uri":f"gs://{gcs_bucket}/{spark_filename}"},}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output is saved to the Cloud Storage bucket# allocated to the job. Use regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)output=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully: {output}\r\n")
Delete the cluster
The following values are set to delete the cluster:
# Delete the cluster once the job has terminated.operation=cluster_client.delete_cluster(request={"project_id":project_id,"region":region,"cluster_name":cluster_name,})operation.result()print(f"Cluster {cluster_name} successfully deleted.")
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-03-27 UTC."],[[["This tutorial guides users through a Cloud Shell walkthrough to interact with Dataproc gRPC APIs using Google Cloud client libraries for Python."],["The walkthrough code demonstrates how to programmatically create a Dataproc cluster, submit a job to the cluster, and then delete the cluster."],["The tutorial details the required values to set when creating a cluster, such as project ID, region, cluster name, and cluster configuration, allowing for default setting overides."],["The tutorial also describes the necessary values to submit a job, including project ID, region, cluster name, and the Cloud Storage filepath of the PySpark job."],["Users can utilize an inline workflow to perform all actions with one API request, rather than making separate requests, as shown in the provided example."]]],[]]