Panduan Cloud Shell dalam tutorial ini menyediakan autentikasi dengan menggunakan kredensial project Google Cloud Anda.
Saat menjalankan kode secara lokal, praktik yang direkomendasikan adalah menggunakan
kredensial akun layanan
untuk mengautentikasi kode Anda.
Membuat cluster Dataproc
Nilai berikut ditetapkan untuk membuat cluster:
Project tempat cluster akan dibuat
Region tempat cluster akan dibuat
Nama cluster
Konfigurasi cluster, yang menentukan satu master dan dua pekerja utama
Setelan konfigurasi default digunakan untuk setelan cluster lainnya.
Anda dapat mengganti setelan konfigurasi cluster default. Misalnya, Anda
dapat menambahkan VM sekunder (default = 0) atau menentukan jaringan VPC
non-default untuk cluster. Untuk mengetahui informasi selengkapnya, lihat
CreateCluster.
defquickstart(project_id,region,cluster_name,gcs_bucket,pyspark_file):# Create the cluster client.cluster_client=dataproc_v1.ClusterControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the cluster config.cluster={"project_id":project_id,"cluster_name":cluster_name,"config":{"master_config":{"num_instances":1,"machine_type_uri":"n1-standard-2"},"worker_config":{"num_instances":2,"machine_type_uri":"n1-standard-2"},},}# Create the cluster.operation=cluster_client.create_cluster(request={"project_id":project_id,"region":region,"cluster":cluster})result=operation.result()print(f"Cluster created successfully: {result.cluster_name}")
Mengirim tugas
Nilai berikut ditetapkan untuk mengirimkan tugas:
Project tempat cluster akan dibuat
Region tempat cluster akan dibuat
Konfigurasi tugas, yang menentukan nama cluster dan jalur file (URI) Cloud Storage tugas PySpark
Lihat SubmitJob
untuk mengetahui informasi selengkapnya.
# Create the job client.job_client=dataproc_v1.JobControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the job config.job={"placement":{"cluster_name":cluster_name},"pyspark_job":{"main_python_file_uri":f"gs://{gcs_bucket}/{spark_filename}"},}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output is saved to the Cloud Storage bucket# allocated to the job. Use regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)output=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully: {output}\r\n")
Menghapus cluster
Nilai berikut ditetapkan untuk menghapus cluster:
Project tempat cluster akan dibuat
Region tempat cluster akan dibuat
Nama cluster
Untuk mengetahui informasi selengkapnya, lihat DeleteCluster.
# Delete the cluster once the job has terminated.operation=cluster_client.delete_cluster(request={"project_id":project_id,"region":region,"cluster_name":cluster_name,})operation.result()print(f"Cluster {cluster_name} successfully deleted.")
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-07-31 UTC."],[[["This tutorial guides users through a Cloud Shell walkthrough to interact with Dataproc gRPC APIs using Google Cloud client libraries for Python."],["The walkthrough code demonstrates how to programmatically create a Dataproc cluster, submit a job to the cluster, and then delete the cluster."],["The tutorial details the required values to set when creating a cluster, such as project ID, region, cluster name, and cluster configuration, allowing for default setting overides."],["The tutorial also describes the necessary values to submit a job, including project ID, region, cluster name, and the Cloud Storage filepath of the PySpark job."],["Users can utilize an inline workflow to perform all actions with one API request, rather than making separate requests, as shown in the provided example."]]],[]]