其餘叢集設定會使用預設設定。
您可以覆寫預設叢集設定。舉例來說,您可以新增次要 VM (預設值為 0),或為叢集指定非預設的 VPC 網路。詳情請參閱 CreateCluster。
defquickstart(project_id,region,cluster_name,gcs_bucket,pyspark_file):# Create the cluster client.cluster_client=dataproc_v1.ClusterControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the cluster config.cluster={"project_id":project_id,"cluster_name":cluster_name,"config":{"master_config":{"num_instances":1,"machine_type_uri":"n1-standard-2"},"worker_config":{"num_instances":2,"machine_type_uri":"n1-standard-2"},},}# Create the cluster.operation=cluster_client.create_cluster(request={"project_id":project_id,"region":region,"cluster":cluster})result=operation.result()print(f"Cluster created successfully: {result.cluster_name}")
# Create the job client.job_client=dataproc_v1.JobControllerClient(client_options={"api_endpoint":f"{region}-dataproc.googleapis.com:443"})# Create the job config.job={"placement":{"cluster_name":cluster_name},"pyspark_job":{"main_python_file_uri":f"gs://{gcs_bucket}/{spark_filename}"},}operation=job_client.submit_job_as_operation(request={"project_id":project_id,"region":region,"job":job})response=operation.result()# Dataproc job output is saved to the Cloud Storage bucket# allocated to the job. Use regex to obtain the bucket and blob info.matches=re.match("gs://(.*?)/(.*)",response.driver_output_resource_uri)output=(storage.Client().get_bucket(matches.group(1)).blob(f"{matches.group(2)}.000000000").download_as_bytes().decode("utf-8"))print(f"Job finished successfully: {output}\r\n")
# Delete the cluster once the job has terminated.operation=cluster_client.delete_cluster(request={"project_id":project_id,"region":region,"cluster_name":cluster_name,})operation.result()print(f"Cluster {cluster_name} successfully deleted.")
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[[["\u003cp\u003eThis tutorial guides users through a Cloud Shell walkthrough to interact with Dataproc gRPC APIs using Google Cloud client libraries for Python.\u003c/p\u003e\n"],["\u003cp\u003eThe walkthrough code demonstrates how to programmatically create a Dataproc cluster, submit a job to the cluster, and then delete the cluster.\u003c/p\u003e\n"],["\u003cp\u003eThe tutorial details the required values to set when creating a cluster, such as project ID, region, cluster name, and cluster configuration, allowing for default setting overides.\u003c/p\u003e\n"],["\u003cp\u003eThe tutorial also describes the necessary values to submit a job, including project ID, region, cluster name, and the Cloud Storage filepath of the PySpark job.\u003c/p\u003e\n"],["\u003cp\u003eUsers can utilize an inline workflow to perform all actions with one API request, rather than making separate requests, as shown in the provided example.\u003c/p\u003e\n"]]],[],null,["This tutorial includes a [Cloud Shell walkthrough](/shell/docs/tutorials) that uses the\n[Google Cloud client libraries for Python](/python/docs/reference/dataproc/latest)\nto programmatically call\n[Dataproc gRPC APIs](/dataproc/docs/reference/rpc) to create\na cluster and submit a job to the cluster.\n\nThe following sections explain the operation of the walkthrough code contained\nin the GitHub\n[GoogleCloudPlatform/python-dataproc](https://github.com/googleapis/python-dataproc/tree/master/samples/snippets) repository.\n| The walkthrough tutorial code makes separate API requests to create a cluster, submit a job to the cluster, then delete the cluster. You can use an [inline workflow](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.WorkflowTemplateService.InstantiateInlineWorkflowTemplate) to accomplish these tasks with one API request (see [instantiate_inline_workflow_template.py](https://github.com/googleapis/python-dataproc/blob/master/samples/snippets/instantiate_inline_workflow_template.py) for an example).\n\nRun the Cloud Shell walkthrough\n\nClick **Open in Cloud Shell** to run the walkthrough.\n\n[Open in Cloud Shell](https://ssh.cloud.google.com/cloudshell/open?cloudshell_git_repo=https://github.com/googleapis/python-dataproc&cloudshell_working_dir=samples/snippets&tutorial=python-api-walkthrough.md&cloudshell_open_in_editor=submit_job_to_cluster.py)\n\nUnderstand the code\n\nApplication Default Credentials\n\nThe [Cloud Shell walkthrough](#run_the_cloud_shell_walkthrough) in this\ntutorial provides authentication by using your Google Cloud project credentials.\nWhen you run code locally, the recommended practice is to use\n[service account credentials](/docs/authentication/production#obtaining_and_providing_service_account_credentials_manually)\nto authenticate your code.\n\nCreate a Dataproc cluster\n\nThe following values are set to create the cluster:\n\n- The project in which the cluster will be created\n- The region where the cluster will be created\n- The name of the cluster\n- The cluster config, which specifies one master and two primary workers\n\nDefault config settings are used for the remaining cluster settings.\nYou can override default cluster config settings. For example, you\ncan add secondary VMs (default = 0) or specify a non-default\nVPC network for the cluster. For more information, see\n[CreateCluster](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.ClusterController.CreateCluster). \n\n def quickstart(project_id, region, cluster_name, gcs_bucket, pyspark_file):\n # Create the cluster client.\n cluster_client = dataproc_v1.ClusterControllerClient(\n client_options={\"api_endpoint\": f\"{region}-dataproc.googleapis.com:443\"}\n )\n\n # Create the cluster config.\n cluster = {\n \"project_id\": project_id,\n \"cluster_name\": cluster_name,\n \"config\": {\n \"master_config\": {\"num_instances\": 1, \"machine_type_uri\": \"n1-standard-2\"},\n \"worker_config\": {\"num_instances\": 2, \"machine_type_uri\": \"n1-standard-2\"},\n },\n }\n\n # Create the cluster.\n operation = cluster_client.create_cluster(\n request={\"project_id\": project_id, \"region\": region, \"cluster\": cluster}\n )\n result = operation.result()\n\n print(f\"Cluster created successfully: {result.cluster_name}\")\n\n\u003cbr /\u003e\n\nSubmit a job\n\nThe following values are set to submit the job:\n\n- The project in which the cluster will be created\n- The region where the cluster will be created\n- The job config, which specifies the cluster name and the Cloud Storage filepath (URI) of the PySpark job\n\nSee [SubmitJob](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.JobController.SubmitJob)\nfor more information. \n\n # Create the job client.\n job_client = dataproc_v1.JobControllerClient(\n client_options={\"api_endpoint\": f\"{region}-dataproc.googleapis.com:443\"}\n )\n\n # Create the job config.\n job = {\n \"placement\": {\"cluster_name\": cluster_name},\n \"pyspark_job\": {\"main_python_file_uri\": f\"gs://{gcs_bucket}/{spark_filename}\"},\n }\n\n operation = job_client.submit_job_as_operation(\n request={\"project_id\": project_id, \"region\": region, \"job\": job}\n )\n response = operation.result()\n\n # Dataproc job output is saved to the Cloud Storage bucket\n # allocated to the job. Use regex to obtain the bucket and blob info.\n matches = re.match(\"gs://(.*?)/(.*)\", response.driver_output_resource_uri)\n\n output = (\n storage.Client()\n .get_bucket(matches.group(1))\n .blob(f\"{matches.group(2)}.000000000\")\n .download_as_bytes()\n .decode(\"utf-8\")\n )\n\n print(f\"Job finished successfully: {output}\\r\\n\")\n\n\u003cbr /\u003e\n\nDelete the cluster\n\nThe following values are set to delete the cluster:\n\n- The project in which the cluster will be created\n- The region where the cluster will be created\n- The name of the cluster\n\nFor more information, see the [DeleteCluster](/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.JobController.DeleteCluster). \n\n # Delete the cluster once the job has terminated.\n operation = cluster_client.delete_cluster(\n request={\n \"project_id\": project_id,\n \"region\": region,\n \"cluster_name\": cluster_name,\n }\n )\n operation.result()\n\n print(f\"Cluster {cluster_name} successfully deleted.\")\n\n\u003cbr /\u003e"]]