本頁面由 Cloud Translation API 翻譯而成。

使用 Python 適用的 Cloud 用戶端程式庫

本教學課程包含 Cloud Shell 逐步操作說明，其中使用 Python 適用的 Google Cloud 用戶端程式庫，以程式輔助方式呼叫 Dataproc gRPC API，建立叢集並將工作提交至叢集。

下列各節說明 GitHub GoogleCloudPlatform/python-dataproc 存放區中包含的導覽程式碼運作方式。

執行 Cloud Shell 逐步操作說明

按一下「在 Cloud Shell 中開啟」，即可執行導覽。

在 Cloud Shell 中開啟

瞭解程式碼

應用程式預設憑證

本教學課程中的 Cloud Shell 逐步導覽會使用 Google Cloud 專案憑證進行驗證。在本機執行程式碼時，建議使用服務帳戶憑證驗證程式碼。

建立 Dataproc 叢集

系統會設定下列值來建立叢集：

即將建立叢集的所在專案
要建立叢集的地區
叢集名稱
叢集設定，指定一個主要執行個體和兩個主要工作站

其餘叢集設定會使用預設設定。您可以覆寫預設叢集設定。舉例來說，您可以新增次要 VM (預設值為 0)，或為叢集指定非預設的 VPC 網路。詳情請參閱 CreateCluster。

def quickstart(project_id, region, cluster_name, gcs_bucket, pyspark_file):
    # Create the cluster client.
    cluster_client = dataproc_v1.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    print(f"Cluster created successfully: {result.cluster_name}")

提交工作

系統會設定下列值來提交工作：

即將建立叢集的所在專案
要建立叢集的地區
工作設定，其中指定叢集名稱和 PySpark 工作的 Cloud Storage 檔案路徑 (URI)

詳情請參閱「SubmitJob」。

# Create the job client.
job_client = dataproc_v1.JobControllerClient(
    client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
)

# Create the job config.
job = {
    "placement": {"cluster_name": cluster_name},
    "pyspark_job": {"main_python_file_uri": f"gs://{gcs_bucket}/{spark_filename}"},
}

operation = job_client.submit_job_as_operation(
    request={"project_id": project_id, "region": region, "job": job}
)
response = operation.result()

# Dataproc job output is saved to the Cloud Storage bucket
# allocated to the job. Use regex to obtain the bucket and blob info.
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)

output = (
    storage.Client()
    .get_bucket(matches.group(1))
    .blob(f"{matches.group(2)}.000000000")
    .download_as_bytes()
    .decode("utf-8")
)

print(f"Job finished successfully: {output}\r\n")

刪除叢集

下列值會設為刪除叢集：

即將建立叢集的所在專案
要建立叢集的地區
叢集名稱

詳情請參閱「DeleteCluster」。

# Delete the cluster once the job has terminated.
operation = cluster_client.delete_cluster(
    request={
        "project_id": project_id,
        "region": region,
        "cluster_name": cluster_name,
    }
)
operation.result()

print(f"Cluster {cluster_name} successfully deleted.")