[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[[["\u003cp\u003eCloud Data Fusion offers two main approaches to cluster configuration for data processing pipelines: default ephemeral clusters, which are automatically managed, and static clusters, which require manual setup and management.\u003c/p\u003e\n"],["\u003cp\u003eEphemeral clusters are recommended for their simplicity and cost-effectiveness, as they are automatically provisioned and deleted for each pipeline run, ensuring you only pay for resources used during execution.\u003c/p\u003e\n"],["\u003cp\u003eStatic clusters are suitable for specific scenarios, such as long-running pipelines, centralized cluster management, or when cluster creation time is prohibitive, but they require manual configuration and management of the cluster lifecycle.\u003c/p\u003e\n"],["\u003cp\u003eCluster reuse can improve processing time by keeping clusters active after a pipeline run, allowing subsequent runs with the same configuration to utilize them, thus reducing resource consumption and enhancing efficiency.\u003c/p\u003e\n"],["\u003cp\u003eCompatibility between Cloud Data Fusion and Dataproc versions is critical, with newer Cloud Data Fusion versions supporting specific, more recent Dataproc versions, and using unsupported Dataproc versions being only compatible with previous Cloud Data Fusion versions.\u003c/p\u003e\n"]]],[],null,["# Dataproc cluster configuration\n\nIn Cloud Data Fusion, cluster configuration refers to defining how your\ndata processing pipelines utilize computational resources when running Spark\njobs on Dataproc. This page describes the main approaches to\ncluster configuration.\n\nDefault ephemeral clusters (recommended)\n----------------------------------------\n\nUsing the default clusters is the recommended approach for\nCloud Data Fusion pipelines.\n\n- Cloud Data Fusion automatically provisions and manages ephemeral Dataproc clusters for each pipeline execution. It creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.\n- Benefits of ephemeral clusters:\n - **Simplicity**: you don't need to manually configure or manage the cluster.\n - **Cost-effectiveness**: you only pay for the resources used during pipeline execution.\n\n| **Note:** Cloud Data Fusion, by default, uses Dataproc Autoscaling compute profile which creates ephemeral clusters as per the default configurations.\n\nTo adjust clusters and tune performance, see [Cluster sizing](/data-fusion/docs/concepts/cluster-sizing).\n\nStatic clusters (for specific scenarios)\n----------------------------------------\n\nIn the following scenarios, you can use static clusters:\n\n- **Long-running pipelines**: for pipelines that run continuously or for extended periods, a static cluster can be more cost-effective than repeatedly creating and tearing down ephemeral clusters.\n- **Centralized cluster management**: if your organization requires centralized control over cluster creation and management policies, static clusters can be used alongside tools like Terraform.\n- **Cluster creation time**: when the time it takes to create a new cluster for every pipeline is prohibitive for your use case.\n\nHowever, static clusters require more manual configuration and involve managing\nthe cluster lifecycle yourself.\n\nTo use a static cluster, you must set the following\n[properties](/dataproc/docs/concepts/configuring-clusters/cluster-properties)\non the Dataproc cluster: \n\n dataproc:dataproc.conscrypt.provider.enable=false\n capacity-scheduler:yarn.scheduler.capacity.resource-calculator=\"org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator\"\n\n### Cluster configuration options for static clusters\n\nIf you choose to use static clusters, Cloud Data Fusion offers\nconfiguration options for the following aspects:\n\n- **Worker machine type**: specify the virtual machine type for the worker nodes in your cluster. This determines the vCPUs and memory available for each worker.\n- **Number of workers**: define the initial number of worker nodes in your cluster. Dataproc might still autoscale this number, based on workload.\n- **Zone**: select your cluster's Google Cloud zone. Location can affect data locality and network performance.\n- **Additional configurations**: you can configure advanced options for your static cluster, such as preemption settings, network settings, and initialization actions.\n\nBest practices\n--------------\n\nWhen creating a static cluster for your pipelines, use the following\nconfigurations.\n\nFor more information, see [Run a pipeline against an existing Dataproc cluster](/data-fusion/docs/how-to/running-against-existing-dataproc).\n\nReusing clusters\n----------------\n\nYou can reuse Dataproc clusters between runs to improve\nprocessing time. Cluster reuse is implemented in a model similar to connection\npooling or thread pooling. Any cluster is kept up and running for a specified\ntime after the run is finished. When a new run is started, it will try to find\nan idle cluster available that matches the configuration of the compute profile.\nIf one is present, it will be used, otherwise a new cluster will be started.\n\n### Considerations for reusing clusters\n\n- Clusters are not shared. Similar to the regular ephemeral cluster provisioning model, a cluster runs a single pipeline run at a time. A cluster is reused only if it is idle**.**\n- If you enable cluster reuse for all your runs, the necessary number of clusters to process all your runs will be created as needed. Similar to the ephemeral Dataproc provisioner, there is no direct control on the number of clusters created. You can still use Google Cloud quotes to manage resources. For example, if you run 100 runs with 7 maximum parallel runs, you will have up to 7 clusters at a given point of time.\n- Clusters are reused between different pipelines as soon as those pipelines\n are using the same profile and share the same profile settings. If profile\n customization is used, clusters will still be reused, but only if\n customizations are exactly the same, including all cluster settings like\n cluster labeling.\n\n- When cluster reuse is enabled, there are two main cost considerations:\n\n - Less resources are used for cluster startup and initialization.\n - More resources are used for clusters to sit idle between the pipeline runs and after the last pipeline run.\n\nWhile it's hard to predict the cost effect of cluster reuse, you can employ a\nstrategy to get maximum savings. The strategy is to identify a critical path for\nchained pipelines and enable cluster reuse for this critical path. This would\nensure the cluster is immediately reused, no idle time is wasted and maximum\nperformance benefits are achieved.\n\n### Enable Cluster Reuse\n\nIn the Compute Config section of deployed pipeline configuration or when\ncreating new compute profile:\n\n- Enable **Skip Cluster Delete**.\n- Max Idle Time is the time up to which a cluster waits for the next pipeline to reuse it. The default Max Idle Time is 30 minutes. For Max Idle Time, consider the cost versus cluster availability for reuse. The higher the value of Max Idle Time, the more clusters sit idle, ready for a run.\n\nTroubleshoot: Version compatibility\n-----------------------------------\n\n**Problem**: The version of your Cloud Data Fusion environment might\nnot be compatible with the version of your Dataproc cluster.\n\n**Recommended** : Upgrade to the latest Cloud Data Fusion version and\nuse one of the [supported Dataproc versions](/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions).\n\nEarlier versions of Cloud Data Fusion are only compatible with\n[unsupported versions of Dataproc](/dataproc/docs/concepts/versioning/dataproc-versions#unsupported_dataproc_versions).\nDataproc does not provide updates and support for clusters\ncreated with these versions. Although you can continue running a cluster that\nwas created with an unsupported version, we recommend replacing it with one\ncreated with a\n[supported version](/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions).\n\n^\\*^ Cloud Data Fusion versions 6.4 and later are compatible with [supported versions of Dataproc](/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions). Unless specific OS features are needed, the recommended practice is to specify the [`major.minor` image version](/dataproc/docs/concepts/versioning/overview#how_versioning_works). \nTo specify the OS version used in your Dataproc cluster, the OS version must be compatible with one of the supported Dataproc versions for your Cloud Data Fusion in the preceding table.\n\n\u003cbr /\u003e\n\n^\\*\\*^ Cloud Data Fusion versions 6.1 to 6.6 are compatible with [unsupported Dataproc version 1.3](/dataproc/docs/concepts/versioning/dataproc-versions#unsupported_dataproc_versions).\n\n\u003cbr /\u003e\n\n^\\*\\*\\*^ Certain [issues](/data-fusion/docs/release-notes#October_24_2024) are detected with this image version. This Dataproc image version is not recommended for production use.\n\n\u003cbr /\u003e\n\nTroubleshoot: Container exited with a non-zero exit code 3\n----------------------------------------------------------\n\n**Problem** : An autoscaling policy isn't used, and the static\nDataproc clusters are encountering memory pressure, causing an\nout of memory exception to appear in the logs: `Container exited with a non-zero\nexit code 3`.\n\n**Recommended**: Increase the executor memory.\n\nIncrease the memory by adding a `task.executor.system.resources.memory` runtime\nargument to the pipeline. The following example runtime argument sets the memory\nto 4096 MB: \n\n \"task.executor.system.resources.memory\": 4096\n\nFor more information, see [Cluster sizing](/data-fusion/docs/concepts/cluster-sizing).\n\nWhat's next\n-----------\n\n- Refer to the [How to change Dataproc image version](/data-fusion/docs/how-to/change-dataproc-image)."]]