실행할 때마다 Dataproc 클러스터를 재사용하여 처리 시간을 줄일 수 있습니다. 클러스터 재사용은 연결 풀링이나 스레드 풀링과 유사한 모델로 구현됩니다. 모든 클러스터는 실행이 완료된 후에도 일정 시간 동안 계속 실행됩니다. 새 실행이 시작되면 컴퓨팅 프로필 구성과 일치하는 사용 가능한 유휴 상태의 클러스터를 찾으려고 시도합니다.
클러스터가 있으면 사용되고 없으면 새 클러스터가 시작됩니다.
클러스터 재사용 시 고려사항
클러스터는 공유되지 않습니다. 일반 임시 클러스터 프로비저닝 모델과 마찬가지로 클러스터는 한 번에 파이프라인 실행 하나만 실행합니다. 클러스터가 유휴 상태인 경우에만 재사용됩니다.
모든 실행에 클러스터 재사용을 사용 설정하면 모든 실행을 처리하는 데 필요한 클러스터 수가 필요에 따라 생성됩니다. 임시 Dataproc 프로비저닝 도구와 마찬가지로 생성된 클러스터 수를 직접 제어할 수 없습니다. Google Cloud 할당량을 사용하여 리소스를 계속 관리할 수 있습니다. 예를 들어 병렬 실행 최대 7개로 실행 100개를 실행하면 특정 시점에 클러스터가 최대 7개까지 생성됩니다.
파이프라인에서 같은 프로필을 사용하고 같은 프로필 설정을 공유하는 즉시 클러스터는 여러 파이프라인 간에 재사용됩니다. 프로필 맞춤설정을 사용하면 클러스터는 클러스터 라벨 지정과 같은 모든 클러스터 설정을 포함하여 맞춤설정이 정확히 동일한 경우에만 계속 재사용됩니다.
클러스터 재사용이 사용 설정된 경우 두 가지 주요 비용 고려사항이 있습니다.
클러스터 시작과 초기화에 사용되는 리소스가 줄어듭니다.
파이프라인 실행 간에 그리고 마지막 파이프라인 실행 후에 클러스터가 유휴 상태로 유지되도록 더 많은 리소스가 사용됩니다.
클러스터 재사용의 비용 효과를 예측하기는 어렵지만 비용을 최대한 절약할 수 있는 전략을 채택할 수 있습니다. 이 전략은 체이닝된 파이프라인의 주요 경로를 식별하고 이 주요 경로에 클러스터 재사용을 사용 설정하는 것입니다. 이렇게 하면 클러스터가 즉시 재사용되고 유휴 상태 시간이 낭비되지 않으며 성능 이점을 최대로 얻을 수 있습니다.
클러스터 재사용 사용 설정
배포된 파이프라인 구성의 Compute Config 섹션에서 또는 새 컴퓨팅 프로필을 만들 때 다음을 실행합니다.
클러스터 삭제 건너뛰기를 사용 설정합니다.
최대 유휴 상태 시간은 클러스터가 다음 파이프라인에서 클러스터를 재사용할 때까지 기다리는 시간입니다. 기본 최대 유휴 상태 시간은 30분입니다. 최대 유휴 상태 시간의 경우 비용과 재사용을 위한 클러스터 가용성을 고려하세요. 최대 유휴 시간 값이 클수록 더 많은 클러스터가 실행 준비 상태인 유휴 상태로 유지됩니다.
문제 해결: 버전 호환성
문제: Cloud Data Fusion 환경 버전이 Dataproc 클러스터 버전과 호환되지 않을 수 있습니다.
권장: 최신 Cloud Data Fusion 버전으로 업그레이드하고 지원되는 Dataproc 버전 중 하나를 사용합니다.
이전 버전의 Cloud Data Fusion은 지원되지 않는 Dataproc 버전과만 호환됩니다.
Dataproc은 이러한 버전으로 생성된 클러스터에 대한 업데이트 및 지원을 제공하지 않습니다. 지원되지 않는 버전으로 생성된 클러스터를 계속 실행할 수 있지만 지원되는 버전으로 생성된 클러스터로 바꾸는 것이 좋습니다.
* Cloud Data Fusion 버전 6.4 이상은 지원되는 Dataproc 버전과 호환됩니다. 특정 OS 기능이 필요하지 않은 한 major.minor 이미지 버전을 지정하는 것이 좋습니다.
Dataproc 클러스터에 사용된 OS 버전을 지정하려면 OS 버전이 이전 테이블에서 Cloud Data Fusion의 지원되는 Dataproc 버전 중 하나와 호환되어야 합니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-08(UTC)"],[[["\u003cp\u003eCloud Data Fusion offers two main approaches to cluster configuration for data processing pipelines: default ephemeral clusters, which are automatically managed, and static clusters, which require manual setup and management.\u003c/p\u003e\n"],["\u003cp\u003eEphemeral clusters are recommended for their simplicity and cost-effectiveness, as they are automatically provisioned and deleted for each pipeline run, ensuring you only pay for resources used during execution.\u003c/p\u003e\n"],["\u003cp\u003eStatic clusters are suitable for specific scenarios, such as long-running pipelines, centralized cluster management, or when cluster creation time is prohibitive, but they require manual configuration and management of the cluster lifecycle.\u003c/p\u003e\n"],["\u003cp\u003eCluster reuse can improve processing time by keeping clusters active after a pipeline run, allowing subsequent runs with the same configuration to utilize them, thus reducing resource consumption and enhancing efficiency.\u003c/p\u003e\n"],["\u003cp\u003eCompatibility between Cloud Data Fusion and Dataproc versions is critical, with newer Cloud Data Fusion versions supporting specific, more recent Dataproc versions, and using unsupported Dataproc versions being only compatible with previous Cloud Data Fusion versions.\u003c/p\u003e\n"]]],[],null,["# Dataproc cluster configuration\n\nIn Cloud Data Fusion, cluster configuration refers to defining how your\ndata processing pipelines utilize computational resources when running Spark\njobs on Dataproc. This page describes the main approaches to\ncluster configuration.\n\nDefault ephemeral clusters (recommended)\n----------------------------------------\n\nUsing the default clusters is the recommended approach for\nCloud Data Fusion pipelines.\n\n- Cloud Data Fusion automatically provisions and manages ephemeral Dataproc clusters for each pipeline execution. It creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes.\n- Benefits of ephemeral clusters:\n - **Simplicity**: you don't need to manually configure or manage the cluster.\n - **Cost-effectiveness**: you only pay for the resources used during pipeline execution.\n\n| **Note:** Cloud Data Fusion, by default, uses Dataproc Autoscaling compute profile which creates ephemeral clusters as per the default configurations.\n\nTo adjust clusters and tune performance, see [Cluster sizing](/data-fusion/docs/concepts/cluster-sizing).\n\nStatic clusters (for specific scenarios)\n----------------------------------------\n\nIn the following scenarios, you can use static clusters:\n\n- **Long-running pipelines**: for pipelines that run continuously or for extended periods, a static cluster can be more cost-effective than repeatedly creating and tearing down ephemeral clusters.\n- **Centralized cluster management**: if your organization requires centralized control over cluster creation and management policies, static clusters can be used alongside tools like Terraform.\n- **Cluster creation time**: when the time it takes to create a new cluster for every pipeline is prohibitive for your use case.\n\nHowever, static clusters require more manual configuration and involve managing\nthe cluster lifecycle yourself.\n\nTo use a static cluster, you must set the following\n[properties](/dataproc/docs/concepts/configuring-clusters/cluster-properties)\non the Dataproc cluster: \n\n dataproc:dataproc.conscrypt.provider.enable=false\n capacity-scheduler:yarn.scheduler.capacity.resource-calculator=\"org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator\"\n\n### Cluster configuration options for static clusters\n\nIf you choose to use static clusters, Cloud Data Fusion offers\nconfiguration options for the following aspects:\n\n- **Worker machine type**: specify the virtual machine type for the worker nodes in your cluster. This determines the vCPUs and memory available for each worker.\n- **Number of workers**: define the initial number of worker nodes in your cluster. Dataproc might still autoscale this number, based on workload.\n- **Zone**: select your cluster's Google Cloud zone. Location can affect data locality and network performance.\n- **Additional configurations**: you can configure advanced options for your static cluster, such as preemption settings, network settings, and initialization actions.\n\nBest practices\n--------------\n\nWhen creating a static cluster for your pipelines, use the following\nconfigurations.\n\nFor more information, see [Run a pipeline against an existing Dataproc cluster](/data-fusion/docs/how-to/running-against-existing-dataproc).\n\nReusing clusters\n----------------\n\nYou can reuse Dataproc clusters between runs to improve\nprocessing time. Cluster reuse is implemented in a model similar to connection\npooling or thread pooling. Any cluster is kept up and running for a specified\ntime after the run is finished. When a new run is started, it will try to find\nan idle cluster available that matches the configuration of the compute profile.\nIf one is present, it will be used, otherwise a new cluster will be started.\n\n### Considerations for reusing clusters\n\n- Clusters are not shared. Similar to the regular ephemeral cluster provisioning model, a cluster runs a single pipeline run at a time. A cluster is reused only if it is idle**.**\n- If you enable cluster reuse for all your runs, the necessary number of clusters to process all your runs will be created as needed. Similar to the ephemeral Dataproc provisioner, there is no direct control on the number of clusters created. You can still use Google Cloud quotes to manage resources. For example, if you run 100 runs with 7 maximum parallel runs, you will have up to 7 clusters at a given point of time.\n- Clusters are reused between different pipelines as soon as those pipelines\n are using the same profile and share the same profile settings. If profile\n customization is used, clusters will still be reused, but only if\n customizations are exactly the same, including all cluster settings like\n cluster labeling.\n\n- When cluster reuse is enabled, there are two main cost considerations:\n\n - Less resources are used for cluster startup and initialization.\n - More resources are used for clusters to sit idle between the pipeline runs and after the last pipeline run.\n\nWhile it's hard to predict the cost effect of cluster reuse, you can employ a\nstrategy to get maximum savings. The strategy is to identify a critical path for\nchained pipelines and enable cluster reuse for this critical path. This would\nensure the cluster is immediately reused, no idle time is wasted and maximum\nperformance benefits are achieved.\n\n### Enable Cluster Reuse\n\nIn the Compute Config section of deployed pipeline configuration or when\ncreating new compute profile:\n\n- Enable **Skip Cluster Delete**.\n- Max Idle Time is the time up to which a cluster waits for the next pipeline to reuse it. The default Max Idle Time is 30 minutes. For Max Idle Time, consider the cost versus cluster availability for reuse. The higher the value of Max Idle Time, the more clusters sit idle, ready for a run.\n\nTroubleshoot: Version compatibility\n-----------------------------------\n\n**Problem**: The version of your Cloud Data Fusion environment might\nnot be compatible with the version of your Dataproc cluster.\n\n**Recommended** : Upgrade to the latest Cloud Data Fusion version and\nuse one of the [supported Dataproc versions](/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions).\n\nEarlier versions of Cloud Data Fusion are only compatible with\n[unsupported versions of Dataproc](/dataproc/docs/concepts/versioning/dataproc-versions#unsupported_dataproc_versions).\nDataproc does not provide updates and support for clusters\ncreated with these versions. Although you can continue running a cluster that\nwas created with an unsupported version, we recommend replacing it with one\ncreated with a\n[supported version](/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions).\n\n^\\*^ Cloud Data Fusion versions 6.4 and later are compatible with [supported versions of Dataproc](/dataproc/docs/concepts/versioning/dataproc-versions#supported_dataproc_versions). Unless specific OS features are needed, the recommended practice is to specify the [`major.minor` image version](/dataproc/docs/concepts/versioning/overview#how_versioning_works). \nTo specify the OS version used in your Dataproc cluster, the OS version must be compatible with one of the supported Dataproc versions for your Cloud Data Fusion in the preceding table.\n\n\u003cbr /\u003e\n\n^\\*\\*^ Cloud Data Fusion versions 6.1 to 6.6 are compatible with [unsupported Dataproc version 1.3](/dataproc/docs/concepts/versioning/dataproc-versions#unsupported_dataproc_versions).\n\n\u003cbr /\u003e\n\n^\\*\\*\\*^ Certain [issues](/data-fusion/docs/release-notes#October_24_2024) are detected with this image version. This Dataproc image version is not recommended for production use.\n\n\u003cbr /\u003e\n\nTroubleshoot: Container exited with a non-zero exit code 3\n----------------------------------------------------------\n\n**Problem** : An autoscaling policy isn't used, and the static\nDataproc clusters are encountering memory pressure, causing an\nout of memory exception to appear in the logs: `Container exited with a non-zero\nexit code 3`.\n\n**Recommended**: Increase the executor memory.\n\nIncrease the memory by adding a `task.executor.system.resources.memory` runtime\nargument to the pipeline. The following example runtime argument sets the memory\nto 4096 MB: \n\n \"task.executor.system.resources.memory\": 4096\n\nFor more information, see [Cluster sizing](/data-fusion/docs/concepts/cluster-sizing).\n\nWhat's next\n-----------\n\n- Refer to the [How to change Dataproc image version](/data-fusion/docs/how-to/change-dataproc-image)."]]