默认情况下,Cloud Data Fusion 使用“Autoscale”作为计算配置文件。估算工作负载的最佳集群工作器(节点)数量非常困难,整个流水线的单个集群大小通常不是理想之选。Dataproc 自动扩缩功能提供自动管理集群资源的机制,还启用了集群工作器虚拟机的自动扩缩功能。如需了解详情,请参阅自动扩缩
在计算配置页面(您可以在其中查看配置文件列表)上,有一个总核心数列,其中显示了配置文件可扩容到的最大 vCPU 数量,例如 Up to 84。
如果您想使用 Dataproc 计算配置文件,可以根据流水线大小管理集群大小。
主节点
主节点使用与集群上运行的流水线或其他应用的数量成比例的资源。如果您是在临时集群上运行流水线,请为主节点使用 2 个 CPU 和 8 GB 内存。如果您使用的是永久性集群,则可能需要更大的主节点才能跟上工作流。如需了解是否需要更大的主节点,您可以监控节点上的内存和 CPU 使用情况。我们建议调整工作器节点的大小,使其至少配备 2 个 CPU 和 8 GB 内存。如果您已将流水线配置为使用更大的内存,则必须使用更大的工作器。
为了最大限度地减少执行时间,请确保您的集群具有足够的节点,以尽可能减少并行处理。
工作器
以下部分介绍了调整工作节点大小的各个方面。
CPU 和内存
我们建议调整工作器节点的大小,使其至少配备 2 个 CPU 和 8 GB 内存。如果您已将流水线配置为使用更大的内存,请使用更大的工作器。例如,如果工作器节点为 4 个 CPU 和 15 GB 内存,则每个工作器将有 4 个 CPU 和 12 GB 可用于运行 YARN 容器。如果您的流水线配置为运行 1 个 CPU、8 GB 的执行器,则 YARN 无法在每个工作器节点上运行多个容器。每个工作器节点都会多出 3 个 CPU 和 4 GB 的资源,这些资源会被浪费,因为它们无法用于运行任何内容。为了最大限度地提高集群上的资源利用率,您需要将 YARN 内存和 CPU 设置为每个 Spark 执行程序所需数量的整数倍。您可以通过查看 YARN 中的 yarn.nodemanager.resource.memory-mb 属性,了解每个工作器为 YARN 预留了多少内存。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-04。"],[[["\u003cp\u003eCloud Data Fusion utilizes Autoscale as the default compute profile, but Dataproc Autoscaling offers automated cluster resource management for dynamic worker VM scaling.\u003c/p\u003e\n"],["\u003cp\u003eMaster nodes require resources based on the number of pipelines or applications running, with a recommendation of at least 2 CPUs and 8 GB of memory for ephemeral clusters, and possibly larger nodes for persistent clusters.\u003c/p\u003e\n"],["\u003cp\u003eWorker nodes should ideally be sized with at least 2 CPUs and 8 GB of memory, adjusting for pipelines that need more memory, and worker node resource utilization can be optimized by ensuring YARN memory and CPU requirements align with Spark executor needs.\u003c/p\u003e\n"],["\u003cp\u003eTo minimize pipeline execution time, ensure the cluster has enough nodes to process as much as possible in parallel, for instance, if a pipeline uses 100 data splits, then the cluster needs to run 100 executors simultaneously.\u003c/p\u003e\n"],["\u003cp\u003eEnhanced Flexibility Mode (EFM) is recommended for pipelines with shuffles on static clusters, as it involves only primary worker nodes in data shuffling, leading to more stable and efficient cluster scaling.\u003c/p\u003e\n"]]],[],null,["# Cluster sizing\n\nCloud Data Fusion by default used Autoscale as the compute profile.\nEstimating the best number of cluster workers (nodes) for a workload is\ndifficult, and a single cluster size for an entire pipeline is often not ideal.\nThe Dataproc Autoscaling provides a mechanism for automating\ncluster resource management and enables cluster worker VM autoscaling. For more\ninformation, see [Autoscaling](/dataproc/docs/concepts/configuring-clusters/autoscaling)\n\nOn the **Compute config** page, where you can see a list of profiles, there is a\n**Total cores** column, which has the maximum v CPUs that the profile can\nscale up to, such as `Up to 84`.\n\nIf you want to use the Dataproc Compute profile , you can manage\ncluster sizes based on the pipeline size.\n\nMaster node\n-----------\n\nMaster nodes use resources proportional to the number of pipelines or additional\napplications that are running on the cluster. If you're running pipelines on\nephemeral clusters, use 2 CPUs and 8 GB of memory for the master\nnodes. If you're using persistent clusters, you might need larger master nodes\nto keep up with the workflow. To understand if you need larger master nodes, you\ncan monitor memory and CPU usage on the node. We recommend sizing your\nworker nodes with at least 2 CPUs and 8 GB of memory. If you've\nconfigured your pipelines to use larger amounts of memory, then you must use\nlarger workers.\n\nTo minimize execution time, ensure that your cluster has enough nodes to allow\nfor as much parallel processing as possible.\n\nWorkers\n-------\n\nThe following sections describe aspects of sizing worker nodes.\n\n### CPU and Memory\n\nWe recommend sizing your worker nodes with at least 2 CPU and 8 GB\nmemory. If you configured your pipelines to use larger amounts of memory, use\nlarger workers. For example, with a 4 CPU 15 GB worker node, each\nworker will have 4 CPU and 12 GB available to run YARN containers. If\nyour pipeline is configured to run 1 CPU, 8 GB executors, YARN is\nunable to run more than one container per worker node. Each worker node would\nhave an extra 3 CPU and 4 GB that's wasted because it can't be used to\nrun anything. To maximize resource utilization on your cluster, you will want\nthe YARN memory and CPUs to be an exact multiple of the amount needed per\nSpark executor. You can check how much memory each worker has reserved for YARN\nby checking the `yarn.nodemanager.resource.memory-mb` property in YARN.\n\nIf you're using Dataproc, the memory available for YARN\ncontainers will be roughly 75% of the VM memory. The minimum YARN container size\nis also adjusted depending on the size of the worker VMs. Some common worker\nsizes and their corresponding YARN settings are given in the following table.\n\nKeep in mind that Spark requests for more memory than the executor memory\nset for the pipeline, and that YARN rounds that requested amount up. For\nexample, suppose you have set your executor memory to 2048 MB, and have not\ngiven a value for `spark.yarn.executor.memoryOverhead`, which means the default\nof 384 MB is used. That means Spark requests 2048 MB + 384 MB\nfor each executor, which YARN rounds up to an exact multiple of the YARN\nminimum allocation. When running on a 8 GB worker node, because the YARN\nminimum allocation is 512 MB, it gets rounded up to 2.5 GB. This\nmeans each worker can run two containers, using up all\navailable CPUs, but leaving 1 GB of YARN memory (6 GB -\n2.5 GB - 2.5 GB) unused. This means the worker node can actually be\nsized a little smaller, or the executors can be given a little bit more memory.\nWhen running on a 16 GB worker node, 2048 MB + 1024 MB is\nrounded up to 3 GB because the YARN minimum allocation is 1024 MB.\nThis means each worker node is able to run four containers, with all CPUs\nand YARN memory in use.\n\nTo help give context, the following table shows recommended worker sizes given\nsome common executor sizes.\n\nFor example, a 26 GB worker node translates to 20 GB of memory usable\nfor running YARN containers. With executor memory set to 4 GB, 1 GB is\nadded as overhead, which means 5 GB YARN containers for each executor. This\nmeans the worker can run four containers without any extra resources leftover.\nYou can also multiply the size of the workers. For example, if executor memory\nis set to 4096 GB, a worker with 8 CPUs and 52 GB memory would\nalso work well. Compute Engine VMs restrict how much memory the VM can have based\non the number of cores. For example, a VM with 4 cores must have at least\n7.25 GB of memory and at most 26 GB of memory. This means an executor\nset to use 1 CPU and 8 GB of memory uses 2 CPUs and 26 GB\nof memory on the VM. If executors are instead configured to use 2 CPUs and\n8 GB of memory, all of the CPUs are utilized.\n\n### Disk\n\nDisk is important for some pipelines but not all of them. If your pipeline does\nnot contain any shuffles, disk will only be used when Spark runs out of memory\nand needs to spill data to disk. For these types of pipelines, disk size and\ntype are generally not going to make a big impact on your performance. If your\npipeline is shuffling a lot of data, disk performance will make a difference. If\nyou are using Dataproc, it is recommended that you use disk sizes\nof at least 1tb, as disk performance scales up with disk size. For information\nabout disk performance, see [Configure disks to meet performance\nrequirements](/compute/docs/disks/performance).\n\n### Number of workers\n\nIn order to minimize execution time, you will want to ensure that your cluster\nis large enough that it can run as much as it can in parallel. For example, if\nyour pipeline source reads data using 100 splits, you will want to make sure the\ncluster is large enough to run 100 executors at once.\n\nThe easiest way to tell if your cluster is undersized is by looking at the YARN\npending memory over time. If you are using Dataproc, a graph can\nbe found on the cluster detail page.\n\nIf pending memory is high for long periods\nof time, you can increase the number of workers to add that much extra capacity\nto your cluster. In the preceding example, the cluster should be increased by\naround 28 GB to ensure that the maximum level of parallelism is achieved.\n\n#### Enhanced Flexibility Mode (EFM)\n\nEFM lets you specify that only primary worker nodes be involved when shuffling\ndata. Since secondary workers are no longer responsible for intermediate shuffle\ndata, when they are removed from a cluster, Spark jobs don't run into delays or\nerrors. Since primary workers are never scaled down, the cluster scales down\nwith more stability and efficiency. If you're running pipelines with shuffles on\na static cluster, it is recommended that you use EFM.\n\nFor more information on EFM, see [Dataproc enhanced flexibility mode](/dataproc/docs/concepts/configuring-clusters/flex)."]]