[[["わかりやすい","easyToUnderstand","thumb-up"],["問題の解決に役立った","solvedMyProblem","thumb-up"],["その他","otherUp","thumb-up"]],[["わかりにくい","hardToUnderstand","thumb-down"],["情報またはサンプルコードが不正確","incorrectInformationOrSampleCode","thumb-down"],["必要な情報 / サンプルがない","missingTheInformationSamplesINeed","thumb-down"],["翻訳に関する問題","translationIssue","thumb-down"],["その他","otherDown","thumb-down"]],["最終更新日 2025-09-04 UTC。"],[[["\u003cp\u003eCloud Data Fusion utilizes Autoscale as the default compute profile, but Dataproc Autoscaling offers automated cluster resource management for dynamic worker VM scaling.\u003c/p\u003e\n"],["\u003cp\u003eMaster nodes require resources based on the number of pipelines or applications running, with a recommendation of at least 2 CPUs and 8 GB of memory for ephemeral clusters, and possibly larger nodes for persistent clusters.\u003c/p\u003e\n"],["\u003cp\u003eWorker nodes should ideally be sized with at least 2 CPUs and 8 GB of memory, adjusting for pipelines that need more memory, and worker node resource utilization can be optimized by ensuring YARN memory and CPU requirements align with Spark executor needs.\u003c/p\u003e\n"],["\u003cp\u003eTo minimize pipeline execution time, ensure the cluster has enough nodes to process as much as possible in parallel, for instance, if a pipeline uses 100 data splits, then the cluster needs to run 100 executors simultaneously.\u003c/p\u003e\n"],["\u003cp\u003eEnhanced Flexibility Mode (EFM) is recommended for pipelines with shuffles on static clusters, as it involves only primary worker nodes in data shuffling, leading to more stable and efficient cluster scaling.\u003c/p\u003e\n"]]],[],null,["# Cluster sizing\n\nCloud Data Fusion by default used Autoscale as the compute profile.\nEstimating the best number of cluster workers (nodes) for a workload is\ndifficult, and a single cluster size for an entire pipeline is often not ideal.\nThe Dataproc Autoscaling provides a mechanism for automating\ncluster resource management and enables cluster worker VM autoscaling. For more\ninformation, see [Autoscaling](/dataproc/docs/concepts/configuring-clusters/autoscaling)\n\nOn the **Compute config** page, where you can see a list of profiles, there is a\n**Total cores** column, which has the maximum v CPUs that the profile can\nscale up to, such as `Up to 84`.\n\nIf you want to use the Dataproc Compute profile , you can manage\ncluster sizes based on the pipeline size.\n\nMaster node\n-----------\n\nMaster nodes use resources proportional to the number of pipelines or additional\napplications that are running on the cluster. If you're running pipelines on\nephemeral clusters, use 2 CPUs and 8 GB of memory for the master\nnodes. If you're using persistent clusters, you might need larger master nodes\nto keep up with the workflow. To understand if you need larger master nodes, you\ncan monitor memory and CPU usage on the node. We recommend sizing your\nworker nodes with at least 2 CPUs and 8 GB of memory. If you've\nconfigured your pipelines to use larger amounts of memory, then you must use\nlarger workers.\n\nTo minimize execution time, ensure that your cluster has enough nodes to allow\nfor as much parallel processing as possible.\n\nWorkers\n-------\n\nThe following sections describe aspects of sizing worker nodes.\n\n### CPU and Memory\n\nWe recommend sizing your worker nodes with at least 2 CPU and 8 GB\nmemory. If you configured your pipelines to use larger amounts of memory, use\nlarger workers. For example, with a 4 CPU 15 GB worker node, each\nworker will have 4 CPU and 12 GB available to run YARN containers. If\nyour pipeline is configured to run 1 CPU, 8 GB executors, YARN is\nunable to run more than one container per worker node. Each worker node would\nhave an extra 3 CPU and 4 GB that's wasted because it can't be used to\nrun anything. To maximize resource utilization on your cluster, you will want\nthe YARN memory and CPUs to be an exact multiple of the amount needed per\nSpark executor. You can check how much memory each worker has reserved for YARN\nby checking the `yarn.nodemanager.resource.memory-mb` property in YARN.\n\nIf you're using Dataproc, the memory available for YARN\ncontainers will be roughly 75% of the VM memory. The minimum YARN container size\nis also adjusted depending on the size of the worker VMs. Some common worker\nsizes and their corresponding YARN settings are given in the following table.\n\nKeep in mind that Spark requests for more memory than the executor memory\nset for the pipeline, and that YARN rounds that requested amount up. For\nexample, suppose you have set your executor memory to 2048 MB, and have not\ngiven a value for `spark.yarn.executor.memoryOverhead`, which means the default\nof 384 MB is used. That means Spark requests 2048 MB + 384 MB\nfor each executor, which YARN rounds up to an exact multiple of the YARN\nminimum allocation. When running on a 8 GB worker node, because the YARN\nminimum allocation is 512 MB, it gets rounded up to 2.5 GB. This\nmeans each worker can run two containers, using up all\navailable CPUs, but leaving 1 GB of YARN memory (6 GB -\n2.5 GB - 2.5 GB) unused. This means the worker node can actually be\nsized a little smaller, or the executors can be given a little bit more memory.\nWhen running on a 16 GB worker node, 2048 MB + 1024 MB is\nrounded up to 3 GB because the YARN minimum allocation is 1024 MB.\nThis means each worker node is able to run four containers, with all CPUs\nand YARN memory in use.\n\nTo help give context, the following table shows recommended worker sizes given\nsome common executor sizes.\n\nFor example, a 26 GB worker node translates to 20 GB of memory usable\nfor running YARN containers. With executor memory set to 4 GB, 1 GB is\nadded as overhead, which means 5 GB YARN containers for each executor. This\nmeans the worker can run four containers without any extra resources leftover.\nYou can also multiply the size of the workers. For example, if executor memory\nis set to 4096 GB, a worker with 8 CPUs and 52 GB memory would\nalso work well. Compute Engine VMs restrict how much memory the VM can have based\non the number of cores. For example, a VM with 4 cores must have at least\n7.25 GB of memory and at most 26 GB of memory. This means an executor\nset to use 1 CPU and 8 GB of memory uses 2 CPUs and 26 GB\nof memory on the VM. If executors are instead configured to use 2 CPUs and\n8 GB of memory, all of the CPUs are utilized.\n\n### Disk\n\nDisk is important for some pipelines but not all of them. If your pipeline does\nnot contain any shuffles, disk will only be used when Spark runs out of memory\nand needs to spill data to disk. For these types of pipelines, disk size and\ntype are generally not going to make a big impact on your performance. If your\npipeline is shuffling a lot of data, disk performance will make a difference. If\nyou are using Dataproc, it is recommended that you use disk sizes\nof at least 1tb, as disk performance scales up with disk size. For information\nabout disk performance, see [Configure disks to meet performance\nrequirements](/compute/docs/disks/performance).\n\n### Number of workers\n\nIn order to minimize execution time, you will want to ensure that your cluster\nis large enough that it can run as much as it can in parallel. For example, if\nyour pipeline source reads data using 100 splits, you will want to make sure the\ncluster is large enough to run 100 executors at once.\n\nThe easiest way to tell if your cluster is undersized is by looking at the YARN\npending memory over time. If you are using Dataproc, a graph can\nbe found on the cluster detail page.\n\nIf pending memory is high for long periods\nof time, you can increase the number of workers to add that much extra capacity\nto your cluster. In the preceding example, the cluster should be increased by\naround 28 GB to ensure that the maximum level of parallelism is achieved.\n\n#### Enhanced Flexibility Mode (EFM)\n\nEFM lets you specify that only primary worker nodes be involved when shuffling\ndata. Since secondary workers are no longer responsible for intermediate shuffle\ndata, when they are removed from a cluster, Spark jobs don't run into delays or\nerrors. Since primary workers are never scaled down, the cluster scales down\nwith more stability and efficiency. If you're running pipelines with shuffles on\na static cluster, it is recommended that you use EFM.\n\nFor more information on EFM, see [Dataproc enhanced flexibility mode](/dataproc/docs/concepts/configuring-clusters/flex)."]]