Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Halaman ini memberikan informasi tentang error kehabisan memori (OOM) Dataproc di VM Compute Engine, dan menjelaskan langkah-langkah yang dapat Anda lakukan untuk memecahkan masalah dan menyelesaikan error OOM.
Efek error OOM
Saat VM Dataproc di Compute Engine mengalami error kehabisan memori (OOM), efeknya mencakup kondisi berikut:
VM master dan pekerja berhenti berfungsi selama jangka waktu tertentu.
Error OOM VM Master menyebabkan tugas gagal dengan error "task not acquired".
Error OOM VM pekerja menyebabkan hilangnya node di YARN HDFS, yang menunda eksekusi tugas Dataproc.
Secara default, Dataproc tidak menetapkan
yarn.nodemanager.resource.memory.enabled untuk mengaktifkan kontrol memori YARN, karena
alasan berikut:
Kontrol memori yang ketat dapat menyebabkan penghentian penampung saat ada
memori yang cukup jika ukuran penampung tidak dikonfigurasi dengan benar.
Persyaratan kontrol memori elastis dapat memengaruhi eksekusi tugas secara negatif.
Kontrol memori YARN dapat gagal mencegah error OOM saat proses
menggunakan memori secara agresif.
Perlindungan memori Dataproc
Jika VM cluster Dataproc mengalami tekanan memori,
perlindungan memori Dataproc akan menghentikan proses atau container
hingga kondisi OOM dihilangkan.
Mengidentifikasi dan mengonfirmasi penghentian perlindungan memori
Anda dapat menggunakan informasi berikut untuk mengidentifikasi dan mengonfirmasi penghentian tugas karena tekanan memori.
Proses penghentian
Proses yang dihentikan oleh perlindungan memori Dataproc akan keluar
dengan kode 137 atau 143.
Jika Dataproc menghentikan proses karena tekanan memori,
tindakan atau kondisi berikut dapat terjadi:
Dataproc akan menaikkan
metrik kumulatif dataproc.googleapis.com/node/problem_count, dan menetapkan
reason ke ProcessKilledDueToMemoryPressure.
Lihat Pengumpulan metrik resource Dataproc.
Dataproc menulis log google.dataproc.oom-killer dengan pesan:
"A process is killed due to memory pressure: process name.
Untuk melihat pesan ini, aktifkan Logging, lalu gunakan filter log berikut:
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name="CLUSTER_NAME"
resource.labels.cluster_uuid="CLUSTER_UUID"
jsonPayload.message:"A process is killed due to memory pressure:"
Penghentian tugas node master atau node driver pool
Jika tugas node master Dataproc atau node pool driver
berakhir karena tekanan memori, tugas akan gagal dengan kode
error Driver received SIGTERM/SIGKILL signal and exited with INT. Untuk melihat pesan ini, aktifkan Logging, lalu gunakan filter log berikut:
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name="CLUSTER_NAME"
resource.labels.cluster_uuid="CLUSTER_UUID"
jsonPayload.message:"Driver received SIGTERM/SIGKILL signal and exited with"
Periksa log
google.dataproc.oom-killer atau dataproc.googleapis.com/node/problem_count
untuk mengonfirmasi bahwa Perlindungan Memori Dataproc menghentikan
tugas (lihat Penghentian proses).
Solusi:
Jika cluster memiliki
driver pool,
tingkatkan driver-required-memory-mb ke penggunaan memori tugas sebenarnya.
Jika cluster tidak memiliki kumpulan driver, buat ulang cluster dengan mengurangi
jumlah maksimum tugas serentak
yang berjalan di cluster.
Gunakan jenis mesin node master dengan memori yang ditingkatkan.
Penghentian container YARN node pekerja
Dataproc menulis pesan berikut di pengelola resource YARN: container id exited with code
EXIT_CODE. Untuk melihat pesan ini, aktifkan
Logging, lalu gunakan filter log berikut:
resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name="CLUSTER_NAME"
resource.labels.cluster_uuid="CLUSTER_UUID"
jsonPayload.message:"container" AND "exited with code" AND "which potentially signifies memory pressure on NODE
Jika penampung keluar dengan code INT, periksa log
google.dataproc.oom-killer atau dataproc.googleapis.com/node/problem_count
untuk mengonfirmasi bahwa Perlindungan Memori Dataproc menghentikan tugas
(lihat Penghentian proses).
Solusi:
Pastikan ukuran penampung dikonfigurasi dengan benar.
Pertimbangkan untuk menurunkan yarn.nodemanager.resource.memory-mb. Properti ini
mengontrol jumlah memori yang digunakan untuk menjadwalkan container YARN.
Jika container tugas terus gagal, periksa apakah kemiringan data menyebabkan
peningkatan penggunaan container tertentu. Jika demikian, partisi ulang tugas atau
tambah ukuran pekerja untuk mengakomodasi persyaratan memori tambahan.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-09-04 UTC."],[[["\u003cp\u003eThis document details how to troubleshoot and resolve out-of-memory (OOM) errors that can occur in Dataproc on Compute Engine VMs.\u003c/p\u003e\n"],["\u003cp\u003eOOM errors in Dataproc VMs can result in frozen VMs, job failures, and delays in job execution due to node loss on YARN HDFS.\u003c/p\u003e\n"],["\u003cp\u003eDataproc offers memory protection that terminates processes or containers when VMs experience memory pressure, using exit codes 137 or 143 to indicate terminations.\u003c/p\u003e\n"],["\u003cp\u003eJob terminations due to memory pressure can be confirmed by reviewing the \u003ccode\u003egoogle.dataproc.oom-killer\u003c/code\u003e log or checking the \u003ccode\u003edataproc.googleapis.com/node/problem_count\u003c/code\u003e metric.\u003c/p\u003e\n"],["\u003cp\u003eSolutions to memory pressure issues include increasing driver memory, recreating clusters with lower concurrent job limits, or adjusting YARN container memory configurations.\u003c/p\u003e\n"]]],[],null,["This page provides information on Dataproc on\nCompute Engine VM out-of-memory (OOM) errors, and explains steps you can take\nto troubleshoot and resolve OOM errors.\n\nOOM error effects\n\nWhen Dataproc on Compute Engine VMs encounter out-of-memory\n(OOM) errors, the effects include the following conditions:\n\n- Master and worker VMs freeze for a period of time.\n\n- Master VMs OOM errors cause jobs to fail with \"task not acquired\" errors.\n\n- Worker VM OOM errors cause a loss of the node on YARN HDFS, which delays\n Dataproc job execution.\n\nYARN memory controls\n\n[Apache YARN](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html)\nprovides the following types of\n[memory controls](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCGroupsMemory.html):\n\n- Polling based (legacy)\n- Strict\n- Elastic\n\nBy default, Dataproc doesn't set\n`yarn.nodemanager.resource.memory.enabled` to enable YARN memory controls, for\nthe following reasons:\n\n- Strict memory control can cause the termination of containers when there is sufficient memory if container sizes aren't configured correctly.\n- Elastic memory control requirements can adversely affect job execution.\n- YARN memory controls can fail to prevent OOM errors when processes aggressively consume memory.\n\nDataproc memory protection\n\nWhen a Dataproc cluster VM is under memory pressure,\nDataproc memory protection terminates processes or containers\nuntil the OOM condition is removed.\n\nDataproc provides memory protection for the following cluster\nnodes in the following\n[Dataproc on Compute Engine image versions](/dataproc/docs/concepts/versioning/dataproc-version-clusters):\n\n| Role | 1.5 | 2.0 | 2.1 | 2.2 |\n|----------------|---------------|---------|---------|-----|\n| Master VM | 1.5.74+ | 2.0.48+ | all | all |\n| Worker VM | Not Available | 2.0.76+ | 2.1.24+ | all |\n| Driver Pool VM | Not Available | 2.0.76+ | 2.1.24+ | all |\n\n| Use Dataproc image versions with memory protection to help avoid VM OOM errors.\n\nIdentify and confirm memory protection terminations\n\nYou can use the following information to identify and confirm\njob terminations due to memory pressure.\n\nProcess terminations\n\n- Processes that Dataproc memory protection terminates exit\n with code `137` or `143`.\n\n- When Dataproc terminates a process due to memory pressure,\n the following actions or conditions can occur:\n\n - Dataproc increments the `dataproc.googleapis.com/node/problem_count` cumulative metric, and sets the `reason` to `ProcessKilledDueToMemoryPressure`. See [Dataproc resource metric collection](/dataproc/docs/guides/dataproc-metrics#dataproc_resource_metric_collection).\n - Dataproc writes a `google.dataproc.oom-killer` log with the message: `\"A process is killed due to memory pressure: `\u003cvar translate=\"no\"\u003eprocess name\u003c/var\u003e. To view these messages, enable Logging, then use the following log filter: \n\n ```\n resource.type=\"cloud_dataproc_cluster\"\n resource.labels.cluster_name=\"CLUSTER_NAME\"\n resource.labels.cluster_uuid=\"CLUSTER_UUID\"\n jsonPayload.message:\"A process is killed due to memory pressure:\"\n ```\n\nMaster node or driver node pool job terminations\n\n- When a Dataproc master node or driver node pool job\n terminates due to memory pressure, the job fails with error\n `Driver received SIGTERM/SIGKILL signal and exited with `\u003cvar translate=\"no\"\u003eINT\u003c/var\u003e\n code. To view these messages, enable Logging, then use the\n following log filter:\n\n ```\n resource.type=\"cloud_dataproc_cluster\"\n resource.labels.cluster_name=\"CLUSTER_NAME\"\n resource.labels.cluster_uuid=\"CLUSTER_UUID\"\n jsonPayload.message:\"Driver received SIGTERM/SIGKILL signal and exited with\"\n \n ```\n\n \u003cbr /\u003e\n\n - Check the `google.dataproc.oom-killer` log or the `dataproc.googleapis.com/node/problem_count` to confirm that Dataproc Memory Protection terminated the job (see [Process terminations](#process_terminations)).\n\n **Solutions:**\n - If the cluster has a [driver pool](/dataproc/docs/guides/node-groups/dataproc-driver-node-groups), increase `driver-required-memory-mb` to actual job memory usage.\n - If the cluster does not have a driver pool, recreate the cluster, lowering the [maximum number of concurrent jobs](/dataproc/docs/concepts/jobs/life-of-a-job#job_concurrency) running on the cluster.\n - Use a master node machine type with increased memory.\n\nWorker node YARN container terminations\n\n- Dataproc writes the following message in the YARN\n resource manager: \u003cvar translate=\"no\"\u003econtainer id\u003c/var\u003e` exited with code\n `\u003cvar translate=\"no\"\u003eEXIT_CODE\u003c/var\u003e. To view these messages, enable\n Logging, then use the following log filter:\n\n ```\n resource.type=\"cloud_dataproc_cluster\"\n resource.labels.cluster_name=\"CLUSTER_NAME\"\n resource.labels.cluster_uuid=\"CLUSTER_UUID\"\n jsonPayload.message:\"container\" AND \"exited with code\" AND \"which potentially signifies memory pressure on NODE\n ```\n- If a container exited with `code `\u003cvar translate=\"no\"\u003eINT\u003c/var\u003e, check the\n `google.dataproc.oom-killer` log or the `dataproc.googleapis.com/node/problem_count`\n to confirm that Dataproc Memory Protection terminated the job\n (see [Process terminations](#process_terminations)).\n\n **Solutions:**\n - Check that container sizes are configured correctly.\n - Consider lowering `yarn.nodemanager.resource.memory-mb`. This property controls the amount of memory used for scheduling YARN containers.\n - If job containers consistently fail, check if data skew is causing increased usage of specific containers. If so, repartition the job or increase worker size to accommodate additional memory requirements."]]