OOM effects
When Dataproc on Compute Engine VMs encounter out-of-memory (OOM) errors:
Master and worker VMs freeze for a period of time.
Master VMs OOM errors cause jobs to fail with "task not acquired" error.
Worker VM OOM errors cause a loss of the node on YARN HDFS, which delays Dataproc job execution.
Yarn memory controls
Yarn provides three types of memory controls:
- Polling based (legacy)
- Strict
- Elastic
By default, Dataproc doesn't set yarn.nodemanager.resource.memory.enabled
to
enable YARN memory controls, for the following reasons:
- Strict memory control can cause the termination of containers when there is sufficient memory if the container sizes aren't configured correctly.
- Elastic memory control requirements can adversely affect job execution.
- YARN memory controls can fail to prevent OOM errors when processes aggressively consume memory.
Dataproc memory protection
When a Dataproc cluster VM is under memory pressure, Dataproc memory protection terminates processes or containers until the OOM condition is removed.
Dataproc memory protection is provided for following cluster nodes in the following Dataproc on Compute Engine image versions:
Role | 1.5 | 2.0 | 2.1 |
---|---|---|---|
Master VM | 1.5.74+ | 2.0.48+ | all |
Worker VM | Not Available | 2.0.76+ | 2.1.24+ |
Driver Pool VM | Not Available | 2.0.76+ | 2.1.24+ |
How to identify Dataproc memory protection terminations
- Processes that Dataproc memory protection terminates exit with code
137
or143
. - Worker node termination:
- Dataproc increments the
dataproc.googleapis.com/node/problem_count
cumulative metric, and sets thereason
toProcessKilledDueToMemoryPressure
. - If Cloud Logging is enabled, Dataproc writes a
google.dataproc.oom-killer
log with "A process is killed due to memory pressure: [process name]." - If a YARN container is terminated, Dataproc writes the following message in the YARN resource manager: "[container id] exited with code 137, which potentially signifies a memory pressure on [node id]"
- Dataproc increments the
- Master or driver pool node termination: the job driver fails with
Driver received SIGTERM/SIGKILL signal and exited with [INT] code
.
OOM solutions
This section offers recommendations for job and container terminations that can result from OOM issues.
Job fails with "Driver received SIGTERM/SIGKILL signal and exited with [INT] code"
Recommendations:
- If the clusters has a driver pool,
increase
driver-required-memory-mb
to acutal job memory usage. - If the cluster does not have a driver pool, recreate the cluster, lowering
the maximum number of concurrent jobs, which is calculated as
(total master memory in MB - 3584MB) / driver-size-mb
. You can lower this number by:- Setting
dataproc:dataproc.scheduler.max-concurrent-jobs
, or - Setting
dataproc:dataproc.scheduler.driver-size-mb
to a larger number (the default is1024MB
).
- Setting
- Consider using a master node machine type with additional memory.
Container exited with exit code 137 or 143
Recommendations:
If Dataproc memory protection terminated the container (see How to identify Dataproc memory protection terminations):
- Check that container sizes are configured correctly.
- Consider lowering
yarn.nodemanager.resource.memory-mb
. This property controls the amount of memory used for scheduling YARN containers. - If job containers consistently fail, check whether data skew is causing increased usage of specific containers. If so, repartition the job or increase worker size to accommodate additional memory requirements.