You can create a Monitoring alert that notifies you when a Dataproc cluster or job metric exceeds a specified threshold.
Steps to create an alert
To create an alert:
Open the Alerting page in the Google Cloud console.
Click + Create Policy to open the Create alerting policy page.
- Click Select Metric.
- In the "Filter by resource or metric name" input box, type "dataproc" to list Dataproc metrics. Navigate through the hierarchy of Cloud Dataproc metrics to select a cluster, job, batch, or session metric.
- Click Apply.
- Click Next to open the Configure alert trigger pane.
- Set a threshold value to trigger the alert.
- Click Next to open the Configure notifications and finalize alert pane.
- Set notification channels, documentation, and the alert policy name.
- Click Next to review the alert policy.
- Click Create Policy to create the alert.
This section describes a sample alert for a job submitted to the Dataproc service and an alert for a job run as a YARN application.
Long-running Dataproc job alert
Dataproc emits the
which tracks how long a job has been in different states. This metric is found
under the Google Cloud console Metrics Explorer under the Cloud Dataproc Job
You can use this metric to set up an alert that notifies you when the job's
RUNNING state exceeds a duration threshold.
Job duration alert setup
This example uses the Monitoring Query Language (MQL) to create an alert policy (see Creating MQL alerting policies (console)).
fetch cloud_dataproc_job | metric 'dataproc.googleapis.com/job/state' | filter metric.state == 'RUNNING' | group_by [resource.job_id, metric.state], 1m | condition val() == true()
In the following example, the alert triggers when a job has been running for more than 30 minutes.
You can modify the query by filtering on the
resource.job_id to apply it
to a specific job:
fetch cloud_dataproc_job | metric 'dataproc.googleapis.com/job/state' | filter (resource.job_id == '1234567890') && (metric.state == 'RUNNING') | group_by [resource.job_id, metric.state], 1m | condition val() == true()
Long-running YARN application alert
The previous sample shows an alert that is triggered when a Dataproc job runs longer
than a specified duration, but it only applies to jobs submitted to the Dataproc
service via the Google Cloud console, the Google Cloud CLI, or by direct calls to the
jobs API. You can also use OSS metrics
to set up similar alerts that monitor the running time of YARN applications.
First, some background. YARN emits running time metrics into multiple buckets.
By default, YARN maintains 60, 300, and 1440 minutes as bucket thresholds
and emits 4 metrics,
running_0records the number of jobs with a runtime between 0 and 60 minutes.
running_60records the number of jobs with a runtime between 60 and 300 minutes.
running_300records the number of jobs with a runtime between 300 and 1440 minutes.
running_1440records the number of jobs with a runtime greater than 1440 minutes.
For example, a job running for 72 minutes will be recorded in
running_60, but not in
These default bucket thresholds can be modified by passing new values to the
during Dataproc cluster creation. When defining custom buckets thresholds,
you must also define metric overrides. For example, to specify bucket thresholds
of 30, 60, and 90 minutes, the
gcloud dataproc clusters create command
should include the following flags:
‑‑metric-overrides=yarn:ResourceManager:QueueMetrics:running_0, yarn:ResourceManager:QueueMetrics:running_30,yarn:ResourceManager:QueueMetrics:running_60, yarn:ResourceManager:QueueMetrics:running_90
Sample cluster creation command
gcloud dataproc clusters create test-cluster \ --properties ^#^yarn:yarn.resourcemanager.metrics.runtime.buckets=30,60,90 \ --metric-sources=yarn \ --metric-overrides=yarn:ResourceManager:QueueMetrics:running_0,yarn:ResourceManager:QueueMetrics:running_30,yarn:ResourceManager:QueueMetrics:running_60,yarn:ResourceManager:QueueMetrics:running_90
These metrics are listed in the Google Cloud console Metrics Explorer under the VM Instance (gce_instance) resource.
YARN application alert setup
Create a cluster with required buckets and metrics enabled .
Create an alert policy that triggers when the number of applications in a YARN metric bucket exceed a specified threshold.
Optionally, add a filter to alert on clusters that match a pattern.
Configure the threshold for triggering the alert.
Cluster capacity deviation alert
Dataproc emits the
metric, which reports the difference between the expected node count in the
cluster and the active YARN node count. You can find this metric in the
Google Cloud console Metrics Explorer under the
Cloud Dataproc Cluster
resource. You can use this metric to create an alert that notifies you when
cluster capacity deviates from expected capacity for longer than a specified
The following operations can cause a temporary under reporting of cluster nodes
capacity_deviation metric. To avoid false positive alerts, set
the metric alert threshold to account for these operations:
Cluster creation and updates: The
capacity_deviationmetric is not emitted during cluster create or update operations.
Cluster initialization actions: Initialization actions are performed after a node is provisioned.
Secondary worker updates: Secondary workers are added asynchronously, after the update operation completes.
Capacity deviation alert setup
This example uses the Monitoring Query Language (MQL) to create an alert policy.
fetch cloud_dataproc_cluster | metric 'dataproc.googleapis.com/cluster/capacity_deviation' | every 1m | condition val() <> 0 '1'
In the next example, the alert triggers when cluster capacity deviation is non zero for more than 30 minutes.
When an alert is triggered by a metric threshold condition, Monitoring creates an incident and a corresponding event. You can view incidents from the Monitoring Alerting page in the Google Cloud console.
If you defined a notification mechanism in the alert policy, such as an email or SMS notification, Monitoring sends a notification of the incident.
- See the Introduction to alerting.