Cloud Monitoring

Cloud Monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. Google Cloud's operations suite collects and ingests metrics, events, and metadata from Dataproc clusters, including per-cluster HDFS, YARN, job, and operation metrics, to generate insights via dashboards and charts (see Cloud Monitoring Dataproc metrics).

Use Cloud Monitoring cluster metrics to monitor the performance and health of Dataproc clusters.

Dataproc cluster metrics

Dataproc cluster resource metrics are automatically enabled on Dataproc clusters. Use Monitoring to view these metrics.

View cluster metrics

You can examine Monitoring from the Google Cloud console or by using the Monitoring API.

Console

  1. After creating a cluster, go to Monitoring in the console to view cluster monitoring data.

    After the Monitoring console appears, you can install the Monitoring agent on VMs in your project as an additional set-up step. You don't need to install the agent on VMs in Dataproc clusters because this step is performed for you when you create a Dataproc cluster.

  2. Select Metrics Explorer, From the "Find resource type and metric" drop-down list, select the "Cloud Dataproc Cluster" resource (or type "cloud_dataproc_cluster" in the box).
  3. Click again in the input box, and then select a metric from the drop-down list. In the next screenshot, "YARN memory size" is selected. Hovering over the metric name displays information about the metric.

    You can select filters, group by metric labels, perform aggregations, and select chart viewing options (see the Monitoring documentation).

API

You can use the Monitoring timeSeries.list API to capture and list metrics defined by a filter expression. Use the Try this API template on the API page to send an API request and display the response.

Example: Here's a snapshot of a templated request and the returned JSON response for the following Monitoring timeSeries.list parameters:

  • name: projects/example-project-id
  • filter: metric.type="dataproc.googleapis.com/cluster/hdfs/storage_capacity"
  • interval.endTime: 2018-02-27T11:54:00.000-08:00
  • interval.startTime: 2018-02-20T00:00:00.000-08:00

OSS metrics

Dataproc collects then integrates Dataproc cluster OSS component metrics into Monitoring. Dataproc OSS metrics are collected in the following format:

custom.googleapis.com/OSS_COMPONENT/METRIC.

OSS Metric examples:

custom.googleapis.com/spark/driver/DAGScheduler/job/allJobs
custom.googleapis.com/hiveserver2/memory/MaxNonHeapMemory

Available OSS metrics

You can enable Dataproc to collect the OSS metrics listed in the following tables. The Collected by default column is marked with "y" if Dataproc collects the metric by default when you enable the associated metric source. Any of the metrics listed for a metric source, and all Spark metrics), can be enabled for collection if you override the collection of default metrics for the metric source (see Enable OSS metric collection).

Hadoop metrics

HDFS metrics

Metric Metrics Explorer name Collected by default
hdfs:NameNode:FSNamesystem:CapacityTotalGB dfs/FSNamesystem/CapacityTotalGB y
hdfs:NameNode:FSNamesystem:CapacityUsedGB dfs/FSNamesystem/CapacityUsedGB y
hdfs:NameNode:FSNamesystem:CapacityRemainingGB dfs/FSNamesystem/CapacityRemainingGB y
hdfs:NameNode:FSNamesystem:FilesTotal dfs/FSNamesystem/FilesTotal y
hdfs:NameNode:FSNamesystem:MissingBlocks dfs/FSNamesystem/MissingBlocks n
hdfs:NameNode:FSNamesystem:ExpiredHeartbeats dfs/FSNamesystem/ExpiredHeartbeats n
hdfs:NameNode:FSNamesystem:TransactionsSinceLastCheckpoint dfs/FSNamesystem/TransactionsSinceLastCheckpoint n
hdfs:NameNode:FSNamesystem:TransactionsSinceLastLogRoll dfs/FSNamesystem/TransactionsSinceLastLogRoll n
hdfs:NameNode:FSNamesystem:LastWrittenTransactionId dfs/FSNamesystem/LastWrittenTransactionId n
hdfs:NameNode:FSNamesystem:CapacityTotal dfs/FSNamesystem/CapacityTotal n
hdfs:NameNode:FSNamesystem:CapacityUsed dfs/FSNamesystem/CapacityUsed n
hdfs:NameNode:FSNamesystem:CapacityRemaining dfs/FSNamesystem/CapacityRemaining n
hdfs:NameNode:FSNamesystem:CapacityUsedNonDFS dfs/FSNamesystem/CapacityUsedNonDFS n
hdfs:NameNode:FSNamesystem:TotalLoad dfs/FSNamesystem/TotalLoad n
hdfs:NameNode:FSNamesystem:SnapshottableDirectories dfs/FSNamesystem/SnapshottableDirectories n
hdfs:NameNode:FSNamesystem:Snapshots dfs/FSNamesystem/Snapshots n
hdfs:NameNode:FSNamesystem:BlocksTotal dfs/FSNamesystem/BlocksTotal n
hdfs:NameNode:FSNamesystem:PendingReplicationBlocks dfs/FSNamesystem/PendingReplicationBlocks n
hdfs:NameNode:FSNamesystem:UnderReplicatedBlocks dfs/FSNamesystem/UnderReplicatedBlocks n
hdfs:NameNode:FSNamesystem:CorruptBlocks dfs/FSNamesystem/CorruptBlocks n
hdfs:NameNode:FSNamesystem:ScheduledReplicationBlocks dfs/FSNamesystem/ScheduledReplicationBlocks n
hdfs:NameNode:FSNamesystem:PendingDeletionBlocks dfs/FSNamesystem/PendingDeletionBlocks n
hdfs:NameNode:FSNamesystem:ExcessBlocks dfs/FSNamesystem/ExcessBlocks n
hdfs:NameNode:FSNamesystem:PostponedMisreplicatedBlocks dfs/FSNamesystem/PostponedMisreplicatedBlocks n
hdfs:NameNode:FSNamesystem:PendingDataNodeMessageCourt dfs/FSNamesystem/PendingDataNodeMessageCourt n
hdfs:NameNode:FSNamesystem:MillisSinceLastLoadedEdits dfs/FSNamesystem/MillisSinceLastLoadedEdits n
hdfs:NameNode:FSNamesystem:BlockCapacity dfs/FSNamesystem/BlockCapacity n
hdfs:NameNode:FSNamesystem:StaleDataNodes dfs/FSNamesystem/StaleDataNodes n
hdfs:NameNode:FSNamesystem:TotalFiles dfs/FSNamesystem/TotalFiles n
hdfs:NameNode:JvmMetrics:MemHeapUsedM dfs/jvm/MemHeapUsedM n
hdfs:NameNode:JvmMetrics:MemHeapCommittedM dfs/jvm/MemHeapCommittedM n
hdfs:NameNode:JvmMetrics:MemHeapMaxM dfs/jvm/MemHeapMaxM n
hdfs:NameNode:JvmMetrics:MemMaxM dfs/jvm/MemMaxM n

YARN metrics

Metric Metrics Explorer name Collected by default
yarn:ResourceManager:ClusterMetrics:NumActiveNMs yarn/ClusterMetrics/NumActiveNMs y
yarn:ResourceManager:ClusterMetrics:NumDecommissionedNMs yarn/ClusterMetrics/NumDecommissionedNMs n
yarn:ResourceManager:ClusterMetrics:NumLostNMs yarn/ClusterMetrics/NumLostNMs n
yarn:ResourceManager:ClusterMetrics:NumUnhealthyNMs yarn/ClusterMetrics/NumUnhealthyNMs n
yarn:ResourceManager:ClusterMetrics:NumRebootedNMs yarn/ClusterMetrics/NumRebootedNMs n
yarn:ResourceManager:QueueMetrics:running_0 yarn/QueueMetrics/running_0 y
yarn:ResourceManager:QueueMetrics:running_60 yarn/QueueMetrics/running_60 y
yarn:ResourceManager:QueueMetrics:running_300 yarn/QueueMetrics/running_300 y
yarn:ResourceManager:QueueMetrics:running_1440 yarn/QueueMetrics/running_1440 y
yarn:ResourceManager:QueueMetrics:AppsSubmitted yarn/QueueMetrics/AppsSubmitted y
yarn:ResourceManager:QueueMetrics:AvailableMB yarn/QueueMetrics/AvailableMB y
yarn:ResourceManager:QueueMetrics:PendingContainers yarn/QueueMetrics/PendingContainers y
yarn:ResourceManager:QueueMetrics:AppsRunning yarn/QueueMetrics/AppsRunning n
yarn:ResourceManager:QueueMetrics:AppsPending yarn/QueueMetrics/AppsPending n
yarn:ResourceManager:QueueMetrics:AppsCompleted yarn/QueueMetrics/AppsCompleted n
yarn:ResourceManager:QueueMetrics:AppsKilled yarn/QueueMetrics/AppsKilled n
yarn:ResourceManager:QueueMetrics:AppsFailed yarn/QueueMetrics/AppsFailed n
yarn:ResourceManager:QueueMetrics:AllocatedMB yarn/QueueMetrics/AllocatedMB n
yarn:ResourceManager:QueueMetrics:AllocatedVCores yarn/QueueMetrics/AllocatedVCores n
yarn:ResourceManager:QueueMetrics:AllocatedContainers yarn/QueueMetrics/AllocatedContainers n
yarn:ResourceManager:QueueMetrics:AggregateContainersAllocated yarn/QueueMetrics/AggregateContainersAllocated n
yarn:ResourceManager:QueueMetrics:AggregateContainersReleased yarn/QueueMetrics/AggregateContainersReleased n
yarn:ResourceManager:QueueMetrics:AvailableVCores yarn/QueueMetrics/AvailableVCores n
yarn:ResourceManager:QueueMetrics:PendingMB yarn/QueueMetrics/PendingMB n
yarn:ResourceManager:QueueMetrics:PendingVCores yarn/QueueMetrics/PendingVCores n
yarn:ResourceManager:QueueMetrics:ReservedMB yarn/QueueMetrics/ReservedMB n
yarn:ResourceManager:QueueMetrics:ReservedVCores yarn/QueueMetrics/ReservedVCores n
yarn:ResourceManager:QueueMetrics:ReservedContainers yarn/QueueMetrics/ReservedContainers n
yarn:ResourceManager:QueueMetrics:ActiveUsers yarn/QueueMetrics/ActiveUsers n
yarn:ResourceManager:QueueMetrics:ActiveApplications yarn/QueueMetrics/ActiveApplications n
yarn:ResourceManager:QueueMetrics:FairShareMB yarn/QueueMetrics/FairShareMB n
yarn:ResourceManager:QueueMetrics:FairShareVCores yarn/QueueMetrics/FairShareVCores n
yarn:ResourceManager:QueueMetrics:MinShareMB yarn/QueueMetrics/MinShareMB n
yarn:ResourceManager:QueueMetrics:MinShareVCores yarn/QueueMetrics/MinShareVCores n
yarn:ResourceManager:QueueMetrics:MaxShareMB yarn/QueueMetrics/MaxShareMB n
yarn:ResourceManager:QueueMetrics:MaxShareVCores yarn/QueueMetrics/MaxShareVCores n
yarn:ResourceManager:JvmMetrics:MemHeapUsedM yarn/jvm/MemHeapUsedM n
yarn:ResourceManager:JvmMetrics:MemHeapCommittedM yarn/jvm/MemHeapCommittedM n
yarn:ResourceManager:JvmMetrics:MemHeapMaxM yarn/jvm/MemHeapMaxM n
yarn:ResourceManager:JvmMetrics:MemMaxM yarn/jvm/MemMaxM n

Spark metrics

Spark driver metrics

Metric Metrics Explorer name Collected by default
spark:driver:BlockManager:disk.diskSpaceUsed_MB spark/driver/BlockManager/disk/diskSpaceUsed_MB y
spark:driver:BlockManager:memory.maxMem_MB spark/driver/BlockManager/memory/maxMem_MB y
spark:driver:BlockManager:memory.memUsed_MB spark/driver/BlockManager/memory/memUsed_MB y
spark:driver:DAGScheduler:job.allJobs spark/driver/DAGScheduler/job/allJobs y
spark:driver:DAGScheduler:stage.failedStages spark/driver/DAGScheduler/stage/failedStages y
spark:driver:DAGScheduler:stage.waitingStages spark/driver/DAGScheduler/stage/waitingStages y

Spark executor metrics

Metric Metrics Explorer name Collected by default
spark:executor:executor:bytesRead spark/executor/bytesRead y
spark:executor:executor:bytesWritten spark/executor/bytesWritten y
spark:executor:executor:cpuTime spark/executor/cpuTime y
spark:executor:executor:diskBytesSpilled spark/executor/diskBytesSpilled y
spark:executor:executor:recordsRead spark/executor/recordsRead y
spark:executor:executor:recordsWritten spark/executor/recordsWritten y
spark:executor:executor:runTime spark/executor/runTime y
spark:executor:executor:shuffleRecordsRead spark/executor/shuffleRecordsRead y
spark:executor:executor:shuffleRecordsWritten spark/executor/shuffleRecordsWritten y

Spark History Server metrics

Dataproc collects the following Spark history service JVM memory metrics:

Metric Metrics Explorer name Collected by default
sparkHistoryServer:JVM:Memory:HeapMemoryUsage.committed sparkHistoryServer/memory/CommittedHeapMemory y
sparkHistoryServer:JVM:Memory:HeapMemoryUsage.used sparkHistoryServer/memory/UsedHeapMemory y
sparkHistoryServer:JVM:Memory:HeapMemoryUsage.max sparkHistoryServer/memory/MaxHeapMemory y
sparkHistoryServer:JVM:Memory:NonHeapMemoryUsage.committed sparkHistoryServer/memory/CommittedNonHeapMemory y
sparkHistoryServer:JVM:Memory:NonHeapMemoryUsage.used sparkHistoryServer/memory/UsedNonHeapMemory y
sparkHistoryServer:JVM:Memory:NonHeapMemoryUsage.max sparkHistoryServer/memory/MaxNonHeapMemory y

HiveServer 2 metrics

Metric Metrics Explorer name Collected by default
hiveserver2:JVM:Memory:HeapMemoryUsage.committed hiveserver2/memory/CommittedHeapMemory y
hiveserver2:JVM:Memory:HeapMemoryUsage.used hiveserver2/memory/UsedHeapMemory y
hiveserver2:JVM:Memory:HeapMemoryUsage.max hiveserver2/memory/MaxHeapMemory y
hiveserver2:JVM:Memory:NonHeapMemoryUsage.committed hiveserver2/memory/CommittedNonHeapMemory y
hiveserver2:JVM:Memory:NonHeapMemoryUsage.used hiveserver2/memory/UsedNonHeapMemory y
hiveserver2:JVM:Memory:NonHeapMemoryUsage.max hiveserver2/memory/MaxNonHeapMemory y

Dataproc monitoring agent metrics

By default, Dataproc collects the following Dataproc monitoring agent default metrics, which are published with an agent.googleapis.com prefix:

CPU
agent.googleapis.com/cpu/load_15m
agent.googleapis.com/cpu/load_1m
agent.googleapis.com/cpu/load_5m
agent.googleapis.com/cpu/usage_time
agent.googleapis.com/cpu/utilization

Disk
agent.googleapis.com/disk/bytes_used
agent.googleapis.com/disk/io_time
agent.googleapis.com/disk/merged_operations
agent.googleapis.com/disk/operation_count
agent.googleapis.com/disk/operation_time
agent.googleapis.com/disk/pending_operations
agent.googleapis.com/disk/percent_used
agent.googleapis.com/disk/read_bytes_count

Swap
agent.googleapis.com/swap/bytes_used
agent.googleapis.com/swap/io
agent.googleapis.com/swap/percent_used

Memory
agent.googleapis.com/memory/bytes_used
agent.googleapis.com/memory/percent_used

Processes - (follows slightly different quota policy for few attributes)
agent.googleapis.com/processes/count_by_state
agent.googleapis.com/processes/cpu_time
agent.googleapis.com/processes/disk/read_bytes_count
agent.googleapis.com/processes/disk/write_bytes_count
agent.googleapis.com/processes/fork_count
agent.googleapis.com/processes/rss_usage
agent.googleapis.com/processes/vm_usage

Interface
agent.googleapis.com/interface/errors
agent.googleapis.com/interface/packets
agent.googleapis.com/interface/traffic

Network
agent.googleapis.com/network/tcp_connections

Enable OSS metric collection

When you create a Dataproc cluster, you can use the gcloud CLI or the Dataproc API to enable the collection of the OSS metrics in two ways (you can use one or both collection methods):

  1. Enable the collection of only the default metrics from one or more OSS metric sources
  2. Enable the collection of only specified ("override") metrics from one or more OSS metric sources

gcloud command

Default metric collection

Use the gcloud dataproc clusters create --metric-sources flag to enable the collection of default available OSS metrics from one or more metric sources.

gcloud dataproc clusters create cluster-name \
    --metric-sources=METRIC_SOURCE(s) \
    ... other flags

Notes:

  • --metric-sources: Required to enable default metric collection. Specify one or more of the following metric sources: spark, hdfs, yarn, spark-history-server, hiveserver2, and monitoring-agent-defaults. The metric source name is case insensitive (for example, either "yarn" or "YARN" is acceptable).

Override metric collection

Optionally, add the --metric-overrides or --metric-overrides-file flag to enable the collection of one or more of the available OSS metrics from one or more metric sources.

  • Any of the available OSS metrics and all Spark metrics, can be listed for collection as a metric override. Override metric values are case sensitive, and must be provided, if appropriate, in CamelCase format.

    Examples

    • sparkHistoryServer:JVM:Memory:NonHeapMemoryUsage.committed
    • hiveserver2:JVM:Memory:NonHeapMemoryUsage.used
    • yarn:ResourceManager:JvmMetrics:MemHeapMaxM
  • Only the specified overridden metrics will be collected from a given metric source. For example, if one or more spark:executive metrics are listed as metric overrides, other SPARK metrics will not be collected. The collection of default OSS metrics from other metric sources is unaffected. For example, if both SPARK and YARN metric sources are enabled, and overrides are provided for Spark metrics only, all default YARN metrics will be collected.
  • The source of the specified metric override must be enabled. For example, if one or more spark:driver metrics are provided as metric overrides, the spark metric source must be enabled (--metric-sources=spark).

Override metrics list

gcloud dataproc clusters create cluster-name \
    --metric-sources=METRIC_SOURCE(s) \
    --metric-overrides=LIST_OF_METRIC_OVERRIDES \
    ... other flags

Notes:

  • --metric-sources: Required to enable default metric collection. Specify one or more of the following metric sources: spark, hdfs, yarn, spark-history-server, hiveserver2, and monitoring-agent-defaults. The metric source name is case insensitive, for example, either "yarn" or "YARN" is acceptable
  • --metric-overrides: Provide a list of metrics in the following format:

    METRIC_SOURCE:INSTANCE:GROUP:METRIC

    Use camelcase format as appropriate.

    Example:--metric-overrides=sparkHistoryServer:JVM:Memory:NonHeapMemoryUsage.committed

  • This flag is an alternative to and cannot be used with the --metric-overrides-file flag.

Override metrics file

gcloud dataproc clusters create cluster-name \
    --metric-sources=METRIC-SOURCE(s) \
    --metric-overrides-file=METRIC_OVERRIDES_FILENAME \
    ... other flags

Notes:

  • --metric-sources: Required to enable default metric collection. Specify one or more of the following metric sources: spark, hdfs, yarn, spark-history-server, hiveserver2, and monitoring-agent-defaults. The metric source name is case insensitive, for example, either "yarn" or "YARN" is acceptable.
  • --metric-overrides-file: Specify a local or Cloud Storage file (gs://bucket/filename) that contains one or more metrics in the following format:

    METRIC_SOURCE:INSTANCE:GROUP:METRIC

    Use camelcase format as appropriate.

    Example: --metric-overrides=sparkHistoryServer:JVM:Memory:NonHeapMemoryUsage.committed

  • This flag is an alternative to and cannot be used with the --metric-overrides-file flag.

REST API

Use DataprocMetricConfig as part of a clusters.create request to enable the collection of OSS metrics.

Build a Monitoring dashboard

You can build a custom Monitoring dashboard that display charts of selected Cloud Dataproc cluster metrics.

  1. Select + CREATE DASHBOARD from the Monitoring Dashboards Overview page. Provide a name for the dashboard, then click Add Chart in the upper-right menu to open the Add Chart window. Select "Cloud Dataproc Cluster" as the resource type. Select one or more metrics and metric and chart properties. Then Save the chart.

  2. You can add additional charts to your dashboard. After you Save the dashboard, its title appears in the Monitoring Dashboards Overview page. Dashboard charts can be viewed, updated, and deleted from the dashboard display page.

Create alerts

You can create a Monitoring alert that notifies you when a Dataproc cluster or job metric crosses a specified threshold, for example, when HDFS free capacity is low.

  1. Open Monitoring Alerting in the console. Click + CREATE POLICY to open the Create new alerting policy form. Define an alert by adding alert conditions, policy triggers, notification channels, and documentation.

  2. Select ADD CONDITION to open the alert condition form with the Metric tab selected. Fill in the fields to define an alert condition, then click ADD. The example alert condition shown below will trigger when Dataproc cluster HDFS capacity falls below the specified 930 GiB (binary GB) threshold (998,579,896,320 bytes) for 1 minute.

  3. After adding the alert condition, complete the alert policy by setting notification channels, policy triggers, documentation, and the alert policy name.

View alerts

When an alert is triggered by a metric threshold condition, Monitoring creates an incident and a corresponding event. You can view incidents from the Monitoring Alerting page in the console. If you defined a notification mechanism in the alert policy, such as an email or SMS notification, Monitoring also sends a notification of the incident.

Whats next