This document describes how to monitor and troubleshoot Dataproc Serverless for Spark batch workloads using the Spark UI.
Spark UI
The Spark UI collects Apache Spark execution details from Dataproc Serverless for Spark batch workloads. There is no charge for the Spark UI feature, which is enabled by default.
Data collected by the Spark UI feature is retained for 90 days. You can use this web interface to monitor and debug Spark workloads without having to create a Persistent History Server.
Limitations
The Spark UI does not support the following Dataproc Serverless workloads:
Required Identity and Access Management permissions and roles
The following permissions are required to use the Spark UI feature with batch workloads.
Data collection permission:
dataproc.batches.sparkApplicationWrite
. This permission must be granted to the service account that runs batch workloads. This permission is included in theDataproc Worker
role, which is automatically granted to the Compute Engine default service account that Dataproc Serverless uses by default (see Dataproc Serverless service account). However, if you specify a custom service account for your batch workload, you must add thedataproc.batches.sparkApplicationWrite
permission to that service account (typically, by granting the service account the DataprocWorker
role).Spark UI access permission:
dataproc.batches.sparkApplicationRead
. This permission must be granted to a user to access the Spark UI in the Google Cloud console. This permission is included in theDataproc Viewer
,Dataproc Editor
andDataproc Administrator
roles. To open the Spark UI in the Google Cloud console, you must have one of these roles or have a custom role that includes this permission.
Open the Spark UI
The Spark UI page is available in the Google Cloud console for Spark batch workloads.
Go to the Dataproc Serverless interactive sessions page.
Click a Batch ID to open the batch details page.
Click View Spark UI in the top menu.
The View Spark UI button is disabled in the following cases:
- If a required permission isn't granted
- If you clear the Enable Spark UI checkbox on the Batch details page
- If you set the
spark.dataproc.appContext.enabled
property tofalse
when you submit a batch workload
Persistent History Server
Dataproc Serverless for Spark creates the compute resources that are needed to run a workload, runs the workload on those resources, and then deletes the resources when the workload finishes. Workload metrics and events don't persist after a workload completes. However, you can use a Persistent History Server (PHS) to retain workload application history (event logs) in Cloud Storage.
To use a PHS with a batch workload, do the following:
Specify your PHS when you submit a workload.
Use the Component Gateway to connect to the PHS to view application details, scheduler stages, task level details, and environment and executor information.
Dataproc Serverless for Spark logs
Logging is enabled by default in Dataproc Serverless for Spark, and workload logs persist after a
workload finishes. Dataproc Serverless for Spark collects workload logs in Cloud Logging.
You can access Dataproc Serverless for Spark logs under the
Cloud Dataproc Batch
resource in the Logs Explorer.
Query Dataproc Serverless for Spark logs
The Logs Explorer in the Google Cloud console provides a query pane to help you build a query to examine batch workload logs. Here are steps you can follow to build a query to examine batch workload logs:
- Your current project is selected. You can click Refine scope Project to select a different project.
Define a batch logs query.
Use filter menus to filter for a batch workload.
Under All resources, select the Cloud Dataproc Batch resource.
In the Select resource panel, select the batch LOCATION, then the BATCH ID. These batch parameters are listed on the Dataproc Batches page in the Google Cloud console.
Click Apply.
Under Select log names. enter
dataproc.googleapis.com
in the Search log names box to limit the log types to query. Select one or more of the listed log file names.
Use the query editor to filter for VM-specific logs.
Specify the resource type and VM resource name as shown in the following example:
resource.type="cloud_dataproc_batch" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch-BATCH_UUID-VM_SUFFIX"
Notes:- BATCH_UUID: The batch UUID is listed in the Batch details page in the Google Cloud console, which opens when you click the Batch ID on the Batches page.
The batch logs also list the batch UUID in the VM resource name. Here's an example from a batch driver.log:
Click Run query.
Dataproc Serverless for Spark log types and sample queries
The following list describes different Dataproc Serverless log types and provides sample Logs Explorer queries for each log type.
dataproc.googleapis.com/output
: This log file contains batch workload output. Dataproc Serverless for Spark streams batch output to theoutput
namespace, and sets the filename toJOB_ID.driver.log
.Sample Logs Explorer query for output logs:
resource.type="cloud_dataproc_batch" resource.labels.location="REGION" resource.labels.batch_id="BATCH_ID" logName="projects/PROJECT_ID/logs/dataproc.googleapis.com%2Foutput"
dataproc.googleapis.com/spark
: Thespark
namespace aggregates Spark logs for daemons and executors running on Dataproc cluster master and worker VMs. Each log entry includes amaster
,worker
orexecutor
component label to identify the log source, as follows:executor
: Logs from user-code executors. Typically, these are distributed logs.master
: Logs from the Spark standalone resource manager master, which are similar to Dataproc on Compute Engine YARNResourceManager
logs.worker
: Logs from the Spark standalone resource manager worker, which are similar to Dataproc on Compute Engine YARNNodeManager
logs.
Sample Logs Explorer query for all logs in the
spark
namespace:resource.type="cloud_dataproc_batch" resource.labels.location="REGION" resource.labels.batch_id="BATCH_ID" logName="projects/PROJECT_ID/logs/dataproc.googleapis.com%2Fspark"
Sample Logs Explorer query for Spark standalone component logs in the
spark
namespace:resource.type="cloud_dataproc_batch" resource.labels.location="REGION" resource.labels.batch_id="BATCH_ID" logName="projects/PROJECT_ID/logs/dataproc.googleapis.com%2Fspark" jsonPayload.component="COMPONENT"
dataproc.googleapis.com/startup
: Thestartup
namespace includes the batch (cluster) startup logs. Any initialization script logs are included. Components are identified by label, for example:startup-script[855]: ... activate-component-spark[3050]: ... enable spark-worker
Sample Logs Explorer query for logs startup logs on a specified VM:resource.type="cloud_dataproc_batch" resource.labels.location="REGION" resource.labels.batch_id="BATCH_ID" logName="projects/PROJECT_ID/logs/dataproc.googleapis.com%2Fstartup" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch-BATCH_UUID-VM_SUFFIX"
dataproc.googleapis.com/agent
: Theagent
namespace aggregates Dataproc agent logs. Each log entry includes filename label that identifies the log source.Sample Logs Explorer query for agent logs generated by a specified worker VM:
resource.type="cloud_dataproc_batch" resource.labels.location="REGION" resource.labels.batch_id="BATCH_ID" logName="projects/PROJECT_ID/logs/dataproc.googleapis.com%2Fagent" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch-BATCHUUID-wWORKER#"
dataproc.googleapis.com/autoscaler
: Theautoscaler
namespace aggregates Dataproc Serverless for Spark autoscaler logs.Sample Logs Explorer query for agent logs generated by a specified worker VM:
resource.type="cloud_dataproc_batch" resource.labels.location="REGION" resource.labels.batch_id="BATCH_ID" logName="projects/PROJECT_ID/logs/dataproc.googleapis.com%2Fautoscaler" labels."dataproc.googleapis.com/resource_name"="gdpic-srvls-batch-BATCHUUID-wWORKER#"
For more information, see Dataproc logs.
Dataproc Serverless audit logs
For information on Dataproc Serverless audit logs, see Dataproc audit logging.
Workload metrics
By default, Dataproc Serverless for Spark enables the collection of available Spark metrics, unless you use Spark metrics collection properties to disable or override the collection of one or more Spark metrics.
You can view workload metrics from the Metrics Explorer or the Batch details page in the Google Cloud console.
Batch metrics
Dataproc batch
resource metrics provide insight into batch resources,
such as the number of batch executors. Batch metrics are prefixed with
dataproc.googleapis.com/batch
.
Spark metrics
Available Spark metrics
include Spark driver and executor metrics, and system metrics. Available Spark metrics are prefixed
with custom.googleapis.com/
.
Set up metric alerts
You can create Dataproc metric alerts to receive notice of workload issues.
Create charts
You can create charts that visualize workload metrics by using the
Metrics Explorer in the
Google Cloud console. For example, you can
create a chart to display disk:bytes_used
, and then filter by batch_id
.
Cloud Monitoring
Monitoring uses workload metadata and metrics to provide insights into the health and performance of Dataproc Serverless for Spark workloads. Workload metrics include Spark metrics, batch metrics, and operation metrics.
You can use Cloud Monitoring in the Google Cloud console to explore metrics, add charts, create dashboards, and create alerts.
Create dashboards
You can create a dashboard to monitor workloads using metrics from multiple projects and different Google Cloud products. For more information, see Create and manage custom dashboards.
Advanced troubleshooting (Preview)
This section covers the advanced troubleshooting features (Preview) that are available in the Google Cloud console. These features include Gemini-assisted troubleshooting for Dataproc Serverless, which is part of the Gemini in BigQuery offering.
Access to preview features
To sign up for the preview release of the advanced troubleshooting features, complete and submit the Gemini in BigQuery Pre-GA Sign-up form. Once the form is approved, projects listed in the form have access to preview features.
Preview pricing
There is no additional charge for participation in the preview. Charges will apply to the following preview features when they become generally available (GA):
Advance notice of GA charges will be sent to the email address that you provide in the preview sign-up form.
Feature requirements
Sign-up: You must sign up for the feature.
Permission: You must have the
dataproc.batches.analyze
permission.If you have the predefined
roles/dataproc.admin
,roles/dataproc.editor
orroles/dataproc.viewer
role, you have the required permission. No further action is necessary.If you use a custom role to access Dataproc services, the custom role must have the
dataproc.batches.analyze
permission. You can use the gcloud CLI to add the permission, as shown in the following command, which adds the permission at the project level:
gcloud iam roles update CUSTOM_ROLE_ID --project=PROJECT_ID \ --add-permissions="dataproc.batches.analyze"
Enable Gemini-assisted troubleshooting for Dataproc Serverless: You enable Gemini-assisted troubleshooting for Dataproc Serverless when you submit each recurring Spark batch workload using the Google Cloud console, gcloud CLI, or the Dataproc API. Once this feature is enabled on a recurring batch workload, Dataproc stores a copy of the workload logs for 30 days, and uses the saved log data to provide Gemini-assisted troubleshooting for the workload. For information on Spark workload log content, see Dataproc Serverless for Spark logs.
Console
Perform the following steps to enable Gemini-assisted troubleshooting on each recurring Spark batch workload:
In the Google Cloud console, go to the Dataproc Batches page.
To create a batch workload, click Create.
In the Container section, fill in the Cohort name, which identifies the batch as one of a series of recurring workloads. Gemini-assisted analysis is applied to the second and subsequent workloads that are submitted with this cohort name. For example, specify
TPCH-Query1
as the cohort name for a scheduled workload that runs a daily TPC-H query.Fill in other sections of the Create batch page as needed, then click Submit. For more information, see Submit a batch workload.
gcloud
Run the following gcloud CLI
gcloud dataproc batches submit
command locally in a terminal window or in Cloud Shell
to enable Gemini-assisted troubleshooting on each recurring Spark batch workload:
gcloud dataproc batches submit COMMAND \ --region=REGION \ --cohort=COHORT \ other arguments ...
Replace the following:
- COMMAND: the Spark workload type, such as
Spark
,PySpark
,Spark-Sql
, orSpark-R
. - REGION: the region where your workload will run.
- COHORT: the cohort name, which
identifies the batch as one of a series of recurring workloads.
Gemini-assisted analysis is applied to the second and subsequent workloads that are submitted
with this cohort name. For example, specify
TPCH Query 1
as the cohort name for a scheduled workload that runs a daily TPC-H query.
API
Include the RuntimeConfig.cohort
name in a batches.create
request to enable Gemini-assisted troubleshooting on each recurring Spark
batch workload. Gemini-assisted analysis is applied to the second and subsequent workloads submitted
with this cohort name. For example, specify TPCH-Query1
as the cohort name
for a scheduled workload that runs a daily
TPC-H query.
Example:
...
runtimeConfig:
cohort: TPCH-Query1
...
Gemini-assisted troubleshooting for Dataproc Serverless
The following Gemini-assisted troubleshooting preview features are available on the Batch details and Batches list pages in the Google Cloud console.
Investigate tab: The Investigate tab on the Batch details page provides a Health Overview (Preview) section with the following Gemini-assisted troubleshooting panels:
- What was autotuned? If you enabled autotuning on one or more workloads, this panel displays the most recent autotuning changes that were applied to running, completed, and failed workloads.
- What is happening now? and What can I do about it? Click Ask Gemini to request recommendations to help fix failed workloads or improve successful but slow workloads.
If you click Ask Gemini, Gemini for Google Cloud generates a summary of any errors, anomalies, or highlights from workload logs, Spark metrics, and Spark events. Gemini for Google Cloud can also display a list of recommended steps you can take to fix a failed workload or improve the performance of a successful, but slow workload.
Gemini-assisted troubleshooting columns: As part of the preview release, the Dataproc Batches list page in the Google Cloud console includes
What was Autotuned
,What is happening now?
, andWhat can I do about it?
columns.The Ask Gemini button is displayed and enabled only if a completed batch is in a
Failed
,Cancelled
, orSucceeded
state. If you click Ask Gemini, Gemini for Google Cloud generates a summary of any errors, anomalies, or highlights from workload logs, Spark metrics, and Spark events. Gemini for Google Cloud can also display a list of recommended steps you can take to fix a failed workload or improve the performance of a successful, but slow workload.
Batch metric highlights
As part of the preview release, the Batch details page in the Google Cloud console includes charts that display important batch workload metric values. The metric charts are populated with values after the batch completes.
The following table lists the Spark workload metrics displayed on the Batch details page in the Google Cloud console, and describes how metric values can provide insight into workload status and performance.
Metric | What does it show? |
---|---|
Metrics at the Executor level | |
Ratio of JVM GC Time to Runtime | This metric shows the ratio of JVM GC (garbage collection) time to runtime per executor. High ratios can indicate memory leaks within tasks running on particular executors or inefficient data structures, which can lead to high object churn. |
Disk Bytes Spilled | This metric shows the total number of disk bytes spilled across different executors. If an executor shows high disk bytes spilled, this can indicate data skew. If the metric increases over time, this can indicate that there are stages with memory pressure or memory leaks. |
Bytes Read and Written | This metric shows the bytes written versus bytes read per executor. Large discrepancies in bytes read or written can indicate scenarios where replicated joins lead to data amplification on specific executors. |
Records Read and Written | This metric shows records read and written per executor. Large numbers record read with low numbers of records written can indicate a bottleneck in processing logic on specific executors, leading to records being read while waiting. Executors that consistently lag in reads and writes can indicate resource contention on those nodes or executor-specific code inefficiencies. |
Ratio of Shuffle Write Time to Run Time | The metric shows the amount of time the executor spent in shuffle runtime as compared to overall runtime. If this value is high for some executors, it can indicate data skew or inefficient data serialization. You can identify stages with long shuffle write times in the Spark UI. Look for outlier tasks within those stages taking more than the average time to complete. Check whether the executors with high shuffle write times also show high disk I/O activity. More efficient serialization and additional partitioning steps might help. Very large record writes compared to record reads can indicate unintended data duplication due to inefficient joins or incorrect transformations. |
Metrics at the Application level | |
Stages Progression | This metric shows the number of stages in failed, waiting, and running stages. A large number of failed or waiting stages can indicate data skew. Check for data partitions, and debug the reason for stage failure using the Stages tab in the Spark UI. |
Batch Spark Executors | This metric shows the number of executors that might be required versus the number of executors running. A large difference between required and running executors can indicate autoscaling issues. |
Metrics at the VM level | |
Memory Used | This metric shows the percentage of VM memory in use. If the master percentage is high, it can indicate that the driver is under memory pressure. For other VM nodes, a high percentage can indicate that the executors are running out of memory, which can lead to high disk spillage and a slower workload runtime. Use the Spark UI to analyze executors to check for high GC time and high task failures. Also debug Spark code for large dataset caching and unnecessary broadcast of variables. |
Job logs
As part of the preview release, the Batch details page in the Google Cloud console lists job (batch workload) logs. The logs include warnings and errors filtered from workload output and Spark logs. You can select log Severity, add a Filter, and then click the View in Logs Explorer icon to open the selected batch logs in the Logs Explorer.
Example: Logs Explorer opens after choosing Errors
from the Severity
selector on the Batch details page in the Google Cloud console.