Optimizing a Spark workload for performance and resiliency can be challenging due to the number of Spark configuration options and the difficulty of assessing how those options impact a workload. Dataproc Serverless autotuning provides an alternative to manual workload configuration by automatically applying Spark configuration settings to a recurring Spark workload based on Spark optimization best practices and an analysis of workload runs.
Sign up for Dataproc Serverless autotuning
To sign up for access to the Dataproc Serverless autotuning preview release described on this page, complete and submit the Gemini in BigQuery Preview signup form. After the form is approved, projects listed in the form have access to preview features.
Benefits
Dataproc Serverless autotuning can provide the following benefits:
- Improved performance: Optimization tuning to increase performance
- Quicker optimization: Auto-configuration to avoid time-consuming manual configuration testing
- Increased resiliency: Automatic memory allocation to avoid memory-related failures
Limitations
Dataproc Serverless autotuning has the following limitations:
- Autotuning is calculated and applied to the second and subsequent runs of a workload. The first run of a recurring workload is not autotuned because Dataproc Serverless autotuning uses workload history for optimization.
- Memory downsizing is not supported.
- Autotuning is not applied retroactively to running workloads, only to newly submitted workload cohorts.
Autotuning cohorts
Autotuning is applied to recurring executions of a batch workload, called cohorts.
The cohort name that you specify when you submit a workload
identifies it as one of the successive runs of the recurring workload.
We recommend that you use cohort names that describe the type of
workload or that otherwise help to identify the runs of a workload as part
of a recurring workload. For example, specify TPCH-Query1
as the cohort name
for a scheduled workload that runs a daily TPC-H query.
Autotuning scenarios
You apply Dataproc Serverless autotuning to your workload by selecting one or more of the following autotuning scenarios:
MEMORY
: Autotune Spark memory allocation to predict and avoid potential workload out-of-memory errors. Fix a previously failed workload due to an out-of-memory (OOM) error.SCALING
: Autotune Spark autoscaling configuration settings.BHJ
: Autotune Spark configuration settings to optimize SQL broadcast join performance.
Pricing
Dataproc Serverless autotuning is offered during preview without additional charge. Standard Dataproc Serverless pricing applies.
Regional availability
You can use Dataproc Serverless autotuning with batches that are submitted in available Compute Engine regions.
Use Dataproc Serverless autotuning
You can enable Dataproc Serverless autotuning on a workload by using the Google Cloud console, Google Cloud CLI, or Dataproc API.
Console
To enable Dataproc Serverless autotuning on each submission of a recurring batch workload, perform the following steps:
In the Google Cloud console, go to the Dataproc Batches page.
To create a batch workload, click Create.
In the Container section, fill in the following fields for your Spark workload:
- Cohort: the cohort name, which
identifies the batch as one of a series of recurring workloads.
Autotuning is applied to the second and subsequent workloads that are submitted
with this cohort name. For example, specify
TPCH-Query1
as the cohort name for a scheduled workload that runs a daily TPC-H query. - Autotuning scenarios: one or more
autotuning scenarios to use to
optimize the workload, for example,
BHJ
,MEMORY
, andSCALING
. You can change the scenario selection with each batch cohort submission.
- Cohort: the cohort name, which
identifies the batch as one of a series of recurring workloads.
Autotuning is applied to the second and subsequent workloads that are submitted
with this cohort name. For example, specify
Fill in other sections of the Create batch page as needed, then click Submit. For more information about these fields, see Submit a batch workload.
gcloud
To enable Dataproc Serverless autotuning on each submission
of a recurring batch workload, run the following gcloud CLI
gcloud dataproc batches submit
command locally in a terminal window or in
Cloud Shell.
gcloud dataproc batches submit COMMAND \ --region=REGION \ --cohort=COHORT \ --autotuning-scenarios=SCENARIOS \ other arguments ...
Replace the following:
- COMMAND: the Spark workload type, such as
Spark
,PySpark
,Spark-Sql
, orSpark-R
. - REGION: the region where your workload will run.
- COHORT: the cohort name, which
identifies the batch as one of a series of recurring workloads.
Autotuning is applied to the second and subsequent workloads that are submitted
with this cohort name. For example, specify
TPCH Query 1
as the cohort name for a scheduled workload that runs a daily TPC-H query. - SCENARIOS: one or more comma-separated
autotuning scenarios to use to
optimize the workload, for example,
--autotuning-scenarios=OOM,MEMORY,SCALING
. You can change the scenario list with each batch cohort submission.
API
To enable Dataproc Serverless autotuning on each submission of a recurring batch workload, submit a batches.create request that includes the following fields:
RuntimeConfig.cohort
: the cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads submitted with this cohort name. For example, specifyTPCH-Query1
as the cohort name for a scheduled workload that runs a daily TPC-H query.AutotuningConfig.scenarios
: one or more autotuning scenarios to use to optimize the workload, for example,BHJ
,MEMORY
, andSCALING
. You can change the scenario list with each batch cohort submission.
Example:
...
runtimeConfig:
cohort: TPCH-Query1
autotuningConfig:
scenarios:
- BHJ
- MEMORY
- SCALING
...
View autotuning changes
To view Dataproc Serverless autotuning changes to a batch workload,
run the
gcloud dataproc batches describe
command.
Example: gcloud dataproc batches describe
output is similar to the following:
...
runtimeInfo:
propertiesInfo:
# Properties set by autotuning.
autotuningProperties
spark.driver.memory:
annotation: Driver OOM was detected
value: 11520m
spark.driver.memoryOverhead:
annotation: Driver OOM was detected
value: 4608m
# Old overwritten properties.
userProperties
...
You can also view the most recent autotuning changes that were applied to a running, completed, or failed workload from the Batch details page in the Google Cloud console, under the Investigate tab.