This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your analytics workloads in Google Cloud.
BigQuery
This section provides recommendations to help you optimize the performance of queries in BigQuery.
Optimize query design
Query performance depends on factors like the number of bytes that your queries read and write, and the volume of data that's passed between slots. To optimize the performance of your queries in BigQuery, apply the best practices that are described in the following documentation:
- Introduction to optimizing query performance
- Managing input data and data sources
- Optimizing communication between slots
- Optimize query computation
- Manage query outputs
- Avoiding SQL anti-patterns
Define and use materialized views efficiently
To improve the performance of workloads that use common and repeated queries, you can use materialized views. There are limits to the number of materialized views that you can create. Don't create a separate materialized view for every permutation of a query. Instead, define materialized views that you can use for multiple patterns of queries.
Improve JOIN
performance
You can use
materialized views
to reduce the cost and latency of a query that performs aggregation on top of a
JOIN
.
Consider a case where you join a large fact table with a few small dimension
tables, and then perform an
aggregation
on top of the join. It might be practical to rewrite the query to first perform
the aggregation on top of the fact table with foreign keys as grouping keys.
Then, join the result with the dimension tables. Finally, perform a
post-aggregation.
Dataflow
This section provides recommendations to help you optimize query performance of your Dataflow pipelines.
When you create and deploy pipelines, you can configure execution parameters, like the Compute Engine machine type that should be used for the Dataflow worker VMs. For more information, see Pipeline options.
After you deploy pipelines, Dataflow manages the Compute Engine and Cloud Storage resources that are necessary to run your jobs. In addition, the following features of Dataflow help optimize the performance of the pipelines:
- Parallelization: Dataflow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing. For more information, see parallelization and distribution.
- Optimization: Dataflow uses your pipeline code to create an execution graph that represents PCollection objects and transforms in the pipeline. It then optimizes the graph for the most efficient performance and resource usage. Dataflow also automatically optimizes potentially costly operations, such as data aggregations. For more information, see Fusion optimization and Combine optimization.
- Automatic tuning: Dataflow dynamically optimizes jobs while they are running by using Horizontal Autoscaling, Vertical Autoscaling, and Dynamic Work Rebalancing.
You can monitor the performance of Dataflow pipelines by using the web-based monitoring interface or the Dataflow gcloud CLI.
Dataproc
This section describes best practices to optimize the performance of your Dataproc clusters.
Autoscale clusters
To ensure that your Dataproc clusters deliver predictable performance, you can enable autoscaling. Dataproc uses Hadoop YARN memory metrics and an autoscaling policy that you define to automatically adjust the number of worker VMs in a cluster. For more information about how to use and configure autoscaling, see Autoscaling clusters.
Provision appropriate storage
Choose an appropriate storage option for your Dataproc cluster based on your performance and cost requirements:
- If you need a low-cost Hadoop-compatible file system (HCFS) that Hadoop and Spark jobs can read from and write to with minimal changes, use Cloud Storage. The data stored in Cloud Storage is persistent, and can be accessed by other Dataproc clusters and other products such as BigQuery.
- If you need a low-latency Hadoop Distributed File System (HDFS) for your Dataproc cluster, use Compute Engine persistent disks attached to the worker nodes. The data stored in HDFS storage is transient, and the storage cost is higher than the Cloud Storage option.
- To get the performance advantage of Compute Engine persistent disks and the cost and durability benefits of Cloud Storage, you can combine both of the storage options. For example, you can store your source and final datasets in Cloud Storage, and provision limited HDFS capacity for the intermediate datasets. When you decide on the size and type of the disks for HDFS storage, consider the recommendations in the Persistent disks and local SSDs section.
Reduce latency when using Cloud Storage
To reduce latency when you access data that's stored in Cloud Storage, we recommend the following:
- Create your Cloud Storage bucket in the same region as the Dataproc cluster.
- Disable
auto.purge
for Apache Hive-managed tables stored in Cloud Storage. - When using Spark SQL, consider creating Dataproc clusters
with the latest versions of available
images
. By using the latest version, you can avoid performance
issues that might remain in older versions, such as
slow
INSERT OVERWRITE
performance in Spark 2.x. - To minimize the possibility of writing many files with varying or small
sizes to Cloud Storage, you can configure the
Spark SQL parameters
spark.sql.shuffle.partitions
andspark.default.parallelism
or the Hadoop parametermapreduce.job.reduces
.
Monitor and adjust storage load and capacity
The persistent disks attached to the worker nodes in a Dataproc
cluster hold
shuffle
data. To perform optimally, the worker nodes need sufficient disk space. If the
nodes don't have sufficient disk space, the nodes are marked as UNHEALTHY
in
the
YARN NodeManager
log. If this issue occurs, either increase the disk size for the affected nodes,
or run fewer jobs concurrently.
Enable EFM
When worker nodes are removed from a running Dataproc cluster, such as due to downscaling or preemption, shuffle data might be lost. To minimize job delays in such scenarios, we recommend that you enable Enhanced Flexibility Mode (EFM) for clusters that use preemptible VMs or that only autoscale the secondary worker group.
What's next
Review the best practices for optimizing the performance of your compute, storage, networking, and database resources:
- Optimize compute performance.
- Optimize storage performance.
- Optimize networking performance.
- Optimize database performance.