Optimize analytics performance

Last reviewed 2023-08-04 UTC

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your analytics workloads in Google Cloud.

BigQuery

This section provides recommendations to help you optimize the performance of queries in BigQuery.

Optimize query design

Query performance depends on factors like the number of bytes that your queries read and write, and the volume of data that's passed between slots. To optimize the performance of your queries in BigQuery, apply the best practices that are described in the following documentation:

Define and use materialized views efficiently

To improve the performance of workloads that use common and repeated queries, you can use materialized views. There are limits to the number of materialized views that you can create. Don't create a separate materialized view for every permutation of a query. Instead, define materialized views that you can use for multiple patterns of queries.

Improve JOIN performance

You can use materialized views to reduce the cost and latency of a query that performs aggregation on top of a JOIN. Consider a case where you join a large fact table with a few small dimension tables, and then perform an aggregation on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys. Then, join the result with the dimension tables. Finally, perform a post-aggregation.

Dataflow

This section provides recommendations to help you optimize query performance of your Dataflow pipelines.

When you create and deploy pipelines, you can configure execution parameters, like the Compute Engine machine type that should be used for the Dataflow worker VMs. For more information, see Pipeline options.

After you deploy pipelines, Dataflow manages the Compute Engine and Cloud Storage resources that are necessary to run your jobs. In addition, the following features of Dataflow help optimize the performance of the pipelines:

You can monitor the performance of Dataflow pipelines by using the web-based monitoring interface or the Dataflow gcloud CLI.

Dataproc

This section describes best practices to optimize the performance of your Dataproc clusters.

Autoscale clusters

To ensure that your Dataproc clusters deliver predictable performance, you can enable autoscaling. Dataproc uses Hadoop YARN memory metrics and an autoscaling policy that you define to automatically adjust the number of worker VMs in a cluster. For more information about how to use and configure autoscaling, see Autoscaling clusters.

Provision appropriate storage

Choose an appropriate storage option for your Dataproc cluster based on your performance and cost requirements:

  • If you need a low-cost Hadoop-compatible file system (HCFS) that Hadoop and Spark jobs can read from and write to with minimal changes, use Cloud Storage. The data stored in Cloud Storage is persistent, and can be accessed by other Dataproc clusters and other products such as BigQuery.
  • If you need a low-latency Hadoop Distributed File System (HDFS) for your Dataproc cluster, use Compute Engine persistent disks attached to the worker nodes. The data stored in HDFS storage is transient, and the storage cost is higher than the Cloud Storage option.
  • To get the performance advantage of Compute Engine persistent disks and the cost and durability benefits of Cloud Storage, you can combine both of the storage options. For example, you can store your source and final datasets in Cloud Storage, and provision limited HDFS capacity for the intermediate datasets. When you decide on the size and type of the disks for HDFS storage, consider the recommendations in the Persistent disks and local SSDs section.

Reduce latency when using Cloud Storage

To reduce latency when you access data that's stored in Cloud Storage, we recommend the following:

  • Create your Cloud Storage bucket in the same region as the Dataproc cluster.
  • Disable auto.purge for Apache Hive-managed tables stored in Cloud Storage.
  • When using Spark SQL, consider creating Dataproc clusters with the latest versions of available images . By using the latest version, you can avoid performance issues that might remain in older versions, such as slow INSERT OVERWRITE performance in Spark 2.x.
  • To minimize the possibility of writing many files with varying or small sizes to Cloud Storage, you can configure the Spark SQL parameters spark.sql.shuffle.partitions and spark.default.parallelism or the Hadoop parameter mapreduce.job.reduces.

Monitor and adjust storage load and capacity

The persistent disks attached to the worker nodes in a Dataproc cluster hold shuffle data. To perform optimally, the worker nodes need sufficient disk space. If the nodes don't have sufficient disk space, the nodes are marked as UNHEALTHY in the YARN NodeManager log. If this issue occurs, either increase the disk size for the affected nodes, or run fewer jobs concurrently.

Enable EFM

When worker nodes are removed from a running Dataproc cluster, such as due to downscaling or preemption, shuffle data might be lost. To minimize job delays in such scenarios, we recommend that you enable Enhanced Flexibility Mode (EFM) for clusters that use preemptible VMs or that only autoscale the secondary worker group.

What's next

Review the best practices for optimizing the performance of your compute, storage, networking, and database resources: