Optimize environment performance and costs

Cloud Composer 1 | Cloud Composer 2

This page explains how to tune your environment's scale and performance parameters to the needs of your project, so that you get improved performance and reduce costs for resources that are not utilized by your environment.

Other pages about scaling:

Optimization process overview

Making changes to the parameters of your environment can affect many aspects of your environment's performance. We recommend to optimize your environment in iterations:

  1. Start with environment presets.
  2. Run your DAGs.
  3. Observe your environment's performance.
  4. Adjust your environment scale and performance parameters, then repeat from the previous step.

Start with environment presets

When you create an environment in Google Cloud console, you can select one of three environment presets. These presets set the initial scale and performance configuration of your environment; after you create your environment, you can change all scale and performance parameters provided by a preset.

We recommend to start with one of the presets, based on the following estimates:

  • Total number of DAGs that you plan to deploy in the environment
  • Maximum number of concurrent DAG runs
  • Maximum number of concurrent tasks

Your environment's performance depends on the implementation of specific DAGs that you run in your environment. The following table lists estimates that are based on the average resource consumption. If you expect your DAGs to consume more resources, adjust the estimates accordingly.

Recommended preset Total DAGs Max concurrent DAG runs Max concurrent tasks
Small 50 15 18
Medium 250 60 100
Large 1000 250 400

For example, an environment must run 40 DAGs. All DAGs must run at the same time with one active task each. This environment would then use a Medium preset, because the maximum number of concurrent DAG runs and tasks exceed the recommended estimates for the Small preset.

Run your DAGs

Once your environment is created, upload your DAGs to it. Run your DAGs and observe the environment's performance.

We recommend to run your DAGs on a schedule that reflects the real-life application of your DAGs. For example, if you want to run multiple DAGs at the same time, make sure to check your environment's performance when all these DAGs are running simultaneously.

Observe your environment's performance

This section focuses on the most common Cloud Composer 2 capacity and performance tuning aspects. We recommend following this guide step by step because the most common performance considerations are covered first.

Go to the Monitoring dashboard

You can monitor your environment's performance metrics on the Monitoring dashboard of your environment.

To go to the Monitoring dashboard for your environment:

  1. In the Google Cloud console, go to the Environments page.

    Go to Environments

  2. Click the name of your environment.

  3. Go to the Monitoring tab.

Monitor scheduler CPU and memory metrics

Airflow scheduler's CPU and memory metrics help you check whether the scheduler's performance is a bottleneck in the overall Airflow performance.

Graphs for Ariflow schedulers
Figure 1. Graphs for Airflow schedulers (click to enlarge)

On the Monitoring dashboard, in the Schedulers section, observe graphs for the Airflow schedulers of your environment:

  • Total schedulers CPU usage
  • Total schedulers memory usage

Adjust according to your observations:

Monitor the total parse time for all DAG files

The schedulers parse DAGs before scheduling DAG runs. If DAGs take a long time to parse, this consumes scheduler's capacity and might reduce the performance of DAG runs.

Total DAG parse time graph
Figure 2. Graph for DAG parse time (click to enlarge)

On the Monitoring dashboard, in the DAG Runs section, observe graphs for the total DAG parse time.

If the number exceeds about 10 seconds, your schedulers might be overloaded with DAG parsing and cannot run DAGs effectively. The default DAG parsing frequency in Airflow is 30 seconds; if DAG parsing time exceeds this threshold, parsing cycles start to overlap, which then exhausts scheduler's capacity.

According to your observations, you might want to:

Monitor worker pod evictions

Pod eviction can happen when a particular pod in your environment's cluster reaches its resource limits.

Worker pod evictions graph
Figure 3. Graph that displays worker pod evictions (click to enlarge)

If an Airflow worker pod is evicted, all task instances running on that pod are interrupted, and later marked as failed by Airflow.

The majority of issues with worker pod evictions happen because of out-of-memory situations in workers.

On the Monitoring dashboard, in the Workers section, observe the Worker Pods evictions graphs for your environment.

The Total workers memory usage graph shows a total perspective of the environment. A single worker can still exceed the memory limit, even if the memory utilization is healthy at the environment level.

According to your observations, you might want to:

Monitor active workers

The number of workers in your environment automatically scales in response to the queued tasks.

Active workers and queued tasks graphs
Figure 4. Active workers and queued tasks graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the number of active workers and the number of tasks in the queue:

  • Active workers
  • Running and queued tasks

Adjust according to your observations:

  • If the environment frequently reaches its maximum limit for workers, and at the same time the number of tasks in the queue is continuously high, you might want to increase the maximum number of workers.
  • If there are long inter-task scheduling delays, but at same time the environment does not scale up to its maximum number of workers, then there is likely an Airflow setting that throttles the execution and prevents Cloud Composer mechanisms from scaling the environment. Because Cloud Composer 2 environments scale based on the size of the Airflow queue, configure Airflow to not throttle tasks on the way into the queue:

    • Increase worker concurrency. Worker concurrency must be set to a value that is higher than the expected maximum number of concurrent tasks, divided by the maximum number of workers in the environment.
    • Increase DAG concurrency, if a single DAG is running a large number of tasks in parallel, which can lead to reaching the maximum number of running task instances per DAG.
    • Increase max active runs per DAG, if you run the same DAG multiple times in parallel, which can lead to Airflow throttling the execution because the max active runs per DAG limit is reached.

Monitor workers CPU and memory usage

Monitor the total CPU and memory usage aggregated across all workers in your environment to determine if Airflow workers utilize the resources of your environment properly.

Workers CPU and memory graphs
Figure 5. Workers CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the CPU and memory usage by Airflow workers:

  • Total workers CPU usage
  • Total workers memory usage

These graph represent aggregated resource usage; individual workers might still reach their capacity limits, even if the aggregate view shows spare capacity.

Adjust according to your observations:

Monitor running and queued tasks

You can monitor the number of queued and running tasks to check the efficiency of the scheduling process.

Graph that displays running and queued tasks
Figure 6. Graph that displays running and queued tasks (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe the Running and queued tasks graph for your environment.

Tasks in the queue are waiting to be executed by workers. If your environment has queued tasks, this might mean that workers in your environment are busy executing other tasks.

Some queuing is always present in an environment, especially during processing peaks. However, if you observe a high number of queued tasks, or a growing trend in the graph, then this might indicate that workers do not have enough capacity to process the tasks, or that Airflow is throttling task execution.

A high number of queued tasks is typically observed when the number of running tasks also reaches the maximum level.

To address both problems:

Monitor the database CPU and memory usage

Airflow database performance issues can lead to overall DAG execution issues. Database disk usage is typically not a cause for concern because the storage is automatically extended as needed.

Database CPU and memory graphs
Figure 7. Database CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the CPU and memory usage by the Airflow database:

  • Database CPU usage
  • Database memory usage

If the database CPU usage exceeds 80% for more than a few percent of the total time, the database is overloaded and requires scaling.

Database size settings are controlled by the environment size property of your environment. To scale the database up or down, change the environment size to a different tier (Small, Medium, or Large). Increasing the environment size increases the costs of your environment.

Monitor the task scheduling latency

If the latency between tasks exceeds the expected levels (for example, 20 seconds or more), then this might indicate that the environment cannot handle the load of tasks generated by DAG runs.

Task latency graph (Airflow UI)
Figure 8. Task latency graph, Airflow UI (click to enlarge)

You can view the task scheduling latency graph, in the Airflow UI of your environment.

In this example, delays (2.5 and 3.5 seconds) are well within the acceptable limits but significantly higher latencies might indicate that:

Monitor web server CPU and memory

The Airflow web server performance affects Airflow UI. It is not common for the web server to be overloaded. If this happens, the Airflow UI performance might deteriorate, but this does not affect the performance of DAG runs.

Web server CPU and memory graphs
Figure 9. Web server CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Web server section, observe graphs for the Airflow web server:

  • Web server CPU usage
  • Web server memory usage

Based on your observations:

Adjust environment's scale and performance parameters

Change the number of schedulers

Adjusting the number of schedulers improves the scheduler capacity and resilience of Airflow scheduling.

Examples:

Console

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

gcloud

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

The following example sets the number of schedulers to two:

gcloud composer environments update example-environment \
    --scheduler-count=2

Terraform

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

The following example sets the number of schedulers to two:

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      scheduler {
        count = 2
      }
    }
  }
}

Changing the CPU and memory for schedulers

The CPU and memory parameters are for each scheduler in your environment. For example, if your environment has two schedulers, the total capacity is twice the specified number of CPU and memory.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for schedulers.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and Memory for schedulers.

The following example changes the CPU and memory for schedulers. You can specify only CPU or Memory attributes, depending on the need.

gcloud composer environments update example-environment \
  --scheduler-cpu=0.5 \
  --scheduler-memory=3.75

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for schedulers.

The following example changes the CPU and memory for schedulers. You can omit CPU or Memory attributes, depending on the need.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      scheduler {
        cpu = "0.5"
        memory_gb = "3.75"
      }
    }
  }
}

Change the maximum number of workers

Increasing the maximum number of workers allows your environment to automatically scale to a higher number of workers, if needed.

Decreasing the maximum number of workers reduces the maximum capacity of the environment but might also be helpful to reduce the environment costs.

Examples:

Console

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

gcloud

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

The following example sets the maximum number of workers to six:

gcloud composer environments update example-environment \
    --max-workers=6

Terraform

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

The following example sets the maximum number of schedulers to six:

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      worker {
        max_count = "6"
      }
    }
  }
}

Change worker CPU and memory

  • Decreasing worker memory can be helpful when the worker usage graph indicates very low memory utilization.

  • Increasing worker memory allows workers to handle more tasks concurrently or handle memory-intensive tasks. It might address the problem of worker pod evictions.

  • Decreasing worker CPU can be helpful when the worker CPU usage graph indicates that the CPU resources are highly overallocated.

  • Increasing worker CPU allows workers to handle more tasks concurrently and in some cases reduce the time it takes to process these tasks.

Changing worker CPU or memory restarts workers, which might affect running tasks. We recommend to do it when no DAGs are running.

The CPU and memory parameters are for each worker in your environment. For example, if your environment has four workers, the total capacity is four times the specified number of CPU and memory.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for workers.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set the CPU and memory for workers.

The following example changes the CPU and memory for workers. You can omit the CPU or memory attribute, if required.

gcloud composer environments update example-environment \
  --worker-memory=3.75 \
  --worker-cpu=2

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for workers.

The following example changes the CPU and memory for workers. You can omit the CPU or memory parameter, if required.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      worker {
        cpu = "2"
        memory_gb = "3.75"
      }
    }
  }
}

Change web server CPU and memory

Decreasing the web server CPU or memory can be helpful when the web server usage graph indicates that it is continuously underutilized.

Changing web server parameters restarts the web server, which causes a temporary web server downtime. We recommend you to make changes outside of the regular usage hours.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for the web server.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set the CPU and Memory for the web server.

The following example changes the CPU and memory for the web server. You can omit CPU or memory attributes, depending on the need.

gcloud composer environments update example-environment \
    --web-server-cpu=2 \
    --web-server-memory=3.75

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for the web server.

The following example changes the CPU and memory for the web server. You can omit CPU or memory attributes, depending on the need.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      web_server {
        cpu = "2"
        memory_gb = "3.75"
      }
    }
  }
}

Change the environment size

Changing the environment size modifies the capacity of Cloud Composer backend components, such as the Airflow database and the Airflow queue.

  • Consider changing the environment size to a smaller size (for example, Large to Medium, or Medium to Small) when Database usage metrics show substantial underutilization.
  • Consider increasing the environment size if you observe the high usage of the Airflow database.

Console

Follow the steps in Adjust the environment size to set the environment size.

gcloud

Follow the steps in Adjust the environment size to set the environment size.

The following example changes the size of the environment to Medium.

gcloud composer environments update example-environment \
    --environment-size=medium

Terraform

Follow the steps in Adjust the environment size to set the environment size.

The following example changes the size of the environment to Medium.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    environment_size = "medium"
  }
}

Changing the DAG directory listing interval

Increasing the DAG directory listing interval reduces the scheduler load associated with discovery of new DAGs in the environment's bucket.

  • Consider increasing this interval if you deploy new DAGs infrequently.
  • Consider decreasing this interval if you want Airflow to react faster to newly deployed DAG files.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
scheduler dag_dir_list_interval New value for the listing interval The default value, in seconds, is 300.

Changing the DAG file parsing interval

Increasing the DAG file parsing interval reduces the scheduler load associated with the continuous parsing of DAGs in the DAG bag.

Consider increasing this interval when you have a high number of DAGs that do not change too often, or observe a high scheduler load in general.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
scheduler min_file_process_interval New value for the DAG parsing interval The default value, in seconds, is 30.

Change worker concurrency

Changing this parameter adjusts the number of tasks that a single worker can execute at the same time.

For example, a Worker with 0.5 CPU can typically handle 6 concurrent tasks; an environment with three such workers can handle up to 18 concurrent tasks.

  • Increase this parameter when there are tasks waiting in the queue, and your workers use a low percentage of their CPUs and memory at the same time.

  • Decrease this parameter when you are getting pod evictions; this would reduce the number of tasks that a single worker attempts to process. As an alternative, you can increase worker memory.

The default value for worker concurrency is equal to 12 * worker_CPU, where worker_CPU is the number of CPUs allocated to a single worker. For example, if workers in your environment use 0.5 CPU each, then default worker concurrency is 6. This value does not depend on the number of workers in your environment.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
celery worker_concurrency New value for worker concurrency The default value is equal to 12 * worker_CPU for your environment

Change DAG concurrency

DAG concurrency defines the maximum number of task instances allowed to run concurrently in each DAG. Increase it when your DAGs run a high number of concurrent tasks. If this setting is low, the scheduler delays putting more tasks into the queue, which also reduces the efficiency of environment autoscaling.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
core max_active_tasks_per_dag New value for DAG concurrency The default value is 100

Increase max active runs per DAG

This attribute defines the maximum number of active DAG runs per DAG. When the same DAG must be run multiple times concurrently, for example, with different input arguments, this attribute allows the scheduler to start such runs in parallel.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
core max_active_runs_per_dag New value for max active runs per DAG The default value is 25

What's next