Cloud Composer 1 is in the post-maintenance mode. Google does not release any further updates to Cloud Composer 1, including new versions of Airflow, bugfixes, and security updates. We recommend planning migration to Cloud Composer 2.

Optimize environment performance and costs

Cloud Composer 1 | Cloud Composer 2 | Cloud Composer 3

This page explains how to tune your environment's scale and performance parameters to the needs of your project, so that you get improved performance and reduce costs for resources that are not utilized by your environment.

Other pages about scaling and optimization:

For information about scaling your environments, see Scale environments.
For information about how environment scaling works, see Environment scaling.
For a tutorial on monitoring key environment metrics, see Monitor environment health and performance with key metrics.

Optimization process overview

Making changes to the parameters of your environment can affect many aspects of your environment's performance. We recommend to optimize your environment in iterations:

Start with environment presets.
Run your DAGs.
Observe your environment's performance.
Adjust your environment scale and performance parameters, then repeat from the previous step.

Start with environment presets

When you create an environment in Google Cloud console, you can select one of three environment presets. These presets set the initial scale and performance configuration of your environment; after you create your environment, you can change all scale and performance parameters provided by a preset.

We recommend to start with one of the presets, based on the following estimates:

Total number of DAGs that you plan to deploy in the environment
Maximum number of concurrent DAG runs
Maximum number of concurrent tasks

Your environment's performance depends on the implementation of specific DAGs that you run in your environment. The following table lists estimates that are based on the average resource consumption. If you expect your DAGs to consume more resources, adjust the estimates accordingly.

Recommended preset	Total DAGs	Max concurrent DAG runs	Max concurrent tasks
Small	50	15	18
Medium	250	60	100
Large	1000	250	400

For example, an environment must run 40 DAGs. All DAGs must run at the same time with one active task each. This environment would then use a Medium preset, because the maximum number of concurrent DAG runs and tasks exceed the recommended estimates for the Small preset.

Run your DAGs

Once your environment is created, upload your DAGs to it. Run your DAGs and observe the environment's performance.

We recommend to run your DAGs on a schedule that reflects the real-life application of your DAGs. For example, if you want to run multiple DAGs at the same time, make sure to check your environment's performance when all these DAGs are running simultaneously.

Observe your environment's performance

This section focuses on the most common Cloud Composer 2 capacity and performance tuning aspects. We recommend following this guide step by step because the most common performance considerations are covered first.

Go to the Monitoring dashboard

You can monitor your environment's performance metrics on the Monitoring dashboard of your environment.

To go to the Monitoring dashboard for your environment:

In the Google Cloud console, go to the Environments page.

Go to Environments
Click the name of your environment.
Go to the Monitoring tab.

Monitor scheduler CPU and memory metrics

Airflow scheduler's CPU and memory metrics help you check whether the scheduler's performance is a bottleneck in the overall Airflow performance.

Graphs for Ariflow schedulers — **Figure 1.** Graphs for Airflow schedulers (click to enlarge)

On the Monitoring dashboard, in the Schedulers section, observe graphs for the Airflow schedulers of your environment:

Total schedulers CPU usage
Total schedulers memory usage

Adjust according to your observations:

If the Scheduler CPU usage is consistently below 30%-35%, you might want to:
- Reduce the number of schedulers.
- Reduce the CPU of schedulers.
Note: If you reduce the number of schedulers or their performance parameters, this can reduce the environment costs. However, if you reduce scheduler parameters below the values required by your environment, you might experience issues with DAG scheduling performance.
If Scheduler CPU usage exceeds 80% for longer than a few percent of the total time, you might want to:

Monitor the total parse time for all DAG files

The schedulers parse DAGs before scheduling DAG runs. If DAGs take a long time to parse, this consumes scheduler's capacity and might reduce the performance of DAG runs.

Total DAG parse time graph — **Figure 2.** Graph for DAG parse time (click to enlarge)

On the Monitoring dashboard, in the DAG Statistics section, observe graphs for the total DAG parse time.

If the number exceeds about 10 seconds, your schedulers might be overloaded with DAG parsing and cannot run DAGs effectively. The default DAG parsing frequency in Airflow is 30 seconds; if DAG parsing time exceeds this threshold, parsing cycles start to overlap, which then exhausts scheduler's capacity.

According to your observations, you might want to:

Simplify your DAGs, including their Python dependencies.
Increase the DAG file parsing interval and increase the DAG directory listing interval.
Increase the number of schedulers.
Increase the CPU of schedulers.

Monitor worker pod evictions

Pod eviction can happen when a particular pod in your environment's cluster reaches its resource limits.

Worker pod evictions graph — **Figure 3.** Graph that displays worker pod evictions (click to enlarge)

If an Airflow worker pod is evicted, all task instances running on that pod are interrupted, and later marked as failed by Airflow.

The majority of issues with worker pod evictions happen because of out-of-memory situations in workers.

On the Monitoring dashboard, in the Workers section, observe the Worker Pods evictions graphs for your environment.

The Total workers memory usage graph shows a total perspective of the environment. A single worker can still exceed the memory limit, even if the memory utilization is healthy at the environment level.

According to your observations, you might want to:

Increase the memory available to workers.
Reduce worker concurrency. In this way, a single worker handles fewer tasks at once. This provides more memory or storage to each individual task. If you change worker concurrency, you might also want to increase the maximum number of workers. In this way, the number of tasks that your environment can handle at once stays the same. For example, if you reduce worker Concurrency from 12 to 6, you might want to double the maximum number of workers.

Monitor active workers

The number of workers in your environment automatically scales in response to the queued tasks.

**Figure 4.** Active workers and queued tasks graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the number of active workers and the number of tasks in the queue:

Active workers
Airflow tasks

Adjust according to your observations:

If the environment frequently reaches its maximum limit for workers, and at the same time the number of tasks in the Celery queue is continuously high, you might want to increase the maximum number of workers.
If there are long inter-task scheduling delays, but at same time the environment does not scale up to its maximum number of workers, then there is likely an Airflow setting that throttles the execution and prevents Cloud Composer mechanisms from scaling the environment. Because Cloud Composer 2 environments scale based on the number of tasks in the Celery queue, configure Airflow to not throttle tasks on the way into the queue:
- Increase worker concurrency. Worker concurrency must be set to a value that is higher than the expected maximum number of concurrent tasks, divided by the maximum number of workers in the environment.
- Increase DAG concurrency, if a single DAG is running a large number of tasks in parallel, which can lead to reaching the maximum number of running task instances per DAG.
- Increase max active runs per DAG, if you run the same DAG multiple times in parallel, which can lead to Airflow throttling the execution because the max active runs per DAG limit is reached.

Monitor workers CPU and memory usage

Monitor the total CPU and memory usage aggregated across all workers in your environment to determine if Airflow workers utilize the resources of your environment properly.

**Figure 5.** Workers CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the CPU and memory usage by Airflow workers:

Total workers CPU usage
Total workers memory usage

These graph represent aggregated resource usage; individual workers might still reach their capacity limits, even if the aggregate view shows spare capacity.

Adjust according to your observations:

If the workers memory usage approaches the limit, this can cause worker pod evictions. To address this problem, increase worker memory.
If the memory usage is minimal compared to the limit, and there are no worker pod evictions, you might want to decrease workers memory.
If the workers CPU usage approaches the limit (exceeds 80% for more than a few percent of the total time), you might want to:
- Increase the number of workers. This gives your environment more control over capacity provisioned for a particular workload.
- Increase Workers CPU or reduce worker concurrency, if individual tasks need higher CPU allocation. Otherwise, we recommend to increase the number of workers.

Monitor running and queued tasks

You can monitor the number of queued and running tasks to check the efficiency of the scheduling process.

**Figure 6.** Graph that displays running and queued tasks (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe the Airflow tasks graph for your environment.

Tasks in the queue are waiting to be executed by workers. If your environment has queued tasks, this might mean that workers in your environment are busy executing other tasks.

Some queuing is always present in an environment, especially during processing peaks. However, if you observe a high number of queued tasks, or a growing trend in the graph, then this might indicate that workers do not have enough capacity to process the tasks, or that Airflow is throttling task execution.

A high number of queued tasks is typically observed when the number of running tasks also reaches the maximum level.

To address both problems:

Increase the max number of workers.
Increase the worker concurrency.

Monitor the database CPU and memory usage

Airflow database performance issues can lead to overall DAG execution issues. Database disk usage is typically not a cause for concern because the storage is automatically extended as needed.

**Figure 7.** Database CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the SQL Database section, observe graphs for the CPU and memory usage by the Airflow database:

Database CPU usage
Database memory usage

If the database CPU usage exceeds 80% for more than a few percent of the total time, the database is overloaded and requires scaling.

Database size settings are controlled by the environment size property of your environment. To scale the database up or down, change the environment size to a different tier (Small, Medium, or Large). Increasing the environment size increases the costs of your environment.

Monitor the task scheduling latency

If the latency between tasks exceeds the expected levels (for example, 20 seconds or more), then this might indicate that the environment cannot handle the load of tasks generated by DAG runs.

Task latency graph (Airflow UI) — **Figure 8.** Task latency graph, Airflow UI (click to enlarge)

You can view the task scheduling latency graph, in the Airflow UI of your environment.

In this example, delays (2.5 and 3.5 seconds) are well within the acceptable limits but significantly higher latencies might indicate that:

The scheduler is overloaded. Monitor scheduler CPU and memory for signs of potential problems.
Airflow configuration options are throttling execution. Try increasing worker concurrency, increasing DAG concurrency or increasing max active runs per DAG.
There are not enough workers to run tasks, try increasing max number of workers.

Monitor web server CPU and memory

The Airflow web server performance affects Airflow UI. It is not common for the web server to be overloaded. If this happens, the Airflow UI performance might deteriorate, but this does not affect the performance of DAG runs.

**Figure 9.** Web server CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Web server section, observe graphs for the Airflow web server:

Web server CPU usage
Web server memory usage

Based on your observations:

If the web server CPU usage is above 80% for more than a few percent of time, consider increasing Web Server CPU.
If you observe high web server memory usage, consider adding more memory to the web server.

Adjust environment's scale and performance parameters

Change the number of schedulers

Adjusting the number of schedulers improves the scheduler capacity and resilience of Airflow scheduling.

If you increase the number of schedulers, this increases the traffic to and from the Airflow database. We recommend using two Airflow schedulers in most scenarios. Using three schedulers is required only in rare cases that require special considerations. Configuring more than three schedulers frequently leads to lower environment's performance.

If you need faster scheduling:

Configure two Airflow schedulers.
Assign more CPU and memory resources to Airflow schedulers.
Increase dag-dir-list-interval
Increase min-file-process-interval
Increase job-heartbeat-sec

Examples:

Console

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

gcloud

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

The following example sets the number of schedulers to two:

gcloud composer environments update example-environment \
    --scheduler-count=2

Terraform

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

The following example sets the number of schedulers to two:

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      scheduler {
        count = 2
      }
    }
  }
}

Changing the CPU and memory for schedulers

The CPU and memory parameters are for each scheduler in your environment. For example, if your environment has two schedulers, the total capacity is twice the specified number of CPU and memory.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for schedulers.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and Memory for schedulers.

The following example changes the CPU and memory for schedulers. You can specify only CPU or Memory attributes, depending on the need.

gcloud composer environments update example-environment \
  --scheduler-cpu=0.5 \
  --scheduler-memory=3.75

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for schedulers.

The following example changes the CPU and memory for schedulers. You can omit CPU or Memory attributes, depending on the need.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      scheduler {
        cpu = "0.5"
        memory_gb = "3.75"
      }
    }
  }
}

Change the maximum number of workers

Increasing the maximum number of workers allows your environment to automatically scale to a higher number of workers, if needed.

Decreasing the maximum number of workers reduces the maximum capacity of the environment but might also be helpful to reduce the environment costs.

Examples:

Console

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

gcloud

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

The following example sets the maximum number of workers to six:

gcloud composer environments update example-environment \
    --max-workers=6

Terraform

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

The following example sets the maximum number of schedulers to six:

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      worker {
        max_count = "6"
      }
    }
  }
}

Change worker CPU and memory

Decreasing worker memory can be helpful when the worker usage graph indicates very low memory utilization.

Caution: If you decrease worker memory, make sure that this does not cause worker pod evictions and subsequent task failures because of insufficient memory.
Increasing worker memory allows workers to handle more tasks concurrently or handle memory-intensive tasks. It might address the problem of worker pod evictions.
Decreasing worker CPU can be helpful when the worker CPU usage graph indicates that the CPU resources are highly overallocated.
Increasing worker CPU allows workers to handle more tasks concurrently and in some cases reduce the time it takes to process these tasks.

Changing worker CPU or memory restarts workers, which might affect running tasks. We recommend to do it when no DAGs are running.

The CPU and memory parameters are for each worker in your environment. For example, if your environment has four workers, the total capacity is four times the specified number of CPU and memory.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for workers.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set the CPU and memory for workers.

The following example changes the CPU and memory for workers. You can omit the CPU or memory attribute, if required.

gcloud composer environments update example-environment \
  --worker-memory=3.75 \
  --worker-cpu=2

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for workers.

The following example changes the CPU and memory for workers. You can omit the CPU or memory parameter, if required.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      worker {
        cpu = "2"
        memory_gb = "3.75"
      }
    }
  }
}

Change web server CPU and memory

Decreasing the web server CPU or memory can be helpful when the web server usage graph indicates that it is continuously underutilized.

Changing web server parameters restarts the web server, which causes a temporary web server downtime. We recommend you to make changes outside of the regular usage hours.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for the web server.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set the CPU and Memory for the web server.

The following example changes the CPU and memory for the web server. You can omit CPU or memory attributes, depending on the need.

gcloud composer environments update example-environment \
    --web-server-cpu=2 \
    --web-server-memory=3.75

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for the web server.

The following example changes the CPU and memory for the web server. You can omit CPU or memory attributes, depending on the need.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      web_server {
        cpu = "2"
        memory_gb = "3.75"
      }
    }
  }
}

Change the environment size

Changing the environment size modifies the capacity of Cloud Composer backend components, such as the Airflow database and the Airflow queue.

Consider changing the environment size to a smaller size (for example, Large to Medium, or Medium to Small) when Database usage metrics show substantial underutilization.
Consider increasing the environment size if you observe the high usage of the Airflow database.

Console

Follow the steps in Adjust the environment size to set the environment size.

gcloud

Follow the steps in Adjust the environment size to set the environment size.

The following example changes the size of the environment to Medium.

gcloud composer environments update example-environment \
    --environment-size=medium

Terraform

Follow the steps in Adjust the environment size to set the environment size.

The following example changes the size of the environment to Medium.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    environment_size = "medium"
  }
}

Changing the DAG directory listing interval

Increasing the DAG directory listing interval reduces the scheduler load associated with discovery of new DAGs in the environment's bucket.

Consider increasing this interval if you deploy new DAGs infrequently.
Consider decreasing this interval if you want Airflow to react faster to newly deployed DAG files.

To change this parameter, override the following Airflow configuration option:

Section	Key	Value	Notes
`scheduler`	`dag_dir_list_interval`	New value for the listing interval	The default value, in seconds, is `120`.

Changing the DAG file parsing interval

Increasing the DAG file parsing interval reduces the scheduler load associated with the continuous parsing of DAGs in the DAG bag.

Consider increasing this interval when you have a high number of DAGs that do not change too often, or observe a high scheduler load in general.

To change this parameter, override the following Airflow configuration option:

Section	Key	Value	Notes
`scheduler`	`min_file_process_interval`	New value for the DAG parsing interval	The default value, in seconds, is `30`.

Worker concurrency

Concurrency performance and your environment's ability to autoscale is connected to two settings:

the minimum number of Airflow workers
the [celery]worker_concurrency parameter

The default values provided by Cloud Composer are optimal for the majority of use cases, but your environment might benefit from custom adjustments.

Worker concurrency performance considerations

The [celery]worker_concurrency parameter defines the number of tasks a single worker can pick up from the task queue. Task execution speed depends on multiple factors, such as worker CPU, memory, and the type of work itself.

Worker autoscaling

Cloud Composer monitors the task queue and spawns additional workers to pick up any waiting tasks. Setting [celery]worker_concurrency to a high value means that every worker can pick up a lot of tasks, so under certain circumstances the queue might never fill up, causing autoscaling to never trigger.

For example, in a Cloud Composer environment with two Airflow workers, [celery]worker_concurrency set to 100, and tasks 200 in the queue, each worker would pick up 100 tasks. This leaves the queue empty and doesn't trigger autoscaling. If these tasks take a long time to complete, this could lead to performance issues.

But if the tasks are small and quick to execute, a high value in the [celery]worker_concurrency setting could lead to overeager scaling. For example, if that environment has 300 tasks in the queue, Cloud Composer begins to create new workers. But if the first 200 tasks finish execution by the time the new workers are ready, an existing worker can pick them up. The end result is that autoscaling creates new workers, but there are no tasks for them.

Adjusting [celery]worker_concurrency for special cases should be based on your peak task execution times and queue numbers:

For tasks that take longer to complete, workers shouldn't be able to empty the queue completely.
For quicker, smaller tasks, increase the minimum number of Airflow workers to avoid overeager scaling.

Synchronization of task logs

Airflow workers feature a component that synchronizes task execution logs to Cloud Storage buckets. A high number of concurrent tasks performed by a single worker leads to a high number of synchronization requests. This could possibly overload your worker and lead to performance issues.

If you observe performance issues due to high numbers of log synchronization traffic, lower the [celery]worker_concurrency values and instead adjust the minimum number of Airflow workers.

Change worker concurrency

Changing this parameter adjusts the number of tasks that a single worker can execute at the same time.

For example, a Worker with 0.5 CPU can typically handle 6 concurrent tasks; an environment with three such workers can handle up to 18 concurrent tasks.

Increase this parameter when there are tasks waiting in the queue, and your workers use a low percentage of their CPUs and memory at the same time.
Decrease this parameter when you are getting pod evictions; this would reduce the number of tasks that a single worker attempts to process. As an alternative, you can increase worker memory.

The default value for worker concurrency is equal to:

In Airflow 2.6.3 and later versions, a minimum value out of 32, 12 * worker_CPU, and 6 * worker_memory.
In Airflow versions before 2.6.3, a minimum value out of 32, 12 * worker_CPU, and 8 * worker_memory.
In Airflow versions before 2.3.3, 12 * worker_CPU.

The worker_CPU value is the number of CPUs allocated to a single worker. The worker_memory value is the amount of memory allocated to a single worker. For example, if workers in your environment use 0.5 CPU and 4GB of memory each, then the worker concurrency is set to 6. The worker concurrency value does not depend on the number of workers in your environment.

To change this parameter, override the following Airflow configuration option:

Section	Key	Value
`celery`	`worker_concurrency`	New value for worker concurrency

Change DAG concurrency

DAG concurrency defines the maximum number of task instances allowed to run concurrently in each DAG. Increase it when your DAGs run a high number of concurrent tasks. If this setting is low, the scheduler delays putting more tasks into the queue, which also reduces the efficiency of environment autoscaling.

To change this parameter, override the following Airflow configuration option:

Section	Key	Value	Notes
`core`	`max_active_tasks_per_dag`	New value for DAG concurrency	The default value is `16`

Increase max active runs per DAG

This attribute defines the maximum number of active DAG runs per DAG. When the same DAG must be run multiple times concurrently, for example, with different input arguments, this attribute allows the scheduler to start such runs in parallel.

To change this parameter, override the following Airflow configuration option:

Section	Key	Value	Notes
`core`	`max_active_runs_per_dag`	New value for max active runs per DAG	The default value is `25`

Optimize environment performance and costs

Optimization process overview

Start with environment presets

Run your DAGs

Observe your environment's performance

Go to the Monitoring dashboard

Monitor scheduler CPU and memory metrics

Monitor the total parse time for all DAG files

Monitor worker pod evictions

Monitor active workers

Monitor workers CPU and memory usage

Monitor running and queued tasks

Monitor the database CPU and memory usage

Monitor the task scheduling latency

Monitor web server CPU and memory

Adjust environment's scale and performance parameters

Change the number of schedulers

Console

gcloud

Terraform

Changing the CPU and memory for schedulers

Console

gcloud

Terraform

Change the maximum number of workers

Console

gcloud

Terraform

Change worker CPU and memory

Console

gcloud

Terraform

Change web server CPU and memory

Console

gcloud

Terraform

Change the environment size

Console

gcloud

Terraform

Changing the DAG directory listing interval

Changing the DAG file parsing interval

Worker concurrency

Worker concurrency performance considerations

Worker autoscaling

Synchronization of task logs

Change worker concurrency

Change DAG concurrency

Increase max active runs per DAG

What's next