Optimize environment performance and costs

Cloud Composer 1 | Cloud Composer 2

This page explains how to tune your environment's scale and performance parameters to the needs of your project, so that you get improved performance and reduce costs for resources that are not utilized by your environment.

Other pages about scaling and optimization:

Optimization process overview

Making changes to the parameters of your environment can affect many aspects of your environment's performance. We recommend to optimize your environment in iterations:

  1. Start with environment presets.
  2. Run your DAGs.
  3. Observe your environment's performance.
  4. Adjust your environment scale and performance parameters, then repeat from the previous step.

Start with environment presets

When you create an environment in Google Cloud console, you can select one of three environment presets. These presets set the initial scale and performance configuration of your environment; after you create your environment, you can change all scale and performance parameters provided by a preset.

We recommend to start with one of the presets, based on the following estimates:

  • Total number of DAGs that you plan to deploy in the environment
  • Maximum number of concurrent DAG runs
  • Maximum number of concurrent tasks

Your environment's performance depends on the implementation of specific DAGs that you run in your environment. The following table lists estimates that are based on the average resource consumption. If you expect your DAGs to consume more resources, adjust the estimates accordingly.

Recommended preset Total DAGs Max concurrent DAG runs Max concurrent tasks
Small 50 15 18
Medium 250 60 100
Large 1000 250 400

For example, an environment must run 40 DAGs. All DAGs must run at the same time with one active task each. This environment would then use a Medium preset, because the maximum number of concurrent DAG runs and tasks exceed the recommended estimates for the Small preset.

Run your DAGs

Once your environment is created, upload your DAGs to it. Run your DAGs and observe the environment's performance.

We recommend to run your DAGs on a schedule that reflects the real-life application of your DAGs. For example, if you want to run multiple DAGs at the same time, make sure to check your environment's performance when all these DAGs are running simultaneously.

Observe your environment's performance

This section focuses on the most common Cloud Composer 2 capacity and performance tuning aspects. We recommend following this guide step by step because the most common performance considerations are covered first.

Go to the Monitoring dashboard

You can monitor your environment's performance metrics on the Monitoring dashboard of your environment.

To go to the Monitoring dashboard for your environment:

  1. In the Google Cloud console, go to the Environments page.

    Go to Environments

  2. Click the name of your environment.

  3. Go to the Monitoring tab.

Monitor scheduler CPU and memory metrics

Airflow scheduler's CPU and memory metrics help you check whether the scheduler's performance is a bottleneck in the overall Airflow performance.

Graphs for Ariflow schedulers
Figure 1. Graphs for Airflow schedulers (click to enlarge)

On the Monitoring dashboard, in the Schedulers section, observe graphs for the Airflow schedulers of your environment:

  • Total schedulers CPU usage
  • Total schedulers memory usage

Adjust according to your observations:

Monitor the total parse time for all DAG files

The schedulers parse DAGs before scheduling DAG runs. If DAGs take a long time to parse, this consumes scheduler's capacity and might reduce the performance of DAG runs.

Total DAG parse time graph
Figure 2. Graph for DAG parse time (click to enlarge)

On the Monitoring dashboard, in the DAG Statistics section, observe graphs for the total DAG parse time.

If the number exceeds about 10 seconds, your schedulers might be overloaded with DAG parsing and cannot run DAGs effectively. The default DAG parsing frequency in Airflow is 30 seconds; if DAG parsing time exceeds this threshold, parsing cycles start to overlap, which then exhausts scheduler's capacity.

According to your observations, you might want to:

Monitor worker pod evictions

Pod eviction can happen when a particular pod in your environment's cluster reaches its resource limits.

Worker pod evictions graph
Figure 3. Graph that displays worker pod evictions (click to enlarge)

If an Airflow worker pod is evicted, all task instances running on that pod are interrupted, and later marked as failed by Airflow.

The majority of issues with worker pod evictions happen because of out-of-memory situations in workers.

On the Monitoring dashboard, in the Workers section, observe the Worker Pods evictions graphs for your environment.

The Total workers memory usage graph shows a total perspective of the environment. A single worker can still exceed the memory limit, even if the memory utilization is healthy at the environment level.

According to your observations, you might want to:

Monitor active workers

The number of workers in your environment automatically scales in response to the queued tasks.

Active workers and queued tasks graphs
Figure 4. Active workers and queued tasks graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the number of active workers and the number of tasks in the queue:

  • Active workers
  • Airflow tasks

Adjust according to your observations:

  • If the environment frequently reaches its maximum limit for workers, and at the same time the number of tasks in the Celery queue is continuously high, you might want to increase the maximum number of workers.
  • If there are long inter-task scheduling delays, but at same time the environment does not scale up to its maximum number of workers, then there is likely an Airflow setting that throttles the execution and prevents Cloud Composer mechanisms from scaling the environment. Because Cloud Composer 2 environments scale based on the number of tasks in the Celery queue, configure Airflow to not throttle tasks on the way into the queue:

    • Increase worker concurrency. Worker concurrency must be set to a value that is higher than the expected maximum number of concurrent tasks, divided by the maximum number of workers in the environment.
    • Increase DAG concurrency, if a single DAG is running a large number of tasks in parallel, which can lead to reaching the maximum number of running task instances per DAG.
    • Increase max active runs per DAG, if you run the same DAG multiple times in parallel, which can lead to Airflow throttling the execution because the max active runs per DAG limit is reached.

Monitor workers CPU and memory usage

Monitor the total CPU and memory usage aggregated across all workers in your environment to determine if Airflow workers utilize the resources of your environment properly.

Workers CPU and memory graphs
Figure 5. Workers CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the CPU and memory usage by Airflow workers:

  • Total workers CPU usage
  • Total workers memory usage

These graph represent aggregated resource usage; individual workers might still reach their capacity limits, even if the aggregate view shows spare capacity.

Adjust according to your observations:

Monitor running and queued tasks

You can monitor the number of queued and running tasks to check the efficiency of the scheduling process.

Graph that displays running and queued tasks
Figure 6. Graph that displays running and queued tasks (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe the Airflow tasks graph for your environment.

Tasks in the queue are waiting to be executed by workers. If your environment has queued tasks, this might mean that workers in your environment are busy executing other tasks.

Some queuing is always present in an environment, especially during processing peaks. However, if you observe a high number of queued tasks, or a growing trend in the graph, then this might indicate that workers do not have enough capacity to process the tasks, or that Airflow is throttling task execution.

A high number of queued tasks is typically observed when the number of running tasks also reaches the maximum level.

To address both problems:

Monitor the database CPU and memory usage

Airflow database performance issues can lead to overall DAG execution issues. Database disk usage is typically not a cause for concern because the storage is automatically extended as needed.

Database CPU and memory graphs
Figure 7. Database CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Workers section, observe graphs for the CPU and memory usage by the Airflow database:

  • Database CPU usage
  • Database memory usage

If the database CPU usage exceeds 80% for more than a few percent of the total time, the database is overloaded and requires scaling.

Database size settings are controlled by the environment size property of your environment. To scale the database up or down, change the environment size to a different tier (Small, Medium, or Large). Increasing the environment size increases the costs of your environment.

Monitor the task scheduling latency

If the latency between tasks exceeds the expected levels (for example, 20 seconds or more), then this might indicate that the environment cannot handle the load of tasks generated by DAG runs.

Task latency graph (Airflow UI)
Figure 8. Task latency graph, Airflow UI (click to enlarge)

You can view the task scheduling latency graph, in the Airflow UI of your environment.

In this example, delays (2.5 and 3.5 seconds) are well within the acceptable limits but significantly higher latencies might indicate that:

Monitor web server CPU and memory

The Airflow web server performance affects Airflow UI. It is not common for the web server to be overloaded. If this happens, the Airflow UI performance might deteriorate, but this does not affect the performance of DAG runs.

Web server CPU and memory graphs
Figure 9. Web server CPU and memory graphs (click to enlarge)

On the Monitoring dashboard, in the Web server section, observe graphs for the Airflow web server:

  • Web server CPU usage
  • Web server memory usage

Based on your observations:

Adjust environment's scale and performance parameters

Change the number of schedulers

Adjusting the number of schedulers improves the scheduler capacity and resilience of Airflow scheduling.

If you increase the number of schedulers, this increases the traffic to and from the Airflow database. We recommend using two Airflow schedulers in most scenarios. Using more than two schedulers is required only in rare cases that require special considerations.

If you need faster scheduling:

Examples:

Console

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

gcloud

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

The following example sets the number of schedulers to two:

gcloud composer environments update example-environment \
    --scheduler-count=2

Terraform

Follow the steps in Adjust the number of schedulers to set the required number of schedulers for your environment.

The following example sets the number of schedulers to two:

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      scheduler {
        count = 2
      }
    }
  }
}

Changing the CPU and memory for schedulers

The CPU and memory parameters are for each scheduler in your environment. For example, if your environment has two schedulers, the total capacity is twice the specified number of CPU and memory.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for schedulers.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and Memory for schedulers.

The following example changes the CPU and memory for schedulers. You can specify only CPU or Memory attributes, depending on the need.

gcloud composer environments update example-environment \
  --scheduler-cpu=0.5 \
  --scheduler-memory=3.75

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for schedulers.

The following example changes the CPU and memory for schedulers. You can omit CPU or Memory attributes, depending on the need.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      scheduler {
        cpu = "0.5"
        memory_gb = "3.75"
      }
    }
  }
}

Change the maximum number of workers

Increasing the maximum number of workers allows your environment to automatically scale to a higher number of workers, if needed.

Decreasing the maximum number of workers reduces the maximum capacity of the environment but might also be helpful to reduce the environment costs.

Examples:

Console

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

gcloud

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

The following example sets the maximum number of workers to six:

gcloud composer environments update example-environment \
    --max-workers=6

Terraform

Follow the steps in Adjust the minimum and maximum number of workers to set the required maximum number of workers for your environment.

The following example sets the maximum number of schedulers to six:

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      worker {
        max_count = "6"
      }
    }
  }
}

Change worker CPU and memory

  • Decreasing worker memory can be helpful when the worker usage graph indicates very low memory utilization.

  • Increasing worker memory allows workers to handle more tasks concurrently or handle memory-intensive tasks. It might address the problem of worker pod evictions.

  • Decreasing worker CPU can be helpful when the worker CPU usage graph indicates that the CPU resources are highly overallocated.

  • Increasing worker CPU allows workers to handle more tasks concurrently and in some cases reduce the time it takes to process these tasks.

Changing worker CPU or memory restarts workers, which might affect running tasks. We recommend to do it when no DAGs are running.

The CPU and memory parameters are for each worker in your environment. For example, if your environment has four workers, the total capacity is four times the specified number of CPU and memory.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for workers.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set the CPU and memory for workers.

The following example changes the CPU and memory for workers. You can omit the CPU or memory attribute, if required.

gcloud composer environments update example-environment \
  --worker-memory=3.75 \
  --worker-cpu=2

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for workers.

The following example changes the CPU and memory for workers. You can omit the CPU or memory parameter, if required.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      worker {
        cpu = "2"
        memory_gb = "3.75"
      }
    }
  }
}

Change web server CPU and memory

Decreasing the web server CPU or memory can be helpful when the web server usage graph indicates that it is continuously underutilized.

Changing web server parameters restarts the web server, which causes a temporary web server downtime. We recommend you to make changes outside of the regular usage hours.

Console

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for the web server.

gcloud

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set the CPU and Memory for the web server.

The following example changes the CPU and memory for the web server. You can omit CPU or memory attributes, depending on the need.

gcloud composer environments update example-environment \
    --web-server-cpu=2 \
    --web-server-memory=3.75

Terraform

Follow the steps in Adjust worker, scheduler, and web server scale and performance parameters to set CPU and memory for the web server.

The following example changes the CPU and memory for the web server. You can omit CPU or memory attributes, depending on the need.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    workloads_config {
      web_server {
        cpu = "2"
        memory_gb = "3.75"
      }
    }
  }
}

Change the environment size

Changing the environment size modifies the capacity of Cloud Composer backend components, such as the Airflow database and the Airflow queue.

  • Consider changing the environment size to a smaller size (for example, Large to Medium, or Medium to Small) when Database usage metrics show substantial underutilization.
  • Consider increasing the environment size if you observe the high usage of the Airflow database.

Console

Follow the steps in Adjust the environment size to set the environment size.

gcloud

Follow the steps in Adjust the environment size to set the environment size.

The following example changes the size of the environment to Medium.

gcloud composer environments update example-environment \
    --environment-size=medium

Terraform

Follow the steps in Adjust the environment size to set the environment size.

The following example changes the size of the environment to Medium.

resource "google_composer_environment" "example-environment" {

  # Other environment parameters

  config {
    environment_size = "medium"
  }
}

Changing the DAG directory listing interval

Increasing the DAG directory listing interval reduces the scheduler load associated with discovery of new DAGs in the environment's bucket.

  • Consider increasing this interval if you deploy new DAGs infrequently.
  • Consider decreasing this interval if you want Airflow to react faster to newly deployed DAG files.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
scheduler dag_dir_list_interval New value for the listing interval The default value, in seconds, is 120.

Changing the DAG file parsing interval

Increasing the DAG file parsing interval reduces the scheduler load associated with the continuous parsing of DAGs in the DAG bag.

Consider increasing this interval when you have a high number of DAGs that do not change too often, or observe a high scheduler load in general.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
scheduler min_file_process_interval New value for the DAG parsing interval The default value, in seconds, is 30.

Worker concurrency

Concurrency performance and your environment's ability to autoscale is connected to two settings:

  • the minimum number of Airflow workers
  • the [celery]worker_concurrency parameter

The default values provided by Cloud Composer are optimal for the majority of use cases, but your environment might benefit from custom adjustments.

Worker concurrency performance considerations

The [celery]worker_concurrency parameter defines the number of tasks a single worker can pick up from the task queue. Task execution speed depends on multiple factors, such as worker CPU, memory, and the type of work itself.

Worker autoscaling

Cloud Composer monitors the task queue and spawns additional workers to pick up any waiting tasks. Setting [celery]worker_concurrency to a high value means that every worker can pick up a lot of tasks, so under certain circumstances the queue might never fill up, causing autoscaling to never trigger.

For example, in a Cloud Composer environment with two Airflow workers, [celery]worker_concurrency set to 100, and tasks 200 in the queue, each worker would pick up 100 tasks. This leaves the queue empty and doesn't trigger autoscaling. If these tasks take a long time to complete, this could lead to performance issues.

But if the tasks are small and quick to execute, a high value in the [celery]worker_concurrency setting could lead to overeager scaling. For example, if that environment has 300 tasks in the queue, Cloud Composer begins to create new workers. But if the first 200 tasks finish execution by the time the new workers are ready, an existing worker can pick them up. The end result is that autoscaling creates new workers, but there are no tasks for them.

Adjusting [celery]worker_concurrency for special cases should be based on your peak task execution times and queue numbers:

  • For tasks that take longer to complete, workers shouldn't be able to empty the queue completely.
  • For quicker, smaller tasks, increase the minimum number of Airflow workers to avoid overeager scaling.

Synchronization of task logs

Airflow workers feature a component that synchronizes task execution logs to Cloud Storage buckets. A high number of concurrent tasks performed by a single worker leads to a high number of synchronization requests. This could possibly overload your worker and lead to performance issues.

If you observe performance issues due to high numbers of log synchronization traffic, lower the [celery]worker_concurrency values and instead adjust the minimum number of Airflow workers.

Change worker concurrency

Changing this parameter adjusts the number of tasks that a single worker can execute at the same time.

For example, a Worker with 0.5 CPU can typically handle 6 concurrent tasks; an environment with three such workers can handle up to 18 concurrent tasks.

  • Increase this parameter when there are tasks waiting in the queue, and your workers use a low percentage of their CPUs and memory at the same time.

  • Decrease this parameter when you are getting pod evictions; this would reduce the number of tasks that a single worker attempts to process. As an alternative, you can increase worker memory.

The default value for worker concurrency is equal to:

  • In Airflow 2.6.3 and later versions, a minimum value out of 32, 12 * worker_CPU, and 6 * worker_memory.
  • In Airflow versions before 2.6.3, a minimum value out of 32, 12 * worker_CPU, and 8 * worker_memory.
  • In Airflow versions before 2.3.3, 12 * worker_CPU.

The worker_CPU value is the number of CPUs allocated to a single worker. The worker_memory value is the amount of memory allocated to a single worker. For example, if workers in your environment use 0.5 CPU and 4GB of memory each, then the worker concurrency is set to 6. The worker concurrency value does not depend on the number of workers in your environment.

To change this parameter, override the following Airflow configuration option:

Section Key Value
celery worker_concurrency New value for worker concurrency

Change DAG concurrency

DAG concurrency defines the maximum number of task instances allowed to run concurrently in each DAG. Increase it when your DAGs run a high number of concurrent tasks. If this setting is low, the scheduler delays putting more tasks into the queue, which also reduces the efficiency of environment autoscaling.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
core max_active_tasks_per_dag New value for DAG concurrency The default value is 16

Increase max active runs per DAG

This attribute defines the maximum number of active DAG runs per DAG. When the same DAG must be run multiple times concurrently, for example, with different input arguments, this attribute allows the scheduler to start such runs in parallel.

To change this parameter, override the following Airflow configuration option:

Section Key Value Notes
core max_active_runs_per_dag New value for max active runs per DAG The default value is 25

What's next