Pipeline performance

This page describes the pipeline settings that you can adjust and the impact they have on performance.

The performance of a pipeline depends on the following:

  • Size and characteristics of your data
  • Structure of your pipeline
  • Cluster sizing
  • Plugins that your Cloud Data Fusion pipeline uses

Cluster sizing

Master nodes use resources proportional to the number of pipelines or additional applications that are running on the cluster. If you're running pipelines on ephemeral clusters, use 2 CPUs and 8 GB of memory for the master nodes. If you're using persistent clusters, you might need larger master nodes to keep up with the workflow. To understand if you need larger master nodes, you can monitor memory and CPU usage on the node. We recommend sizing your worker nodes with at least 2 CPUs and 8 GB of memory. If you've configured your pipelines to use larger amounts of memory, then you must use larger workers.

To minimize execution time, ensure that your cluster has enough nodes to allow for as much parallel processing as possible.

Learn more about cluster sizing.

Resources

Pipelines let you specify the number of CPUs and amount of memory to be given to the Spark driver and to each Spark executor. The driver doesn't do much work. Therefore, the default value of 1 CPU and 2 GB of memory is enough to run most pipelines. You might need to increase the memory for pipelines that contain many stages or large schemas. The number of CPUs assigned to an executor determines the number of tasks the executor can run in parallel.

Learn more about resources.

Execution engine tuning

In Cloud Data Fusion versions 6.4 and later, Cloud Data Fusion automatically configures the execution engine for the best performance for ephemeral Dataproc clusters. For static Dataproc clusters, configure your execution engine.

Learn more

To learn about the concepts introduced here more in detail, see the CDAP data pipeline performance tuning guide.