Pipeline performance

Pipeline performance depends on the size and characteristics of your data, the structure of your pipeline, cluster sizing, and the plugins your Cloud Data Fusion pipeline is using. This page describes the pipeline settings that you can adjust and the impact they have on performance.

Cluster sizing

Master nodes use resources proportional to the number of pipelines or additional applications that are running on the cluster. If you're running pipelines on ephemeral clusters, use 2 CPUs and 8 GB of memory for the master nodes. If you're using persistent clusters, you may need larger master nodes to keep up with the workflow. You can monitor memory and CPU usage on the node to understand if you need larger master nodes. We recommend sizing your worker nodes with at least 2 CPUs and 8 GB of memory. You will need to use larger workers if you have configured your pipelines to use larger amounts of memory.

To minimize execution time, ensure that your cluster has enough nodes to allow for as much parallel processing as possible.

Learn more about cluster sizing.

Resources

Pipelines allow you to specify the number of CPUs and amount of memory to be given to the Spark driver and to each Spark executor. Because the driver doesn't do much work, its' default of 1 CPU and 2 GB of memory is generally enough to run most pipelines. You may need to increase the memory for pipelines that contain many stages or large schemas. The number of CPUs assigned to an executor determines the number of tasks the executor can run in parallel.

Learn more about resources.

Execution engine tuning

In Cloud Data Fusion versions 6.4 and above, Cloud Data Fusion automatically configures the execution engine for the best performance for ephemeral Dataproc clusters. For static Dataproc clusters, configure your execution engine.

Learn more

To learn about the concepts introduced here more in detail, see the CDAP data pipeline performance tuning guide.