Pipeline performance

This document provides a brief overview of how pipelines are executed. You can learn more about the concepts introduced here by clicking the links to the CDAP documentation site, provided in the sections below.

Pipeline performance depends on the size and characteristics of your data, the structure of your pipeline, and the plugins your pipeline is using. By learning about pipeline execution, you will understand what pipeline settings you can adjust and the performance impact they will have.

Parallel processing

Pipelines are executed on clusters of machines. The work the pipeline needs to achieve is split up and run in parallel across the cluster. The greater the number of splits (also called partitions), the faster the pipeline can run. The level of parallelism in your pipeline is determined by the sources and shuffle stages in the pipeline.

Pipelines are executed on Hadoop clusters, each of which consists of one to three master nodes and multiple worker nodes. Master nodes coordinate the work that needs to be done, and worker nodes perform the work. Hadoop YARN is used as the work coordination system. YARN runs a Resource Manager service on the master node(s) and a Node Manager service on each worker node.

Learn more about parallel processing.

Pipeline structure

Pipelines present a logical view of your business logic. A single pipeline will break down into one or more Spark jobs. Each job reads data, processes it, and then writes out the data. A common misconception is that each pipeline stage will get processed completely before moving on to the next one. But in fact, multiple pipeline stages get grouped together into jobs based on the structure of the pipeline.

Learn more about pipeline structure.

Joins

Joins are often the most expensive part of a pipeline. Just like other parts of a pipeline, joins are executed in parallel. The first step of a join is shuffling the data so that every record with the same join key is sent to the same executor. After all the data is shuffled, it is joined and outputted to the rest of the pipeline.

When the join keys are fairly evenly distributed, joins perform well because they can be executed in parallel. Like any shuffle, data skew will negatively impact performance. In a case where one operation happens more frequently than others, the executor performing the join does more work than the other executors. If you notice that a join is skewed, there are two ways to improve performance: in-memory joins and key distribution.

Learn more about joins.

Resources

Pipelines allow you to specify the number of CPUs and amount of memory to be given to the Spark driver and to each Spark executor. Because the driver doesn't do much work, its' default of 1 CPU and 2 GB of memory is generally enough to run most pipelines. You may need to increase the memory for pipelines that contain many stages or large schemas. The number of CPUs assigned to an executor determines the number of tasks the executor can run in parallel.

Learn more about resources.

Cluster sizing

Master nodes use resources proportional to the number of pipelines or additional applications that are running on the cluster. If you're running pipelines on ephemeral clusters, use 2 CPUs and 8 GB of memory for the master nodes. If you're using persistent clusters, you may need larger master nodes to keep up with the workflow. You can monitor memory and CPU usage on the node to understand if you need larger master nodes. We recommend sizing your worker nodes with at least 2 CPUs and 8 GB of memory. You will need to use larger workers if you have configured your pipelines to use larger amounts of memory.

Disk space is important for some pipelines; If your pipeline doesn't contain any shuffles, disk space is only used if Spark runs out of memory. In these cases, disk size and type don't make a big impact on pipeline performance. However, if your pipeline is shuffling a lot of data, disk size makes a big difference on performance. If you're using Dataproc, use disk sizes of at least 1 TB, as disk performance scales up with disk size.

To minimize execution time, ensure that your cluster has enough nodes to allow for as much parallel processing as possible.

Learn more about cluster sizing.