Google Cloud

Writing Dataflow pipelines with scalability in mind

Dataflow usually removes the need for you to think too much about how to make a pipeline scale. A lot of work has gone into sophisticated algorithms that can automatically parallelize and tune your pipeline across many machines. However, as with any such system, there are some anti-patterns that can bottleneck your pipeline at scale.

In this blogpost we will go over three of these anti-patterns and discuss how to address them. It’s assumed that you are already familiar with the Dataflow programming model. If not, we recommend beginning with our Getting Started guide and Tyler Akidau’s Streaming 101 and Streaming 102 blog posts. You might also read the Dataflow model paper published in VLDB 2015.

Today we’re going to talk about scaling your pipeline — or more specifically, why your pipeline might not scale. When we say scalability, we mean the ability of the pipeline to operate efficiently as input size increases and key distribution changes. The scenario: you’ve written a cool new Dataflow pipeline, which the high-level operations we provide made easy to write. You’ve tested this pipeline locally on your machine using DirectPipelineRunner and everything looks fine. You’ve even tried deploying it on a small number of Compute VMs, and things still look rosy. Then you try and scale up to a larger data volume, and the picture becomes decidedly worse. For a batch pipeline, it takes far longer than expected for the pipeline to complete. For a streaming pipeline, the lag reported in the Dataflow UI keeps increasing as the pipeline falls further and further behind. We’re going to describe some reasons why this might happen, and how to address them.

Expensive Per-Record Operations

One common problem we see is pipelines that perform needlessly expensive or slow operations for each record processed. Technically this isn’t a hard scaling bottleneck — given enough resources, Dataflow can still distribute this pipeline on enough machines to make it perform well. However, when running over many millions or billions of records, the cost of these per-record operations adds up to an unexpectedly large number. Usually these problems aren’t noticeable at all at lower scale.

Here’s an example of one such operation, taken from a real Dataflow pipeline.

  import javax.json.Json;
PCollection output = input.apply(ParDo.of(new DoFn() {
  public void processElement(ProcessContext c) {
    JsonReader reader = Json.createReader();
    // Perform some processing on entry.

At first glance it’s not obvious that anything is wrong with this code, yet when run at scale this pipeline ran extremely slowly.

Since the actual business logic of our code shouldn't have caused a slowdown, we suspected that something was adding per-record overhead to our pipeline. To get more information on this, we had to ssh to the VMs to get actual thread profiles from workers. After a bit of digging, we found threads were often stuck in the following stack trace:
sun.misc.URLClassPath$1.hasMoreElements($3$$3$ Method)$$3.hasMoreElements(

Each call to Json.createReader was searching the classpath trying to find a registered JsonProvider. As you can see from the stack trace, this involves loading and unzipping JAR files. Doing this per record on a high-scale pipeline is not likely to perform very well!

The solution here was for the user to create a static JsonReaderFactory and use that to instantiate the individual reader objects. You might be tempted to create a JsonReaderFactory per bundle of records instead, inside Dataflow’s startBundle method. However, while this will work well for a batch pipeline, a streaming pipeline can have very small bundles - sometimes just a few records. As a result, we don’t recommend doing expensive work per bundle either. Even if you believe your pipeline will only be used in batch mode, you may in the future want to run it as a streaming pipeline. So future-proof your pipelines by making sure they’ll work well in either mode!

Hot Keys

A fundamental primitive in Dataflow is GroupByKey. GroupByKey allows you to group a PCollection of key-value pairs so that all values for a specific key are grouped together to be processed as a unit. Most of Dataflow’s built-in aggregating transforms — Count, Top, Combine, etc. — use GroupByKey under the cover. You might have a hot key problem if a single worker is extremely busy (e.g., high CPU use determined by looking at the set of Google Compute Engine workers for the job) while other workers are idle, yet the pipeline falls farther and farther behind.

The DoFn that processes the result of a GroupByKey is given an input type of KV<KeyType, Iterable<ValueType>>. This means that the entire set of all values for that key (within the current window if using windowing) is modeled as a single Iterable element. In particular, this means that all values for that key must be processed on the same machine, in fact on the same thread. Performance problems can occur in the presence of hot keys — when one or more keys receive data faster than can be processed on a single cpu. For example, consider the following code snippet:

  p.apply(Read.from(new UserWebEventSource())
 .apply(new ExtractBrowserString())
 .apply(Window.into(FixedWindow.of(1, Duration.standardSeconds(1))))
 .apply(ParDo.of(new ProcessEventsByBrowser()));

This code keys all user events by the user’s web browser, and then processes all events for each browser as a unit. However, there are a small number of very popular browsers (such as Chrome, IE, Firefox, Safari), and those keys will be very hot — possibly too hot to process on one CPU. In addition to performance, this is also a scalability bottleneck. Adding more workers to the pipeline will not help if there are four hot keys, since those keys can be processed on at most four workers. You’ve structured your pipeline so that Dataflow can’t scale it up without violating the API contract.

One way to alleviate this is to structure ProcessEventsByBrowser as a combiner. A combiner is a special type of user function that allows piecewise processing of the Iterable. For example, if the goal was to count the number of events per browser per second, Count.perKey() can be used instead of a ParDo. Dataflow is able to lift part of the combining operation above the GroupByKey, which allows for more parallelism (for those of you coming from the Database world, this is similar to pushing a predicate down); some of the work can be done in a previous stage which hopefully is better distributed.

Unfortunately, while using a combiner often helps, it may not be enough — especially if the hot keys are very hot. This is especially true for streaming pipelines. You might also see this when using the global variants of combine (Combine.globally(), Count.globally(), Top.largest(), among others). Under the covers, these operations are performing a per-key combine on a single static key, and may not perform well if the volume to this key is too high. To address this, we allow you to provide extra parallelism hints using the Combine.PerKey.withHotKeyFanout or Combine.Globally.withFanout. These operations will create an extra step in your pipeline to pre-aggregate the data on many machines before performing the final aggregation on the target machines. There's no magic number for these operations, but the general strategy would be to split any hot key into enough sub-shards so that any single shard is well under the per-worker throughput that your pipeline can sustain.

Large Windows

Dataflow provides a sophisticated windowing facility for bucketing data according to time. This is most useful in streaming pipelines when processing unbounded data, though it is fully supported for batch, bounded pipelines as well. When a windowing strategy has been attached to a PCollection, any subsequent grouping operation (most notably GroupByKey) performs a separate grouping per window. Unlike other systems that provide only globally-synchronized windows, Dataflow windows the data for each key separately. This is what enables us to provide flexible per-key windows such as sessions. For more information, we recommend that you read the windowing guide in the Dataflow documentation.

As a consequence of the fact that windows are per-key, Dataflow buffers elements on the receiver side while waiting for each window to close. If using very long windows — e.g., a 24-hour fixed window — a lot of data has to be buffered, which can be a performance bottleneck for the pipeline. This can manifest as slowness (like for hot keys), or even as out of memory errors on the workers (visible in the logs). We again recommend using combiners to reduce the data size. The difference between writing this:

  pcollection.apply(Window.into(FixedWindows.of(1, TimeUnit.DAYS)))
           .apply(ParDo.of(new DoFn>, Long>() {
             public void processElement(ProcessContext c) {

. . . and this . . .

  pcollection.apply(Window.into(FixedWindows.of(1, TimeUnit.DAYS)))

. . . isn’t just brevity. In the latter snippet Dataflow knows that a count combiner is being applied, and so only needs to store the count so far for each key, no matter how long the window is. In contrast, Dataflow understands less about the first snippet of code and is forced to buffer an entire day’s worth of data on receivers, even though the two snippets are logically equivalent!

If it’s impossible to express your operation as a combiner, then we recommend looking at the triggers API. This will allow you to optimistically process portions of the window before the window closes, and so reduce the size of buffered data.

Note that many of these limitations do not apply to the batch runner. However, as mentioned above, you're always better off future-proofing your pipeline and making sure it runs well in both modes.We've talked about hot keys, large windows, and expensive per-record operations. Other guidance can be found in our documentation. Although this post has focused on challenges you may encounter with scaling your pipeline, there are many benefits to Dataflow that are largely transparent — techniques such as dynamic work rebalancing to minimize the negative effects of stragglers, throughput-based auto-scaling, and job resource management to adapt to many different pipeline and data shapes without user intervention. We're always trying to make our system more adaptive, and plan to automatically incorporate some of the above strategies into the core execution engine over time. Thanks for reading, and happy Dataflowing!