About Dataflow HPC highly parallel workloads

Highly parallel workloads, also known as embarrassingly parallel workloads, are common in finance, media, and life-sciences enterprises. For parallel workloads like these, enterprises typically deploy a cluster of compute nodes. Each node can carry out independent processing tasks, in a configuration referred to as grid computing. To process data for parallel workloads, you can use Apache Beam with Dataflow. For more information about Apache Beam, see the Apache Beam programming guide.

Using Dataflow for highly parallel workloads provides many benefits.

In addition, Dataflow includes various security features:

These workloads require distribution of data to functions that run across many cores. This distribution often requires very high concurrency reads followed by a large fan-out of data, absorbed by downstream systems. The core competencies of Dataflow are the distribution of batch and stream workloads across resources and managing autoscaling and dynamic work rebalancing across these resources. Therefore, when you use Dataflow for your highly parallel workloads, performance, scalability, availability, and security needs are handled automatically.

Incorporate external code into your pipeline

Apache Beam currently has built-in SDKs for Java, Python, and Go. However, many highly parallel workloads use code written in C++. You can use Dataflow and other Google Cloud services to run C++ binaries (libraries) as external code using Apache Beam. Including C++ binaries lets you unlock these types of workloads using fully managed services. It also gives you the ability to build complete pipelines using a sophisticated directed acyclic graph (DAG).

The same approach for running C++ binaries is also relevant for code written in other languages where you can compile a standalone binary.

End-to-end highly parallel pipelines

With Dataflow, you can do the I/O read/write processing, analysis, and task output all in the same pipeline, which enables you to run complete highly parallel pipelines

For example, an HPC highly parallel workload might incorporate the following steps:

  1. Ingest raw data, both from internal and external sources. Data might come from unbounded or bounded sources. Unbounded sources are mostly converted to bounded sources to accommodate the technologies used for task farming.

  2. Pre-process the raw data into a data shape and encoding that the task farming component can use.

  3. Use a system to distribute the calculations to hosts and to retrieve data from a source, and then materialize the results for post-analysis.

  4. Conduct post-analysis to convert the results into output.

You can use Dataflow to manage all of these steps in a single pipeline while taking advantages of Dataflow features:

  • Because a single system is responsible for all stages, you don't need an external orchestration system to coordinate the running of multiple pipelines.

  • With data locality, you don't need to explicitly materialize and dematerialize between the stage boundaries, increasing efficiency.

  • With better in-system telemetry, information about the total bytes in stage is available, which helps with designing later stages.

  • With autoscaling, when the data is in the system, the resources scale based on the data volumes as the data moves through the pipeline stages.

The core Dataflow HPC highly parallel pipeline uses modern DAG execution engines. All the typical pipeline processes can be completed in a single DAG and, therefore, a single Dataflow pipeline. You can use a DAG generated by Apache Beam to define the shape of the pipeline.

If you're migrating from a task farm system to a highly parallel workflow, you need to shift from tasks to data. A PTransform contains a DoFn, which has a process function that takes in a data element. The data point can be any object with one or more properties.

By using a DAG and a single pipeline, you can load all the data within the system during the entire workflow. You don't need to output data to databases or storage.

Google Cloud components used with highly parallel workflows

Grid computing applications require data to be distributed to functions running on many cores. This pattern often requires high-concurrency reads and is often followed by a large fan-out of data absorbed by downstream systems.

Dataflow is integrated with other Google Cloud managed services that can absorb massive-scale parallelized data I/O:

  • Pub/Sub: wide-column store for caching and serving
  • Bigtable: global event stream ingestion service
  • Cloud Storage: unified object store
  • BigQuery: petabyte-scale data warehouse service

Used together, these services provide a compelling solution for highly parallel workloads.

Common architecture for highly parallel workloads running on Google Cloud includes the following:

  • The Dataflow Runner for Apache Beam. This runner distributes work to the grid nodes with a processing flow derived from a DAG. A single Apache Beam DAG lets you define complex multi-stage pipelines in which parallel pipelined stages can be brought back together using side-inputs or joins.

  • Cloud Storage. This service provides a location to stage the C++ binaries. When large files need to be stored, as in many media use cases, those files also reside in Cloud Storage.

  • Bigtable, BigQuery, and Pub/Sub. These services are used as both sources and sinks.

The following diagram outlines high-level architecture for an example workflow.

Architecture of a grid computing solution

You can also use other storage systems. For details, see the list of storage systems and streaming sources in the pipeline I/O page in the Apache Beam documentation.

The Dataflow runner for Apache Beam

Use Dataflow to transform and enrich you data in both streaming and batch modes. Dataflow is based on Apache Beam.

Cloud Storage

Cloud Storage is a unified object storage that encompasses live data serving, data analytics, machine learning (ML), and data archiving. For highly parallel workloads with Dataflow, Cloud Storage provides access to C++ binaries. In some use cases, Cloud Storage also provides the location for data needed by the processing phase.

For the high-burst loads required by grid computing, you need to understand Cloud Storage performance characteristics. For more information about Cloud Storage data-serving performance, see Request Rate and Access Distribution Guidelines in the Cloud Storage documentation.

Bigtable

Bigtable is a high-performance NoSQL database service optimized for large analytical and operational workloads. Bigtable is complementary to Dataflow. The primary characteristics of Bigtable, low latency reads and writes (6 ms on the 90th percentile), allow it to handle many thousands of concurrent clients and heavy-burst workloads. These features make Bigtable ideal as a sink and as a data source within the DoFn function in the processing phase of Dataflow.

BigQuery

BigQuery is a fast, economical, and fully managed enterprise data warehouse for large-scale data analytics. Grid results are often used for analysis, and enable you to run large-scale aggregations against the data output of the grid.

Pub/Sub

Pub/Sub is an asynchronous and scalable messaging service that decouples services producing messages from services processing those messages. You can use Pub/Sub for streaming analytics and data integration pipelines to ingest and distribute data. It's equally effective as a messaging-oriented middleware for service integration or as a queue to parallelize tasks.

The Dataflow DAG

The Apache Beam SDK enables you to build expressive DAGs, which in turn lets you create stream or batch multi-stage pipelines. Data movement is handled by the runner, with data expressed as PCollection objects, which are immutable parallel element collections.

The following diagram illustrates this flow.

Flow using DAG

The Apache Beam SDK lets you define a DAG. In the DAG, you can include user-defined code as functions. Normally, the same programming language (Java, Python, or Go) is used for both the declaration of the DAG and for the user-defined code. You can also use code that isn't built-in, such as C++, for the user-defined code.

What's next

  • Learn about best practices for working with Dataflow HPC highly parallel pipelines.

  • Follow the tutorial to create a pipeline that uses custom containers with C++ libraries.