Highly parallel workloads, also known as embarrassingly parallel workloads, are common in finance, media, and life-sciences enterprises. For parallel workloads like these, enterprises typically deploy a cluster of compute nodes. Each node can carry out independent processing tasks, in a configuration referred to as grid computing. To process data for parallel workloads, you can use Apache Beam with Dataflow. For more information about Apache Beam, see the Apache Beam programming guide.
Using Dataflow for highly parallel workloads provides many benefits.
- Create a fully-managed workflow, with data processing and orchestration in the
same pipeline.
- The Dataflow user interface and API include observability features.
- Dataflow has centralized logging for all pipeline stages.
- Dataflow offers autoscaling to maximize performance and optimize resource usage.
- Dataflow is fault tolerant and provides dynamic load balancing.
- Dataflow offers straggler detection and correction.
- Use a single system for all aspects of the pipeline, for both pre- and post-processing and for task processing. You can even use your existing C++ code in the pipeline.
- Use the built-in exactly once processing that Dataflow provides.
In addition, Dataflow includes various security features:
- Use customer-managed encryption key (CMEK) with your pipeline.
- Define firewall rules for the network associated with your Dataflow job.
- Use a VPC network.
These workloads require distribution of data to functions that run across many cores. This distribution often requires very high concurrency reads followed by a large fan-out of data, absorbed by downstream systems. The core competencies of Dataflow are the distribution of batch and stream workloads across resources and managing autoscaling and dynamic work rebalancing across these resources. Therefore, when you use Dataflow for your highly parallel workloads, performance, scalability, availability, and security needs are handled automatically.
Incorporate external code into your pipeline
Apache Beam currently has built-in SDKs for Java, Python, and Go. However, many highly parallel workloads use code written in C++. You can use Dataflow and other Google Cloud services to run C++ binaries (libraries) as external code using Apache Beam. Including C++ binaries lets you unlock these types of workloads using fully managed services. It also gives you the ability to build complete pipelines using a sophisticated directed acyclic graph (DAG).
The same approach for running C++ binaries is also relevant for code written in other languages where you can compile a standalone binary.
End-to-end highly parallel pipelines
With Dataflow, you can do the I/O read/write processing, analysis, and task output all in the same pipeline, which enables you to run complete highly parallel pipelines
For example, an HPC highly parallel workload might incorporate the following steps:
Ingest raw data, both from internal and external sources. Data might come from unbounded or bounded sources. Unbounded sources are mostly converted to bounded sources to accommodate the technologies used for task farming.
Pre-process the raw data into a data shape and encoding that the task farming component can use.
Use a system to distribute the calculations to hosts and to retrieve data from a source, and then materialize the results for post-analysis.
Conduct post-analysis to convert the results into output.
You can use Dataflow to manage all of these steps in a single pipeline while taking advantages of Dataflow features:
Because a single system is responsible for all stages, you don't need an external orchestration system to coordinate the running of multiple pipelines.
With data locality, you don't need to explicitly materialize and dematerialize between the stage boundaries, increasing efficiency.
With better in-system telemetry, information about the total bytes in stage is available, which helps with designing later stages.
With autoscaling, when the data is in the system, the resources scale based on the data volumes as the data moves through the pipeline stages.
The core Dataflow HPC highly parallel pipeline uses modern DAG execution engines. All the typical pipeline processes can be completed in a single DAG and, therefore, a single Dataflow pipeline. You can use a DAG generated by Apache Beam to define the shape of the pipeline.
If you're migrating from a task farm system to a highly
parallel workflow, you need to shift from tasks to data. A PTransform
contains a
DoFn
,
which has a process function that takes in a data element.
The data point can be any object with one or more properties.
By using a DAG and a single pipeline, you can load all the data within the system during the entire workflow. You don't need to output data to databases or storage.
Google Cloud components used with highly parallel workflows
Grid computing applications require data to be distributed to functions running on many cores. This pattern often requires high-concurrency reads and is often followed by a large fan-out of data absorbed by downstream systems.
Dataflow is integrated with other Google Cloud managed services that can absorb massive-scale parallelized data I/O:
- Pub/Sub: wide-column store for caching and serving
- Bigtable: global event stream ingestion service
- Cloud Storage: unified object store
- BigQuery: petabyte-scale data warehouse service
Used together, these services provide a compelling solution for highly parallel workloads.
Common architecture for highly parallel workloads running on Google Cloud includes the following:
The Dataflow Runner for Apache Beam. This runner distributes work to the grid nodes with a processing flow derived from a DAG. A single Apache Beam DAG lets you define complex multi-stage pipelines in which parallel pipelined stages can be brought back together using side-inputs or joins.
Cloud Storage. This service provides a location to stage the C++ binaries. When large files need to be stored, as in many media use cases, those files also reside in Cloud Storage.
Bigtable, BigQuery, and Pub/Sub. These services are used as both sources and sinks.
The following diagram outlines high-level architecture for an example workflow.
You can also use other storage systems. For details, see the list of storage systems and streaming sources in the pipeline I/O page in the Apache Beam documentation.
The Dataflow runner for Apache Beam
Use Dataflow to transform and enrich you data in both streaming and batch modes. Dataflow is based on Apache Beam.
Cloud Storage
Cloud Storage is a unified object storage that encompasses live data serving, data analytics, machine learning (ML), and data archiving. For highly parallel workloads with Dataflow, Cloud Storage provides access to C++ binaries. In some use cases, Cloud Storage also provides the location for data needed by the processing phase.
For the high-burst loads required by grid computing, you need to understand Cloud Storage performance characteristics. For more information about Cloud Storage data-serving performance, see Request Rate and Access Distribution Guidelines in the Cloud Storage documentation.
Bigtable
Bigtable is a high-performance NoSQL database service optimized
for large analytical and operational workloads. Bigtable is
complementary to Dataflow. The
primary characteristics of Bigtable, low latency reads and writes (6 ms on the 90th
percentile), allow it to handle many thousands of concurrent clients and
heavy-burst workloads. These features make Bigtable
ideal as a sink and as a data source within the
DoFn
function in the processing phase of Dataflow.
BigQuery
BigQuery is a fast, economical, and fully managed enterprise data warehouse for large-scale data analytics. Grid results are often used for analysis, and enable you to run large-scale aggregations against the data output of the grid.
Pub/Sub
Pub/Sub is an asynchronous and scalable messaging service that decouples services producing messages from services processing those messages. You can use Pub/Sub for streaming analytics and data integration pipelines to ingest and distribute data. It's equally effective as a messaging-oriented middleware for service integration or as a queue to parallelize tasks.
The Dataflow DAG
The Apache Beam SDK enables you to build expressive DAGs, which in turn lets you
create stream or batch multi-stage pipelines. Data movement is handled by
the runner, with data expressed as
PCollection
objects, which are immutable parallel element collections.
The following diagram illustrates this flow.
The Apache Beam SDK lets you define a DAG. In the DAG, you can include user-defined code as functions. Normally, the same programming language (Java, Python, or Go) is used for both the declaration of the DAG and for the user-defined code. You can also use code that isn't built-in, such as C++, for the user-defined code.
What's next
Learn about best practices for working with Dataflow HPC highly parallel pipelines.
Follow the tutorial to create a pipeline that uses custom containers with C++ libraries.