Migrating from App Engine MapReduce to Cloud Dataflow

This tutorial is intended for App Engine MapReduce users. It shows how to migrate from using App Engine MapReduce to Cloud Dataflow.

Why migrate?

App Engine MapReduce is a programming model for processing large amounts of data in a parallel and distributed fashion. It is useful for large, long-running tasks that cannot be handled within the scope of a single request, such as:

  • Analyzing application logs
  • Aggregating related data from external sources
  • Transforming data from one format to another
  • Exporting data for external analysis

However, App Engine MapReduce is a community-maintained, open source library that is built on top of App Engine services and is no longer supported by Google.

Cloud Dataflow, on the other hand, is fully supported by Google, and provides extended functionality compared to App Engine MapReduce.

Migration cases

The following are some example cases where you could benefit from migrating from App Engine MapReduce to Cloud Dataflow:

  • Store your Cloud Datastore database application data in a BigQuery data warehouse for analytical processing using SQL.
  • Use Cloud Dataflow as an alternative to App Engine MapReduce for maintenance and/or updates of your Cloud Datastore dataset.
  • Back up parts of your Cloud Datastore database to Cloud Storage.

What is Cloud Dataflow?

Cloud Dataflow is a unified programming model and a managed service for developing and executing a wide variety of data processing patterns. It enables you to easily create complex pipelines for both batch and streaming. Cloud Dataflow provides SDKs for defining data processing workflows and a Google Cloud Platform managed service to run those workflows on GCP resources such as Compute Engine, BigQuery, and more.

In 2016, Google donated the Cloud Dataflow SDKs to the Apache Software Foundation as part of the Apache Beam project. Google is actively involved as one of the many contributors to the Apache Beam community. Starting with Cloud Dataflow SDK version 2.0.0, Google redistributes the Beam SDKs as the Cloud Dataflow SDKs and uses Beam as the main runner for the Cloud Dataflow service. The Cloud Dataflow SDK is available for Java and Python.

Getting started with Cloud Dataflow

To get started, follow the Cloud Dataflow quickstart of your choice:

Creating and running a pipeline

When using App Engine MapReduce, you create data processing classes, add the MapReduce library, and once the job's specification and settings are defined, you create and start the job in one step using the static start() method on the appropriate job class.

For Map jobs, you create the Input and Output classes and the Map class that does the mapping. For App Engine MapReduce jobs, you create Input and Output classes, and define the Mapper and Reducer classes for data transformations.

In Cloud Dataflow you do things slightly differently; you create a pipeline. You use input and output connectors to read and write from your data sources and sinks. You use predefined data transforms (or write your own) to implement your data transformations. Then, once your code is ready, you deploy your pipeline to the Cloud Dataflow service.

Converting App Engine MapReduce jobs to Cloud Dataflow pipelines

The following table presents the Cloud Dataflow equivalents of the map, shuffle, and reduce steps of the App Engine MapReduce model.

Java: SDK 2.x

App Engine MapReduce Cloud Dataflow equivalent
Map MapElements<InputT,OutputT>
Shuffle GroupByKey<K,V>
Reduce Combine.GroupedValues<K,InputT,OutputT>

A common practice is to use Combine.PerKey<K,InputT,OutputT> instead of GroupByKey and CombineValues.

Python

App Engine MapReduce Cloud Dataflow equivalent
Map beam.Map
Shuffle beam.GroupByKey
Reduce beam.CombineValues

A common practice is to use beam.CombinePerKey instead of beam.GroupByKey and beam.CombineValues.


The following sample code demonstrates how to implement the App Engine MapReduce model in Cloud Dataflow:

Java: SDK 2.x

Taken from MinimalWordCount.java:
p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))

 // Apply a ParDo that returns a PCollection, where each element is an
 // individual word in Shakespeare's texts.
 .apply("ExtractWords", ParDo.of(new DoFn() {
                   @ProcessElement
                   public void processElement(ProcessContext c) {
                     for (String word : c.element().split(ExampleUtils.TOKENIZER_PATTERN)) {
                       if (!word.isEmpty()) {
                         c.output(word);
                       }
                     }
                   }
                 }))

 // Apply the Count transform that returns a new PCollection of key/value pairs,
 // where each key represents a unique word in the text.
 .apply(Count.perElement())

 // Apply a MapElements transform that formats our PCollection of word counts
 // into a printable string, suitable for writing to an output file.
 .apply("FormatResults", MapElements.via(new SimpleFunction, String>() {
                   @Override
                   public String apply(KV input) {
                     return input.getKey() + ": " + input.getValue();
                   }
                 }))

 // Apply a write transform that writes the contents of the PCollection to a
 // series of text files.
 .apply(TextIO.write().to("wordcounts"));

Python

Taken from wordcount_minimal.py:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)

# Count the occurrences of each word.
counts = (
    lines
    | 'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
                  .with_output_types(unicode))
    | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
    | 'GroupAndSum' >> beam.CombinePerKey(sum))

# Format the counts into a PCollection of strings.
output = counts | 'Format' >> beam.Map(lambda (w, c): '%s: %s' % (w, c))

# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)

Additional Cloud Dataflow benefits

If you choose to migrate your App Engine MapReduce jobs to Cloud Dataflow pipelines, you will benefit from several features that Cloud Dataflow provides.

Scheduling Cloud Dataflow jobs

If you are familiar with App Engine task queues, you can schedule your recurring jobs using Cron. This example demonstrates how to use App Engine Cron to schedule your Cloud Dataflow pipelines.

There are several additional ways to schedule execution of your pipeline. You can:

Monitoring Cloud Dataflow jobs

To monitor your App Engine MapReduce jobs, you depend on an appspot.com-hosted URL.

When you execute your pipelines using the Cloud Dataflow managed service, you can monitor the pipelines using the convenient Cloud Dataflow web-based monitoring user interface. You can also use Stackdriver Monitoring for additional information about your pipelines.

Reading and writing

App Engine MapReduce's Readers and Writers are called data sources and sinks, or I/O connectors, in Cloud Dataflow.

Cloud Dataflow has many I/O connectors for several Google Cloud services, such as Cloud Bigtable, BigQuery, Cloud Datastore, Cloud SQL, and others. There are also I/O connectors created by Apache Beam contributors for non-Google services, such as Apache Cassandra and MongoDB.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow Documentation