Google Cloud Platform

How to use Google Cloud Dataflow with TensorFlow for batch predictive analysis

As demonstrated in this new solution guide, Cloud Dataflow provides flexible serverless infrastructure for machine-learning workloads.

Google Cloud Dataflow is a fully-managed service in Google Cloud Platform (GCP) for developing and executing a range of data processing patterns including ETL, batch computation and continuous (streaming) computation in a unified way. The latest incarnation in a line of successive frameworks developed inside Google that includes MapReduce, FlumeJava and MillWheel, Cloud Dataflow passes on all the design lessons we’ve learned along that journey to Google Cloud customers. For example, Cloud Dataflow SDKs, which are transitioning to Apache Beam in their 2.x releases, provide the necessary programming primitives for doing unified development across batch and streaming workloads — an advantage for data engineers in their ongoing battle to standardize on key APIs.

For many customers, Cloud Dataflow has also proven to be a useful integration method for bringing machine learning into data analytics pipelines. Preprocessing of data for consistency and quality (to reconcile differing data formats, for example) is a precondition for effective training of machine-learning models, which makes Cloud Dataflow a natural complement to feature development in machine learning. It is also possible to use Cloud Dataflow to run predictive analysis jobs for large-scale machine learning: Because Cloud Dataflow supports the use of “switchable” parallel pipelines in which data sources and sinks can be changed-out without modifying code, model development is more flexible (e.g., more/different data can be brought into the training dataset non-disruptively) and data engineers needn’t concern themselves with distributing/redistributing machine-learning jobs across worker nodes.

Along these lines, we’ve created a solution guide that demonstrates how to build a Cloud Dataflow batch-processing pipeline (with Python) in support of a TensorFlow-based model for doing predictive handwriting analysis. It also shows how Google Cloud Storage and Google BigQuery can be used interchangeably as data sources or sinks, again without the need for pipeline modifications or changes to the underlying model.


Read the guide in its entirety here. For additional background, consider the following resources, as well: