AI & Machine Learning

AI in Depth: Cloud Dataproc meets TensorFlow on YARN: Let TonY help you train right in your cluster

January 30, 2019

Gonzalo Gasca Meza

Developer Programs Engineer, Google

Keqiu Hu

Staff Software Engineer, LinkedIn

Apache Hadoop has become an established and long-running framework for distributed storage and data processing. Google’s Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. With Cloud Dataproc, you can set up a distributed storage platform without worrying about the underlying infrastructure. But what if you want to train TensorFlow workloads directly on your distributed data store?

This post will explain how to install a Hadoop cluster for LinkedIn open-source project TonY (TensorFlow on YARN). You will deploy a Hadoop cluster using Cloud Dataproc and TonY to launch a distributed machine learning job. We’ll explore how you can use two of the most popular machine learning frameworks: TensorFlow and PyTorch.

TensorFlow supports distributed training, allowing portions of the model’s graph to be computed on different nodes. This distributed property can be used to split up computation to run on multiple servers in parallel. Orchestrating distributed TensorFlow is not a trivial task and not something that all data scientists and machine learning engineers have the expertise, or desire, to do—particularly since it must be done manually. TonY provides a flexible and sustainable way to bridge the gap between the analytics powers of distributed TensorFlow and the scaling powers of Hadoop. With TonY, you no longer need to configure your cluster specification manually, a task that can be tedious, especially for large clusters.

The components of our system:

First, Apache Hadoop

Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provides for data storage, data processing, data access, data governance, security, and operations.

Next, Cloud Dataproc

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc’s automation capability helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

And now TonY

TonY is a framework that enables you to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow and PyTorch. TonY enables running either single node or distributed training as a Hadoop application. This native connector, together with other TonY features, runs machine learning jobs reliably and flexibly.

Installation

Setup a Google Cloud Platform project

Get started on Google Cloud Platform (GCP) by creating a new project, using the instructions found here.

Create a Cloud Storage bucket

Then create a Cloud Storage bucket. Reference here.

Create a Hadoop cluster via Cloud Dataproc using initialization actions

You can create your Hadoop cluster directly from Cloud Console or via an appropriate `gcloud` command. The following command initializes a cluster that consists of 1 master and 2 workers:

When creating a Cloud Dataproc cluster, you can specify in your TonY initialization actions script that Cloud Dataproc should run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up.

Note: Use Cloud Dataproc version 1.3-deb9, which is supported for this deployment. Cloud Dataproc version 1.3-deb9 provides Hadoop version 2.9.0. Check this version list for details.

Once your cluster is created. You can verify that under Cloud Console > Big Data > Cloud Dataproc > Clusters, that cluster installation is completed and your cluster’s status is Running.

Go to Cloud Console > Big Data > Cloud Dataproc > Clusters and select your new cluster:

https://storage.googleapis.com/gweb-cloudblog-publish/images/1._Master_and_Worker_nodes.max-700x700.png

You will see the Master and Worker nodes.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2._Master_and_Worker_nodes.max-700x700.png

Connect to your Cloud Dataproc master server via SSH

Click on SSH and connect remotely to Master server.

Verify that your YARN nodes are active

Example

Installing TonY

TonY’s Cloud Dataproc initialization action will do the following:

Install and build TonY from GitHub repository.
Create a sample folder containing TonY examples, for the following frameworks:

TensorFlow
PyTorch

The following folders are created:

TonY install folder (TONY_INSTALL_FOLDER) is located by default in:

TonY samples folder (TONY_SAMPLES_FOLDER) is located by default in:

The Tony samples folder will provide 2 examples to run distributed machine learning jobs using:

TensorFlow MNIST example
PyTorch MNIST example

Running a TensorFlow distributed job

Launch a TensorFlow training job

You will be launching the Dataproc job using a `gcloud` command.

The following folder structure was created during installation in `TONY_SAMPLES_FOLDER`, where you will find a sample Python script to run the distributed TensorFlow job.

This is a basic MNIST model, but it serves as a good example of using TonY with distributed TensorFlow. This MNIST example uses “data parallelism,” by which you use the same model in every device, using different training samples to train the model in each device. There are many ways to specify this structure in TensorFlow, but in this case, we use “between-graph replication” using tf.train.replica_device_setter.

Dependencies

TensorFlow version 1.9

Note: If you require a more recent TensorFlow and TensorBoard version, take a look at the progress of this issue to be able to upgrade to latest TensorFlow version.

Connect to Cloud Shell

Open Cloud Shell via the console UI:

https://storage.googleapis.com/gweb-cloudblog-publish/images/3._Open_Cloud_Shell.max-1200x1200.png

Use the following gcloud command to create a new job. Once launched, you can monitor the job. (See the section below on where to find the job monitoring dashboard in Cloud Console.)

Running a PyTorch distributed job

Launch your PyTorch training job

For PyTorch as well, you can launch your Cloud Dataproc job using gcloud command.

The following folder structure was created during installation in the TONY_SAMPLES_FOLDER, where you will find an available sample script to run the TensorFlow distributed job:

Dependencies

PyTorch version 0.4
Torch Vision 0.2.1

Launch a PyTorch training job

Verify your job is running successfully

You can track Job status from the Dataproc Jobs tab: navigate to Cloud Console > Big Data > Dataproc > Jobs.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4._track_Job_status.max-1100x1100.png

Access your Hadoop UI

Logging via web to Cloud Dataproc’s master node via web: http://<Node_IP>:8088 and track Job status. Please take a look at this section to see how to access the Cloud Dataproc UI.

https://storage.googleapis.com/gweb-cloudblog-publish/images/5._Hadoop_UI.max-2000x2000.png

Cleanup resources

Delete your Cloud Dataproc cluster

Conclusion

Deploying TensorFlow on YARN enables you to train models straight from your data infrastructure that lives in HDFS and Cloud Storage. If you’d like to learn more about some of the related topics mentioned in this post, feel free to check out the following documentation links:

Acknowledgements: Anthony Hsu, LinkedIn Software Engineer; and Zhe Zhang, LinkedIn Core Big Data Infra team manager.

Posted in

Data Analytics

How to use gen AI for better data schema handling, data quality, and data generation

By Deb Lee • 9-minute read

Data Analytics

BigQuery ML is now compatible with open-source gen AI models

By Vaibhav Sethi • 3-minute read

Containers & Kubernetes

With MultiKueue, grab GPUs for your GKE cluster, wherever they may be

By Jean-Baptiste Leroy • 5-minute read