Developing a TensorFlow Training Application

Cloud Machine Learning Engine can run an existing TensorFlow training application with little or no alteration. Developing a training application is a complex process that is largely outside of the scope of this documentation. You can start learning by working through TensorFlow's getting started guide.

When you understand the TensorFlow fundamentals, the best way to learn what goes into a training application is to study a good example.

Read through the samples provided with this documentation. These TensorFlow samples have been developed specifically to work well with Cloud ML Engine.

To start with, study the set of classification samples that use a United States Census dataset to predict income level. The sample applications in the set have nearly identical functionality, but they use different TensorFlow APIs. You can learn a lot by reading the samples and comparing the ways that they use to accomplish the same thing.

Give particular attention to the following points as you read the samples:

  • How they use command-line arguments to get important information that may change from one training job to another. Cloud ML Engine passes arguments to each replica of your training application running in the cloud. You specify the arguments when you start a training job. Command-line arguments are the primary mechanism for communicating with your application at the time of execution.

  • How they use the TF_CONFIG environment variable to set up a distributed processing cluster. This is the method by which Cloud ML Engine communicates job information to the individual replicas of your training application that run on the allocated training instances. See the guide to getting details from TF_CONFIG.

  • How they manage distributed processing to account for the different task types (master, parameter server, and worker) in one application.

  • How they accommodate different stages in the training process (notably training, evaluation, and export) by using checkpoints and variations of the computation graph.

  • How they define input and output data, both for training the model and for exporting it.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud ML Engine for TensorFlow