TensorFlow Trainer Development Considerations

Cloud Machine Learning Engine can run an existing TensorFlow training application with little or no alteration. Developing a trainer is a complex process that is largely outside of the scope of this documentation. You can start learning TensorFlow by working through its getting started guide.

Once you understand some TensorFlow fundamentals, the best way to learn about what goes into a training application is to study a good example. The samples page of this documentation includes information about TensorFlow samples that have been developed specifically to work well with Cloud ML Engine. We suggest that you use the two samples that use U.S. census data to predict income level as primary sources for TensorFlow best practices when working with Cloud ML Engine. The two trainers have nearly identical functionality, but one uses the lower-level APIs of TensorFlow core and the other uses the higher-level estimator-based APIs. You can learn a lot by reading these two samples, and by comparing the ways that they accomplish the same thing.

Here are some important things to look for as you look through the samples:

  • How they work with command-line arguments to get important information that may change from one training job to another. Cloud ML Engine passes arguments that you specify when you start a training job to each replica of your trainer that it runs in the cloud. Command-line arguments are the primary mechanism for communicating with your trainer at the time of execution.

  • How they use the TF_CONFIG environment variable to set up a distributed processing cluster. This is the method by which Cloud ML Engine communicates job information to the individual replicas of your trainer that run on the allocated training instances.

  • How they manage distributed processing to account for the different task types (master, parameter server, and worker) in one application.

  • How they accommodate different stages in the training process (notably training, evaluation, and export) by using checkpoints and variations of the computation graph.

  • How they define input and output data, both for training the model and then for exporting it.

What's next

Send feedback about...

Cloud Machine Learning Engine (Cloud ML Engine)