AI & Machine Learning
Jeff Dean on machine learning, part 2: TensorFlow
Jeff Dean talks about TensorFlow, the machine learning toolkit open sourced by Google.
TensorFlow is the machine-learning library open sourced by Google in November 2015. It gained over 11,000 stars on GitHub in its first week after launch, and has built up quite a community since then: at the time of this writing, TensorFlow has over 45,000 stars, 13,000 commits and 21,000 forks. This is the second installment in our interview series with Jeff Dean, Google Senior Fellow and lead of the Google Brain research team. In our first installment, we talked about the landscape of machine learning: its past, present and future. In this installment, we’ll cover TensorFlow: why we built it originally, how to use it, and what its future may hold. (And keep in mind that TensorFlow will be a topic at Google Cloud NEXT ‘17 next month; see the last section of this post for details.)
The origins of TensorFlow inside Google
Why did Google build TensorFlow?
JD: One of the things that we did as part of our early work in machine learning at Google, is we built a first-generation system for solving machine learning problems called DistBelief that we didn’t open source. That system was really good for scalable machine learning. We could train large models on larger datasets. And it was really good for production deployments for those models. So once we trained a model, we could actually use it in our products very easily.
But what it didn’t do very well was to allow us a lot of flexibility in terms of the kinds of machine-learning algorithms and techniques we applied to different problems. If what you wanted to do fit well in the framework that it provided, that was great, but if it was a little bit off, it was kind of a bit of a mismatch.
So one of the reasons we built TensorFlow, our next-generation system, the system that we’ve actually open sourced for machine learning, is that we wanted to keep the scalable attributes and production readiness of our first system, but make it a much more flexible platform for doing all kinds of machine-learning research and product development.
One of the things we saw as we developed TensorFlow was because it was more flexible we could use it in more and more different kinds of machine learning problems, and it sort of very quickly spread throughout Google, even more rapidly than our first system. Then as we open sourced TensorFlow, we saw that same spread of the use of TensorFlow in many different organizations and different domains, and other kinds of application areas, in the external community as we saw internally.
P.S.: The DistBelief research and application paved the way for more scalable and distributed deep learning, demonstrating how to train deep networks with billions of parameters using tens of thousands of CPU cores. It solved the problem of model parallelism in a single machine (via multithreading) and at scale (via message passing), as well as data parallelism, where multiple copies of the model are combined to inform the training. This advantage allowed researchers to work with models far larger than what could fit inside of a GPU's memory. In the paper, their largest model computed 1.7 billion parameters, utilizing 81 machines, delivering a 12x speedup. This resulted in an incredible 60% relative improvement over the state of the art of image recognition accuracy.
Where DistBelief fell short was in its ability to accommodate other machine-learning models and methods, and in smaller use cases, such as mobile. Maintaining different, separate systems for large and small-scale systems led to increased maintenance burdens and leaky abstractions. TensorFlow was born from this need for a more flexible programming model and the ability to use a wider variety of heterogeneous hardware platforms. It also focused on being suitable for both experimentation and prototyping, as well as high-performance production training and inference. You can find out more about the origins of TensorFlow in this whitepaper.
Optimizing code for TensorFlow models
What’s the future of TensorFlow?
JD: One of the things we’re really excited about for TensorFlow is that we’re going to be releasing soon a compiler for TensorFlow that we call XLA. We have some documentation of that in the Github repository now, and we’ll be releasing an open source version of that in the next couple of months, that can generate optimized code for TensorFlow models.
Essentially, knowing the sizes and the exact computation that is being done, we can generate an optimized piece of code that is tailored to do exactly that computation, and it should be much faster and also have much smaller code size for mobile applications.
P.S.: Today, most developers interact with TensorFlow through its Python interface which calls C++ code. At runtime, a computational graph is generated, allowing for fast execution and improved portability to other hardware platforms. XLA, which stands for "Accelerated Linear Algebra," seeks to further improve performance through a number of optimizations to the graph.
XLA alters the graph to improve execution speed, decrease memory usage used by intermediate buffers, reduce mobile footprint by several orders of magnitude, and improve portability to other hardware platforms. It is especially useful for developers targeting hardware-accelerated platforms. (At the time of this writing, XLA is still experimental, but look for it to become an official part of TensorFlow soon.)
TensorFlow in action
What are the ways that TensorFlow can be used?
JD: One of the really nice things about TensorFlow and the Google Cloud Machine Learning service is that it’s very flexible. You can decide to train your model in our data center using our managed Cloud Machine Learning service, or you could train the model on machines with a bunch of GPU cards under your desk using TensorFlow open source release. And then once you’ve trained the model, you can take that model and deploy it in a lot of different settings.
You can deploy on your virtual machines in our cloud environment. You can deploy it in our managed service. You can deploy it in mobile applications, by extracting the trained model, and then baking it into your mobile app. It’s a very flexible thing, and we don’t lock you into using our cloud environment for all purposes. You can use it in whatever way it makes sense, and use the TensorFlow open source release as a way of using it in other environments that are not on our cloud machine learning platform.
P.S.: While training locally will allow for faster iteration of ideas when you're first starting out, when you want to train on large datasets (e.g., 500,000 hours of video), it becomes increasingly important to have a scalable solution. Fortunately, TensorFlow works well for both prototyping on your local machine as well as scaling up to production training loads. So whether you choose to deploy your own system using virtual machines in the cloud or use a managed service such as Cloud Machine Learning, you don't need to rewrite your entire system just to productionize. Once you have your TensorFlow model trained, it can run on many platforms, from Linux, Mac and Windows to Android,iOS, and even Raspberry Pi!
Ready to go deeper? Check out the original TensorFlow whitepaper. Feeling ambitious? Make a contribution to TensorFlow! Be sure to catch the TensorFlow summit on February 15th, live streaming worldwide!