Google Cloud Big Data and Machine Learning Blog

Innovation in data processing and machine learning technology

Google Cloud provides a unified, streamlined way to execute your ML strategy

Thursday, November 30, 2017

By Lak Lakshmanan and Saptarshi Mukherjee, Google Cloud

In a recent article in the Harvard Business Review, Quentin Hardy looked at companies with successful machine learning strategies, and found that they do three things: (1) they derive value from unique data, data that they have and that no one else has, (2) they find adjacent areas that can be affected by data by looking at their business in a systematic way, and (3) they package up the resulting machine learning models to drive personalization for customers.

We built Google Cloud to help businesses achieve all three of these outcomes, and Google BigQuery is a fundamental part of helping businesses excel at this ML strategy. This comes from our own experience transitioning to become an AI-first company that infuses machine learning into nearly every product. In order to find and collect unique data, you have to be capable of storing large and varied data, and querying it interactively. In the marketplace today, BigQuery has the scale and speed to analyze petabytes of data in seconds, and to do so for ad-hoc queries, i.e. for queries you didn’t prepare your database system to anticipate. To find adjacent areas that are affected, it is important that the data not be siloed within your organization. BigQuery is a global service, and subject to access controls (that you set), so anyone in your organization can query the datasets and mash up the output with a wide variety of public datasets or commercial datasets, ranging from transportation to weather.

Indeed, this ability to scale and to break down silos is why 80% of Googlers use BigQuery every month to access some form of data—and it’s not just our engineers, but almost every other role at Google! Beyond just us, companies renowned for their ML expertise, and making the transition to deriving value from machine learning, credit BigQuery for spreading the use of data across their organizations.

"In an environment where speed is key, this technology allows us to stay ahead of everyone else. With BigQuery and Bigtable we can work with unimaginably large quantities of big data that we use to inform our predictive buying through the bidder. It’s a conjunction of Google technologies that lets us create a better version of our business.”

—Larraine Criss, Chief Product Officer, MainAd

However, having unique data, and making it available across your organization, is not enough for successfully executing a ML strategy. It is enough for data analytics, but for machine learning, you need one more thing—the ability to train models on this data and to deploy these models into production. This is where BigQuery’s integration with the rest of Google Cloud, and the push towards democratization of these services, really starts to shine.

The reference architecture for machine learning on Google Cloud involves taking both streaming and batch data, transforming it and storing it in BigQuery. Because BigQuery offers a connector for TensorFlow, it is really easy for you to train your TensorFlow ML model directly against data stored in BigQuery. In real-time, as the same data comes in, the same set of transformations are carried out and used to make predictions.

A key aspect of this design is Cloud Dataflow which is able to apply the same set of transformations to both historical data (for training) and streaming data (for predictions). Since Cloud Dataflow can stream into BigQuery, up-to-date data is available to applications and reports for agile decision making.

The infrastructure is important, but so is the tooling. How easy and convenient is it to develop the above architecture for your machine learning needs? The entire workflow above can be  developed and deployed from Cloud Datalab, a hosted Jupyter notebook. Try out this example notebook to explore and preprocess data and then train, deploy and predict natality data with a machine learning model.

We continue to work on making this workflow easier to implement. The majority of the time and effort in implementing an AI strategy is spent in data exploration, cleanup, and transformation. In the reference architecture, this is carried out using Datalab and Dataflow. While Python notebooks like Cloud Datalab are familiar to data scientists, they are typically not the tool of choice for data analysts and data engineers at many companies. On the other hand, Cloud Dataflow requires knowledge of either Java or Python programming. To democratize data exploration and preparation, therefore, Google Cloud offers Cloud Dataprep, a visual, web-based interface to explore your data and author Dataflow pipelines that can then be deployed on both batch and streaming data.

“Cloud Dataprep allows us to quickly view and understand new datasets, and its flexibility supports our data transformation needs. The GUI is nicely designed, so the learning curve is minimal. Our initial data preparation work is now completed in minutes, not hours or days.“

—Henry Culver, IT Architect at Merkle Inc

For example, Cloud Dataprep provides the ability to explore the natality data directly from BigQuery:

Having explored the data, you can set up a set of preprocessing actions in Cloud Dataprep:

Finally, you can launch the preprocessing transformations at scale on Dataflow -- all without writing any code!

Then, the job can be launched on Cloud ML Engine and the training can be monitored from the Datalab notebook itself:

In summary, Google Cloud offers you an integrated platform on which to execute an AI strategy to collect unique data, break down silos within your organization, and carry out machine learning end-to-end.

Here are some next steps you might take to learn how to create ML models starting from BigQuery:

  1. Try out these hands-on labs to become familiar with the reference architecture for machine learning on Google Cloud Platform:
    • Analyzing Natality data using Datalab and BigQuery: try a Codelab or a Qwiklab.
    • Predicting baby weight with TensorFlow on Cloud ML Engine: try a  Codelab or a Qwiklab.
  2. To learn how to design, train, and deploy machine learning models, take the Coursera course Serverless Machine Learning with TensorFlow. This course leads you step-by-step from exploring a dataset in BigQuery and building an accurate machine learning model with TensorFlow to deploying it as a RESTful web service.

In the coming months, we’ll be launching a 10 course-specialization on Coursera titled “Machine Learning on Google Cloud Platform,” which will offer a deep dive course on machine learning. The first course in this series, “How Google Does ML,” provides an executive-level overview of ML and ML strategy, and expands upon the points in this blog post. Watch this space for the launch announcement in January.

Join us for the “Cloud OnAir: The journey from big data to AI” online event to learn more about Google Cloud data analytics and AI solutions.

  • Big Data Solutions

  • Product deep dives, technical comparisons, how-to's and tips and tricks for using the latest data processing and machine learning technologies.

  • Learn More

12 Months FREE TRIAL

Try BigQuery, Machine Learning and other cloud products and get $300 free credit to spend over 12 months.

TRY IT FREE