Google Cloud Platform

Data Science on the Google Cloud Platform: the first book

This upcoming O’Reilly Media book is the first one dedicated to doing data science on the modern public cloud

Valliappa (Lak) Lakshmanan, Technical Lead for Data & ML Professional Services at Google Cloud, is the author of the upcoming O’Reilly Media book "Data Science on the Google Cloud Platform" (now in Early Release). In the following Q&A, Lak describes his reasons for writing this book, its intended readers, what readers will learn and how to think about the practice of data science on Google Cloud Platform (GCP)-based architecture.


Why did you decide to write this book?

There are data science books and there are cloud books, but there are no books about how you do data science on a modern public cloud. In my current role at Google, I work with customers who are onboarding onto our cloud platform, so I get to see how data scientists and data engineers in a variety of industries approach the public cloud. Some try to do the same things, the same way, just on rented computing resources. The visionary users, though, rethink their systems, transform how they work with data, and thereby get to innovate faster.

What will readers learn from this book, exactly?

In this book, you get to look over my shoulder as I walk you through an example of this new transformative way of doing data science. You'll learn how to implement an end-to-end pipeline, starting with ingest and working your way through data exploration, dashboards, relational databases, streaming data and end up at real-time machine learning. So, you (the reader) will learn how to do all these things. Regardless of what your current work involves, you'll also realize how easy and how accessible these tools are, and hopefully end up picking up some ancillary skills around data processing and data analytics.

Is the book intended for data scientists in any particular set of industries?

The target audience for this book is anyone who uses computers to work with data: data analysts, database administrators, data engineers, data scientists or systems programmers are titles that these people go by today. I foresee that their roles will soon require creating data science models as well as implementing them at scale in production systems.

In other words, the book is intended for data scientists/engineers in any industry. Although the particular case-study example involves transportation, it's a familiar problem to most readers. The concepts should be transferable to any set of vertical use cases.

In what ways is the practice of data science changing with the availability of modern cloud platforms like GCP?

The current separation of responsibility between data analysts, database administrators, data scientists and systems programmers exists because each of these roles requires a lot of specialized knowledge in today’s corporate environments. The cloud changes that equation; the thinking behind Google’s data engineer certification, for example, is that a practicing data engineer can and should be able to do all these things.

Autoscaled, fully-managed services make it easier to implement data science models at scale — which is why data scientists will no longer need to hand off their models to data engineers. Instead, they can write a data science workload, submit it to the cloud and have that workload executed automatically in an autoscaled manner. At the same time, data science packages are becoming simpler and simpler. So, it has become extremely easy for an engineer to slurp in data and use a canned model to get an initial (and often very good) model up and running. With well-designed packages and easy-to-consume APIs, you don't need to know the esoteric details of data science algorithms — only what each algorithm does, and how to link algorithms together to solve realistic problems.

If you're an IT person whose job role, so far, has been to manage processes but you know some programming (particularly Python) and you understand your business domain well, it's quite possible for you to start designing data processing pipelines and addressing business problems with just those programming skills.

In this book, I talk about all these aspects of data-based services because data engineers will be involved in all these aspects: designing the services, developing the statistical and machine-learning models and implementing them in large-scale production and in real-time.

What is it about GCP that you think readers will find different or unique?

GCP is designed to make you forget about infrastructure. Our marquee data services — Google BigQueryCloud DataflowCloud Pub/Sub, and Cloud ML ​— are all serverless and autoscaling. You submit a query to BigQuery, it gets executed on thousands of nodes, and you get your result back; you don’t spin up a cluster.

Similarly, in Cloud Dataflow, you submit a data pipeline, and in Cloud Machine Learning you submit a machine-learning job, in either case without worrying about cluster management. Cloud Pub/Sub autoscales to the throughput and number of subscribers and publishers without any work on your part. Plus, data is encrypted both at rest and in transit, and kept secure. As a data scientist, not having to manage infrastructure is incredibly liberating.

Even when you're running open-source software like Apache Spark that's designed to operate on a cluster, GCP makes it easy. Leave your data on Google Cloud Storage, not in HDFS, and spin up a job-specific cluster to run the Spark job. After the job completes, you can safely delete the cluster. Because of this job-specific infrastructure, there's no need to fear overprovisioning hardware or running out of capacity to run a job when you need it.

What makes these advantages possible?

The reason that you can afford to forget about virtual machines and clusters when running on GCP comes down to networking. The network bisection bandwidth within a GCP data center is 1PB/second, and so sustained reads off Cloud Storage are extremely fast. What this means is that you don’t need to shard your data as you would with traditional MapReduce jobs. Instead, GCP can autoscale your compute jobs by shuffling the data onto new compute nodes as needed. Hence, you’re liberated from cluster management when doing data science on GCP.

Did you personally learn anything about doing data science on GCP that you didn’t already know?

I’ll let you in on a secret: Everything I know about GCP, I learned while I was writing this book! When I took the job at Google (exactly one year ago), I had used the public cloud the way I described in my first answer: to spin up virtual machines and run the same software I had always run, in exactly the same way as I had run it before. Fortunately, I realized that Google’s big data stack was different, and so I set out to learn how to use all the data and machine-learning tools on GCP, and properly.

The way I learn best is to write code, and so, that’s what I did. When the PyData Singapore meetup recently asked me to do a talk about GCP (video of the talk), I described the code that I had written. In the process, I discovered that a walkthrough of the code and contrasting the different approaches to the problem were quite educational for the attendees. So, I wrote up the essence of my talk and sent in a book proposal to O’Reilly Media.

The reason that this single use case turned out to be so educational was that I kept asking all my friendly colleagues why things were designed the way they were, and why there were these three ways to do something. In the way of a response, I would be told a story about a situation that had cropped up, and how this particular way of doing things was the way to address that type of situation. The thing about Google’s big data stack is that it came about as a result of realistic needs by the product teams; BigQuery may be relatively new to the outside world, but Dremel has been used for more than a decade within Google. I cannot repeat the underlying product stories, but in the book, I do summarize the lessons learned and the situations in which some aspect or the other of a product becomes useful. You may not get the colorful stories, but you do get the rationales.

Ever since I joined Google, a week hasn’t gone by where I haven’t reflected on some system I have built, or some algorithm I’ve developed, and wished I could go back and do it again. This time, of course, I would do it faster and better. With this book, I hope to help the broader community do data science better, as well.

Read my book, go forth, and build cool things!

NEXT steps

At Google Cloud NEXT '17 in San Francisco (March 8-10), you’ll be able to dive into many facets of doing data science on GCP in person. Here are a few sessions led by Lak that you might consider attending: