Hadoop on Google Cloud Platform

What is Hadoop on Google Cloud Platform?

Apache Hadoop is an open-source implementation of MapReduce, the computing paradigm that has come to define big data processing since the original research paper was released in 2004. Hadoop on Google Cloud Platform is a set of setup scripts and software libraries optimized for Google infrastructure, bringing ease of use and high performance to data processing with Hadoop.

With the Google Cloud Storage connector for Hadoop, you can perform MapReduce jobs directly on data in Google Cloud Storage, without copying to local disk and running Hadoop Distributed File System (HDFS). The connector simplifies Hadoop deployment, reduces cost, and provides performance comparable to HDFS, all while increasing reliability by eliminating the single point of failure of the name node.

Hadoop on Google Cloud Platform also provides connectors that enable you to access data stored in BigQuery and Datastore, as well as Google Cloud Storage.


Try a Hello World Example


A few Hadoop fundamentals


A Hadoop job is a batch process that runs on a Hadoop cluster. A job might transform data, run MapReduce, or perform parallel computations.


MapReduce is a distributed parallel computing paradigm. A MapReduce consists of a map step, performed on subsets of the input data; and a reduce step, that combines the output. MapReduce can run on structured or unstructured data, and have become one of the most popular ways to analyze massive data sets in parallel. For more information, see Running a MapReduce Job.