Building a data lake on Google Cloud Platform

Store, process, and analyze massive volumes of data in a cost-efficient and agile way.

Cloud Data Lake Overview

A place to capture and use all your data

Land your data in Google Cloud Platform in its raw state — structured or unstructured — and store it separated from compute resources to get away from costly on-premises storage models. Eliminate the headache of data preprocessing and constantly trying to design schemas to handle new data types. Take advantage of Google Cloud Platform’s cutting edge processing, analysis, and machine-learning services to enable impactful use cases inside your company. Leverage the same secure-by-design infrastructure that Google uses to protect identities, applications, and devices.

From ingest to insight

Data in GCP Data Lake

Getting data into your GCP data lake

From batch to streaming, Google Cloud Platform makes it easy to move your data from wherever it lives into the cloud. Whether you are migrating data across your network, using an offline transfer appliance, or capturing real-time streams, GCP’s products and services scale to meet your needs without complexity.

Storing Data at Petabyte Scale

Storing data at petabyte scale

Use Cloud Storage as the central hub for your data lake to benefit from its strong consistency, high-durability design (designed for 99.999999999%), and ability to store data at rest (not bound to compute resources like traditional on-premises models). Google Cloud Storage’s multiple storage classes also allow you to optimize for both cost and availability, letting you create petabyte-scale data lakes that are cost efficient. Most importantly, data stored in Google Cloud Storage is easily accessible to a wide array of other Google Cloud Platform products, making it the ideal heart for storing every kind of data asset for every kind of use case.

Process Data

Process data how you want

With your data lake living on top of Cloud Storage, you can choose to process data in the way that makes sense for your company. Take advantage of existing Hadoop experience in your organization by using Cloud Dataproc, GCP’s fully managed Hadoop and Spark service, to spin-up clusters on demand and pay only for the time it takes for jobs to run. Additionally, explore Cloud Dataflow, GCP’s fully managed Apache Beam service, to work with both stream and batch workloads in a serverless data-processing experience that removes provisioning and management complexities.

Serverless Data Warehouse

Serverless data warehouse for analytics on top of your data lake

Use BigQuery, GCP’s serverless petabyte-scale data warehouse, to perform analytics on structured data living in your data lake. Benefit from blazing query speeds against massive data volumes to support enterprise reporting and business intelligence needs. Enjoy built-in machine-learning capabilities that can be accessed using familiar SQL and help support a data-driven culture inside your company.

Advanced Analytics using ML

Advanced analytics using machine learning

Leverage your data lake in GCP to carry out data science experiments and create machine-learning models based on data assets stored in Cloud Storage. Use the native integrations with Google’s cutting edge Cloud AI products to do everything from deriving insights from images and video assets to customizing, deploying, and scaling your own tailored ML models with Cloud Machine Learning Engine.

Map on-premises Hadoop data lake workloads to GCP products

Building a cloud data lake on GCPYESNOIm processingstreaming dataWe useApache BeamWe useApache Spark or KafkaCloud DataflowCloud DataprocCloud DataprocIm doinginteractive dataanalysis orad-hoc queryingWe use Apache Sparkwith interactive webnotebooksAre you interested in keepingthese SQL queries as they are?Cloud Dataproc in combinationwith Jupyter or Zeppelinoptional componentsCloud DataprocNo, Im interested inlearning more abouta serverless solution.YESNONo, Im interested inlearning more abouta managed solution.BigQueryWe use SQL with Apache Hive,Apache Drill, Impala,Presto or similarCloud DataprocCloud DataprocIm doing ELT/ETLor batch processingWe use MapReduce,Spark, Pig, or HiveWe use Oozie forworkflow orchestrationCloud ComposerAre you interested inkeeping these workflowjobs as they are?Im supportingNoSQL workloadsWe useApache AccumuloCloud DataprocYESNONeed to use coprocessorsor SQL with Apache Phoenix?Cloud DataprocCloud BigtableWe useApache HBaseIm running anApache Hadoopclusteron-premises

Resources