This article explores how to build a data lake on Google Cloud Platform (GCP). A data lake offers organizations like yours the flexibility to capture every aspect of your business operations in data form. Over time, this data can accumulate into the petabytes or even exabytes, but with the separation of storage and compute, it's now more economical than ever to store all of this data.
After capturing and storing the data, you can apply a variety of processing techniques to extract insights from it. Data warehousing has been the standard approach to doing business analytics. However, this approach requires fairly rigid schemas for well-understood types of data, such as orders, order details, and inventory. Analytics that are built solely on traditional data warehousing make it challenging to deal with data that doesn't conform to a well-defined schema, because that data is often discarded and lost forever.
Moving from data warehousing to the "store everything" approach of a data lake is useful only if it's still possible to extract insights from all of the data. Data scientists, engineers, and analysts often want to use the analytics tools of their choice to process and analyze data in the lake. In addition, the lake must support the ingestion of vast amounts of data from multiple data sources.
With these considerations in mind, here's how you can build a data lake on GCP. The following diagram depicts the key stages in a data lake solution.
This article explores the stages in more detail and discusses how GCP can help.
Store: Cloud Storage as the data lake
Cloud Storage is well suited to serve as the central storage repository for many reasons.
Performance and durability: With Cloud Storage, you can start with a few small files and grow your data lake to exabytes in size. Cloud Storage supports high-volume ingestion of new data and high-volume consumption of stored data in combination with other services such as Cloud Pub/Sub. While performance is critical for a data lake, durability is even more important, and Cloud Storage is designed for 99.999999999% annual durability.
Strong consistency: One key characteristic that sets Cloud Storage apart from many other object stores is its support for strong consistency in scenarios such as read-after-write operations, listing buckets and objects, and granting access to resources. Without this consistency, you must implement complex, time-consuming workarounds to determine when data is available for processing.
Cost efficiency: Cloud Storage provides a number of storage classes at multiple prices to suit different access patterns and availability needs, and to offer the flexibility to balance cost and frequency of data access. Without sacrificing performance, you can access data from these various storage classes by using a consistent API. For instance, you can archive infrequently used data to Cloud Storage Nearline or Cloud Storage Coldline using a lifecycle policy, and then access it later, maybe to gather training data for machine learning, with subsecond latency.
Flexible processing: Cloud Storage provides native integration with a number of powerful GCP services, such as BigQuery, Cloud Dataproc (Hadoop ecosystem), Cloud Dataflow for serverless analytics, Cloud Video Intelligence and Cloud Vision, and Cloud Machine Learning Engine, giving you the flexibility to choose the right tool to analyze your data.
Central repository: By offering a central location for storing and accessing data across teams and departments, Cloud Storage helps you avoid data silos that have to be kept in sync.
Security: Because data lakes are designed to store all types of data, enterprises expect strong access control capabilities to help ensure that their data doesn't fall into the wrong hands. Cloud Storage offers a number of mechanisms to implement fine-grained access control over your data assets.
A data lake architecture must be able to ingest varying volumes of data from different sources such as Internet of Things (IoT) sensors, clickstream activity on websites, online transaction processing (OLTP) data, and on-premises data, to name just a few. In this section, you learn how GCP can support a wide variety of ingestion use cases.
Cloud Pub/Sub and Cloud Dataflow: You can ingest and store real-time data directly into Cloud Storage, scaling both in and out in response to data volume.
Storage Transfer Service: Moving large amounts of data is seldom as straightforward as issuing a single command. You have to deal with issues such as scheduling periodic data transfers, synchronizing files between source and sink, or moving files selectively based on filters. Storage Transfer Service provides a robust mechanism to accomplish these tasks.
gsutil: For one-time or manually initiated transfers, you might consider using gsutil, which is an open source command-line tool that is available for Windows, Linux, and Mac. It supports multi-threaded transfers, processed transfers, parallel composite uploads, retries, and resumability.
Transfer Appliance: Depending on your network bandwidth, if you want to migrate large volumes of data to the cloud for analysis, you might find it less time consuming to perform the migration offline by using the Transfer Appliance.
See a more detailed overview of the ingest options and the key decision-making criteria that are involved in choosing an ingest option.
Processing and analytics
After you have ingested and stored data, the next step is to make it available for analysis. In some cases, you can store data in a well-understood schema immediately after ingestion, which simplifies in-place querying. For instance, if you store incoming data in Avro format in Cloud Storage, you can do the following:
- Use Hive on Cloud Dataproc to issue SQL queries against the data.
- Issue queries directly against the data from BigQuery.
- Load the data into BigQuery and then query it.
However, you cannot always shape data into a well-known schema as it is being ingested and stored. In fact, the main reason to maintain a data lake instead of a data warehouse is to store everything now so that you can extract insights later. Depending on the nature of the raw data and the types of analytics involved, the workflow can range from simple to complex. The following diagram provides a high-level overview.
Data mining and exploration
Because a large portion of the data stored in the lake is not ready for immediate consumption, you must first mine this data for latent value. Jupyter Notebook is a favorite tool for exploring raw data, and for this purpose GCP offers Cloud Datalab, a fully managed Jupyter Notebook service.
Cloud Datalab comes preinstalled with a wide range of popular data science libraries such as TensorFlow and NumPy. In addition to using Cloud Datalab, you have access to the traditional Hadoop ecosystem of tools in Cloud Dataproc and fully serverless analytics with Cloud Dataflow. For powerful SQL-based analysis, they can transform raw data with Cloud Dataprep by Trifacta and load it into BigQuery.
When you understand the analytics potential of a subset of raw data in the lake, you can make that subset available to a broader audience.
Design and deploy workflows
Making a data subset more widely available means creating focused data marts, as shown in the preceding diagram. You can keep these data marts up to date by using orchestrated data pipelines that take raw data and transform it into a format that downstream processes and users can consume. These pipelines vary based on the nature of the data and the types of analytics used. Here are some common analytics workflows and how you can implement them on GCP.
Transformation of raw data and load into BigQuery
In the simple but common workflow illustrated in the following diagram, you use an extract, transform, and load (ETL) process to ingest data into a BigQuery data warehouse. You can then query the data by using SQL. Cloud Dataprep, a visual tool for cleansing and preparing data, is well suited for simple ETL jobs, while Cloud Dataflow with Apache Beam provides additional flexibility for more involved ETL jobs.
If you want to use the Hadoop ecosystem of products for batch analytics, store your transformed data in a separate Cloud Storage location. Then you can use Cloud Dataproc to run queries against this data by using Spark, Spark SQL, SQL on Hive, and similar tools. Apache Avro, Apache Parquet, and Apache ORC are popular formats for this refined data. The following diagram summarizes this workflow.
If you want a straightforward, SQL-based pipeline, stream processing on BigQuery gives you the ability to query data as it is ingested. Adding Cloud Pub/Sub and Cloud Dataflow with Beam offers deeper stream-processing capability so that, for instance, your users can perform aggregations, windowing, and filtering before they store data in BigQuery. For time-series analytics, you can store ingested data in Cloud Bigtable to facilitate rapid analysis. The following diagram illustrates this workflow.
Machine learning can benefit immensely from the vast amount of data in a data lake. After you identify useful training data, the associated data preparation steps, and the machine learning network architecture, you can orchestrate these steps as shown in the following diagram. Cloud Machine Learning Engine makes it easy to hone models and then use them to do both batch and online predictions.
Not all machine learning use cases warrant the design and training of custom models. GCP includes pretrained models for speech, vision, video intelligence, and natural language processing. For these cases, you pass the appropriate inputs, such as audio, images, or video, to the respective GCP service. You then extract valuable metadata and store that metadata in a service such as BigQuery for further querying and analysis. The following diagram illustrates this flow.
- Data Lifecycle
- Learn how to transition from Data Warehousing in Teradata to big data services such as BigQuery, Cloud Dataflow, and Cloud Dataprep.
- Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.