This article describes how you can transition from on-premises and cloud data warehousing to Google Cloud Platform (GCP), which includes big data services such as BigQuery, Cloud Dataflow, Cloud Dataproc, Cloud Datalab, Cloud Dataprep by Trifacta, and Cloud Pub/Sub.
Over the past few decades, organizations have mastered the science of data warehousing, and increasingly apply descriptive analytics to large quantities of stored data, gaining insight into their core business operations. Classic Business Intelligence (BI) might have been a differentiating factor in the past, either making or breaking a company, but it's no longer sufficient.
Now, not only do organizations need to understand past events using descriptive analytics, they also need predictive analytics, which often use machine learning (ML) to extract data patterns and make probabilistic claims about the future. The ultimate goal is to develop prescriptive analytics that combine lessons from the past with predictions about the future to guide real-time actions.
Innovative ideas are often slowed by the size and complexity of the IT required to implement them. It takes years and millions of dollars to build a scalable, highly available, and secure data warehouse architecture. It is even more difficult and expensive to build a data lake for a big data operation. Fortunately, cloud technology makes it possible for you to focus on advancing your core business while delegating infrastructure maintenance and platform development to the cloud provider. On top of advanced and cost effective Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings, GCP offers sophisticated Software as a Service (SaaS) technology, which can be used for serverless data warehouse and/or data lake operations.
Rather than advocating an all-or-nothing approach, this article describes how to make the transition to big data in gradual stages. While operating a data lake in GCP might not necessarily be your immediate goal, a simpler task, such as migrating a relational data warehouse to GCP, often offers a high return on investment and opens the door to new opportunities. It's up to you to decide how quickly, and to what degree, you adopt big data.
This article presents Teradata as the system (or warehouse) to be migrated. However, the general concepts are applicable to any data warehousing technology. The architectural patterns you will explore are powered by GCP, enabling you to move beyond common, inefficient, and decades-old data analysis practices.
Before you begin
This article assumes that you have reviewed BigQuery for Data Warehouse Practitioners.
Data warehousing and big data industry definitions are sometimes fuzzy, so it's important to clarify some definitions up front.
Data characteristics include the following terms:
Velocity refers to how quickly data becomes available for analysis.
Variety refers to the scope of data beyond the data warehouse. A data warehouse only retains data that follows a certain structure and achieves a high level of quality. New kinds of data appear in a data lake faster than they can become streamlined into a data warehouse.
Volume refers to the size of data retained in a data storage. Since scaling a traditional data warehouse requires hardware upgrades, often strict data size limits are enforced. However, given the distributed architecture, this is not a concern for a GCP data warehouse.
Quality refers to the level of data cleanliness. While duplicate records and missing values are acceptable in a data lake, they are strictly controlled by the Extract, Load, and Transform (ETL) process in a data warehouse operation.
Interactivity refers to the completion speed of a user query. Queries that take more than a few seconds qualify as noninteractive and are better suited for batch or asynchronous executions.
Data warehouse refers to relational data storage as well as the related processes involved in storing analysis-ready data. Data is stored in a highly structured fashion intended for known queries. Typical data warehouse practices capture raw data from various sources using an Online Transactional Processing (OLTP) system. Then, on batch cycles, a subset of data is extracted, transformed based on a defined schema, and loaded into the data warehouse: this is known as the ETL process. The fact that data warehouses capture a subset of data on batch cycles and store data based on rigid schemas makes them incapable of handling real-time analysis or responding to spontaneous queries. Big data concepts emerged in response to these inherent limitations.
Data lake refers to distributed and often heterogeneous data storage where an ongoing stream of business data is persisted in its natural state. A data lake operation addresses the limitations of data warehousing by embracing the "3V" characteristics of data:
- Volume. It's assumed that data flows from an infinite source.
- Velocity. Newly arrived data is quickly incorporated into storage and made available for analysis.
- Variety. Data doesn't necessarily conform to a predefined structure.
Big data refers to an analytics paradigm based on raw data in a data lake. Immediate access to large quantities of data has allowed a new class of analytics to prosper. Using big data, analysts find answers to previously intractable problems, apply machine learning to discover emerging data patterns, and test new hypotheses. As a result, decision makers have timely insight into how their business is performing, enabling them to modify processes for better results. In addition, the end user's experience is often enriched with relevant insights gleaned from big data analysis.
When migrating a complicated data warehousing operation to the cloud, it's better to take a staged approach. This guide presents a gradually maturing architecture consisting of four steps, where each step builds on the previous. Where you start and end your migration depends on your specific business needs.
Stages differ from each other by the method in which data is inserted into BigQuery. Irrespective of how the data gets there, using BigQuery as a data warehouse is a common practice that is discussed in a separate section on Operating a data warehouse in the cloud. It isn't until Stage 4 that a data lake operation is presented.
Stage 1: Data warehousing in GCP
Figure 1 presents an architecture that minimally impacts existing operations. Data is transferred from the OLAP DB in Teradata to GCP. By transferring clean and quality-controlled data, you can reuse existing tools and processes. For guidance on how to transfer your data to GCP, see Transferring Big Data Sets.
This article maintains the OLD Analyze layer, which consists of the OLAP DB in Teradata as well as existing reporting tools. Keeping existing systems is an essential tactic for validation and benchmarking purposes, but suboptimal in the long run. Production of the old layer adds to the overall processing time and negatively impacts velocity. As you will see in the following section, the next stage is to transfer data to GCP directly from the staging storage without maintaining the OLD Analyze layer.
The following highlights the pros and cons of this stage compared to the next stages.
- It is the fastest and least risky to develop.
- Allows reports and dashboards to be validated against old versions; this is critical until the GCP-based operation is fully adopted within your organization.
- Because OLAP data is maintained in redundant places, this operation isn't cost effective.
- Longer processing time negatively impacts velocity. You cannot deliver operational BI or any type of near real-time analytics based on this architecture.
Figure 1: Stage 1 migration
The callout numbers in the illustration above refer to the following steps:
Extract, Transform, and Transfer: Use BTEQ or FastExport to export data out of Teradata. Since you are transferring already deduped and cleansed OLAP data into Cloud Storage, it is recommended that you use your existing ETL tools such as Informatica, talend, or other partner tools to transform the data into the optimal schema for BigQuery. From there, transfer the ready-to-load files to Cloud Storage.
Load: Use BigQuery load jobs as described in BigQuery for Data Warehouse Practitioners to load the data into BigQuery.
Analyze: Use the variety of analysis methods that are described in BigQuery for Data Warehouse Practitioners to make the data available to your users.
Stage 2: Data staging and warehousing in GCP
Figure 2 illustrates Stage 2, an architecture eliminating the production of OLAP DB in Teradata. Instead, data is transferred from existing staging storage directly to GCP. This stage builds on top of Stage 1 by enhancing the ETL section of the architecture while keeping the Analyze and Acquire layers intact.
Stage 2 also reduces the number of ETL steps. It takes advantage of serverless cloud tools for ETL. Compared to the Stage 1 architecture, Stage 2 delivers a more optimal solution. However, it requires initial ramp-up time to gain expertise with the new tools to perform intensive data cleansing and transformations. Stage 3 further enhances this operation by connecting input systems directly to the GCP.
The following highlights the pros and cons of Stage 2 compared to other stages.
- Compared to Stage 1: reduces overall processing time and improves velocity.
- Compared to Stage 1: takes advantage of cloud-managed services for data cleansing and transformation.
- Compared to Stage 1: eliminates infrastructure and operational costs for OLAP DB in Teradata.
- Compared to next stages: keeps source systems unchanged.
- Compared to Stage 1: requires more upfront development investment.
- Compared to next stages: although you have gained some operational efficiency, you still cannot deliver operational BI or any type of near real-time analytics based on this architecture.
Figure 2: Stage 2 migration
The callout numbers in the illustration above refer to the following steps:
Extract and Transfer: Depending on how and where you currently stage the data, you will use different tools to export the data. Data is often staged in Teradata; in this case, you could use BTEQ or FastExport to export the data. Once you have a set of files, you transfer them to Cloud Storage. It is important to note that data is transferred in its native schema and data cleansing and transformation is delayed until step 2.
Analyze: Use a variety of analysis methods that are described in BigQuery for Data Warehouse Practitioners to make the data available to your users.
Stage 3: Full data warehouse operation in GCP
The Stage 3 architecture presented in Figure 3 completes the picture for data warehousing in GCP. In Stage 3, all the data from source systems is directly persisted in GCP. Because all the input systems must be changed, this stage is considered the most disruptive and therefore, the slowest to implement. However, it's often the case that autoscaling, the ability to provide near real-time analytics, and cost-effective serverless operation justify the initial development investment. More importantly, it sets you up for Stage 4 architecture, which builds a data lake operation on top of this stage.
Stage 3 builds on the Stage 2 architecture and, therefore, has built-in risk mitigation. The Stage 3 architecture maintains the Analyze layer from Stage 1, the ETL layer from Stage 2, and expands on the Acquire layer.
Depending on the number and complexity of the source systems, delivering this architecture can be further staged by tackling source systems one at a time according to priority, interdependencies, integration risks, or other business factors.
The following highlights the pros and cons of Stage 3 compared to previous stages.
- Improves velocity, making operational BI and near real-time analysis a reality.
- Takes better advantage of GCP, reducing the overall cost of ownership.
- Sets you up for a complete big data operation as in the next step.
- Requires all input systems to interface with GCP and therefore, demands widespread and disruptive changes. In other words, it poses higher cost of prolonged initial development.
Figure 3: Stage 3 migration
The callout numbers in the illustration above refer to the following steps:
Capture - Synchronous: Use synchronous integration between data sources and GCP backend when dealing with use cases that require explicit user actions as part of the flow. Typically, these use cases have a transactional nature. For example, a user adding an item to the cart, selecting an airline seat, or conducting a financial transaction is done in a synchronous fashion. The user must be aware of success or failure in order to take further explicit actions.
Capture - Asynchronous: Use Cloud Pub/Sub to implement asynchronous integration between data sources and GCP back-ends. This is ideal for use cases where:
- A large volume of concurrent event data is captured in a fire-and-forget fashion.
- Data is captured in the background in ways that don't require explicit user actions. For example, in a retail scenario, a user entering a store or liking an item falls under this category.
Persist: Data is persisted using one or more GCP storage products. You need to select the storage type according to the nature of data, as well as other requirements. See Choosing the right storage stype for more details.
ETL to BigQuery: From this point onward, the rest of the architecture remains identical to Stage 2.
Stage 4: Full big data operation in GCP
The Stage 4 architecture presented in Figure 4 builds on top of your investments in the previous stages. You will operate a data warehouse in the cloud as well as a data lake, which opens the door to more sophisticated analytics.
Notice that Stage 4 retains the Stage 3 architecture fueling the data warehouse and builds the data lake as an add-on.
In order to differentiate the variety of ways that data is made available to users, Stage 4 distinguishes between four grades of data. Each grade of data is suitable for a set of use cases according to user roles. These grades demonstrate different data characteristics. A full circle indicates acceptable behavior; a half circle indicates average; and an empty circle indicates poor performance.
|Bronze||Refers to raw data as natively stored in the Persistence layer; suitable for batch processing.|
|Silver||Refers to raw data queried through BigQuery data federation.|
|Gold||Refers to the relational data warehouse in BigQuery.|
|Platinum||Refers to a subset of critical data that is streamed into BigQuery for realtime interactive analysis.|
Figure 4: Full big data operation in GCP
The callout numbers in the illustration above refer to the following steps:
Unchanged: Refer to Stage 3 architecture for details.
Stream for real-time descriptive analytics: Refer to streaming data into BigQuery and life of a BigQuery streaming insert for more details. Because quota limitations and extra streaming costs are involved, you need to be selective with the use cases you address with this data stream. Typically, this service is reserved for mission-critical use cases; some examples are discussed in step 8 below. If you find the quotas too limiting, talk to your Google representative to find out how to increase the quotas.
Federate for raw descriptive analytics: BigQuery can make SQL queries against external data sources. Because data remains external to BigQuery, queries aren't as quick compared to when data natively resides in BigQuery's columnar storage. Furthermore, data quality isn't as high as the data warehouse, hence the "Silver" grade. Refer to step 5 below for the use cases.
Traditional BI: This is the traditional BI practice that many organizations operate today. Data analysts create new queries, reports, and visualization dashboards to help business executives make decisions.
This part of the diagram was collapsed to make room for new additions. Refer to the Analyze layer in Stage 1 or 2.
Semi-raw descriptive analytics: This type of semi-raw and high velocity data is ideal for data analysts who need to perform custom studies. Custom studies involve analyzing data that is yet to be streamlined into the data warehouse. This type of data is referred to as semi-raw because BigQuery's data federation feature allows SQL queries against log files in Cloud Storage, transactional records in Cloud Bigtable, and many other kinds of data outside BigQuery.
Another common use case for this type of data is by data engineers working on the next release of Cloud Dataflow jobs in order to expand variety in the data warehouse. Custom studies by data analysts often feed into data warehouse expansions.
Scientific experiments: Although data science in GCP is beyond the scope of this paper, the toolset is described here at a very high level. For a more comprehensive guide, see Valliappa Lakshmanan's book Data Science on GCP.
BigQuery offers the convenience of SQL and the flexibility of user-defined functions as well as parallel computational power. This presentation shows how complicated data processing algorithms can be executed using BigQuery at massive scales. Data scientists can leverage the integration of BigQuery with Cloud Datalab, allowing them to write, execute, and plot query results all from a notebook.
BigQuery is often used for heavy duty pre-processing work during feature engineering. However, you would need to ensure scientific workloads don't impact other users by hogging available slots. When training an ML model, it's inefficient to fetch data for each batch from BigQuery unless it's for small experiments because each batch randomly samples the data, meaning the same row of data can be processed several times, leading to unnecessary query costs. Instead, it's better to export processed data from BigQuery into Cloud Storage and read batch data from Cloud Storage.
Cloud Dataproc allows you to transfer your existing on-premises Hadoop and Spark workloads to a managed environment and continue extending your workloads. Similarly, if you are using Spark MLlib on-premises, you can transfer them to the managed environment provided by Cloud Dataproc. As your models get more complicated and demand larger-scale feature engineering, you might hit the limits of the MLlib platform, and that's when you transition into Cloud Machine Learning.
Cloud Machine Learning enables you to transfer your existing on-premises TensorFlow models to the managed Cloud Machine Learning environment and run them at scale. If you are just starting your ML practice, you might find it helpful to experiment with TensorFlow on a small scale on your local machine before running your jobs using Cloud Machine Learning. Cloud Machine Learning makes it easy to go back and forth between your local environment and GCP.
Cloud Dataprep is a handy tool that can be used for familiarizing yourself with the data, discovering patterns, cleansing, and transforming data in an interactive, zero-development fashion. It is especially powerful in the hands of data scientists doing feature engineering during supervised machine learning.
Cloud Datalab, built on top of Jupyter (formerly known as iPython), is a powerful interactive tool created to explore, analyze, transform, and visualize data. Data scientists use Cloud Datalab notebooks to capture their thoughts, source code, code outcome, and data visualizations all in one place and collaborate in a team environment.
Real-time BI: You will use the same set of tools and processes listed in step 4 above to create and share data visualizations; the only difference is velocity and variety. Because data is made available as a live stream, you can build real-time reports and dashboards. However, as discussed in step 2 above, this stream of data is reserved for a special grade of data necessary for mission-critical use cases.
Enhance user experience with real-time descriptive analytics: You can build sophisticated user experiences around a real-time stream of data when it is fused with historical data. This could be a back-office employee alerted on their mobile app about low stock; an online customer, who might benefit from knowing that spending another dollar would put her on the next reward tier; or a nurse, who gets alerted about a patient's vital signs on their smartwatch, enabling them to take the best course of action by pulling up the patient's treatment history on their tablet. Don't mistake these use cases with the ones in step 9, where ML is employed to deliver predictive and prescriptive analytics based on patterns that are hard to detect or describe by humans.
Enhance user experience with predictive/prescriptive analytics: You can use Google's pre-trained models for Image analysis, video analysis, speech recognition, natural language, and translation, or serve your custom-trained model using Cloud Machine Learning to augment your user experience. Thanks to pre-trained models, building ML-enriched apps involves little more than making API calls to GCP, offering easier implementation of features such as voice recognition, multilingual services, chatbots, OCR, image recognition, and so forth. You can use your custom-trained models for use cases tailored to your specific business needs. This might involve recommending a product based on market trends and user purchase behavior; predicting a flight delay; detecting fraudulent activities; flagging inappropriate content, or other innovative ideas that could differentiate your app from the competition.
Choosing the right storage type
GCP offers various data storage products. While the product catalog is larger than what is presented in the following table, the ones listed are applicable to this article. By presenting the products side by side and highlighting their characteristics, the following table can help you select the right product for storing your data. Sometimes persisting data across various storage types is inevitable. However, in general it's a good idea to consolidate storage types as much as possible.
|Feature||Cloud Storage||Cloud Datastore||Cloud Bigtable||Cloud Spanner|
|Access metaphore||Like files in a file system||Persistent hashmap||Key-values, HBase API||Relational|
|Read||Have to copy to local disk||Filter objects on property||Scan rows||SELECT rows|
|Write||One file||Put object||Put row||INSERT rows|
|Update granularity||An object (a "file")||Attribute||Row||Field|
|Usage||Store blobs||Structured data from web
and mobile apps
|No-Ops, high throughput,
scalable, flattened data
|No-Ops SQL database|
Operating a data warehouse in GCP
BigQuery for Data Warehouse Practitioners explains how to use BigQuery as a data warehouse. The article shows how common data warehouse concepts map to those in BigQuery and describes how standard data warehouse tasks are done in a GCP-based data warehouse operation. It also covers the various ways that you can connect to BigQuery and analyze the data.
The abovementioned guide isn't geared towards a specific migratory technology, so the following table maps equivalent GCP services that replace Teradata tools and concepts.
|Teradata tools and concepts||GCP equivalent|
Enterprise data warehouse platform
Refer to sections service model comparison and costs for more details.
bq command-line tool
Refer to section managing data for more details.
|Use BigQuery data export to get data out of the data warehouse and the different ways explained in section loading data to get data into the data warehouse.|
|Teradata SQL Assistant||BigQuery web
More details are covered in section querying data.
Performance monitoring for administrators and business users
|Section monitoring and auditing addresses this topic.|
|Teradata Virtual Storage||BigQuery storage management is explained in section storage management.|
|Teradata Row Level Security||Row level security and other aspects of permission management is discussed in section granting permissions.|
|Teradata Workload Management||This topic is addressed in section managing workloads and concurrency.|
Operating a data lake in GCP
Descriptive Analytics or classic BI looks at past performance and attempts to understand it by mining historical data to answer the question: "What happened?" The architecture presented in Stage 4 outlines, at a very high level, various aspects of operating a data lake in GCP. It shows how you can build pipelines to help deliver more timely answers to the question "What will happen?", using predictive analytics and "What to do?", using prescriptive analytics.
Contact your Google representative to obtain a copy of the detailed guide for Teradata to BigQuery migration.
Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.