This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key data-analytics services, and how they can help at the various stages of the data lifecycle. These best practices help you to meet your data analytics needs and create your system design.
Businesses want to analyze data and generate actionable insights from that data. Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization. Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam.
Your particular focus, team expertise, and strategic outlook help you to determine which Google Cloud services you adopt to support your data analytics needs. For example, Dataflow lets you write complex transformations in a serverless approach, but you must rely on an opinionated version of configurations for compute and processing needs. Alternatively, Dataproc lets you run the same transformations, but you manage the clusters and fine-tune the jobs yourself.
In your system design, think about which processing strategy your teams use, such as extract, transform, load (ETL) or extract, load, transform (ELT). Your system design should also consider whether you need to process batch analytics or streaming analytics. Google Cloud provides a unified data platform, and it lets you build a data lake or a data warehouse to meet your business needs.
The following table provides a high-level overview of Google Cloud analytics services:
|Google Cloud service||Description|
|Pub/Sub||Simple, reliable, and scalable foundation for stream analytics and event-driven computing systems.|
|Dataflow||A fully managed service to transform and enrich data in stream (real time) and batch (historical) modes.|
|Dataprep by Trifacta||Intelligent data service to visually explore, clean, and prepare structured and unstructured data for analysis.|
|Dataproc||Fast, easy-to-use, and fully managed cloud service to run Apache Spark and Apache Hadoop clusters.|
|Cloud Data Fusion||Fully managed, data integration service that's built for the cloud and lets you build and manage ETL/ELT data pipelines. Cloud DataFusion provides a graphical interface and a broad open source library of preconfigured connectors and transformations.|
|BigQuery||Fully managed, low-cost, serverless data warehouse that scales with your storage and compute power needs. BigQuery is a columnar and ANSI SQL database that can analyze terabytes to petabytes of data.|
|Cloud Composer||Fully managed workflow orchestration service that lets you author, schedule, and monitor pipelines that span clouds and on-premises data centers.|
|Data Catalog||Fully managed and scalable metadata management service that helps you discover, manage, and understand all your data.|
|Looker Studio||Fully managed visual analytics service that can help you unlock insights from data through interactive dashboards.|
|Looker||Enterprise platform that connects, analyzes, and visualizes data across multi-cloud environments.|
|Dataform||Fully managed product to help you collaborate, create, and deploy data pipelines, and ensure data quality.|
|Dataplex||Managed data lake service that centrally manages, monitors, and governs data across data lakes, data warehouses, and data marts using consistent controls.|
|AnalyticsHub||Platform that efficiently and securely exchanges data analytics assets across your organization to address challenges of data reliability and cost.|
When you create your system design, you can group the Google Cloud data analytics services around the general data movement in any system, or around the data lifecycle.
The data lifecycle includes the following stages and example services:
- Ingestion includes services such as Pub/Sub, Storage Transfer Service, Transfer Appliance, and BigQuery.
- Storage includes services such as Cloud Storage, Bigtable, Memorystore, and BigQuery.
- Processing and transformation includes services such as Dataflow, Dataproc, Dataprep, Sensitive Data Protection, and BigQuery.
- Analysis and warehousing includes services such as BigQuery.
- Reporting and visualization includes services such as Looker Studio and Looker.
The following stages and services run across the entire data lifecycle:
- Data integration includes services such as Data Fusion.
- Metadata management and governance includes services such as Data Catalog.
- Workflow management includes services such as Cloud Composer.
Apply the following data ingestion best practices to your own environment.
Determine the data source for ingestion
Data typically comes from another cloud provider or service, or from an on-premises location:
For on-premises data ingestion, consider the volume of data to ingest and your team's skill set. If your team prefers a low-code, graphical user interface (GUI) approach, use Cloud Data Fusion with a suitable connector, such as Java Database Connectivity (JDBC). For large volumes of data, you can use Transfer Appliance or Storage Transfer Service.
Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.
Identify streaming or batch data sources
Consider how you need to use your data and identify where you have streaming or batch use cases. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery.
If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template. For batch workloads, the first step is usually to ingest data into Cloud Storage. Use the gsutil tool or Storage Transfer Service to ingest data.
Ingest data with automated tools
Manually moving data from other systems into the cloud can be a challenge. If possible, use tools that let you automate the data ingestion processes. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.
Use migration tools to ingest from another data warehouse
If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance. The BigQuery Data Transfer Service also provides third-party transfers that help you ingest data on a schedule from external sources. For more information, see the detailed migration approaches for each data warehouse.
Estimate your data ingestion needs
The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second. Capacity, storage, and regional requirements for your data help you to determine whether Pub/Sub Lite is a better option for your system design. For more information, see Choosing Pub/Sub or Pub/Sub Lite.
For batch ingestion of data, estimate how much data you want to transfer in total, and how quickly you want to do it. Review the available migration options, including an estimate on time and comparison of online versus offline transfers.
Use appropriate tools to regularly ingest data on a schedule
Storage Transfer Service
BigQuery Data Transfer Service
both let you schedule ingestion jobs. For fine-grain control of the timing of
ingestion or the source and destination system, use a workflow-management system
If you want a more manual approach, you can
use Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
If you want to manage the Compute infrastructure, you can use the gsutil command with cron for data transfer of up to 1 TB. If you use this manual approach instead of Cloud Composer, follow the best practices to script production transfers.
Review FTP/SFTP server data ingest needs
If you need a code-free environment to ingest data from an FTP/SFTP server, you can use the FTP copy plugins. If you want to modernize and create a long-term workflow solution, Cloud Composer is a fully managed service that lets you read and write from various sources and sinks.
Use Apache Kafka connectors to ingest data
If you use Pub/Sub, Dataflow, or BigQuery, you can ingest data using one of the Apache Kafka connectors. For example, the open source Pub/Sub Kafka connector lets you ingest data from Pub/Sub or Pub/Sub Lite.
- Cloud Storage Transfer Service agent best practices
- How to ingest data into BigQuery so you can analyze it
- Ingesting clinical and operational data with Cloud Data Fusion
- Optimizing large-scale ingestion of analytics events and logs
Apply the following data storage best practices to your own environment.
Choose the appropriate data store for your needs
To help you choose what type of storage solution to use, review and understand the downstream usage of your data. The following common use cases for your data give recommendations for which Google Cloud product to use:
|Data use case||Product recommendation|
|Transaction processing||Cloud SQL|
|Business intelligence (BI) & analytics||BigQuery|
|Batch processing||Cloud Storage
Bigtable if incoming data is time series and you need low latency access to it.
BigQuery if you use SQL.
Review your data structure needs
For most unstructured data, such as documents and text files, audio and video files, or logs, an object-based store is the most suitable choice. You can then load and process the data from object storage when you need it.
For semi-structured data, such as XML or JSON, your use cases and data access patterns help guide your choice. You can load such datasets into BigQuery for automatic schema detection. If you have low latency requirements, you can load your JSON data into Bigtable. If you have legacy requirements or your applications work with relational databases, you can also load datasets into a relation store.
For structured data, such as CSV, Parquet, Avro, or ORC, you can use BigQuery if you have BI and analytics requirements that use SQL. For more information, see how to batch load data. If you want to create a data lake on open standards and technologies, you can use Cloud Storage.
Migrate data and reduce costs for HDFS
Look for ways to move Hadoop Distributed File System (HDFS) data from on-premises or from another cloud provider to a cheaper object-storage system. Cloud Storage is the most common choice that enterprises make as an alternative data store. For information about the advantages and disadvantages of this choice, see HDFS vs. Cloud Storage.
You can move data with a push or pull method. Both methods use the
distcp command. For more information, see
Migrating HDFS Data from On-Premises to Google Cloud.
You can also use the open source Cloud Storage connector to let Hadoop and Spark jobs access data in Cloud Storage. The connector is installed by default on Dataproc clusters, and can be manually installed on other clusters.
Use object storage to build a cohesive data lake
A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. You can use Cloud Composer and Cloud Data Fusion to build a data lake.
To build a modern data platform, you can use BigQuery as your central data source instead of Cloud Storage. BigQuery is a modern data warehouse with separation of storage and compute. A data lake built on BigQuery lets you perform traditional analytics from BigQuery in the Cloud console. It also lets you access the data stored from other frameworks such as Apache Spark.
- Best practices for Cloud Storage
- Best practices for Cloud Storage cost optimization
- Best practices for ensuring privacy and security of your data in Cloud Storage
- Best practices for Memorystore
- Optimizing storage in BigQuery
- Bigtable schema design
Process and transform data
Apply the following data analytics best practices to your own environment when you process and transform data.
Explore the open source software you can use in Google Cloud
Many Google Cloud services use open source software to help make your transition seamless. Google Cloud offers managed and serverless solutions that have Open APIs and are compatible with open source frameworks to reduce vendor lock-in.
Dataproc is a Hadoop-compatible managed service that lets you host open source software, with little operational burden. Dataproc includes support for Spark, Hive, Pig, Presto, and Zookeeper. It also provides Hive Metastore as a managed service to remove itself as a single point of failure in the Hadoop ecosystem.
You can migrate to Dataflow if you currently use Apache Beam as a batch and streaming processing engine. Dataflow is a fully managed and serverless service that uses Apache Beam. Use Dataflow to write jobs in Beam, but let Google Cloud manage the execution environment.
Determine your ETL or ELT data-processing needs
- For new environments, we recommend Dataflow for a unified way to create batch and streaming applications.
- For a fully managed approach, Data Fusion provides a drag-and-drop GUI to help you create pipelines.
For ELT pipelines, use BigQuery, which supports both batch and streaming data load. After your data is in BigQuery, use SQL to perform all transformations to derive new datasets for your business use cases.
Use the appropriate framework for your data use case
Your data use cases determine which tools and frameworks to use. Some Google Cloud products are built to handle all of the following data use cases, while others best support only one particular use case.
- For a batch data processing system, you can process and transform
data in BigQuery with a familiar SQL interface. If you have
an existing pipeline that runs on Apache Hadoop or Spark on-premises or in
another public cloud, you can use Dataproc.
- You can also use Dataflow if you want a unified programing interface for both batch and streaming use cases. We recommend that you modernize and use Dataflow for ETL and BigQuery for ELT.
For streaming data pipelines, you use a managed and serverless service like Dataflow that provides windowing, autoscaling, and templates. For more information, see Building production-ready data pipelines using Dataflow.
- If you have analytics and SQL-focused teams and capabilities, you can also stream data into BigQuery.
For real-time use cases, such as time series analysis or streaming video analytics, use Dataflow.
Retain future control over your execution engine
To minimize vendor lock-in and to be able to use a different platform in the future, use the Apache Beam programming model and Dataflow as a managed serverless solution. The Beam programming model lets you change the underlying execution engine, such as changing from Dataflow to Apache Flink or Apache Spark.
Use Dataflow to ingest data from multiple sources
To ingest data from multiple sources, such as Pub/Sub, Cloud Storage, HDFS, S3, or Kafka, use Dataflow. Dataflow is a managed serverless service that supports Dataflow templates, which lets your teams run templates from different tools.
Dataflow Prime provides horizontal and vertical autoscaling of machines that are used in the execution process of a pipeline. It also provides smart diagnostics and recommendations that identify problems and suggest how to fix them.
Discover, identify, and protect sensitive data
Use Sensitive Data Protection to inspect and transform structured and unstructured data. Sensitive Data Protection works for data located anywhere in Google Cloud, such as in Cloud Storage or databases. You can classify, mask, and tokenize your sensitive data to continue to use it safely for downstream processing. Use Sensitive Data Protection to perform actions such as to scan BigQuery data or de-identify and re-identify PII in large-scale datasets.
Modernize your data transformation processes
Use Dataform to write data transformations as code and to start to use version control by default. You can also adopt software development best practices such as CI/CD, unit tests, and version control to SQL code. Dataform supports all major cloud data warehouse products and databases, such as PostgreSQL.
- Data Fusion
- Sensitive Data Protection
Data analytics and warehouses
Apply the following data analytics and warehouse best practices to your own environment.
Review your data storage needs
Data lakes and data warehouses aren't mutually exclusive. Data lakes are useful for unstructured and semi-structured data storage and processing. Data warehouses are best for analytics and BI.
Review your data needs to help determine where to store your data and which Google Cloud product is the most appropriate to process and analyze your data. Products like BigQuery can process PBs of data and grow with your demands.
Identify opportunities to migrate from a traditional data warehouse to BigQuery
Review the traditional data warehouses that are currently in use in your environment. To reduce complexity and potentially reduce costs, identify opportunities to migrate your traditional data warehouses to a Google Cloud service like BigQuery. For more information and example scenarios, see Migrating data warehouses to BigQuery.
Plan for federated access to data
Review your data requirements and how you might need to interact with other products and services. Identify your data federation needs, and create an appropriate system design.
For example, BigQuery lets you define external tables that can read data from other sources, such as Bigtable, Cloud SQL, Cloud Storage, or Google Drive. You can join these external sources with tables that you store in BigQuery.
Use BigQuery flex slots to provide on-demand burst capacity
Sometimes you need extra capacity to do experimental or exploratory analysis that needs a lot of compute resources. BigQuery lets you get additional compute capacity in the form of flex slots. These flex slots help you when there's a period of high demand or when you want to complete an important analysis.
Understand schema differences if you migrate to BigQuery
BigQuery supports both star and snowflake schemas, but by default it uses nested and repeated fields. Nested and repeated fields can be easier to read and correlate compared to other schemas. If your data is represented in a star or snowflake schema, and if you want to migrate to BigQuery, review your system design for any necessary changes to processes or analytics.
- Best practices for multi-tenant workloads on BigQuery
- Best practices for row-level security in BigQuery
- Best practices for materialized views in BigQuery
Reports and visualization
Apply the following reporting and visualization best practices to your own environment.
Use BigQuery BI Engine to visualize your data
BigQuery BI Engine is a fast, in-memory analysis service. You can use BI Engine to analyze data stored in BigQuery with subsecond query response time and with high concurrency. BI Engine is integrated into the BigQuery API. Use reserved BI Engine capacity to manage the on-demand or flat-rate pricing for your needs. BI Engine can also work with other BI or custom dashboard applications that require subsecond response times.
Modernize your BI processes with Looker
Looker is a modern, enterprise platform for BI, data applications, and embedded analytics. You can create consistent data models on top of your data with speed and accuracy, and you can access data inside transactional and analytical datastores. Looker can also analyze your data on multiple databases and clouds. If you have existing BI processes and tools, we recommend that you modernize and use a central platform such as Looker.
- Migrating data warehouses to BigQuery: Reporting and analysis
- Architecture for connecting visualization software to Hadoop on Google Cloud
- Speeding up small queries in BigQuery with BI Engine
Use workflow management tools
Data analytics involves many processes and services. Data moves across different tools and processing pipelines during the data analytics lifecycle. To manage and maintain end-to-end data pipelines, use appropriate workflow management tools. Cloud Composer is a fully managed workflow management tool based on the open source Apache Airflow project.
You can use Cloud Composer to launch Dataflow pipelines and to use Dataproc Workflow Templates. Cloud Composer can also help you create a CI/CD pipeline for your data processing workflow or automate your infrastructure tasks. For more information, watch Cloud Composer: Development best practices.
If you already run a data analytics platform and if you want to migrate some or all of the workloads to Google Cloud, review the following migration resources for best practices and guidance:
- General migration guidance
- Cloud Storage migration
- Pub/Sub migration
- Bigtable migration
- Dataproc migration
- BigQuery migration
- Composer migration
Learn about system design best practices for Google Cloud AI and machine learning, including the following:
- Learn about Google Cloud AI and machine learning services that support system design.
- Learn ML data processing best practices.
- Learn best practices for model development and training.
Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.