Analyze your data

Last reviewed 2023-08-08 UTC

This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key data-analytics services, and how they can help at the various stages of the data lifecycle. These best practices help you to meet your data analytics needs and create your system design.

Core principles

Businesses want to analyze data and generate actionable insights from that data. Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization. Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam.

Your particular focus, team expertise, and strategic outlook help you to determine which Google Cloud services you adopt to support your data analytics needs. For example, Dataflow lets you write complex transformations in a serverless approach, but you must rely on an opinionated version of configurations for compute and processing needs. Alternatively, Dataproc lets you run the same transformations, but you manage the clusters and fine-tune the jobs yourself.

In your system design, think about which processing strategy your teams use, such as extract, transform, load (ETL) or extract, load, transform (ELT). Your system design should also consider whether you need to process batch analytics or streaming analytics. Google Cloud provides a unified data platform, and it lets you build a data lake or a data warehouse to meet your business needs.

Key services

The following table provides a high-level overview of Google Cloud analytics services:

Google Cloud service Description
Pub/Sub Simple, reliable, and scalable foundation for stream analytics and event-driven computing systems.
Dataflow A fully managed service to transform and enrich data in stream (real time) and batch (historical) modes.
Dataprep by Trifacta Intelligent data service to visually explore, clean, and prepare structured and unstructured data for analysis.
Dataproc Fast, easy-to-use, and fully managed cloud service to run Apache Spark and Apache Hadoop clusters.
Cloud Data Fusion Fully managed, data integration service that's built for the cloud and lets you build and manage ETL/ELT data pipelines. Cloud DataFusion provides a graphical interface and a broad open source library of preconfigured connectors and transformations.
BigQuery Fully managed, low-cost, serverless data warehouse that scales with your storage and compute power needs. BigQuery is a columnar and ANSI SQL database that can analyze terabytes to petabytes of data.
Cloud Composer Fully managed workflow orchestration service that lets you author, schedule, and monitor pipelines that span clouds and on-premises data centers.
Data Catalog Fully managed and scalable metadata management service that helps you discover, manage, and understand all your data.
Looker Studio Fully managed visual analytics service that can help you unlock insights from data through interactive dashboards.
Looker Enterprise platform that connects, analyzes, and visualizes data across multi-cloud environments.
Dataform Fully managed product to help you collaborate, create, and deploy data pipelines, and ensure data quality.
Dataplex Managed data lake service that centrally manages, monitors, and governs data across data lakes, data warehouses, and data marts using consistent controls.
AnalyticsHub Platform that efficiently and securely exchanges data analytics assets across your organization to address challenges of data reliability and cost.

Data lifecycle

When you create your system design, you can group the Google Cloud data analytics services around the general data movement in any system, or around the data lifecycle.

The data lifecycle includes the following stages and example services:

The following stages and services run across the entire data lifecycle:

  • Data integration includes services such as Data Fusion.
  • Metadata management and governance includes services such as Data Catalog.
  • Workflow management includes services such as Cloud Composer.

Data ingestion

Apply the following data ingestion best practices to your own environment.

Determine the data source for ingestion

Data typically comes from another cloud provider or service, or from an on-premises location:

Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.

Identify streaming or batch data sources

Consider how you need to use your data and identify where you have streaming or batch use cases. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery.

If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template. For batch workloads, the first step is usually to ingest data into Cloud Storage. Use the gsutil tool or Storage Transfer Service to ingest data.

Ingest data with automated tools

Manually moving data from other systems into the cloud can be a challenge. If possible, use tools that let you automate the data ingestion processes. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.

Use migration tools to ingest from another data warehouse

If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance. The BigQuery Data Transfer Service also provides third-party transfers that help you ingest data on a schedule from external sources. For more information, see the detailed migration approaches for each data warehouse.

Estimate your data ingestion needs

The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second. Capacity, storage, and regional requirements for your data help you to determine whether Pub/Sub Lite is a better option for your system design. For more information, see Choosing Pub/Sub or Pub/Sub Lite.

For batch ingestion of data, estimate how much data you want to transfer in total, and how quickly you want to do it. Review the available migration options, including an estimate on time and comparison of online versus offline transfers.

Use appropriate tools to regularly ingest data on a schedule

Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs. For fine-grain control of the timing of ingestion or the source and destination system, use a workflow-management system like Cloud Composer. If you want a more manual approach, you can use Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
If you want to manage the Compute infrastructure, you can use the gsutil command with cron for data transfer of up to 1 TB. If you use this manual approach instead of Cloud Composer, follow the best practices to script production transfers.

Review FTP/SFTP server data ingest needs

If you need a code-free environment to ingest data from an FTP/SFTP server, you can use the FTP copy plugins. If you want to modernize and create a long-term workflow solution, Cloud Composer is a fully managed service that lets you read and write from various sources and sinks.

Use Apache Kafka connectors to ingest data

If you use Pub/Sub, Dataflow, or BigQuery, you can ingest data using one of the Apache Kafka connectors. For example, the open source Pub/Sub Kafka connector lets you ingest data from Pub/Sub or Pub/Sub Lite.

Additional resources

Data storage

Apply the following data storage best practices to your own environment.

Choose the appropriate data store for your needs

To help you choose what type of storage solution to use, review and understand the downstream usage of your data. The following common use cases for your data give recommendations for which Google Cloud product to use:

Data use case Product recommendation
File-based Filestore
Object-based Cloud Storage
Low latency Bigtable
Time series Bigtable
Online cache Memorystore
Transaction processing Cloud SQL
Business intelligence (BI) & analytics BigQuery
Batch processing Cloud Storage

Bigtable if incoming data is time series and you need low latency access to it.

BigQuery if you use SQL.

Review your data structure needs

For most unstructured data, such as documents and text files, audio and video files, or logs, an object-based store is the most suitable choice. You can then load and process the data from object storage when you need it.

For semi-structured data, such as XML or JSON, your use cases and data access patterns help guide your choice. You can load such datasets into BigQuery for automatic schema detection. If you have low latency requirements, you can load your JSON data into Bigtable. If you have legacy requirements or your applications work with relational databases, you can also load datasets into a relation store.

For structured data, such as CSV, Parquet, Avro, or ORC, you can use BigQuery if you have BI and analytics requirements that use SQL. For more information, see how to batch load data. If you want to create a data lake on open standards and technologies, you can use Cloud Storage.

Migrate data and reduce costs for HDFS

Look for ways to move Hadoop Distributed File System (HDFS) data from on-premises or from another cloud provider to a cheaper object-storage system. Cloud Storage is the most common choice that enterprises make as an alternative data store. For information about the advantages and disadvantages of this choice, see HDFS vs. Cloud Storage.

You can move data with a push or pull method. Both methods use the hadoop distcp command. For more information, see Migrating HDFS Data from On-Premises to Google Cloud.

You can also use the open source Cloud Storage connector to let Hadoop and Spark jobs access data in Cloud Storage. The connector is installed by default on Dataproc clusters, and can be manually installed on other clusters.

Use object storage to build a cohesive data lake

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. You can use Cloud Composer and Cloud Data Fusion to build a data lake.

To build a modern data platform, you can use BigQuery as your central data source instead of Cloud Storage. BigQuery is a modern data warehouse with separation of storage and compute. A data lake built on BigQuery lets you perform traditional analytics from BigQuery in the Cloud console. It also lets you access the data stored from other frameworks such as Apache Spark.

Additional resources

Process and transform data

Apply the following data analytics best practices to your own environment when you process and transform data.

Explore the open source software you can use in Google Cloud

Many Google Cloud services use open source software to help make your transition seamless. Google Cloud offers managed and serverless solutions that have Open APIs and are compatible with open source frameworks to reduce vendor lock-in.

Dataproc is a Hadoop-compatible managed service that lets you host open source software, with little operational burden. Dataproc includes support for Spark, Hive, Pig, Presto, and Zookeeper. It also provides Hive Metastore as a managed service to remove itself as a single point of failure in the Hadoop ecosystem.

You can migrate to Dataflow if you currently use Apache Beam as a batch and streaming processing engine. Dataflow is a fully managed and serverless service that uses Apache Beam. Use Dataflow to write jobs in Beam, but let Google Cloud manage the execution environment.

If you use CDAP as your data integration platform, you can migrate to Cloud Data Fusion for a fully managed experience.

Determine your ETL or ELT data-processing needs

Your team's experience and preferences help determine your system design for how to process data. Google Cloud lets you use either traditional ETL or more modern ELT data-processing systems.

Use the appropriate framework for your data use case

Your data use cases determine which tools and frameworks to use. Some Google Cloud products are built to handle all of the following data use cases, while others best support only one particular use case.

  • For a batch data processing system, you can process and transform data in BigQuery with a familiar SQL interface. If you have an existing pipeline that runs on Apache Hadoop or Spark on-premises or in another public cloud, you can use Dataproc.
    • You can also use Dataflow if you want a unified programing interface for both batch and streaming use cases. We recommend that you modernize and use Dataflow for ETL and BigQuery for ELT.
  • For streaming data pipelines, you use a managed and serverless service like Dataflow that provides windowing, autoscaling, and templates. For more information, see Building production-ready data pipelines using Dataflow.

  • For real-time use cases, such as time series analysis or streaming video analytics, use Dataflow.

Retain future control over your execution engine

To minimize vendor lock-in and to be able to use a different platform in the future, use the Apache Beam programming model and Dataflow as a managed serverless solution. The Beam programming model lets you change the underlying execution engine, such as changing from Dataflow to Apache Flink or Apache Spark.

Use Dataflow to ingest data from multiple sources

To ingest data from multiple sources, such as Pub/Sub, Cloud Storage, HDFS, S3, or Kafka, use Dataflow. Dataflow is a managed serverless service that supports Dataflow templates, which lets your teams run templates from different tools.

Dataflow Prime provides horizontal and vertical autoscaling of machines that are used in the execution process of a pipeline. It also provides smart diagnostics and recommendations that identify problems and suggest how to fix them.

Discover, identify, and protect sensitive data

Use Sensitive Data Protection to inspect and transform structured and unstructured data. Sensitive Data Protection works for data located anywhere in Google Cloud, such as in Cloud Storage or databases. You can classify, mask, and tokenize your sensitive data to continue to use it safely for downstream processing. Use Sensitive Data Protection to perform actions such as to scan BigQuery data or de-identify and re-identify PII in large-scale datasets.

Modernize your data transformation processes

Use Dataform to write data transformations as code and to start to use version control by default. You can also adopt software development best practices such as CI/CD, unit tests, and version control to SQL code. Dataform supports all major cloud data warehouse products and databases, such as PostgreSQL.

Additional Resources

Data analytics and warehouses

Apply the following data analytics and warehouse best practices to your own environment.

Review your data storage needs

Data lakes and data warehouses aren't mutually exclusive. Data lakes are useful for unstructured and semi-structured data storage and processing. Data warehouses are best for analytics and BI.

Review your data needs to help determine where to store your data and which Google Cloud product is the most appropriate to process and analyze your data. Products like BigQuery can process PBs of data and grow with your demands.

Identify opportunities to migrate from a traditional data warehouse to BigQuery

Review the traditional data warehouses that are currently in use in your environment. To reduce complexity and potentially reduce costs, identify opportunities to migrate your traditional data warehouses to a Google Cloud service like BigQuery. For more information and example scenarios, see Migrating data warehouses to BigQuery.

Plan for federated access to data

Review your data requirements and how you might need to interact with other products and services. Identify your data federation needs, and create an appropriate system design.

For example, BigQuery lets you define external tables that can read data from other sources, such as Bigtable, Cloud SQL, Cloud Storage, or Google Drive. You can join these external sources with tables that you store in BigQuery.

Use BigQuery flex slots to provide on-demand burst capacity

Sometimes you need extra capacity to do experimental or exploratory analysis that needs a lot of compute resources. BigQuery lets you get additional compute capacity in the form of flex slots. These flex slots help you when there's a period of high demand or when you want to complete an important analysis.

Understand schema differences if you migrate to BigQuery

BigQuery supports both star and snowflake schemas, but by default it uses nested and repeated fields. Nested and repeated fields can be easier to read and correlate compared to other schemas. If your data is represented in a star or snowflake schema, and if you want to migrate to BigQuery, review your system design for any necessary changes to processes or analytics.

Additional resources

Reports and visualization

Apply the following reporting and visualization best practices to your own environment.

Use BigQuery BI Engine to visualize your data

BigQuery BI Engine is a fast, in-memory analysis service. You can use BI Engine to analyze data stored in BigQuery with subsecond query response time and with high concurrency. BI Engine is integrated into the BigQuery API. Use reserved BI Engine capacity to manage the on-demand or flat-rate pricing for your needs. BI Engine can also work with other BI or custom dashboard applications that require subsecond response times.

Modernize your BI processes with Looker

Looker is a modern, enterprise platform for BI, data applications, and embedded analytics. You can create consistent data models on top of your data with speed and accuracy, and you can access data inside transactional and analytical datastores. Looker can also analyze your data on multiple databases and clouds. If you have existing BI processes and tools, we recommend that you modernize and use a central platform such as Looker.

Additional resources

Use workflow management tools

Data analytics involves many processes and services. Data moves across different tools and processing pipelines during the data analytics lifecycle. To manage and maintain end-to-end data pipelines, use appropriate workflow management tools. Cloud Composer is a fully managed workflow management tool based on the open source Apache Airflow project.

You can use Cloud Composer to launch Dataflow pipelines and to use Dataproc Workflow Templates. Cloud Composer can also help you create a CI/CD pipeline to test, synchronize, and deploy DAGs or use a CI/CD pipeline for data-processing workflows. For more information, watch Cloud Composer: Development best practices.

Migration resources

If you already run a data analytics platform and if you want to migrate some or all of the workloads to Google Cloud, review the following migration resources for best practices and guidance:

What's next

Learn about system design best practices for Google Cloud AI and machine learning, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.