Data Analytics

Navigating Google Cloud: a decision tree for data & analytics workloads

July 25, 2023

Priyanka Vergadia

Staff Developer Advocate, Google Cloud

Alicia Williams

Developer Advocate

Google Cloud provides a wide range of services for running data and analytics workloads, which can mean sifting through a lot of information when choosing the right tools for your specific use cases. Each workload requires a unique set of services, from data ingestion and processing to storage, governance, and orchestration. To simplify the decision-making process, we've developed a handy decision tree that provides a roadmap for researching and selecting the best services based on your specific needs.

https://storage.googleapis.com/gweb-cloudblog-publish/images/decision-tree-da-v2.max-2200x2200.png

A decision tree for data and analytics workloads on Google Cloud. Click here to download and zoom!

In this post, we'll break down each workload area and how to choose the right Google Cloud services to match.

Data Ingestion

The first step in any data and analytics workflow is getting the data into your system. Data ingestion can be a first bulk load as part of a migration or regular ingestion needs once a workload is up and running. Depending on the type of data you're ingesting and where it's coming from, you may need to use different services.

For real-time data ingestion, there are a few options to choose from:

Datastream: If your data is coming from an operational database, look at Datastream, a serverless, real-time ingestion service that uses non intrusive change data capture (CDC) to replicate data reliably into BigQuery and enable streaming and operational analytics. It integrates with Dataflow templates and Data Fusion for building custom workflows with advanced transformation capabilities.
Pub/Sub is the ideal choice if you need to process and analyze data as it arrives. Pub/Sub is a fully managed messaging service designed for real-time data ingestion that integrates directly with our data processing services, including BigQuery.

For batch data ingestion, there are multiple options to choose from:

Cloud Storage: A very convenient way to import data into Google Cloud is to use object storage buckets. You can use the command line tool, gsutil, which optimizes the movement of data from a client or other buckets to a Cloud Storage bucket while maximizing the level of parallelism.
Storage Transfer Service: If you are transferring a larger amount of data from on-premises, or from other clouds, you can use Storage Transfer Service.
Transfer Appliance: If you need to transfer large amounts of on-premises data over low bandwidth, Transfer Appliance provides a more secure and efficient option using a physical device you ship to Google Cloud.
BigQuery Transfer Service: If you are specifically ingesting data from SaaS or third-party apps into your BigQuery data warehouse, you can use BigQuery Transfer Service, which provides pre-built connectors for popular data sources, and scheduling, monitoring and management features.
Dataflow: With Dataflow, you can manage large, complex and parameterized data ingestion across thousands of sources reliably as part of its comprehensive data processing service.
Dataproc: You can also use Dataproc, a fully managed Hadoop/Spark service that is 100% open source. Dataproc enables you to ingest data from on-premises or other clouds via ready-to use configurable templates powered by Dataproc Serverless.
Data Fusion: Data Fusion enables you to ingest batch data with a point-and-click interface via 150+ connectors (and with code-free analysis too!).

Data Processing

Once your raw data is ingested, you'll likely need to process it to make it into a more usable form. Data processing can include activities such as cleaning, filtering, aggregating, and transforming data to make it more accessible, organized, and understandable. The specific Google Cloud tools you will use for this will depend on where and how you want to process your data for storing in your data lakes, databases, and data warehouses.

Dataflow: For a fully managed, serverless, scalable, and reliable service for both batch and streaming data processing using Apache Beam and programming languages including Java, Python, and Go, head over to Dataflow.
Dataproc: For your Apache Hadoop/Spark workloads, you can use Dataproc, to process vast amounts of data stored in different file formats including table formats such as Delta, Iceberg or Hudi.
Data Fusion: If you need code-free processing, you can use Data Fusion, which supports a variety of transformation tasks.
BigQuery: If your workload can be managed with SQL-based ELT processing, you can benefit from the price to performance advantages of BigQuery, which is a serverless, highly scalable, and cost-effective cloud data warehouse.
Cloud Data Loss Prevention: Cloud DLP is a fully-managed service that helps you discover, classify, and protect sensitive data. As part of your data processing pipeline, it can deploy de-identification in migrations, data workloads, and real-time data collection and processing.

Data Storage

Next, it's time to store your data securely and efficiently to easily access, analyze, and use it in downstream applications such as business intelligence or machine learning. There are multiple options for storing data in Google Cloud and the specific service you choose will depend on your use case. Here are a few focused on storage for data and analytics workloads:

Cloud Storage: A good place to start for data lake storage is Cloud Storage: a scalable, durable, and highly available object storage service that is used to store various data, including structured, semi-structured, and unstructured data. It offers dual-region storage, which provides redundancy with low latency, no manual replication, and manages the failover if needed.
BigQuery: For structured or semistructured (Native JSON type, nested fields) data, store it in BigQuery and get access to super fast SQL analytics.
Filestore: If your use case requires especially high performance and low latency, such as I/O driven analytic training workloads, take a look at using Filestore.

As your data may be stored across BigQuery, Cloud Storage, and even other Clouds, it's important to unify and make it accessible using BigLake. BigLake is a data access engine that enables you to unify, manage, and analyze data across your data lakes and data warehouses. It provides increased performance and allows extra levels of governance and (columnar and row level) security.

Governance

It is increasingly important for companies to establish guidelines and best practices for data management to ensure that data is accurate, consistent, protected, and compliant with regulations. Data governance can include activities such as data cataloging, data lineage, data quality management, PII identification, and data access control.

Dataplex helps you with these tasks and centralizes governance across your data lakes, data warehouses, and data marts in Google Cloud and beyond. Within Dataplex, you can use Data Catalog, a fully-managed metadata repository, to help you discover, understand, and enrich your data.

You will also find governance-related features built directly into Google Cloud products. For example, BigQuery supports customer-managed encryption keys (CMEK) and column- and row-level security. This functionality extends to object storage via BigLake tables.

Orchestration

Finally, you'll want to coordinate and manage your workflow’s various components using orchestration. Orchestration can include defining pipelines, scheduling data processing jobs, and monitoring your data pipelines to ensure that your data is processed in a timely and efficient manner.

Google Cloud offers two orchestration services:

Composer: You can write, schedule, and monitor your data pipelines using this fully-managed Airflow service that integrates with the data processing options mentioned above.
Dataform: If you want to build and manage ETL/ELT data pipelines using SQL, Dataform allows you to develop and operationalize scalable data transformation pipelines in BigQuery.

Data Consumption

With your data workflows in place, you're ready to take the data where you want to go next!

Want to perform fast SQL analytics? Head to BigQuery.
Want to securely share data and insights at scale without moving the data? Use Analytics Hub.
Want to visualize data or create dashboards for reporting? Looker Studio is a robust and intuitive BI tool.
Want to develop custom machine learning models with your data? Let Vertex AI unify your machine learning workflows end-to-end.

Next steps

Data and analytics workloads involve multiple stages, from ingesting data from various sources to processing, storing, governing, orchestrating, and sharing the data. We want to make it as easy as possible for you to find the right tools and technologies to match your needs - so bookmark this decision tree and keep a look out as we publish more decision trees for other cloud workloads in the future.

Let us know what you think of this post and the decision tree by heading over to the Cloud Analytics Discord channel! Just make sure you've joined Innovators and the Google Developers Discord.

Posted in