Google Cloud for AWS Professionals: Big Data

Updated October 30, 2019

This article compares the big data services that Amazon provides through Amazon Web Services (AWS) with those that Google provides through Google Cloud.

The article discusses the following service types:

  • Ingestion services, which are used to ingest data from a source environment into a reliable and stable target environment or data type.
  • Transformation and preparation services, which allow you to filter, extract, and transform data from one data type or model to another.
  • Warehousing and analytics services, which allow you to store, analyze, visualize, and interact with the processed data.

Ingestion services

This section compares ways to ingest data in both AWS and Google Cloud.

Connectivity

For some initial migrations, and especially for ongoing data ingestion, you typically use a high-bandwidth network connection between your destination cloud and another network. The following table summarizes AWS and Google Cloud connectivity options.

Feature AWS Google Cloud
Virtual private network Site-to-Site VPN Cloud VPN
Private connectivity to a VPC Direct Connect Dedicated Interconnect
Cloud Interconnect – Partner
High speed connectivity to other cloud services Direct Connect Direct Peering
Carrier Peering

For more information about the Google Cloud options, see the Private connectivity to other networks section in Google Cloud for AWS Professionals: Networking.

Stream ingestion

Amazon Kinesis Data Streams and Google Pub/Sub can both be used to ingest data streams into their respective cloud environments. However, each service accomplishes this task using different service models.

Service model comparison

The following table compares features of Amazon Kinesis Data Streams and Pub/Sub.

Feature Amazon Kinesis Data Streams Pub/Sub
Unit of deployment Stream Topic
Unit of provisioning Shard N/A (fully managed)
Data unit Record Message
Data source Producer Publisher
Data destination Consumer Subscriber
Data partitioning User-supplied partition key N/A (fully managed)
Retention period Up to 7 days Up to 7 days
Data delivery order Service-supplied sequence key (best effort) Service-supplied publish time (best effort)
Max data size 1 MB 10 MB
Deployment locality Regional Global
Pricing model Per shard-hour, PUT payload units, and optional data retention Message ingestion and delivery, and optional message retention
Amazon Kinesis Data Streams

Amazon Kinesis Data Streams uses a streaming model to ingest data. In this model, producers send data to a stream that you create and provision by shard. Each shard in a stream can provide a maximum of 1 MiB per second of input bandwidth and 1000 data puts per second.

Users send data to Amazon Kinesis Data Streams by using the low-level REST API or the higher-level Kinesis Producer Library (KPL). This data is stored in data records that consist of the following:

  • An incremental sequence number
  • A user-supplied partition key
  • A data blob

The partition key is used to load-balance the records across the available shards. By default, records are retained for 24 hours. However, users can increase this retention period to a maximum of 7 days.

The user sets up a consumer application that retrieves the data records from the stream on a per-shard basis, and then processes them. The application is responsible for multiplexing across the available shards. Incorporating Amazon's Kinesis Client Library into your application simplifies this multiplexing across shards, and also manages load balancing and failure management across the cluster of consumer application nodes.

Pub/Sub

Pub/Sub presents a messaging service that uses a publisher/subscriber model. After you create a Pub/Sub topic, you can publish data to that topic, and each application that subscribes to the topic can retrieve the ingested data from the topic. This approach eliminates the need to define a specific capacity, such as the number of shards.

Each application that is registered with Pub/Sub can retrieve messages by using either a push model or a pull model:

  • In the push model, the Pub/Sub server sends a request to the subscriber application at a preconfigured URL endpoint.
  • In the pull model, the subscriber application requests messages from the server, and then acknowledges receipt when the messages arrive. Pull subscribers can retrieve messages either asynchronously or synchronously.

Each data message published to a topic must be base64-encoded and no larger than 10 MB. At the time of ingestion, Pub/Sub adds a messageId attribute and a publishTime attribute to each data message. The messageId attribute is a message ID that is guaranteed to be unique within the topic, and the publishTime attribute is a timestamp added by the system at the time of data ingestion. Publishers can add attributes to the data in the form of key-value pairs.

Data order

This section describes how Amazon Kinesis Data Streams and Pub/Sub manage the ordering of data that's requested by a consumer or subscriber application.

Amazon Kinesis Data Streams

By default, Amazon Kinesis Data Streams maintains data order through the use of the partition key and the sequence number. When a producer adds a record to a stream, the producer provides a partition key that determines the shard to which the record will be sent. The shard adds an incremental sequence number to the record, and then stores the record reliably.

Consumer applications request records by shard, and receive the records in sequence number order. This model aids per-shard ordering, but ordering is not guaranteed if the consumer application makes requests across shards. In addition, a record can be delivered to a consumer more than once, so the application must enforce exactly-once semantics. For more information, see Handling Duplicate Records in the Amazon Kinesis Data Streams documentation.

Pub/Sub

Pub/Sub delivers messages on a best-effort basis, using the system-supplied publishTime attribute to deliver messages in the order that they were published. Pub/Sub does not guarantee only-once or in-order delivery: on occasion, a message might be delivered more than once, and out of order. Your subscriber should be idempotent when processing messages and, if necessary, it should be able to handle messages that are received out of order.

You can achieve stricter ordering by using application-supplied sequence numbers and buffering consumed messages. If the final target of your data is a persistent storage service that supports time-based queries, such as Cloud Firestore or BigQuery, you can also view data in a strict order by sorting your queries by timestamp. If your target is Dataflow, you can use record IDs to establish exactly-once processing.

Operations

This section examines operational and maintenance overhead for production workloads on each service.

Amazon Kinesis Data Streams

Because Amazon Kinesis Data Streams users must scale shards up and down manually, they might need to monitor usage with Amazon CloudWatch and modify scale as needed. This scaling process is called resharding, and can only be done on a shard-by-shard basis. Resharding supports two operations: a shard can be split into two shards, or two shards can be merged into a single shard. As such, doubling the capacity of N shards requires N individual shard-split operations.

Due to the fixed nature of shards, you should account for each shard's capacity individually in your design. For example, if you choose a partition key that would direct a spike in traffic to a single shard, that spike could overwhelm a shard's ingestion capacity. This problem probably can't be avoided in the future by simply resharding. In cases like these, the only way to permanently address the issue is to redesign the application with a different partition key.

You can avoid the shard management of Kinesis Data Streams by using Kinesis Data Firehose. Kinesis Data Firehose automates the management, monitoring, and scaling of Kinesis streams for one specific use case: aggregating data from a stream into Amazon S3 or Amazon Redshift. Users specify an S3 bucket or Redshift cluster, and Kinesis Firehose creates and manages a stream on the user's behalf, depositing the data in specified intervals into the specified location.

Amazon Kinesis is a regional service, with streams scoped to specific regions. As such, all ingested data must travel to the region in which the stream is defined.

Pub/Sub

Pub/Sub does not require provisioning, and handles sharding, replication, and scaling for you.

In addition, you don't need to use partition keys—Pub/Sub manages data partitioning on your behalf. Though these features greatly reduce managerial overhead, they also mean that Pub/Sub can make fewer guarantees about message ordering.

Pub/Sub uses Google's HTTP(S) load balancer to support data ingestion globally across all Google Cloud regions. When a publisher publishes data to Pub/Sub, Google's HTTP(S) load balancer automatically directs the traffic to Pub/Sub servers in an appropriate region in order to minimize latency.

Costs

Amazon Kinesis Data Streams is priced by shard hour, data volume, and data retention period. By default, data is retained for 24 hours. Increasing the retention period incurs additional costs. Because Amazon Kinesis Data Streams uses a provisioned model, you must pay for the resources you provision even if you do not use the resources.

Amazon Kinesis Data Firehose is priced by data volume.

Pub/Sub is priced by data volume. Because Pub/Sub does not require resource provisioning, you pay for only the resources you consume.

Bulk ingestion

AWS Snowball and Google Transfer Appliance can both be used to ingest data in bulk into their respective cloud environments.

AWS Snowball comes in 50 TB (North America only) and 80 TB versions. Transfer Appliance comes in a 100 TB version known as the TA100, and a 480 TB version known as the TA480.

Summary comparison

The following table compares features of AWS Snowball and Google Transfer Appliance.

Feature AWS Snowball Transfer Appliance
Capacity per unit 50 TB or 80 TB 100 TB or 480 TB
Maximum transfer rate 10 Gbps 20 Gbps for TA100
40 Gbps for TA480
Both with automatic link aggregation
Email status updates? No Yes
Rack-mountable? No Yes for TA100
No for TA480
Use fee $200 for 50 TB
$250 for 80 TB
$300 for TA100
$1800 for TA480
Daily fee $15/day after 10 days $30/day after 10 days for TA100
$90/day after 25 days for TA480
Transfer modes Push Push or pull
Transfer data out of object store? Yes No

Operations

The two services have a similar workflow (receive shipment, set up, transfer data, ship back), but there are some important differences in how you set them up and load data onto them.

AWS Snowball is not rack-mountable. Instead, it's meant to be free standing, similar to an ATX PC case. The Transfer Appliance TA100 model comes in a 2U rack-mountable form for use in data centers. The TA480 model arrives in its own case with casters; it is not rack-mountable.

Perhaps the largest contrast between the AWS Snowball and Transfer Appliance is in the networking throughput capability. Both have support for 1 Gbps or 10 Gbps using an RJ-45 connection, and 10 Gbps using a fiber optic connection. However, both Transfer Appliance models offer four 10 Gbps ethernet ports with adaptive load balancing link aggregation, making it possible to achieve multi-stream throughput much greater than just 10 Gbps.

You set up Snowball using a touch e-ink screen. Transfer Appliance requires a VGA display and USB keyboard to access the console, from which a web console is configured. From there, you perform all administration remotely, using a web browser.

For getting data onto the device, both Snowball and Transfer Appliance offer workstation client push models. Snowball also offers an Amazon S3 API push. Transfer Appliance offers both NFS Pull (where it acts as an NFS client) and NFS Push (where it acts as an NFS server) transfer modes.

For both Snowball and Transfer Appliance, you return the device through a shipping carrier.

Finally, when your data is loaded into object storage, there is one important difference between the AWS and Google devices. For Snowball, decryption of the device data is included in the service. Transfer Appliance customers must use a rehydrator Compute Engine virtual appliance to decrypt the device data; normal Compute Engine VM pricing applies to rehydrator instances.

Object storage transfer

Because you might be considering moving Big Data workloads from AWS to Google Cloud, Google's Storage Transfer Service might be helpful to you. You can use Storage Transfer Service to create one-time or recurring jobs to copy data from Amazon S3 buckets to Google Cloud Storage buckets. Other data sources are supported as well.

Transformation and preparation

After you've ingested your data into your cloud environment, you can transform the data, filtering and processing it as needed.

This document covers three categories of services to perform this work: partially managed ETL, fully managed ETL, and stream transformation.

Partially managed ETL

A common approach to data transformation tasks is to use Apache-Hadoop–based tools, which typically provide flexible and scalable batch processing. Both Google Cloud and AWS offer managed Hadoop services. Google Dataproc and Amazon Elastic MapReduce (EMR) both provide automatic provisioning and configuration, simple job management, sophisticated monitoring, and flexible pricing. For stream-based data, both Dataproc and Amazon EMR support Apache Spark Streaming.

In addition, Google Cloud provides Dataflow, which is based on Apache Beam rather than on Hadoop. While Apache Spark Streaming treats streaming data as small batch jobs, Dataflow is a native stream-focused processing engine.

Service model comparison

The following table compares features of Amazon EMR, Dataproc, and Dataflow.

Feature Amazon Elastic MapReduce Google Dataproc Google Dataflow
Open source library Apache Hadoop and Apache Spark Apache Hadoop and Apache Spark Apache Beam
Scaling Manual Manual Auto
Unit of deployment Cluster Cluster Pipeline
Unit of scale Nodes (master, core, and task) Nodes (master and worker) Workers
Unit of work Step Job Job
Programming model MapReduce, Apache Hive, Pig, Flink, Spark, Spark SQL, PySpark MapReduce, Apache Hive, Pig, Flink, Spark, Spark SQL, PySpark Apache Beam
Dataproc and Amazon EMR

Dataproc and Amazon EMR have similar service models. Each is a scalable platform for filtering and aggregating data, and each is tightly integrated with Apache's big data tools and services, including Apache Hadoop, Apache Spark, Apache Hive, and Apache Pig.

In both Dataproc and Amazon EMR, you create a cluster that consists of a number of nodes. The service creates a single master node and a variable number of worker nodes. Amazon EMR further classifies worker nodes into core nodes and task nodes.

After a cluster has been provisioned, the user submits an application—called a job in Dataproc and in Amazon EMR—for execution by the cluster. Application dependencies are typically added by the user to the cluster nodes using custom Bash scripts called initialization actions in Dataproc and bootstrap actions in Amazon EMR. Applications typically read data from stable storage, such as Amazon S3, Cloud Storage, or HDFS, and then process the data using an Apache data processing tool or service. After the data has been processed, the resulting data can be further processed or pushed back to stable storage.

Dataflow

Dataflow uses the Apache Beam programming model to perform batch processing and stream processing. This model offers improved flexibility and expressiveness when compared to the Apache Spark model used by Amazon EMR and Dataproc, particularly for real-time data processing.

In Dataflow, you specify an abstract pipeline, using a Dataflow SDK library to provide the primitives for parallel processing and for aggregation. When specifying a pipeline, the user defines a set of transformations that are then submitted for execution in the pipeline. These transformations are in turn mapped to a set of worker nodes that are provisioned and configured for execution by Dataflow. Some nodes might be used for reading data from Pub/Sub, and others might perform other downstream transformations; the details are managed by the Dataflow runtime.

The Dataflow model, SDKs, and pipeline runners have been accepted into the Apache open source incubator as Apache Beam. This development means that Dataflow applications can also be executed in a Flink or Spark cluster, or in a local development environment.

For a detailed comparison of the Apache Beam and Apache Spark programming models, see Dataflow/Beam & Spark: A Programming Model Comparison.

Scaling

This section discusses how to manage scaling with Amazon EMR, Dataproc, and Dataflow.

Dataproc and Amazon EMR

Amazon EMR and Dataproc allow you to manually adjust the number of nodes in a cluster after the cluster is started. You can determine the size of the cluster, as well as the scaling actions, by monitoring the performance and usage of the cluster to decide how to manage it. In both services, users pay for the number of nodes that are provisioned.

Dataflow

With Dataflow, you specify only the maximum number of nodes. The Dataflow runtime system then autoscales the nodes, actively managing node provisioning and allocation to different parts of the pipeline as needed.

Costs

This section discusses how costs are assessed for Amazon EMR, Dataproc, and Dataflow.

Amazon EMR and Dataproc

Both Amazon EMR and Dataproc support on-demand pricing as well as discounts for short-term and long-term use. Both services are priced by the second. To reduce the cost of nodes, Amazon EMR users can pre-purchase reserved instances. Dataproc automatically provides sustained-use discounts.

In addition, each service offers options for buying discounted surplus compute capacity. Amazon EMR supports provisioning nodes using Amazon EC2 Spot Instances, in which unused capacity is auctioned to users in short-term increments. These nodes can be reclaimed by EC2, but the cluster continues processing as nodes are added or removed. Similarly, Dataproc supports preemptible VMs that can be reclaimed at any time. Preemptible VMs are not auctioned through a market. Instead, they offer a fixed hourly discount for each Compute Engine machine type.

For a detailed comparison of managed Hadoop pricing for common cloud environments, including Google Cloud and AWS, see Understanding Cloud Pricing: Big Data Processing Engines.

Dataflow

Dataflow is priced per hour depending on the Dataflow worker type. For more information, see Dataflow pricing.

Fully managed ETL

Both AWS and Google Cloud have offerings that reduce the work of configuring transformation by automating significant parts of the work and generating transformation pipelines.

In AWS, you can use AWS Glue, a fully-managed AWS service that combines the concerns of a data catalog and data preparation into a single service. AWS Glue employs user-defined crawlers that automate the process of populating the AWS Glue data catalog from various data sources. After the data catalog is populated, you can define an AWS Glue job. Creating the job generates a Python or Scala script that's compatible with Apache Spark, which you can then customize. AWS Glue jobs can run based on time-based schedules or can be started by events.

Google Dataprep is a fully-managed service that's operated by Trifacta, and easily integrated with your Google Cloud projects and data. You can use Dataprep to explore and clean up data you've identified for further analysis. Your data can be structured or unstructured, and can be sourced from Google Cloud Storage, BigQuery, or a file upload. Work is organized around flows, which represent one or more source datasets, transformations, and prepared datasets. Dataprep offers a GUI to discover information and plan a transformation flow. These transformations are specified in the Wrangle domain-specific language, and can be specified manually as well as through the GUI. Your flow is run on fully-managed Dataflow to perform transformations.

Stream transformation

There are several services in both AWS and Google Cloud that can be used to transform data streams.

Dataproc and Amazon EMR

Amazon EMR implements a streaming data model natively by supporting Amazon Kinesis Data Streams as a method of ingesting data. In this model, the application reads the available data stored in the stream until no new data is available for some time value. When all the shards are clear, the reduce step of the operation begins, and the data is aggregated. Amazon EMR also supports streaming from third-party services such as Apache Kafka through a native implementation of Apache Spark Streaming.

Although Dataproc cannot read streaming data directly from Pub/Sub, Apache Spark comes preinstalled on all Dataproc clusters. This lets you use Dataproc to read streaming data from Apache Kafka. In addition, you can use Dataflow to read and process streaming data from Pub/Sub.

Google Dataflow and Amazon Kinesis Data Firehose

Dataflow supports stream processing in addition to batch processing, as described earlier. The streaming engine runs Apache Beam, just as with batch, and you can apply a transformation to both batch and stream sources without any code changes. Pub/Sub is the only event source used with Dataflow in streaming mode, and Pub/Sub can process messages up to 10 MB. As with batch transformations, Dataflow streaming transformations are fully managed and autoscaled, with scaling independent across components in the transformation pipeline.

Amazon Kinesis Data Firehose can perform stream transformation by attaching an AWS Lambda function to the stream. The function can process input up to 6 MB, and can return up to 6 MB of data. You can mirror this approach in Google Cloud by using Pub/Sub and Cloud Functions.

Warehousing and analysis

After ingesting and transforming your data, you can perform data analysis and create visualizations from the data. Typically, data ready for analysis ends up on one of two places:

  • An object storage service, such as Amazon S3 or Google Cloud Storage.
  • A managed data warehouse, such as Amazon Redshift or Google BigQuery.

Managed data warehouses

This section focuses on Amazon Redshift and Google BigQuery's native storage.

Service model comparison

The following table compares features of Amazon Redshift and Google BigQuery.

Feature Amazon Redshift Google BigQuery
Unit of deployment Cluster N/A (fully managed)
Unit of provisioning Node N/A (fully managed)
Node storage types HDD/SSD N/A (fully managed)
Compute scaling Manual, to a maximum of 128 nodes Automatically adjusted, no limit
Query scaling Up to 50 simultaneous queries across all user-defined queues Up to 1000 simultaneous queries
Table scaling Up to 20,000 tables for large node types No limit. Performance is best under 50,000 tables per dataset. Unlimited datasets.
Backup management Snapshots N/A (fully managed)
Deployment locality Zonal Regional
Pricing model Hourly By storage and query volume
Query language PostgreSQL compatible Legacy BigQuery SQL or Standard SQL
Built-in machine learning? No Yes
Amazon Redshift

Amazon Redshift uses a massively parallel processing architecture across a cluster of provisioned nodes to provide high-performance SQL execution. When you use Amazon Redshift, your data is stored in a columnar database that is automatically replicated across the nodes of the cluster. Your data lives within the cluster, so the cluster must be kept running to preserve the data. (However, you can export your data from Amazon Redshift to Amazon S3 and reload it into a a Amazon Redshift cluster to query later). An extension to Amazon Redshift, Spectrum, provides an alternative that lets you directly query data stored in supported formats in Amazon S3.

As noted, Amazon Redshift uses a provisioned model. In this model, you select an instance type, and then provision a specific number of nodes according to your needs. After you've done the provisioning, you can connect to the cluster and then load and query your data using the PostgreSQL-compatible connector of your choice.

Amazon Redshift is a partially managed service. If you want to scale a cluster up or down—for example, to reduce costs during periods of low usage, or to increase resources during periods of heavy usage—you must do so manually. In addition, Amazon Redshift requires you to define and manage your distribution keys and sort keys, and to perform data cleanup and defragmentation processes manually.

Google BigQuery

BigQuery is fully managed. You don't need to provision resources; instead, you can simply push data into BigQuery, and then query the data. BigQuery manages the required resources and scales them automatically as appropriate. BigQuery also supports federated queries, which can include data stored in open source formats in Cloud Storage or Google Drive, and also data stored natively in Cloud Bigtable.

Behind the scenes, BigQuery uses the same powerful, global-scale services that Google uses internally:

  • It stores, encrypts, and replicates data using Colossus, Google's latest-generation distributed file system.
  • It processes tasks using Borg, Google's large-scale cluster management system.
  • It executes queries with Dremel, Google's internal query engine.

For more information, see the BigQuery Under the Hood post in the Google Cloud Blog.

BigQuery tables are append-only, with support for limited deletes to fix mistakes. Users can perform interactive queries and create and execute batch query jobs. BigQuery supports two query languages:

  • Legacy SQL, which is a BigQuery-specific dialect of SQL.
  • Standard SQL, which is compliant with the SQL 2011 standard and includes extensions for querying nested and repeated data.

In addition, BigQuery supports integration with a number of third-party tools, connectors, and partner services for ingestion, analysis, visualization, and development.

Machine learning

BigQuery includes native support for machine learning. BigQuery ML offers a number of models to address common business questions. Examples include linear regression for sales forecasts, and binary logistic regression for classification such as whether a customer is likely to make a purchase. Multiple BigQuery datasets can be used for both model training and prediction.

Scale

Amazon Redshift can scale from a single node to a maximum of 32 or 128 nodes for different node types. Using Dense Storage nodes, Redshift has a maximum capacity of 2 PB of stored data, including replicated data. Amazon Redshift's ingestion and query mechanisms use the same resource pool, which means that query performance can degrade when you load very large amounts of data.

Amazon Redshift Spectrum extends this capacity. However, when you use Redshift Spectrum, an Amazon Redshift cluster must be running in order to run queries against this data. Queries are processed between two layers (Amazon Redshift and Redshift Spectrum), and you must construct queries to use each layer most efficiently.

In contrast, BigQuery has no practical limits on the size of a stored dataset. Ingestion resources scale quickly, and ingestion itself is extremely fast—by using the BigQuery API, you can ingest millions of rows into BigQuery each second. In addition, ingestion resources are decoupled from query resources, so an ingestion load cannot degrade the performance of a query load. BigQuery can also perform queries of data stored in Google Cloud Storage. These federated queries require no changes to the way queries are written—the data is just viewed as another table.

Operations

This section compares operational considerations of using Amazon Redshift and Google BigQuery.

Amazon Redshift

Amazon Redshift is partially managed, so that it takes care of many of the operational details needed to run a data warehouse. These details include data backups, data replication, failure management, and software deployment and configuration. However, several operational details remain your responsibility, including performance management, scaling, and concurrency.

To achieve good performance, you must define static distribution keys when you create tables. These distribution keys are then used by the system to shard the data across the nodes so that queries can be performed in parallel. Because distribution keys can have a significant effect on query performance, you must choose these keys carefully. After you define distribution keys, the keys cannot be changed; to use different keys, you must create a new table with the new keys and copy the data from the old table.

In addition, Amazon recommends that you perform periodic maintenance to maintain consistent query performance. Updates and deletes do not automatically compact the data on disk, which can eventually lead to performance bottlenecks. For more information, see Vacuuming Tables in the Amazon Redshift documentation.

In Amazon Redshift, you must manage end users and applications carefully. For example, users must tune the number of concurrent queries they perform. By default, Amazon Redshift performs up to 5 concurrent queries. You can increase the number of concurrent queries up to 50. However, because resources are provisioned ahead of time, as you increase this limit, performance and throughput can be affected. For more information, see the Concurrency Levels section of Implementing Manual WLM in the Amazon Redshift documentation.

You must also size your cluster to support the overall data size, query performance, and number of concurrent users. You can scale up the cluster; however, given the provisioned model, you pay for what you provision, regardless of usage.

Finally, Amazon Redshift clusters are restricted to a single zone by default. To create a highly available, multi-regional Amazon Redshift architecture, you must create additional clusters in other zones, and then build out a mechanism for achieving consistency across clusters. For more information, see the Building Multi-AZ or Multi-Region Amazon Redshift Clusters post in the Amazon Big Data Blog.

For details about other Amazon Redshift quotas and limits, see Limits in Amazon Redshift.

BigQuery

BigQuery is fully managed, with little or no operational toil for the user:

  • BigQuery handles sharding automatically. You don't need to create and maintain distribution keys.
  • BigQuery is an on-demand service rather than a provisioned one. You don't need to worry about underprovisioning, which can cause bottlenecks, or overprovisioning, which can result in unnecessary costs.
  • BigQuery provides global, managed data replication. You don't need to set up and manage multiple deployments.
  • BigQuery supports up to 50 concurrent interactive queries, with no effect on performance or throughput.

For details about BigQuery's quotas and limits, see the Quota Policy page in the BigQuery documentation.

Costs

Amazon Redshift has two types of pricing: on-demand pricing and reserved instance pricing. Pricing is based on the number and type of provisioned instances. You can get discounted rates by purchasing reserved instances up front. Amazon offers one-year and three-year reserve terms. For more information, see the Amazon Redshift pricing page.

BigQuery charges you for usage. Pricing is based on data storage size, query compute cost, and streaming inserts. If a table is not updated for 90 consecutive days, data storage pricing drops in half. Queries are billed per terabyte of data processed. BigQuery is priced at both on-demand and flat-rate schedules, which can result in significant savings for predictable workloads. BigQuery offers significant free usage, up to 20 GB of storage and 1 TB of query reads per month. For more information, see the BigQuery pricing page for more information.

Object storage warehouses

Object stores are another common big data storage mechanism. Amazon S3 and Google Cloud Storage are comparable, fully-managed object storage services. For a more detailed discussion of the two, see the section on distributed object storage in the storage comparison document.

For large amounts of data which you would access infrequently, Google Cloud Storage Coldline is a good choice, comparable to Amazon Glacier storage. For a detailed discussion of the two, see Google Cloud for AWS Professionals: Storage.

Analysis in object storage

This section focuses on Amazon Athena and Google BigQuery's compatibility with object storage.

Service model

AWS Athena is a serverless object storage analysis service. It lets you run SQL queries on data whose schema is defined in Amazon S3. BigQuery federated queries are comparable, supporting Google Cloud Storage, Google Drive, and Cloud Bigtable data.

Both Athena and BigQuery on Cloud Storage are fully managed, including automatic scaling, so the service models are similar.

Scale

In terms of data scale, both Amazon S3 and Cloud Storage offer exabyte-scale storage. Amazon S3 limits buckets to 100 per account. Cloud Storage rate-limits bucket creation to one bucket every two seconds, but there is no limit on the number of buckets in a project, folder, or organization.

In terms of query scale, Athena queries time out at 30 minutes, while BigQuery queries time out after 6 hours. Athena has a soft limit of up to 20 DDL queries and 20 DML queries at one time. BigQuery supports up to 50 concurrent interactive queries, with no effect on performance or throughput. Dry-run queries do not contribute to this limit. This limit can be raised at the project level.

Athena query strings are limited to 262,144 bytes. BigQuery legacy SQL queries are limited to 256 KB unresolved, while standard SQL queries are limited to 1 MB unresolved. After resolution, which expands views and wildcard tables referenced by the query into the overall query size, the limit for both BigQuery SQL dialects is 12 MB.

Operations

Both Athena and BigQuery are fully managed, with little or no operational overhead for the user.

Costs

For storage costs, Google Cloud Storage and Amazon S3 are comparable, with the charges for transfers and storage making up the bulk of cost. Cloud Storage customers who need cost stability can enroll in the Storage Growth Plan to make costs the same amount each month.

Both AWS Athena and BigQuery (at the on-demand price) charge $5 per terabyte for queries. However, a terabyte is measured differently between the two services. Athena bills on bytes read from Amazon S3, which means that queries of compressed data cost less than uncompressed data. BigQuery bills on bytes processed, so the cost is the same regardless of where and how the data is stored.

Both services have a minimum of 10 MB billed per query. Neither service charges for failed queries, but both services charge for work done on canceled queries.

Athena does not have a free tier. BigQuery offers the first 1 TB per month for free, for the lifetime of your account.

Data visualization

If you use AWS QuickSight, you can find comparable features in Google Data Studio. The main difference is pricing. Data Studio is free, while QuickSight is billed per session. In addition, Data Studio is integrated with G Suite for easy sharing within your organization—just like Documents, Sheets, and Slides.

For more control or for scientific work, Google also offers Colaboratory, a free, no-setup service that integrates with BigQuery using Jupyter notebooks.

Bu sayfayı yararlı buldunuz mu? Lütfen görüşünüzü bildirin:

Şunun hakkında geri bildirim gönderin...

Google Cloud Platform for AWS Professionals