Google Cloud Platform for AWS Professionals: Big Data

Updated June 29, 2016

Compare the big data services that Amazon and Google provide in their respective cloud environments. This article discusses the following service types:

  • Data ingestion services, which are use to ingest data from a source environment into a reliable and stable target environment or data type.
  • Data transformation services, which allow you to filter, extract, and transform data from one data type or model to another.
  • Data analytics services, which allow you to analyze, visualize, and interact with the processed data.

Data ingestion

Amazon Kinesis and Google Cloud Pub/Sub can both be used to ingest data into their respective cloud environments. However, each service accomplishes this task using different service models.

Service model comparison

Cloud Pub/Sub terms and concepts map to those of Amazon Kinesis as follows:

Feature Amazon Kinesis Google Cloud Pub/Sub
Unit of Deployment Stream Topic
Unit of Provisioning Shard N/A (fully managed)
Data Unit Record Message
Data Source Producers Publisher
Data Destination Consumer Subscriber
Data Partitioning User-supplied partition key N/A (fully managed)
Retention Period 1 – 7 days 7 days
Data Delivery Order Service-supplied sequence key (best effort) Service-supplied publishTime (best effort)
Max Data Size 1MB 10MB
Deployment Locality Regional Global
Pricing Model Per shard-hour, per number of operations, and per length of data retention Per number of operations

Amazon Kinesis

Amazon Kinesis uses a streaming model to ingest data. In this model, producers send data to a stream that you create and provision by shard. Each shard in a stream can provide a maximum of 1MB/sec of input bandwidth and 1000 data puts per second.

Users send data to Amazon Kinesis by using the low-level REST API or the higher-level Kinesis Producer Library (KPL). This data is stored in data records that comprise the following:

  • An incremental sequence number
  • A user-supplied partition key
  • A data blob

The partition key is used to load balance the records across the available shards. By default, records are retained for 24 hours. However, users can increase this retention period to a maximum of 7 days.

The user sets up a consumer application that retrieves the data records from the stream on a per-shard basis, and then processes them. The application is responsible for multiplexing across the available shards. Amazon's Kinesis Client Library simplifies this management across shards, and also manages load balancing and failure management across the cluster of consumer application nodes.

Cloud Pub/Sub

In contrast, Cloud Pub/Sub is a messaging service that uses a publisher/subscriber model. After you create a Cloud Pub/Sub topic, you can publish data to that topic, and each application that subscribes to the topic can retrieve the ingested data from the topic. This approach eliminates the need for provisioning.

Each application that is registered with Cloud Pub/Sub can retrieve messages using either a push model or a pull model:

  • In the push model, the Cloud Pub/Sub server sends a request to the subscriber application at a preconfigured URL endpoint.
  • In the pull model, the subscriber application requests messages from the server, and then acknowledges receipt when the messages arrive.

Each data message published to a topic must be base64-encoded and no larger than 10MB in size. At the time of ingestion, Cloud Pub/Sub adds a messageId attribute and a publishTime attribute to each data message. The messageId attribute is a message ID that is guaranteed to be unique within the topic, and the publishTime attribute is a timestamp added by the system at the time of data ingestion. Optional attributes can be added to the data in the form of key names and values.

Data order

This section describes how Amazon Kinesis and Cloud Pub/Sub manage the ordering of data requested by a consumer or subscriber application.

Amazon Kinesis

By default, Amazon Kinesis maintains data order through the use of the partition key and the sequence number. When a producer adds a record to a stream, the producer provides a partition key that determines the shard to which the record will be sent. The shard adds an incremental sequence number to the record, and then stores the record reliably.

Consumer applications request records by shard, and receive the records in order of sequence number. While this model ensures per-shard ordering, ordering is not guaranteed when making requests across shards. In addition, a record can be delivered to a consumer more than once, so the application is in charge of enforcing exactly-once semantics if needed. For more information, see Handling Duplicate Records in the Amazon Kinesis documentation.

Cloud Pub/Sub

Cloud Pub/Sub delivers messages on a best-effort basis, using the system-supplied publishTime attribute to deliver messages in the order that they were published. Cloud Pub/Sub does not guarantee only-once and in-order delivery: on occasion, a message might be delivered more than once, and out of order. Your subscriber should be idempotent when processing messages and, if necessary, be able to handle messages received out of order.

You can achieve stricter ordering by using application-supplied sequence numbers and buffering. If the final target of your data is a persistent storage service that supports time-based queries, such as Cloud Datastore or BigQuery, you can also achieve stricter ordering by sorting your queries by timestamp. If your target is Cloud Dataflow, you can use record IDs to establish exactly-once processing.

Operations

This section examines operational and maintenance overhead for production workloads on each service.

Amazon Kinesis

Because Amazon Kinesis shards are provisioned, and therefore fixed, users must scale shards up and down manually, monitoring usage with Amazon CloudWatch and scaling as needed. This scaling process is called resharding, and can only be done on a shard-by-shard basis. Resharding supports two operations: a shard can be split into two shards, or two shards can be merged into a single shard. As such, doubling the capacity of N shards requires N individual shard-split operations.

Due to the fixed nature of shards, users must be careful to architect around their limitations. For example, if you choose to use an inappropriate partition key—such as a stock symbol in a stock market application—a spike in traffic can overwhelm a shard's ingestion capacity, and the problem will not be solvable by resharding. In such cases, the only way to address the issue is to redesign the application with a different partition key.

You can mitigate some of the management overhead of Amazon Kinesis Streams by using Amazon Kinesis Firehose. Kinesis Firehose automates the management, monitoring, and scaling of Kinesis Streams for one specific use case: aggregating data from a stream into Amazon S3 or Amazon Redshift. Users specify an Amazon S3 bucket or Amazon Redshift cluster, and then Firehose creates and manages a stream on the user's behalf, depositing the data in specified intervals into the desired location.

Finally, Amazon Kinesis is a regional service, with streams scoped to specific regions. As such, all ingested data must travel to the region in which the stream is defined.

Cloud Pub/Sub

Cloud Pub/Sub does not require provisioning, and handles sharding, replication, and scaling opaquely. Administrators do not need to monitor and scale anything manually.

In addition, the user does not use partition keys—Cloud Pub/Sub manages data partitioning on the user's behalf. Though these features greatly reduce managerial overhead, they also mean that Cloud Pub/Sub can make fewer guarantees about message ordering.

Cloud Pub/Sub uses Google's HTTP(S) load balancer to support data ingestion globally across all Cloud Platform regions. When a publisher publishes data to Cloud Pub/Sub, Google's HTTP(S) load balancer automatically directs the traffic to Cloud Pub/Sub servers in an appropriate region to minimize latency.

Costs

Amazon Kinesis Streams is priced by shard hour, data volume, and data retention period. By default, data is retained for 24 hours. Increasing the retention period will incur additional costs. Because Amazon Kinesis Streams uses a provisioned model, you must pay for the resources you provision even if you do not use the resources.

Amazon Kinesis Firehose is priced by data volume.

Cloud Pub/Sub is priced by data volume. Because Cloud Pub/Sub does not require resource provisioning, you pay for only the resources you consume.

Data transformation

After you've ingested your data into your cloud environment, you can transform the data, filtering and processing it as needed.

A common approach to data-transformation tasks is to use Apache-Hadoop-based tools, which typically provide flexible and scalable batch processing. Both Google Cloud Platform and Amazon Web Services offer managed Hadoop services. Google Cloud Dataproc and Amazon Elastic MapReduce (EMR) both provide automatic provisioning and configuration, simple job management, sophisticated monitoring, and flexible pricing. For stream-based data, both Cloud Dataproc and Amazon EMR support Apache Spark Streaming.

In addition, Google Cloud Platform provides Google Cloud Dataflow, which is based on Apache Beam rather than Hadoop. While Apache Spark Streaming treats streaming data as small batch jobs, Cloud Dataflow is a native stream-focused processing engine.

Service model comparison

Amazon EMR, Cloud Dataproc, and Cloud Dataflow compare to each other as follows:

Feature Amazon Elastic MapReduce Google Cloud Dataproc Google Cloud Dataflow
Open source library Apache Hadoop and Apache Spark Apache Hadoop and Apache Spark Apache Beam
Service integration Yes Yes Yes
Scaling Manual Manual Auto
Deployment locality Zonal Zonal Zonal
Pricing model Per-hour Per-second Per-minute
Unit of Deployment Cluster Cluster Pipeline
Unit of Scale Nodes (Master, Core and Task) Nodes (Master and Worker) Workers
Unit of Work Step Job Job
Programming Model MapReduce, Apache Hive, Apache Pig, Apache Spark, Spark SQL, PySpark MapReduce, Apache Hive, Apache Pig, Apache Spark, Spark SQL, PySpark Apache Beam
Customization Bootstrap actions Initialization actions File staging

Cloud Dataproc and Amazon EMR

Cloud Dataproc and Amazon EMR have very similar service models. Each is a scalable platform for filtering and aggregating data, and each is tightly integrated with Apache's big data tools and services, including Apache Hadoop, Apache Spark, Apache Hive, and Apache Pig.

In both services, a user creates a cluster that comprises a number of nodes. The service creates a single master node and a variable number of worker nodes. Amazon EMR further classifies worker nodes into core nodes and task nodes.

Once a cluster has been provisioned, the user submits an application—called a job in Cloud Dataproc and a step in Amazon EMR—for execution by the cluster. Application dependencies are typically added to the cluster nodes using custom Bash scripts called initialization actions in Cloud DataProc and bootstrap actions in Amazon EMR. Applications typically read data from stable storage, such as Amazon S3, Cloud Storage, or HDFS, and then process the data using an Apache data processing tool or service. After the data has been processed, the resulting data can be further processed or pushed back to stable storage.

Cloud Dataflow

Cloud Dataflow uses the Apache Beam programming model to perform batch processing and stream processing. This model offers improved flexibility and expressiveness when compared to the Apache Spark model used by Amazon EMR and Cloud Dataproc, particularly for real-time data processing.

In Cloud Dataflow, the user specifies an abstract pipeline, using a Cloud Dataflow SDK library to provide the primitives for both data parallel processing and aggregation language. When specifying a pipeline, the user defines a set of transformations that are subsequently submitted for execution in the pipeline. These transformations are then mapped to a set of worker nodes that are provisioned and configured for execution. Some nodes might be used for reading data from Cloud Pub/Sub, and others might perform other downstream transformations—the details are managed by the Cloud Dataflow runtime.

The Cloud Dataflow model, SDKs, and pipeline runners have been accepted into the Apache open source incubator as Apache Beam. This development means that Cloud Dataflow applications can also be executed in a Flink or Spark cluster, or in a local development environment.

For a detailed comparison of the Apache Beam and Apache Spark programming models, see Dataflow/Beam & Spark: A Programming Model Comparison.

Scaling

Cloud Dataproc and Amazon EMR

Both Amazon EMR and Cloud Dataproc allow you to manually adjust the number of nodes in a cluster after the cluster is started. The size of the cluster, as well as the scaling actions, is determined by the user or administrator who can monitor the performance and usage of the cluster to decide how to manage it. In both, users pay for the number of nodes provisioned.

Steps submitted into an Amazon EMR are queued and run on a first-come, first-served basis. Each job might require multiple Hadoop iterations, and each is managed by the Hadoop scheduler across the set of worker nodes.

Cloud Dataproc natively supports parallel job submission, leveraging Hadoop's Fair Scheduler and YARN to schedule across the applications. Conversely, Amazon EMR does not provide native support for running multiple jobs in parallel, though workarounds are available.

Cloud Dataflow

With Cloud Dataflow, the user only specifies the maximum number of nodes. The Cloud Dataflow runtime system then autoscales the nodes, actively managing node provisioning and allocation as needed.

Streaming

Cloud Dataproc and Amazon EMR

Both Cloud Dataproc and Amazon EMR operate in batch mode. An application reads a large file or set of files stored on reliable object storage, processes the data in parallel, and stores the resulting files back in object storage.

Amazon EMR implements a streaming data model natively by supporting Amazon Kinesis Streams as a method of ingesting data. In this model, the application runs and reads the available data stored in the stream until no new data is available for some time value. Once all the shards are clear, the reduce step of the operation begins, and the data is aggregated. Amazon EMR also supports streaming from third-party services such as Apache Kafka through a native implementation of Apache Spark Streaming.

Though Cloud Dataproc cannot read streaming data directly from Cloud Pub/Sub, Apache Spark comes preinstalled on all Cloud Dataproc clusters, allowing Cloud Dataproc to read streaming data from Apache Kafka. In addition, you can use Cloud Dataflow to read and process streaming data.

Cloud Dataflow

Cloud Dataflow supports both stream processing and batch processing.

Costs

Cloud Dataproc and Amazon EMR

Both Amazon EMR and Cloud Dataproc support on-demand pricing, as well as discounts for short-term and long-term use. Amazon EMR is priced by the hour, and Cloud Dataproc is priced by the second. To reduce the cost of nodes, Amazon EMR users can pre-purchase reserved instances. Cloud Dataproc automatically provides sustained use discounts.

In addition, each service offers inexpensive options for utilizing temporarily unused capacity. Amazon EMR supports provisioning nodes from the Amazon EC2 Spot market, in which unused capacity is auctioned off to users in short-term increments. These nodes can be reclaimed by the service, but the cluster continues processing as nodes are added or removed. Similarly, Cloud Dataproc supports Preemptible VMs that can be reclaimed at any time. Preemptible VMs are not auctioned through a market. Instead, they offer a fixed hourly discount for each Compute Engine machine type.

For a detailed comparison of managed Hadoop pricing for common cloud environments, including Google Cloud Platform and AWS, see Understanding Cloud Pricing: Big Data Processing Engines.

Cloud Dataflow

Cloud Dataflow is priced per hour depending on the Dataflow worker type. See Cloud Dataflow pricing for details.

Data analytics

After ingesting and transforming your data, you can perform data analysis and create visualizations from the data.

Typically, during the data transformation stage, you output the transformed data to one of two services:

  • An object storage service, such as Amazon S3 or Google Cloud Storage.
  • A managed data warehouse, such as Amazon Redshift or Google BigQuery.

This section focuses on Amazon Redshift and Google BigQuery.

Service model comparison

BigQuery's terms and concepts map to those of Amazon Redshift as follows:

Feature Amazon Redshift Google BigQuery
Unit of deployment Cluster N/A (fully managed)
Unit of provisioning Node N/A (fully managed)
Node Types Spinning disk / SSD N/A (fully managed)
Scaling Manual Automatically adjusted
Backup management Snapshots N/A (fully managed)
Deployment locality Zonal Regional
Pricing model Hourly By storage and query volume
Query language PostgreSQL compatible Legacy BigQuery SQL or Standard SQL (Beta)

Amazon Redshift

Amazon Redshift uses a massively parallel processing architecture across a cluster of provisioned nodes to provide high-performance SQL execution. When you use Amazon Redshift, your data is stored in a columnar database that is automatically replicated across the nodes of the cluster. In addition, you can export your data from Amazon Redshift to Amazon S3 for backup purposes.

As noted, Amazon Redshift uses a provisioned model. In this model, users select an instance type, and then provision a specific number of nodes according to their needs. After provisioning, users can connect to the cluster, and then load and query their data using the PostgreSQL-compatible connector of their choice.

Amazon Redshift is a partially managed service. If Amazon Redshift users want to scale a cluster up or down— for example, to reduce costs during periods of low usage, or to increase resources during periods of heavy usage—they must do so manually. In addition, Amazon Redshift requires users to carefully define and manage their distribution and sort keys, and to perform data cleanup and defragmentation processes manually.

Google BigQuery

In contrast, BigQuery is fully managed. Users do not need to provision resources; instead, they can simply push data into BigQuery, and then query across the data. The BigQuery service manages the associated resources opaquely and scales them automatically as appropriate.

Behind the scenes, BigQuery uses the same powerful, global-scale services that Google uses internally. BigQuery stores, encrypts, and replicates data using Colossus, Google's latest-generation distributed file system; processes tasks using Borg, Google's large-scale cluster management system; and executes queries with Dremel, Google's internal query engine. For more information, see the BigQuery Under the Hood post in the Google Cloud Blog.

BigQuery tables are append-only. Users can both perform interactive queries and create and execute batch query jobs. BigQuery supports two query languages:

  • Legacy SQL, which is a BigQuery-specific dialect of SQL.
  • Standard SQL, which is compliant with the SQL 2011 standard and includes extensions for querying nested and repeated data.

In addition, BigQuery supports integration with a number of third-party tools, connectors, and partner services for ingestion, analysis, visualization, and development.

Scale

Amazon Redshift can scale from a single node to a maximum of either 128 nodes for 8xlarge node types or 32 nodes for smaller node types. These limits mean that Amazon Redshift has a maximum capacity of 2PB of stored data, including replicated data.

Amazon Redshift's ingestion and query mechanisms use the same resource pool, which means that query performance can degrade when you load very large amounts of data.

In contrast, BigQuery has no practical limits on the size of a stored dataset. Ingestion resources scale quickly, and ingestion itself is extremely fast—by using the BigQuery API, you can ingest millions of rows into BigQuery per second. In addition, ingestion resources are decoupled from query resources, so an ingestion load cannot degrade the performance of a query load.

Operations

Amazon Redshift

Amazon Redshift is partially managed, taking care of many of the operational details needed to run a data warehouse. These details include data backups, data replication, failure management, and software deployment and configuration. However, several operational details remain the responsibility of the user or administrator, including performance management, scaling, and concurrency.

To achieve good performance, the user must define their static distribution keys at the time of table creation. These distribution keys are then used by the system to shard the data across the nodes so that queries can be performed in parallel. Because distribution keys have a significant effect on query performance, the user must choose these keys carefully. After the user defines their distribution keys, the keys cannot be changed; to use different keys, the user must create a new table with the new keys and copy their data from the old table.

In addition, Amazon recommends that the administrator perform periodic maintenance to reclaim lost space. Because updates and deletes do not automatically compact the resident data on disk, they can eventually lead to performance bottlenecks. For more information, see Vacuuming Tables in the Amazon Redshift documentation.

Amazon Redshift administrators must manage their end users and applications carefully. For example, users must tune the number of concurrent queries they perform. By default, Amazon Redshift performs up to 5 concurrent queries. Because resources are provisioned ahead of time, as you increase this limit—the maximum is 50—performance and throughput can begin to suffer. See the Concurrency Levels section of Defining Query Queues in the Amazon Redshift documentation for details.

Amazon Redshift administrators must also size their cluster to support the overall data size, query performance, and number of concurrent users. Administrators can scale up the cluster; however, given the provisioned model, the users pay for what they provision, regardless of usage.

Finally, Amazon Redshift clusters are restricted to a single zone by default. To create a highly available, multi-regional Amazon Redshift architecture, the user must create additional clusters in other zones, and then build out a mechanism for achieving consistency across clusters. For more information, see the Building Multi-AZ or Multi-Region Amazon Redshift Clusters post in the Amazon Big Data Blog.

For details about other Amazon Redshift quotas and limits, see Limits in Amazon Redshift.

BigQuery

BigQuery is fully managed, with little or no operational overhead for the user:

  • BigQuery handles sharding automatically. Users do not need to create and maintain distribution keys.
  • BigQuery is an on-demand service rather than a provisioned one. Users do not need to worry about underprovisioning, which can cause bottlenecks, or overprovisioning, which can result in unnecessary costs.
  • BigQuery provides global, managed data replication. Users do not need to set up and manage multiple deployments.
  • BigQuery supports up to 50 concurrent interactive queries, with no effect on performance or throughput.

For details about BigQuery's quotas and limits, see the Quota Policy page in the BigQuery documentation.

Costs

Amazon Redshift has two types of pricing: on-demand pricing and reserved instance pricing. Pricing is based on the number and type of provisioned instances. Users can get discounted rates by purchasing reserved instances up front. Amazon offers one-year and three-year reserve terms. See the Amazon Redshift pricing page for more information.

In contrast, BigQuery charges for what you consume rather than what you provision. Pricing is based on data storage size, query compute cost, and streaming inserts. If a table is not updated for 90 consecutive days, data storage pricing drops in half. Queries are billed per TB of data processed, with opt-in high-compute queries available at a scaling rate. See the BigQuery pricing page for more information.

For a scenario-based comparison of the two pricing models, see Understanding Cloud Pricing Part 3.2 in the Cloud Platform Blog.

What's next?

Check out the other Google Cloud Platform for AWS Professionals articles:

Send feedback about...

Google Cloud Platform for AWS Professionals