Jump to Content
Google Cloud

Understanding Cloud Pricing Part 3: Data Warehouses

June 19, 2015
Peter-Mark Verwoerd

Platform Solutions Architect, Google Cloud

We’re back once again to talk about cloud pricing. We hope you’ve enjoyed our previous installments on Local SSD and Virtual Machines. Today we’re going to look at pricing data warehouses in the cloud: how they are priced and what information you need to make a decision about running a cloud data warehouse.

The last several years have seen a marked increase in the amount of data applications have generated coupled with a significant decrease in the cost of storage. However, just storing data doesn’t do much for anyone. To take advantage of it, you need a tool which will let you run queries on your data, build shareable reports, and actually analyze the data. More often than not, a data warehouse is what will help you do exactly this.

So what do we mean when we use the term “data warehouse?” Generally, what we are discussing in this post are data storage systems used for reporting and analysis, rather than application transactional storage. Put another way, we will focus on OLAP rather than OLTP systems. There’s a bit of overlap, but we’re just focusing on systems which are capable of: 1) reporting and analytics; and 2) storing large data sets.

Next, how is data warehousing different in the cloud? It has to do with the type of system you’re running. In the cloud, you have two kinds of data warehouses:

  1. In-cloud systems that could run in traditional data centers and cloud native systems. A cloud native data warehouse is one that can’t exist in a traditional data center because of its elasticity with storage, network, and compute require a cloud. BigQuery is an example of such a system. There are no nodes to add, no storage to provision and it all scales seamlessly.

  2. In-cloud systems that could or do run in traditional data centers are systems that take a traditional approach to deployment - adding more nodes to increase compute power and storage. Examples of this type of data warehouse are HP Vertica, SAP HANA, EMC Greenplum, and Amazon Redshift. Even though Redshift does not run private data centers, it closely resembles the other data warehouses in function.

BigQuery is an excellent choice for running your data warehouse in the cloud, both in terms of the cost efficiency and the scaling performance you gain.

An Example Cluster

Now we are ready to talk about a practical application and look at a data warehouse with 1 petabyte of data. Why exactly 1 PB? Well, when discussing OLAP data warehouses, a term often thrown around is “petabyte scale”. So, why not actually examine some systems using a petabyte of data stored? The specific purpose of the data warehouse doesn’t matter for our pricing purposes - again, keeping it simple. With this scenario, we will just examine BigQuery and a generic commercial data warehouse which operates in a columnar storage mode. Most column-oriented databases are either commercial products or have commercial entities that support them and charge for maintenance and patching. These could include data warehouses such as HP Vertica, SAP HANA, or many more. We will look at the prices of running a generic one of these on Google Cloud Platform (see our pricing post for more info). These systems meet our basic requirements - store large amounts of data; run analytics and reports on them; and run full versions in a public cloud.

We also need to look at how these data warehouses are priced. BigQuery has a pure consumption-based model - you are charged only for what you consume, both for storage and compute. The price for storage is $0.02/GB/month, and the price for data analyzed is $5/TB analyzed at the time of this post.

Most commercial data warehouses do not disclose pricing publicly. But it is common practice to price based on the amount of data stored. So for our generic commercial data warehouse we will assume $1,000/TB/year for the commercial license. This will likely vary significantly depending on the vendor, the customer, and the amount of data stored. But we will show all our calculations so you can plug in different numbers that more closely match your experience. Therefore the total cost will be the cost of running a cluster on Google Compute Engine and the total license cost.

Sizing Our Clusters

In order to correctly price BigQuery, we must estimate the number of queries being run using reasonable averages based on our customers. We will estimate 100 users, 40 queries each per day, with 100 GB average query size which is a usage pattern close to the 90th percentile of our user base. That comes out to 4000 queries per day and 12,000 TB data analyzed per month (simplifying to 30 day month).

In order to accurately price a generic system, we will attempt to use guidelines that are broadly applicable. The smallest node size for a data warehouse to perform as expected should be at least 16 cores and at least 60 GB of RAM. In order to make sure our comparison meets at least the minimum, we will estimate with the Google Compute Engine n1-highmem-16 instance, which has 16 cores and 104 GB of RAM. Additionally, we will only check the pricing for spinning disks, not SSDs. The reasoning is because showing how a data warehouse is priced on one type of disk is the same as for the other. While it is possible to run a 1 PB SSD-based cluster on Google Compute Engine we feel that SSD based data warehouses tend to be for more specialized functions on smaller data sets and are considerably more expensive.

The generic data warehouse estimate includes 3 components: the estimated license cost, the cost of the instances, and the cost of the Persistent Disk. There is a limit to how much Persistent Disk you can attach to an instance - 10 TB. This means to get 1 PB we will use 100 instances.

Pricing

Please note we completed these calculations on May 13, 2015, and have included the output prices in this post. Any discrepancies are likely due to pricing or calculator changes following the publishing of this post.

BigQuery estimate: Monthly: $79,999.50

This includes $20,000 for storage and $59,999.50 for query costs.

Generic data warehouse estimate: Monthly: $100,502.40

The assumed estimated monthly license cost is $1000/TB/year. This is $83,333.33/month (1000 * 1000 / 12)

Effective monthly: $183,835.73

Obviously this is more expensive than BigQuery. However, by running your cluster on Google Compute Engine, you can take advantage of Sustained Usage Discounts, which lower your costs significantly below what the on-demand rate is without having to take any action on your part. This definitely makes running any kind of large scale persistent cluster more cost effective. On the other hand, depending on the cost of the data warehouse license, it may add significantly to the operating expense of running on Google Compute Engine.

Scaling

While pricing is obviously important when you’re considering what kind of system to use, it is also important to take a look at scaling performance.

There are 2 data warehouse scaling characteristics that matter: how much data storage is available and the speed with which you can access it. The method to add more storage to a column-oriented data warehouse is to add additional nodes, limited by how many nodes you can provision. We used 100 nodes to get to 1 PB of storage, but you can go much higher on Google Compute Engine. However, this also means your storage and performance are tied together. There’s no way for you to add more storage separately from your compute needs. This also means discussing what it would look like to only add compute capacity would be almost the exact same discussion - adding more nodes.

BigQuery scales and performs differently - our pricing is set so you can separate your needs for storage and compute. If you need to store more data but don’t increase the amount of work you need to do (a common usage pattern - your reports are the same as you add data) - then you can do that with BigQuery without having to pay for a lot more compute you don’t need. On the other hand, if you have the same sized data set but need to do more work on it than you previously were (more reports! more ad-hoc queries!), you can do that too without having to pay for more storage you don’t need.

To take a final look at how the pricing for this increase would look, let’s consider what would happen if either your compute or your storage needs dramatically increase. For storage, let’s suppose it goes from 1 PB to 10 PB’s, with all the same previous assumptions, monthly below.

BigQuery estimate: $259,999.50

This breaks down to $200,000 for storage and $59,999.50 for queries. The exact same price for queries as when the example stored 1 PB. That’s because those pricing metrics are completely independent. We’re storing 10 times the data so the storage cost is 10 times as much, but the actual cost to analyze the data hasn’t changed at all.

BigQuery estimate analyzing 5x data: $499,999.50

This breaks down to $200,000 for storage and $299,997.50 for queries. Note the pricing metrics of storage and query are independent, so it’s possible the query behavior and scope doesn’t change and costs the same as before at $59,999.50 with the 1 PB example. In practice it is possible those queries would span a larger footprint of data in this 10 PB example, so for the purposes of this comparison we’ve assumed queries will analyze five times the data they did in the 1 PB example.

Generic estimate: $1,005,024

The assumed monthly license cost would be $833,333.33 (1000 * 10000 / 12 )

This is practically 10 times the cost of the 1 PB example because we’re adding in compute along with storage.

Just to put this in perspective, before people ask - “but who would use 10 PB in reality?” Well, we have external production customers using BQ in the 10’s of PB’s.

Additionally, consider that this is a high workload BigQuery system we’re comparing to the minimum requirement generic system. An expensive Google BigQuery system is still cheaper than a barebones generic one running on Google Compute Engine.

TCO Considerations

There are some aspects of comparing different tools in the cloud that don’t come out in the hourly (or monthly or yearly, etc) pricing. There are a few features of BigQuery where it shines in this regard.

Perhaps the biggest point to consider is that BigQuery is entirely no-ops. There’s no management of underlying virtual (or physical) machines, capacity or price planning, maintenance, or cluster resizing. This means you save money in time saved and freeing up your team to work on things which actually matter to your business. In all the examples above, even for a pricing example, we had to estimate the size of every cluster except BigQuery. That’s time, effort, and expertise needed to do correctly.

Being no-ops means you also don’t need a database admin to manage your schema, keys, indexes, etc., in your BigQuery cluster. This is all managed for you by Google. A skilled DBA could be working on unique problems to your company, rather than on the maintenance of a data warehouse schema.

Also, with a pure consumption based model you don’t have to plan for peak capacity. BigQuery can scale from 0 to the capacity you need, when you need it. Again, this saves you from operational expense and lets your people work on your business, not on figuring out how to scale your data warehouse.

Finally, while we mentioned above how you can separate out the costs of a BigQuery cluster between storage and compute, it’s actually quite a bit more than that. BigQuery allows you to separate the costs for storage, ingest, batch queries, and streaming queries. This allows you to focus on what you need from BigQuery without having to pay for resources you don’t need.

Summary

While it may seem daunting to price out a data warehouse in the cloud, we hopefully showed you how you can make sense of it all. While examining these data warehouses, we showed how much money and time BigQuery can save you, while still growing to the size you may need.

Posted in