This page describes schema design concepts, patterns, and examples for storing time series data in Cloud Bigtable. Before you read this page, you should be familiar with the overview of Cloud Bigtable. You should also be familiar with Designing Your Schema.
Whenever you measure something, and you record the time together with the measurement, you're building a time series. Time series are all around us:
When you look at the plot of memory usage on your computer because it's running slow, you're looking at a time series.
When you look at temperature over time on a news report, you're looking at a time series.
If you're a foreign exchange trader, and your job involves plotting 5, 10, and 30 day moving average prices for USD/JPY, you're looking at time series.
Time series are also incredibly important:
Time series help us optimize resource usage, decrease energy usage, minimize environmental impact, and reduce cost.
Time series help us identify trends in data, letting us demonstrate concretely what happened in the past and make informed estimates about what will happen in the future.
Time series underpin some of the complex analysis and machine learning in fields such as financial services, retail, insurance, physics, and chemistry.
This guide provides detailed strategies and a walk through for storing and querying time-series data in Cloud Bigtable.
Time series and Cloud Bigtable
Storing time-series data in Cloud Bigtable is a natural fit. Cloud Bigtable stores data as unstructured columns in rows; each row has a row key, and row keys are sorted lexicographically.
There are two commonly used ways to retrieve data from Cloud Bigtable:
- You can get a single row by specifying the row key.
- You can get multiple rows by specifying a range of row keys.
These methods are ideal for querying time-series data, since you often want data for a given time range (for example, all of the market data for the day, or server CPU statistics for the last 15 minutes). As a result, Cloud Bigtable is functionally a great fit for time series.
Of course, there's always a devil in the details. For Cloud Bigtable, the devil is that the schema for your data—the columns and the row-key structure—must be designed carefully. A good schema results in excellent performance and scalability, and a bad schema can lead to a poorly performing system. However, there is no single schema design that provides the best fit for all use cases.
The remainder of this paper presents a number of patterns for schema design in Cloud Bigtable. You can use these patterns to design an ideal schema for your use case. After enumerating and explaining the patterns for schema design, you can learn from examples for the following use cases:
- Financial market data
- Server metrics (for example, CPU, memory, and network usage)
- Intelligent energy meters (part of the "Internet of Things", or IoT)
Schema design patterns for time series
The schema design patterns for storing time series in Cloud Bigtable fit into the following categories:
- General patterns
- Patterns for row key design
- Patterns for data column design
General patterns
Keep names short but meaningful
When you transfer data from Cloud Bigtable, you're also transferring metadata, including:
- The row key
- The column family, an identifier that's used to group related columns
- The column qualifier, a unique name within a column family
As a result, it's important to choose meaningful names that are also as short as
possible, because the size of each name contributes to storage and RPC overhead.
For example, rather than using CELLPHONE_NUMBER
as a column qualifier, you
might use CELL
as a short but meaningful abbreviation.
Patterns for row key design
Use tall and narrow tables
A tall and narrow table has a small number of events per row, which could be just one event, whereas a short and wide table has a large number of events per row. As explained in a moment, tall and narrow tables are best suited for time-series data.
For example, suppose you take the temperature in your vegetable garden each morning. Now if you decide that, because you take the temperature each morning, one row per day is appropriate; as a result, your table is tall and narrow. Note that the timestamp isn't the first element of the row key. As explained later, using a timestamp as the first element of a row key can cause a variety of problems.
Row key | Column data |
---|---|
VEGGIEGARDEN#20150301 | DAILY:TEMP:60.4 |
VEGGIEGARDEN#20150302 | DAILY:TEMP:61.2 |
VEGGIEGARDEN#20150303 | DAILY:TEMP:61.0 |
VEGGIEGARDEN#20150304 | DAILY:TEMP:65.1 |
VEGGIEGARDEN#20150305 | DAILY:TEMP:62.2 |
... | ... |
VEGGIEGARDEN#20150331 | DAILY:TEMP:60.4 |
In contrast, suppose you want to plot the temperature over each month, so one row per month is appropriate. The following example shows the short and wide table that you get as a result:
Row key | Column data | ||||||
---|---|---|---|---|---|---|---|
VEGGIEGARDEN#20150301 | TEMP:1:60.4 | TEMP:2:61.2 | TEMP:3:61.0 | TEMP:4:65.1 | TEMP:5:62.2 | ... | TEMP:31:60.4 |
For time series, you should generally use tall and narrow tables. This is for two reasons: Storing one event per row makes it easier to run queries against your data. Storing many events per row makes it more likely that the total row size will exceed the recommended maximum (see Rows can be big but are not infinite).
As an optimization, you can use short and wide tables, but avoid unbounded numbers of events. For example, if you usually need to retrieve an entire month of events at once, the temperature table above is a reasonable optimization—the row is bounded in size to the number of days in a month.
Prefer rows to column versions
In Cloud Bigtable, columns can have timestamped versions. As a result, it's theoretically possible to store a time series as a set of versions of a column. For example, if you wanted to record the closing price of ZXZZT shares each day, you could have a single column with a timestamped version for each day:
Row key | Column data | ||||
---|---|---|---|---|---|
ZXZZT | STOCK:PRICE (V1 03/01/15):558.40 | STOCK:PRICE (V2 03/02/15):571.34 | STOCK:PRICE (V3 03/03/15):573.64 | STOCK:PRICE (V4 03/04/15):573.37 | STOCK:PRICE (V5 03/05/15):575.33 |
However, this isn't the best way to store this data.
By default, use new rows instead of column versions. Using multiple rows, with a single version of an event in each row, is the simplest way to represent, understand, and query your data.
It is acceptable to use versions of a column where the use case is actually amending a value, and the value's history is important. For example, suppose you did a set of calculations based on the closing price of ZXZZT, and initially the data was mistakenly entered as 559.40 for the closing price instead of 558.40. In this case, it might be important to know the value's history in case the incorrect value had caused other miscalculations.
Design your row key with your queries in mind
When Cloud Bigtable stores rows, it sorts them by row key in lexicographic order. There is effectively a single index per table, which is the row key. Queries that access a single row, or a contiguous range of rows, execute quickly and efficiently. All other queries result in a full table scan, which will be far, far slower. A full table scan is exactly what it sounds like—every row of your table is examined in turn. For Cloud Bigtable, where you could be storing many petabytes of data in a single table, the performance of a full table scan will only get worse as your system grows.
For example, consider a table where players' scores from video games are stored, which might be designed as follows.
Row key | Column data | ||
---|---|---|---|
LoL#20150301 | GAME:PLAYER:Corrie | GAME:WIN:false | GAME:KDA:4.25 |
LoL#20150302 | GAME:PLAYER:Jo | GAME:WIN:true | GAME:KDA:7.00 |
LoL#20150302 | GAME:PLAYER:Sam | GAME:WIN:true | GAME:KDA:7.00 |
LoL#20150303 | GAME:PLAYER:Corrie | GAME:WIN:true | GAME:KDA:9.50 |
Starcraft#20150303 | GAME:PLAYER:Eriko | GAME:WIN:true | GAME:KDA:6.00 |
Suppose you want to query this data to answer the question "How many games of LoL did Corrie win in March?" With the schema shown above, you will have to scan most of the table to answer this question. In contrast, if you design the table as follows, you could complete this query by retrieving a specific range of row keys:
Row key | Column data | |
---|---|---|
LoL#Corrie#20150301 | GAME:WIN:false | GAME:KDA:4.25 |
LoL#Corrie#20150303 | GAME:WIN:true | GAME:KDA:9.50 |
LoL#Jo#20150302 | GAME:WIN:true | GAME:KDA:7.00 |
LoL#Sam#20150302 | GAME:WIN:true | GAME:KDA:7.00 |
Starcraft#Eriko#20150303 | GAME:WIN:true | GAME:KDA:6.00 |
Choosing a row key that facilitates common queries is of paramount importance to the overall performance of the system. Enumerate your queries, put them in order of importance, and then design row keys that work for those queries.
How do you deal with a situation where there is no perfect row key? For example, suppose the queries for all LoL games for March and all LoL games played by Corrie in March were equally important. The schema above would enable us to query for Corrie's games on LoL in March but wouldn't help us with all LoL games in March—the best you could do is query all LoL games and then filter for March. There are two ways to solve this problem:
Denormalization
- Use two tables, each with a row key appropriate to one of the queries. This is a good solution, because it results in a robust, scalable system.
Query and filter
- Stick with the schema shown above, and have one query (all LoL games in March) that underperforms because you are filtering a large number of rows. This is not normally a good solution, because it results in a less scalable system that could easily deteriorate as usage increases.
Ensure that your row key avoids hotspotting
The most common issue for time series in Cloud Bigtable is hotspotting. This issue can affect any type of row key that contains a monotonically increasing value.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting. For example, if you're storing a cell phone's battery status, and your row key consists of the word "BATTERY" plus a timestamp (as shown below), the row key will always increase in sequence. Because Cloud Bigtable stores adjacent row keys on the same server node, all writes will focus only on one node until that node is full, at which point writes will move to the next node in the cluster.
Row key | Column data | |
---|---|---|
BATTERY#20150301124501001 | METRIC:USER:Corrie | METRIC:PERCENTAGE:98 |
BATTERY#20150301124501002 | METRIC:USER:Jo | METRIC:PERCENTAGE:54 |
BATTERY#20150301124501003 | METRIC:USER:Corrie | METRIC:PERCENTAGE:96 |
BATTERY#20150301124501004 | METRIC:USER:Sam | METRIC:PERCENTAGE:43 |
BATTERY#20150301124501005 | METRIC:USER:Sam | METRIC:PERCENTAGE:38 |
There are a few ways to solve this problem:
Field promotion. Move fields from the column data into the row key to make writes non-contiguous.
Salting. Add an additional calculated element to the row key to artificially make writes non-contiguous.
Using the Key Visualizer tool can help with identification of hotspots and make it much easier to troubleshoot issues with Cloud Bigtable.
Field promotion
In this example, you will promote USER
from a column to an element of the row
key. This change would solve the hotspotting issue because user identifiers will
provide a more uniform distribution of row keys. As a result, writes will be
split across multiple nodes in your cluster.
The advantage of field promotion is that it often makes your queries more efficient as well, making this strategy a clear winner. The (slight) disadvantage is that your queries are constrained by your promoted fields, leading to rework if you don't promote the right fields.
Row key | Column data |
---|---|
BATTERY#Corrie#20150301124501001 | METRIC:PERCENTAGE:98 |
BATTERY#Corrie#20150301124501003 | METRIC:PERCENTAGE:96 |
BATTERY#Jo#20150301124501002 | METRIC:PERCENTAGE:54 |
BATTERY#Sam#20150301124501004 | METRIC:PERCENTAGE:43 |
BATTERY#Sam#20150301124501005 | METRIC:PERCENTAGE:38 |
Salting
In this example, you will take a hash of the timestamp and divide it by 3; take the remainder of this calculation; and add the remainder to the row key. Why 3? This is an estimate of the number of nodes in the cluster in this case and would provide a good division of activity across those nodes.
The advantage of salting is its simplicity—it's essentially a simple hashing function. One disadvantage is that when you query for time ranges, you'll have to do multiple scans—one scan per salt value—and combine the results in your own code. Another disadvantage is that it's difficult to choose a salt value that both distributes activity across nodes and operates well as you scale your system up or down. Because of these disadvantages, and because it's best to use human-readable row keys, avoid salting unless you can find no other way to prevent hotspotting.
Row key | Column data | |
---|---|---|
BATTERY#1#20150301124501003 | METRIC:USER:Jo | METRIC:PERCENTAGE:96 |
BATTERY#1#20150301124501004 | METRIC:USER:Sam | METRIC:PERCENTAGE:43 |
BATTERY#2#20150301124501002 | METRIC:USER: Corrie | METRIC:PERCENTAGE:54 |
BATTERY#3#20150301124501005 | METRIC:USER:Sam | METRIC:PERCENTAGE:38 |
BATTERY#3#20150301124501001 | METRIC:USER:Corrie | METRIC:PERCENTAGE:98 |
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Use salting only where field promotion does not resolve hotspotting. In the rare case where you apply a salting function, be careful not to make too many assumptions about the underlying size of the cluster. The example above uses a salting function that assumes there are 3 nodes in the cluster; this assumption is safe because it would scale to the limited number of nodes that can exist in a Cloud Bigtable cluster. If you could create clusters with hundreds of nodes, you would want to use a different salting function.
Reverse timestamps only when necessary
You can reverse timestamps by subtracting the timestamp from your programming
language's maximum value for long integers (such as Java's
java.lang.Long.MAX_VALUE
). By reversing the timestamp, you can design a row
key where the most recent event appears at the start of the table instead of the
end. As a result, you can get the N most recent events simply by retrieving the
first N rows of the table.
Prefer reverse timestamps only where your most common query is for the latest values. This is because reversing timestamps makes every other query more complex and complicates the overall schema.
Patterns for data column design
Rows can be big but are not infinite
Rows in Cloud Bigtable can contain ~100 column families and millions of columns, with a 100 MB limit on each value stored in a column. These generous limits provide great flexibility. However, you shouldn't assume that big rows are the right way to store data, and that you should therefore fill each row with as much data as possible. Always keep in mind that retrieving large values takes extra time and memory.
In general, keep row sizes below approximately 100 MB. This is more of a guideline than a rule--rows can be larger than 100 MB. However, if you have many rows larger than this, you should expect performance issues.
In general, keep column values below approximately 10 MB. Again, this is more of a guideline than a rule--you can store some values that are larger than 10 MB, but they are likely to cause performance issues.
To reiterate, if you often rely on large rows, or large individual values, expect performance issues in your system.
Keep related data in the same table, keep unrelated data in different tables
Cloud Bigtable is a key/value store, not a relational store. It does not support joins, nor does it support transactions except within a single row. As a result, it's best to access data in individual rows or in a set of contiguous rows.
One result of this pattern is fairly obvious: In the vast majority of cases, time-series queries are accessing a given dataset for a given time period. Therefore, make sure that all of the data for a given time period is stored in contiguous rows, unless doing so would cause hotspotting.
Another result is that when you read data for a row, or a range of rows, that data should be useful on its own—you should not need to combine it with other data. For example, suppose you are storing user activity on a shopping website, and you often need to retrieve the last five actions performed by the user so you can display them in a sidebar. In this case, you should consider denormalizing your data and including some user and product details in the table of recent actions. In contrast, with a relational database, you would likely store the user ID and product ID in one table, then join that table with separate user and product tables in your SQL query.
That said, you don't need to include every piece of data about an entity in every single row. For example, if you're displaying information about a user's recent actions, you don't need to store the user's phone number, or the address of a product's manufacturer, because you aren't going to display that information in a sidebar.
Look for opportunities to denormalize data to satisfy queries, but only include as much data as required by the queries.
Store data you will access in a single query in a single column family
Column qualifiers in a single column family have a physical as well as a logical relationship. In general, all of the column qualifiers in a single column family are stored together, accessed together and cached together. As a result, a query that accesses a single column family might execute more efficiently than a query spanning column families.
Ensure your common queries are as efficient as possible by retrieving data from as few column families as possible.
Don't exploit atomicity of single rows
Cloud Bigtable does not support transactions, with one exception: Operations on a single row are transactional. Transactions are also expensive, meaning that a system that relies on transactions will not perform as well as one that does not.
When working with time series, do not leverage the transactional behavior of rows. Changes to data in an existing row should be stored as a new, separate row, not changed in the existing row. This is an easier model to construct, and it enables you to maintain a history of activity without relying upon column versions.
Schema design examples
Now you will apply the schema design patterns to create examples for the following types of data:
- Financial market data
- Server metrics
- Intelligent energy meters (Internet of Things)
Remember, these are just examples! To find the best schema for your time-series data, you'll need to consider what data you're storing and how you plan to query the data, then apply the design patterns from the previous section.
Financial market data
This example takes a hypothetical equity market data message that represents information about an imaginary stock:
Field | Example data |
---|---|
Ticker Symbol | ZXZZT |
Bid | 600.55 |
Ask | 600.60 |
Bid size | 500 |
Ask size | 1500 |
Last sale | 600.58 |
Last size | 300 |
Quote time | 12:53:32.156 |
Trade time | 12:53:32.045 |
Exchange | NASDAQ |
Volume | 89000 |
Some observations about equity market data messages before beginning:
The message aggregates quote data and trade data, which are logically separate.
There are a relatively large number, several thousand, of ticker symbols (tickers).
Several hundred of these tickers will account for 90% of messages received, because relatively few stocks are actively traded.
Messages come frequently, from hundreds to tens of thousands per second, with an average of several thousand per second.
Typical queries will be for quote data or for trade data separately, not for both at the same time.
For both quote data and trade data, a typical query will specify:
- An exchange (such as NASDAQ)
- A ticker symbol (such as ZXZZT)
- A start and end time
Now you can design the table for this use case:
Keep related data in the same table, keep unrelated data in different tables
- Store quote data in a table called QUOTE.
- Store trade data in a table called TRADE.
- Queries can include arbitrary time ranges, so you will store data from a single message in each row.
Rows can be big, but not infinitely big
- Each row stores data from a single message. This does not raise any size concerns.
Don't exploit atomicity of single rows
- Each message is self-contained, so there are no concerns.
Keep names short but meaningful
- For simplicity, this example uses the fields from the message, capitalized with spaces removed, as names.
These decisions give you the following column layout:
QUOTE table example:
Column data | ||||||
---|---|---|---|---|---|---|
MD:SYMBOL:ZXZZT | MD:BID: 600.55 |
MD:ASK: 600.60 |
MD:BIDSIZE: 500 |
MD:ASKSIZE: 1500 |
MD:QUOTETIME: 1426535612156 |
MD:EXCHANGE: NASDAQ |
TRADE table example:
Column data | ||||||
---|---|---|---|---|---|---|
MD:SYMBOL: ZXZZT |
MD:LASTSALE: 600.58 |
MD:LASTSIZE: 300 |
MD:TRADETIME: 1426535612045 |
MD:EXCHANGE: NASDAQ |
MD:VOLUME: 89000 |
Next, design the row key:
Use tall and narrow tables
- Each row will store data from one message, resulting in a very large number of relatively narrow rows.
Prefer rows to column versions
- Use column versions only in the exceptional circumstance where a value was incorrect.
Design your row key with your queries in mind
QUOTE
andTRADE
row keys can follow the same form.- Because this is a time series, you can assume by default that
QUOTETIME
will be part of the row key. - To query by exchange and ticker for a given start and end time, you will need
to use the values of
EXCHANGE
,SYMBOL
, andQUOTETIME
. - Therefore, you will promote
EXCHANGE
(as a 6 character code; exchanges with less than 6 characters will be right padded with spaces),SYMBOL
(as a 5 character code; tickers with less than 5 characters will be right padded with spaces), andQUOTETIME
(as a 13 digit numeral). By paddingEXCHANGE
andSYMBOL
with spaces, you ensure that each part of the row key is at a predictable offset. - Taking these values together, the row key will be of the form
EXCHANGE
+SYMBOL
+QUOTETIME
(for example,NASDAQ#ZXZZT#1426535612156
).
Ensure that your row key avoids hotspotting
- Having
EXCHANGE
andSYMBOL
in the leading positions in the row key will naturally distribute activity. - Given that 90% of the messages are concentrated in a few hundred tickers, there is some risk of hotspotting, but you would need to stress test the system before making further changes. If this concentration resulted in poor performance, you could apply salting to break up activity more effectively.
Reverse timestamps only when necessary
- You will not reverse timestamps in this case, because queries do not always require access to the latest data.
After this design exercise, you have the following tables:
QUOTE table example:
Row key | Column data | ||||||
---|---|---|---|---|---|---|---|
NASDAQ#ZXZZT#1426535612156 | MD:SYMBOL: ZXZZT |
MD:BID: 600.55 |
MD:ASK: 600.60 |
MD:BIDSIZE: 500 |
MD:ASKSIZE:1 500 |
MD:QUOTETIME: 1426535612156 |
MD:EXCHANGE: NASDAQ |
TRADE table example:
Row key | Column data | ||||||
---|---|---|---|---|---|---|---|
NASDAQ#ZXZZT#1426535612045 | MD:SYMBOL: ZXZZT |
MD:LASTSALE: 600.58 |
MD:LASTSIZE: 300 |
MD:TRADETIME: 1426535612045 |
MD:EXCHANGE: NASDAQ |
MD:VOLUME: 89000 |
These tables will grow at the rate of hundreds of millions of rows per day, which Cloud Bigtable can handle without difficulty.
Server metrics
The following example uses a hypothetical server-monitoring system that collects a large variety of metrics (such as per-core CPU, memory, and disk usage) from a large inventory of machines. You will go through multiple iterations of the schema in this example.
You can make the following assumptions about the data:
- Collect 100 metrics per machine.
- Collect metrics from 100,000 machines.
- Metrics are collected every 5 seconds.
Typical queries will be one of the following:
- Metrics for a given machine for a given start and end time
- The latest metrics for the entire inventory of machines
With that use case in mind, you can design the table:
Iteration 1
Keep related data in the same table, keep unrelated data in different tables
- You will store metrics data in a table called
METRIC
. - There are several categories of metrics, and you will group these using appropriate column families.
- Queries can include arbitrary time ranges, so you will have each row store a single set of metrics from a machine at a given time.
Rows can be big, but not infinitely big
- Each row stores a single set of metrics, which causes no size concerns.
Don't exploit atomicity of single rows
- You won't rely upon the atomicity of rows in our schema design.
Keep names short but meaningful
- For simplicity, you will use the field names from the metrics, capitalized with spaces removed, as column qualifier names.
These decisions give you the following column layout:
Column data | |||||
---|---|---|---|---|---|
METRIC: HOSTNAME: server1.bbb.com |
METRIC: CPU/CPU1_USR: 0.02 |
METRIC: CPU/CPU1_NICE: 0.00 |
... |
METRIC: IO/BLK_READ: 253453634 |
METRIC: MIO/BLK_WRTN: 657365234 |
Next design the row key based on the patterns:
Use tall and narrow tables
- Each row in the table will store one set of metrics for one machine, resulting in a very large number of rows.
Prefer rows to column versions
- You will not use column versions.
Design your row key with your queries in mind
- Because this is a time series, include the timestamp,
TS
, in the row key. - To retrieve metrics for a given machine for a given start and end time, you
will retrieve a range of rows using
HOSTNAME
plusTS
. - Retrieving the latest metrics for the entire inventory of machines is complicated. You can't simply reverse the timestamp and scan N rows, because there is no guarantee that would pick up each machine in the inventory.
Now, you have run into a problem with designing the row key. The solution here
is denormalization. You will create a separate table that holds the latest
versions of the metrics, called CURRENT_METRIC
, which you will update whenever
you update METRIC
. When you update the existing metrics for a machine, you
will simply overwrite the row for that machine.
Next iterate on your original design:
Iteration 2
Keep related data in the same table, keep unrelated data in different tables
- You will store metrics data in a table called
METRIC
. - You will store the latest version of the metrics in a table called
CURRENT_METRIC
. - Other information stays the same as iteration 1.
Rows can be big, but not infinitely big
- Stays the same as iteration 1.
Don't exploit atomicity of single rows
- You will rely upon the atomicity of rows for updating each machine's data in
CURRENT_METRIC
. This is a simple row mutation, with little potential for contention, so it will not cause any issues.
Keep names short but meaningful
- Stays the same as iteration 1.
Next you design the row key based on the patterns:
Use tall and narrow tables
- In both tables, each row in the table will store one set of metrics for one machine, resulting in a large number of rows.
Prefer rows to column versions
- Stays the same as iteration 1.
Design your row key with your queries in mind
- Because this is a time series, you will include the timestamp,
TS
, in the row key. To retrieve metrics for a given machine for a given start and end time, you will retrieve a range of rows fromMETRIC
usingHOSTNAME
plusTS
. - Therefore, you will promote
HOSTNAME
to the row key and use a row key of the formHOSTNAME
+TS
. - To find the latest metrics for specific machines, scan
CURRENT_METRIC
, filtering on the row key prefix, orHOSTNAME
, of those machines. - To find the latest metrics for the entire inventory of machines, scan
CURRENT_METRIC
without specifying a row key. - Therefore, you do not need to promote any additional fields to the row key,
and the row key will have the simple form
HOSTNAME
+TS
. This leads to a simple, easily understood schema that makes it possible for tables to be sharded effectively. In the
CURRENT_METRIC
table, you no longer need the timestamp, so the row key will beHOSTNAME
.METRIC
row key example: server1.aaa.bbb.com#1426535612045CURRENT_METRIC
row key example: server1.aaa.bb.com
Ensure that your row key avoids hotspotting
- For both
METRIC
andCURRENT_METRIC
, there is no concern about hotspotting, because having the hostname at the start of the row key will distribute activity across regions.
Reverse timestamps only when necessary
- You are storing data for the most recent metrics in a separate table, so there's no need to reverse the timestamps.
After the design exercise, you have the following:
METRIC table:
Row key | Column data | ||||||
---|---|---|---|---|---|---|---|
server1.bbb.com#1426535612045 |
METRIC: CPU/CPU1_USR: 0.02 |
METRIC: CPU/CPU1_NICE: 0.00 |
... |
METRIC: IO/BLK_READ: 253453634 |
METRIC: MIO/BLK_WRTN: 657365234 |
This table will grow at the rate of approximately 2 billion rows per day, which Cloud Bigtable can handle without difficulty.
CURRENT_METRIC table:
Row key | Column data | ||||||
---|---|---|---|---|---|---|---|
server1.bbb.com |
METRIC: CPU/CPU1_USR: 0.02 |
METRIC: CPU/CPU1_NICE: 0.00 |
... |
METRIC: IO/BLK_READ: 253453634 |
METRIC: MIO/BLK_WRTN: 657365234 |
Intelligent energy meters (Internet of Things)
This example uses a hypothetical IoT scenario where there are intelligent energy meters sending sensor readings periodically to a centralized system. Once again, you will go through multiple iterations of the schema in this example.
You can make the following assumptions about the data:
- There are 10,000,000 operational meters.
- Each meter sends a sensor reading every 15 minutes.
- The meter ID is a unique numeric ID.
Typical queries will be one of the following:
- All data for a given meter for a given day
- All data for a given day
Given that use case, you can design the table:
Keep related data in the same table, keep unrelated data in different tables
- You will store the sensor data in a table called SENSOR.
- Queries are in daily increments, so you will have each row store data for one meter for one day.
Rows can be big, but not infinitely big
- Each row will contain data for one meter for one day, for a total of 96 columns (24 hours in a day * 60 minutes in an hour / 1 reading every 15 minutes), which causes no size concerns.
Don't exploit atomicity of single rows
- You will exploit the atomicity of rows because when the data is received, it will be added to the appropriate daily row. This is a simple row mutation, with little potential for contention, so it will not cause any issues.
Keep names short but meaningful
ID
to store the meter ID (a unique integer).- A series of columns with names
0000
through2345
to contain the 96 values recorded every 15 minutes during the day. This scheme is adopted because it enables you to change to frequencies other than 15 minutes if necessary.
These decisions give you the following column layout:
Iteration 1
Column data | |||||
---|---|---|---|---|---|
METER:ID:987654 |
METER:0000: 12.34 |
METER:0015: 13.45 |
... |
METER:2330: 27.89 |
METER:2345: 28.90 |
Next design the row key:
Use tall and narrow tables
- Each row will store data for one day, as previously explained.
Prefer rows to column versions
- Use column versions only in the exceptional circumstance where a value was incorrect.
Design your row key with your queries in mind
- Because this is a time series, include
DATE
in the row key. Because you are interested only in precision to the day the last five digits of the timestamp will be zero and will therefore be omitted. - To query a given meter for a given day, you will retrieve a single row using
METER
andDATE
. - To query all meters for a given day, retrieve a range of rows using
DATE
. - Therefore, promote
DATE
(as an 8 digit numeral) andMETER
(as a 10 digit numeral, with the meterID
left padded to 10 digits to accommodate 1 billion potential meters while retaining lexicographical ordering). - Taking the queries together, the row key will need to be of the form
DATE
+METER
. For example:|20170726|0000987654|
.
At this point, you might notice a problem: At the start of each day, all the
meters will write to a single node, because the date is in the leading position
in the row key. With 10 million meters, this is likely to be an issue that
impacts performance every single day. The solution is to find a better way to
query meter data for a specific day. If you run a single query each night and
store the results as a new table, called SENSOR_YYYYMMDD
, you won't need to
optimize our row key for date- based queries.
Let's iterate to solve this problem:
Iteration 2
Keep related data in the same table, keep unrelated data in different tables
- Store the sensor data in a table called SENSOR.
- Queries will retrieve data in daily increments, so you will have each row store data for one meter for one day.
- You would run an overnight batched query that produces another table called
SENSOR_YYYYMMDD
(the date) to store all meter data for that date.
Rows can be big, but not infinitely big
- Stays the same as iteration 1.
Don't exploit atomicity of single rows
- Stays the same as iteration 1.
Keep names short but meaningful
- Stays the same as iteration 1.
Putting this all together, an example row would look like the following for both
the SENSOR
table and the SENSOR_YYYYMMDD
tables:
Column data | |||||
---|---|---|---|---|---|
METER:ID:987654 |
METER:0000: 12.34 |
METER:0015: 13.45 |
... |
METER:2330: 27.89 |
METER:2345: 28.90 |
Next, design the row key:
Use tall and narrow tables
- Stays the same as iteration 1.
Prefer rows to column versions
- Stays the same as iteration 1.
Design your row key with your queries in mind
- Because this is a time series, include
DATE
in the row key. - To query a given meter for a given day, you will need to retrieve a single row
using the
METER
andDATE
. - Therefore, promote
DATE
(as an 8 digit numeral) andMETER
(as a 10 digit numeral, with the meterID
left padded to 10 digits to accommodate 1 billion potential meters while retaining lexicographical ordering). - The row key will need to be of the form
METER
+DATE
(for example, "0000987654#20170726"), which will satisfy our query and lead to a good distribution of activity across nodes. - You will also run a batched query once per day that will scan the entire table
and storing yesterday's data in a new table called
SENSOR_YYYYMMDD
, whereYYYYMMDD
is yesterday's date.
Ensure that your row key avoids hotspotting
- Hotspotting isn't a concern for the
SENSOR
table. Writes will be distributed evenly because the row key hasMETER
in the leading position. - Hotspotting isn't a concern for the
SENSOR_YYYYMMDD
tables. Each table is constructed only once, as a batched query, where performance is a lesser concern. However, creating these tables requires a full scan ofSENSOR
, so you will want to create theSENSOR_YYYYMMDD
tables when there are few other queries of theSENSOR
table.
Reverse timestamps only when necessary
- In this case, you don't need to reverse timestamps.
After the design exercise, you have the following for both the SENSOR
table
and the SENSOR_YYYYMMDD
tables:
Row key | Column data | ||||||
---|---|---|---|---|---|---|---|
0000987654#20170726 | METER:ID:987654 | METER:0000: 12.34 |
METER:0015: 13.45 |
... | METER:2330: 27.89 |
METER:2345: 8.90 |
This table will grow at the rate of a little less than 10 million rows per day, which Cloud Bigtable can handle without difficulty.
What's next
- Learn how to design schemas for other types of data.
- Read about the performance you can expect from Cloud Bigtable.
- Explore the diagnostic capabilities of Key Visualizer.