Design a schema
The ideal schema for a Bigtable table is highly dependent on a number of factors, including use case, data access patterns, and the data you plan to store. This page provides an overview of the Bigtable schema design process.
Before you read this page, you should understand schema design concepts and best practices. If applicable, also read Schema design for time series data.
Before you begin
Create or identify a Bigtable instance that you can use to test your schema.
Gather information
- Identify the data that you plan to store in Bigtable.
Questions to ask include:
- What format does the data use? Possible formats include raw bytes, strings, protobufs, and json.
- What constitutes an entity in your data? For example, are you storing page views, stock prices, ad placements, device measurements, or some other type of entity? What are the entities composed of?
- Is the data time-based?
- Identify and rank the queries that you use to get the data you
need. Considering the entities you will be storing, think about how you
will want the data to be sorted and grouped when you use it. Your schema
design might not satisfy all your queries, but ideally it satisfies the most
important or most frequently used queries. Examples of queries might include
the following:
- A month's worth of temperature readings for IoT objects.
- Daily ad views for an IP address.
- The most recent location of a mobile device.
- All application events per day per user.
Design
Decide on an initial schema design. This means planning the pattern that your row keys will follow, the column families your table will have, and the column qualifiers for the columns you want within those column families. Follow the general schema design guidelines. If your data is time-based, also follow the guidelines for time series data.
Test
- Create a table using the column families and column qualifiers that you came up with for your schema.
- Load the table with at least 30 GB of test data, using row keys that you identified in your draft plan. Stay below the storage utilization per node limits.
- Run a heavy load test for several minutes. This step gives Bigtable a chance to balance data across nodes based on the access patterns that it observes.
- Run a one-hour simulation of the reads and writes you would normally send to the table.
Review the results of your simulation using Key Visualizer and Cloud Monitoring.
The Key Visualizer tool for Bigtable provides scans that show the usage patterns for each table in a cluster. Key Visualizer helps you check whether your schema design and usage patterns are causing undesirable results, such as hotspots on specific rows.
Monitoring helps you check metrics, such as CPU utilization of the hottest node in a cluster, to help you determine if the schema design is causing problems.
Refine
- Revise your schema design as necessary, based on what you learned
with Key Visualizer. For instance:
- If you see evidence of hotspotting, use different row keys.
- If you notice latency, find out whether your rows exceed the 100 MB per row limit.
- If you find that you have to use filters to get the data you need, consider normalizing the data in a way that allows simpler (and faster) reads: reading a single row or ranges of rows by row key.
- After you've revised your schema, test and review the results again.
- Continue modifying your schema design and testing until an inspection in Key Visualizer tells you that the schema design is optimal.
What's next
- Watch a presentation on the iterative design process that Twitter used for Bigtable.
- Learn more about Bigtable performance.