Cloud Bigtable Reads

This page describes the types of read requests you can send to Cloud Bigtable, discusses the performance implications, and presents a few recommendations for specific types of queries. Before you read this page, you should be familiar with the overview of Cloud Bigtable.

Overview

Read requests to Cloud Bigtable stream back the contents of the requested rows in key order, meaning they are returned in the order in which they are stored. You are able to read any writes that have returned a response.

The queries that your table supports should help determine the type of read that is best for your use case. Cloud Bigtable read requests fall into two general categories:

  • Reading a single row
  • Scans, or reading multiple rows

Reads are atomic at the row level. This means that when you send a read request for a row, Cloud Bigtable returns either the entire row or, in the event the request fails, none of the row. A partial row is never returned unless you specifically request one.

We strongly recommend that you use our Cloud Bigtable client libraries to read data from a table instead of calling the API directly. Code samples showing how to send read requests are available in multiple languages. All read requests make the ReadRows API call.

Single-row reads

You can request a single row based on the row key. Code samples are available for the following variations:

Scans

Scans are the most common way to read Cloud Bigtable data. You can read a range of contiguous rows or multiple ranges of rows from Cloud Bigtable, by specifying a row key prefix or specifying beginning and ending row keys. Code samples are available for the following variations:

Filtered reads

If you only need rows that contain specific values, or partial rows, you can use a filter with your read request. Filters allow you to be highly selective in the data that you want.

Filters also let you make sure that reads match the garbage collection policies that your table is using. This is particularly useful if you frequently write new timestamped cells to existing columns. Because garbage collection can take up to a week to remove expired data, using a filter to read data can ensure you don't read more data than you need.

The overview of filters provides detailed explanations of the types of filters that you can use. Using filters shows examples in multiple languages.

Reads and performance

Reads that use filters are slower than reads without filters, and they increase CPU utilization. On the other hand, they can significantly reduce the amount of network bandwidth that you use, by limiting the amount of data that is returned. In general, filters should be used to control throughput efficiency, not latency.

If you want to optimize your read performance, consider the following strategies:

  1. Restrict the rowset as much as possible. Limiting the number of rows that your nodes have to scan is the first step toward improving time to first byte and overall query latency. If you don't restrict the rowset, Cloud Bigtable will almost certainly have to scan your entire table. This is why we recommend that you design your schema in a way that allows your most common queries to work this way.

  2. For additional performance tuning after you've restricted the rowset, try adding a basic filter. Restricting the set of columns or the number of versions returned generally doesn't increase latency and can sometimes help Cloud Bigtable seek more efficiently past irrelevant data in each row.

  3. If you want to fine-tune your read performance even more after the first two strategies, consider using a more complicated filter. You might try this for a few reasons:

    • You're still getting back a lot of data you don't want.
    • You want to simplify your application code by pushing the query down into Cloud Bigtable.

    Be aware, however, that filters requiring conditions, interleaves, or regex matching on large values tend to do more harm than good if they allow most of the scanned data through. This harm comes in the form of increased CPU utilization in your cluster without large savings client-side.

Reads for specific situations

Large rows

Cloud Bigtable limits the size of a row to 256 MB, but it's possible to accidentally exceed that maximum. If you need to read a row that has grown larger than the limit, you can paginate your request and use a cells per row limit filter and a cells per row offset filter. Be aware that if a write arrives for the row between the paginated read requests, the read might not be atomic.

Simulating a reverse scan

Cloud Bigtable does not offer the ability to do a reverse scan, like HBase does. However, you can simulate a reverse scan. This technique is useful if you want to find the most recent version of a value that is not stored in the same row and column.

This approach assumes that you have created row keys that contain values that can be subtracted from, such as numbers or dates. Start with a row, X, that you want to search back from. First search some number of rows Y before X: X-Y to X. If the value is not found, search the next set of rows, such as X-Y*2 to X-Y or X-Y*3 to X-Y. If the value is not found, search the next set of rows, and so on until the value is found.

For example, let's say your row keys consist of a customer ID and a date, in the format 123ABC#2020-05-02, and one of the columns is password_reset, which stores the hour when the password was reset. Cloud Bigtable automatically stores them lexicographically, like the following. Note that the column does not exist for rows (days) when the password was not reset.

`123ABC#2020-02-12,password_reset:03`
`123ABC#2020-04-02,password_reset:11`
`123ABC#2020-04-14`
`123ABC#2020-05-02`
`223ABC#2020-05-22`

If you want to find the last time customer 123ABC reset their password, you can start by reading a range of 2020-05-21 to 2020-05-22. Then, if the value is not found, search 2020-05-19 to 2020-05-20, and so on, until you retrieve a row that contains a value for the column password_reset.

What's next