Query protobuf data

This document provides examples of common query patterns for reading and querying protocol buffer (protobuf) data stored in Bigtable.

Before you read this page, familiarize yourself with the following:

Example data

The following examples use a Music table that stores information about albums and artists. The data is stored in a column family named album_details, with a column qualifier album. The column qualifier contains protobuf messages.

The protobuf schema is defined in the following proto files:

  • artist.proto:

    syntax = "proto3";
    
    package package_name;
    
    message Artist {
     string name = 1;
    }
    
  • album.proto:

    syntax = "proto3";
    
    package package_name;
    
    import "artist.proto";
    
    message Album {
     string title = 1;
     Artist artist = 2;
     int32 release_year = 3;
    }
    

As a result, Bigtable creates a schema bundle for this table that contains the following descriptor set for these protobuf definitions:

file {
  name: "artist.proto"
  package: "package_name"
  message_type {
    name: "Artist"
    field {
      name: "name"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "name"
    }
  }
  syntax: "proto3"
}
file {
  name: "album.proto"
  package: "package_name"
  dependency: "artist.proto"
  message_type {
    name: "Album"
    field {
      name: "title"
      number: 1
      label: LABEL_OPTIONAL
      type: TYPE_STRING
      json_name: "title"
    }
    field {
      name: "artist"
      number: 2
      label: LABEL_OPTIONAL
      type: TYPE_MESSAGE
      type_name: ".package_name.Artist"
      json_name: "artist"
    }
    field {
      name: "release_year"
      number: 3
      label: LABEL_OPTIONAL
      type: TYPE_INT32
      json_name: "releaseYear"
    }
  }
  syntax: "proto3"
}

Example queries

The following examples show how to query protobuf data using GoogleSQL for Bigtable and BigQuery external tables.

Cast a column to a protobuf message

You can use the CAST operator to interpret a BYTES value as a protobuf message. To do this, you must provide the full name of the protobuf message in the following format: SCHEMA_BUNDLE_ID.FULLY_QUALIFIED_MESSAGE_NAME.

Replace the following:

  • SCHEMA_BUNDLE_ID: The unique ID you assigned to your schema bundle when you created it.
  • FULLY_QUALIFIED_MESSAGE_NAME: The full name of the message, which must include the package name defined in your proto file–for example, package_name.message_name.

The following sample query casts the album column to the Album protobuf message. This message is defined in the package_name package and is part of a schema bundle named bundle_name:

SELECT
  CAST(album_details['album'] AS bundle_name.package_name.Album).title
FROM
  Music;

Access nested fields

You can access nested fields within a protobuf message using dot notation.

The following query retrieves the name of the artist from the nested Artist message within the Album message:

SELECT
  CAST(album_details['album'] AS bundle_name.package_name.Album).artist.name
FROM
  Music;

Filter based on protobuf fields

You can use the WHERE clause to filter rows based on the values of fields within a protobuf message.

The following query selects all albums by the artist Dana A.:

SELECT
  *
FROM
  Music
WHERE
  CAST(album_details['album'] AS bundle_name.package_name.Album).artist.name = 'Dana A.';

Aggregate protobuf fields

You can use aggregate functions like SUM, AVG, MIN, MAX, and COUNT on numeric fields within your protobuf messages.

The following query calculates the average release year of all albums in the table:

SELECT
  AVG(CAST(album_details['album'] AS bundle_name.package_name.Album).release_year)
FROM
  Music;

Order by protobuf fields

You can use the ORDER BY clause to sort the result set based on a field in your protobuf message.

The following query retrieves all albums and sorts them by their release year in descending order:

SELECT
  *
FROM
  Music
ORDER BY
  CAST(album_details['album'] AS bundle_name.package_name.Album).release_year DESC;

Use with BigQuery external tables

You can query protobuf data stored in Bigtable from BigQuery by creating an external table. When creating the external table, you specify the column type as JSON, its encoding as PROTO_BINARY, and you associate it with a schema bundle. This process converts the protobuf message bytes into the equivalent JSON data, allowing you to query its fields directly.

The following is an example of a table definition file for creating a BigQuery external table over the Music table:

{
    "sourceFormat": "BIGTABLE",
    "sourceUris": [
        "https://googleapis.com/bigtable/projects/PROJECT_ID/instances/INSTANCE_ID/tables/Music"
    ],
    "bigtableOptions": {
        "columnFamilies" : [
            {
                "familyId": "album_details",
                "columns": [
                    {
                        "qualifierString": "album",
                        "type": "JSON",
                        "encoding": "PROTO_BINARY",
                        "protoConfig": {
                            "schemaBundleId": "bundle_name",
                            "protoMessageName": "package_name.Album"
                        }
                    }
                ]
            }
        ]
    }
}

Once the external table is created, you can query the protobuf data as a JSON column in BigQuery.

The following query retrieves the title of all albums released in 2022 from the BigQuery external table:

SELECT JSON_VALUE(value, '$.title') AS title
FROM
`PROJECT_ID.DATASET.TABLE_NAME` AS t,
UNNEST(t.album_details.album.cell)
WHERE INT64(JSON_EXTRACT(value, '$.releaseYear')) = 2022;

What's next