Bigtable I/O

The Dataflow SDKs provide an API for reading data from, and writing data to, Google Cloud Bigtable. The BigtableIO source and sink let you read or write a PCollection of Bigtable Row objects from a given Bigtable.

Setting Bigtable Options

When you read from or write to Bigtable, you'll need to provide a table ID and a set of Bigtable options. These options contain information necessary to identify the target Bigtable cluster, including:

  • Project ID
  • Cluster ID
  • Zone ID

The easiest way to provide these options is to construct them using BigtableOptions.Builder in the package com.google.cloud.bigtable.config.BigtableOptions:

  BigtableOptions.Builder optionsBuilder =
     new BigtableOptions.Builder()
         .setProjectId("project")
         .setClusterId("cluster")
         .setZoneId("zone");

Reading from Bigtable

To read from Bigtable, apply the BigtableIO.read() transform to your Pipeline object. You'll need to specify the table ID and BigtableOptions using .withTableId and .withBigtableOptions, respectively. By default, BigtableIO.read() scans the entire specified Bigtable and returns PCollection of Bigtable Row objects:

  // Scan the entire table.
  PCollection <Row> btRows = p.apply("read",
      BigtableIO.read()
          .withBigtableOptions(optionsBuilder)
          .withTableId("table"));

If you want to scan a subset of the rows in the specified Bigtable, you can provide a Bigtable RowFilter object. If you provide a RowFilter, BigtableIO.read() will return only the Rows that match the filter:

  // Read only rows that match the specified filter.

  RowFilter filter = ...;

  PCollection <Row> filteredBtRows = p.apply("filtered read",
      BigtableIO.read()
          .withBigtableOptions(optionsBuilder)
          .withTableId("table")
          .withRowFilter(filter));

Writing to Bigtable

To write to Bigtable, apply the BigtableIO.write() transform to the PCollection containing your output data. You'll need to specify the table ID and BigtableOptions using .withTableId and .withBigtableOptions, respectively.

Formatting Bigtable Output Data

The BigtableIO data sink performs each write operation as a set of row mutations to the target Bigtable. As such, you must format your output data as a PCollection<KV<ByteString, Iterable<Mutation>>>. Each element in the PCollection must contain:

  • The key of the row to be written as a ByteString.
  • An Iterable of Mutation objects that represent a series of idempotent row mutation operations.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Dataflow Documentation