BigQueryIO (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1

com.google.cloud.dataflow.sdk.io

Class BigQueryIO



  • public class BigQueryIO
    extends Object
    PTransforms for reading and writing BigQuery tables.

    Table References

    A fully-qualified BigQuery table name consists of three components:

    • projectId: the Cloud project id (defaults to GcpOptions.getProject()).
    • datasetId: the BigQuery dataset id, unique within a project.
    • tableId: a table id, unique within a dataset.

    BigQuery table references are stored as a TableReference, which comes from the BigQuery Java Client API. Tables can be referred to as Strings, with or without the projectId. A helper function is provided (parseTableSpec(String)) that parses the following string forms into a TableReference:

    • [project_id]:[dataset_id].[table_id]
    • [dataset_id].[table_id]

    Reading

    To read from a BigQuery table, apply a BigQueryIO.Read transformation. This produces a PCollection of TableRows as output:

    
     PCollection<TableRow> weatherData = pipeline.apply(
         BigQueryIO.Read.named("Read")
                        .from("clouddataflow-readonly:samples.weather_stations"));
     

    See TableRow for more information on the TableRow object.

    Users may provide a query to read from rather than reading all of a BigQuery table. If specified, the result obtained by executing the specified query will be used as the data of the input transform.

    
     PCollection<TableRow> meanTemperatureData = pipeline.apply(
         BigQueryIO.Read.named("Read")
                        .fromQuery("SELECT year, mean_temp FROM [samples.weather_stations]"));
     

    When creating a BigQuery input transform, users should provide either a query or a table. Pipeline construction will fail with a validation error if neither or both are specified.

    Writing

    To write to a BigQuery table, apply a BigQueryIO.Write transformation. This consumes a PCollection of TableRows as input.

    
     PCollection<TableRow> quotes = ...
    
     List<TableFieldSchema> fields = new ArrayList<>();
     fields.add(new TableFieldSchema().setName("source").setType("STRING"));
     fields.add(new TableFieldSchema().setName("quote").setType("STRING"));
     TableSchema schema = new TableSchema().setFields(fields);
    
     quotes.apply(BigQueryIO.Write
         .named("Write")
         .to("my-project:output.output_table")
         .withSchema(schema)
         .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
     

    See BigQueryIO.Write for details on how to specify if a write should append to an existing table, replace the table, or verify that the table is empty. Note that the dataset being written to must already exist. Write dispositions are not supported in streaming mode.

    Sharding BigQuery output tables

    A common use case is to dynamically generate BigQuery table names based on the current window. To support this, BigQueryIO.Write.to(SerializableFunction) accepts a function mapping the current window to a tablespec. For example, here's code that outputs daily tables to BigQuery:

    
     PCollection<TableRow> quotes = ...
     quotes.apply(Window.<TableRow>into(CalendarWindows.days(1)))
           .apply(BigQueryIO.Write
             .named("Write")
             .withSchema(schema)
             .to(new SerializableFunction<BoundedWindow, String>() {
               public String apply(BoundedWindow window) {
                 // The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
                 String dayString = DateTimeFormat.forPattern("yyyy_MM_dd")
                      .withZone(DateTimeZone.UTC)
                      .print(((IntervalWindow) window).start());
                 return "my-project:output.output_table_" + dayString;
               }
             }));
     

    Per-window tables are not yet supported in batch mode.

    Permissions

    Permission requirements depend on the PipelineRunner that is used to execute the Dataflow job. Please refer to the documentation of corresponding PipelineRunners for more details.

    Please see BigQuery Access Control for security and permission related information specific to BigQuery.

    • Method Detail

      • parseTableSpec

        public static TableReference parseTableSpec(String tableSpec)
        Parse a table specification in the form "[project_id]:[dataset_id].[table_id]" or "[dataset_id].[table_id]".

        If the project id is omitted, the default project id is used.


Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow