Loading Avro Data from Google Cloud Storage

Loading Avro files from Cloud Storage

Avro is an open source data format that bundles serialized data with the data's schema in the same file.

When you load Avro data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

For information about loading Avro data from a local file, see Loading data into BigQuery from a local data source.

Advantages of Avro

Avro is the preferred format for loading data into BigQuery. Loading Avro files has the following advantages over CSV and JSON (newline delimited):

  • The Avro binary format:
    • Is faster to load. The data can be read in parallel, even if the data blocks are compressed.
    • Doesn't require typing or serialization.
    • Is easier to parse because there are no encoding issues found in other formats such as ASCII.
  • You do not need to provide a schema definition. BigQuery infers the schema from the data.

Required permissions

When you load data into BigQuery, you need project or dataset-level permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need access to the bucket that contains your data.

BigQuery permissions

When you are loading data into BigQuery from Cloud Storage, you must be granted the bigquery.dataOwner or bigquery.dataEditor role at the project level or at the dataset level. Both roles grant users and groups permission to load data into a new table or to append to or overwrite an existing table.

Granting the roles at the project level gives the user or group permission to load data into tables in every dataset in the project. Granting the roles at the dataset level gives the user or group the ability to load data only into tables in that dataset.

For more information on configuring dataset access, see Assigning access controls to datasets. For more information on IAM roles in BigQuery, see Access Control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions at the project level or on that individual bucket. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined IAM role storage.objectViewer can be granted to provide storage.objects.getand storage.objects.list permissions.

Avro schemas

When you load Avro files into BigQuery, the table schema is automatially retrieved using the source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following Avro files in Cloud Storage:

gs://mybucket/00/
  a.avro
  z.avro
gs://mybucket/01/
  b.avro

This command loads all of the files in a single CLI command (as a comma-separated list), and the schema is derived from mybucket/01/b.avro:

bq load --source_format=AVRO [DATASET].[TABLE_NAME] "gs://mybucket/00/*.avro","gs://mybucket/01/*.avro"

When importing multiple Avro files with different Avro schemas, all schemas must be compatible with Avro's Schema Resolution.

When BigQuery detects the schema, some Avro data types are converted to BigQuery data types to make them compatible with BigQuery SQL syntax. For more information, see Avro conversions.

Avro compression

Compressed Avro files are not supported, but compressed data blocks are. BigQuery supports the DEFLATE and Snappy codecs.

Loading Avro data into a new table

To load Avro data from Google Cloud Storage into a new BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported.
    • For File format, select Avro.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is inferred from Avro files.
  6. Click Create Table.

Command-line

Use the bq load command, specify AVRO as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard.

bq load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [FORMAT] is AVRO.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.avro into a table named mytable in mydataset:

    bq load --source_format=AVRO mydataset.mytable gs://mybucket/mydata.avro
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard:

    bq load --source_format=AVRO mydataset.mytable gs://mybucket/mydata*.avro
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards:

    bq load --autodetect --source_format=AVRO mydataset.mytable "gs://mybucket/00/*.avro","gs://mybucket/01/*.avro"
    

API

Set the following properties to load Avro data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character.

  3. Specify the Avro data format by setting the configuration.load.sourceFormat property to AVRO.

  4. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Appending to or overwriting a table with Avro data

You can load additional data into a table either from source files or by appending query results.

In the BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result. The CLI and API include the following options:

Web UI option CLI flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

To load Avro data from Google Cloud Storage and to append to or overwrite a BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported.
    • For File format, select Avro.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. Schema information is inferred from Avro files.
  6. In the Options section, for Write preference, choose Write if empty, Append to table, or Overwrite table.

    Add schema using add fields

  7. Click Create Table.

Command-line

Enter the bq load command with the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

bq load --[no]replace [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.avro and overwrites a table named mytable in mydataset:

    bq load --replace --source_format=AVRO mydataset.mytable gs://mybucket/mydata.avro
    
  • The following command loads data from gs://mybucket/mydata.avro and appends data to a table named mytable in mydataset:

    bq load --noreplace --source_format=AVRO mydataset.mytable gs://mybucket/mydata.avro
    

API

Set the following properties to load CSV data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported when loading CSV data from Google Cloud Storage.

  3. Specify the data format by setting the configuration.load.sourceFormat property to AVRO.

  4. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE, WRITE_APPEND, or WRITE_EMPTY.

Avro conversions

BigQuery converts Avro data types to the following BigQuery data types:

Primitive types

Avro data type BigQuery data type Notes
null BigQuery ignores these values
boolean BOOLEAN
int INTEGER
long INTEGER
float FLOAT
double FLOAT
bytes BYTES
string STRING UTF-8 only

Complex types

Avro data type BigQuery data type Notes
record RECORD
  • Aliases are ignored
  • Doc is ignored
  • Default values are set at read time
  • Order is ignored
  • Recursive fields are dropped — Only the first level of nesting is maintained for recursive fields
enum STRING
  • The string is the symbolic value of the enum
  • Aliases are ignored
  • Doc is ignored
array repeated fields Arrays of arrays are not supported.
map<T> RECORD BigQuery converts an Avro map<T> field to a repeated RECORD that contains two fields: a key and a value. BigQuery stores the key as a STRING, and converts the value to its corresponding data type in BigQuery.
union
  • Nullable field
  • RECORD with a list of nullable fields
  • When union only has one non-null type, it converts to a nullable field.
  • Otherwise it converts to a RECORD with a list of nullable fields. Only one of these fields will be set at read time.
fixed BYTES
  • Aliases are ignored
  • Size is ignored

Logical types

Currently, Avro logical types are not supported in BigQuery. The logicalType attribute is ignored, and the underlying Avro type is used instead.

Avro data type BigQuery data type Notes
date INTEGER
time-millis INTEGER
time-micros INTEGER (converted from LONG)
timestamp-millis INTEGER (converted from LONG)
timestamp-micros INTEGER (converted from LONG)
duration BYTES (converted from fixed type of size 12)

For more information on Avro data types, see the Apache Avro™ 1.8.2 Specification.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...