Loading Parquet Data from Google Cloud Storage

This page provides an overview of loading Parquet data from Cloud Storage into BigQuery.

Parquet is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load Parquet data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.

For information about loading Parquet data from a local file, see Loading data into BigQuery from a local data source.

Parquet schemas

When you load Parquet files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following Parquet files in Cloud Storage:

gs://mybucket/00/
  a.parquet
  z.parquet
gs://mybucket/01/
  b.parquet

This command loads all of the files in a single CLI command (as a comma-separated list), and the schema is derived from mybucket/01/b.parquet:

bq --location=US load --source_format=PARQUET [DATASET].[TABLE] "gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

When you load multiple Parquet files that have different schemas, identical columns specified in multiple schemas must have the same mode in each schema definition.

When BigQuery detects the schema, some Parquet data types are converted to BigQuery data types to make them compatible with BigQuery SQL syntax. For more information, see Parquet conversions.

Parquet compression

Compressed Parquet files are not supported, but compressed data blocks are. BigQuery supports Snappy and GZip codecs.

Loading Parquet data into a new table

To load Parquet data from Google Cloud Storage into a new BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
    • For File format, select Parquet.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is self-described in Parquet files.
  6. Click Create Table.

Command-line

Use the bq load command, specify PARQUET as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard.

Supply the --location flag and set the value to your location.

bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [LOCATION] is your location. The --location flag is optional if your data is in the US or the EU multi-region location. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • [FORMAT] is PARQUET.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.parquet into a table named mytable in mydataset. mybucket and mydataset were created in the US multi-region location.

    bq --location=US load --source_format=PARQUET mydataset.mytable gs://mybucket/mydata.parquet
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard. mybucket and mydataset were created in the US multi-region location.

    bq --location=US load --source_format=PARQUET mydataset.mytable gs://mybucket/mydata*.parquet
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards. mybucket and mydataset were created in the asia-northeast1 region.

    bq --location=asia-northeast1 load --autodetect --source_format=PARQUET mydataset.mytable "gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"
    

API

Set the following properties to load Parquet data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character.

  4. Specify the Parquet data format by setting the configuration.load.sourceFormat property to PARQUET.

  5. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Python

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

Use the Client.load_table_from_uri() method to start a load job from Cloud Storage. To use Parquet, set the LoadJobConfig.source_format property to the SourceFormat constant PARQUET and pass the job config as the job_config argument to the load_table_from_uri() method.

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.PARQUET
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet'

load_job = client.load_table_from_uri(
    uri,
    dataset_ref.table('us_states'),
    job_config=job_config)  # API request

assert load_job.job_type == 'load'

load_job.result()  # Waits for table load to complete.

assert load_job.state == 'DONE'
assert client.get_table(dataset_ref.table('us_states')).num_rows > 0

Appending to or overwriting a table with Parquet data

You can load additional data into a table either from source files or by appending query results.

In the BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result. The CLI and API include the following options:

Web UI option CLI flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

To load Parquet data from Google Cloud Storage and to append to or overwrite a BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
    • For File format, select Parquet.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. Schema information is self-described in Parquet files.
  6. In the Options section, for Write preference, choose Write if empty, Append to table, or Overwrite table.

    Add schema using add fields

  7. Click Create Table.

Command-line

Enter the bq load command with the --replace flag to overwrite the table. Supply the --location flag and set the value to your location. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

bq --location=[LOCATION] load --[no]replace [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [LOCATION] is your location. The --location flag is optional if your data is in the US or the EU multi-region location. You can set a default value for the location using the .bigqueryrc file.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.parquet and overwrites a table named mytable in mydataset. mybucket and mydataset were created in the US multi-region location.

    bq --location=US load --replace --source_format=PARQUET mydataset.mytable gs://mybucket/mydata.parquet
    
  • The following command loads data from gs://mybucket/mydata.parquet and appends data to a table named mytable in mydataset. mybucket and mydataset were created in the asia-northeast1 region.

    bq --location=asia-northeast1 load --noreplace --source_format=PARQUET mydataset.mytable gs://mybucket/mydata.parquet
    

API

Set the following properties to load CSV data using the API.

  1. Create a load job that points to the source data in Google Cloud Storage.

  2. Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported when loading CSV data from Google Cloud Storage.

  4. Specify the data format by setting the configuration.load.sourceFormat property to PARQUET.

  5. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE, WRITE_APPEND, or WRITE_EMPTY.

Python

For more on installing and creating a BigQuery client, refer to BigQuery Client Libraries.

To append rows to an existing table, set the LoadJobConfig.write_disposition property to the WriteDisposition constant WRITE_APPEND. Note that the schema of the data must match the destination table's schema in order to append the data.

# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')

previous_rows = client.get_table(table_ref).num_rows
assert previous_rows > 0

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
# The schema of the parquet file must match the table schema in an append
job_config.source_format = bigquery.SourceFormat.PARQUET
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet'
load_job = client.load_table_from_uri(
    uri,
    table_ref,
    job_config=job_config)  # API request

assert load_job.job_type == 'load'

load_job.result()  # Waits for table load to complete.

assert load_job.state == 'DONE'
assert client.get_table(table_ref).num_rows == previous_rows + 50

To replace the rows in an existing table, set the LoadJobConfig.write_disposition property to the WriteDisposition constant WRITE_TRUNCATE.

# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')

previous_rows = client.get_table(table_ref).num_rows
assert previous_rows > 0

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.source_format = bigquery.SourceFormat.PARQUET
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet'
load_job = client.load_table_from_uri(
    uri,
    table_ref,
    job_config=job_config)  # API request

assert load_job.job_type == 'load'

load_job.result()  # Waits for table load to complete.

assert load_job.state == 'DONE'
assert client.get_table(table_ref).num_rows == 50

Parquet conversions

BigQuery converts Parquet data types to the following BigQuery data types:

Type conversions

Parquet type Parquet converted type(s) BigQuery data type
BOOLEAN NONE Boolean
INT32 NONE, UINT_8, UINT_16, UINT_32, INT_8, INT_16, INT_32 Integer
INT32 DECIMAL (see DECIMAL annotation) Numeric
INT32 DATE Date
INT64 NONE, UINT_64, INT_64 Integer
INT64 DECIMAL (see DECIMAL annotation) Numeric
INT64 TIMESTAMP_MILLIS Timestamp
INT64 TIMESTAMP_MICROS Timestamp
INT96 NONE Timestamp
FLOAT NONE Floating point
DOUBLE NONE Floating point
BYTE_ARRAY NONE Bytes
BYTE_ARRAY UTF8 String
FIXED_LEN_BYTE_ARRAY DECIMAL (see DECIMAL annotation) Numeric
FIXED_LEN_BYTE_ARRAY NONE Bytes

Other combinations of Parquet types and converted types are not supported.

Decimal annotation

Parquet types with the DECIMAL annotation may have at most a precision of 38 (total number of digits) and at most a scale of 9 (digits to the right of the decimal). The number of integer digits, which is the precision minus the scale, may be at most 29. For example, DECIMAL(38, 9) is supported because the precision is 38 and the scale is 9. In this example, the number of integer digits is 29. DECIMAL(38, 5) is not supported because it has a precision of 38 and a scale of 5. In this example, the number of integer digits is 33.

Column name conversions

A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 128 characters. A column name cannot use any of the following prefixes:

  • _TABLE_
  • _FILE_
  • _PARTITION

Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.

Currently, you cannot load Parquet files containing columns that have a period (.) in the column name.

If a Parquet column name contains other characters (aside from a period), the characters are replaced with underscores. Trailing underscores may be added to column names to avoid collisions. For example, if a Parquet file contains 2 columns Column1 and column1, the columns are loaded as Column1 and column1_ respectively.

Send feedback about...