Stay organized with collections Save and categorize content based on your preferences.

Loading Parquet data from Cloud Storage

This page provides an overview of loading Parquet data from Cloud Storage into BigQuery.

Parquet is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load Parquet data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

For information about loading Parquet data from a local file, see Loading data from local files.

Limitations

You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:

  • If your dataset's location is set to a value other than the US multi-region, then the Cloud Storage bucket must be in the same region or contained in the same multi-region as the dataset.
  • BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
  • BigQuery does not support Cloud Storage object versioning. If you include a generation number in the Cloud Storage URI, then the load job fails.

Input file requirements

To avoid resourcesExceeded errors when loading Parquet files into BigQuery, follow these guidelines:

  • Keep record sizes to 50 MB or less.
  • If your input data contains more than 100 columns, consider reducing the page size to be smaller than the default page size (1 * 1024 * 1024 bytes). This is especially helpful if you are using significant compression.

Before you begin

Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.

Required permissions

To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.

Permissions to load data into BigQuery

To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create

Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin (includes the bigquery.jobs.create permission)
  • bigquery.user (includes the bigquery.jobs.create permission)
  • bigquery.jobUser (includes the bigquery.jobs.create permission)

Additionally, if you have the bigquery.datasets.create permission, you can create and update tables using a load job in the datasets that you create.

For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Permissions to load data from Cloud Storage

To load data from a Cloud Storage bucket, you need the following IAM permissions:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list (required if you are using a URI wildcard)

Create a dataset

Create a BigQuery dataset to store your data.

Parquet schemas

When you load Parquet files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following Parquet files in Cloud Storage:

gs://mybucket/00/
  a.parquet
  z.parquet
gs://mybucket/01/
  b.parquet

Running this command in the bq command-line tool loads all of the files (as a comma-separated list), and the schema is derived from mybucket/01/b.parquet:

bq load \
--source_format=PARQUET \
dataset.table \
"gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

When you load multiple Parquet files that have different schemas, identical columns specified in multiple schemas must have the same mode in each schema definition.

When BigQuery detects the schema, some Parquet data types are converted to BigQuery data types to make them compatible with GoogleSQL syntax. For more information, see Parquet conversions.

To provide a table schema for creating external tables, set the referenceFileSchemaUri property in BigQuery API or
--reference_file_schema_uri parameter in bq command-line tool to the URL of the reference file.

For example, --reference_file_schema_uri="gs://mybucket/schema.parquet".

Parquet compression

BigQuery supports the following compression codecs for Parquet file contents:

  • GZip
  • LZO_1C and LZO_1X
  • Snappy
  • ZSTD

Loading Parquet data into a new table

You can load Parquet data into a new table by using one of the following:

  • The Google Cloud console
  • The bq command-line tool's bq load command
  • The jobs.insert API method and configuring a load job
  • The client libraries

To load Parquet data from Cloud Storage into a new BigQuery table:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.
  3. In the Dataset info section, click Create table.
  4. In the Create table panel, specify the following details:
    1. In the Source section, select Google Cloud Storage in the Create table from list. Then, do the following:
      1. Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI. You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite. select source file to create a BigQuery table
      2. For File format, select Parquet.
    2. In the Destination section, specify the following details:
      1. For Dataset, select the dataset in which you want to create the table.
      2. In the Table field, enter the name of the table that you want to create.
      3. Verify that the Table type field is set to Native table.
    3. In the Schema section, no action is necessary. The schema is self-described in Parquet files.
    4. Optional: Specify Partition and cluster settings. For more information, see Creating partitioned tables and Creating and using clustered tables.
    5. Click Advanced options and do the following:
      • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
      • If you want to ignore values in a row that are not present in the table's schema, then select Unknown values.
      • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
    6. Click Create table.

SQL

Use the LOAD DATA DDL statement. The following example loads a Parquet file into the new table mytable:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    LOAD DATA OVERWRITE mydataset.mytable
    FROM FILES (
      format = 'PARQUET',
      uris = ['gs://bucket/path/file.parquet']);
    

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

bq

Use the bq load command, specify PARQUET using the --source_format flag, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs, or a URI containing a wildcard.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --time_partitioning_type: Enables time-based partitioning on a table and sets the partition type. Possible values are HOUR, DAY, MONTH, and YEAR. This flag is optional when you create a table partitioned on a DATE, DATETIME, or TIMESTAMP column. The default partition type for time-based partitioning is DAY. You cannot change the partitioning specification on an existing table.
  • --time_partitioning_expiration: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
  • --time_partitioning_field: The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
  • --require_partition_filter: When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter can reduce cost and improve performance. For more information, see Querying partitioned tables.
  • --clustering_fields: A comma-separated list of up to four column names used to create a clustered table.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.

    For more information on partitioned tables, see:

    For more information on clustered tables, see:

    For more information on table encryption, see:

To load Parquet data into BigQuery, enter the following command:

bq --location=LOCATION load \
--source_format=FORMAT \
DATASET.TABLE \
PATH_TO_SOURCE

Replace the following:

  • LOCATION: your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • FORMAT: PARQUET.
  • DATASET: an existing dataset.
  • TABLE: the name of the table into which you're loading data.
  • PATH_TO_SOURCE: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

The following command loads data from gs://mybucket/mydata.parquet into a table named mytable in mydataset.

    bq load \
    --source_format=PARQUET \
    mydataset.mytable \
    gs://mybucket/mydata.parquet

The following command loads data from gs://mybucket/mydata.parquet into a new ingestion-time partitioned table named mytable in mydataset.

    bq load \
    --source_format=PARQUET \
    --time_partitioning_type=DAY \
    mydataset.mytable \
    gs://mybucket/mydata.parquet

The following command loads data from gs://mybucket/mydata.parquet into a partitioned table named mytable in mydataset. The table is partitioned on the mytimestamp column.

    bq load \
    --source_format=PARQUET \
    --time_partitioning_field mytimestamp \
    mydataset.mytable \
    gs://mybucket/mydata.parquet

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard.

    bq load \
    --source_format=PARQUET \
    mydataset.mytable \
    gs://mybucket/mydata*.parquet

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma- separated list of Cloud Storage URIs with wildcards.

    bq load \
    --source_format=PARQUET \
    mydataset.mytable \
    "gs://mybucket/00/*.parquet","gs://mybucket/01/*.parquet"

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully qualified, in the format gs://BUCKET/OBJECT. Each URI can contain one '*' wildcard character.

  4. Specify the Parquet data format by setting the sourceFormat property to PARQUET.

  5. To check the job status, call jobs.get(JOB_ID*), replacing JOB_ID with the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object includes information describing what went wrong. When a request fails, no table is created and no data is loaded.
    • If status.errorResult is absent, the job finished successfully; although, there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent: if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
)

// importParquet demonstrates loading Apache Parquet data from Cloud Storage into a table.
func importParquet(projectID, datasetID, tableID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	// tableID := "mytable"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")
	gcsRef.SourceFormat = bigquery.Parquet
	gcsRef.AutoDetect = true
	loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)

	job, err := loader.Run(ctx)
	if err != nil {
		return err
	}
	status, err := job.Wait(ctx)
	if err != nil {
		return err
	}

	if status.Err() != nil {
		return fmt.Errorf("job completed with error: %v", status.Err())
	}
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.LoadJobConfiguration;
import com.google.cloud.bigquery.TableId;
import java.math.BigInteger;

public class LoadParquet {

  public static void runLoadParquet() {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    loadParquet(datasetName);
  }

  public static void loadParquet(String datasetName) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet";
      TableId tableId = TableId.of(datasetName, "us_states");

      LoadJobConfiguration configuration =
          LoadJobConfiguration.builder(tableId, sourceUri)
              .setFormatOptions(FormatOptions.parquet())
              .build();

      // For more information on Job see:
      // https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html
      // Load the table
      Job job = bigquery.create(JobInfo.of(configuration));

      // Blocks until this load table job completes its execution, either failing or succeeding.
      Job completedJob = job.waitFor();
      if (completedJob == null) {
        System.out.println("Job not executed since it no longer exists.");
        return;
      } else if (completedJob.getStatus().getError() != null) {
        System.out.println(
            "BigQuery was unable to load the table due to an error: \n"
                + job.getStatus().getError());
        return;
      }

      // Check number of rows loaded into the table
      BigInteger numRows = bigquery.getTable(tableId).getNumRows();
      System.out.printf("Loaded %d rows. \n", numRows);

      System.out.println("GCS parquet loaded successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("GCS Parquet was not loaded. \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the Parquet file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.parquet
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.parquet';

async function loadTableGCSParquet() {
  // Imports a GCS file into a table with Parquet source format.

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = 'my_dataset';
  // const tableId = 'my_table';

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'PARQUET',
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);

  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table('us_states');

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET');
$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

Use the Client.load_table_from_uri() method to start a load job from Cloud Storage. To use Parquet, set the LoadJobConfig.source_format property to the string PARQUET and pass the job config as the job_config argument to the load_table_from_uri() method.
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.PARQUET,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

Appending to or overwriting a table with Parquet data

You can load additional data into a table either from source files or by appending query results.

In the Google Cloud console, use the Write preference option to specify what action to take when you load data from a source file or from a query result.

You have the following options when you load additional data into a table:

Console option bq tool flag BigQuery API property Description
Write if empty Not supported WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data. This action also deletes the table schema and removes any Cloud KMS key.

If you load data into an existing table, the load job can append the data or overwrite the table.

You can append or overwrite a table by using one of the following:

  • The Google Cloud console
  • The bq command-line tool's bq load command
  • The jobs.insert API method and configuring a load job
  • The client libraries

To append or overwrite a table with Parquet data:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.
  3. In the Dataset info section, click Create table.
  4. In the Create table panel, specify the following details:
    1. In the Source section, select Google Cloud Storage in the Create table from list. Then, do the following:
      1. Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI. You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite. select source file to create a BigQuery table
      2. For File format, select Parquet.
    2. In the Destination section, specify the following details:
      1. For Dataset, select the dataset in which you want to create the table.
      2. In the Table field, enter the name of the table that you want to create.
      3. Verify that the Table type field is set to Native table.
    3. In the Schema section, no action is necessary. The schema is self-described in Parquet files.
    4. Optional: Specify Partition and cluster settings. For more information, see Creating partitioned tables and Creating and using clustered tables. You cannot convert a table to a partitioned or clustered table by appending or overwriting it. The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job.
    5. Click Advanced options and do the following:
      • For Write preference, choose Append to table or Overwrite table.
      • If you want to ignore values in a row that are not present in the table's schema, then select Unknown values.
      • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
    6. Click Create table.

SQL

Use the LOAD DATA DDL statement. The following example appends a Parquet file to the table mytable:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    LOAD DATA INTO mydataset.mytable
    FROM FILES (
      format = 'PARQUET',
      uris = ['gs://bucket/path/file.parquet']);
    

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

bq

Enter the bq load command with the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data. Supply the --source_format flag and set it to PARQUET. Because Parquet schemas are automatically retrieved from the self-describing source data, you do not need to provide a schema definition.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --destination_kms_key: The Cloud KMS key for encryption of the table data.
bq --location=LOCATION load \
--[no]replace \
--source_format=FORMAT \
DATASET.TABLE \
PATH_TO_SOURCE

Replace the following:

  • location: your location. The --location flag is optional. You can set a default value for the location by using the .bigqueryrc file.
  • format: PARQUET.
  • dataset: an existing dataset.
  • table: the name of the table into which you're loading data.
  • path_to_source: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

The following command loads data from gs://mybucket/mydata.parquet and overwrites a table named mytable in mydataset.

    bq load \
    --replace \
    --source_format=PARQUET \
    mydataset.mytable \
    gs://mybucket/mydata.parquet

The following command loads data from gs://mybucket/mydata.parquet and appends data to a table named mytable in mydataset.

    bq load \
    --noreplace \
    --source_format=PARQUET \
    mydataset.mytable \
    gs://mybucket/mydata.parquet

For information on appending and overwriting partitioned tables using the bq command-line tool, see Appending to and overwriting partitioned table data.

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully qualified, in the format gs://BUCKET/OBJECT. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported.

  4. Specify the data format by setting the configuration.load.sourceFormat property to PARQUET.

  5. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE or WRITE_APPEND.

Go

Before trying this sample, follow the Go setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Go API reference documentation.

import (
	"context"
	"fmt"

	"cloud.google.com/go/bigquery"
)

// importParquetTruncate demonstrates loading Apache Parquet data from Cloud Storage into a table
// and overwriting/truncating existing data in the table.
func importParquetTruncate(projectID, datasetID, tableID string) error {
	// projectID := "my-project-id"
	// datasetID := "mydataset"
	// tableID := "mytable"
	ctx := context.Background()
	client, err := bigquery.NewClient(ctx, projectID)
	if err != nil {
		return fmt.Errorf("bigquery.NewClient: %v", err)
	}
	defer client.Close()

	gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.parquet")
	gcsRef.SourceFormat = bigquery.Parquet
	gcsRef.AutoDetect = true
	loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
	loader.WriteDisposition = bigquery.WriteTruncate

	job, err := loader.Run(ctx)
	if err != nil {
		return err
	}
	status, err := job.Wait(ctx)
	if err != nil {
		return err
	}

	if status.Err() != nil {
		return fmt.Errorf("job completed with error: %v", status.Err())
	}
	return nil
}

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.


import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;
import com.google.cloud.bigquery.JobInfo.WriteDisposition;
import com.google.cloud.bigquery.LoadJobConfiguration;
import com.google.cloud.bigquery.TableId;
import java.math.BigInteger;

public class LoadParquetReplaceTable {

  public static void runLoadParquetReplaceTable() {
    // TODO(developer): Replace these variables before running the sample.
    String datasetName = "MY_DATASET_NAME";
    loadParquetReplaceTable(datasetName);
  }

  public static void loadParquetReplaceTable(String datasetName) {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      // Imports a GCS file into a table and overwrites table data if table already exists.
      // This sample loads CSV file at:
      // https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
      String sourceUri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet";
      TableId tableId = TableId.of(datasetName, "us_states");

      // For more information on LoadJobConfiguration see:
      // https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigquery/LoadJobConfiguration.Builder.html
      LoadJobConfiguration configuration =
          LoadJobConfiguration.builder(tableId, sourceUri)
              .setFormatOptions(FormatOptions.parquet())
              // Set the write disposition to overwrite existing table data.
              .setWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
              .build();

      // For more information on Job see:
      // https://googleapis.dev/java/google-cloud-clients/latest/index.html?com/google/cloud/bigquery/package-summary.html
      // Load the table
      Job job = bigquery.create(JobInfo.of(configuration));

      // Load data from a GCS parquet file into the table
      // Blocks until this load table job completes its execution, either failing or succeeding.
      Job completedJob = job.waitFor();
      if (completedJob == null) {
        System.out.println("Job not executed since it no longer exists.");
        return;
      } else if (completedJob.getStatus().getError() != null) {
        System.out.println(
            "BigQuery was unable to load into the table due to an error: \n"
                + job.getStatus().getError());
        return;
      }

      // Check number of rows loaded into the table
      BigInteger numRows = bigquery.getTable(tableId).getNumRows();
      System.out.printf("Loaded %d rows. \n", numRows);

      System.out.println("GCS parquet overwrote existing table successfully.");
    } catch (BigQueryException | InterruptedException e) {
      System.out.println("Table extraction job was interrupted. \n" + e.toString());
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Node.js API reference documentation.

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the CSV file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.parquet';

async function loadParquetFromGCSTruncate() {
  /**
   * Imports a GCS file into a table and overwrites
   * table data if table already exists.
   */

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_dataset";
  // const tableId = "my_table";

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'PARQUET',
    // Set the write disposition to overwrite existing table data.
    writeDisposition: 'WRITE_TRUNCATE',
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);
  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery PHP API reference documentation.

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';
// $tableID = 'The BigQuery table ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$table = $bigQuery->dataset($datasetId)->table($tableId);

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.parquet';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('PARQUET')->writeDisposition('WRITE_TRUNCATE');
$job = $table->runJob($loadConfig);

// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});

// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Python API reference documentation.

To replace the rows in an existing table, set the LoadJobConfig.write_disposition property to the WRITE_TRUNCATE.
import io

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("name", "STRING"),
        bigquery.SchemaField("post_abbr", "STRING"),
    ],
)

body = io.BytesIO(b"Washington,WA")
client.load_table_from_file(body, table_id, job_config=job_config).result()
previous_rows = client.get_table(table_id).num_rows
assert previous_rows > 0

job_config = bigquery.LoadJobConfig(
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    source_format=bigquery.SourceFormat.PARQUET,
)

uri = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

Loading hive-partitioned Parquet data

BigQuery supports loading hive partitioned Parquet data stored on Cloud Storage and populates the hive partitioning columns as columns in the destination BigQuery managed table. For more information, see Loading externally partitioned data.

Parquet conversions

This section describes how BigQuery parses various data types when loading Parquet data.

Some Parquet data types (such as INT32, INT64, BYTE_ARRAY, and FIXED_LEN_BYTE_ARRAY) can be converted into multiple BigQuery data types. To ensure BigQuery converts the Parquet data types correctly, specify the appropriate data type in the Parquet file.

For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:

optional int32 date_col (DATE);

BigQuery converts Parquet data types to the BigQuery data types that are described in the following sections.

Type conversions

Parquet type Parquet logical type(s) BigQuery data type
BOOLEAN None BOOLEAN
INT32 None, INTEGER (UINT_8, UINT_16, UINT_32, INT_8, INT_16, INT_32) INTEGER
INT32 DECIMAL NUMERIC, BIGNUMERIC, or STRING
INT32 DATE DATE
INT64 None, INTEGER (UINT_64, INT_64) INTEGER
INT64 DECIMAL NUMERIC, BIGNUMERIC, or STRING
INT64 TIMESTAMP, precision=MILLIS (TIMESTAMP_MILLIS) TIMESTAMP
INT64 TIMESTAMP, precision=MICROS (TIMESTAMP_MICROS) TIMESTAMP
INT96 None TIMESTAMP
FLOAT None FLOAT
DOUBLE None FLOAT
BYTE_ARRAY None BYTES
BYTE_ARRAY STRING (UTF8) STRING
FIXED_LEN_BYTE_ARRAY DECIMAL NUMERIC, BIGNUMERIC, or STRING
FIXED_LEN_BYTE_ARRAY None BYTES

Nested groups are converted into STRUCT types. Other combinations of Parquet types and converted types are not supported.

Unsigned logical types

The Parquet UINT_8, UINT_16, UINT_32, and UINT_64 types are unsigned. BigQuery will treat values with these types as unsigned when loading into a BigQuery signed INTEGER column. In the case of UINT_64, an error will be returned if the unsigned value exceeds the maximum INTEGER value of 9,223,372,036,854,775,807.

Decimal logical type

Decimal logical types can be converted to NUMERIC, BIGNUMERIC , or STRING types. The converted type depends on the precision and scale parameters of the decimal logical type and the specified decimal target types. Specify the decimal target type as follows:

Enum logical type

Enum logical types can be converted to STRING or BYTES. Specify the converted target type as follows:

List logical type

You can enable schema inference for Parquet LIST logical types. BigQuery checks whether the LIST node is in the standard form or in one of the forms described by the backward-compatibility rules:

// standard form
<optional | required> group <name> (LIST) {
  repeated group list {
    <optional | required> <element-type> element;
  }
}

If yes, the corresponding field for the LIST node in the converted schema is treated as if the node has the following schema:

repeated <element-type> <name>

The nodes "list" and "element" are omitted.

Column name conversions

A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 300 characters. A column name cannot use any of the following prefixes:

  • _TABLE_
  • _FILE_
  • _PARTITION
  • _ROW_TIMESTAMP
  • __ROOT__
  • _COLIDENTIFIER

Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.

You cannot load Parquet files containing columns that have a period (.) in the column name.

If a Parquet column name contains other characters (aside from a period), the characters are replaced with underscores. You can add trailing underscores to column names to avoid collisions. For example, if a Parquet file contains 2 columns Column1 and column1, the columns are loaded as Column1 and column1_ respectively.