Loading ORC data from Cloud Storage

This page provides an overview of loading ORC data from Cloud Storage into BigQuery.

ORC is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load ORC data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.

For information about loading ORC data from a local file, see Loading data into BigQuery from a local data source.

ORC schemas

When you load ORC files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following ORC files in Cloud Storage:

gs://mybucket/00/
  a.orc
  z.orc
gs://mybucket/01/
  b.orc

This command loads all of the files in a single CLI command (as a comma-separated list), and the schema is derived from mybucket/01/b.orc:

bq load \
--source_format=ORC \
dataset.table \
"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"

When BigQuery detects the schema, some ORC data types are converted to BigQuery data types to make them compatible with BigQuery SQL syntax. All fields in the detected schema are NULLABLE. For more information, see ORC conversions.

When you load multiple ORC files that have different schemas, identical fields (with the same name and same nested level) specified in multiple schemas must map to the same converted BigQuery data type in each schema definition.

ORC compression

Compressed ORC files are not supported, but compressed file footer and stripes are. Supported compression types are Zlib, Snappy, LZO, and LZ4.

Required permissions

When you load data into BigQuery, you need permissions to run a load job and permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.

BigQuery permissions

At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.

  • bigquery.tables.create
  • bigquery.tables.updateData
  • bigquery.jobs.create

The following predefined Cloud IAM roles include both bigquery.tables.create and bigquery.tables.updateData permissions:

  • bigquery.dataEditor
  • bigquery.dataOwner
  • bigquery.admin

The following predefined Cloud IAM roles include bigquery.jobs.create permissions:

  • bigquery.user
  • bigquery.jobUser
  • bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that user creates a dataset, they are granted bigquery.dataOwner access to it. bigquery.dataOwner access gives the user the ability to create and update tables in the dataset via a load job.

For more information on Cloud IAM roles and permissions in BigQuery, see Access control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted storage.objects.get permissions. If you are using a URI wildcard, you must also have storage.objects.list permissions.

The predefined Cloud IAM role storage.objectViewer can be granted to provide both storage.objects.get and storage.objects.list permissions.

Loading ORC data into a new table

You can load ORC data into a new table by:

  • Using the GCP Console or the classic web UI
  • Using the CLI's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

To load ORC data from Cloud Storage into a new BigQuery table:

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for loading data is the same as the process for creating an empty table.

    View dataset

  4. On the Create table page, in the Source section:

    • For Create table from, select Cloud Storage.

    • In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the GCP Console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.

      Select file

    • For File format, select ORC.

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      View dataset

    • Verify that Table type is set to Native table.

    • In the Table name field, enter the name of the table you're creating in BigQuery.

  6. In the Schema section, no action is necessary. The schema is self-described in ORC files.

  7. (Optional) To partition the table, choose your options in the Partition and cluster settings:

    • To create a partitioned table, click No partitioning, select Partition by field and choose a DATE or TIMESTAMP column. This option is unavailable if your schema does not include a DATE or TIMESTAMP column.
    • To create an ingestion-time partitioned table, click No partitioning and select Partition by ingestion time.
  8. (Optional) For Partitioning filter, click the Require partition filter box to require users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if No partitioning is selected.

  9. (Optional) To cluster the table, in the Clustering order box, enter between one and four field names. Currently, clustering is supported only for partitioned tables.

  10. (Optional) Click Advanced options.

    • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Unknown values, leave Ignore unknown values unchecked. This option applies only to CSV and JSON files.
    • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
  11. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • Click Create from source.
    • For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
    • For File format, select ORC.
  4. In the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is self-described in ORC files.

  6. (Optional) In the Options section:

    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
    • To partition the table:
      • For Partitioning Type, click None and choose Day.
      • For Partitioning Field:
      • To create a partitioned table, choose a DATE or TIMESTAMP column. This option is unavailable if your schema does not include a DATE or TIMESTAMP column.
      • To create an ingestion-time partitioned table, leave the default value: _PARTITIONTIME.
      • Click the Require partition filter box to require users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if Partitioning type is set to None.
    • To cluster the table, in the Clustering fields box, enter between one and four field names.
    • For Destination encryption, choose Customer-managed encryption to use a Cloud Key Management Service key to encrypt the table. If you leave the Default setting, BigQuery encrypts the data at rest using a Google-managed key.
  7. Click Create Table.

CLI

Use the bq load command, specify ORC as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --max_bad_records: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0. At most, five errors of any type are returned regardless of the --max_bad_records value.
  • --time_partitioning_type: Enables time-based partitioning on a table and sets the partition type. Currently, the only possible value is DAY which generates one partition per day. This flag is optional when you create a table partitioned on a DATE or TIMESTAMP column.
  • --time_partitioning_expiration: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.
  • --time_partitioning_field: The DATE or TIMESTAMP column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.
  • --require_partition_filter: When enabled, this option requires users to include a WHERE clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables.
  • --clustering_fields: A comma-separated list of up to four column names used to create a clustered table. This flag can only be used with partitioned tables.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.

    For more information on partitioned tables, see:

    For more information on clustered tables, see:

    For more information on table encryption, see:

To load ORC data into BigQuery, enter the following command:

bq --location=location load \
--source_format=format \
dataset.table \
path_to_source

Where:

  • location is your location. The --location flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • format is ORC.
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

The following command loads data from gs://mybucket/mydata.orc into a table named mytable in mydataset.

    bq load \
    --source_format=ORC \
    mydataset.mytable \
    gs://mybucket/mydata.orc

The following command loads data from gs://mybucket/mydata.orc into an ingestion-time partitioned table named mytable in mydataset.

    bq load \
    --source_format=ORC \
    --time_partitioning_type=DAY \
    mydataset.mytable \
    gs://mybucket/mydata.orc

The following command loads data from gs://mybucket/mydata.orc into a partitioned table named mytable in mydataset. The table is partitioned on the mytimestamp column.

    bq load \
    --source_format=ORC \
    --time_partitioning_field mytimestamp \
    mydataset.mytable \
    gs://mybucket/mydata.orc

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard.

    bq load \
    --source_format=ORC \
    mydataset.mytable \
    gs://mybucket/mydata*.orc

The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma- separated list of Cloud Storage URIs with wildcards.

    bq load --autodetect \
    --source_format=ORC \
    mydataset.mytable \
    "gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully-qualified, in the format gs://bucket/object. Each URI can contain one '*' wildcard character.

  4. Specify the ORC data format by setting the sourceFormat property to ORC.

  5. To check the job status, call jobs.get(job_id*), where job_id is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is loaded.
    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery C# API reference documentation .


using Google.Apis.Bigquery.v2.Data;
using Google.Cloud.BigQuery.V2;
using System;

public class BigQueryLoadTableGcsOrc
{
    public void LoadTableGcsOrc(
        string projectId = "your-project-id",
        string datasetId = "your_dataset_id"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        var gcsURI = "gs://cloud-samples-data/bigquery/us-states/us-states.orc";
        var dataset = client.GetDataset(datasetId);
        TableReference destinationTableRef = dataset.GetTableReference(
            tableId: "us_states");
        // Create job configuration
        var jobOptions = new CreateLoadJobOptions()
        {
            SourceFormat = FileFormat.Orc
        };
        // Create and run job
        var loadJob = client.CreateLoadJob(
            sourceUri: gcsURI,
            destination: destinationTableRef,
            // Pass null as the schema because the schema is inferred when
            // loading Orc data
            schema: null,
            options: jobOptions
        );
        loadJob.PollUntilCompleted();  // Waits for the job to complete.
        // Display the number of rows uploaded
        BigQueryTable table = client.GetTable(destinationTableRef);
        Console.WriteLine(
            $"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")
gcsRef.SourceFormat = bigquery.ORC
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the ORC file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.orc
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.orc';

async function loadTableGCSORC() {
  // Imports a GCS file into a table with ORC source format.

  /**
   * TODO(developer): Uncomment the following line before running the sample.
   */
  // const datasetId = 'my_dataset';
  // const tableId = 'my_table'

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'ORC',
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);

  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table('us_states');

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC');
$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.ORC
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.orc"

load_job = client.load_table_from_uri(
    uri, dataset_ref.table("us_states"), job_config=job_config
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(dataset_ref.table("us_states"))
print("Loaded {} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

require "google/cloud/bigquery"

def load_table_gcs_orc dataset_id = "your_dataset_id"
  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  gcs_uri  = "gs://cloud-samples-data/bigquery/us-states/us-states.orc"
  table_id = "us_states"

  load_job = dataset.load_job table_id, gcs_uri, format: "orc"
  puts "Starting job #{load_job.job_id}"

  load_job.wait_until_done!  # Waits for table load to complete.
  puts "Job finished."

  table = dataset.table(table_id)
  puts "Loaded #{table.rows_count} rows to table #{table.id}"
end

Appending to or overwriting a table with ORC data

You can load additional data into a table either from source files or by appending query results.

In the console and the classic BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result.

You have the following options when you load additional data into a table:

Console option Classic web UI option CLI flag BigQuery API property Description
Write if empty Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

If you load data into an existing table, the load job can append the data or overwrite the table.

You can append or overwrite a table by:

  • Using the GCP Console or the classic web UI
  • Using the CLI's bq load command
  • Calling the jobs.insert API method and configuring a load job
  • Using the client libraries

To append or overwrite a table with ORC data:

Console

  1. Open the BigQuery web UI in the GCP Console.
    Go to the GCP Console

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. On the right side of the window, in the details panel, click Create table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.

    Create table

  4. On the Create table page, in the Source section:

    • For Create table from, select Cloud Storage.

    • In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.

      Select file

    • For File format, select ORC.

  5. On the Create table page, in the Destination section:

    • For Dataset name, choose the appropriate dataset.

      Select dataset

    • In the Table name field, enter the name of the table you're appending or overwriting in BigQuery.

    • Verify that Table type is set to Native table.

  6. In the Schema section, no action is necessary. The schema is self-described in ORC files.

  7. For Partition and cluster settings, leave the default values. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the GCP Console does not support appending to or overwriting partitioned or clustered tables in a load job.

  8. Click Advanced options.

    • For Write preference, choose Append to table or Overwrite table.
    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Unknown values, leave Ignore unknown values unchecked. This option applies only to CSV and JSON files.
    • For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.

      Overwrite table

  9. Click Create table.

Classic UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
    • For File format, select ORC.
  4. On the Create Table page, in the Destination Table section:

    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. Schema information is self-described in ORC files.

  6. In the Options section:

    • For Number of errors allowed, accept the default value of 0 or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in an invalid message and fail.
    • For Write preference, choose Append to table or Overwrite table.
    • Leave the default values for Partitioning Type, Partitioning Field, Require partition filter, and Clustering Fields. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the web UI does not support appending to or overwriting partitioned or clustered tables in a load job.
    • For Destination encryption, choose Customer-managed encryption to use a Cloud Key Management Service key to encrypt the table. If you leave the Default setting, BigQuery encrypts the data at rest using a Google-managed key.
  7. Click Create Table.

CLI

Enter the bq load command with the --replace flag to overwrite the table. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data. Supply the --source_format flag and set it to ORC. Because ORC schemas are automatically retrieved from the self-describing source data, you do not need to provide a schema definition.

(Optional) Supply the --location flag and set the value to your location.

Other optional flags include:

  • --max_bad_records: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is 0. At most, five errors of any type are returned regardless of the --max_bad_records value.
  • --destination_kms_key: The Cloud KMS key for encryption of the table data.
bq --location=location load \
--[no]replace \
--source_format=format \
dataset.table \
path_to_source

Where:

  • location is your location. The --location flag is optional. You can set a default value for the location by using the .bigqueryrc file.
  • format is ORC.
  • dataset is an existing dataset.
  • table is the name of the table into which you're loading data.
  • path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

The following command loads data from gs://mybucket/mydata.orc and overwrites a table named mytable in mydataset.

    bq load \
    --replace \
    --source_format=ORC \
    mydataset.mytable \
    gs://mybucket/mydata.orc

The following command loads data from gs://mybucket/mydata.orc and appends data to a table named mytable in mydataset.

    bq load \
    --noreplace \
    --source_format=ORC \
    mydataset.mytable \
    gs://mybucket/mydata.orc

For information on appending and overwriting partitioned tables using the CLI, see: Appending to and overwriting partitioned table data.

API

  1. Create a load job that points to the source data in Cloud Storage.

  2. (Optional) Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs property must be fully-qualified, in the format gs://bucket/object. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported.

  4. Specify the data format by setting the configuration.load.sourceFormat property to ORC.

  5. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE or WRITE_APPEND.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery C# API reference documentation .


using Google.Apis.Bigquery.v2.Data;
using Google.Cloud.BigQuery.V2;
using System;

public class BigQueryLoadTableGcsOrcTruncate
{
    public void LoadTableGcsOrcTruncate(
        string projectId = "your-project-id",
        string datasetId = "your_dataset_id",
        string tableId = "your_table_id"
    )
    {
        BigQueryClient client = BigQueryClient.Create(projectId);
        var gcsURI = "gs://cloud-samples-data/bigquery/us-states/us-states.orc";
        var dataset = client.GetDataset(datasetId);
        TableReference destinationTableRef = dataset.GetTableReference(
            tableId: "us_states");
        // Create job configuration
        var jobOptions = new CreateLoadJobOptions()
        {
            SourceFormat = FileFormat.Orc,
            WriteDisposition = WriteDisposition.WriteTruncate
        };
        // Create and run job
        var loadJob = client.CreateLoadJob(
            sourceUri: gcsURI,
            destination: destinationTableRef,
            // Pass null as the schema because the schema is inferred when
            // loading Orc data
            schema: null, options: jobOptions);
        loadJob.PollUntilCompleted();  // Waits for the job to complete.
        // Display the number of rows uploaded
        BigQueryTable table = client.GetTable(destinationTableRef);
        Console.WriteLine(
            $"Loaded {table.Resource.NumRows} rows to {table.FullyQualifiedId}");
    }
}

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")
gcsRef.SourceFormat = bigquery.ORC
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
// Default for import jobs is to append data to a table.  WriteTruncate
// specifies that existing data should instead be replaced/overwritten.
loader.WriteDisposition = bigquery.WriteTruncate

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Import the Google Cloud client libraries
const {BigQuery} = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

// Instantiate the clients
const bigquery = new BigQuery();
const storage = new Storage();

/**
 * This sample loads the CSV file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.orc';

async function loadORCFromGCSTruncate() {
  /**
   * Imports a GCS file into a table and overwrites
   * table data if table already exists.
   */

  /**
   * TODO(developer): Uncomment the following lines before running the sample.
   */
  // const datasetId = "my_dataset";
  // const tableId = "my_table";

  // Configure the load job. For full list of options, see:
  // https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
  const metadata = {
    sourceFormat: 'ORC',
    // Set the write disposition to overwrite existing table data.
    writeDisposition: 'WRITE_TRUNCATE',
    location: 'US',
  };

  // Load data from a Google Cloud Storage file into the table
  const [job] = await bigquery
    .dataset(datasetId)
    .table(tableId)
    .load(storage.bucket(bucketName).file(filename), metadata);
  // load() waits for the job to finish
  console.log(`Job ${job.id} completed.`);

  // Check the job's status for errors
  const errors = job.status.errors;
  if (errors && errors.length > 0) {
    throw errors;
  }
}

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';
// $tableID = 'The BigQuery table ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$table = $bigQuery->dataset($datasetId)->table($tableId);

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC')->writeDisposition('WRITE_TRUNCATE');
$job = $table->runJob($loadConfig);

// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});

// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

To replace the rows in an existing table, set the LoadJobConfig.write_disposition property to the WRITE_TRUNCATE.

# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.source_format = bigquery.SourceFormat.ORC
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.orc"
load_job = client.load_table_from_uri(
    uri, table_ref, job_config=job_config
)  # API request
print("Starting job {}".format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print("Job finished.")

destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

require "google/cloud/bigquery"

def load_table_gcs_orc_truncate(
    dataset_id = "your_dataset_id",
    table_id   = "your_table_id"
  )
  bigquery = Google::Cloud::Bigquery.new
  dataset  = bigquery.dataset dataset_id
  gcs_uri  = "gs://cloud-samples-data/bigquery/us-states/us-states.orc"

  load_job = dataset.load_job table_id,
                              gcs_uri,
                              format: "orc",
                              write:  "truncate"
  puts "Starting job #{load_job.job_id}"

  load_job.wait_until_done!  # Waits for table load to complete.
  puts "Job finished."

  table = dataset.table(table_id)
  puts "Loaded #{table.rows_count} rows to table #{table.id}"
end

ORC conversions

BigQuery converts ORC data types to the following BigQuery data types:

Primitive types

ORC data type BigQuery data type Notes
boolean BOOLEAN
byte INTEGER
short INTEGER
int INTEGER
long INTEGER
float FLOAT
double FLOAT
string STRING UTF-8 only
varchar STRING UTF-8 only
char STRING UTF-8 only
binary BYTES
date DATE
timestamp TIMESTAMP ORC supports nanosecond precision, but BigQuery converts sub-microsecond values to microseconds when the data is read.
decimal NUMERIC or STRING NUMERIC types are exact numeric values with 38 digits of precision and 9 decimal digits of scale. See NUMERIC type for details. If a decimal type in an ORC schema has its scale no more than 9 and its precision - scale no more than 29, it's converted to NUMERIC. Otherwise, it's converted to STRING. If a decimal type is converted to STRING, a warning message is returned.

Complex types

ORC data type BigQuery data type Notes
struct RECORD
  • All fields are NULLABLE.
  • Order of fields is ignored.
  • Name of a field must be a valid colum name.
map<K,V> RECORD An ORC map<K,V> field is converted to a repeated RECORD that contains two fields: a key of the same data type as K, and a value of the same data type as V. Both fields are NULLABLE.
list repeated fields Nested lists and lists of maps are not supported.
union RECORD
  • When union only has one variant, it's converted to a NULLABLE field.
  • Otherwise a union is converted to a RECORD with a list of NULLABLE fields. The NULLABLE fields have suffixes such as field_0, field_1, and so on. Only one of these fields is assigned a value when the data is read.

Column names

A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 128 characters. A column name cannot use any of the following prefixes:

  • _TABLE_
  • _FILE_
  • _PARTITION

Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.

NULL values

Note that currently, BigQuery ignores NULL elements for the list compound type.

For more information on ORC data types, see the Apache ORC™ Specification v1.

Σας βοήθησε αυτή η σελίδα; Πείτε μας τη γνώμη σας:

Αποστολή σχολίων σχετικά με…

Αυτή η σελίδα
Χρειάζεστε βοήθεια; Επισκεφτείτε τη σελίδα υποστήριξής μας.