Loading ORC Data from Google Cloud Storage

This page provides an overview of loading ORC data from Cloud Storage into BigQuery.

ORC is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.

When you load ORC data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi-regional location as the Cloud Storage bucket.

For information about loading ORC data from a local file, see Loading data into BigQuery from a local data source.

ORC schemas

When you load ORC files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.

For example, you have the following ORC files in Cloud Storage:

gs://mybucket/00/
  a.orc
  z.orc
gs://mybucket/01/
  b.orc

This command loads all of the files in a single CLI command (as a comma-separated list), and the schema is derived from mybucket/01/b.orc:

bq --location=US load --source_format=ORC [DATASET].[TABLE] "gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"

When BigQuery detects the schema, some ORC data types are converted to BigQuery data types to make them compatible with BigQuery SQL syntax. All fields in the detected schema are NULLABLE. For more information, see ORC conversions.

When you load multiple ORC files that have different schemas, identical fields (with the same name and same nested level) specified in multiple schemas must map to the same converted BigQuery data type in each schema definition.

ORC compression

Compressed ORC files are not supported, but compressed file footer and stripes are. Supported compression types are Zlib, Snappy, LZO, and LZ4.

Loading ORC data into a new table

To load ORC data from Google Cloud Storage into a new BigQuery table, or append data to an existing table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
    • For File format, select ORC.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. The schema is self-described in ORC files.
  6. Click Create Table.

Command-line

Use the bq load command, specify ORC as the source_format, and include a Cloud Storage URI. You can include a single URI, a comma-separated list of URIs or a URI containing a wildcard.

Supply the --location flag and set the value to your location.

bq --location=[LOCATION] load --source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [LOCATION] is your location. The --location flag is optional if your data is in the US or the EU multi-region location. For example, if you are using BigQuery in the Tokyo region, set the flag's value to asia-northeast1. You can set a default value for the location using the .bigqueryrc file.
  • [FORMAT] is ORC.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.orc into a table named mytable in mydataset. mybucket and mydataset were created in the US multi-region location.

    bq --location=US load --source_format=ORC mydataset.mytable gs://mybucket/mydata.orc
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The Cloud Storage URI uses a wildcard. mybucket and mydataset were created in the US multi-region location.

    bq --location=US load --source_format=ORC mydataset.mytable gs://mybucket/mydata*.orc
    
  • The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The command includes a comma-separated list of Cloud Storage URIs with wildcards. mybucket and mydataset were created in the asia-northeast1 region.

    bq --location=asia-northeast1 load --autodetect --source_format=ORC mydataset.mytable "gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"
    

API

Set the following properties to load ORC data using the API.

  1. Create a load job that points to the source data in Cloud Storage.

  2. Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. Each URI can contain one '*' wildcard character.

  4. Specify the ORC data format by setting the configuration.load.sourceFormat property to ORC.

  5. To check the job status, call jobs.get([JOB_ID]*), where [JOB_ID] is the ID of the job returned by the initial request.

    • If status.state = DONE, the job completed successfully.
    • If the status.errorResult property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is added.

    • If status.errorResult is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object's status.errors property.

API notes:

  • Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.

  • As a best practice, generate a unique ID and pass it as jobReference.jobId when calling jobs.insert() to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.

  • Calling jobs.insert() on a given job ID is idempotent; in other words, you can retry as many times as you like on the same job ID, and at most one of those operations will succeed.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery C# API reference documentation .

var gcsURI = "gs://cloud-samples-data/bigquery/us-states/us-states.orc";
var dataset = client.GetDataset(datasetId);
var jobOptions = new CreateLoadJobOptions()
{
    SourceFormat = FileFormat.Orc
};
// Pass null as the schema because the schema is inferred when
// loading Orc data
var loadJob = client.CreateLoadJob(gcsURI, dataset.GetTableReference(tableId),
    null, jobOptions);
loadJob.PollUntilCompleted();

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")
gcsRef.SourceFormat = bigquery.ORC
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Imports the Google Cloud client libraries
const BigQuery = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = "your-project-id";
// const datasetId = "my_dataset";
// const tableId = "my_table";

/**
 * This sample loads the ORC file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.orc
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.orc';

// Instantiates clients
const bigquery = new BigQuery({
  projectId: projectId,
});

const storage = new Storage({
  projectId: projectId,
});

// Configure the load job. For full list of options, see:
// https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
const metadata = {
  sourceFormat: 'ORC',
};

// Loads data from a Google Cloud Storage file into the table
bigquery
  .dataset(datasetId)
  .table(tableId)
  .load(storage.bucket(bucketName).file(filename), metadata)
  .then(results => {
    const job = results[0];

    // load() waits for the job to finish
    console.log(`Job ${job.id} completed.`);

    // Check the job's status for errors
    const errors = job.status.errors;
    if (errors && errors.length > 0) {
      throw errors;
    }
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId  = 'The Google project ID';
// $datasetId  = 'The BigQuery dataset ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$dataset = $bigQuery->dataset($datasetId);
$table = $dataset->table('us_states');

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC');
$job = $table->runJob($loadConfig);
// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});
// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.ORC
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc'

load_job = client.load_table_from_uri(
    uri,
    dataset_ref.table('us_states'),
    job_config=job_config)  # API request
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

# project_id = "Your Google Cloud project ID"
# dataset_id = "ID of the dataset to create table in"

require "google/cloud/bigquery"

bigquery     = Google::Cloud::Bigquery.new project: project_id
dataset      = bigquery.dataset dataset_id
table_id     = "us_states"
storage_path = "gs://cloud-samples-data/bigquery/us-states/us-states.orc"

puts "Importing data from Cloud Storage file: #{storage_path}"
load_job = dataset.load_job table_id,
                            storage_path,
                            format: "orc"

puts "Waiting for load job to complete: #{load_job.job_id}"
load_job.wait_until_done!

puts "Data imported"

Overwriting a table with ORC data

You can load additional data into a table either from source files or by appending query results.

In the BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result. The CLI and API include the following options:

Web UI option CLI flag BigQuery API property Description
Write if empty None WRITE_EMPTY Writes the data only if the table is empty.
Append to table --noreplace or --replace=false; if --[no]replace is unspecified, the default is append WRITE_APPEND (Default) Appends the data to the end of the table.
Overwrite table --replace or --replace=true WRITE_TRUNCATE Erases all existing data in a table before writing the new data.

By default, load jobs will append data to a table unless the write disposition is altered. If you instead wish to replace data with the data from a load job, you can choose to overwrite data in a BigQuery table:

Web UI

  1. Go to the BigQuery web UI.
    Go to the BigQuery web UI

  2. In the navigation panel, hover on a dataset, click the down arrow icon down arrow icon image, and click Create new table. The process for loading data is the same as the process for creating an empty table.

  3. On the Create Table page, in the Source Data section:

    • For Location, select Google Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
    • For File format, select ORC.
  4. On the Create Table page, in the Destination Table section:
    • For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
    • Verify that Table type is set to Native table.
  5. In the Schema section, no action is necessary. Schema information is self-described in ORC files.
  6. In the Options section, for Write preference, choose Write if empty, Append to table, or Overwrite table.

    Add schema using add fields

  7. Click Create Table.

Command-line

Enter the bq load command with the --replace flag to overwrite the table. Supply the --location flag and set the value to your location. Use the --noreplace flag to append data to the table. If no flag is specified, the default is to append data.

bq --location=[LOCATION] load --[no]replace [DATASET].[TABLE] [PATH_TO_SOURCE]

where:

  • [LOCATION] is your location. The --location flag is optional if your data is in the US or the EU multi-region location. You can set a default value for the location using the .bigqueryrc file.
  • [DATASET] is an existing dataset.
  • [TABLE] is the name of the table into which you're loading data.
  • [PATH_TO_SOURCE] is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.

Examples:

  • The following command loads data from gs://mybucket/mydata.orc and overwrites a table named mytable in mydataset. mybucket and mydataset were created in the US multi-region location.

    bq --location=US load --replace --source_format=ORC mydataset.mytable gs://mybucket/mydata.orc
    
  • The following command loads data from gs://mybucket/mydata.orc and appends data to a table named mytable in mydataset. mybucket and mydataset were created in the asia-northeast1 region.

    bq --location=asia-northeast1 load --noreplace --source_format=ORC mydataset.mytable gs://mybucket/mydata.orc
    

API

Set the following properties to load CSV data using the API.

  1. Create a load job that points to the source data in Cloud Storage.

  2. Specify your location in the location property in the jobReference section of the job resource.

  3. The source URIs must be fully-qualified, in the format gs://[BUCKET]/[OBJECT]. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported when loading CSV data from Cloud Storage.

  4. Specify the data format by setting the configuration.load.sourceFormat property to ORC.

  5. Specify the write preference by setting the configuration.load.writeDisposition property to WRITE_TRUNCATE, WRITE_APPEND, or WRITE_EMPTY. For Load jobs, the default mode is WRITE_APPEND.

C#

Before trying this sample, follow the C# setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery C# API reference documentation .

var gcsURI = "gs://cloud-samples-data/bigquery/us-states/us-states.orc";
var dataset = client.GetDataset(datasetId);
var jobOptions = new CreateLoadJobOptions()
{
    SourceFormat = FileFormat.Orc,
    WriteDisposition = WriteDisposition.WriteTruncate
};
// Pass null as the schema because the schema is inferred when
// loading Orc data
var loadJob = client.CreateLoadJob(gcsURI, dataset.GetTableReference(tableId),
    null, jobOptions);
loadJob.PollUntilCompleted();

Go

Before trying this sample, follow the Go setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Go API reference documentation .

// To run this sample, you will need to create (or reuse) a context and
// an instance of the bigquery client.  For example:
// import "cloud.google.com/go/bigquery"
// ctx := context.Background()
// client, err := bigquery.NewClient(ctx, "your-project-id")
gcsRef := bigquery.NewGCSReference("gs://cloud-samples-data/bigquery/us-states/us-states.orc")
gcsRef.SourceFormat = bigquery.ORC
loader := client.Dataset(datasetID).Table(tableID).LoaderFrom(gcsRef)
// Default for import jobs is to append data to a table.  WriteTruncate
// specifies that existing data should instead be replaced/overwritten.
loader.WriteDisposition = bigquery.WriteTruncate

job, err := loader.Run(ctx)
if err != nil {
	return err
}
status, err := job.Wait(ctx)
if err != nil {
	return err
}

if status.Err() != nil {
	return fmt.Errorf("Job completed with error: %v", status.Err())
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Node.js API reference documentation .

// Imports the Google Cloud client libraries
const BigQuery = require('@google-cloud/bigquery');
const {Storage} = require('@google-cloud/storage');

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = "your-project-id";
// const datasetId = "my_dataset";
// const tableId = "my_table";

/**
 * This sample loads the CSV file at
 * https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
 *
 * TODO(developer): Replace the following lines with the path to your file.
 */
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.orc';

// Instantiates clients
const bigquery = new BigQuery({
  projectId: projectId,
});

const storage = new Storage({
  projectId: projectId,
});

// Configure the load job. For full list of options, see:
// https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
const metadata = {
  sourceFormat: 'ORC',
  // Set the write disposition to append to an existing table.
  writeDisposition: 'WRITE_TRUNCATE',
};

// Loads data from a Google Cloud Storage file into the table
bigquery
  .dataset(datasetId)
  .table(tableId)
  .load(storage.bucket(bucketName).file(filename), metadata)
  .then(results => {
    const job = results[0];

    // load() waits for the job to finish
    console.log(`Job ${job.id} completed.`);

    // Check the job's status for errors
    const errors = job.status.errors;
    if (errors && errors.length > 0) {
      throw errors;
    }
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

PHP

Before trying this sample, follow the PHP setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery PHP API reference documentation .

use Google\Cloud\BigQuery\BigQueryClient;
use Google\Cloud\Core\ExponentialBackoff;

/** Uncomment and populate these variables in your code */
// $projectId = 'The Google project ID';
// $datasetId = 'The BigQuery dataset ID';
// $tableID = 'The BigQuery table ID';

// instantiate the bigquery table service
$bigQuery = new BigQueryClient([
    'projectId' => $projectId,
]);
$table = $bigQuery->dataset($datasetId)->table($tableId);

// create the import job
$gcsUri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc';
$loadConfig = $table->loadFromStorage($gcsUri)->sourceFormat('ORC')->writeDisposition('WRITE_TRUNCATE');
$job = $table->runJob($loadConfig);

// poll the job until it is complete
$backoff = new ExponentialBackoff(10);
$backoff->execute(function () use ($job) {
    print('Waiting for job to complete' . PHP_EOL);
    $job->reload();
    if (!$job->isComplete()) {
        throw new Exception('Job has not yet completed', 500);
    }
});

// check if the job has errors
if (isset($job->info()['status']['errorResult'])) {
    $error = $job->info()['status']['errorResult']['message'];
    printf('Error running job: %s' . PHP_EOL, $error);
} else {
    print('Data imported successfully' . PHP_EOL);
}

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

To replace the rows in an existing table, set the LoadJobConfig.write_disposition property to the WRITE_TRUNCATE.

# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')

previous_rows = client.get_table(table_ref).num_rows
assert previous_rows > 0

job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.source_format = bigquery.SourceFormat.ORC
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.orc'
load_job = client.load_table_from_uri(
    uri,
    table_ref,
    job_config=job_config)  # API request
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

destination_table = client.get_table(table_ref)
print('Loaded {} rows.'.format(destination_table.num_rows))

Ruby

Before trying this sample, follow the Ruby setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Ruby API reference documentation .

# project_id = "Your Google Cloud project ID"
# dataset_id = "ID of the dataset containing table"
# table_id   = "ID of the table to append data into"

require "google/cloud/bigquery"

bigquery = Google::Cloud::Bigquery.new project: project_id
dataset  = bigquery.dataset dataset_id
table    = dataset.table table_id

storage_path = "gs://cloud-samples-data/bigquery/us-states/us-states.orc"

puts "Importing data from Cloud Storage file: #{storage_path}"
load_job = table.load_job storage_path,
                          format: "orc",
                          write: "WRITE_TRUNCATE"

puts "Waiting for load job to complete: #{load_job.job_id}"
load_job.wait_until_done!

puts "Data imported"

ORC conversions

BigQuery converts ORC data types to the following BigQuery data types:

Primitive types

ORC data type BigQuery data type Notes
boolean BOOLEAN
byte INTEGER
short INTEGER
int INTEGER
long INTEGER
float FLOAT
double FLOAT
string STRING UTF-8 only
varchar STRING UTF-8 only
char STRING UTF-8 only
binary BYTES
date DATE
timestamp TIMESTAMP ORC supports nanosecond precision, but BigQuery converts sub-microsecond values to microseconds when the data is read.
decimal NUMERIC or STRING NUMERIC types are exact numeric values with 38 digits of precision and 9 decimal digits of scale. See NUMERIC type for details. If a decimal type in an ORC schema has its scale no more than 9 and its precision - scale no more than 29, it's converted to NUMERIC. Otherwise, it's converted to STRING. If a decimal type is converted to STRING, a warning message is returned.

Complex types

ORC data type BigQuery data type Notes
struct RECORD
  • All fields are NULLABLE.
  • Order of fields is ignored.
  • Name of a field must be a valid colum name.
map<K,V> RECORD An ORC map<K,V> field is converted to a repeated RECORD that contains two fields: a key of the same data type as K, and a value of the same data type as V. Both fields are NULLABLE.
list repeated fields Nested lists and lists of maps are not supported.
union RECORD
  • When union only has one variant, it's converted to a NULLABLE field.
  • Otherwise a union is converted to a RECORD with a list of NULLABLE fields. The NULLABLE fields have suffixes such as field_0, field_1, and so on. Only one of these fields is assigned a value when the data is read.

Column names

A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 128 characters. A column name cannot use any of the following prefixes:

  • _TABLE_
  • _FILE_
  • _PARTITION

Duplicate column names are not allowed even if the case differs. For example, a column named Column1 is considered identical to a column named column1.

NULL values

Note that currently, BigQuery ignores NULL values for all the ORC data types.

For more information on ORC data types, see the Apache ORC™ Specification v1.

Was this page helpful? Let us know how we did:

Send feedback about...