This page provides an overview of loading ORC data from Cloud Storage into BigQuery.
ORC is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.
When you load ORC data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
For information about loading ORC data from a local file, see Loading data into BigQuery from a local data source.
ORC schemas
When you load ORC files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.
For example, you have the following ORC files in Cloud Storage:
gs://mybucket/00/ a.orc z.orc gs://mybucket/01/ b.orc
This command loads all of the files in a single CLI command (as a
comma-separated list), and the schema is derived from mybucket/01/b.orc
:
bq load \ --source_format=ORC \ dataset.table \ "gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"
When BigQuery detects the schema, some ORC data types are
converted to BigQuery data types to make them compatible with
BigQuery SQL syntax. All fields in the detected schema are
NULLABLE
. For more information, see
ORC conversions.
When you load multiple ORC files that have different schemas, identical fields (with the same name and same nested level) specified in multiple schemas must map to the same converted BigQuery data type in each schema definition.
ORC compression
Compressed ORC files are not supported, but compressed file footer and stripes are. Supported compression types are Zlib, Snappy, LZO, and LZ4.
Required permissions
When you load data into BigQuery, you need permissions to run a load job and permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need permissions to access to the bucket that contains your data.
BigQuery permissions
At a minimum, the following permissions are required to load data into BigQuery. These permissions are required if you are loading data into a new table or partition, or if you are appending or overwriting a table or partition.
bigquery.tables.create
bigquery.tables.updateData
bigquery.jobs.create
The following predefined Cloud IAM roles include both
bigquery.tables.create
and bigquery.tables.updateData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
The following predefined Cloud IAM roles include bigquery.jobs.create
permissions:
bigquery.user
bigquery.jobUser
bigquery.admin
In addition, if a user has bigquery.datasets.create
permissions, when that
user creates a dataset, they are granted bigquery.dataOwner
access to it.
bigquery.dataOwner
access gives the user the ability to create and
update tables in the dataset via a load job.
For more information on Cloud IAM roles and permissions in BigQuery, see Access control.
Cloud Storage permissions
In order to load data from a Cloud Storage bucket, you must be granted
storage.objects.get
permissions. If you are using a URI wildcard,
you must also have storage.objects.list
permissions.
The predefined Cloud IAM role storage.objectViewer
can be granted to provide both storage.objects.get
and storage.objects.list
permissions.
Loading ORC data into a new table
You can load ORC data into a new table by:
- Using the Cloud Console or the classic web UI
- Using the CLI's
bq load
command - Calling the
jobs.insert
API method and configuring aload
job - Using the client libraries
To load ORC data from Cloud Storage into a new BigQuery table:
Console
Open the BigQuery web UI in the Cloud Console.
Go to the Cloud ConsoleIn the navigation panel, in the Resources section, expand your project and select a dataset.
On the right side of the window, in the details panel, click Create table. The process for loading data is the same as the process for creating an empty table.
On the Create table page, in the Source section:
For Create table from, select Cloud Storage.
In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the Cloud Console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
For File format, select ORC.
On the Create table page, in the Destination section:
For Dataset name, choose the appropriate dataset.
Verify that Table type is set to Native table.
In the Table name field, enter the name of the table you're creating in BigQuery.
In the Schema section, no action is necessary. The schema is self-described in ORC files.
(Optional) To partition the table, choose your options in the Partition and cluster settings:
- To create a partitioned table,
click No partitioning, select Partition by field and choose a
DATE
orTIMESTAMP
column. This option is unavailable if your schema does not include aDATE
orTIMESTAMP
column. - To create an ingestion-time partitioned table, click No partitioning and select Partition by ingestion time.
- To create a partitioned table,
click No partitioning, select Partition by field and choose a
(Optional) For Partitioning filter, click the Require partition filter box to require users to include a
WHERE
clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if No partitioning is selected.(Optional) To cluster the table, in the Clustering order box, enter between one and four field names. Currently, clustering is supported only for partitioned tables.
(Optional) Click Advanced options.
- For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
- For Number of errors allowed, accept the default value of
0
or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in aninvalid
message and fail. - For Unknown values, leave Ignore unknown values unchecked. This option applies only to CSV and JSON files.
- For Encryption, click Customer-managed key to use a Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
Click Create table.
Classic UI
Go to the BigQuery web UI.
Go to the BigQuery web UIIn the navigation panel, hover on a dataset, click the down arrow icon
, and click Create new table. The process for loading data is the same as the process for creating an empty table.
On the Create Table page, in the Source Data section:
- Click Create from source.
- For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're creating.
- For File format, select ORC.
In the Destination Table section:
- For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're creating in BigQuery.
- Verify that Table type is set to Native table.
In the Schema section, no action is necessary. The schema is self-described in ORC files.
(Optional) In the Options section:
- For Number of errors allowed, accept the default value of
0
or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in aninvalid
message and fail. - For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
- To partition the table:
- For Partitioning Type, click None and choose Day.
- For Partitioning Field:
- To create a partitioned table,
choose a
DATE
orTIMESTAMP
column. This option is unavailable if your schema does not include aDATE
orTIMESTAMP
column. - To create an ingestion-time partitioned table,
leave the default value:
_PARTITIONTIME
. - Click the Require partition filter box to require users
to include a
WHERE
clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables. This option is unavailable if Partitioning type is set to None.
- To cluster the table, in the Clustering fields box, enter between one and four field names.
- For Destination encryption, choose Customer-managed
encryption to use a Key Management Service key
to encrypt the table. If you leave the
Default
setting, BigQuery encrypts the data at rest using a Google-managed key.
- For Number of errors allowed, accept the default value of
Click Create Table.
CLI
Use the bq load
command, specify ORC as the source_format
, and include a
Cloud Storage URI.
You can include a single URI, a comma-separated list of URIs or a URI
containing a wildcard.
(Optional) Supply the --location
flag and set the value to your
location.
Other optional flags include:
--max_bad_records
: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is0
. At most, five errors of any type are returned regardless of the--max_bad_records
value.--time_partitioning_type
: Enables time-based partitioning on a table and sets the partition type. Currently, the only possible value isDAY
which generates one partition per day. This flag is optional when you create a table partitioned on aDATE
orTIMESTAMP
column.--time_partitioning_expiration
: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.--time_partitioning_field
: TheDATE
orTIMESTAMP
column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.--require_partition_filter
: When enabled, this option requires users to include aWHERE
clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables.--clustering_fields
: A comma-separated list of up to four column names used to create a clustered table. This flag can only be used with partitioned tables.--destination_kms_key
: The KMS key for encryption of the table data.For more information on partitioned tables, see:
For more information on clustered tables, see:
For more information on table encryption, see:
To load ORC data into BigQuery, enter the following command:
bq --location=location load \ --source_format=format \ dataset.table \ path_to_source
Where:
- location is your location. The
--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file. - format is
ORC
. - dataset is an existing dataset.
- table is the name of the table into which you're loading data.
- path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
Examples:
The following command loads data from gs://mybucket/mydata.orc
into a
table named mytable
in mydataset
.
bq load \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from gs://mybucket/mydata.orc
into an
ingestion-time partitioned table named mytable
in mydataset
.
bq load \
--source_format=ORC \
--time_partitioning_type=DAY \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from gs://mybucket/mydata.orc
into a
partitioned table named mytable
in mydataset
. The table is partitioned
on the mytimestamp
column.
bq load \
--source_format=ORC \
--time_partitioning_field mytimestamp \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from multiple files in gs://mybucket/
into a table named mytable
in mydataset
. The Cloud Storage URI uses a
wildcard.
bq load \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata*.orc
The following command loads data from multiple files in gs://mybucket/
into a table named mytable
in mydataset
. The command includes a comma-
separated list of Cloud Storage URIs with wildcards.
bq load --autodetect \
--source_format=ORC \
mydataset.mytable \
"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"
API
Create a
load
job that points to the source data in Cloud Storage.(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.The
source URIs
property must be fully-qualified, in the formatgs://bucket/object
. Each URI can contain one '*' wildcard character.Specify the ORC data format by setting the
sourceFormat
property toORC
.To check the job status, call
jobs.get(job_id*)
, where job_id is the ID of the job returned by the initial request.- If
status.state = DONE
, the job completed successfully. - If the
status.errorResult
property is present, the request failed, and that object will include information describing what went wrong. When a request fails, no table is created and no data is loaded. - If
status.errorResult
is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object'sstatus.errors
property.
- If
API notes:
Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it as
jobReference.jobId
when callingjobs.insert
to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.Calling
jobs.insert
on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations will succeed.
C#
Before trying this sample, follow the C# setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery C# API reference documentation
.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Go API reference documentation
.
Node.js
Before trying this sample, follow the Node.js setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Node.js API reference documentation
.
PHP
Before trying this sample, follow the PHP setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery PHP API reference documentation
.
Python
Before trying this sample, follow the Python setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Python API reference documentation
.
Ruby
Before trying this sample, follow the Ruby setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Ruby API reference documentation
.
Appending to or overwriting a table with ORC data
You can load additional data into a table either from source files or by appending query results.
In the console and the classic BigQuery web UI, you use the Write preference option to specify what action to take when you load data from a source file or from a query result.
You have the following options when you load additional data into a table:
Console option | Classic web UI option | CLI flag | BigQuery API property | Description |
---|---|---|---|---|
Write if empty | Write if empty | None | WRITE_EMPTY |
Writes the data only if the table is empty. |
Append to table | Append to table | --noreplace or --replace=false ; if
--[no]replace is unspecified, the default is append |
WRITE_APPEND |
(Default) Appends the data to the end of the table. |
Overwrite table | Overwrite table | --replace or --replace=true |
WRITE_TRUNCATE |
Erases all existing data in a table before writing the new data. |
If you load data into an existing table, the load job can append the data or overwrite the table.
You can append or overwrite a table by:
- Using the Cloud Console or the classic web UI
- Using the CLI's
bq load
command - Calling the
jobs.insert
API method and configuring aload
job - Using the client libraries
To append or overwrite a table with ORC data:
Console
Open the BigQuery web UI in the Cloud Console.
Go to the Cloud ConsoleIn the navigation panel, in the Resources section, expand your project and select a dataset.
On the right side of the window, in the details panel, click Create table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.
On the Create table page, in the Source section:
For Create table from, select Cloud Storage.
In the source field, browse to or enter the Cloud Storage URI. Note that you cannot include multiple URIs in the BigQuery web UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
For File format, select ORC.
On the Create table page, in the Destination section:
For Dataset name, choose the appropriate dataset.
In the Table name field, enter the name of the table you're appending or overwriting in BigQuery.
Verify that Table type is set to Native table.
In the Schema section, no action is necessary. The schema is self-described in ORC files.
For Partition and cluster settings, leave the default values. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the Cloud Console does not support appending to or overwriting partitioned or clustered tables in a load job.
Click Advanced options.
- For Write preference, choose Append to table or Overwrite table.
- For Number of errors allowed, accept the default value of
0
or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in aninvalid
message and fail. - For Unknown values, leave Ignore unknown values unchecked. This option applies only to CSV and JSON files.
For Encryption, click Customer-managed key to use a Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
Click Create table.
Classic UI
Go to the BigQuery web UI.
Go to the BigQuery web UIIn the navigation panel, hover on a dataset, click the down arrow icon
, and click Create new table. The process for appending and overwriting data in a load job is the same as the process for creating a table in a load job.
On the Create Table page, in the Source Data section:
- For Location, select Cloud Storage and in the source field, enter the Cloud Storage URI. Note that you cannot include multiple URIs in the UI, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you're appending or overwriting.
- For File format, select ORC.
On the Create Table page, in the Destination Table section:
- For Table name, choose the appropriate dataset, and in the table name field, enter the name of the table you're appending or overwriting.
- Verify that Table type is set to Native table.
In the Schema section, no action is necessary. Schema information is self-described in ORC files.
In the Options section:
- For Number of errors allowed, accept the default value of
0
or enter the maximum number of rows containing errors that can be ignored. If the number of rows with errors exceeds this value, the job will result in aninvalid
message and fail. - For Write preference, choose Append to table or Overwrite table.
- Leave the default values for Partitioning Type, Partitioning Field, Require partition filter, and Clustering Fields. You cannot convert a table to a partitioned or clustered table by appending or overwriting it, and the web UI does not support appending to or overwriting partitioned or clustered tables in a load job.
- For Destination encryption, choose Customer-managed
encryption to use a Key Management Service key
to encrypt the table. If you leave the
Default
setting, BigQuery encrypts the data at rest using a Google-managed key.
- For Number of errors allowed, accept the default value of
Click Create Table.
CLI
Enter the bq load
command with the --replace
flag to overwrite the
table. Use the --noreplace
flag to append data to the table. If no flag is
specified, the default is to append data. Supply the --source_format
flag
and set it to ORC
. Because ORC schemas are automatically retrieved
from the self-describing source data, you do not need to provide a schema
definition.
(Optional) Supply the --location
flag and set the value to your
location.
Other optional flags include:
--max_bad_records
: An integer that specifies the maximum number of bad records allowed before the entire job fails. The default value is0
. At most, five errors of any type are returned regardless of the--max_bad_records
value.--destination_kms_key
: The KMS key for encryption of the table data.
bq --location=location load \ --[no]replace \ --source_format=format \ dataset.table \ path_to_source
Where:
- location is your location.
The
--location
flag is optional. You can set a default value for the location by using the .bigqueryrc file. - format is
ORC
. - dataset is an existing dataset.
- table is the name of the table into which you're loading data.
- path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
Examples:
The following command loads data from gs://mybucket/mydata.orc
and
overwrites a table named mytable
in mydataset
.
bq load \
--replace \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from gs://mybucket/mydata.orc
and
appends data to a table named mytable
in mydataset
.
bq load \
--noreplace \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata.orc
For information on appending and overwriting partitioned tables using the CLI, see: Appending to and overwriting partitioned table data.
API
Create a
load
job that points to the source data in Cloud Storage.(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.The
source URIs
property must be fully-qualified, in the formatgs://bucket/object
. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported.Specify the data format by setting the
configuration.load.sourceFormat
property toORC
.Specify the write preference by setting the
configuration.load.writeDisposition
property toWRITE_TRUNCATE
orWRITE_APPEND
.
C#
Before trying this sample, follow the C# setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery C# API reference documentation
.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Go API reference documentation
.
Node.js
Before trying this sample, follow the Node.js setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Node.js API reference documentation
.
PHP
Before trying this sample, follow the PHP setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery PHP API reference documentation
.
Python
Before trying this sample, follow the Python setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Python API reference documentation
.
Ruby
Before trying this sample, follow the Ruby setup instructions in the
BigQuery Quickstart Using Client Libraries
.
For more information, see the
BigQuery Ruby API reference documentation
.
Loading hive-partitioned ORC data
BigQuery supports loading hive partitioned ORC data stored on Cloud Storage and will populate the hive partitioning columns as columns in the destination BigQuery managed table. For more information, see Loading Externally Partitioned Data from Cloud Storage.
ORC conversions
BigQuery converts ORC data types to the following BigQuery data types:
Primitive types
BigQuery data type | Notes | |
---|---|---|
boolean | BOOLEAN | |
byte | INTEGER | |
short | INTEGER | |
int | INTEGER | |
long | INTEGER | |
float | FLOAT | |
double | FLOAT | |
string | STRING | UTF-8 only |
varchar | STRING | UTF-8 only |
char | STRING | UTF-8 only |
binary | BYTES | |
date | DATE | |
timestamp | TIMESTAMP | ORC supports nanosecond precision, but BigQuery converts sub-microsecond values to microseconds when the data is read. |
decimal | NUMERIC or STRING | NUMERIC types are exact numeric values with 38 digits of precision and 9 decimal digits of scale. See NUMERIC type for details. If a decimal type in an ORC schema has its scale no more than 9 and its precision - scale no more than 29, it's converted to NUMERIC. Otherwise, it's converted to STRING. If a decimal type is converted to STRING, a warning message is returned. |
Complex types
BigQuery data type | Notes | |
---|---|---|
struct | RECORD |
|
map<K,V> | RECORD | An ORC map<K,V> field is converted to a repeated RECORD that contains two fields: a key of the same data type as K, and a value of the same data type as V. Both fields are NULLABLE. |
list | repeated fields | Nested lists and lists of maps are not supported. |
union | RECORD |
|
Column names
A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 128 characters. A column name cannot use any of the following prefixes:
_TABLE_
_FILE_
_PARTITION
Duplicate column names are not allowed even if the case differs. For example,
a column named Column1
is considered identical to a column named column1
.
NULL
values
Note that for load jobs, BigQuery ignores NULL
elements for the
list
compound type, since otherwise they would be translated to NULL
ARRAY
elements which cannot persist to a table (see
Data Types for
details).
For more information on ORC data types, see the Apache ORC™ Specification v1.