Loading ORC data from Cloud Storage
This page provides an overview of loading ORC data from Cloud Storage into BigQuery.
ORC is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem.
When you load ORC data from Cloud Storage, you can load the data into a new table or partition, or you can append to or overwrite an existing table or partition. When your data is loaded into BigQuery, it is converted into columnar format for Capacitor (BigQuery's storage format).
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
For information about loading ORC data from a local file, see Loading data into BigQuery from a local data source.
Limitations
You are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket:
- If your dataset's location is set to a value other than the
US
multi-region, then the Cloud Storage bucket must be in the same region or contained in the same multi-region as the dataset. - BigQuery does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
- BigQuery does not support Cloud Storage object versioning. If you include a generation number in the Cloud Storage URI, then the load job fails.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and create a dataset to store your data.
Required permissions
To load data into BigQuery, you need IAM permissions to run a load job and load data into BigQuery tables and partitions. If you are loading data from Cloud Storage, you also need IAM permissions to access the bucket that contains your data.
Permissions to load data into BigQuery
To load data into a new BigQuery table or partition or to append or overwrite an existing table or partition, you need the following IAM permissions:
bigquery.tables.create
bigquery.tables.updateData
bigquery.tables.update
bigquery.jobs.create
Each of the following predefined IAM roles includes the permissions that you need in order to load data into a BigQuery table or partition:
roles/bigquery.dataEditor
roles/bigquery.dataOwner
roles/bigquery.admin
(includes thebigquery.jobs.create
permission)bigquery.user
(includes thebigquery.jobs.create
permission)bigquery.jobUser
(includes thebigquery.jobs.create
permission)
Additionally, if you have the bigquery.datasets.create
permission, you can create and
update tables using a load job in the datasets that you create.
For more information on IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Permissions to load data from Cloud Storage
To get the permissions that you need to load data from a Cloud Storage bucket,
ask your administrator to grant you the
Storage Admin (roles/storage.admin
) IAM role on the bucket.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to load data from a Cloud Storage bucket. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to load data from a Cloud Storage bucket:
-
storage.buckets.get
-
storage.objects.get
-
storage.objects.list (required if you are using a URI wildcard)
You might also be able to get these permissions with custom roles or other predefined roles.
Create a dataset
Create a BigQuery dataset to store your data.
ORC schemas
When you load ORC files into BigQuery, the table schema is automatically retrieved from the self-describing source data. When BigQuery retrieves the schema from the source data, the alphabetically last file is used.
For example, you have the following ORC files in Cloud Storage:
gs://mybucket/00/ a.orc z.orc gs://mybucket/01/ b.orc
Running this command in the bq command-line tool loads all of the files (as a
comma-separated list), and the schema is derived from mybucket/01/b.orc
:
bq load \ --source_format=ORC \ dataset.table \ "gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"
When BigQuery detects the schema, some ORC data types are
converted to BigQuery data types to make them compatible with
GoogleSQL syntax. All fields in the detected schema are
NULLABLE
. For more information, see
ORC conversions.
When you load multiple ORC files that have different schemas, identical fields (with the same name and same nested level) specified in multiple schemas must map to the same converted BigQuery data type in each schema definition.
To provide a table schema for creating external tables, set thereferenceFileSchemaUri
property in BigQuery API or --reference_file_schema_uri
parameter in bq command-line tool
to the URL of the reference file.
For example, --reference_file_schema_uri="gs://mybucket/schema.orc"
.
ORC compression
BigQuery supports the following compression codecs for ORC file contents:
Zlib
Snappy
LZO
LZ4
Data in ORC files doesn't remain compressed after it is uploaded to
BigQuery. Data storage is reported in logical bytes or physical
bytes, depending on the
dataset storage billing model.
To get information on storage usage, query the
INFORMATION_SCHEMA.TABLE_STORAGE
view.
Loading ORC data into a new table
You can load ORC data into a new table by:
- Using the Google Cloud console
- Using the bq command-line tool's
bq load
command - Calling the
jobs.insert
API method and configuring aload
job - Using the client libraries
To load ORC data from Cloud Storage into a new BigQuery table:
Console
In the Google Cloud console, go to the BigQuery page.
- In the Explorer pane, expand your project, and then select a dataset.
- In the Dataset info section, click Create table.
- In the Create table panel, specify the following details:
- In the Source section, select Google Cloud Storage in the Create table from list.
Then, do the following:
- Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI. You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- For File format, select ORC.
- In the Destination section, specify the following details:
- For Dataset, select the dataset in which you want to create the table.
- In the Table field, enter the name of the table that you want to create.
- Verify that the Table type field is set to Native table.
- In the Schema section, no action is necessary. The schema is self-described in ORC files.
- Optional: Specify Partition and cluster settings. For more information, see Creating partitioned tables and Creating and using clustered tables.
- Click Advanced options and do the following:
- For Write preference, leave Write if empty selected. This option creates a new table and loads your data into it.
- If you want to ignore values in a row that are not present in the table's schema, then select Unknown values.
- For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
- Click Create table.
SQL
Use the
LOAD DATA
DDL statement.
The following example loads an ORC file into the new table mytable
:
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
LOAD DATA OVERWRITE mydataset.mytable FROM FILES ( format = 'ORC', uris = ['gs://bucket/path/file.orc']);
Click
Run.
For more information about how to run queries, see Run an interactive query.
bq
Use the bq load
command, specify ORC as the source_format
, and include a
Cloud Storage URI.
You can include a single URI, a comma-separated list of URIs or a URI
containing a wildcard.
(Optional) Supply the --location
flag and set the value to your
location.
Other optional flags include:
--time_partitioning_type
: Enables time-based partitioning on a table and sets the partition type. Possible values areHOUR
,DAY
,MONTH
, andYEAR
. This flag is optional when you create a table partitioned on aDATE
,DATETIME
, orTIMESTAMP
column. The default partition type for time-based partitioning isDAY
. You cannot change the partitioning specification on an existing table.--time_partitioning_expiration
: An integer that specifies (in seconds) when a time-based partition should be deleted. The expiration time evaluates to the partition's UTC date plus the integer value.--time_partitioning_field
: TheDATE
orTIMESTAMP
column used to create a partitioned table. If time-based partitioning is enabled without this value, an ingestion-time partitioned table is created.--require_partition_filter
: When enabled, this option requires users to include aWHERE
clause that specifies the partitions to query. Requiring a partition filter may reduce cost and improve performance. For more information, see Querying partitioned tables.--clustering_fields
: A comma-separated list of up to four column names used to create a clustered table.--destination_kms_key
: The Cloud KMS key for encryption of the table data.For more information about partitioned tables, see:
For more information about clustered tables, see:
For more information about table encryption, see:
To load ORC data into BigQuery, enter the following command:
bq --location=location load \ --source_format=format \ dataset.table \ path_to_source
Where:
- location is your location. The
--location
flag is optional. For example, if you are using BigQuery in the Tokyo region, you can set the flag's value toasia-northeast1
. You can set a default value for the location using the .bigqueryrc file. - format is
ORC
. - dataset is an existing dataset.
- table is the name of the table into which you're loading data.
- path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
Examples:
The following command loads data from gs://mybucket/mydata.orc
into a
table named mytable
in mydataset
.
bq load \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from gs://mybucket/mydata.orc
into a new
ingestion-time partitioned table named mytable
in mydataset
.
bq load \
--source_format=ORC \
--time_partitioning_type=DAY \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from gs://mybucket/mydata.orc
into a
partitioned table named mytable
in mydataset
. The table is partitioned
on the mytimestamp
column.
bq load \
--source_format=ORC \
--time_partitioning_field mytimestamp \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from multiple files in gs://mybucket/
into a table named mytable
in mydataset
. The Cloud Storage URI uses a
wildcard.
bq load \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata*.orc
The following command loads data from multiple files in gs://mybucket/
into a table named mytable
in mydataset
. The command includes a comma-
separated list of Cloud Storage URIs with wildcards.
bq load --autodetect \
--source_format=ORC \
mydataset.mytable \
"gs://mybucket/00/*.orc","gs://mybucket/01/*.orc"
API
Create a
load
job that points to the source data in Cloud Storage.(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.The
source URIs
property must be fully-qualified, in the formatgs://bucket/object
. Each URI can contain one '*' wildcard character.Specify the ORC data format by setting the
sourceFormat
property toORC
.To check the job status, call
jobs.get(job_id*)
, where job_id is the ID of the job returned by the initial request.- If
status.state = DONE
, the job completed successfully. - If the
status.errorResult
property is present, the request failed, and that object includes information describing what went wrong. When a request fails, no table is created and no data is loaded. - If
status.errorResult
is absent, the job finished successfully, although there might have been some non-fatal errors, such as problems importing a few rows. Non-fatal errors are listed in the returned job object'sstatus.errors
property.
- If
API notes:
Load jobs are atomic and consistent; if a load job fails, none of the data is available, and if a load job succeeds, all of the data is available.
As a best practice, generate a unique ID and pass it as
jobReference.jobId
when callingjobs.insert
to create a load job. This approach is more robust to network failure because the client can poll or retry on the known job ID.Calling
jobs.insert
on a given job ID is idempotent. You can retry as many times as you like on the same job ID, and at most one of those operations succeeds.
C#
Before trying this sample, follow the C# setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery C# API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Go API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Java
Node.js
Before trying this sample, follow the Node.js setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Node.js API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
PHP
Before trying this sample, follow the PHP setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery PHP API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Python
Before trying this sample, follow the Python setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Python API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Ruby
Before trying this sample, follow the Ruby setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Ruby API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Append to or overwrite a table with ORC data
You can load additional data into a table either from source files or by appending query results.
In the Google Cloud console, use the Write preference option to specify what action to take when you load data from a source file or from a query result.
You have the following options when you load additional data into a table:
Console option | bq tool flag | BigQuery API property | Description |
---|---|---|---|
Write if empty | Not supported | WRITE_EMPTY |
Writes the data only if the table is empty. |
Append to table | --noreplace or --replace=false ; if
--[no]replace is unspecified, the default is append |
WRITE_APPEND |
(Default) Appends the data to the end of the table. |
Overwrite table | --replace or --replace=true |
WRITE_TRUNCATE |
Erases all existing data in a table before writing the new data. This action also deletes the table schema, row level security, and removes any Cloud KMS key. |
If you load data into an existing table, the load job can append the data or overwrite the table.
You can append or overwrite a table by:
- Using the Google Cloud console
- Using the bq command-line tool's
bq load
command - Calling the
jobs.insert
API method and configuring aload
job - Using the client libraries
To append or overwrite a table with ORC data:
Console
In the Google Cloud console, go to the BigQuery page.
- In the Explorer pane, expand your project, and then select a dataset.
- In the Dataset info section, click Create table.
- In the Create table panel, specify the following details:
- In the Source section, select Google Cloud Storage in the Create table from list.
Then, do the following:
- Select a file from the Cloud Storage bucket, or enter the Cloud Storage URI. You cannot include multiple URIs in the Google Cloud console, but wildcards are supported. The Cloud Storage bucket must be in the same location as the dataset that contains the table you want to create, append, or overwrite.
- For File format, select ORC.
- In the Destination section, specify the following details:
- For Dataset, select the dataset in which you want to create the table.
- In the Table field, enter the name of the table that you want to create.
- Verify that the Table type field is set to Native table.
- In the Schema section, no action is necessary. The schema is self-described in ORC files.
- Optional: Specify Partition and cluster settings. For more information, see Creating partitioned tables and Creating and using clustered tables. You cannot convert a table to a partitioned or clustered table by appending or overwriting it. The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job.
- Click Advanced options and do the following:
- For Write preference, choose Append to table or Overwrite table.
- If you want to ignore values in a row that are not present in the table's schema, then select Unknown values.
- For Encryption, click Customer-managed key to use a Cloud Key Management Service key. If you leave the Google-managed key setting, BigQuery encrypts the data at rest.
- Click Create table.
SQL
Use the
LOAD DATA
DDL statement.
The following example appends an ORC file to the table mytable
:
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
LOAD DATA INTO mydataset.mytable FROM FILES ( format = 'ORC', uris = ['gs://bucket/path/file.orc']);
Click
Run.
For more information about how to run queries, see Run an interactive query.
bq
Enter the bq load
command with the --replace
flag to overwrite the
table. Use the --noreplace
flag to append data to the table. If no flag is
specified, the default is to append data. Supply the --source_format
flag
and set it to ORC
. Because ORC schemas are automatically retrieved
from the self-describing source data, you don't need to provide a schema
definition.
(Optional) Supply the --location
flag and set the value to your
location.
Other optional flags include:
--destination_kms_key
: The Cloud KMS key for encryption of the table data.
bq --location=location load \ --[no]replace \ --source_format=format \ dataset.table \ path_to_source
Where:
- location is your location.
The
--location
flag is optional. You can set a default value for the location by using the .bigqueryrc file. - format is
ORC
. - dataset is an existing dataset.
- table is the name of the table into which you're loading data.
- path_to_source is a fully-qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported.
Examples:
The following command loads data from gs://mybucket/mydata.orc
and
overwrites a table named mytable
in mydataset
.
bq load \
--replace \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata.orc
The following command loads data from gs://mybucket/mydata.orc
and
appends data to a table named mytable
in mydataset
.
bq load \
--noreplace \
--source_format=ORC \
mydataset.mytable \
gs://mybucket/mydata.orc
For information about appending and overwriting partitioned tables using the bq command-line tool, see: Appending to and overwriting partitioned table data.
API
Create a
load
job that points to the source data in Cloud Storage.(Optional) Specify your location in the
location
property in thejobReference
section of the job resource.The
source URIs
property must be fully-qualified, in the formatgs://bucket/object
. You can include multiple URIs as a comma-separated list. Note that wildcards are also supported.Specify the data format by setting the
configuration.load.sourceFormat
property toORC
.Specify the write preference by setting the
configuration.load.writeDisposition
property toWRITE_TRUNCATE
orWRITE_APPEND
.
C#
Before trying this sample, follow the C# setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery C# API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Go
Before trying this sample, follow the Go setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Go API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Java
Node.js
Before trying this sample, follow the Node.js setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Node.js API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
PHP
Before trying this sample, follow the PHP setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery PHP API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Python
Before trying this sample, follow the Python setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Python API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Ruby
Before trying this sample, follow the Ruby setup instructions in the
BigQuery quickstart using
client libraries.
For more information, see the
BigQuery Ruby API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Load hive-partitioned ORC data
BigQuery supports loading hive partitioned ORC data stored on Cloud Storage and populates the hive partitioning columns as columns in the destination BigQuery managed table. For more information, see Loading Externally Partitioned Data from Cloud Storage.
ORC conversions
BigQuery converts ORC data types to the following BigQuery data types:
Primitive types
BigQuery data type | Notes | |
---|---|---|
boolean | BOOLEAN | |
byte | INTEGER | |
short | INTEGER | |
int | INTEGER | |
long | INTEGER | |
float | FLOAT | |
double | FLOAT | |
string | STRING | UTF-8 only |
varchar | STRING | UTF-8 only |
char | STRING | UTF-8 only |
binary | BYTES | |
date | DATE | An attempt to convert any value in the ORC data that is less than
-719162 days or greater than 2932896 days returns an invalid date
value error. If this affects you, contact
Support to have
unsupported values converted to the BigQuery minimum value
of 0001-01-01 or maximum value of 9999-12-31 ,
as appropriate. |
timestamp | TIMESTAMP |
ORC supports nanosecond precision, but BigQuery converts sub-microsecond values to microseconds when the data is read. An attempt to convert any value in the ORC data that is less than
-719162 days or greater than 2932896 days returns an |
decimal | NUMERIC, BIGNUMERIC, or STRING | See Decimal type. |
Decimal type
Decimal
logical types can be converted to NUMERIC
, BIGNUMERIC
, or STRING
types. The converted type depends
on the precision and scale parameters of the decimal
logical type and the
specified decimal target types. Specify the decimal target type as follows:
- For a load job using the
jobs.insert
API: use theJobConfigurationLoad.decimalTargetTypes
field. - For a load job using the
bq load
command in the bq command-line tool: use the--decimal_target_types
flag. - For a query against a table with external sources:
use the
ExternalDataConfiguration.decimalTargetTypes
field. - For a persistent external table created with DDL:
use the
decimal_target_types
option.
Complex types
BigQuery data type | Notes | |
---|---|---|
struct | RECORD |
|
map<K,V> | RECORD | An ORC map<K,V> field is converted to a repeated RECORD that contains two fields: a key of the same data type as K, and a value of the same data type as V. Both fields are NULLABLE. |
list | repeated fields | Nested lists and lists of maps are not supported. |
union | RECORD |
|
Column names
A column name can contain letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. If you use flexible column names, BigQuery supports starting a column name with a number. Exercise caution when starting columns with a number, since using flexible column names with the BigQuery Storage Read API or BigQuery Storage Write API requires special handling. For more information about flexible column name support, see flexible column names.
Column names have a maximum length of 300 characters. Column names can't use any of the following prefixes:
_TABLE_
_FILE_
_PARTITION
_ROW_TIMESTAMP
__ROOT__
_COLIDENTIFIER
Duplicate column names are not allowed even if the case differs. For example, a
column named Column1
is considered identical to a column named column1
. To
learn more about column naming rules, see Column
names in the
GoogleSQL reference.
If a table name (for example, test
) is the same as one of its column names
(for example, test
), the SELECT
expression interprets the test
column as
a STRUCT
containing all other table columns. To avoid this collision, use
one of the following methods:
Avoid using the same name for a table and its columns.
Assign the table a different alias. For example, the following query assigns a table alias
t
to the tableproject1.dataset.test
:SELECT test FROM project1.dataset.test AS t;
Include the table name when referencing a column. For example:
SELECT test.test FROM project1.dataset.test;
Flexible column names
You have more flexibility in what you name columns, including expanded access to characters in languages other than English as well as additional symbols.
Flexible column names support the following characters:
- Any letter in any language, as represented by the Unicode regular expression
\p{L}
. - Any numeric character in any language as represented by the Unicode regular
expression
\p{N}
. - Any connector punctuation character, including underscores, as represented
by the Unicode regular expression
\p{Pc}
. - A hyphen or dash as represented by the Unicode regular expression
\p{Pd}
. - Any mark intended to accompany another character as represented by the
Unicode regular expression
\p{M}
. For example, accents, umlauts, or enclosing boxes. - The following special characters:
- An ampersand (
&
) as represented by the Unicode regular expression\u0026
. - A percent sign (
%
) as represented by the Unicode regular expression\u0025
. - An equals sign (
=
) as represented by the Unicode regular expression\u003D
. - A plus sign (
+
) as represented by the Unicode regular expression\u002B
. - A colon (
:
) as represented by the Unicode regular expression\u003A
. - An apostrophe (
'
) as represented by the Unicode regular expression\u0027
. - A less-than sign (
<
) as represented by the Unicode regular expression\u003C
. - A greater-than sign (
>
) as represented by the Unicode regular expression\u003E
. - A number sign (
#
) as represented by the Unicode regular expression\u0023
. - A vertical line (
|
) as represented by the Unicode regular expression\u007c
. - Whitespace.
- An ampersand (
Flexible column names don't support the following special characters:
- An exclamation mark (
!
) as represented by the Unicode regular expression\u0021
. - A quotation mark (
"
) as represented by the Unicode regular expression\u0022
. - A dollar sign (
$
) as represented by the Unicode regular expression\u0024
. - A left parenthesis (
(
) as represented by the Unicode regular expression\u0028
. - A right parenthesis (
)
) as represented by the Unicode regular expression\u0029
. - An asterisk (
*
) as represented by the Unicode regular expression\u002A
. - A comma (
,
) as represented by the Unicode regular expression\u002C
. - A period (
.
) as represented by the Unicode regular expression\u002E
. - A slash (
/
) as represented by the Unicode regular expression\u002F
. - A semicolon (
;
) as represented by the Unicode regular expression\u003B
. - A question mark (
?
) as represented by the Unicode regular expression\u003F
. - An at sign (
@
) as represented by the Unicode regular expression\u0040
. - A left square bracket (
[
) as represented by the Unicode regular expression\u005B
. - A backslash (
\
) as represented by the Unicode regular expression\u005C
. - A right square bracket (
]
) as represented by the Unicode regular expression\u005D
. - A circumflex accent (
^
) as represented by the Unicode regular expression\u005E
. - A grave accent (
`
) as represented by the Unicode regular expression\u0060
. - A left curly bracket {
{
) as represented by the Unicode regular expression\u007B
. - A right curly bracket (
}
) as represented by the Unicode regular expression\u007D
. - A tilde (
~
) as represented by the Unicode regular expression\u007E
.
For additional guidelines, see Column names.
The expanded column characters are supported by both the BigQuery Storage Read API
and the BigQuery Storage Write API. To use the expanded list of Unicode characters
with the BigQuery Storage Read API, you must set a flag. You can use the
displayName
attribute to retrieve the column name. The following example
shows how to set a flag with the Python client:
from google.cloud.bigquery_storage import types
requested_session = types.ReadSession()
#set avro serialization options for flexible column.
options = types.AvroSerializationOptions()
options.enable_display_name_attribute = True
requested_session.read_options.avro_serialization_options = options
To use the expanded list of Unicode characters with the BigQuery Storage Write API,
you must provide the schema with column_name
notation, unless you are using
the JsonStreamWriter
writer object. The following example shows how to
provide the schema:
syntax = "proto2";
package mypackage;
// Source protos located in github.com/googleapis/googleapis
import "google/cloud/bigquery/storage/v1/annotations.proto";
message FlexibleSchema {
optional string item_name_column = 1
[(.google.cloud.bigquery.storage.v1.column_name) = "name-列"];
optional string item_description_column = 2
[(.google.cloud.bigquery.storage.v1.column_name) = "description-列"];
}
In this example, item_name_column
and item_description_column
are
placeholder names which need to be compliant with the
protocol buffer naming
convention. Note that column_name
annotations always take precedence over
placeholder names.
- Loading Parquet data does not support flexible column names by default. To enroll in this preview complete the enrollment form. Note that after enrolling in the preview, any invalid column names (for example, column name collation) returns an error. For projects not enrolled, the load request replaces the invalid characters with underscores instead of returning an error.
- Loading CSV data using schema autodetection does not support flexible column names by default. To enroll in this preview complete the enrollment form. Note that after enrolling in the preview, any invalid column names (for example, column name collation) returns an error. For projects not enrolled, the load request replaces the invalid characters with underscores instead of returning an error.
Limitations
Flexible column names are not supported with external tables.
NULL
values
Note that for load jobs, BigQuery ignores NULL
elements for the
list
compound type, since otherwise they would be translated to NULL
ARRAY
elements which cannot persist to a table (see
Data Types for
details).
For more information on ORC data types, see the Apache ORC™ Specification v1.