About collation
Collation determines how strings are sorted and compared in collation-supported operations. If you would like to customize collation for a collation-supported operation, you must assign a collation specification to at least one string in the operation. Some operations can't use collation, but can pass collation through them.
Operations affected by collation
When an operation is affected by collation, this means that the operation takes into consideration collation during the operation. These query operations are affected by collation when sorting and comparing strings:
Operations | Notes |
---|---|
Collation-supported comparison operations | |
Join operations | |
ORDER BY
|
In the WINDOW clause, ORDER BY is not
supported with collation. |
GROUP BY
|
|
WINDOW
for window functions |
In the WINDOW clause, ORDER BY and
PARTITION BY is not supported with collation. |
Collation-supported scalar functions | |
Collation-supported aggregate functions | |
Set operations |
Operations that propagate collation
Collation can pass through some query operations to other parts of a query. When collation passes through an operation in a query, this is known as propagation. During propagation:
- If an input contains no collation specification or an empty collation specification and another input contains an explicitly defined collation, the explicitly defined collation is used for all of the inputs.
- All inputs with a non-empty explicitly defined collation specification must have the same type of collation specification, otherwise an error is thrown.
BigQuery has several functions, operators, and expressions that can propagate collation.
In the following example, the 'und:ci'
collation specification is propagated
from the character
column to the ORDER BY
operation.
-- With collation
SELECT *
FROM UNNEST([
COLLATE('B', 'und:ci'),
'b',
'a'
]) AS character
ORDER BY character
+-----------+
| character |
+-----------+
| a |
| B |
| b |
+-----------+
-- Without collation
SELECT *
FROM UNNEST([
'B',
'b',
'a'
]) AS character
ORDER BY character
+-----------+
| character |
+-----------+
| B |
| a |
| b |
+-----------+
Functions
These functions let collation propagate through them:
Function | Notes |
---|---|
AEAD.DECRYPT_STRING |
|
ANY_VALUE |
|
ARRAY_AGG |
Collation on input arguments are propagated as collation on the array element. |
ARRAY_TO_STRING |
Collation on array elements are propagated to output. |
COLLATE |
|
CONCAT |
|
FORMAT |
Collation from format_string to the returned string is propagated. |
FORMAT_DATE |
Collation from format_string to the returned string is propagated. |
FORMAT_DATETIME |
Collation from format_string to the returned string is propagated. |
FORMAT_TIME |
Collation from format_string to the returned string is propagated. |
FORMAT_TIMESTAMP |
Collation from format_string to the returned string is propagated. |
GREATEST |
|
LAG |
|
LEAD |
|
LEAST |
|
LEFT |
|
LOWER |
|
LPAD |
|
MAX |
|
MIN |
|
NET.HOST |
|
NET.PUBLIC_SUFFIX |
|
NET.REG_DOMAIN |
|
NORMALIZE |
|
NORMALIZE_AND_CASEFOLD |
|
REPEAT |
|
REPLACE |
|
REVERSE |
|
RIGHT |
|
RPAD |
|
SOUNDEX |
|
SPLIT |
Collation on input arguments are propagated as collation on the array element. |
STRING_AGG |
|
SUBSTR |
|
UPPER |
Operators
These operators let collation propagate through them:
Operator | Notes |
---|---|
|| concatenation operator |
|
Array subscript operator | Propagated to output. |
Set operators | Collation of an output column is decided by the collations of input columns at the same position. |
STRUCT field access operator |
When getting a STRUCT , collation on the STRUCT field is propagated as the output collation. |
UNNEST |
Collation on the input array element is propagated to output. |
Expressions
These expressions let collation propagate through them:
Expression | Notes |
---|---|
ARRAY |
When you construct an ARRAY , collation on input arguments is propagated on the elements in the ARRAY . |
CASE |
|
CASE expr |
|
COALESCE |
|
IF |
|
IFNULL |
|
NULLIF |
|
STRUCT |
When you construct a STRUCT , collation on input arguments is propagated on the fields in the STRUCT . |
Where you can assign a collation specification
A collation specification can be assigned to these collation-supported types:
- A
STRING
- A
STRING
field in aSTRUCT
- A
STRING
element in anARRAY
In addition:
- You can assign a default collation specification to a schema when you create or alter it. This assigns a default collation specification to all future tables that are added to the schema if the tables do not have their own default collation specifications.
- You can assign a default collation specification to a table when you create or alter it. This assigns a collation specification to all future collation-supported columns that are added to the table if the columns do not have collation specifications. This overrides a default collation specification on a schema.
- You can assign a collation specification to a collation-supported type in a column. A column that contains a collation-supported type in its column schema is a collation-supported column. This overrides a default collation specification on a table.
- You can assign a collation specification to a collation-supported query operation.
- You can assign a collation specification to a collation-supported expression
with the
COLLATE
function. This overrides any collation specifications set previously.
In summary:
You can define a default collation specification for a schema. For example:
CREATE SCHEMA (...)
DEFAULT COLLATE 'und:ci'
You can define a default collation specification for a table. For example:
CREATE TABLE (...)
DEFAULT COLLATE 'und:ci'
You can define a collation specification for a collation-supported column. For example:
CREATE TABLE (
case_insensitive_column STRING COLLATE 'und:ci'
)
You can specify a collation specification for a collation-supported expression
with the COLLATE
function. For example:
SELECT COLLATE('a', 'und:ci') AS character
DDL statements
Location | Support | Notes |
---|---|---|
Schema | CREATE SCHEMA
|
Create a schema and optionally add a default collation specification to the schema. |
Schema | ALTER SCHEMA
|
Updates the default collation specification for a schema. |
Table | CREATE TABLE
|
Create a table and optionally add a default
collation specification to a table
or a collation specification to a
collation-supported type in a column.
You cannot have collation on a column used with CLUSTERING .
You can use collated-supported functions with AS SELECT when creating a table. |
Table | ALTER TABLE
|
Update the default collation specification for collation-supported type in a table. |
Column | ADD COLUMN
|
Add a collation specification to a collation-supported type in a new column in an existing table. |
Data types
Type | Support | Notes |
---|---|---|
Data type | STRING
|
You can apply a collation specification directly to this data type. |
Data type | STRUCT
|
You can apply a collation specification to a
STRING field in a STRUCT . A STRUCT can
have STRING fields with different
collation specifications. |
Data type | ARRAY
|
You can apply a collation specification to a
STRING element in an ARRAY . An ARRAY can
have STRING elements with different
collation specifications. |
Use the COLLATE
function to apply a collation specification to
collation-supported expressions.
Functions, operators, and conditional expressions
Functions
Type | Support | Notes |
---|---|---|
Scalar | COLLATE |
|
Scalar | ENDS_WITH |
|
Scalar | GREATEST
|
|
Scalar | INSTR |
|
Scalar | LEAST |
|
Scalar | REPLACE
|
|
Scalar | SPLIT |
|
Scalar | STARTS_WITH |
|
Scalar | STRPOS |
|
Aggregate | COUNT
|
This operator is only affected by
collation when the input includes
the DISTINCT argument. |
Aggregate | MAX |
|
Aggregate | MIN
|
Operators
Support | Notes |
---|---|
< |
|
<= |
|
> |
|
>= |
|
= |
|
!= |
|
[NOT] BETWEEN |
|
[NOT] IN
|
This operator generally supports collation,
however, [NOT] IN UNNEST is not
supported.
If used with a list, requires at least one
item in the list. |
[NOT] IN |
Conditional expressions
Support | |
---|---|
CASE |
|
CASE expr |
|
NULLIF |
The preceding collation-supported operations (functions, operators, and conditional expressions) can include input with explicitly defined collation specifications for collation-supported types. In a collation-supported operation:
- All inputs with a non-empty, explicitly defined collation specification must be the same, otherwise an error is thrown.
- If an input does not contain an explicitly defined collation and another input contains an explicitly defined collation, the explicitly defined collation is used for both.
For example:
-- Assume there is a table with this column declaration:
CREATE TABLE table_a
(
col_a STRING COLLATE 'und:ci',
col_b STRING COLLATE '',
col_c STRING,
col_d STRING COLLATE 'und:ci'
);
-- This runs. Column 'b' has a collation specification and the
-- column 'c' does not.
SELECT STARTS_WITH(col_b_expression, col_c_expression)
FROM table_a;
-- This runs. Column 'a' and 'd' have the same collation specification.
SELECT STARTS_WITH(col_a_expression, col_d_expression)
FROM table_a;
-- This runs. Even though column 'a' and 'b' have different
-- collation specifications, column 'b' is considered the default collation
-- because it's assigned to an empty collation specification.
SELECT STARTS_WITH(col_a_expression, col_b_expression)
FROM table_a;
-- This works. Even though column 'a' and 'b' have different
-- collation specifications, column 'b' is updated to use the same
-- collation specification as column 'a'.
SELECT STARTS_WITH(col_a_expression, COLLATE(col_b_expression, 'und:ci'))
FROM table_a;
-- This runs. Column 'c' does not have a collation specification, so it uses the
-- collation specification of column 'd'.
SELECT STARTS_WITH(col_c_expression, col_d_expression)
FROM table_a;
Collation specification details
A collation specification determines how strings are sorted and compared in
collation-supported operations. You can define the
Unicode collation specification, und:ci
, for
collation-supported types.
If a collation specification is not defined, the default collation specification is used. To learn more, see the next section.
Default collation specification
When a collation specification is not assigned or is empty,
'binary'
collation is used. Binary collation indicates that the
operation should return data in Unicode code point order.
You cannot set binary collation explicitly.
Unicode collation specification
collation_specification:
'language_tag:collation_attribute'
A unicode collation specification indicates that the operation should use the
Unicode Collation Algorithm to sort and compare
strings. The collation specification can be a STRING
literal or a
query parameter.
The language tag
The language tag determines how strings are generally sorted and compared.
Allowed values for language_tag
are:
und
: A locale string representing the undetermined locale.und
is a special language tag defined in the IANA language subtag registry and used to indicate an undetermined locale. This is also known as the root locale and can be considered the default Unicode collation. It defines a reasonable, locale agnostic collation.
The collation attribute
In addition to the language tag, the unicode collation specification must
have a collation_attribute
, which enables additional rules for sorting
and comparing strings. Allowed values are:
ci
: Collation is case-insensitive.
Collation specification example
This is what the ci
collation attribute looks like when used with the
und
language tag in the COLLATE
function:
COLLATE('orange1', 'und:ci')
Caveats
- Differing strings can be considered equal.
For instance,
ẞ
(LATIN CAPITAL LETTER SHARP S) is considered equal to'SS'
on primary level thus'ẞ1' < 'SS2'
. This is similar to how case insensitivity works. There are a wide range of unicode code points (punctuation, symbols, etc), that are treated as if they are not there. So strings with and without them are sorted identically. For example, the format control code point U+2060 is ignored when the following strings are sorted:
SELECT * FROM UNNEST([ COLLATE('oran\u2060ge1', 'und:ci'), COLLATE('\u2060orange2', 'und:ci'), COLLATE('orange3', 'und:ci') ]) AS fruit ORDER BY fruit +---------+ | fruit | +---------+ | orange1 | | orange2 | | orange3 | +---------+
Ordering may change. The Unicode specification of the
und
collation can change occasionally, which can affect sorting order.
Limitations
Limitations for supported features are captured in the previous sections, but here are a few general limitations to keep in mind:
und:ci
and empty collation are supported, but not other collation specifications.- Operations and functions that do not support collation produce an error if they encounter collated values.
You cannot set non-empty collation on a clustering field, using an API call or DDL query.
CREATE TABLE mydataset.mytable ( word STRING COLLATE 'und:ci', number INT64 ) CLUSTER BY word; -- User error: -- “CLUSTER BY STRING column word with -- collation und:ci is not supported"
A materialized view with collated output columns is not supported.
CREATE MATERIALIZED VIEW bigquery.public.data.samples.shakespeare AS SELECT COLLATE('abc', 'und:ci') x; -- User error: -- “Creating MATERIALIZED VIEW with collation -- on the output column is not supported”
A view with collated output columns is not supported.
CREATE VIEW bigquery.public.data.samples.shakespeare AS SELECT COLLATE('abc', 'und:ci') x; -- User error: -- "Creating VIEW with collation on the output column -- is not supported"
After a view or materialized view is created and a column collation of the base table is altered to make the column collated, the view or materialized view is invalidated and cannot be queried.
CREATE TABLE dataset.base ( col_no_collation STRING, col INT64 ) AS SELECT 'a', 1 UNION ALL SELECT 'b', 2 UNION ALL SELECT 'A', 3 UNION ALL SELECT 'B', 4; CREATE VIEW/MATERIALIZED VIEW dataset.view AS SELECT * FROM dataset.base; -- Change the collation of string column of the base table so -- that the string column of view/materialized view is collated. ALTER TABLE dataset.base ALTER COLUMN col_no_collation SET DATA TYPE STRING COLLATE 'und:ci'; -- Should return an error since the column of view/materialized -- contains collation. SELECT * FROM dataset.view ORDER BY col; -- User error: -- "Collation on the output column for view dataset.view is -- not supported"
You cannot create a materialized view with collated sort keys in an aggregate function.
CREATE TABLE dataset.base ( col_no_collation STRING, col INT64 ) AS SELECT 'a', 1 UNION ALL SELECT 'b', 2 UNION ALL SELECT 'A', 3 UNION ALL SELECT 'B', 4; CREATE VIEW/MATERIALIZED VIEW dataset.view AS SELECT * FROM dataset.base; -- Change the collation of string column of the base table so -- that the string column of view/materialized view is -- collated. ALTER TABLE dataset.base ALTER COLUMN col_no_collation SET DATA TYPE STRING COLLATE 'und:ci'; -- Should return an error since the column of view/materialized -- contains collation. SELECT * FROM dataset.view ORDER BY col; -- User error: -- "Collation on the output column for view dataset.view is -- not supported"
A table function with collated output columns is not supported.
CREATE TABLE FUNCTION bigquery.public.data.samples.shakespeare() AS SELECT COLLATE('abc', 'und:ci') x; -- User error: -- "Creating TABLE FUNCTION with collation on the output -- column is not supported"
User defined functions (UDFs) cannot take collated arguments.
CREATE FUNCTION tmp_dataset.f(x STRING) AS (x); SELECT tmp_dataset.f(col_ci) FROM shared_dataset.table_collation_simple -- User error: -- “Collation is not allowed on argument x ("und:ci"). -- Use COLLATE(arg, '') to remove collation at [1:8]”
If a
SELECT
query with a user-specified destination table produces output columns with collation, an error is produced.-- The following query returns an error since the resulting schema of -- the query has a column string_ci with collation ‘und:ci’. SELECT string_ci FROM collated_table
BigQuery BI engine does not support tables with collation or the
COLLATE
function.