GoogleSQL for Spanner supports collation. You can learn more about collation in this topic.
About collation
Collation determines how strings are sorted and compared in an
ORDER BY
operation. If you would like to use custom
collation in the operation, you can include the
COLLATE
clause with a
collation specification.
Where you can assign a collation specification
In the ORDER BY
clause, you can specify a collation specification for a
collation-supported column. This overrides any
collation specifications set previously.
For example:
SELECT Place
FROM Locations
ORDER BY Place COLLATE "und:ci"
Query statements
Type | Support |
---|---|
Sorting | ORDER BY clause |
Collation specification details
A collation specification determines how strings are sorted and compared in collation-supported operations. You can define a collation specification for collation-supported types. These types of collation specifications are available:
If a collation specification isn't defined, the default collation specification is used. To learn more, see the next section.
Default collation specification
When a collation specification isn't assigned or is empty,
the ordering behavior is identical to 'unicode'
collation,
which you can learn about in the
Unicode collation specification.
Unicode collation specification
collation_specification:
'language_tag[:collation_attribute]'
A unicode collation specification indicates that the operation should use the
Unicode Collation Algorithm to sort and compare
strings. The collation specification can be a STRING
literal or a
query parameter.
The language tag
The language tag determines how strings are generally sorted and compared.
Allowed values for language_tag
are:
- A standard locale string: This name is usually two or three letters
that represent the language, optionally followed by an underscore or dash and
two letters that represent the region — for example,
en_US
. These names are defined by the Common Locale Data Repository (CLDR). und
: A locale string representing the undetermined locale.und
is a special language tag defined in the IANA language subtag registry and used to indicate an undetermined locale. This is also known as the root locale and can be considered the default Unicode collation. It defines a reasonable, locale agnostic collation. It differs significantly fromunicode
.unicode
: Returns data in Unicode code point order, which is identical to the ordering behavior whenCOLLATE
isn't used. The sort order will look largely arbitrary to human users.
The collation attribute
In addition to the language tag, the unicode collation specification can have
an optional collation_attribute
, which enables additional rules for sorting
and comparing strings. Allowed values are:
ci
: Collation is case-insensitive.cs
: Collation is case-sensitive. By default,collation_attribute
is implicitlycs
.
If you're using the unicode
language tag with a collation attribute, these
caveats apply:
unicode:cs
is identical tounicode
.unicode:ci
is identical tound:ci
. It's recommended to migrateunicode:ci
tobinary
.
Collation specification example
This is what the ci
collation attribute looks like when used with the
und
language tag in the ORDER BY
clause:
SELECT Place
FROM Locations
ORDER BY Place COLLATE 'und:ci'
Caveats
- Differing strings can be considered equal.
For instance,
ẞ
(LATIN CAPITAL LETTER SHARP S) is considered equal to'SS'
on the primary level, thus'ẞ1' < 'SS2'
are equal. This is similar to how case insensitivity works. - In search operations, strings with different lengths could be considered equal. To ensure consistency, collation should be used without search tailoring.
There are a wide range of unicode code points (punctuation, symbols, etc), that are treated as if they aren't there. So strings with and without them are sorted identically. For example, the format control code point
U+2060
is ignored when the following strings are sorted:SELECT * FROM UNNEST([ 'oran\u2060ge1', '\u2060orange2', 'orange3' ]) AS fruit ORDER BY fruit COLLATE 'und' /*---------* | fruit | +---------+ | orange1 | | orange2 | | orange3 | *---------*/
Ordering may change. The Unicode specification of the
und
collation can change occasionally, which can affect sorting order. If you need a stable sort order that's guaranteed to never change, useunicode
collation.