String Collation Rules

Collation refers to the organizing of written content into a standardized order. String comparison functions utilize collation rules for Latin. A summary of the rules:

  • Comparisons are case-sensitive.
    • Uppercase letters are greater than lowercase versions of the same letter.
    • However, lowercase letters that are later in the alphabet are greater than the uppercase version of the previous letter.
  • Two strings are equal if they match identically.
    • If two strings are identical except that the second string contains one additional character at the end, the second string is greater.
  • A normalized version of a letter is the unaccented, lowercase version of the letter. In string comparison, it is the lowest value of all of its variants.
    • a is less than ă.
    • However, when compared to b, a = ă.
    • The set of Latin normalized characters contains more than 26 characters.

This table illustrates some generalized rules of Latin collation.

OrderDescriptionLesser ExampleGreater Example
1whitespace(space)(return)
2Punctuation'@
3Digits12
4LettersaA
5 Ab

Resources:

NOTE: In the following set of charts (linked below), the values at the top of the page are lower than the values listed lower on the page. Similarly, the charts listed in the left nav bar are listed in ascending order.

For more information on the applicable collation rules, see http://www.unicode.org/charts/collation/.

Was this page helpful? Let us know how we did:

Send feedback about...

Google Cloud Dataprep Documentation
Need help? Visit our support page.