Migrating to GoogleSQL
BigQuery supports two SQL dialects: GoogleSQL and legacy SQL. This document explains the differences between the two dialects, including syntax, functions, and semantics, and gives examples of some of the highlights of GoogleSQL.
Comparison of legacy and GoogleSQL
When initially released, BigQuery ran queries using a non-GoogleSQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for GoogleSQL, and renamed BigQuery SQL to legacy SQL. GoogleSQL is the preferred SQL dialect for querying data stored in BigQuery.
Do I have to migrate to GoogleSQL?
We recommend migrating from legacy SQL to GoogleSQL, but it's not required. For example, suppose that you execute many queries that use legacy SQL, but you want to take advantage of a GoogleSQL feature for a new query. You can create new queries using GoogleSQL that run alongside queries using legacy SQL.
Enabling GoogleSQL
You have a choice of whether to use legacy or GoogleSQL when you run a query. For information about switching between SQL dialects, see BigQuery SQL dialects.
Advantages of GoogleSQL
GoogleSQL complies with the SQL 2011 standard, and has extensions that support querying nested and repeated data. It has several advantages over legacy SQL, including:
- Composability using
WITH
clauses and SQL functions - Subqueries in the
SELECT
list andWHERE
clause - Correlated subqueries
ARRAY
andSTRUCT
data types- Inserts, updates, and deletes
COUNT(DISTINCT <expr>)
is exact and scalable, providing the accuracy ofEXACT_COUNT_DISTINCT
without its limitations- Automatic predicate push-down through
JOIN
s - Complex
JOIN
predicates, including arbitrary expressions
For examples that demonstrate some of these features, see GoogleSQL highlights.
Type differences
Legacy SQL types have an equivalent in GoogleSQL and vice versa. In some cases, the type has a different name. The following table lists each legacy SQL data type and its GoogleSQL equivalent.
Legacy SQL | GoogleSQL | Notes |
---|---|---|
BOOL |
BOOL |
|
INTEGER |
INT64 |
|
FLOAT |
FLOAT64 |
|
STRING |
STRING |
|
BYTES |
BYTES |
|
RECORD |
STRUCT |
|
REPEATED |
ARRAY |
|
TIMESTAMP |
TIMESTAMP |
See TIMESTAMP differences |
DATE |
DATE |
Legacy SQL has limited support for DATE |
TIME |
TIME |
Legacy SQL has limited support for TIME |
DATETIME |
DATETIME |
Legacy SQL has limited support for DATETIME |
For more information about the GoogleSQL type system, see the GoogleSQL data types reference. For more information about data types in BigQuery, see the BigQuery data types reference.
TIMESTAMP
differences
GoogleSQL has a
stricter range of valid TIMESTAMP
values
than legacy SQL does. In GoogleSQL, valid TIMESTAMP
values are in the
range of 0001-01-01 00:00:00.000000
to 9999-12-31 23:59:59.999999
. For
example, you can select the minimum and maximum TIMESTAMP
values using
GoogleSQL:
#standardSQL
SELECT
min_timestamp,
max_timestamp,
UNIX_MICROS(min_timestamp) AS min_unix_micros,
UNIX_MICROS(max_timestamp) AS max_unix_micros
FROM (
SELECT
TIMESTAMP '0001-01-01 00:00:00.000000' AS min_timestamp,
TIMESTAMP '9999-12-31 23:59:59.999999' AS max_timestamp
);
This query returns -62135596800000000
as min_unix_micros
and
253402300799999999
as max_unix_micros
.
If you select a column that contains timestamp values outside of this range, you receive an error:
#standardSQL
SELECT timestamp_column_with_invalid_values
FROM MyTableWithInvalidTimestamps;
This query returns the following error:
Cannot return an invalid timestamp value of -8446744073709551617
microseconds relative to the Unix epoch. The range of valid
timestamp values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]
To correct the error, one option is to define and use a user-defined function to filter the invalid timestamps:
#standardSQL
CREATE TEMP FUNCTION TimestampIsValid(t TIMESTAMP) AS (
t >= TIMESTAMP('0001-01-01 00:00:00') AND
t <= TIMESTAMP('9999-12-31 23:59:59.999999')
);
SELECT timestamp_column_with_invalid_values
FROM MyTableWithInvalidTimestamps
WHERE TimestampIsValid(timestamp_column_with_invalid_values);
Another option to correct the error is to use the
SAFE_CAST
function with the timestamp column. For example:
#standardSQL
SELECT SAFE_CAST(timestamp_column_with_invalid_values AS STRING) AS timestamp_string
FROM MyTableWithInvalidTimestamps;
This query returns NULL
rather than a timestamp string for invalid
timestamp values.
GoogleSQL highlights
This section discusses some of the highlights of GoogleSQL compared to legacy SQL.
Composability using WITH
clauses
Some of the GoogleSQL examples on this page make use of a
WITH
clause,
which enables extraction or reuse of named subqueries. For example:
#standardSQL
WITH T AS (
SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
)
SELECT x / (SELECT SUM(x) FROM T) AS weighted_x
FROM T;
This query defines a named subquery T
that contains x
values of 1, 2, 3,
and 4. It selects x
values from T
and divides them by the sum of all x
values in T
. This query is equivalent to a query where the contents of T
are inline:
#standardSQL
SELECT
x / (SELECT SUM(x)
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x)) AS weighted_x
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);
As another example, consider this query, which uses multiple named subqueries:
#standardSQL
WITH T AS (
SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
),
TPlusOne AS (
SELECT x + 1 AS y
FROM T
),
TPlusOneTimesTwo AS (
SELECT y * 2 AS z
FROM TPlusOne
)
SELECT z
FROM TPlusOneTimesTwo;
This query defines a sequence of transformations of the original data, followed
by a SELECT
statement over TPlusOneTimesTwo
. This query is equivalent to the
following query, which inlines the computations:
#standardSQL
SELECT (x + 1) * 2 AS z
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);
For more information, see
WITH
clause.
Composability using SQL functions
GoogleSQL supports user-defined SQL functions. You can use user-defined SQL functions to define common expressions and then reference them from the query. For example:
#standardSQL
-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
-- n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>) AS
(
ARRAY_LENGTH(arr) / (SELECT SUM(1 / x) FROM UNNEST(arr) AS x)
);
WITH T AS (
SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;
This query defines a SQL function named HarmonicMean
and then applies it to
the array column arr
from T
.
Subqueries in more places
GoogleSQL supports subqueries in the SELECT
list, WHERE
clause, and
anywhere else in the query that expects an expression. For example, consider the
following GoogleSQL query that computes the fraction of warm days in Seattle
in 2015:
#standardSQL
WITH SeattleWeather AS (
SELECT *
FROM `bigquery-public-data.noaa_gsod.gsod2015`
WHERE stn = '994014'
)
SELECT
COUNTIF(max >= 70) /
(SELECT COUNT(*) FROM SeattleWeather) AS warm_days_fraction
FROM SeattleWeather;
The Seattle weather station has an ID of '994014'
. The query computes the
number of warm days based on those where the temperature reached 70 degrees
Fahrenheit, or approximately 21 degrees Celsius, divided by the total number of
recorded days for that station in 2015.
Correlated subqueries
In GoogleSQL, subqueries can reference correlated columns; that is, columns that originate from the outer query. For example, consider the following GoogleSQL query:
#standardSQL
WITH WashingtonStations AS (
SELECT weather.stn AS station_id, ANY_VALUE(station.name) AS name
FROM `bigquery-public-data.noaa_gsod.stations` AS station
INNER JOIN `bigquery-public-data.noaa_gsod.gsod2015` AS weather
ON station.usaf = weather.stn
WHERE station.state = 'WA' AND station.usaf != '999999'
GROUP BY station_id
)
SELECT washington_stations.name,
(SELECT COUNT(*)
FROM `bigquery-public-data.noaa_gsod.gsod2015` AS weather
WHERE washington_stations.station_id = weather.stn
AND max >= 70) AS warm_days
FROM WashingtonStations AS washington_stations
ORDER BY warm_days DESC;
This query computes the names of weather stations in Washington state and the
number of days in 2015 that the temperature reached 70 degrees Fahrenheit, or
approximately 21 degrees Celsius. Notice that there is a subquery in the
SELECT
list, and that the subquery references washington_stations.station_id
from the outer scope, namely FROM WashingtonStations AS washington_stations
.
Arrays and structs
ARRAY
and STRUCT
are powerful concepts in GoogleSQL. As an example that
uses both, consider the following query, which computes the top two articles
for each day in the HackerNews dataset:
#standardSQL
WITH TitlesAndScores AS (
SELECT
ARRAY_AGG(STRUCT(title, score)) AS titles,
EXTRACT(DATE FROM time_ts) AS date
FROM `bigquery-public-data.hacker_news.stories`
WHERE score IS NOT NULL AND title IS NOT NULL
GROUP BY date)
SELECT date,
ARRAY(SELECT AS STRUCT title, score
FROM UNNEST(titles)
ORDER BY score DESC
LIMIT 2)
AS top_articles
FROM TitlesAndScores
ORDER BY date DESC;
The WITH
clause defines TitlesAndScores
, which contains two columns. The
first is an array of structs, where one field is an article title and the second
is a score. The ARRAY_AGG
expression returns an array of these structs for
each day.
The SELECT
statement following the WITH
clause uses an ARRAY
subquery to
sort and return the top two articles within each array in accordance with the
score
, then returns the results in descending order by date.
For more information about arrays and ARRAY
subqueries, see
Working with arrays. See also the
references for arrays
and structs.## GoogleSQL highlights
This section discusses some of the highlights of GoogleSQL compared to legacy SQL.
Syntax differences
Escaping reserved keywords and invalid identifiers
In legacy SQL, you escape reserved keywords and identifiers that contain
invalid characters such as a space or hyphen
-
using square brackets []
.
In GoogleSQL, you escape such keywords and identifiers using backticks
`
. For example:
#standardSQL
SELECT
word,
SUM(word_count) AS word_count
FROM
`bigquery-public-data.samples.shakespeare`
WHERE word IN ('me', 'I', 'you')
GROUP BY word;
Legacy SQL allows reserved keywords in some places that GoogleSQL does not.
For example, the following query fails due to a Syntax error
using standard
SQL:
#standardSQL
SELECT
COUNT(*) AS rows
FROM
`bigquery-public-data.samples.shakespeare`;
To fix the error, escape the alias rows
using backticks:
#standardSQL
SELECT
COUNT(*) AS `rows`
FROM
`bigquery-public-data.samples.shakespeare`;
For a list of reserved keywords and what constitutes valid identifiers, see Lexical structure.
Project-qualified table names
In legacy SQL, to query a table with a project-qualified name, you use a
colon, :
, as a separator. For example:
#legacySQL
SELECT
word
FROM
[bigquery-public-data:samples.shakespeare]
LIMIT 1;
In GoogleSQL, you use a period, .
, instead. For example:
#standardSQL
SELECT
word
FROM
`bigquery-public-data.samples.shakespeare`
LIMIT 1;
If your project name includes a domain, such as example.com:myproject
, you use
example.com:myproject
as the project name, including the :
.
Table decorators
GoogleSQL does not support table decorators.
You can achieve the semantics of time decorators (formerly known as snapshot
decorators) by using the
FOR SYSTEM_TIME AS OF
clause, which references the historical version of a table at a specified
timestamp. For more information, see
Accessing historical data using time travel.
There is no exact equivalent to range decorators in GoogleSQL. You can
achieve similar semantics by creating a time-partitioned table and using a
partition filter when querying data. For more information, see
Querying partitioned tables.
Another option is to create date-sharded tables and filter on the
_TABLE_SUFFIX
pseudocolumn. For more information, see
Wildcard tables.
Wildcard functions
GoogleSQL does not support the TABLE_DATE_RANGE
, TABLE_DATE_RANGE_STRICT
,
or TABLE_QUERY
functions.
You can achieve the same semantics of TABLE_DATE_RANGE
and TABLE_QUERY
using
a filter on the _TABLE_SUFFIX
pseudocolumn. For example, consider the
following legacy SQL query, which counts the number of rows across 2010 and 2011
in the National Oceanic and Atmospheric Administration GSOD (global summary of
the day) tables:
#legacySQL
SELECT COUNT(*)
FROM TABLE_QUERY([bigquery-public-data:noaa_gsod],
'table_id IN ("gsod2010", "gsod2011")');
An equivalent query using GoogleSQL is:
#standardSQL
SELECT COUNT(*)
FROM `bigquery-public-data.noaa_gsod.*`
WHERE _TABLE_SUFFIX IN ("gsod2010", "gsod2011");
For more information, including examples that use TABLE_DATE_RANGE
, see
Migrating legacy SQL table wildcard functions.
Comma operator with tables
In legacy SQL, the comma operator ,
has the non-standard meaning of
UNION ALL
when applied to tables. In GoogleSQL, the comma operator has the
standard meaning of JOIN
. For example, consider the following legacy SQL
query:
#legacySQL
SELECT
x,
y
FROM
(SELECT 1 AS x, "foo" AS y),
(SELECT 2 AS x, "bar" AS y);
This is equivalent to the GoogleSQL query:
#standardSQL
SELECT
x,
y
FROM
(SELECT 1 AS x, "foo" AS y UNION ALL
SELECT 2 AS x, "bar" AS y);
Note also that in GoogleSQL, UNION ALL
associates columns by position
rather than by name. This query is equivalent to:
#standardSQL
SELECT
x,
y
FROM
(SELECT 1 AS x, "foo" AS y UNION ALL
SELECT 2, "bar");
A common usage of the comma operator in GoogleSQL is to JOIN
with an array.
For example:
#standardSQL
WITH T AS (
SELECT 0 AS x, [1, 2, 3] AS arr UNION ALL
SELECT 1, [4, 5])
SELECT
x,
y
FROM
T,
UNNEST(arr) AS y;
This returns the cross product of the table T
with the elements of arr
. You
can also express the query in GoogleSQL as:
#standardSQL
WITH T AS (
SELECT 0 AS x, [1, 2, 3] AS arr UNION ALL
SELECT 1, [4, 5])
SELECT
x,
y
FROM
T
JOIN
UNNEST(arr) AS y;
In this query, JOIN
has the same meaning as the ,
comma operator separating
T
and UNNEST(arr) AS y
in the example above it.
Logical views
You cannot query a logical view defined with legacy SQL using GoogleSQL and vice versa due to differences in syntax and semantics between the dialects. Instead, you would need to create a new view that uses GoogleSQL--possibly under a different name--to replace a view that uses legacy SQL.
As an example, suppose that view V
is defined using legacy SQL as:
#legacySQL
SELECT *, UTC_USEC_TO_DAY(timestamp_col) AS day
FROM MyTable;
Suppose that view W
is defined using legacy SQL as:
#legacySQL
SELECT user, action, day
FROM V;
Suppose that you execute the following legacy SQL query daily, but you want to migrate it to use GoogleSQL instead:
#legacySQL
SELECT EXACT_COUNT_DISTINCT(user), action, day
FROM W
GROUP BY action, day;
One possible migration path is to create new views using different names. The steps involved are:
Create a view named V2
using GoogleSQL with the following contents:
#standardSQL
SELECT *, EXTRACT(DAY FROM timestamp_col) AS day
FROM MyTable;
Create a view named W2
using GoogleSQL with the following contents:
#standardSQL
SELECT user, action, day
FROM V2;
Change your query that executes daily to use GoogleSQL and refer to W2
instead:
#standardSQL
SELECT COUNT(DISTINCT user), action, day
FROM W2
GROUP BY action, day;
Another option is to delete views V
and W
, then recreate them using standard
SQL under the same names. With this option, you would need to migrate all of
your queries that reference V
or W
to use GoogleSQL at the same time,
however.
Function comparison
The following is a partial list of legacy SQL functions and their GoogleSQL equivalents.
For more information about GoogleSQL functions, see All functions.
COUNT
function comparison
Both legacy SQL and GoogleSQL contain a COUNT function. However, each function behaves differently, depending on the SQL dialect you use.
In legacy SQL, COUNT(DISTINCT x)
returns an approximate count. In standard
SQL, it returns an exact count. For an approximate count of distinct values
that runs faster and requires fewer resources, use
APPROX_COUNT_DISTINCT
.
URL function comparison
Both legacy SQL and GoogleSQL contain functions for parsing URLs. In legacy
SQL, these functions are HOST(url)
, TLD(url)
, and DOMAIN(url)
. In standard
SQL, these functions are
NET.HOST(url)
,
NET.PUBLIC_SUFFIX(url)
,
and NET.REG_DOMAIN(url)
.
Improvements over legacy SQL functions
- GoogleSQL URL functions can parse URLs starting with "//".
- When the input is not compliant with RFC 3986 or is not a URL (for example, "mailto:?to=&subject=&body="), different rules are applied to parse the input. In particular, GoogleSQL URL functions can parse non-standard inputs without "//", such as "www.google.com". For best results, it is recommended that you ensure that inputs are URLs and comply with RFC 3986.
NET.PUBLIC_SUFFIX
returns results without leading dots. For example, it returns "com" instead of ".com". This complies with the format in the public suffix list.NET.PUBLIC_SUFFIX
andNET.REG_DOMAIN
support uppercase letters and internationalized domain names.TLD
andDOMAIN
do not support them (might return unexpected results).
Minor differences on edge cases
- If the input does not contain any suffix in the
public suffix list,
NET.PUBLIC_SUFFIX
andNET.REG_DOMAIN
return NULL, whileTLD
andDOMAIN
return non-NULL values as best effort guesses. - If the input contains only a public suffix without a preceding label (for
example, "http://com"),
NET.PUBLIC_SUFFIX
returns the public suffix, whileTLD
returns an empty string. Similarly,NET.REG_DOMAIN
returns NULL, whileDOMAIN
returns the public suffix. - For inputs with IPv6 hosts,
NET.HOST
does not remove brackets from the result, as specified by RFC 3986. - For inputs with IPv4 hosts,
NET.REG_DOMAIN
returns NULL, whileDOMAIN
returns the first 3 octets.
Examples
In the table below, gray text color indicates results that are the same between legacy and GoogleSQL.
URL (description) | HOST | NET.HOST | TLD | NET.PUBLIC _SUFFIX | DOMAIN | NET.REG_DOMAIN |
---|---|---|---|---|---|---|
"//google.com" (starting with "//") |
NULL | "google.com" | NULL | "com" | NULL | "google.com" |
"google.com" (non-standard; no "//") |
NULL | "google.com" | NULL | "com" | NULL | "google.com" |
"http://user:pass@word@x.com" (non-standard with multiple "@") |
"word@x.com" | "x.com" | ".com" | "com" | "word@x.com" | "x.com" |
"http://foo.com:1:2" (non-standard with multiple ":") |
"foo.com:1" | "foo.com" | ".com:1" | "com" | "foo.com" | "foo.com" |
"http://x.Co.uk" (upper case letters) |
"x.Co.uk" | "x.Co.uk" | ".uk" | "Co.uk" | "Co.uk" | "x.Co.uk" |
"http://a.b" (public suffix not found) |
"a.b" | "a.b" | ".b" | NULL | "a.b" | NULL |
"http://com" (host contains only a public suffix) |
"com" | "com" | "" | "com" | "com" | NULL |
"http://[::1]" (IPv6 host; no public suffix) |
"::1" | "[::1]" | "" | NULL | "::1" | NULL |
"http://1.2.3.4" (IPv4 host; no public suffix) |
"1.2.3.4" | "1.2.3.4" | "" | NULL | "1.2.3" | NULL |
Differences in repeated field handling
A REPEATED
type in legacy SQL is equivalent to an ARRAY
of that type in
GoogleSQL. For example, REPEATED INTEGER
is equivalent to ARRAY<INT64>
in
GoogleSQL. The following section discusses some of the differences in
operations on repeated fields between legacy and GoogleSQL.
NULL
elements and NULL
arrays
GoogleSQL supports NULL
array elements, but raises an error if there is a
NULL
array element in the query result. If there is a NULL
array column in
the query result, GoogleSQL stores it as an empty array.
Selecting nested repeated leaf fields
Using legacy SQL, you can "dot" into a nested repeated field without needing to consider where the repetition occurs. In GoogleSQL, attempting to "dot" into a nested repeated field results in an error. For example:
#standardSQL
SELECT
repository.url,
payload.pages.page_name
FROM
`bigquery-public-data.samples.github_nested`
LIMIT 5;
Attempting to execute this query returns:
Cannot access field page_name on a value with type
ARRAY<STRUCT<action STRING, html_url STRING, page_name STRING, ...>>
To correct the error and return an array of page_name
s in the result, use an
ARRAY
subquery instead. For example:
#standardSQL
SELECT
repository.url,
ARRAY(SELECT page_name FROM UNNEST(payload.pages)) AS page_names
FROM
`bigquery-public-data.samples.github_nested`
LIMIT 5;
For more information about arrays and ARRAY
subqueries, see
Working with arrays.
Filtering repeated fields
Using legacy SQL, you can filter repeated fields directly using a WHERE
clause. In GoogleSQL, you can express similar logic with a JOIN
comma
operator followed by a filter. For example, consider the following legacy SQL
query:
#legacySQL
SELECT
payload.pages.title
FROM
[bigquery-public-data:samples.github_nested]
WHERE payload.pages.page_name IN ('db_jobskill', 'Profession');
This query returns all title
s of pages for which the page_name
is either
db_jobskill
or Profession
. You can express a similar query in GoogleSQL
as:
#standardSQL
SELECT
page.title
FROM
`bigquery-public-data.samples.github_nested`,
UNNEST(payload.pages) AS page
WHERE page.page_name IN ('db_jobskill', 'Profession');
One difference between the preceding legacy SQL and GoogleSQL queries is that
if you unset the Flatten Results option and execute the legacy SQL query,
payload.pages.title
is REPEATED
in the query result. To achieve the
same semantics in GoogleSQL and return an array for the title
column, use
an ARRAY
subquery instead:
#standardSQL
SELECT
title
FROM (
SELECT
ARRAY(SELECT title FROM UNNEST(payload.pages)
WHERE page_name IN ('db_jobskill', 'Profession')) AS title
FROM
`bigquery-public-data.samples.github_nested`)
WHERE ARRAY_LENGTH(title) > 0;
This query creates an array of title
s where the page_name
is either
'db_jobskill'
or 'Profession'
, then filters any rows where the array did not
match that condition using ARRAY_LENGTH(title) > 0
.
For more information about arrays, see Working with arrays.
Structure of selected nested leaf fields
Legacy SQL preserves the structure of nested leaf fields in the SELECT
list
when the Flatten Results option is unset, whereas GoogleSQL does not. For
example, consider the following legacy SQL query:
#legacySQL
SELECT
repository.url,
repository.has_downloads
FROM
[bigquery-public-data.samples.github_nested]
LIMIT 5;
This query returns url
and has_downloads
within a record named repository
when Flatten Results is unset. Now consider the following GoogleSQL query:
#standardSQL
SELECT
repository.url,
repository.has_downloads
FROM
`bigquery-public-data.samples.github_nested`
LIMIT 5;
This query returns url
and has_downloads
as top-level columns; they are not
part of a repository
record or struct. To return them as part of a struct, use
the STRUCT
operator:
#standardSQL
SELECT
STRUCT(
repository.url,
repository.has_downloads) AS repository
FROM
`bigquery-public-data.samples.github_nested`
LIMIT 5;
Removing repetition with FLATTEN
GoogleSQL does not have a FLATTEN
function as in legacy SQL, but you can
achieve similar semantics using the JOIN
(comma) operator. For example,
consider the following legacy SQL query:
#legacySQL
SELECT
repository.url,
payload.pages.page_name
FROM
FLATTEN([bigquery-public-data:samples.github_nested], payload.pages.page_name)
LIMIT 5;
You can express a similar query in GoogleSQL as follows:
#standardSQL
SELECT
repository.url,
page.page_name
FROM
`bigquery-public-data.samples.github_nested`,
UNNEST(payload.pages) AS page
LIMIT 5;
Or, equivalently, use JOIN
rather than the comma ,
operator:
#standardSQL
SELECT
repository.url,
page.page_name
FROM
`bigquery-public-data.samples.github_nested`
JOIN
UNNEST(payload.pages) AS page
LIMIT 5;
One important difference is that the legacy SQL query returns a row where
payload.pages.page_name
is NULL
if payload.pages
is empty. The standard
SQL query, however, does not return a row if payload.pages
is empty. To
achieve exactly the same semantics, use a LEFT JOIN
or LEFT OUTER JOIN
. For
example:
#standardSQL
SELECT
repository.url,
page.page_name
FROM
`bigquery-public-data.samples.github_nested`
LEFT JOIN
UNNEST(payload.pages) AS page
LIMIT 5;
For more information about arrays, see
Working with arrays. For more
information about UNNEST
, see the
UNNEST
topic.
Filtering rows with OMIT RECORD IF
The OMIT IF
clause from legacy SQL lets you filter rows based on a
condition that can apply to repeated fields. In GoogleSQL, you can model
an OMIT IF
clause with an EXISTS
clause, IN
clause, or simple filter.
For example, consider the following legacy SQL query:
#legacySQL
SELECT
repository.url,
FROM
[bigquery-public-data:samples.github_nested]
OMIT RECORD IF
EVERY(payload.pages.page_name != 'db_jobskill'
AND payload.pages.page_name != 'Profession');
The analogous GoogleSQL query is:
#standardSQL
SELECT
repository.url
FROM
`bigquery-public-data.samples.github_nested`
WHERE EXISTS (
SELECT 1 FROM UNNEST(payload.pages)
WHERE page_name = 'db_jobskill'
OR page_name = 'Profession');
Here the EXISTS
clause evaluates to true
if there is at least one element of
payload.pages
where the page name is 'db_jobskill'
or 'Profession'
.
Alternatively, suppose that the legacy SQL query uses IN
:
#legacySQL
SELECT
repository.url,
FROM
[bigquery-public-data:samples.github_nested]
OMIT RECORD IF NOT
SOME(payload.pages.page_name IN ('db_jobskill', 'Profession'));
In GoogleSQL, you can express the query using an EXISTS
clause with IN
:
#standardSQL
SELECT
repository.url
FROM
`bigquery-public-data.samples.github_nested`
WHERE EXISTS (
SELECT 1 FROM UNNEST(payload.pages)
WHERE page_name IN ('db_jobskill', 'Profession'));
Consider the following legacy SQL query that filters records with 80 or fewer pages:
#legacySQL
SELECT
repository.url,
FROM
[bigquery-public-data:samples.github_nested]
OMIT RECORD IF
COUNT(payload.pages.page_name) <= 80;
In this case, you can use a filter with ARRAY_LENGTH
in GoogleSQL:
#standardSQL
SELECT
repository.url
FROM
`bigquery-public-data.samples.github_nested`
WHERE
ARRAY_LENGTH(payload.pages) > 80;
Note that the ARRAY_LENGTH
function applies to the repeated payload.pages
field directly rather than the nested field payload.pages.page_name
as in the
legacy SQL query.
For more information about arrays and ARRAY
subqueries, see
Working with arrays.
Semantic differences
The semantics of some operations differ between legacy and GoogleSQL.
Automatic data type coercions
Both legacy and GoogleSQL support coercions (automatic conversions) between
certain data types. For example, BigQuery coerces a value of type INT64
to
FLOAT64
if the query passes it to a function that requires FLOAT64
as input.
GoogleSQL does not support the following coercions that legacy SQL supports.
Instead, you must use an explicit CAST
.
INT64
literal toTIMESTAMP
. Instead, useTIMESTAMP_MICROS(micros_value)
.STRING
literal toINT64
,FLOAT64
, orBOOL
. Instead, useCAST(str AS INT64)
,CAST(str AS FLOAT64)
, orCAST(str AS BOOL)
.STRING
toBYTES
. Instead, useCAST(str AS BYTES)
.
Runtime errors
Some functions in legacy SQL return NULL
for invalid input, potentially
masking problems in queries or in data. GoogleSQL is generally more strict,
and raises an error if an input is invalid.
- For all mathematical functions and operators, legacy SQL does not check for
overflows. GoogleSQL adds overflow checks, and raises an error if a
computation overflows. This includes the
+
,-
,*
operators, theSUM
,AVG
, andSTDDEV
aggregate functions, and others. - GoogleSQL raises an error upon division by zero, whereas legacy SQL returns
NULL
. To returnNULL
for division by zero in GoogleSQL, useSAFE_DIVIDE
. - GoogleSQL raises an error for
CAST
s where the input format is invalid or out of range for the target type, whereas legacy SQL returnsNULL
. To avoid raising an error for an invalid cast in GoogleSQL, useSAFE_CAST
.
Nested repeated results
Queries executed using GoogleSQL preserve any nesting and repetition of the
columns in the result, and the Flatten Results option has no effect. To
return top-level columns for nested fields, use the .*
operator on struct
columns. For example:
#standardSQL
SELECT
repository.*
FROM
`bigquery-public-data.samples.github_nested`
LIMIT 5;
To return top-level columns for repeated nested fields (ARRAY
s of STRUCT
s),
use a JOIN
to take the cross product of the table's rows and the elements of
the repeated nested field. For example:
#standardSQL
SELECT
repository.url,
page.*
FROM
`bigquery-public-data.samples.github_nested`
JOIN
UNNEST(payload.pages) AS page
LIMIT 5;
For more information about arrays and ARRAY
subqueries, see
Working with arrays.
NOT IN conditions and NULL
Legacy SQL does not comply with the SQL standard in its handling of NULL
with
NOT IN
conditions, whereas GoogleSQL does. Consider the following legacy
SQL query, which finds the number of words that don't appear in the GitHub
sample table as locations:
#legacySQL
SELECT COUNT(*)
FROM [bigquery-public-data.samples.shakespeare]
WHERE word NOT IN (
SELECT actor_attributes.location
FROM [bigquery-public-data.samples.github_nested]
);
This query returns 163,716 as the count, indicating that there are 163,716 words that don't appear as locations in the GitHub table. Now consider the following GoogleSQL query:
#standardSQL
SELECT COUNT(*)
FROM `bigquery-public-data.samples.shakespeare`
WHERE word NOT IN (
SELECT actor_attributes.location
FROM `bigquery-public-data.samples.github_nested`
);
This query returns 0 as the count. The difference is due to the semantics of
NOT IN
with GoogleSQL, which returns NULL
if any value on the right hand
side is NULL
. To achieve the same results as with the legacy SQL query, use a
WHERE
clause to exclude the NULL
values:
#standardSQL
SELECT COUNT(*)
FROM `bigquery-public-data.samples.shakespeare`
WHERE word NOT IN (
SELECT actor_attributes.location
FROM `bigquery-public-data.samples.github_nested`
WHERE actor_attributes.location IS NOT NULL
);
This query returns 163,716 as the count. Alternatively, use a NOT EXISTS
condition:
#standardSQL
SELECT COUNT(*)
FROM `bigquery-public-data.samples.shakespeare` AS t
WHERE NOT EXISTS (
SELECT 1
FROM `bigquery-public-data.samples.github_nested`
WHERE t.word = actor_attributes.location
);
This query also returns 163,716 as the count. For further reading, see the
comparison operators
section of the documentation, which explains the semantics of IN
, NOT IN
,
EXISTS
, and other comparison operators.
Differences in user-defined JavaScript functions
The User-defined functions topic documents how to use JavaScript user-defined functions with GoogleSQL. This section explains some of the key differences between user-defined functions in legacy and GoogleSQL.
Functions in the query text
With GoogleSQL, you use CREATE TEMPORARY FUNCTION
as part of the query body
rather than specifying user-defined functions separately. Examples of defining
functions separately include using the UDF Editor in the Google Cloud console
or using the --udf_resource
flag in
the bq command-line tool.
Consider the following GoogleSQL query:
#standardSQL
-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
-- n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>)
RETURNS FLOAT64 LANGUAGE js AS """
var sum_of_reciprocals = 0;
for (var i = 0; i < arr.length; ++i) {
sum_of_reciprocals += 1 / arr[i];
}
return arr.length / sum_of_reciprocals;
""";
WITH T AS (
SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;
This query defines a JavaScript function named HarmonicMean
and then applies
it to the array column arr
from T
.
For more information about user-defined functions, see the User-defined functions topic.
Functions operate on values rather than rows
In legacy SQL, JavaScript functions operate on rows from a table. In standard SQL, as in the example above, JavaScript functions operate on values. To pass a row value to a JavaScript function using GoogleSQL, define a function that takes a struct of the same row type as the table. For example:
#standardSQL
-- Takes a struct of x, y, and z and returns a struct with a new field foo.
CREATE TEMPORARY FUNCTION AddField(s STRUCT<x FLOAT64, y BOOL, z STRING>)
RETURNS STRUCT<x FLOAT64, y BOOL, z STRING, foo STRING> LANGUAGE js AS """
var new_struct = new Object();
new_struct.x = s.x;
new_struct.y = s.y;
new_struct.z = s.z;
if (s.y) {
new_struct.foo = 'bar';
} else {
new_struct.foo = 'baz';
}
return new_struct;
""";
WITH T AS (
SELECT x, MOD(off, 2) = 0 AS y, CAST(x AS STRING) AS z
FROM UNNEST([5.0, 4.0, 3.0, 2.0, 1.0]) AS x WITH OFFSET off
)
SELECT AddField(t).*
FROM T AS t;
This query defines a JavaScript function that takes a struct with the same row
type as T
and creates a new struct with an additional field named foo
. The
SELECT
statement passes the row t
as input to the function and uses .*
to
return the fields of the resulting struct in the output.
Migrating legacy SQL table wildcard functions
In legacy SQL, you can use the following table wildcard functions to query multiple tables.
TABLE_DATE_RANGE()
andTABLE_DATE_RANGE_STRICT()
TABLE_QUERY()
The TABLE_DATE_RANGE() functions
The legacy SQL TABLE_DATE_RANGE()
functions work on tables that conform to a
specific naming scheme: <prefix>YYYYMMDD
, where the <prefix>
represents the
first part of a table name and YYYYMMDD
represents the date associated with
that table's data.
For example, the following legacy SQL query finds the average temperature from a set of daily tables that contain Seattle area weather data:
#legacySQL
SELECT
ROUND(AVG(TemperatureF),1) AS AVG_TEMP_F
FROM
TABLE_DATE_RANGE([mydataset.sea_weather_],
TIMESTAMP("2016-05-01"),
TIMESTAMP("2016-05-09"))
In GoogleSQL, an equivalent query uses a table wildcard and the BETWEEN
clause.
#standardSQL
SELECT
ROUND(AVG(TemperatureF),1) AS AVG_TEMP_F
FROM
`mydataset.sea_weather_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160501' AND '20160509'
The TABLE_QUERY() function
The legacy SQL TABLE_QUERY()
function enables you to find table names based
on patterns. When migrating a TABLE_QUERY()
function to GoogleSQL, which
does not support the TABLE_QUERY()
function, you can instead filter using
the _TABLE_SUFFIX
pseudocolumn. Keep the following differences in mind when
migrating:
In legacy SQL, you place the
TABLE_QUERY()
function in theFROM
clause, whereas in GoogleSQL, you filter using the_TABLE_SUFFIX
pseudocolumn in theWHERE
clause.In legacy SQL, the
TABLE_QUERY()
function operates on the entire table name (ortable_id
), whereas in GoogleSQL, the_TABLE_SUFFIX
pseudocolumn contains part or all of the table name, depending on how you use the wildcard character.
Filter in the WHERE clause
When migrating from legacy SQL to GoogleSQL, move the filter to the WHERE
clause. For example, the following query finds the maximum temperatures across
all years that end in the number 0
:
#legacySQL
SELECT
max,
ROUND((max-32)*5/9,1) celsius,
year
FROM
TABLE_QUERY([bigquery-public-data:noaa_gsod],
'REGEXP_MATCH(table_id, r"0$")')
WHERE
max != 9999.9 # code for missing data
AND max > 100 # to improve ORDER BY performance
ORDER BY
max DESC
In GoogleSQL, an equivalent query uses a table wildcard and places the
regular expression function, REGEXP_CONTAINS()
, in the WHERE
clause:
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) celsius,
year
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
max != 9999.9 # code for missing data
AND max > 100 # to improve ORDER BY performance
AND REGEXP_CONTAINS(_TABLE_SUFFIX, r"0$")
ORDER BY
max DESC
Differences between table_id and _TABLE_SUFFIX
In the legacy SQL TABLE_QUERY(dataset, expr)
function, the second parameter
is an expression that operates over the entire table name, using the value
table_id
. When migrating to GoogleSQL, the filter that you create in the
WHERE
clause operates on the value of _TABLE_SUFFIX
, which can include part
or all of the table name, depending on your use of the wildcard character.
For example, the following legacy SQL query uses the entire table name in a
regular expression to find the maximum temperatures across all years that end
in the number 0
:
#legacySQL
SELECT
max,
ROUND((max-32)*5/9,1) celsius,
year
FROM
TABLE_QUERY([bigquery-public-data:noaa_gsod],
'REGEXP_MATCH(table_id, r"gsod\d{3}0")')
WHERE
max != 9999.9 # code for missing data
AND max > 100 # to improve ORDER BY performance
ORDER BY
max DESC
In GoogleSQL, an equivalent query can use the entire table name or only a part of the table name. You can use an empty prefix in GoogleSQL so that your filter operates over the entire table name:
# Empty prefix
FROM
`bigquery-public-data.noaa_gsod.*`
However, longer prefixes perform better than empty prefixes, so the following
example uses a longer prefix, which means that the value of _TABLE_SUFFIX
is only part of the table name.
#standardSQL
SELECT
max,
ROUND((max-32)*5/9,1) celsius,
year
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
max != 9999.9 # code for missing data
AND max > 100 # to improve ORDER BY performance
AND REGEXP_CONTAINS(_TABLE_SUFFIX, r"\d{3}0")
ORDER BY
max DESC