Migrating to GoogleSQL
BigQuery supports two SQL dialects: GoogleSQL and legacy SQL. This topic explains the differences between the two dialects, including syntax, functions, and semantics, and gives examples of some of the highlights of GoogleSQL.
Comparison of legacy and GoogleSQL
When initially released, BigQuery ran queries using a non-GoogleSQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for GoogleSQL, and renamed BigQuery SQL to legacy SQL. GoogleSQL is the preferred SQL dialect for querying data stored in BigQuery.
Do I have to migrate to GoogleSQL?
We recommend migrating from legacy SQL to GoogleSQL, but it's not required. For example, suppose that you execute many queries that use legacy SQL, but you want to take advantage of a GoogleSQL feature for a new query. You can create new queries using GoogleSQL that run alongside queries using legacy SQL.
Enabling GoogleSQL
You have a choice of whether to use legacy or GoogleSQL when you run a query. For information about switching between SQL dialects, see BigQuery SQL dialects.
Advantages of GoogleSQL
GoogleSQL complies with the SQL 2011 standard, and has extensions that support querying nested and repeated data. It has several advantages over legacy SQL, including:
- Composability using
WITH
clauses and SQL functions - Subqueries in the
SELECT
list andWHERE
clause - Correlated subqueries
ARRAY
andSTRUCT
data types- Inserts, updates, and deletes
COUNT(DISTINCT <expr>)
is exact and scalable, providing the accuracy ofEXACT_COUNT_DISTINCT
without its limitations- Automatic predicate push-down through
JOIN
s - Complex
JOIN
predicates, including arbitrary expressions
For examples that demonstrate some of these features, see GoogleSQL highlights.
Type differences
Legacy SQL types have an equivalent in GoogleSQL and vice versa. In some cases, the type has a different name. The following table lists each legacy SQL data type and its GoogleSQL equivalent.
Legacy SQL | GoogleSQL | Notes |
---|---|---|
BOOL |
BOOL |
|
INTEGER |
INT64 |
|
FLOAT |
FLOAT64 |
|
STRING |
STRING |
|
BYTES |
BYTES |
|
RECORD |
STRUCT |
|
REPEATED |
ARRAY |
|
TIMESTAMP |
TIMESTAMP |
See TIMESTAMP differences |
DATE |
DATE |
Legacy SQL has limited support for DATE |
TIME |
TIME |
Legacy SQL has limited support for TIME |
DATETIME |
DATETIME |
Legacy SQL has limited support for DATETIME |
For more information on the GoogleSQL type system, see the GoogleSQL data types reference. For more information on data types in BigQuery, see the BigQuery data types reference.
TIMESTAMP
differences
GoogleSQL has a
stricter range of valid TIMESTAMP
values
than legacy SQL does. In GoogleSQL, valid TIMESTAMP
values are in the
range of 0001-01-01 00:00:00.000000
to 9999-12-31 23:59:59.999999
. For
example, you can select the minimum and maximum TIMESTAMP
values using
GoogleSQL:
#standardSQL
SELECT
min_timestamp,
max_timestamp,
UNIX_MICROS(min_timestamp) AS min_unix_micros,
UNIX_MICROS(max_timestamp) AS max_unix_micros
FROM (
SELECT
TIMESTAMP '0001-01-01 00:00:00.000000' AS min_timestamp,
TIMESTAMP '9999-12-31 23:59:59.999999' AS max_timestamp
);
This query returns -62135596800000000
as min_unix_micros
and
253402300799999999
as max_unix_micros
.
If you select a column that contains timestamp values outside of this range, you will receive an error:
#standardSQL
SELECT timestamp_column_with_invalid_values
FROM MyTableWithInvalidTimestamps;
This query returns the following error:
Cannot return an invalid timestamp value of -8446744073709551617
microseconds relative to the Unix epoch. The range of valid
timestamp values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]
To correct the error, one option is to define and use a user-defined function to filter the invalid timestamps:
#standardSQL
CREATE TEMP FUNCTION TimestampIsValid(t TIMESTAMP) AS (
t >= TIMESTAMP('0001-01-01 00:00:00') AND
t <= TIMESTAMP('9999-12-31 23:59:59.999999')
);
SELECT timestamp_column_with_invalid_values
FROM MyTableWithInvalidTimestamps
WHERE TimestampIsValid(timestamp_column_with_invalid_values);
Another option to correct the error is to use the
SAFE_CAST
function with the timestamp column. For example:
#standardSQL
SELECT SAFE_CAST(timestamp_column_with_invalid_values AS STRING) AS timestamp_string
FROM MyTableWithInvalidTimestamps;
This query returns NULL
rather than a timestamp string for invalid
timestamp values.
GoogleSQL highlights
This section discusses some of the highlights of GoogleSQL compared to legacy SQL.
Composability using WITH
clauses
Some of the GoogleSQL examples on this page make use of a
WITH
clause,
which enables extraction or reuse of named subqueries. For example:
#standardSQL
WITH T AS (
SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
)
SELECT x / (SELECT SUM(x) FROM T) AS weighted_x
FROM T;
This query defines a named subquery T
that contains x
values of 1, 2, 3,
and 4. It selects x
values from T
and divides them by the sum of all x
values in T
. This query is equivalent to a query where the contents of T
are inline:
#standardSQL
SELECT
x / (SELECT SUM(x)
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x)) AS weighted_x
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);
As another example, consider this query, which uses multiple named subqueries:
#standardSQL
WITH T AS (
SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
),
TPlusOne AS (
SELECT x + 1 AS y
FROM T
),
TPlusOneTimesTwo AS (
SELECT y * 2 AS z
FROM TPlusOne
)
SELECT z
FROM TPlusOneTimesTwo;
This query defines a sequence of transformations of the original data, followed
by a SELECT
statement over TPlusOneTimesTwo
. This query is equivalent to the
following query, which inlines the computations:
#standardSQL
SELECT (x + 1) * 2 AS z
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);
For more information, see
WITH
clause.
Composability using SQL functions
GoogleSQL supports user-defined SQL functions. You can use user-defined SQL functions to define common expressions and then reference them from the query. For example:
#standardSQL
-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
-- n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>) AS
(
ARRAY_LENGTH(arr) / (SELECT SUM(1 / x) FROM UNNEST(arr) AS x)
);
WITH T AS (
SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;
This query defines a SQL function named HarmonicMean
and then applies it to
the array column arr
from T
.
Subqueries in more places
GoogleSQL supports subqueries in the SELECT
list, WHERE
clause, and
anywhere else in the query that expects an expression. For example, consider the
following GoogleSQL query that computes the fraction of warm days in Seattle
in 2015:
#standardSQL
WITH SeattleWeather AS (
SELECT *
FROM `bigquery-public-data.noaa_gsod.gsod2015`
WHERE stn = '994014'
)
SELECT
COUNTIF(max >= 70) /
(SELECT COUNT(*) FROM SeattleWeather) AS warm_days_fraction
FROM SeattleWeather;
The Seattle weather station has an ID of '994014'
. The query computes the
number of warm days based on those where the temperature reached 70 degrees
Fahrenheit, or approximately 21 degrees Celsius, divided by the total number of
recorded days for that station in 2015.
Correlated subqueries
In GoogleSQL, subqueries can reference correlated columns; that is, columns that originate from the outer query. For example, consider the following GoogleSQL query:
#standardSQL
WITH WashingtonStations AS (
SELECT weather.stn AS station_id, ANY_VALUE(station.name) AS name
FROM `bigquery-public-data.noaa_gsod.stations` AS station
INNER JOIN `bigquery-public-data.noaa_gsod.gsod2015` AS weather
ON station.usaf = weather.stn
WHERE station.state = 'WA' AND station.usaf != '999999'
GROUP BY station_id
)
SELECT washington_stations.name,
(SELECT COUNT(*)
FROM `bigquery-public-data.noaa_gsod.gsod2015` AS weather
WHERE washington_stations.station_id = weather.stn
AND max >= 70) AS warm_days
FROM WashingtonStations AS washington_stations
ORDER BY warm_days DESC;
This query computes the names of weather stations in Washington state and the
number of days in 2015 that the temperature reached 70 degrees Fahrenheit, or
approximately 21 degrees Celsius. Notice that there is a subquery in the
SELECT
list, and that the subquery references washington_stations.station_id
from the outer scope, namely FROM WashingtonStations AS washington_stations
.
Arrays and structs
ARRAY
and STRUCT
are powerful concepts in GoogleSQL. As an example that
uses both, consider the following query, which computes the top two articles
for each day in the HackerNews dataset:
#standardSQL
WITH TitlesAndScores AS (
SELECT
ARRAY_AGG(STRUCT(title, score)) AS titles,
EXTRACT(DATE FROM time_ts) AS date
FROM `bigquery-public-data.hacker_news.stories`
WHERE score IS NOT NULL AND title IS NOT NULL
GROUP BY date)
SELECT date,
ARRAY(SELECT AS STRUCT title, score
FROM UNNEST(titles)
ORDER BY score DESC
LIMIT 2)
AS top_articles
FROM TitlesAndScores
ORDER BY date DESC;
The WITH
clause defines TitlesAndScores
, which contains two columns. The
first is an array of structs, where one field is an article title and the second
is a score. The ARRAY_AGG
expression returns an array of these structs for
each day.
The