Migrating to GoogleSQL

BigQuery supports two SQL dialects: GoogleSQL and legacy SQL. This document explains the differences between the two dialects, including syntax, functions, and semantics, and gives examples of some of the highlights of GoogleSQL.

Comparison of legacy and GoogleSQL

When initially released, BigQuery ran queries using a non-GoogleSQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for GoogleSQL, and renamed BigQuery SQL to legacy SQL. GoogleSQL is the preferred SQL dialect for querying data stored in BigQuery.

Do I have to migrate to GoogleSQL?

We recommend migrating from legacy SQL to GoogleSQL, but it's not required. For example, suppose that you execute many queries that use legacy SQL, but you want to take advantage of a GoogleSQL feature for a new query. You can create new queries using GoogleSQL that run alongside queries using legacy SQL.

Enabling GoogleSQL

You have a choice of whether to use legacy or GoogleSQL when you run a query. For information about switching between SQL dialects, see BigQuery SQL dialects.

Advantages of GoogleSQL

GoogleSQL complies with the SQL 2011 standard, and has extensions that support querying nested and repeated data. It has several advantages over legacy SQL, including:

For examples that demonstrate some of these features, see GoogleSQL highlights.

Type differences

Legacy SQL types have an equivalent in GoogleSQL and vice versa. In some cases, the type has a different name. The following table lists each legacy SQL data type and its GoogleSQL equivalent.

Legacy SQL GoogleSQL Notes
BOOL BOOL
INTEGER INT64
FLOAT FLOAT64
STRING STRING
BYTES BYTES
RECORD STRUCT
REPEATED ARRAY
TIMESTAMP TIMESTAMP See TIMESTAMP differences
DATE DATE Legacy SQL has limited support for DATE
TIME TIME Legacy SQL has limited support for TIME
DATETIME DATETIME Legacy SQL has limited support for DATETIME

For more information about the GoogleSQL type system, see the GoogleSQL data types reference. For more information about data types in BigQuery, see the BigQuery data types reference.

TIMESTAMP differences

GoogleSQL has a stricter range of valid TIMESTAMP values than legacy SQL does. In GoogleSQL, valid TIMESTAMP values are in the range of 0001-01-01 00:00:00.000000 to 9999-12-31 23:59:59.999999. For example, you can select the minimum and maximum TIMESTAMP values using GoogleSQL:

#standardSQL
SELECT
  min_timestamp,
  max_timestamp,
  UNIX_MICROS(min_timestamp) AS min_unix_micros,
  UNIX_MICROS(max_timestamp) AS max_unix_micros
FROM (
  SELECT
    TIMESTAMP '0001-01-01 00:00:00.000000' AS min_timestamp,
    TIMESTAMP '9999-12-31 23:59:59.999999' AS max_timestamp
);

This query returns -62135596800000000 as min_unix_micros and 253402300799999999 as max_unix_micros.

If you select a column that contains timestamp values outside of this range, you receive an error:

#standardSQL
SELECT timestamp_column_with_invalid_values
FROM MyTableWithInvalidTimestamps;

This query returns the following error:

Cannot return an invalid timestamp value of -8446744073709551617
microseconds relative to the Unix epoch. The range of valid
timestamp values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

To correct the error, one option is to define and use a user-defined function to filter the invalid timestamps:

#standardSQL
CREATE TEMP FUNCTION TimestampIsValid(t TIMESTAMP) AS (
  t >= TIMESTAMP('0001-01-01 00:00:00') AND
  t <= TIMESTAMP('9999-12-31 23:59:59.999999')
);

SELECT timestamp_column_with_invalid_values
FROM MyTableWithInvalidTimestamps
WHERE TimestampIsValid(timestamp_column_with_invalid_values);

Another option to correct the error is to use the SAFE_CAST function with the timestamp column. For example:

#standardSQL
SELECT SAFE_CAST(timestamp_column_with_invalid_values AS STRING) AS timestamp_string
FROM MyTableWithInvalidTimestamps;

This query returns NULL rather than a timestamp string for invalid timestamp values.

GoogleSQL highlights

This section discusses some of the highlights of GoogleSQL compared to legacy SQL.

Composability using WITH clauses

Some of the GoogleSQL examples on this page make use of a WITH clause, which enables extraction or reuse of named subqueries. For example:

#standardSQL
WITH T AS (
  SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
)
SELECT x / (SELECT SUM(x) FROM T) AS weighted_x
FROM T;

This query defines a named subquery T that contains x values of 1, 2, 3, and 4. It selects x values from T and divides them by the sum of all x values in T. This query is equivalent to a query where the contents of T are inline:

#standardSQL
SELECT
  x / (SELECT SUM(x)
       FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x)) AS weighted_x
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);

As another example, consider this query, which uses multiple named subqueries:

#standardSQL
WITH T AS (
  SELECT x FROM UNNEST([1, 2, 3, 4]) AS x
),
TPlusOne AS (
  SELECT x + 1 AS y
  FROM T
),
TPlusOneTimesTwo AS (
  SELECT y * 2 AS z
  FROM TPlusOne
)
SELECT z
FROM TPlusOneTimesTwo;

This query defines a sequence of transformations of the original data, followed by a SELECT statement over TPlusOneTimesTwo. This query is equivalent to the following query, which inlines the computations:

#standardSQL
SELECT (x + 1) * 2 AS z
FROM (SELECT x FROM UNNEST([1, 2, 3, 4]) AS x);

For more information, see WITH clause.

Composability using SQL functions

GoogleSQL supports user-defined SQL functions. You can use user-defined SQL functions to define common expressions and then reference them from the query. For example:

#standardSQL
-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
--   n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>) AS
(
  ARRAY_LENGTH(arr) / (SELECT SUM(1 / x) FROM UNNEST(arr) AS x)
);

WITH T AS (
  SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
  FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;

This query defines a SQL function named HarmonicMean and then applies it to the array column arr from T.

Subqueries in more places

GoogleSQL supports subqueries in the SELECT list, WHERE clause, and anywhere else in the query that expects an expression. For example, consider the following GoogleSQL query that computes the fraction of warm days in Seattle in 2015:

#standardSQL
WITH SeattleWeather AS (
  SELECT *
  FROM `bigquery-public-data.noaa_gsod.gsod2015`
  WHERE stn = '994014'
)
SELECT
  COUNTIF(max >= 70) /
    (SELECT COUNT(*) FROM SeattleWeather) AS warm_days_fraction
FROM SeattleWeather;

The Seattle weather station has an ID of '994014'. The query computes the number of warm days based on those where the temperature reached 70 degrees Fahrenheit, or approximately 21 degrees Celsius, divided by the total number of recorded days for that station in 2015.

Correlated subqueries

In GoogleSQL, subqueries can reference correlated columns; that is, columns that originate from the outer query. For example, consider the following GoogleSQL query:

#standardSQL
WITH WashingtonStations AS (
  SELECT weather.stn AS station_id, ANY_VALUE(station.name) AS name
  FROM `bigquery-public-data.noaa_gsod.stations` AS station
  INNER JOIN `bigquery-public-data.noaa_gsod.gsod2015` AS weather
  ON station.usaf = weather.stn
  WHERE station.state = 'WA' AND station.usaf != '999999'
  GROUP BY station_id
)
SELECT washington_stations.name,
  (SELECT COUNT(*)
   FROM `bigquery-public-data.noaa_gsod.gsod2015` AS weather
   WHERE washington_stations.station_id = weather.stn
   AND max >= 70) AS warm_days
FROM WashingtonStations AS washington_stations
ORDER BY warm_days DESC;

This query computes the names of weather stations in Washington state and the number of days in 2015 that the temperature reached 70 degrees Fahrenheit, or approximately 21 degrees Celsius. Notice that there is a subquery in the SELECT list, and that the subquery references washington_stations.station_id from the outer scope, namely FROM WashingtonStations AS washington_stations.

Arrays and structs

ARRAY and STRUCT are powerful concepts in GoogleSQL. As an example that uses both, consider the following query, which computes the top two articles for each day in the HackerNews dataset:

#standardSQL
WITH TitlesAndScores AS (
  SELECT
    ARRAY_AGG(STRUCT(title, score)) AS titles,
    EXTRACT(DATE FROM time_ts) AS date
  FROM `bigquery-public-data.hacker_news.stories`
  WHERE score IS NOT NULL AND title IS NOT NULL
  GROUP BY date)
SELECT date,
  ARRAY(SELECT AS STRUCT title, score
        FROM UNNEST(titles)
        ORDER BY score DESC
        LIMIT 2)
  AS top_articles
FROM TitlesAndScores
ORDER BY date DESC;

The WITH clause defines TitlesAndScores, which contains two columns. The first is an array of structs, where one field is an article title and the second is a score. The ARRAY_AGG expression returns an array of these structs for each day.

The SELECT statement following the WITH clause uses an ARRAY subquery to sort and return the top two articles within each array in accordance with the score, then returns the results in descending order by date.

For more information about arrays and ARRAY subqueries, see Working with arrays. See also the references for arrays and structs.## GoogleSQL highlights

This section discusses some of the highlights of GoogleSQL compared to legacy SQL.

Syntax differences

Escaping reserved keywords and invalid identifiers

In legacy SQL, you escape reserved keywords and identifiers that contain invalid characters such as a space or hyphen - using square brackets []. In GoogleSQL, you escape such keywords and identifiers using backticks `. For example:

#standardSQL
SELECT
  word,
  SUM(word_count) AS word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE word IN ('me', 'I', 'you')
GROUP BY word;

Legacy SQL allows reserved keywords in some places that GoogleSQL does not. For example, the following query fails due to a Syntax error using standard SQL:

#standardSQL
SELECT
  COUNT(*) AS rows
FROM
  `bigquery-public-data.samples.shakespeare`;

To fix the error, escape the alias rows using backticks:

#standardSQL
SELECT
  COUNT(*) AS `rows`
FROM
  `bigquery-public-data.samples.shakespeare`;

For a list of reserved keywords and what constitutes valid identifiers, see Lexical structure.

Project-qualified table names

In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. For example:

#legacySQL
SELECT
  word
FROM
  [bigquery-public-data:samples.shakespeare]
LIMIT 1;

In GoogleSQL, you use a period, ., instead. For example:

#standardSQL
SELECT
  word
FROM
  `bigquery-public-data.samples.shakespeare`
LIMIT 1;

If your project name includes a domain, such as example.com:myproject, you use example.com:myproject as the project name, including the :.

Table decorators

GoogleSQL does not support table decorators. You can achieve the semantics of time decorators (formerly known as snapshot decorators) by using the FOR SYSTEM_TIME AS OF clause, which references the historical version of a table at a specified timestamp. For more information, see Accessing historical data using time travel.

There is no exact equivalent to range decorators in GoogleSQL. You can achieve similar semantics by creating a time-partitioned table and using a partition filter when querying data. For more information, see Querying partitioned tables. Another option is to create date-sharded tables and filter on the _TABLE_SUFFIX pseudocolumn. For more information, see Wildcard tables.

Wildcard functions

GoogleSQL does not support the TABLE_DATE_RANGE, TABLE_DATE_RANGE_STRICT, or TABLE_QUERY functions.

You can achieve the same semantics of TABLE_DATE_RANGE and TABLE_QUERY using a filter on the _TABLE_SUFFIX pseudocolumn. For example, consider the following legacy SQL query, which counts the number of rows across 2010 and 2011 in the National Oceanic and Atmospheric Administration GSOD (global summary of the day) tables:

#legacySQL
SELECT COUNT(*)
FROM TABLE_QUERY([bigquery-public-data:noaa_gsod],
                 'table_id IN ("gsod2010", "gsod2011")');

An equivalent query using GoogleSQL is:

#standardSQL
SELECT COUNT(*)
FROM `bigquery-public-data.noaa_gsod.*`
WHERE _TABLE_SUFFIX IN ("gsod2010", "gsod2011");

For more information, including examples that use TABLE_DATE_RANGE, see Migrating legacy SQL table wildcard functions.

Comma operator with tables

In legacy SQL, the comma operator , has the non-standard meaning of UNION ALL when applied to tables. In GoogleSQL, the comma operator has the standard meaning of JOIN. For example, consider the following legacy SQL query:

#legacySQL
SELECT
  x,
  y
FROM
  (SELECT 1 AS x, "foo" AS y),
  (SELECT 2 AS x, "bar" AS y);

This is equivalent to the GoogleSQL query:

#standardSQL
SELECT
  x,
  y
FROM
  (SELECT 1 AS x, "foo" AS y UNION ALL
   SELECT 2 AS x, "bar" AS y);

Note also that in GoogleSQL, UNION ALL associates columns by position rather than by name. This query is equivalent to:

#standardSQL
SELECT
  x,
  y
FROM
  (SELECT 1 AS x, "foo" AS y UNION ALL
   SELECT 2, "bar");

A common usage of the comma operator in GoogleSQL is to JOIN with an array. For example:

#standardSQL
WITH T AS (
  SELECT 0 AS x, [1, 2, 3] AS arr UNION ALL
  SELECT 1, [4, 5])
SELECT
  x,
  y
FROM
  T,
  UNNEST(arr) AS y;

This returns the cross product of the table T with the elements of arr. You can also express the query in GoogleSQL as:

#standardSQL
WITH T AS (
  SELECT 0 AS x, [1, 2, 3] AS arr UNION ALL
  SELECT 1, [4, 5])
SELECT
  x,
  y
FROM
  T
JOIN
  UNNEST(arr) AS y;

In this query, JOIN has the same meaning as the , comma operator separating T and UNNEST(arr) AS y in the example above it.

Logical views

You cannot query a logical view defined with legacy SQL using GoogleSQL and vice versa due to differences in syntax and semantics between the dialects. Instead, you would need to create a new view that uses GoogleSQL--possibly under a different name--to replace a view that uses legacy SQL.

As an example, suppose that view V is defined using legacy SQL as:

#legacySQL
SELECT *, UTC_USEC_TO_DAY(timestamp_col) AS day
FROM MyTable;

Suppose that view W is defined using legacy SQL as:

#legacySQL
SELECT user, action, day
FROM V;

Suppose that you execute the following legacy SQL query daily, but you want to migrate it to use GoogleSQL instead:

#legacySQL
SELECT EXACT_COUNT_DISTINCT(user), action, day
FROM W
GROUP BY action, day;

One possible migration path is to create new views using different names. The steps involved are:

Create a view named V2 using GoogleSQL with the following contents:

#standardSQL
SELECT *, EXTRACT(DAY FROM timestamp_col) AS day
FROM MyTable;

Create a view named W2 using GoogleSQL with the following contents:

#standardSQL
SELECT user, action, day
FROM V2;

Change your query that executes daily to use GoogleSQL and refer to W2 instead:

#standardSQL
SELECT COUNT(DISTINCT user), action, day
FROM W2
GROUP BY action, day;

Another option is to delete views V and W, then recreate them using standard SQL under the same names. With this option, you would need to migrate all of your queries that reference V or W to use GoogleSQL at the same time, however.

Function comparison

The following is a partial list of legacy SQL functions and their GoogleSQL equivalents.

Legacy SQL GoogleSQL Notes
INTEGER(x) SAFE_CAST(x AS INT64)
CAST(x AS INTEGER) SAFE_CAST(x AS INT64)
DATEDIFF(t1, t2) TIMESTAMP_DIFF(t1, t2, DAY)
NOW CURRENT_TIMESTAMP
STRFTIME_UTC_USEC(t, fmt) FORMAT_TIMESTAMP(fmt, t)
UTC_USEC_TO_DAY(t) TIMESTAMP_TRUNC(t, DAY)
UTC_USEC_TO_WEEK(t, day) TIMESTAMP_TRUNC(TIMESTAMP_MICROS(t), WEEK)
REGEXP_MATCH(s, pattern) REGEXP_CONTAINS(s, pattern)
IS_NULL(x) x IS NULL
LEFT(s, len) SUBSTR(s, 0, len)
RIGHT(s, len) SUBSTR(s, -len)
s CONTAINS "foo" STRPOS(s, "foo") > 0 or s LIKE '%foo%'
INSTR('str1', 'str2') STRPOS(s, "foo") > 0
x % y MOD(x, y)
NEST(x) ARRAY_AGG(x)
ANY(x) ANY_VALUE(x)
GROUP_CONCAT_UNQUOTED(s, sep) STRING_AGG(s, sep)
SOME(x) IFNULL(LOGICAL_OR(x), false)
EVERY(x) IFNULL(LOGICAL_AND(x), true)
COUNT(DISTINCT x) APPROX_COUNT_DISTINCT(x) see notes below
EXACT_COUNT_DISTINCT(x) COUNT(DISTINCT x) see notes below
QUANTILES(x, buckets + 1) APPROX_QUANTILES(x, buckets)
TOP(x, num), COUNT(*)