Dataflow SQL query syntax

Dataflow SQL supports a variant of the ZetaSQL query syntax and includes additional streaming extensions for running Dataflow streaming jobs. Not all ZetaSQL language features are supported; however, additional constructs are under development.

Query statements scan one or more tables, streams, or expressions and return the computed result rows. This topic describes the syntax for SQL queries in Cloud Dataflow SQL.

SQL Syntax

query_statement:
    query_expr

query_expr:
    [ WITH with_query_name AS ( query_expr ) [, ...] ]
    { select | ( query_expr ) | query_expr set_op query_expr }
    [ [ ORDER BY expression [{ ASC | DESC }] [, ...] ] LIMIT count [ OFFSET skip_rows ] ]

select:
    SELECT  [ ALL | DISTINCT ] { * | expression [ [ AS ] alias ] } [, ...]
    [ FROM from_item ]
    [ WHERE bool_expression ]
    [ GROUP BY expression [, ...] ]
    [ HAVING bool_expression ]

set_op:
    UNION { ALL | DISTINCT } | INTERSECT { ALL | DISTINCT } | EXCEPT { ALL | DISTINCT }

from_item: {
    table_name [ [ AS ] alias ] |
    join |
    ( query_expr ) [ [ AS ] alias ] |
    with_query_name [ [ AS ] alias ]
}
table_name:
    identifier [ . identifier ...]

join:
    from_item [ join_type ] JOIN from_item
    ON bool_expression

join_type:
    { INNER | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }

Notation:

  • Square brackets "[ ]" indicate optional clauses.
  • Parentheses "( )" indicate literal parentheses.
  • The vertical bar "|" indicates a logical OR.
  • Curly braces "{ }" enclose a set of options.
  • A comma followed by an ellipsis within square brackets "[, ... ]" indicates that the preceding item can repeat in a comma-separated list.

SELECT list

Syntax:

SELECT  [ ALL ]
    { * | expression [ [ AS ] alias ] } [, ...]

The SELECT list defines the columns that the query will return. Expressions in the SELECT list can refer to columns in any of the from_items in its corresponding FROM clause.

Each item in the SELECT list is one of:

  • *
  • expression

SELECT *

SELECT *, often referred to as select star, produces one output column for each column that is visible after executing the full query.

SELECT * FROM (SELECT "apple" AS fruit, "carrot" AS vegetable);

+-------+-----------+
| fruit | vegetable |
+-------+-----------+
| apple | carrot    |
+-------+-----------+

SELECT expression

Items in a SELECT list can be expressions. These expressions evaluate to a single value and produce one output column, with an optional explicit alias.

If the expression does not have an explicit alias, it receives an implicit alias according to the rules for implicit aliases, if possible. Otherwise, the column is anonymous and you cannot refer to it by name elsewhere in the query.

SELECT modifiers

You can modify the results returned from a SELECT query, as follows.

SELECT ALL

A SELECT ALL statement returns all rows, including duplicate rows. SELECT ALL is the default behavior of SELECT.

Aliases

See Aliases for information on syntax and visibility for SELECT list aliases.

FROM clause

The FROM clause indicates the tables or streams from which to retrieve rows, and specifies how to join those rows together to produce a single stream of rows for processing in the rest of the query.

Syntax

from_item: {
    table_name [ [ AS ] alias ] |
    join |
    ( query_expr ) [ [ AS ] alias ] |
    with_query_name [ [ AS ] alias ]
}

table_name

The fully-qualified SQL name of a data source queryable by Cloud Dataflow SQL, specified by a dot-separated list of identifiers using Standard SQL lexical structure. You must use backticks to enclose identifiers that contain characters which are not letters, numbers, or underscores.

SELECT * FROM bigquery.table.`my-project`.baseball.roster;
SELECT * FROM pubsub.topic.`my-project`.incoming_events;

join

See JOIN Types below.

select

( select ) [ [ AS ] alias ] is a table subquery.

with_query_name

The query names in a WITH clause (see WITH Clause) act like names of temporary tables that you can reference anywhere in the FROM clause. In the example below, subQ1 and subQ2 are with_query_names.

Example:

WITH
  subQ1 AS (SELECT * FROM Roster WHERE SchoolID = 52),
  subQ2 AS (SELECT SchoolID FROM subQ1)
SELECT DISTINCT * FROM subQ2;

The WITH clause hides any permanent tables with the same name for the duration of the query, unless you qualify the table name, e.g. db.Roster.

Subqueries

A subquery is a query that appears inside another statement, and is written inside parentheses. These are also referred to as "sub-SELECTs" or "nested SELECTs". The full SELECT syntax is valid in subqueries.

There are two types of subquery:

  • Expression subqueries, which you can use in a query wherever expressions are valid. Expression subqueries return a single value.
  • Table subqueries, which you can use only in a FROM clause. The outer query treats the result of the subquery as a table.

Note that there must be parentheses around both types of subqueries.

Example:

SELECT AVG ( PointsScored )
FROM
( SELECT PointsScored
  FROM Stats
  WHERE SchoolID = 77 )

Optionally, a table subquery can have an alias.

Example:

SELECT r.LastName
FROM
( SELECT * FROM Roster) AS r;

Aliases

See Aliases for information on syntax and visibility for FROM clause aliases.

JOIN types

Syntax

join:
    from_item [ join_type ] JOIN from_item
    ON bool_expression

join_type:
    { INNER | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }

The JOIN clause merges two from_items so that the SELECT clause can query them as one source. The join_type and ON clause (a "join condition") specify how to combine and discard rows from the two from_items to form a single source.

All JOIN clauses require a join_type.

[INNER] JOIN

An INNER JOIN, or simply JOIN, effectively calculates the Cartesian product of the two from_items and discards all rows that do not meet the join condition. "Effectively" means that it is possible to implement an INNER JOIN without actually calculating the Cartesian product.

FULL [OUTER] JOIN

A FULL OUTER JOIN (or simply FULL JOIN) returns all fields for all rows in both from_items that meet the join condition.

FULL indicates that all rows from both from_items are returned, even if they do not meet the join condition.

OUTER indicates that if a given row from one from_item does not join to any row in the other from_item, the row will return with NULLs for all columns from the other from_item.

LEFT [OUTER] JOIN

The result of a LEFT OUTER JOIN (or simply LEFT JOIN) for two from_items always retains all rows of the left from_item in the JOIN clause, even if no rows in the right from_item satisfy the join predicate.

LEFT indicates that all rows from the left from_item are returned; if a given row from the left from_item does not join to any row in the right from_item, the row will return with NULLs for all columns from the right from_item. Rows from the right from_item that do not join to any row in the left from_item are discarded.

RIGHT [OUTER] JOIN

The result of a RIGHT OUTER JOIN (or simply RIGHT JOIN) is similar and symmetric to that of LEFT OUTER JOIN.

ON clause

The ON clause contains a bool_expression. A combined row (the result of joining two rows) meets the join condition if bool_expression returns TRUE.

Example:

SELECT * FROM Roster INNER JOIN PlayerStats
ON Roster.LastName = PlayerStats.LastName;

Sequences of JOINs

The FROM clause can contain multiple JOIN clauses in sequence.

Example:

SELECT * FROM a LEFT JOIN b ON TRUE LEFT JOIN c ON TRUE;

where a, b, and c are any from_items. JOINs are bound from left to right, but you can insert parentheses to group them in a different order.

WHERE clause

Syntax

WHERE bool_expression

The WHERE clause filters out rows by evaluating each row against bool_expression, and discards all rows that do not return TRUE (that is, rows that return FALSE or NULL).

Example:

SELECT * FROM Roster
WHERE SchoolID = 52;

The bool_expression can contain multiple sub-conditions.

Example:

SELECT * FROM Roster
WHERE STARTS_WITH(LastName, "Mc") OR STARTS_WITH(LastName, "Mac");

You cannot reference column aliases from the SELECT list in the WHERE clause.

GROUP BY clause

Syntax

GROUP BY expression [, ...]

The GROUP BY clause groups together rows in a table with non-distinct values for the expression in the GROUP BY clause. For multiple rows in the source table with non-distinct values for expression, the GROUP BY clause produces a single combined row. GROUP BY is commonly used when aggregate functions are present in the SELECT list, or to eliminate redundancy in the output. The data type of expression must be groupable.

Example:

SELECT SUM(PointsScored), LastName
FROM PlayerStats
GROUP BY LastName;

The GROUP BY clause can refer to expression names in the SELECT list. The GROUP BY clause also allows ordinal references to expressions in the SELECT list using integer values. 1 refers to the first expression in the SELECT list, 2 the second, and so forth. The expression list can combine ordinals and expression names.

Example:

SELECT SUM(PointsScored), LastName, FirstName
FROM PlayerStats
GROUP BY LastName, FirstName;

The query above is equivalent to:

SELECT SUM(PointsScored), LastName, FirstName
FROM PlayerStats
GROUP BY 2, FirstName;

GROUP BY clauses may also refer to aliases. If a query contains aliases in the SELECT clause, those aliases override names in the corresponding FROM clause.

Example:

SELECT SUM(PointsScored), LastName as last_name
FROM PlayerStats
GROUP BY last_name;

HAVING clause

Syntax

HAVING bool_expression

The HAVING clause is similar to the WHERE clause: it filters out rows that do not return TRUE when they are evaluated against the bool_expression.

As with the WHERE clause, the bool_expression can be any expression that returns a boolean, and can contain multiple sub-conditions.

The HAVING clause differs from the WHERE clause in that:

  • The HAVING clause requires GROUP BY or aggregation to be present in the query.
  • The HAVING clause occurs after GROUP BY and aggregation, and before ORDER BY. This means that the HAVING clause is evaluated once for every aggregated row in the result set. This differs from the WHERE clause, which is evaluated before GROUP BY and aggregation.

The HAVING clause can reference columns available via the FROM clause, as well as SELECT list aliases. Expressions referenced in the HAVING clause must either appear in the GROUP BY clause or they must be the result of an aggregate function:

SELECT LastName
FROM Roster
GROUP BY LastName
HAVING SUM(PointsScored) > 15;

If a query contains aliases in the SELECT clause, those aliases override names in a FROM clause.

SELECT LastName, SUM(PointsScored) AS ps
FROM Roster
GROUP BY LastName
HAVING ps > 0;

Mandatory aggregation

Aggregation does not have to be present in the HAVING clause itself, but aggregation must be present in at least one of the following forms:

Aggregation function in the SELECT list.

SELECT LastName, SUM(PointsScored) AS total
FROM PlayerStats
GROUP BY LastName
HAVING total > 15;

Aggregation function in the 'HAVING' clause.

SELECT LastName
FROM PlayerStats
GROUP BY LastName
HAVING SUM(PointsScored) > 15;

Aggregation in both the SELECT list and HAVING clause.

When aggregation functions are present in both the SELECT list and HAVING clause, the aggregation functions and the columns they reference do not need to be the same. In the example below, the two aggregation functions, COUNT() and SUM(), are different and also use different columns.

SELECT LastName, COUNT(*)
FROM PlayerStats
GROUP BY LastName
HAVING SUM(PointsScored) > 15;

LIMIT clause and OFFSET clause

Syntax

[ ORDER BY expression [{ASC | DESC}] [,...] ] LIMIT count [ OFFSET skip_rows ]

The ORDER BY clause specifies a column or expression as the sort criterion for the result set. If an ORDER BY clause is not present, the order of the results of a query is not defined. The default sort direction is ASC, which sorts the results in ascending order of expression values. DESC sorts the results in descending order. Column aliases from a FROM clause or SELECT list are allowed. If a query contains aliases in the SELECT clause, those aliases override names in the corresponding FROM clause.

It is possible to order by multiple columns.

The following rules apply when ordering values:

  • NULLs: In the context of the ORDER BY clause, NULLs are the minimum possible value; that is, NULLs appear first in ASC sorts and last in DESC sorts.

LIMIT specifies a non-negative count of type INT64, and no more than count rows will be returned. LIMIT 0 returns 0 rows. If there is a set operation, LIMIT is applied after the set operation is evaluated.

OFFSET specifies a non-negative skip_rows of type INT64, and only rows from that offset in the table will be considered.

These clauses accept only literal or parameter values.

The rows that are returned by LIMIT and OFFSET is unspecified unless these operators are used after ORDER BY.

WITH clause

The WITH clause binds the results of one or more named subqueries to temporary table names. Each introduced table name is visible in subsequent SELECT expressions within the same query expression. This includes the following kinds of SELECT expressions:

  • Any SELECT expressions in subsequent WITH bindings
  • Top level SELECT expressions in the query expression on both sides of a set operator such as UNION
  • SELECT expressions inside subqueries within the same query expression

Example:

WITH subQ1 AS (SELECT SchoolID FROM Roster),
     subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2;

The following are scoping rules for WITH clauses:

  • Aliases are scoped so that the aliases introduced in a WITH clause are visible only in the later subqueries in the same WITH clause, and in the query under the WITH clause.
  • Aliases introduced in the same WITH clause must be unique, but the same alias can be used in multiple WITH clauses in the same query. The local alias overrides any outer aliases anywhere that the local alias is visible.

Cloud Dataflow SQL does not support WITH RECURSIVE.

Aliases

An alias is a temporary name given to a table, column, or expression present in a query. You can introduce explicit aliases in the SELECT list or FROM clause.

Explicit alias syntax

You can introduce explicit aliases in either the FROM clause or the SELECT list.

In a FROM clause, you can introduce explicit aliases for any item, including tables, arrays and subqueries, using [AS] alias. The AS keyword is optional.

Example:

SELECT s.FirstName, s2.SongName
FROM Singers AS s, (SELECT * FROM Songs) AS s2;

You can introduce explicit aliases for any expression in the SELECT list using [AS] alias. The AS keyword is optional.

Example:

SELECT s.FirstName AS name, LOWER(s.FirstName) AS lname
FROM Singers s;

Explicit alias visibility

After you introduce an explicit alias in a query, there are restrictions on where else in the query you can reference that alias. These restrictions on alias visibility are the result of Cloud Dataflow SQL's name scoping rules.

FROM clause aliases

Cloud Dataflow SQL processes aliases in a FROM clause from left to right, and aliases are visible only to subsequent path expressions in a FROM clause.

Example:

Assume the Singers table had a Concerts column of ARRAY type.

SELECT FirstName
FROM Singers AS s, s.Concerts;

Invalid:

SELECT FirstName
FROM s.Concerts, Singers AS s;  // INVALID.

FROM clause aliases are not visible to subqueries in the same FROM clause. Subqueries in a FROM clause cannot contain correlated references to other tables in the same FROM clause.

Invalid:

SELECT FirstName
FROM Singers AS s, (SELECT (2020 - ReleaseDate) FROM s)  // INVALID.

You can use any column name from a table in the FROM as an alias anywhere in the query, with or without qualification with the table name.

Example:

SELECT FirstName, s.ReleaseDate
FROM Singers s WHERE ReleaseDate = 1975;

SELECT list aliases

Aliases in the SELECT list are visible only to the following clauses:

  • GROUP BY clause
  • ORDER BY clause
  • HAVING clause

Example:

SELECT LastName AS last, SingerID
FROM Singers
ORDER BY last;

Explicit aliases in GROUP BY, ORDER BY, and HAVING clauses

These three clauses, GROUP BY, ORDER BY, and HAVING, can refer to only the following values:

  • Tables in the FROM clause and any of their columns.
  • Aliases from the SELECT list.

GROUP BY and ORDER BY can also refer to a third group:

  • Integer literals, which refer to items in the SELECT list. The integer 1 refers to the first item in the SELECT list, 2 refers to the second item, etc.

Example:

SELECT SingerID AS sid, COUNT(Songid) AS s2id
FROM Songs
GROUP BY 1
ORDER BY 2 DESC LIMIT 10;

The query above is equivalent to:

SELECT SingerID AS sid, COUNT(Songid) AS s2id
FROM Songs
GROUP BY sid
ORDER BY s2id DESC LIMIT 10;

Ambiguous aliases

Cloud Dataflow SQL provides an error if a name is ambiguous, meaning it can resolve to more than one unique object.

Examples:

This query contains column names that conflict between tables, since both Singers and Songs have a column named SingerID:

SELECT SingerID
FROM Singers, Songs;

This query contains aliases that are ambiguous in the GROUP BY clause because they are duplicated in the SELECT list:

SELECT FirstName AS name, LastName AS name,
FROM Singers
GROUP BY name;

Ambiguity between a FROM clause column name and a SELECT list alias in GROUP BY:

SELECT UPPER(LastName) AS LastName
FROM Singers
GROUP BY LastName;

The query above is ambiguous and will produce an error because LastName in the GROUP BY clause could refer to the original column LastName in Singers, or it could refer to the alias AS LastName, whose value is UPPER(LastName).

The same rules for ambiguity apply to path expressions. Consider the following query where table has columns x and y, and column z is of type STRUCT and has fields v, w, and x.

Example:

SELECT x, z AS T
FROM table T
GROUP BY T.x;

The alias T is ambiguous and will produce an error because T.x in the GROUP BY clause could refer to either table.x or table.z.x.

A name is not ambiguous in GROUP BY, ORDER BY or HAVING if it is both a column name and a SELECT list alias, as long as the name resolves to the same underlying object.

Example:

SELECT LastName, BirthYear AS BirthYear
FROM Singers
GROUP BY BirthYear;

The alias BirthYear is not ambiguous because it resolves to the same underlying column, Singers.BirthYear.

Appendix A: examples with sample data

Sample tables

The following three tables contain sample data about athletes, their schools, and the points they score during the season. These tables will be used to illustrate the behavior of different query clauses.

Table Roster:

LastName SchoolID
Adams 50
Buchanan 52
Coolidge 52
Davis 51
Eisenhower 77

The Roster table includes a list of player names (LastName) and the unique ID assigned to their school (SchoolID).

Table PlayerStats:

LastName OpponentID PointsScored
Adams 51 3
Buchanan 77 0
Coolidge 77 1
Adams 52 4
Buchanan 50 13

The PlayerStats table includes a list of player names (LastName) and the unique ID assigned to the opponent they played in a given game (OpponentID) and the number of points scored by the athlete in that game (PointsScored).

Table TeamMascot:

SchoolId Mascot
50 Jaguars
51 Knights
52 Lakers
53 Mustangs

The TeamMascot table includes a list of unique school IDs (SchoolID) and the mascot for that school (Mascot).

JOIN types

1) [INNER] JOIN

Example:

SELECT * FROM Roster JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;

Results:

LastName Roster.SchoolId TeamMascot.SchoolId Mascot
Adams 50 50 Jaguars
Buchanan 52 52 Lakers
Coolidge 52 52 Lakers
Davis 51 51 Knights

2) FULL [OUTER] JOIN

Example:

SELECT * FROM Roster FULL JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;
LastName Roster.SchoolId TeamMascot.SchoolId Mascot
Adams 50 50 Jaguars
Buchanan 52 52 Lakers
Coolidge 52 52 Lakers
Davis 51 51 Knights
Eisenhower 77 NULL NULL
NULL NULL 53 Mustangs

3) LEFT [OUTER] JOIN

Example:

SELECT * FROM Roster LEFT JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;

Results:

LastName Roster.SchoolId TeamMascot.SchoolId Mascot
Adams 50 50 Jaguars
Buchanan 52 52 Lakers
Coolidge 52 52 Lakers
Davis 51 51 Knights
Eisenhower 77 NULL NULL

4) RIGHT [OUTER] JOIN

Example:

SELECT * FROM Roster RIGHT JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;

Results:

LastName Roster.SchoolId TeamMascot.SchoolId Mascot
Adams 50 50 Jaguars
Davis 51 51 Knights
Coolidge 52 52 Lakers
Buchanan 52 52 Lakers
NULL NULL 53 Mustangs

GROUP BY clause

Example:

SELECT LastName, SUM(PointsScored)
FROM PlayerStats
GROUP BY LastName;
LastName SUM
Adams 7
Buchanan 13
Coolidge 1

Set operators

UNION

The UNION operator combines the result sets of two or more SELECT statements by pairing columns from the result set of each SELECT statement and vertically concatenating them.

Example:

SELECT Mascot AS X, SchoolID AS Y
FROM TeamMascot
UNION ALL
SELECT LastName, PointsScored
FROM PlayerStats;

Results:

X Y
Jaguars 50
Knights 51
Lakers 52
Mustangs 53
Adams 3
Buchanan 0
Coolidge 1
Adams 4
Buchanan 13

INTERSECT

This query returns the last names that are present in both Roster and PlayerStats.

SELECT LastName
FROM Roster
INTERSECT ALL
SELECT LastName
FROM PlayerStats;

Results:

LastName
Adams
Coolidge
Buchanan

EXCEPT

The query below returns last names in Roster that are not present in PlayerStats.

SELECT LastName
FROM Roster
EXCEPT DISTINCT
SELECT LastName
FROM PlayerStats;

Results:

LastName
Eisenhower
Davis

Reversing the order of the SELECT statements will return last names in PlayerStats that are not present in Roster:

SELECT LastName
FROM PlayerStats
EXCEPT DISTINCT
SELECT LastName
FROM Roster;

Results:

(empty)

Appendix B: unnest ARRAY

Because Dataflow SQL doesn't support ARRAY functions (see Beam ZetaSQL support ), you can't use the [OFFSET] or [CARDINAL] syntax to access a STRUCT in an ARRAY.

For example, this won't work:


WITH records AS (
    SELECT [
        STRUCT("A" as name, 1 as age),
        STRUCT("B" as name, 2 as age),
        STRUCT("C" as name, 3 as age)
    ] as record,
    '1' as id
)
SELECT * from records
where record[offset(1)].name = "B";

It will print errors similar to Element access using [offset()] is not supported on values of type ARRAY>.

To make the query work with Dataflow SQL, you can use UNNEST in the FROM clause to join the records with the STRUCTs in the ARRAY.


WITH records AS (
    SELECT [
        STRUCT("A" as name, 1 as age),
        STRUCT("B" as name, 2 as age),
        STRUCT("C" as name, 3 as age)
    ] as record,
    '1' as id
)
SELECT id, name, age FROM records, UNNEST(record)
WHERE name='B';

Note you have to explicitly specify the column names in the SELECT clause when using UNNEST(ARRAY_COLUMN) in the FROM clause.