Dataflow SQL supports a variant of the ZetaSQL query syntax and includes additional streaming extensions for running Dataflow streaming jobs. Not all ZetaSQL language features are supported; however, additional constructs are under development.
Query statements scan one or more tables, streams, or expressions and return the computed result rows. This topic describes the syntax for SQL queries in Cloud Dataflow SQL.
SQL Syntax
query_statement: query_expr query_expr: [ WITH with_query_name AS ( query_expr ) [, ...] ] { select | ( query_expr ) | query_expr set_op query_expr } [ [ ORDER BY expression [{ ASC | DESC }] [, ...] ] LIMIT count [ OFFSET skip_rows ] ] select: SELECT [ ALL | DISTINCT ] { * | expression [ [ AS ] alias ] } [, ...] [ FROM from_item ] [ WHERE bool_expression ] [ GROUP BY expression [, ...] ] [ HAVING bool_expression ] set_op: UNION { ALL | DISTINCT } | INTERSECT { ALL | DISTINCT } | EXCEPT { ALL | DISTINCT } from_item: { table_name [ [ AS ] alias ] | join | ( query_expr ) [ [ AS ] alias ] | with_query_name [ [ AS ] alias ] } table_name: identifier [ . identifier ...] join: from_item [ join_type ] JOIN from_item ON bool_expression join_type: { INNER | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }
Notation:
- Square brackets "[ ]" indicate optional clauses.
- Parentheses "( )" indicate literal parentheses.
- The vertical bar "|" indicates a logical OR.
- Curly braces "{ }" enclose a set of options.
- A comma followed by an ellipsis within square brackets "[, ... ]" indicates that the preceding item can repeat in a comma-separated list.
SELECT list
Syntax:
SELECT [ ALL ] { * | expression [ [ AS ] alias ] } [, ...]
The SELECT
list defines the columns that the query will return. Expressions in
the SELECT
list can refer to columns in any of the from_item
s in its
corresponding FROM
clause.
Each item in the SELECT
list is one of:
- *
expression
SELECT *
SELECT *
, often referred to as select star, produces one output column for
each column that is visible after executing the full query.
SELECT * FROM (SELECT "apple" AS fruit, "carrot" AS vegetable);
+-------+-----------+
| fruit | vegetable |
+-------+-----------+
| apple | carrot |
+-------+-----------+
SELECT expression
Items in a SELECT
list can be expressions. These expressions evaluate to a
single value and produce one output column, with an optional explicit alias
.
If the expression does not have an explicit alias, it receives an implicit alias according to the rules for implicit aliases, if possible. Otherwise, the column is anonymous and you cannot refer to it by name elsewhere in the query.
SELECT modifiers
You can modify the results returned from a SELECT
query, as follows.
SELECT ALL
A SELECT ALL
statement returns all rows, including duplicate rows.
SELECT ALL
is the default behavior of SELECT
.
Aliases
See Aliases for information on syntax and visibility for
SELECT
list aliases.
FROM clause
The FROM
clause indicates the tables or streams from which to retrieve rows, and
specifies how to join those rows together to produce a single stream of
rows for processing in the rest of the query.
Syntax
from_item: { table_name [ [ AS ] alias ] | join | ( query_expr ) [ [ AS ] alias ] | with_query_name [ [ AS ] alias ] }
table_name
The fully-qualified SQL name of a data source queryable by Cloud Dataflow SQL, specified by a dot-separated list of identifiers using Standard SQL lexical structure. You must use backticks to enclose identifiers that contain characters which are not letters, numbers, or underscores.
SELECT * FROM bigquery.table.`my-project`.baseball.roster;
SELECT * FROM pubsub.topic.`my-project`.incoming_events;
join
See JOIN Types below.
select
( select ) [ [ AS ] alias ]
is a table subquery.
with_query_name
The query names in a WITH
clause (see WITH Clause) act like names of temporary tables that you
can reference anywhere in the FROM
clause. In the example below,
subQ1
and subQ2
are with_query_names
.
Example:
WITH subQ1 AS (SELECT * FROM Roster WHERE SchoolID = 52), subQ2 AS (SELECT SchoolID FROM subQ1) SELECT DISTINCT * FROM subQ2;
The WITH
clause hides any permanent tables with the same name
for the duration of the query, unless you qualify the table name, e.g.
db.Roster
.
Subqueries
A subquery is a query that appears inside another statement, and is written
inside parentheses. These are also referred to as "sub-SELECTs" or
"nested SELECTs". The full SELECT
syntax is valid in
subqueries.
There are two types of subquery:
- Expression subqueries, which you can use in a query wherever expressions are valid. Expression subqueries return a single value.
- Table subqueries, which you can use only in a
FROM
clause. The outer query treats the result of the subquery as a table.
Note that there must be parentheses around both types of subqueries.
Example:
SELECT AVG ( PointsScored )
FROM
( SELECT PointsScored
FROM Stats
WHERE SchoolID = 77 )
Optionally, a table subquery can have an alias.
Example:
SELECT r.LastName
FROM
( SELECT * FROM Roster) AS r;
Aliases
See Aliases for information on syntax and visibility for
FROM
clause aliases.
JOIN types
Syntax
join: from_item [ join_type ] JOIN from_item ON bool_expression join_type: { INNER | FULL [OUTER] | LEFT [OUTER] | RIGHT [OUTER] }
The JOIN
clause merges two from_item
s so that the SELECT
clause can
query them as one source. The join_type
and ON
clause (a
"join condition") specify how to combine and discard rows from the two
from_item
s to form a single source.
All JOIN
clauses require a join_type
.
[INNER] JOIN
An INNER JOIN
, or simply JOIN
, effectively calculates the Cartesian product
of the two from_item
s and discards all rows that do not meet the join
condition. "Effectively" means that it is possible to implement an INNER JOIN
without actually calculating the Cartesian product.
FULL [OUTER] JOIN
A FULL OUTER JOIN
(or simply FULL JOIN
) returns all fields for all rows in
both from_item
s that meet the join condition.
FULL
indicates that all rows from both from_item
s are
returned, even if they do not meet the join condition.
OUTER
indicates that if a given row from one from_item
does not
join to any row in the other from_item
, the row will return with NULLs
for all columns from the other from_item
.
LEFT [OUTER] JOIN
The result of a LEFT OUTER JOIN
(or simply LEFT JOIN
) for two
from_item
s always retains all rows of the left from_item
in the
JOIN
clause, even if no rows in the right from_item
satisfy the join
predicate.
LEFT
indicates that all rows from the left from_item
are
returned; if a given row from the left from_item
does not join to any row
in the right from_item
, the row will return with NULLs for all
columns from the right from_item
. Rows from the right from_item
that
do not join to any row in the left from_item
are discarded.
RIGHT [OUTER] JOIN
The result of a RIGHT OUTER JOIN
(or simply RIGHT JOIN
) is similar and
symmetric to that of LEFT OUTER JOIN
.
ON clause
The ON
clause contains a bool_expression
. A combined row (the result of
joining two rows) meets the join condition if bool_expression
returns
TRUE.
Example:
SELECT * FROM Roster INNER JOIN PlayerStats
ON Roster.LastName = PlayerStats.LastName;
Sequences of JOINs
The FROM
clause can contain multiple JOIN
clauses in sequence.
Example:
SELECT * FROM a LEFT JOIN b ON TRUE LEFT JOIN c ON TRUE;
where a
, b
, and c
are any from_item
s. JOINs are bound from left to
right, but you can insert parentheses to group them in a different order.
WHERE clause
Syntax
WHERE bool_expression
The WHERE
clause filters out rows by evaluating each row against
bool_expression
, and discards all rows that do not return TRUE (that is,
rows that return FALSE or NULL).
Example:
SELECT * FROM Roster
WHERE SchoolID = 52;
The bool_expression
can contain multiple sub-conditions.
Example:
SELECT * FROM Roster
WHERE STARTS_WITH(LastName, "Mc") OR STARTS_WITH(LastName, "Mac");
You cannot reference column aliases from the SELECT
list in the WHERE
clause.
GROUP BY clause
Syntax
GROUP BY expression [, ...]
The GROUP BY
clause groups together rows in a table with non-distinct values
for the expression
in the GROUP BY
clause. For multiple rows in the
source table with non-distinct values for expression
, the
GROUP BY
clause produces a single combined row. GROUP BY
is commonly used
when aggregate functions are present in the SELECT
list, or to eliminate
redundancy in the output. The data type of expression
must be groupable.
Example:
SELECT SUM(PointsScored), LastName
FROM PlayerStats
GROUP BY LastName;
The GROUP BY
clause can refer to expression names in the SELECT
list. The
GROUP BY
clause also allows ordinal references to expressions in the SELECT
list using integer values. 1
refers to the first expression in the
SELECT
list, 2
the second, and so forth. The expression list can combine
ordinals and expression names.
Example:
SELECT SUM(PointsScored), LastName, FirstName
FROM PlayerStats
GROUP BY LastName, FirstName;
The query above is equivalent to:
SELECT SUM(PointsScored), LastName, FirstName
FROM PlayerStats
GROUP BY 2, FirstName;
GROUP BY
clauses may also refer to aliases. If a query contains aliases in
the SELECT
clause, those aliases override names in the corresponding FROM
clause.
Example:
SELECT SUM(PointsScored), LastName as last_name
FROM PlayerStats
GROUP BY last_name;
HAVING clause
Syntax
HAVING bool_expression
The HAVING
clause is similar to the WHERE
clause: it filters out rows that
do not return TRUE when they are evaluated against the bool_expression
.
As with the WHERE
clause, the bool_expression
can be any expression
that returns a boolean, and can contain multiple sub-conditions.
The HAVING
clause differs from the WHERE
clause in that:
- The
HAVING
clause requiresGROUP BY
or aggregation to be present in the query. - The
HAVING
clause occurs afterGROUP BY
and aggregation, and beforeORDER BY
. This means that theHAVING
clause is evaluated once for every aggregated row in the result set. This differs from theWHERE
clause, which is evaluated beforeGROUP BY
and aggregation.
The HAVING
clause can reference columns available via the FROM
clause, as
well as SELECT
list aliases. Expressions referenced in the HAVING
clause
must either appear in the GROUP BY
clause or they must be the result of an
aggregate function:
SELECT LastName
FROM Roster
GROUP BY LastName
HAVING SUM(PointsScored) > 15;
If a query contains aliases in the SELECT
clause, those aliases override names
in a FROM
clause.
SELECT LastName, SUM(PointsScored) AS ps
FROM Roster
GROUP BY LastName
HAVING ps > 0;
Mandatory aggregation
Aggregation does not have to be present in the HAVING
clause itself, but
aggregation must be present in at least one of the following forms:
Aggregation function in the SELECT
list.
SELECT LastName, SUM(PointsScored) AS total
FROM PlayerStats
GROUP BY LastName
HAVING total > 15;
Aggregation function in the 'HAVING' clause.
SELECT LastName
FROM PlayerStats
GROUP BY LastName
HAVING SUM(PointsScored) > 15;
Aggregation in both the SELECT
list and HAVING
clause.
When aggregation functions are present in both the SELECT
list and HAVING
clause, the aggregation functions and the columns they reference do not need
to be the same. In the example below, the two aggregation functions,
COUNT()
and SUM()
, are different and also use different columns.
SELECT LastName, COUNT(*)
FROM PlayerStats
GROUP BY LastName
HAVING SUM(PointsScored) > 15;
LIMIT clause and OFFSET clause
Syntax
[ ORDER BY expression [{ASC | DESC}] [,...] ] LIMIT count [ OFFSET skip_rows ]
The ORDER BY
clause specifies a column or expression as the sort criterion for
the result set. If an ORDER BY clause is not present, the order of the results
of a query is not defined. The default sort direction is ASC
, which sorts the
results in ascending order of expression
values. DESC
sorts the results in
descending order. Column aliases from a FROM
clause or SELECT
list are
allowed. If a query contains aliases in the SELECT
clause, those aliases
override names in the corresponding FROM
clause.
It is possible to order by multiple columns.
The following rules apply when ordering values:
- NULLs: In the context of the
ORDER BY
clause, NULLs are the minimum possible value; that is, NULLs appear first inASC
sorts and last inDESC
sorts.
LIMIT
specifies a non-negative count
of type INT64,
and no more than count
rows will be returned. LIMIT
0
returns 0 rows. If
there is a set
operation, LIMIT
is applied after the
set operation
is evaluated.
OFFSET
specifies a non-negative skip_rows
of type
INT64, and only rows from
that offset in the table will be considered.
These clauses accept only literal or parameter values.
The rows that are returned by LIMIT
and OFFSET
is unspecified unless these
operators are used after ORDER BY
.
WITH clause
The WITH
clause binds the results of one or more named subqueries to temporary
table names. Each introduced table name is visible in subsequent SELECT
expressions within the same query expression. This includes the following kinds
of SELECT
expressions:
- Any
SELECT
expressions in subsequentWITH
bindings - Top level
SELECT
expressions in the query expression on both sides of a set operator such asUNION
SELECT
expressions inside subqueries within the same query expression
Example:
WITH subQ1 AS (SELECT SchoolID FROM Roster),
subQ2 AS (SELECT OpponentID FROM PlayerStats)
SELECT * FROM subQ1
UNION ALL
SELECT * FROM subQ2;
The following are scoping rules for WITH
clauses:
- Aliases are scoped so that the aliases introduced in a
WITH
clause are visible only in the later subqueries in the sameWITH
clause, and in the query under theWITH
clause. - Aliases introduced in the same
WITH
clause must be unique, but the same alias can be used in multipleWITH
clauses in the same query. The local alias overrides any outer aliases anywhere that the local alias is visible.
Cloud Dataflow SQL does not support WITH RECURSIVE
.
Aliases
An alias is a temporary name given to a table, column, or expression present in
a query. You can introduce explicit aliases in the SELECT
list or FROM
clause.
Explicit alias syntax
You can introduce explicit aliases in either the FROM
clause or the SELECT
list.
In a FROM
clause, you can introduce explicit aliases for any item, including
tables, arrays and subqueries, using [AS] alias
. The AS
keyword is optional.
Example:
SELECT s.FirstName, s2.SongName
FROM Singers AS s, (SELECT * FROM Songs) AS s2;
You can introduce explicit aliases for any expression in the SELECT
list using
[AS] alias
. The AS
keyword is optional.
Example:
SELECT s.FirstName AS name, LOWER(s.FirstName) AS lname
FROM Singers s;
Explicit alias visibility
After you introduce an explicit alias in a query, there are restrictions on where else in the query you can reference that alias. These restrictions on alias visibility are the result of Cloud Dataflow SQL's name scoping rules.
FROM clause aliases
Cloud Dataflow SQL processes aliases in a FROM
clause from left to right,
and aliases are visible only to subsequent path expressions in a FROM
clause.
Example:
Assume the Singers
table had a Concerts
column of ARRAY
type.
SELECT FirstName
FROM Singers AS s, s.Concerts;
Invalid:
SELECT FirstName
FROM s.Concerts, Singers AS s; // INVALID.
FROM
clause aliases are not visible to subqueries in the same FROM
clause. Subqueries in a FROM
clause cannot contain correlated references to
other tables in the same FROM
clause.
Invalid:
SELECT FirstName
FROM Singers AS s, (SELECT (2020 - ReleaseDate) FROM s) // INVALID.
You can use any column name from a table in the FROM
as an alias anywhere in
the query, with or without qualification with the table name.
Example:
SELECT FirstName, s.ReleaseDate
FROM Singers s WHERE ReleaseDate = 1975;
SELECT list aliases
Aliases in the SELECT
list are visible only to the following clauses:
GROUP BY
clauseORDER BY
clauseHAVING
clause
Example:
SELECT LastName AS last, SingerID
FROM Singers
ORDER BY last;
Explicit aliases in GROUP BY, ORDER BY, and HAVING clauses
These three clauses, GROUP BY
, ORDER BY
, and HAVING
, can refer to only the
following values:
- Tables in the
FROM
clause and any of their columns. - Aliases from the
SELECT
list.
GROUP BY
and ORDER BY
can also refer to a third group:
- Integer literals, which refer to items in the
SELECT
list. The integer1
refers to the first item in theSELECT
list,2
refers to the second item, etc.
Example:
SELECT SingerID AS sid, COUNT(Songid) AS s2id
FROM Songs
GROUP BY 1
ORDER BY 2 DESC LIMIT 10;
The query above is equivalent to:
SELECT SingerID AS sid, COUNT(Songid) AS s2id
FROM Songs
GROUP BY sid
ORDER BY s2id DESC LIMIT 10;
Ambiguous aliases
Cloud Dataflow SQL provides an error if a name is ambiguous, meaning it can resolve to more than one unique object.
Examples:
This query contains column names that conflict between tables, since both
Singers
and Songs
have a column named SingerID
:
SELECT SingerID
FROM Singers, Songs;
This query contains aliases that are ambiguous in the GROUP BY
clause because
they are duplicated in the SELECT
list:
SELECT FirstName AS name, LastName AS name,
FROM Singers
GROUP BY name;
Ambiguity between a FROM
clause column name and a SELECT
list alias in
GROUP BY
:
SELECT UPPER(LastName) AS LastName
FROM Singers
GROUP BY LastName;
The query above is ambiguous and will produce an error because LastName
in the
GROUP BY
clause could refer to the original column LastName
in Singers
, or
it could refer to the alias AS LastName
, whose value is UPPER(LastName)
.
The same rules for ambiguity apply to path expressions. Consider the following
query where table
has columns x
and y
, and column z
is of type STRUCT
and has fields v
, w
, and x
.
Example:
SELECT x, z AS T
FROM table T
GROUP BY T.x;
The alias T
is ambiguous and will produce an error because T.x
in the GROUP
BY
clause could refer to either table.x
or table.z.x
.
A name is not ambiguous in GROUP BY
, ORDER BY
or HAVING
if it is both
a column name and a SELECT
list alias, as long as the name resolves to the
same underlying object.
Example:
SELECT LastName, BirthYear AS BirthYear
FROM Singers
GROUP BY BirthYear;
The alias BirthYear
is not ambiguous because it resolves to the same
underlying column, Singers.BirthYear
.
Appendix A: examples with sample data
Sample tables
The following three tables contain sample data about athletes, their schools, and the points they score during the season. These tables will be used to illustrate the behavior of different query clauses.
Table Roster:
LastName | SchoolID |
---|---|
Adams | 50 |
Buchanan | 52 |
Coolidge | 52 |
Davis | 51 |
Eisenhower | 77 |
The Roster table includes a list of player names (LastName) and the unique ID assigned to their school (SchoolID).
Table PlayerStats:
LastName | OpponentID | PointsScored |
---|---|---|
Adams | 51 | 3 |
Buchanan | 77 | 0 |
Coolidge | 77 | 1 |
Adams | 52 | 4 |
Buchanan | 50 | 13 |
The PlayerStats table includes a list of player names (LastName) and the unique ID assigned to the opponent they played in a given game (OpponentID) and the number of points scored by the athlete in that game (PointsScored).
Table TeamMascot:
SchoolId | Mascot |
---|---|
50 | Jaguars |
51 | Knights |
52 | Lakers |
53 | Mustangs |
The TeamMascot table includes a list of unique school IDs (SchoolID) and the mascot for that school (Mascot).
JOIN types
1) [INNER] JOIN
Example:
SELECT * FROM Roster JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;
Results:
LastName | Roster.SchoolId | TeamMascot.SchoolId | Mascot |
---|---|---|---|
Adams | 50 | 50 | Jaguars |
Buchanan | 52 | 52 | Lakers |
Coolidge | 52 | 52 | Lakers |
Davis | 51 | 51 | Knights |
2) FULL [OUTER] JOIN
Example:
SELECT * FROM Roster FULL JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;
LastName | Roster.SchoolId | TeamMascot.SchoolId | Mascot |
---|---|---|---|
Adams | 50 | 50 | Jaguars |
Buchanan | 52 | 52 | Lakers |
Coolidge | 52 | 52 | Lakers |
Davis | 51 | 51 | Knights |
Eisenhower | 77 | NULL | NULL |
NULL | NULL | 53 | Mustangs |
3) LEFT [OUTER] JOIN
Example:
SELECT * FROM Roster LEFT JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;
Results:
LastName | Roster.SchoolId | TeamMascot.SchoolId | Mascot |
---|---|---|---|
Adams | 50 | 50 | Jaguars |
Buchanan | 52 | 52 | Lakers |
Coolidge | 52 | 52 | Lakers |
Davis | 51 | 51 | Knights |
Eisenhower | 77 | NULL | NULL |
4) RIGHT [OUTER] JOIN
Example:
SELECT * FROM Roster RIGHT JOIN TeamMascot
ON Roster.SchoolID = TeamMascot.SchoolID;
Results:
LastName | Roster.SchoolId | TeamMascot.SchoolId | Mascot |
---|---|---|---|
Adams | 50 | 50 | Jaguars |
Davis | 51 | 51 | Knights |
Coolidge | 52 | 52 | Lakers |
Buchanan | 52 | 52 | Lakers |
NULL | NULL | 53 | Mustangs |
GROUP BY clause
Example:
SELECT LastName, SUM(PointsScored)
FROM PlayerStats
GROUP BY LastName;
LastName | SUM |
---|---|
Adams | 7 |
Buchanan | 13 |
Coolidge | 1 |
Set operators
UNION
The UNION
operator combines the result sets of two or more SELECT
statements
by pairing columns from the result set of each SELECT
statement and vertically
concatenating them.
Example:
SELECT Mascot AS X, SchoolID AS Y
FROM TeamMascot
UNION ALL
SELECT LastName, PointsScored
FROM PlayerStats;
Results:
X | Y |
---|---|
Jaguars | 50 |
Knights | 51 |
Lakers | 52 |
Mustangs | 53 |
Adams | 3 |
Buchanan | 0 |
Coolidge | 1 |
Adams | 4 |
Buchanan | 13 |
INTERSECT
This query returns the last names that are present in both Roster and PlayerStats.
SELECT LastName
FROM Roster
INTERSECT ALL
SELECT LastName
FROM PlayerStats;
Results:
LastName |
---|
Adams |
Coolidge |
Buchanan |
EXCEPT
The query below returns last names in Roster that are not present in PlayerStats.
SELECT LastName
FROM Roster
EXCEPT DISTINCT
SELECT LastName
FROM PlayerStats;
Results:
LastName |
---|
Eisenhower |
Davis |
Reversing the order of the SELECT
statements will return last names in
PlayerStats that are not present in Roster:
SELECT LastName
FROM PlayerStats
EXCEPT DISTINCT
SELECT LastName
FROM Roster;
Results:
(empty)
Appendix B: unnest ARRAY
Because Dataflow SQL doesn't support ARRAY functions (see
Beam ZetaSQL support
), you can't use the [OFFSET]
or [CARDINAL]
syntax to
access a STRUCT in an ARRAY.
For example, this won't work:
WITH records AS (
SELECT [
STRUCT("A" as name, 1 as age),
STRUCT("B" as name, 2 as age),
STRUCT("C" as name, 3 as age)
] as record,
'1' as id
)
SELECT * from records
where record[offset(1)].name = "B";
It will print errors similar to Element access using [offset()] is not
supported on values of type ARRAY
.
To make the query work with Dataflow SQL, you can use UNNEST
in the
FROM
clause to join the records with the STRUCTs in the ARRAY.
WITH records AS (
SELECT [
STRUCT("A" as name, 1 as age),
STRUCT("B" as name, 2 as age),
STRUCT("C" as name, 3 as age)
] as record,
'1' as id
)
SELECT id, name, age FROM records, UNNEST(record)
WHERE name='B';
Note you have to explicitly specify the column names in the
SELECT
clause when using UNNEST(ARRAY_COLUMN)
in the
FROM
clause.