Dataflow SQL user-defined functions

Dataflow SQL supports user-defined functions (UDFs) written in Java or SQL. These functions accept columns of input and perform actions, returning the result of those actions as a value.

Java UDFs

A Java UDF takes zero or more scalar parameters and returns a single scalar value.

Syntax

CREATE FUNCTION
    function_name
    ([named_parameter[, ...]])
  RETURNS data_type
  LANGUAGE java
  OPTIONS (path='jar_file_uri')
named_parameter:
  param_name param_type

This syntax consists of the following components:

  • function_name. Specifies the name of the function.

  • named_parameter. Consists of a comma-separated param_name and param_type pair. The value of param_type is a Dataflow SQL data type.

  • RETURNS data_type. Specifies the data type that the function returns.

  • OPTIONS (path = jar_file_uri). Specifies the Cloud Storage URI of the JAR file that implements the function.

The CREATE FUNCTION statement defines the signature of the function. The implementation is provided as a JAR file. Upload the JAR file to Cloud Storage and use the path option to specify the Cloud Storage URI of the JAR file.

To invoke the function, include a SELECT statement after the CREATE FUNCTION statement. Separate multiple statements in the query using a semicolon. The last statement must be a SELECT statement, and you can use only one SELECT statement.

Java UDF example

The following example defines a Java UDF and calls the function in a SELECT statement:

CREATE FUNCTION fun (a INT64)
  RETURNS INT64
  LANGUAGE java
  OPTIONS (
    path='gs://my-bucket/udf.jar'
  );

SELECT fun(1) as b;

Implementing a Java UDF

To create a Java UDF, extend the ScalarFn class. Your subclass must contain exactly one method annotated with @ApplyMethod. The method parameters and return type must match the data types that you declared in the CREATE FUNCTION statement. For more information, see Type mappings in this document.

The following example implements a UDF that increments a number and returns the result.

public class IncrementFn extends ScalarFn {
  @ApplyMethod
  public Long increment(Long i) {
    return i + 1;
  }
}

UDF provider

In your JAR file, include exactly one public class that implements the userDefinedScalarFunctions method in the UdfProvider interface. This method maps UDF function names to Java UDF functions. At runtime, Dataflow calls this method to discover which function to invoke.

The following example shows an implementation of UdfProvider for a function named helloWorld.

@AutoService(UdfProvider.class)
public class ExampleUdfProvider implements UdfProvider {
  @Override
  public Map<String, ScalarFn> userDefinedScalarFunctions() {
    return ImmutableMap.of("helloWorld", new HelloWorldFn());
  }

  public static class HelloWorldFn extends ScalarFn {
    @ApplyMethod
    public String helloWorld() {
      return "Hello world!";
    }
  }
}

Type mappings

When a Java UDF is invoked, Dataflow converts the function parameters to Java types and converts the return value back to a Dataflow SQL type. The following table shows how the types are mapped.

SQL data type Java class
INT64 java.lang.Long
FLOAT64 java.lang.Double
STRING java.lang.String
BOOL java.lang.Boolean

SQL UDFs

A SQL UDF lets you create a function by using another SQL expression.

Syntax

CREATE FUNCTION
    function_name
    ([named_parameter[, ...]])
  [RETURNS data_type]
  LANGUAGE sql
  AS (sql_expression)
named_parameter:
  param_name param_type

This syntax consists of the following components:

  • CREATE FUNCTION. Creates a function.

  • function_name. Specifies the name of the function.

  • named_parameter. Consists of a comma-separated param_name and param_type pair. The value of param_type is a Dataflow SQL data type.

  • [RETURNS data_type]. Specifies the data type that the function returns. For SQL UDFs, the RETURNS clause is optional.

  • sql_expression. Specifies the SQL expression that defines the function.

To invoke the function, include a SELECT statement after the CREATE FUNCTION statement. Separate multiple statements in the query using a semicolon. The last statement must be a SELECT statement, and you can use only one SELECT statement.

SQL UDF example

The following example creates a function named addFourAndMultiply and calls the function on a list of numbers.

CREATE FUNCTION addFourAndMultiply(x INT64, y INT64) AS ((x + 4) * y);

WITH numbers AS
  (SELECT 1 as val
  UNION ALL
  SELECT 3 as val
  UNION ALL
  SELECT 4 as val
  UNION ALL
  SELECT 5 as val)
SELECT val, addFourAndMultiply(val, 2) AS result
FROM numbers;