Working with Remote Functions

A BigQuery remote function allows you to implement your function in other languages than SQL and Javascript or with the libraries or services which are not allowed in BigQuery user-defined functions.

Overview

A BigQuery remote function lets you incorporate BigQuery SQL functionality with software outside of BigQuery by providing a direct integration with Cloud Functions and Cloud Run. With BigQuery remote functions, you can implement functions in Node.js, Python, Go, Java, .net, Ruby, or PHP. You can then execute these functions in Cloud Functions or Cloud Run on columns passed in from a BigQuery SQL query.

Workflow Overview

  1. Create the HTTP endpoint in Cloud Functions or Cloud Run.
  2. Create a remote function in BigQuery.
    1. Create a connection of type CLOUD_RESOURCE.
    2. Create a remote function.
  3. Use the remote function in a query just like any other user-defined functions.

Limitations

  • Remote functions only support one of the following data types as argument type or return type:

    • Boolean
    • Bytes
    • Numeric
    • String
    • Date
    • Datetime
    • Time
    • Timestamp

    Remote functions do not support ARRAY, STRUCT, INTERVAL, GEOGRAPHY and JSON.

  • VPC Service Controls is not supported by remote functions for now.

  • You cannot create temporary remote functions.

  • You cannot create table-valued remote functions.

  • You cannot use remote functions when creating materialized views.

  • The return value of a remote function is always assumed to be non-deterministic so the result of a query calling a remote function is not cached.

  • For better performance, consider SQL or Javascript user-defined functions if your use case supports them.

  • Remote functions might have degraded performance if they are used with Javascript user-defined functions in the same query.

  • Queries using a remote function might fail if they run over an hour.

Create a HTTP endpoint in Cloud Functions or Cloud Run

You need to implement the business logic for a remote function in a Cloud Functions or Cloud Run HTTP endpoint. The endpoint should be able to process a batch of rows in a single HTTP POST request and return the results for the batch as a HTTP response.

See the Cloud Functions tutorial and other Cloud Functions documentation on how to write, deploy, test and maintain a Cloud Function.

See the Cloud Run quick start and other Cloud Run documentation on how to write, deploy, test and maintain a Cloud Run service.

It’s recommended that you keep the default authentication instead of allowing unauthenticated invocation of your Cloud Function or Cloud Run service.

Input Format

BigQuery sends HTTP POST requests with JSON body in the following format:

Field name Description Field type
requestId Id of the request. Unique over multiple requests sent to this endpoint in a BigQuery SQL query. Always provided. String.
caller Job full resource name for the BigQuery SQL query calling the remote function. Always provided. String.
sessionUser Email of the user executing the BigQuery SQL query. Always provided. String.
userDefinedContext The user defined context that was used when creating the remote function in BigQuery. Optional. A JSON object with key-value pairs.
calls A batch of input data. Always provided. A JSON array.

Each element itself is a JSON array, which is a JSON encoded argument list of one remote function call.

An example of a request:

{
 "requestId": "124ab1c",
 "caller": "//bigquery.googleapis.com/projects/myproject/jobs/myproject:US.bquxjob_5b4c112c_17961fafeaf",
 "sessionUser": "test-user@test-company.com",
 "userDefinedContext": {
  "key1": "value1",
  "key2": "v2"
 },
 "calls": [
  [null, 1, "", "abc"],
  ["abc", "9007199254740993", null, null]
 ]
}

Output Format

BigQuery expects the endpoint should return a HTTP response in the following format, otherwise BigQuery can't consume it and will fail the query calling the remote function.

Field name Description Value Range
replies A batch of return values. Required for a successful response. A JSON array.

Each element corresponds to a JSON encoded return value of the external function.

Size of the array must match the size of the JSON array of calls in the HTTP request. For example, if the JSON array in calls has 4 elements, this JSON array needs to have 4 elements as well.

errorMessage Error message when the HTTP response code other than 200 is returned. For non-retryable errors, we return this as part of the BigQuery job's error message to the user. Optional. String. Size should be less than 1KB.

An example of a successful response:

{
  "replies": [
    1,
    0
  ]
}

An example of a failed response:

{
  "errorMessage": "Received but not expected that the argument 0 be null".
}

JSON encoding of SQL data type

JSON encoding in HTTP request/response follows the existing BigQuery JSON encoding for TO_JSON_STRING function.

Sample Cloud Function code

The following sample Python code implements adding all the integer arguments of the remote function. It handles a request with the arguments for batched invocations and returns all the result in a response.

import json

_MAX_LOSSLESS=9007199254740992

def batch_add(request):
  try:
    return_value = []
    request_json = request.get_json()
    calls = request_json['calls']
    for call in calls:
      return_value.append(sum([int(x) if isinstance(x, str) else x for x in call if x is not None]))
    replies = [str(x) if x > _MAX_LOSSLESS or x < -_MAX_LOSSLESS else x for x in return_value]
    return_json = json.dumps( { "replies" :  replies} )
    return return_json
  except Exception as inst:
    return json.dumps( { "errorMessage": 'something unexpected in input' } ), 400

Assuming that the function is deployed in the project my_gcf_project in region us-east1 as the function name remote_add, it can be accessed via the endpoint https://us-east1-my_gcf_project.cloudfunctions.net/remote_add.

Sample Cloud Run code

The following sample Python code implements a web service, which can be built and deployed to Cloud Run for the same functionality.

import os

from flask import Flask, request, jsonify

_MAX_LOSSLESS=9007199254740992

app = Flask(__name__)

@app.route("/", methods=['POST'])
def batch_add():
  try:
    return_value = []
    request_json = request.get_json()
    calls = request_json['calls']
    for call in calls:
      return_value.append(sum([int(x) if isinstance(x, str) else x for x in call if x is not None]))
    replies = [str(x) if x > _MAX_LOSSLESS or x < -_MAX_LOSSLESS else x for x in return_value]
    return jsonify( { "replies" :  replies} )
  except Exception as inst:
    return jsonify( { "errorMessage": 'something unexpected in input' } ), 400

if __name__ == "__main__":
    app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))

See the guide on how to build and deploy the code.

Assuming that the Cloud Run service is deployed in the project my_gcf_project in region us-east1 as the service name remote_add, it can be accessed via the endpoint https://remote_add-<project_id_hash>-ue.a.run.app.

Create a remote function

BigQuery uses a CLOUD_RESOURCE connection to interact with your Cloud Function. In order to create a remote function, you first need to create a CLOUD_RESOURCE connection.

Create a CLOUD_RESOURCE connection

You can use the bq mk command to create a CLOUD_RESOURCE connection. The following example creates a CLOUD_RESOURCE connection, my-connection, in the project my-project-id in the location US.

bq mk --connection --display_name='friendly name' --connection_type=CLOUD_RESOURCE
      --project_id=my-project-id --location=US my-connection

When you create a CLOUD_RESOURCE connection, BigQuery creates a unique system service account and associates it with the connection. You can retrieve the service account identifier, which you can do by using the bq command-line tool as shown in the following example (some output columns have been omitted for clarity of the example).

bq show --location=US --connection  my-connection

Connection my-project-id.US.my-connection
           name                                    properties
 -------------------------    ------------------------------------------------
  1234.us.my-connection       {"serviceAccountId": "connection-1234-9u56h9@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}

For more information how to manage connections, see how to work with connection to external data sources.

Grant the permission on your Cloud Function or Cloud Run service

Unless you choose to allow unauthenticated invocation for your Cloud Function or Cloud Run service, which is not encouraged, the service account associated with the connection needs to be granted permissions.

Cloud Run

The service account associated with the connection needs to be granted permission run.routes.invoke (e.g. via Cloud Run Invoker role) on your Cloud Run service.

Cloud Function

The service account associated with the connection needs to be granted permission cloudfunctions.functions.invoke (e.g. via Cloud Functions Invoker role) on your 1st-gen Cloud Function. For 2nd-gen Cloud Function, the service account associated with the connection needs to be granted permission run.routes.invoke (e.g. via Cloud Run Invoker role) on your function's underlying Cloud Run service.

Create a remote function

To create a remote function, use the CREATE FUNCTION statement.

The following example creates a remote function named my_bq_project.my_dataset.remote_add using the connection my-project-id.us.my-connection and referencing an endpoint in Cloud Functions.

CREATE FUNCTION my_bq_project.my_dataset.remote_add(x INT64, y INT64) RETURNS INT64
REMOTE WITH CONNECTION `my-project-id.us.my-connection`
OPTIONS (endpoint = 'https://us-east1-my_gcf_project.cloudfunctions.net/remote_add')

You need to have the permission bigquery.routines.create on the dataset where you create the remote function, and bigquery.connections.delegate permission (e.g. via BigQuery Connection Admin role) on the connection which is used by the remote function.

Providing user defined context

You can specify user_defined_context in OPTIONS as a form of key-value pairs, which will be part of every HTTP request to the endpoint. With user defined context, you can create multiple remote functions but re-use a single endpoint, that provides different behaviors based on the context passed to it.

The following examples create two remote functions to encrypt and decrypt BYTES data using the same endpoint.

CREATE FUNCTION my_bq_project.my_dataset.encrypt(x BYTES) RETURNS BYTES
REMOTE WITH CONNECTION `my-project-id.us.my-connection`
OPTIONS (endpoint = 'https://us-east1-my_gcf_project.cloudfunctions.net/crypto_service',
         user_defined_context = [("mode", "encryption")])

CREATE FUNCTION my_bq_project.my_dataset.decrypt(x BYTES) RETURNS BYTES
REMOTE WITH CONNECTION `my-project-id.us.my-connection`
OPTIONS (endpoint = 'https://us-east1-my_gcf_project.cloudfunctions.net/crypto_service',
         user_defined_context = [("mode", "decryption")])

Limiting number of rows in a batch request

You can specify max_batching_rows in OPTIONS as the maximum number of rows in each HTTP request, to avoid Cloud Functions timeout. If not specified, BigQuery will decide how many rows are included in a batch.

Use a remote function in a query

Make sure you have granted the permission on your Cloud Function, so that it is accessible to BigQuery's service account associated with the connection of the remote function.

You also need to have the permission bigquery.routines.get on the dataset where the remote function is, and the bigquery.connections.use permission, which you can get through the BigQuery Connection User role, on the connection which is used by the remote function.

You can use a remote function in a query just like a user defined function.

For example, you can use the remote_add function in the example query:

SELECT val, my_bq_project.my_dataset.remote_add(val, 2)
  FROM UNNEST([NULL,2,3,5,8]) AS val;

This example produces the following output:

+------+-----+
|  val | f0_ |
+------+-----+
| NULL |   2 |
|    2 |   4 |
|    3 |   5 |
|    5 |   7 |
|    8 |  10 |
+------+-----+

Supported regions

There are two types of locations in BigQuery:

  • A region is a specific geographic place, such as London.

  • A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.

Single regions

In a BigQuery single region dataset, you can only create a remote function that uses a Cloud Function deployed in the same region. For example:

  • A remote function in BigQuery single region us-east4 can only use a Cloud Function in us-east4.

So for single regions, remote functions are only supported in regions that support both Cloud Functions and BigQuery.

Multi-regions

In a BigQuery multi-region (US, EU) dataset, you can only create a remote function that uses a Cloud Function deployed in a region within the same large geographic area (US, EU). For example:

  • A remote function in BigQuery US multi-region can only use a Cloud Function deployed in any single region in the US geographic area, such as us-central1, us-east4, us-west2, etc.
  • A remote function in BigQuery EU multi-region can only use a Cloud Function deployed in any single region in member states of the European Union, such as europe-north1, europe-west3, etc.

For more information about BigQuery regions and multi-regions, see the Dataset Locations page. For more information about Cloud Functions regions, see the Cloud Functions Locations page.

Connections

For either a single-region location or multi-region location, you can only create a remote function in the same location as the connection you use. For example, to create a remote function in the US multi-region, use a connection located in the US multi-region.

Pricing

  • Standard BigQuery pricing applies.

  • In addition, costs may be incurred for Cloud Functions and Cloud Run by using this feature. Please review the Cloud Functions and Cloud Run pricing pages for details.

Best practices for remote functions

  • Prefilter your input: If your input can be easily filtered down before being passed to a remote function, your query will likely be faster and cheaper.

  • Keep your Cloud Function scalable: Use the default value for your Cloud Function's maximum number of instances.

  • Follow other Cloud Function tips for better performance. Remote function queries interacting with a high latency Cloud Function might fail due to timeout.

Quotas

For information about remote function quotas, see Quotas and limits.