Jumpstart Your BigQuery Remote Function development today
Carlos Augusto
Principal Architect
Christopher Crosbie
Product Manager, Data Analytics
Last year BigQuery introduced Remote Functions, a feature that allows users to extend BigQuery SQL with their own custom code, written and hosted in Cloud Functions or Cloud Run. With Remote Functions, custom SQL functions can be written in languages like Node.js, Python, Go, Java, NET, Ruby, or PHP, enabling a personalized BigQuery experience for each organization, while leveraging its standard management and permission models.
We’ve seen an amazing number of use cases enabled by Remote Functions. Inspired by our customers’ success, we decided to document the art of the possible on this blog, providing a few examples, sample code and video instructions to jumpstart your Remote Function development.
Dynamic language translation with SQL
Imagine multinational organizations storing, for example, customer’s feedback in various languages inside a common BigQuery table. Translation API could be used to translate all content into a common language and make it easier to act on the data.
For this specific example, we’ve created an end to end tutorial for extending BigQuery with the Cloud Translation API. You can get all the instructions at https://cloud.google.com/bigquery/docs/remote-functions-translation-tutorial.
BigQuery unstructured data analysis
Analyzing unstructured data can be a daunting task. The combination of a Remote Function and Cloud Vision API can help organizations derive insights from images and videos stored in Google Cloud via SQL, without leaving the BigQuery prompt.
Imagine if organizations could assign labels to images and quickly classify them into millions of predefined categories or detect objects, read printed and handwritten text, and build valuable metadata into your image catalog stored in BigQuery. And all of this processing via BigQuery SQL. This is what this example is all about.
We’ve created an end to end, easy to follow tutorial for this use case as well. You can get all instructions at https://cloud.google.com/bigquery/docs/remote-function-tutorial.
Natural language insights with SQL
The Natural Language Processing API lets you derive insights from unstructured data with machine learning. With remote functions, this text processing can be combined with BigQuery SQL.
This example focuses on the ability of delivering insights from unstructured text stored in BigQuery tables using Google Machine learning and SQL. A simple use case could be an application gathering social media comments and storing them in BigQuery while performing sentiment analysis on each comment via SQL.
The sample code (main.py and requirements.txt) can be found in this repo.
Once the Python code is deployed as a Cloud Function, you can create the BigQuery Remote Function using the syntax below:
For more information on how to create a remote function, please see this documentation.
A screenshot of the working function can be seen below.
Security and compliance
Protection of sensitive data, like personally identifiable information (PII), is critical to every business.
With Remote functions, a SQL call can be made to integrate functionality provided by the Cloud Data Loss Prevention API, without the need to export data out of BigQuery. Since the remote function calls are done in-line with SQL, even DML statements can be performed on the fly using the outcome of the function as an input value for the data manipulation.
This example focuses on the ability to perform deterministic encryption and decryption of data stored in BigQuery tables using Remote Functions along with DLP.
The sample code (main.py and requirements.txt) can be found here. Please notice:
References to <change-me> on main.py will need to be adjusted according to your GCP environment
The code is inspecting data for the following info_types:
PHONE_NUMBER
,EMAIL_ADDRESS
andIP_ADDRESS
. Feel free to adjust as neededCloud Key Management Service (KMS) and Data Loss Prevention APIs will need to be enabled on the GCP project
A DLP Keyring and Key will be required. For directions, click here
The key will need to be wrapped (instructions)
DLP User role will need to be assigned to the service account executing the Cloud Function (by default the compute engine service account)
Once the Python code is deployed as a Cloud Function, you can create BigQuery Remote Functions using the syntax below:
For more information on how to create a remote function, take a look at the documentation.
The deterministic encryption/decryption functions we are using on phone numbers are. The picture below demonstrates a phone number being encrypted by the function:
Since deterministic encryption and decryption techniques are being used, the picture below demonstrates the phone number can be decrypted back to its original value by calling the dlp_decrypt function with the hashed value created by the dlp_encrypt function.
Below is an example of a BigQuery table creation, selecting data from an existing table while encrypting the values of any phone_number, email address or email values found inside the call_details column:
Check the full demo video here.
ELT and data catalog updates
Extract, Load, Transform (ELT) is a data integration process for transferring raw data from a source server to a target server such as BigQuery and then preparing the information for downstream uses. With ELT, the raw data is loaded into the data warehouse or data lake and transformations occur on the stored data.
When working with BigQuery, it’s common to see transformations being done with SQL and called via stored procedures. In this scenario, the transformation logic is self-contained, running inside BigQuery. But what if you need to keep external systems like Google Data Catalog updated while running the SQL transformation jobs?
This is what this example is all about. It demonstrates the ability to update Data Catalog, in-line with BigQuery stored Procedures using the catalog’s APIs and Remote Functions.
The sample code (main.py and requirements.txt) can be found here. Please notice:
References to <your-tag-template-id> and <your-project-id> on main.py will need to be adjusted according to your GCP environment
Data Catalog Admin role (or similar) will need to be assigned to the service account executing the Cloud Function (by default the compute engine service account) as a tag template values will be updated
A tag template with the structure below exists
Once the Python code is deployed as a Cloud Function, you can create A BigQuery Remote Function using the syntax below:
See a remote function being used to update Data Catalog. The picture below demonstrates how to call the function, passing five parameters to it.
Below you can see how the tag template “BQ Remote Functions Demo Tag Template” gets updated after the function execution.
You can now use this function inside a BigQuery stored procedure performing a full ELT job. In the example below, remote_udf.test_tag table is being updated by the stored procedure and the number of updated rows + total number of rows in table remote_udf.test_tag are being stored in Data Catalog:
Check the full demo video here.
Event-driven architectures with Pub/Sub updates from SQL
Pub/Sub is used for streaming analytics and data integration pipelines to ingest and distribute data. It's equally effective as a messaging-oriented middleware for service integration or as a queue to parallelize tasks.
What if you need to trigger an event by posting a message into a Pub/Sub topic via BigQuery SQL? Here is an example:
The sample code (main.py and requirements.txt) can be found here. Please notice:
References to <change-me> on main.py will need to be adjusted according to your GCP environment to reflect your project_id and topic_id
The service account executing the Cloud Function (by default the compute engine service account) will need to have permissions to post a message into a Pub/Sub topic
Once the Python code is deployed as a Cloud Function, you can create A BigQuery Remote Function using the syntax below:
A few screenshots of a remote function being used to post a message into a Pub/Sub topic can be found below:
Data science with Vertex AI models called from SQL
Vertex AI brings together the Google Cloud services for building ML under one, unified UI and API
What if you need to call online predictions from Vertex AI models via BigQuery SQL? Here is an example:
The sample code (main.py and requirements.txt) can be found here. Please notice:
References to <change-me> on main.py will need to be adjusted according to your GCP environment to reflect your project_id, location and model_endpoint
The service account executing the Cloud Function (by default the compute engine service account) will need to have permissions to execute Vertex AI models. Role “AI Platform Developer” should be enough
Once the Python code is deployed as a Cloud Function, you can create A BigQuery Remote Function using the syntax below:
The example function above will predict penguin weights based on inputs such as species, island, sex, length and other parameters.
A few screenshots of a remote function being used can be found below:
Calling external APIs in real-time
Another common use-case is BigQuery data enrichment by using external APIs to obtain the latest stock price data, weather updates, or geocoding information. Depending on the external service in use, deploy the client code as a Cloud Function and integrate with Remote Functions using the same methodology as the examples covered before.
Here is a screenshot of a remote function example calling an external/public API to retrieve Brazil’s currency information:
In summary
A BigQuery remote function lets you incorporate GoogleSQL functionality with software outside of BigQuery by providing a direct integration with Cloud Functions and Cloud Run.
Hopefully this blog sparkled some ideas on how to leverage this super powerful BigQuery feature and enrich your BigQuery data.