Scheduling SQL scripts and notebooks lets you to operationalize the assets that were created in the Data exploration workbench (Explore).
Scheduling a SQL script or a notebook creates a Dataplex task that runs at a predetermined time, once or on a recurrent basis. Upon each run, a job is created that captures the execution metadata, such as execution start and end time, execution status, logs, and generated outputs.
When a run is triggered for a scheduled SQL script or a notebook, the current version of the notebook or SQL script is executed. This means that if you schedule a notebook or SQL script, and later update its content, then the update is also reflected in the scheduled runs of the notebook or SQL script.
Costs
Scheduled SQL scripts and scheduled notebooks trigger job execution using Dataproc Serverless. The usages are charged according to the Dataproc pricing model and charges show up under Dataproc instead of Dataplex.
Before you begin
Review and complete the prerequisites described in the following documents:
- The Before you begin section of Use data exploration workbench
The Before you begin section of Schedule custom Spark and SparkSQL tasks
If a SQL script or notebook schedule uses BigQuery API, then you must grant the
service account the Service Usage Consumer role (roles/serviceusage.serviceUsageConsumer
).
Known limitations
When scheduling a notebook using the Google Cloud console, additional Python packages configured in the selected environment won't be available at runtime for the scheduled notebook. To make them available at runtime, you must create a notebook schedule referencing a container image with the required additional packages. For more information, see Schedule notebooks with custom packages.
Using the Google Cloud console to reference custom container images when scheduling notebooks is not supported. Instead, use the gcloud CLI.
You cannot edit a schedule once created. To update the schedule parameters, you must recreate the schedule with new parameters.
You cannot define a specific version of a SQL script or notebook to be used in a schedule. When the schedule is triggered, the current version of the SQL script or notebook is used.
When scheduling SQL scripts and notebooks using the Google Cloud console, the default subnetwork is used. To specify another network or a subnetwork, use the gcloud CLI.
Additionally, review the known limitations of Explore.
Create and manage schedules for SQL scripts
Runtime configuration for scheduled SQL scripts
When a SQL script is scheduled, it runs as a Dataplex task. When a schedule is created using the Google Cloud console, the corresponding task runtime parameters for scheduled SQL scripts are inherited from the environment of that SQL script based on the following mapping:
Environment configuration parameter | Scheduled SQL script task configuration parameter |
---|---|
Primary disk size | Container image / Properties: spark.dataproc.driver.disk.size spark.dataproc.executor.disk.size |
Number of nodes | Batch / Executors count |
Maximum number of nodes | Batch / Max executors count |
Image version | Not inherited |
JAR files | Container image / Java JARS |
Python packages | Not inherited |
Additional properties | Container image / Properties |
Schedule a SQL script
Console
In the Google Cloud console, go to the Dataplex Explore page.
Select a SQL script you want to schedule, either by selecting a single SQL script from the Spark SQL Scripts list or by opening a SQL script in Spark SQL Editor.
In the Schedule menu, click Create Schedule.
In the Schedule name field, enter a name for the scheduled instance.
In the Schedule Options section, select whether you want to run the SQL script only once or on a recurring basis:
If you select Run Once, specify whether you want to run the SQL script immediately or at a scheduled time, using the options in the Start menu. For a scheduled run, specify the start date and run time.
If you select Repeat, specify whether you want to run the SQL script daily, weekly, monthly, or on a custom schedule specified using cron expression. Additionally, specify the start date and run time for the initial scheduled run.
In the Destination for results section, do the following:
In the Output folder name field, click Browse and select a folder in a Cloud Storage bucket where you want the script outputs to be stored.
Click Select.
Select the Script output format. The supported formats are CSV, JSON, ORC, and Parquet.
In the Scheduled script credentials section, select a service account from the User service account menu.
Click Schedule.
gcloud
For information about scheduling SQL scripts using the gcloud CLI, see Schedule a SparkSQL task.
REST
Use the API Explorer to create a task.
Manage SQL scripts schedules
Console
Open the list of all SQL script schedules
In the Google Cloud console, go to the Dataplex Process page.
Click the Scheduled queries tab to view the list of SQL script schedules.
Open the list of all schedules for a specific SQL script
In the Google Cloud console, go to the Dataplex Explore page.
Select the required SQL script.
In the Schedule menu, click View Schedules.
The Scheduled queries tab opens with a list of SQL script schedules filtered by the selected SQL script.
View the details of a SQL script schedule
Go to the Scheduled queries tab and select the required SQL script schedule.
Click the Details tab to view the details of the SQL script schedule and a preview of the scheduled SQL script content.
Activate and deactivate a SQL script schedule
Go to the Scheduled queries tab and select the required SQL script schedule.
Click Disable to deactivate an active SQL script schedule.
Click Enable to activate an inactive SQL script schedule.
Delete an existing SQL script schedule
Go to the Scheduled queries tab and select the required SQL script schedule.
Click Delete to permanently delete an existing SQL script schedule.
View logs and manage a SQL script schedule
Go to the Scheduled queries tab and select the required SQL script schedule.
Click the Jobs tab, and then click the Job ID of the scheduled SQL script run you want to view.
Click View logs to view the logs related to the selected scheduled SQL script run in Cloud Logging.
Click the Dataproc Batch ID to open the Dataproc page in the Google Cloud console. Access the details of the corresponding Dataproc Serverless session.
Click Manage in Cloud Storage next to Output label to open the Cloud Storage page in the Google Cloud console. Access the details of the corresponding Cloud Storage bucket containing the SQL script execution output.
gcloud
For information about monitoring scheduled SQL scripts using the gcloud CLI, see the gcloud tab in Monitor your task.
For information about managing scheduled SQL scripts using the gcloud CLI, see the gcloud tab in Manage the schedule.
REST
For information about monitoring scheduled SQL scripts using REST, see the REST tab in Monitor your task.
For information about managing scheduled SQL scripts using REST, see the REST tab in Manage the schedule.
Output of a scheduled SQL script
Output from each execution of a scheduled SQL script is stored in the location specified by you for the Cloud Storage folder, in the following structure:
CLOUD_STORAGE_FOLDER_LOCATION/projects/PROJECT_ID/locations/LOCATION_ID/lakes/LAKE_ID/tasks/TASK_ID/JOB_ID/QUERY_NUMBER
Create and manage schedules for notebooks
Runtime configuration for scheduled notebooks
When a notebook is scheduled, it runs as a Dataplex task. When a schedule is created using the Google Cloud console, the corresponding task runtime parameters for scheduled notebooks are inherited from the environment of that notebook based on the following mapping:
Environment configuration parameter | Scheduled notebook task configuration parameter |
---|---|
Primary disk size | Container image / Properties: spark.dataproc.driver.disk.size spark.dataproc.executor.disk.size |
Number of nodes | Batch / Executors count |
Maximum number of nodes | Batch / Max executors count |
Image version | Not inherited |
JAR files | Container image / Java JARS |
Python packages | Not inherited |
Additional properties | Container image / Properties |
Schedule notebooks with custom packages
For interactive notebooks, when configuring an environment, Dataplex lets you specify custom Python packages to be installed in user sessions provisioned for the environment. When such a notebook is scheduled, the custom Python packages configured in its environment are not available at runtime for the scheduled notebook, and the default runtime environment only has the components of Dataproc Serverless Spark runtime version 1.0. For custom Python packages to be available at runtime for scheduled notebooks, provide a custom container image with necessary custom packages.
Specifying a custom container image when scheduling a notebook using the Google Cloud console is not supported. Instead, use the gcloud CLI. For more information, see Schedule a notebook.
If you schedule a notebook with an environment that has custom packages configured, you get the following warning:
Selected environmenthas additional Python packages configured. These packages will not be available in the default runtime for the scheduled notebook when scheduling in the Console. To make the required additional Python packages available in the runtime, please create Notebook Schedule using gcloud command instead, referencing a container image with required additional packages.
Build a custom container image for scheduled notebooks
For information about establishing a container image for scheduled notebooks, and installing the required custom python packages in the container image dockerfile, see Build a custom container image for Dataproc Serverless for Spark.
Make sure that you have installed the following conda packages in the container image:
gcsfs
google-cloud-bigquery
google-cloud-storage
google-auth
ipykernel
papermill
To install the gcloud
components in a container image, open the container image dockerfile
and complete the following steps:
Add the
google-cloud-sdk
package to the list of conda packages to be installed.Add the following lines after the conda packages installation step:
# (Optional) Installing gcloud components RUN gcloud components install alpha beta bq
Schedule a notebook
Console
In the Google Cloud console, go to the Dataplex Explore page.
Select a notebook you want to schedule, either by selecting a single notebook in the Notebooks view or by opening a notebook in the notebook details page.
In the Schedule menu, click Create Schedule.
In the Schedule name field, enter a name for the scheduled instance.
In the Schedule Options section, select whether you want to run the notebook only once or on a recurring basis:
If you select Run Once, specify whether you want to run the notebook immediately or at a scheduled time, using the options in the Start menu. For a scheduled run, specify the start date and run time.
If you select Repeat, specify whether you want to run the notebook daily, weekly, monthly, or on a custom schedule specified using cron expression. Additionally, specify the start date and run time for the initial scheduled run.
In the Destination for results section, choose a location where you want the notebook output to be stored:
In the Output folder name field, click Browse and select a folder in a Cloud Storage bucket where you want the notebook output to be stored.
Click Select.
Select the Script output format. The supported formats are CSV, JSON, ORC, and Parquet.
In the Scheduled notebook credentials section, select a service account from the User service account menu.
In the Parameters section, add execution parameters for the notebook as key-value pairs by clicking Add new.
Click Schedule.
gcloud
Run the gcloud CLI command described in Schedule a Spark (Java or Python) task with the following additional parameters:
Parameter | Description |
---|---|
--notebook
|
Path to a notebook content for input notebook. The execution args are accessible as
environment variables. For example, TASK_key=value . Note : In the gcloud reference documentation for scheduling notebooks as tasks, it is
mentioned that the value for --notebook parameter can be the Cloud Storage URI of
the notebook file. This is not supported. You must only specify the path to Notebook
Content for the --notebook parameter. |
--notebook-archive-uris
|
Cloud Storage URIs of archives to be extracted into the working directory of each
executor. The supported file types are JAR , tar , tar.gz , tgz , and zip . |
--notebook-file-uris |
Cloud Storage URIs of files to be placed in the working directory of each executor. |
Compute resources needed for a task when using Dataproc Serverless | |
--notebook-batch-executors-count |
Number of job executors. |
--notebook-batch-max-executors-count
|
Max configurable executors. If notebook-batch-max-executors-count is greater than notebook-batch-executors-count ,
then autoscaling is enabled. |
Container image runtime configuration | |
--notebook-container-image |
Optional: Custom container image for the job. |
--notebook-container-image-java-jars
|
A list of Java JARs to add to the classpath. Valid input includes Cloud Storage URIs to
JAR binaries. For example, gs ://BUCKET_NAME/my/path/to/file.jar . |
--notebook-container-image-properties
|
The properties to set on daemon configuration files. Property keys are specified in the format prefix :property . For example: core :hadoop.tmp.dir . For more information, see Cluster properties. |
Cloud VPC Network used to run the infrastructure | |
--notebook-vpc-network-tags |
List of network tags to apply to the job. |
The Cloud VPC network identifier. You can specify at most one of the following. | |
--notebook-vpc-network-name
|
The Cloud VPC network in which the job runs. By default, the Cloud VPC network named Default within the project is used. |
--notebook-vpc-sub-network-name |
The Cloud VPC sub-network in which the job runs. |
Location for notebook outputs | |
--execution-args
|
For notebook tasks, the following argument is mandatory and needs
to be passed as TASK_ARGS . |
--execution-args=^::^TASK_ARGS="--output_location,CLOUD_STORAGE_URI_OF_OUTPUT_DIRECTORY" |
The following is an example of the gcloud
command used for scheduling notebooks:
gcloud dataplex tasks create sample-task --project=sample-project --location=us-central1 --lake=my-lake --trigger-type=ON_DEMAND --notebook="projects/my-project-id/locations/us-central1/lakes/my-lake/content/my-notebook.ipynb" --execution-service-account=associated-service-account@sample-project.iam.gserviceaccount.com --execution-args=^::^TASK_ARGS="--output_location,gs://my-bucket/Demo" --notebook-batch-executors-count=2 --notebook-batch-max-executors-count=5 --notebook-container-image=container-image --notebook-container-image-java-jars=java-jar1,java-jar-2
REST
Use the API Explorer to create a task.
Manage notebook schedules
Console
Open the list of all notebook schedules
In the Google Cloud console, go to the Dataplex Process page.
Click the Scheduled notebooks tab to view the list of SQL script schedules.
Open the list of all schedules for a specific notebook
In the Google Cloud console, go to the Dataplex Explore page.
Select the required notebook.
In the Schedule menu, click View Schedules.
The Scheduled queries tab opens with a list of notebook schedules filtered by the selected notebook.
View the details of a notebook schedule
Go to the Scheduled notebooks tab and select the required notebook schedule.
Click the Details details of the notebook schedule and a preview of the scheduled notebook content.
Activate and deactivate a notebook schedule
Go to the Scheduled notebook tab and select the required notebook schedule.
Click Disable to deactivate an active notebook schedule.
Click Enable to activate an inactive notebook schedule.
Delete an existing notebook schedule
Go to the Scheduled notebook tab and select the required notebook schedule.
Click Delete to permanently delete an existing notebook schedule.
View logs and manage a notebook schedule
Go to the Scheduled notebook tab and select the required notebook schedule.
Click the Jobs tab, and then click the Job ID of the scheduled notebook run you want to view.
In the Preview Output section, review the output of the notebook run.
Click View logs to view the logs related to the selected scheduled notebook run in Cloud Logging.
Click the Dataproc Batch ID to open the Dataproc page in the Google Cloud console. Access the details of the corresponding Dataproc Serverless session.
Click Manage in Cloud Storage next to the Output label to open the Cloud Storage page in the Google Cloud console. Access the details of the corresponding Cloud Storage bucket containing the notebook execution output.
gcloud
For information about monitoring scheduled notebooks using the gcloud CLI, see the gcloud tab in Monitor your task.
For information about managing scheduled notebooks using the gcloud CLI, see the gcloud tab in Manage the schedule.
REST
For information about monitoring scheduled notebooks using REST, see the REST tab in Monitor your task.
For information about managing scheduled notebooks using REST, see the REST tab in Manage the schedule.
Output of a scheduled notebook
Output from each run of a scheduled notebook is stored in the location specified by you for the Cloud Storage folder, in the following structure:
CLOUD_STORAGE_FOLDER_LOCATION/projects/PROJECT_ID/locations/LOCATION_ID/lakes/LAKE_ID/tasks/TASK_ID/JOB_ID/QUERY_NUMBER