Schedule SQL scripts and notebooks

Scheduling SQL scripts and notebooks lets you to operationalize the assets that were created in the Data exploration workbench (Explore).

Scheduling a SQL script or a notebook creates a Dataplex task that runs at a predetermined time, once or on a recurrent basis. Upon each run, a job is created that captures the execution metadata, such as execution start and end time, execution status, logs, and generated outputs.

When a run is triggered for a scheduled SQL script or a notebook, the current version of the notebook or SQL script is executed. This means that if you schedule a notebook or SQL script, and later update its content, then the update is also reflected in the scheduled runs of the notebook or SQL script.

Costs

Scheduled SQL scripts and scheduled notebooks trigger job execution using Dataproc Serverless. The usages are charged according to the Dataproc pricing model and charges show up under Dataproc instead of Dataplex.

Before you begin

Review and complete the prerequisites described in the following documents:

If a SQL script or notebook schedule uses BigQuery API, then you must grant the service account the Service Usage Consumer role (roles/serviceusage.serviceUsageConsumer).

Known limitations

  • When scheduling a notebook using the Google Cloud console, additional Python packages configured in the selected environment will not be available at runtime for the scheduled notebook. To make them available at runtime, you must create a notebook schedule referencing a container image with the required additional packages. For more information, see Schedule notebooks with custom packages.

  • Using the Google Cloud console to reference custom container images when scheduling notebooks is not supported. Instead, use the gcloud CLI.

  • You cannot edit a schedule once created. To update the schedule parameters, you must recreate the schedule with new parameters.

  • You cannot define a specific version of a SQL script or notebook to be used in a schedule. When the schedule is triggered, the current version of the SQL script or notebook is used.

  • When scheduling SQL scripts and notebooks using the Google Cloud console, the default subnetwork is used. To specify another network or a subnetwork, use the gcloud CLI.

Additionally, review the known limitations of Explore.

Create and manage schedules for SQL scripts

Runtime configuration for scheduled SQL scripts

When a SQL script is scheduled, it runs as a Dataplex task. When a schedule is created using the Google Cloud console, the corresponding task runtime parameters for scheduled SQL scripts are inherited from the environment of that SQL script based on the following mapping:

Environment configuration parameter Scheduled SQL script task configuration parameter
Primary disk size Container image / Properties:
   spark.dataproc.driver.disk.size
   spark.dataproc.executor.disk.size
Number of nodes Batch / Executors count
Maximum number of nodes Batch / Max executors count
Image version Not inherited
JAR files Container image / Java JARS
Python packages Not inherited
Additional properties Container image / Properties

Schedule a SQL script

Console

  1. In the Google Cloud console, go to the Dataplex Explore page.

    Go to Dataplex Explore

  2. Select a SQL script you want to schedule, either by selecting a single SQL script from the Spark SQL Scripts list or by opening a SQL script in Spark SQL Editor.

  3. In the Schedule menu, click Create Schedule.

  4. In the Schedule name field, enter a name for the scheduled instance.

  5. In the Schedule Options section, select whether you want to run the SQL script
    only once or on a recurring basis:

    1. If you select Run Once, specify whether you want to run the SQL script immediately or at a scheduled time, using the options in the Start menu. For a scheduled run, specify the start date and run time.

    2. If you select Repeat, specify whether you want to run the SQL script daily, weekly, monthly, or on a custom schedule specified using cron expression. Additionally, specify the start date and run time for the initial scheduled run.

  6. In the Destination for results section, do the following:

    1. In the Output folder name field, click Browse and select a folder in a Cloud Storage bucket where you want the script outputs to be stored.

    2. Click Select.

    3. Select the Script output format. The supported formats are CSV, JSON, ORC, and Parquet.

  7. In the Scheduled script credentials section, select a service account from the User service account menu.

  8. Click Schedule.

gcloud

For information about scheduling SQL scripts using the gcloud CLI, see Schedule a SparkSQL task.

REST

Use the API Explorer to create a task.

Manage SQL scripts schedules

Console

Open the list of all SQL script schedules

  1. In the Google Cloud console, go to the Dataplex Process page.

  2. Click the Scheduled queries tab to view the list of SQL script schedules.

Open the list of all schedules for a specific SQL script

  1. In the Google Cloud console, go to the Dataplex Explore page.

  2. Select the required SQL script.

  3. In the Schedule menu, click View Schedules.

    The Scheduled queries tab opens with a list of SQL script schedules filtered by the selected SQL script.

View the details of a SQL script schedule

  1. Go to the Scheduled queries tab and select the required SQL script schedule.

  2. Click the Details tab to view the details of the SQL script schedule and a preview of the scheduled SQL script content.

Activate and deactivate a SQL script schedule

  1. Go to the Scheduled queries tab and select the required SQL script schedule.

  2. Click Disable to deactivate an active SQL script schedule.

  3. Click Enable to activate an inactive SQL script schedule.

Delete an existing SQL script schedule

  1. Go to the Scheduled queries tab and select the required SQL script schedule.

  2. Click Delete to permanently delete an existing SQL script schedule.

View logs and manage a SQL script schedule

  1. Go to the Scheduled queries tab and select the required SQL script schedule.

  2. Click the Jobs tab, and then click the Job ID of the scheduled SQL script run you want to view.

  3. Click View logs to view the logs related to the selected scheduled SQL script run in Cloud Logging.

  4. Click the Dataproc Batch ID to open the Dataproc page in the Google Cloud console. Access the details of the corresponding Dataproc Serverless session.

  5. Click Manage in Cloud Storage next to Output label to open the Cloud Storage page in the Google Cloud console. Access the details of the corresponding Cloud Storage bucket containing the SQL script execution output.

gcloud

For information about monitoring scheduled SQL scripts using the gcloud CLI, see the gcloud tab in Monitor your task.

For information about managing scheduled SQL scripts using the gcloud CLI, see the gcloud tab in Manage the schedule.

REST

For information about monitoring scheduled SQL scripts using REST, see the REST tab in Monitor your task.

For information about managing scheduled SQL scripts using REST, see the REST tab in Manage the schedule.

Output of a scheduled SQL script

Output from each execution of a scheduled SQL script is stored in the location specified by you for the Cloud Storage folder, in the following structure:

CLOUD_STORAGE_FOLDER_LOCATION/projects/PROJECT_ID/locations/LOCATION_ID/lakes/LAKE_ID/tasks/TASK_ID/JOB_ID/QUERY_NUMBER

Create and manage schedules for notebooks

Runtime configuration for scheduled notebooks

When a notebook is scheduled, it runs as a Dataplex task. When a schedule is created using the Google Cloud console, the corresponding task runtime parameters for scheduled notebooks are inherited from the environment of that notebook based on the following mapping:

Environment configuration parameter Scheduled notebook task configuration parameter
Primary disk size Container image / Properties:
   spark.dataproc.driver.disk.size
   spark.dataproc.executor.disk.size
Number of nodes Batch / Executors count
Maximum number of nodes Batch / Max executors count
Image version Not inherited
JAR files Container image / Java JARS
Python packages Not inherited
Additional properties Container image / Properties

Schedule notebooks with custom packages

For interactive notebooks, when configuring an environment, Dataplex lets you specify custom Python packages to be installed in user sessions provisioned for the environment. When such a notebook is scheduled, the custom Python packages configured in its environment are not available at runtime for the scheduled notebook, and the default runtime environment only has the components of Dataproc Serverless Spark runtime version 1.0. For custom Python packages to be available at runtime for scheduled notebooks, provide a custom container image with necessary custom packages.

Specifying a custom container image when scheduling a notebook using the Google Cloud console is not supported. Instead, use the gcloud CLI. For more information, see Schedule a notebook.

If you schedule a notebook with an environment that has custom packages configured, you get the following warning:

Selected environment  has additional Python packages configured. 
These packages will not be available in the default runtime for the scheduled notebook 
when scheduling in the Console. To make the required additional Python packages 
available in the runtime, please create Notebook Schedule using gcloud command 
instead, referencing a container image with required additional packages.

Build a custom container image for scheduled notebooks

For information about establishing a container image for scheduled notebooks, and installing the required custom python packages in the container image dockerfile, see Build a custom container image for Dataproc Serverless for Spark.

Make sure that you have installed the following conda packages in the container image:

  • gcsfs
  • google-cloud-bigquery
  • google-cloud-storage
  • google-auth
  • ipykernel
  • papermill

To install the gcloud components in a container image, open the container image dockerfile and complete the following steps:

  1. Add the google-cloud-sdk package to the list of conda packages to be installed.

  2. Add the following lines after the conda packages installation step:

    # (Optional) Installing gcloud components
    RUN gcloud components install alpha beta bq gsutil
    

Schedule a notebook

Console

  1. In the Google Cloud console, go to the Dataplex Explore page.

    Go to Dataplex Explore

  2. Select a notebook you want to schedule, either by selecting a single notebook in the Notebooks view or by opening a notebook in the notebook details page.

  3. In the Schedule menu, click Create Schedule.

  4. In the Schedule name field, enter a name for the scheduled instance.

  5. In the Schedule Options section, select whether you want to run the notebook only once or on a recurring basis:

    1. If you select Run Once, specify whether you want to run the notebook immediately or at a scheduled time, using the options in the Start menu. For a scheduled run, specify the start date and run time.

    2. If you select Repeat, specify whether you want to run the notebook daily, weekly, monthly, or on a custom schedule specified using cron expression. Additionally, specify the start date and run time for the initial scheduled run.

  6. In the Destination for results section, choose a location where you want the notebook output to be stored:

    1. In the Output folder name field, click Browse and select a folder in a Cloud Storage bucket where you want the notebook output to be stored.

    2. Click Select.

    3. Select the Script output format. The supported formats are CSV, JSON, ORC, and Parquet.

  7. In the Scheduled notebook credentials section, select a service account from the User service account menu.

  8. In the Parameters section, add execution parameters for the notebook as key-value pairs by clicking Add new.

  9. Click Schedule.

gcloud

Run the gcloud CLI command described in Schedule a Spark (Java or Python) task with the following additional parameters:

Parameter Description
--notebook Path to a notebook content for input notebook. The execution args are accessible as environment variables. For example, TASK_key=value.
Note: In the gcloud reference documentation for scheduling notebooks as tasks, it is mentioned that the value for --notebook parameter can be the Cloud Storage URI of the notebook file. This is not supported. You must only specify the path to Notebook Content for the --notebook parameter.
--notebook-archive-uris Cloud Storage URIs of archives to be extracted into the working directory of each executor. The supported file types are JAR, tar, tar.gz, tgz, and zip.
--notebook-file-uris Google Cloud Storage URIs of files to be placed in the working directory of each executor.
Compute resources needed for a task when using Dataproc Serverless
--notebook-batch-executors-count Number of job executors.
--notebook-batch-max-executors-count Max configurable executors.
If notebook-batch-max-executors-count is greater than notebook-batch-executors-count, then auto-scaling is enabled.
Container image runtime configuration
--notebook-container-image Optional: Custom container image for the job.
--notebook-container-image-java-jars A list of Java JARs to add to the classpath. Valid input includes Cloud Storage URIs to JAR binaries. For example, gs://BUCKET_NAME/my/path/to/file.jar.
--notebook-container-image-properties The properties to set on daemon configuration files.
Property keys are specified in the format prefix:property. For example:
core:hadoop.tmp.dir.
For more information, see Cluster properties.
Cloud VPC Network used to run the infrastructure
--notebook-vpc-network-tags List of network tags to apply to the job.
The Cloud VPC network identifier. You can specify at most one of the following.
--notebook-vpc-network-name The Cloud VPC network in which the job runs. By default, the Cloud VPC network named Default within the project is used.
--notebook-vpc-sub-network-name The Cloud VPC sub-network in which the job runs.
Location for notebook outputs
--execution-args For notebook tasks, the following argument is mandatory and needs to be passed as TASK_ARGS.
--execution-args=^::^TASK_ARGS="--output_location,CLOUD_STORAGE_URI_OF_OUTPUT_DIRECTORY"

The following is an example of the gcloud command used for scheduling notebooks:

gcloud dataplex tasks create sample-task --project=sample-project --location=us-central1 --lake=my-lake --trigger-type=ON_DEMAND --notebook="projects/my-project-id/locations/us-central1/lakes/my-lake/content/my-notebook.ipynb" --execution-service-account=associated-service-account@sample-project.iam.gserviceaccount.com  --execution-args=^::^TASK_ARGS="--output_location,gs://my-bucket/Demo" --notebook-batch-executors-count=2 --notebook-batch-max-executors-count=5 --notebook-container-image=container-image --notebook-container-image-java-jars=java-jar1,java-jar-2

REST

Use the API Explorer to create a task.

Manage notebook schedules

Console

Open the list of all notebook schedules

  1. In the Google Cloud console, go to the Dataplex Process page.

  2. Click the Scheduled notebooks tab to view the list of SQL script schedules.

Open the list of all schedules for a specific notebook

  1. In the Google Cloud console, go to the Dataplex Explore page.

  2. Select the required notebook.

  3. In the Schedule menu, click View Schedules.

    The Scheduled queries tab opens with a list of notebook schedules filtered by the selected notebook.

View the details of a notebook schedule

  1. Go to the Scheduled notebooks tab and select the required notebook schedule.

  2. Click the Details details of the notebook schedule and a preview of the scheduled notebook content.

Activate and deactivate a notebook schedule

  1. Go to the Scheduled notebook tab and select the required notebook schedule.

  2. Click Disable to deactivate an active notebook schedule.

  3. Click Enable to activate an inactive notebook schedule.

Delete an existing notebook schedule

  1. Go to the Scheduled notebook tab and select the required notebook schedule.

  2. Click Delete to permanently delete an existing notebook schedule.

View logs and manage a notebook schedule

  1. Go to the Scheduled notebook tab and select the required notebook schedule.

  2. Click the Jobs tab, and then click the Job ID of the scheduled notebook run you want to view.

  3. In the Preview Output section, review the output of the notebook run.

  4. Click View logs to view the logs related to the selected scheduled notebook run in Cloud Logging.

  5. Click the Dataproc Batch ID to open the Dataproc page in the Google Cloud console. Access the details of the corresponding Dataproc Serverless session.

  6. Click Manage in Cloud Storage next to the Output label to open the Cloud Storage page in the Google Cloud console. Access the details of the corresponding Cloud Storage bucket containing the notebook execution output.

gcloud

For information about monitoring scheduled notebooks using the gcloud CLI, see the gcloud tab in Monitor your task.

For information about managing scheduled notebooks using the gcloud CLI, see the gcloud tab in Manage the schedule.

REST

For information about monitoring scheduled notebooks using REST, see the REST tab in Monitor your task.

For information about managing scheduled notebooks using REST, see the REST tab in Manage the schedule.

Output of a scheduled notebook

Output from each run of a scheduled notebook is stored in the location specified by you for the Cloud Storage folder, in the following structure:

CLOUD_STORAGE_FOLDER_LOCATION/projects/PROJECT_ID/locations/LOCATION_ID/lakes/LAKE_ID/tasks/TASK_ID/JOB_ID/QUERY_NUMBER

What's next