Schedule Dataflow batch jobs with Cloud Scheduler
Contributed by Google employees.
Dataflow is a managed service for handling streaming jobs and batch jobs. You can typically launch a streaming job and not worry about operating it afterwards. However, for your batch jobs, you often need to trigger them based on certain conditions.
In this tutorial, you learn how to set up a Cloud Scheduler job to trigger to Dataflow batch jobs.
You can find the code for this tutorial in the associated GitHub repository.
Dataflow templates
To be able to run your Dataflow jobs on a regular basis, you need to build your Dataflow templates.
Follow the instructions in the Dataflow documentation to create your templates and save them in a Cloud Storage bucket.
Cloud Scheduler jobs
When you have your templates ready, you can set up Cloud Scheduler jobs to trigger Dataflow templates.
Here's one example that defines a Cloud Scheduler job using Terraform:
resource "google_cloud_scheduler_job" "scheduler" {
name = "scheduler-demo"
schedule = "0 0 * * *"
# This needs to be us-central1 even if App Engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://${var.bucket}/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "test-cloud-scheduler",
"parameters": {
"region": "${var.region}",
"autoscalingAlgorithm": "THROUGHPUT_BASED",
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://${var.bucket}/temp",
"zone": "${var.region}-a"
}
}
EOT
)
}
}
Cloud Scheduler jobs need to be created in the region in which you have set up App engine. In your
Terraform script, be sure to assign the right value for the region field.
You need to use us-central1
if you have set up App Engine in us-central
.
Use the regional endpoint to specify the region of the Dataflow job. If you don't explicitly set the location in the request, the jobs will be created in the default region (us-central).
Create a Dataflow pipeline with Cloud Build
Follow these instructions to create a sample Dataflow pipeline with Cloud Build.
Open Cloud Shell and clone the repository:
git clone https://github.com/GoogleCloudPlatform/community cd community/tutorials/schedule-dataflow-jobs-with-cloud-scheduler/scheduler-dataflow-demo/
Create a bucket in Cloud Storage, which will store Terraform states and Dataflow templates:
export BUCKET=[YOUR_BUCKET_NAME] gsutil mb -p ${GOOGLE_CLOUD_PROJECT} gs://${BUCKET}
Replace
[YOUR_BUCKET_NAME]
with your own choice.${GOOGLE_CLOUD_PROJECT}
is predefined in Cloud Shell for the project ID.Create a backend for Terraform to store the states of Google Cloud resources:
cd terraform cat > backend.tf << EOF terraform { backend "gcs" { bucket = "${BUCKET}" prefix = "terraform/state" } } EOF
Follow these instructions to set up App Engine, which is needed to set up Cloud Scheduler jobs.
Cloud Scheduler jobs must be created in the same region as App engine.
Set the region:
export REGION=us-central1
You need to set the region to be
us-central1
, even though the region is shown asus-central
in some parts of the Cloud Console interface.Follow these instructions to give the Cloud Build service account the following roles:
- Cloud Scheduler Admin
- Service Account Admin
- Service Account User
Project IAM Admin
Verify in Cloud Console that all the roles are enabled.
Submit a Cloud Build job to create the resources:
cd .. gcloud builds submit --config=cloudbuild.yaml \ --substitutions=_BUCKET=${BUCKET},_REGION=${REGION},_PROJECT_ID=${GOOGLE_CLOUD_PROJECT} .
The job will run based on the schedule you defined in the Terraform script.
You can manually run the Cloud Scheduler job by using the Cloud Console interface and watch it trigger your Dataflow batch job.
You can check the status of jobs in the Cloud Console.
Cleaning up
Because this tutorial uses multiple Google Cloud components, be sure to delete the associated resources when you are done.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.