Workflow using Cloud Scheduler

In this document, you use the following billable components of Google Cloud:

  • Dataproc
  • Compute Engine
  • Cloud Scheduler

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Set up your project

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc, Compute Engine, and Cloud Scheduler APIs.

    Enable the APIs

  5. Install the Google Cloud CLI.
  6. To initialize the gcloud CLI, run the following command:

    gcloud init
  7. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  8. Make sure that billing is enabled for your Google Cloud project.

  9. Enable the Dataproc, Compute Engine, and Cloud Scheduler APIs.

    Enable the APIs

  10. Install the Google Cloud CLI.
  11. To initialize the gcloud CLI, run the following command:

    gcloud init

Create a custom role

  1. Open the Open IAM & Admin → Roles page in the Google Cloud console.
    1. Click CREATE ROLE to open the Create Role page.
    2. Complete the Title, Description, ID, Launch stage fields. Suggestion: Use "Dataproc Workflow Template Create" as the role title.
    3. Click ADD PERMISSIONS,
      1. In the Add Permissions form, click Filter, then select "Permission". Complete the filter to read "Permission: dataproc.workflowTemplates.instantiate".
      2. Click the checkbox to the left of the listed permission, then click ADD.
    4. On the Create Role page, click ADD PERMISSIONS again to repeat the previous sub-steps to add the "iam.serviceAccounts.actAs" permission to the custom role. The Create Role page now lists two permissions.
    5. Click CREATE on the Custom Role page. The custom role is listed on the Roles page.

Create a service account

  1. In the Google Cloud console, go to the Service Accounts page.

    Go to Service Accounts

  2. Select your project.

  3. Click Create Service Account.

  4. In the Service account name field, enter the name workflow-scheduler. The Google Cloud console fills in the Service account ID field based on this name.

  5. Optional: In the Service account description field, enter a description for the service account.

  6. Click Create and continue.

  7. Click the Select a role field and choose the Dataproc Workflow Template Create custom role that you created in the previous step.

  8. Click Continue.

  9. In the Service account admins role field, enter your Google account email address.

  10. Click Done to finish creating the service account.

Create a workflow template.

Copy and run the commands listed below in a local terminal window or in Cloud Shell to create and define a workflow template.

Notes:

  • The commands specify the "us-central1" region. You can specify a different region or delete the --region flag if you have previously run gcloud config set compute/region to set the region property.
  • The "-- " (dash dash space) sequence in the add-job command passes the 1000 argument to the SparkPi job, which specifies the number of samples to use to estimate the value of Pi.

  1. Create the workflow template.

    gcloud dataproc workflow-templates create sparkpi \
        --region=us-central1
    
  2. Add the spark job to the sparkpi workflow template. The "compute" step ID is required, and identifies the added SparkPi job.
    gcloud dataproc workflow-templates add-job spark \
        --workflow-template=sparkpi \
        --step-id=compute \
        --class=org.apache.spark.examples.SparkPi \
        --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
        --region=us-central1 \
        -- 1000
    

  3. Use a managed, single-node cluster to run the workflow. Dataproc will create the cluster, run the workflow on it, then delete the cluster when the workflow completes.

    gcloud dataproc workflow-templates set-managed-cluster sparkpi \
        --cluster-name=sparkpi \
        --single-node \
        --region=us-central1
    

  4. Click on the sparkpi name on the Dataproc Workflows page in the Google Cloud console to open the Workflow template details page. Confirm the sparkpi template attributes.

Create a Cloud Scheduler job

  1. Open the Cloud Scheduler page in the Google Cloud console (you may need to select your project to open the page). Click CREATE JOB.

  2. Enter or select the following job information:

    1. Select a region: "us-central" or other region where you created your workflow template.
    2. Name: "sparkpi"
    3. Frequency: "* * * * *" selects every minute; "0 9 * * 1" selects every Monday at 9 AM. See Defining the Job Schedule for other unix-cron values. Note: You will be able to click a RUN NOW button on the Cloud Scheduler Jobs in the Google Cloud console to run and test your job regardless of the frequency you set for your job.
    4. Timezone: Select your timezone. Type "United States" to list U.S. timezones.
    5. Target: "HTTP"
    6. URL: Insert the following URL after inserting your-project-id. Replace "us-central1" if you created your workflow template in a different region. This URL will call the Dataproc workflowTemplates.instantiate API to run your sparkpi workflow template.
      https://dataproc.googleapis.com/v1/projects/your-project-id/regions/us-central1/workflowTemplates/sparkpi:instantiate?alt=json
      
    7. HTTP method:
      1. "POST"
      2. Body: "{}"
    8. Auth header:
      1. "Add OAuth token"
      2. Service account: Insert the service account address of the service account that you created for this tutorial. You can use the following account address after inserting your-project-id:
        workflow-scheduler@your-project-id.iam.gserviceaccount.com
        
      3. Scope: You can ignore this item.
    9. Click CREATE.

Test your scheduled workflow job

  1. On the sparkpi job row on the Cloud Scheduler Jobs page, click RUN NOW.

  2. Wait a few minutes, then open the Dataproc Workflows page to verify that the sparkpi workflow completed.

  3. After the workflow deletes the managed cluster, job details persist in the Google Cloud console. Click the compute... job listed on the Dataproc Jobs page to view workflow job details.

Cleaning up

The workflow in this tutorial deletes its managed cluster when the workflow completes. Keeping the workflow allows you to rerun the workflow and does not incur charges. You can delete other resources created in this tutorial to avoid recurring costs.

Deleting a project

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting your workflow template

gcloud dataproc workflow-templates delete sparkpi \
    --region=us-central1

Deleting your Cloud Schedule job

Open the Cloud Scheduler Jobs page in the Google Cloud console, select the box to the left of the sparkpi function, then click DELETE.

Deleting your service account

Open the IAM & Admin → Service Accounts page in the Google Cloud console, select the box to the left of the workflow-scheduler... service account, then click DELETE.

What's next