Workflow using Cloud Scheduler

This tutorial uses the following billable components of Google Cloud:

  • Dataproc
  • Compute Engine
  • Cloud Scheduler

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Set up your project

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Dataproc, Compute Engine, and Cloud Scheduler APIs.

    Enable the APIs

  5. Install and initialize the Cloud SDK.

Create a custom role

  1. Open the Open IAM & Admin → Roles page in the Cloud Console.
    1. Click CREATE ROLE.
    2. Complete the Title, Description, ID, Launch stage fields. Suggestion: Use "Dataproc Workflow Template Create" as the role title.
    3. Click ADD PERMISSIONS,
    4. In the Add Permissions form, in the Filter table test box, select: "Permission", then complete the filter to read "Permission: dataproc.workflowTemplates.instantiate".
    5. Click the checkbox to the left of the listed permission, then click ADD.
    6. Click CREATE on the Custom Role page. The custom role is listed on the Roles page.

Create a service account

  1. Open IAM & Admin → Service Accounts page in the Cloud Console.

  2. Click CREATE SERVICE ACCOUNT, then complete the following items:

      1. Service account name: "workflow-scheduler"
      2. Service account ID (this field should auto-complete): "workflow-scheduler@your-project-id.iam.gserviceaccount"
      3. Service account description: Optional description
        1. Click CREATE.
        2. Role: Select your custom role with the dataproc.workflowTemplates.instantiate permission.
        3. Grant users access to this service account:
          1. Service account users role: You can ignore this item.
          2. Service account admins role: Insert your Google account email address
        4. Create key: You can ignore this item.
        5. Click DONE.

    Create a workflow template.

    Copy and run the commands listed below in a local terminal window or in Cloud Shell to create and define a workflow template.

    Notes:

    • The commands specify the "us-central1" region. You can specify a different region or delete the --region flag if you have previously run gcloud config set compute/region to set the region property.
    • The "-- " (dash dash space) sequence in the add-job command passes the 1000 argument to the SparkPi job, which specifies the number of samples to use to estimate the value of Pi.

    1. Create the workflow template.

      gcloud dataproc workflow-templates create sparkpi \
          --region=us-central1
      
    2. Add the spark job to the sparkpi workflow template. The "compute" step ID is required, and identifies the added SparkPi job.
      gcloud dataproc workflow-templates add-job spark \
          --workflow-template=sparkpi \
          --step-id=compute \
          --class=org.apache.spark.examples.SparkPi \
          --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
          --region=us-central1 \
          -- 1000
      

    3. Use a managed, single-node cluster to run the workflow. Dataproc will create the cluster, run the workflow on it, then delete the cluster when the workflow completes.

      gcloud dataproc workflow-templates set-managed-cluster sparkpi \
          --cluster-name=sparkpi \
          --single-node \
          --region=us-central1
      

    4. Click on the sparkpi name on the Dataproc Workflows page in the Cloud Console to open the Workflow template details page. Confirm the sparkpi template attributes.

    Create a Cloud Scheduler job

    1. Open the Cloud Scheduler page in the Cloud Console, then click CREATE JOB.

    2. Enter or select the following job information:

      1. Select a region: "us-central" or other region where you created your workflow template.
      2. Name: "sparkpi"
      3. Frequency: "* * * * *" selects every minute; "0 9 * * 1" selects every Monday at 9 AM. See Defining the Job Schedule for other unix-cron values. Note: You will be able to click a RUN NOW button on the Cloud Scheduler Jobs in the Cloud Console to run and test your job regardless of the frequency you set for your job.
      4. Timezone: Select your timezone. Type "United States" to list U.S. timezones.
      5. Target: "HTTP"
      6. URL: Insert the following URL after inserting your-project-id. Replace "us-central1" if you created your workflow template in a different region. This URL will call the Dataproc workflowTemplates.instantiate API to run your sparkpi workflow template.
        https://dataproc.googleapis.com/v1/projects/your-project-id/regions/us-central1/workflowTemplates/sparkpi:instantiate?alt=json
        
      7. HTTP method:
        1. "POST"
        2. Body: "{}"
      8. Auth header:
        1. "Add OAuth token"
        2. Service account: Insert the service account address of the service account that you created for this tutorial. You can use the following account address after inserting your-project-id:
          workflow-scheduler@your-project-id.iam.gserviceaccount
          
        3. Scope: You can ignore this item.
      9. Click CREATE.

    Test your scheduled workflow job

    1. On the sparkpi job row on the Cloud Scheduler Jobs page, click RUN NOW.

    2. Wait a few minutes, then open the Dataproc Workflows page to verify that the sparkpi workflow completed.

    3. After the workflow deletes the managed cluster, job details persist in the Cloud Console. Click the compute... job listed on the Dataproc Jobs page to view workflow job details.

    Cleaning up

    The workflow in this tutorial deletes its managed cluster when the workflow completes. Keeping the workflow allows you to rerun the workflow and does not incur charges. You can delete other resources created in this tutorial to avoid recurring costs.

    Deleting a project

    1. In the Cloud Console, go to the Manage resources page.

      Go to the Manage resources page

    2. In the project list, select the project that you want to delete and then click Delete .
    3. In the dialog, type the project ID and then click Shut down to delete the project.

    Deleting your workflow template

    gcloud dataproc workflow-templates delete sparkpi \
        --region=us-central1
    

    Deleting your Cloud Schedule job

    Open the Cloud Scheduler Jobs page in the Cloud Console, select the box to the left of the sparkpi function, then click DELETE.

    Deleting your service account

    Open the IAM & Admin → Service Accounts page in the Cloud Console, select the box to the left of the workflow-scheduler... service account, then click DELETE.

    What's next