Running a pipeline against an existing Dataproc cluster

This page describes how to run a pipeline in Cloud Data Fusion against an existing Dataproc cluster.

By default, Cloud Data Fusion creates ephemeral clusters for each pipeline: it creates a cluster at the beginning of the pipeline run, and then deletes it after the pipeline run completes. While this behavior saves costs by ensuring that resources are only created when required, this default behavior might not be desirable in the following scenarios:

  • If the time it takes to create a new cluster for every pipeline is prohibitive for your use case.

  • If your organization requires cluster creation to be managed centrally; for example, when you want to enforce certain policies for all Dataproc clusters.

For these scenarios, you instead run pipelines against an existing cluster with the following steps. The steps are simplified in versions 6.2.1 and above.

Before you begin

You need the following:

Versions 6.2.1 and above

Connecting to the existing cluster

In Cloud Data Fusion versions 6.2.1 and later, you can connect to an existing Dataproc cluster when you create a new Compute Engine profile.

  1. In the Google Cloud Console, go to the Cloud Data Fusion Instances page.

    Go to Instances

  2. Click View Instance.

  3. Click System Admin.

  4. Click the Configuration tab.

  5. Expand the System Compute Profiles box.

  6. Click Create New Profile. A page of provisioners opens.

  7. Click Existing Dataproc.

  8. Enter your desired profile, cluster, and monitoring information.

  9. Click Create.

Configuring your pipeline to use the custom profile

  1. In the Pipeline Studio, click Configure.

  2. Click Compute config.

  3. Click the profile that you created.

    Use Custom Profile
  4. Run the pipeline. It now runs against your existing Dataproc cluster.

Versions before 6.2.1

Setting up SSH on a Dataproc cluster

  1. In the Google Cloud Console, go to the Dataproc Clusters page.

    Go to Clusters

  2. Click the cluster name. A page with cluster details opens.

  3. Click the VM Instances tab, and then click the SSH button to connect to the Dataproc Master VM.

  4. Create a new SSH key by running the following command:

    ssh-keygen -m PEM -t rsa -b 4096 -f ~/.ssh/KEY_FILENAME -C USERNAME
    

    The following two files are created:

    • ~/.ssh/KEY_FILENAME (Private key)
    • ~/.ssh/KEY_FILENAME.pub (Public key)
  5. Copy the entire SSH public key. To view the key in a readable format, run the following command:

    cat  ~/.ssh/KEY_FILENAME.pub
    

  6. Open the Compute Engine Metadata page, select the SSH Keys tab, and then click Edit.

  7. Click Add item.

  8. In the text box that appears, paste the public key that you previously copied.

  9. Click Save.

Creating a custom system compute profile for the instance

  1. In the Google Cloud Console, go to the Cloud Data Fusion Instances page.

    Go to Instances

  2. Click View Instance.

  3. Click System Admin.

  4. Click the Configuration tab.

  5. Expand the System Compute Profiles box.

  6. Click Create New Profile. A page of provisioners opens.

  7. Click Remote Hadoop Provisioner.

  8. On the Create a profile for Remote Hadoop Provisioner page, enter the profile information, including the SSH information:

    • Host: You can find the SSH host IP information for the Master Node in the Compute Engine VM instance page details.
    Find Master Node IP
    • User: The username that you specified when creating the SSH keys.
    • SSH private key: Paste in the SSH private key that you created previously. To view the key content in a readable format, use the following command:
       cat  ~/.ssh/KEY_FILENAME 

    Include the beginning and ending comments in your copy.

  9. Click Create.

Configuring your pipeline to use the custom profile

  1. In the Pipeline Studio, click Configure.

  2. Click Compute config.

  3. Click the profile that you created.

    Use Custom Profile
  4. Run the pipeline. It now runs against your existing Dataproc cluster.

Troubleshooting

  • If the pipeline fails on connection timeout, check that the SSH key and the firewall rules are configured correctly.

  • If you get an invalid privatekey error while running the pipeline, check if the first line of your private key is as follows: ----BEGIN OPENSSH PRIVATE KEY-----. If it is, try generating a key pair with RSA type:

    ssh-keygen -m PEM -t rsa -b 4096 -f ~/.ssh/KEY_FILENAME -C USERNAME
    
  • If you receive the following error from your pipeline, java.io.IOException:com.jcraft.jsch.JSchException: Auth fail, follow these steps:

    • Validate the SSH key by manually connecting to the target Dataproc node using the SSH key.
    • If you are manually connecting to the VM via SSH from the command line, and a private key works, but the same setup results in an Auth failed exception from JSch, then verify that OS login is not enabled. From the Compute Engine UI, click Metadata in the menu on the left, and then click on the Metadata tab. Either delete the osLogin key, or set it to FALSE.