Upgrade your Cloud Data Fusion environment

Upgrade your Cloud Data Fusion instances and batch pipelines to the latest platform and plugin versions for the latest features, bug fixes, and performance improvements. The upgrade process involves instance and pipeline downtime (see Before you start).

Before you start

  • Plan a scheduled downtime for the upgrade. The process takes up to an hour.

  • Recommended: Before you upgrade, stop any running pipelines and disable any upstream triggers, such as Cloud Composer triggers. When the upgrade begins, all running pipelines stop. If you upgrade to versions 6.3 and later, if any pipelines are running beforehand, Cloud Data Fusion doesn't restart them. In earlier versions, Cloud Data Fusion attempts to restart them.

  • Install Google Cloud CLI.

  • Install curl.

Upgrade Cloud Data Fusion instances

To upgrade a Cloud Data Fusion instance to a new Cloud Data Fusion version, go to the Instance details page:

  1. In the Google Cloud console, go to the Cloud Data Fusion page.

  2. Click Instances, and then click the instance's name to go to the Instance details page.

    Go to Instances

Then perform the upgrade using either the Google Cloud console or Google Cloud CLI:

Console

  1. Click Upgrade for a list of available versions.

  2. Select a version.

  3. Click Upgrade.

  4. Click View instance to access the upgraded instance.

  5. Verify that the upgrade was successful by reloading the Instance details page, and then clicking System admin in the menu bar. The new version number appears at the top of the page.

  6. To prevent your pipelines from getting stuck when you run them in the new version:

    1. Grant the required roles in your upgraded instance.

    2. If you have upgraded to version 6.2.0 or later and your Dataproc cluster gets stuck in provisioning state, see Adding network tags.

gcloud

  1. To upgrade to a new Cloud Data Fusion version, run the following gcloud CLI command from a local terminal Cloud Shell session. Add the --enable_stackdriver_logging, --enable_stackdriver_monitoring , and --labels flags if they apply to your instance.

    gcloud beta data-fusion instances update \
        --project=PROJECT_ID \
        --location=REGION \
        --version=NEW_VERSION_NUMBER INSTANCE_ID
    

  2. After the command completes, verify that the upgrade was successful. From the Google Cloud console, reload the Instance details page, and then click System admin in the menu bar. The new version number appears at the top of the page.

  3. To prevent your pipelines from getting stuck when you run them in the new version:

    1. Grant the required roles in your upgraded instance.

    2. If you have upgraded to version 6.2.0 or later and your Dataproc cluster gets stuck in provisioning state, see Adding network tags.

Upgrade batch pipelines

To upgrade your Cloud Data Fusion batch pipelines to use the latest plugin versions:

  1. Set environment variables.

  2. Recommended: Backup all pipelines.

    1. To trigger the zip file download, run the following command, then copy the URL output to your browser.

      echo $CDAP_ENDPOINT/v3/export/apps
      

    2. Extract the downloaded file, then confirm that all pipelines were exported. The pipelines are organized by namespace.

  3. Upgrade pipelines.

    1. Create a variable that points to the pipeline_upgrade.json file that you will create in the next step to save a list of pipelines (insert the PATH to the file).

      export PIPELINE_LIST=PATH/pipeline_upgrade.json
      

    2. Create a list of all pipelines for an instance and namespace using the following command. The result is stored in the $PIPELINE_LIST file in JSON format. You can edit the list to remove pipelines that don't need to be upgraded. Set the NAMESPACE_ID field to the namespace where you want the upgrade to happen.

      curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/apps -o $PIPELINE_LIST
      

    3. Upgrade the pipelines listed in pipeline_upgrade.json. Insert the NAMESPACE_ID of pipelines to be upgraded. The command displays a list of upgraded pipelines with their upgrade status.

      curl -N -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" ${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/upgrade --data @$PIPELINE_LIST
      

  4. Prevent your pipelines from getting stuck when you run them in the new version:

    1. Grant the required roles in your upgraded instance.

    2. If you have upgraded to version 6.2.0 or later and your Dataproc cluster gets stuck in provisioning state, see Adding network tags.

Upgrade to enable Replication

Replication can be enabled in Cloud Data Fusion environments in version 6.3.0 or later. If you have version 6.2.3, upgrade to 6.3.0, and then enable Replication.

Grant roles for upgraded instances

If you upgrade an instance from Cloud Data Fusion version 6.1.x to versions 6.2.0 or later, after the upgrade completes, grant the Cloud Data Fusion Runner role (roles/datafusion.runner) and Cloud Storage Admin role (roles/storage.admin) to the Dataproc service account in your project.

Add network tags

Network tags are preserved in your compute profiles when you upgrade from Cloud Data Fusion versions 6.2.x or later to a higher version.

If you upgrade from version 6.1.x to version 6.2.0 or later, network tags are not preserved. It might cause your Dataproc cluster to get stuck in provisioning state, especially if your environment has restrictive networking and security policies.

Instead, in each updated instance, manually add your network tags to each of the compute profiles it uses.

To add the network tags to a compute profile:

  1. In the Google Cloud console, open the Cloud Data Fusion Instances page.

  2. Click View Instance.

  3. Click System Admin.

  4. Click the Configuration tab.

  5. Expand the System Compute Profiles box.

  6. Click Create New Profile. A page of provisioners opens.

  7. Click Dataproc.

  8. Enter your desired profile information, including your network tags.

  9. Click Create.

After you add the tags, use the updated profile in your pipeline. The new tags are preserved in future releases.

Available versions for your upgrade

When you upgrade, use the latest version of Cloud Data Fusion so that your instances run in a supported environment as long as possible. For more information, see the Version support policy. Depending on your original version, upgrades to some versions might not be available. In those cases, upgrade to a version that supports upgrades to your desired version.

Cloud Data Fusion supports the following version upgrades:

Your Cloud Data Fusion version Available upgrades
6.8.2 6.8.3 (latest)
6.8.1 6.8.3
6.8.0 6.8.3
6.7.3 6.8.3
6.7.2 6.7.3
6.7.1 6.7.3
6.7.0 6.7.3
6.6.0 6.7.3, 6.8.3
6.5.1 6.6.0, 6.7.3, 6.8.3
6.5.0 6.5.1
6.4.1 6.5.1, 6.6.0, 6.7.3, 6.8.3
6.4.0 6.4.1
6.3.1 6.5.1, 6.6.0, 6.7.3, 6.8.3
6.3.0 6.3.1
6.2.3 6.5.1, 6.6.0, 6.7.3, 6.8.3
6.2.2 6.2.3
6.2.1 6.2.3
6.2.0 6.2.3
6.1.4 6.5.1, 6.6.0, 6.7.3, 6.8.3
6.1.3 6.1.4, 6.3.1
6.1.2 6.1.4

Troubleshooting

When you upgrade to version 6.4, there is a known issue with the Joiner plugin where you cannot see join conditions. For more information, see the Troubleshooting page.