This page describes how to change the Dataproc image version used by your Cloud Data Fusion instance.
Before you begin
Stop all real-time pipelines and Replication jobs in the Cloud Data Fusion instance. If a real-time pipeline or replication is running when you change the Dataproc image version, the changes will not be applied to the pipeline execution.
For real-time pipelines, if checkpointing is enabled, the stopping of these pipelines does not cause any data loss. For Replication jobs, as long as the database logs are available, stopping and starting the Replication job does not cause data loss.
Console
Go to the Cloud Data Fusion Instances page (in CDAP, click View Instances) and open the instance where you need to stop a pipeline.
Open each real-time pipeline in the Pipeline Studio and click Stop.
Open each Replication job on the Replicate page and click Stop.
REST API
To retrieve all pipelines, use the following REST API call:
GET -H "Authorization: Bearer ${AUTH_TOKEN}" \ "${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/apps"
Replace
NAMESPACE_ID
with the name of your namespace.To stop a real-time pipeline, use the following REST API call:
POST -H "Authorization: Bearer ${AUTH_TOKEN}" \ "${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/apps/PIPELINE_NAME/spark/DataStreamsSparkStreaming/stop"
Replace NAMESPACE_ID with the name of your namespace and PIPELINE_NAME with the name of the real-time pipeline.
To stop a Replication job, use the following REST API call:
POST -H "Authorization: Bearer ${AUTH_TOKEN}" \ "${CDAP_ENDPOINT}/v3/namespaces/NAMESPACE_ID/apps/REPLICATION_JOB_NAME/workers/DeltaWorker/stop"
Replace NAMESPACE_ID with the name of your Namespace and REPLICATION_JOB_NAME with the name of the Replication job.
For more information, see stopping real-time pipelines and stopping Replication jobs.
Check and override the default version of Dataproc in Cloud Data Fusion
In the Google Cloud console, go to the Instances page (in CDAP, click View Instances) and open the instance.
Click System Admin > Configuration > System Preferences.
If a Dataproc image is not specified in System Preferences, or to change the preference, click Edit System Preferences.
Enter the following text in the Key field:
system.profile.properties.imageVersion
Enter the desired Dataproc image in the Value field, such as
1.5-debian10
.Click Save & Close.
This change affects the entire Cloud Data Fusion instance, including all its Namespaces and pipeline runs, unless the image version property is overridden in a Namespace, pipeline, or Runtime Argument in your instance.
Change Dataproc image version in a Namespace or Pipeline Runtime Argument
If you have not overridden the Dataproc image version in Namespace Preferences or in Pipeline Runtime Arguments, you can skip these steps.
Namespace Preferences
If you have overridden the image version in your Namespace properties, follow these steps:
Open your instance in the Cloud Data Fusion UI.
Click System Admin > Configuration > Namespaces.
Open each namespace and click Preferences.
Make sure that there is no override with key
system.profile.properties.imageVersion
with an incorrect image version value.Click Finish.
Pipeline Runtime Arguments
If you have overridden the image version with a property in your pipeline's Runtime Arguments, follow these steps:
Open your instance in the Cloud Data Fusion UI.
Click > List and select the desired pipeline.
PipelineThe pipeline opens on the Pipeline Studio page.
Click the dropdown menu
next to Run.Runtime Arguments window opens.
Make sure that there is no override with key
system.profile.properties.imageVersion
and incorrect image version value.Click Save.
Recreate static Dataproc clusters used by Cloud Data Fusion with desired image version
If you use existing Dataproc clusters with Cloud Data Fusion, follow the Dataproc guide to recreate the clusters with the desired Dataproc image version for your Cloud Data Fusion version.
If there are any pipelines running when the cluster is being recreated, the pipelines will fail. Subsequent runs should run on the recreated cluster.
Alternatively, you can create a new Dataproc cluster with the desired Dataproc image version and delete and recreate the compute profile in Cloud Data Fusion with the same compute profile name and updated Dataproc cluster name. This way, running batch pipelines can complete execution on the existing cluster and new pipeline runs will take place on the new Dataproc cluster. You can delete the old Dataproc cluster after you have confirmed that all pipeline runs have completed.
Check that the Dataproc image version is updated
Console
In the Google Cloud console, go to the Dataproc Clusters page.
Open the Cluster details page for the new cluster that Cloud Data Fusion created when you specified the new version.
The Image version field has the new value that you specified in Cloud Data Fusion.
REST API
Get the list of clusters with their metadata:
GET -H "Authorization: Bearer ${AUTH_TOKEN}" \ https://dataproc.googleapis.com/v1/projects/PROJECT_ID/regions/REGION_ID/clusters
Replace the following:
NAMESPACE_ID
with the name of your namespaceREGION_ID
with the name of the region where your clusters are located
Search for the name of your pipeline (cluster name).
Under that JSON object, see the image in
config > softwareConfig > imageVersion
.