To reduce the start time for pipelines, Cloud Data Fusion version 6.8.0 and
6.8.1 instances cache the artifacts that are required to start a pipeline in a
Dataproc cluster inside a Cloud Storage bucket.
One of these cached artifacts is application.jar
. Depending on the order in
which you run your pipelines, some pipelines might fail with the following
error:
Unsupported program type: Spark
For example, after you create a new 6.8.1 instance (or upgrade to 6.8.1), the first time that you run a pipeline that only contains actions, it succeeds. However, the next pipeline runs, which include sources or sinks, might fail with this error.
Recommendation
To resolve this issue, do either of the following:
- Recommended: Upgrade the instance to Cloud Data Fusion version 6.8.2 or later.
- Disable Cloud Storage caching by a preference or runtime argument.
You can disable caching for any of the following:
- For all pipelines in an instance.
- For a given namespace.
- For the specific Dataproc profiles that contain the failing pipelines.
- For only the failing pipelines.
Disable Cloud Storage caching for all pipelines in an instance
To disable Cloud Storage caching for all pipelines in an instance, follow these steps:
Console
- Go to your instance:
In the Google Cloud console, go to the Cloud Data Fusion page.
To open the instance in the Cloud Data Fusion Studio, click Instances, and then click View instance.
Click System Admin > System Preferences and set the value for
system.profile.properties.gcsCacheEnabled
tofalse.
REST API
To set system.profile.properties.gcsCacheEnabled
to false
, see
Set preferences.
Disable Cloud Storage caching for a given namespace
To disable Cloud Storage caching for a given namespace, follow these steps:
Console
- Go to your instance:
In the Google Cloud console, go to the Cloud Data Fusion page.
To open the instance in the Cloud Data Fusion Studio, click Instances, and then click View instance.
- Click System Admin > Namespaces and select your namespace.
Click Preferences > Edit and set the value for
system.profile.properties.gcsCacheEnabled
tofalse
.
REST API
To set this through the REST API, see Set preferences.
Disable Cloud Storage caching for a Dataproc profile
To disable Cloud Storage caching for the specific Dataproc profiles that contain the failing pipelines, follow these steps:
Console
- Set
gcsCacheEnabled
tofalse
in the Dataproc profile.
Disable Cloud Storage caching for only the failing pipelines
To disable Cloud Storage caching for only the failing pipelines, follow these steps:
Console
- Go to your instance:
In the Google Cloud console, go to the Cloud Data Fusion page.
To open the instance in the Cloud Data Fusion Studio, click Instances, and then click View instance.
- Click List and select the failing pipeline.
- Click
system.profile.properties.gcsCacheEnabled
tofalse
.
Expand next to Run and set the runtime argument
- Repeat for any other failing pipelines.
REST API
Cloud Storage caching can be disabled when starting a pipeline through REST API and also by optionally specifying runtime arguments as a JSON map in the request body. For more information, see Start a program.