Troubleshoot Cloud Data Fusion

This page shows you how to resolve issues with Cloud Data Fusion.

Troubleshoot batch pipelines

The following advice is for batch pipelines.

Concurrent pipeline is stuck

In Cloud Data Fusion, running many concurrent batch pipelines can put a strain on the instance, causing jobs to get stuck in Starting, Provisioning, or Running states. As a result, pipelines cannot be stopped through the web interface or API calls. When you run many pipelines concurrently, the web interface can become slow or unresponsive. This issue occurs due to multiple UI requests made to HTTP handler in the backend.

Recommendation

To resolve this issue, control the number of new requests using Cloud Data Fusion flow control, which is available in instances running in version 6.6 and later.

SSH connection times out while a running pipeline

The following error occurs when you run a batch pipeline:

`java.io.IOException: com.jcraft.jsch.JSchException:
java.net.ConnectException: Connection timed out (Connection timed out)`

Recommendation

To resolve the error, check for the following issues:

  • Check for a missing firewall rule (typically port 22). To create a new firewall rule, see Dataproc cluster network configuration
  • Check that the Compute Engine enforcer allows the connection between your Cloud Data Fusion instance and the Dataproc cluster.

Response code: 401. Error: unknown error

The following error occurs when you run a batch pipeline:

`java.io.IOException: Failed to send message for program run program_run:
Response code: 401. Error: unknown error`

Recommendation

To resolve this error, you must grant the Cloud Data Fusion Runner role (roles/datafusion.runner) to the service account used by Dataproc.

Pipeline with BigQuery plugin fails with Access Denied error

There is a known issue where a pipeline fails with an Access Denied error when running BigQuery jobs. This impacts pipelines that use the following plugins:

  • BigQuery sources
  • BigQuery sinks
  • BigQuery Multi Table sinks
  • Transformation Pushdown

Example error in the logs (might differ depending on the plugin you are using):

POST https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/jobs
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Project xxxx: User does not have bigquery.jobs.create permission in project PROJECT_ID",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Project PROJECT_ID: User does not have bigquery.jobs.create permission in project PROJECT_ID.",
"status" : "PERMISSION_DENIED"
}

In this example, PROJECT_ID is the project ID that you specified in the plugin. The service account for the project specified in the plugin does not have permission to do at least one of the following:

  • Run a BigQuery job
  • Read a BigQuery dataset
  • Create a temporary bucket
  • Create a BigQuery dataset
  • Create the BigQuery table

Recommendation

To resolve this issue, grant the missing roles to the project (PROJECT_ID) that you specified in the plugin:

For more information, see the plugin's troubleshooting documentation (Google BigQuery Multi Table Sink Troubleshooting).

Pipeline doesn't stop at the error threshold

A pipeline might not stop after multiple errors, even if you set the error threshold to 1.

The error threshold is intended for any exceptions raised from the directive in the event of a failure that is not otherwise handled. If the directive already uses the emitError API, then the error threshold is not activated.

Recommendation

To design a pipeline that fails when a certain threshold is met, use the FAIL directive.

Whenever the condition passed to the FAIL directive is satisfied, it counts against the error threshold and the pipeline fails after the threshold is reached.

Oracle batch source plugin converts NUMBER to string

In Oracle batch source versions 1.9.0, 1.8.3, and earlier, the Oracle NUMBER data type, with undefined precision and scale, is mapped to the CDAP decimal(38,0) data type.

Plugin versions 1.9.1, 1.8.4, and 1.8.5 are backward incompatible, and pipelines that use earlier versions might not work after upgrading to versions 1.9.1, 1.8.5, and 1.8.4, if a downstream stage in the pipeline relies on the output schema of the source because the output schema has changed. When there's an output schema defined for the Oracle NUMBER data type defined without precision and scale in the previous plugin version, after upgrading to versions 1.9.1, 1.8.5, or 1.8.4, the Oracle batch source plugin throws the following schema mismatch error for the types: Schema field '<field name>' is expected to have type 'decimal with precision <precision> and scale <scale> but found 'string'. Change the data type of field <field name> to string.

Versions 1.9.1, 1.8.5, and 1.8.4 will work with an output schema of CDAP string data type for Oracle NUMBER data type defined without precision and scale. If there's any Oracle NUMBER data type defined without precision and scale present in the Oracle source output schema, using the older version of the Oracle plugin isn't recommended, as it can lead to rounding errors.

The special case is when you use a macro for the database name, schema name, or table name, and if you haven't manually specified an output schema. The schema gets detected and mapped at runtime. The older version of the Oracle batch source plugin maps the Oracle NUMBER data type defined without precision and scale to the CDAP decimal(38,0) data type, while versions 1.9.1, 1.8.5, and 1.8.4 and later map the data types to string at runtime.

Recommendation

To resolve the possible precision loss issue while working with Oracle NUMBER data types with undefined precision and scale, upgrade your pipelines to use Oracle batch source plugin versions 1.9.1, 1.8.5, or 1.8.4.

After the upgrade, the Oracle NUMBER data type defined without precision and scale is mapped to the CDAP string data type at runtime. If you have a downstream stage or sink that consumes the original CDAP decimal data type (to which the Oracle NUMBER data type defined without precision and scale was mapped), either update it or expect it to consume string data.

If you understand the risk of possible data loss due to rounding errors, but choose to use Oracle NUMBER data type defined without precision and scale as CDAP decimal(38,0) data type, then deploy the Oracle plugin version 1.8.6 (for Cloud Data Fusion 6.7.3) or 1.9.2 (for Cloud Data Fusion 6.8.1) from the Hub, and update the pipelines to use them instead.

For more information, see the Oracle Batch Source reference.

Delete an ephemeral Dataproc cluster

When Cloud Data Fusion creates an ephemeral Dataproc cluster during pipeline run provisioning, the cluster gets deleted after the pipeline run is finished. In rare cases, the cluster deletion fails.

Strongly recommended: Upgrade to the most recent Cloud Data Fusion version to ensure proper cluster maintenance.

Set Max Idle Time

To resolve this issue, configure the Max Idle Time option. This lets Dataproc delete clusters automatically, even if an explicit call on the pipeline finish fails.

Max Idle Time is available in Cloud Data Fusion versions 6.4 and later.

Recommended: For versions before 6.6, set Max Idle Time manually to 30 minutes or greater.

Delete clusters manually

If you cannot upgrade your version or configure the Max Idle Time option, instead delete stale clusters manually:

  1. Get each project ID where the clusters were created:

    1. In the pipeline's runtime arguments, check if the Dataproc project ID is customized for the run.

      Check if the Dataproc project ID is customized for the run

    2. If a Dataproc project ID is not specified explicitly, determine which provisioner is used, and then check for a project ID:

      1. In the pipeline runtime arguments, check the system.profile.name value.

        Get the provisioner name in the runtime arguments

      2. Open the provisioner settings and check if the Dataproc project ID is set. If the setting is not present or the field is empty, the project that the Cloud Data Fusion instance is running in is used.

  2. For each project:

    1. Open the project in the Google Cloud console and go to the Dataproc Clusters page.

      Go to Clusters

    2. Sort the clusters by the date that they were created, from oldest to newest.

    3. If the info panel is hidden, click Show info panel and go to the Labels tab.

    4. For every cluster that is not in use—for example, more than a day has elapsed—check if it has a Cloud Data Fusion version label. That is an indication that it was created by Cloud Data Fusion.

    5. Select the checkbox by the cluster name and click Delete.

Unable to create Cloud Data Fusion instance

While creating a Cloud Data Fusion instance, you may encounter the following issue:

Read access to project PROJECT_NAME was denied.

Recommendation

To resolve this issue, disable and re-enable the Cloud Data Fusion API. Then, create the instance.

Pipelines fail when run on Dataproc clusters with secondary workers

In Cloud Data Fusion versions 6.8 and 6.9, an issue occurs causing pipelines to fail if they run on Dataproc clusters where secondary workers are enabled:

ERROR [provisioning-task-2:i.c.c.i.p.t.ProvisioningTask@161] - PROVISION task failed in REQUESTING_CREATE state for program run program_run:default.APP_NAME.UUID.workflow.DataPipelineWorkflow.RUN_ID due to
Caused by: io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
Caused by: com.google.protobuf.GeneratedMessageV3$Builder.parseUnknownField(Lcom/google/protobuf/CodedInputStream;Lcom/google/protobuf/ExtensionRegistryLite;I)Z.

Recommendation

To resolve the issue, upgrade to the patch revision 6.8.3.1, 6.9.2.1, or later. If you cannot upgrade, remove the secondary worker nodes in the following ways.

If you use an ephemeral Dataproc provisioner, resolve the error with these steps:

  1. Go to your pipeline in the Cloud Data Fusion web interface.
  2. In the pipeline runtime arguments, set system.profile.properties.secondaryWorkerNumNodes to 0. Set runtime argument.
  3. Click Save.
  4. If you use a namespace, disable secondary workers in the namespace:
    1. Click System Admin > Namespaces and select the namespace.
    2. Click Preferences > Edit.
    3. Set the value for system.profile.properties.secondaryWorkerNumNodes to 0. Disable secondary workers in a namespace.
    4. Click Save and Close.

If you use an existing Dataproc provisioner, resolve the error with these steps:

  1. In the Google Cloud console, go to the Dataproc Clusters page.

    Go to Clusters

  2. Select the cluster and click Edit.

  3. In the Secondary worker nodes field, enter 0. Edit secondary worker nodes in console.

  4. Click Save.