Troubleshooting

This page shows you how to resolve issues with Cloud Data Fusion.

Pipeline is stuck

In Cloud Data Fusion versions before 6.2, there is a known issue where pipelines get stuck in the Starting or Running states. Stopping the pipeline results in the following error: Malformed reply from SOCKS server. This error is due to a lack of adequate memory resources on the Dataproc master node.

Recommendations

To prevent your pipeline from getting stuck on the next run, delete the Dataproc cluster. Then, update the master memory in the Compute Engine profile:

  • Required: Increase the Dataproc cluster size to at least 2 CPU and master nodes to at least 8 GB.
  • Optional: Migrate to Cloud Data Fusion 6.2. Starting in version 6.2, pipeline executions are submitted through the Dataproc Job API and do not impose heavy memory usage on the master node. However, it is still recommended that you use 2 CPU and 8 GB master nodes for production jobs.

Changing cluster size

REST API

To change the cluster size, export the Compute Engine profile, and then use the REST API to update the memory settings:

  1. Export the Compute Profile. It gets saved locally to a JSON file.
  2. Edit the following memory settings in the JSON file: update masterCPUs to at least 2, and masterMemoryMB to at least 8192 MB (roughly 8 GB).

    {
     "name": "masterCPUs",
     "value": "2",
     "isEditable": true
    },
    {
     "name": "masterMemoryMB",
     "value": "8192",
     "isEditable": true
    },
    
  3. Use the REST API to update the Compute Profile. You can either use cURL or the HTTP Executor in the UI.

    For cURL, use the following command:

    curl -H "Authorization: Bearer $(gcloud auth print-access-token)" https://<data-fusion-instance-url>/api/v3/profiles/<profile-name> -X PUT -d @<path-to-json-file>

Recovering the pipeline

To recover the stuck pipeline, restart the instance. You can restart the instance using the REST API or the gcloud command-line tool.

REST API

To restart your instance, use the restart() method.

gcloud

To restart your instance, run the following command:

gcloud beta data-fusion instances restart

Joiner plugin does not show join conditions

The following issue occurs in Cloud Data Fusion version 6.4.0, when you use the Joiner plugin, which lets you toggle between basic and advanced join conditions. After you upgrade or import a pipeline from a previous version, and you open the Joiner properties page, the basic join condition for the configured pipeline does not appear. This issue doesn't affect how the pipeline runs, the condition still exists.

Recommendation

To resolve this issue:

  1. Click System Admin > Configuration > Make HTTP Calls.
  2. In the HTTP calls executor fields, enter:

    PUT namespaces/system/artifacts/core-plugins/versions/CORE_PLUGIN_VERSION/properties/widgets.Joiner-batchjoiner?scope=SYSTEM

    For the CORE_PLUGIN_VERSION, use the latest core plugin version.

  3. Paste the following JSON content in the Body field:

    0967c-fdb73
    2ee80-67055
    b41f9-1dcd9
    425a5-cf822
    7e1a0-485e6
    eda47-040ea
    27430-fabba
    803ec-2c6e7
    8f7e0-2738d
    e22b5-4c375
    b3abb-778e4
    2deda-2d6be
    47855-b451d
    3e356-1268e
    f0ff9-876b6
    623df-8703a
    

  4. Click Send.

If your Pipeline page is open in another window, you might need to refresh the page to see the join conditions.

Replication for SQL Server does not replicate all columns for changed tables

The following issue occurs in Replication jobs that are replicating data from a table in SQL Server. If your replication source table has a newly added column, it is not automatically added to the CDC table. You must manually add it to the underlying CDC table.

Recommendation

To resolve this issue:

  1. Disable the CDC instance:

    EXEC sp_cdc_disable_table
    @source_schema = N'dbo',
    @source_name = N'myTable',
    @capture_instance = 'dbo_myTable'
    GO
    
  2. Enable CDC instance again:

    EXEC sp_cdc_enable_table
    @source_schema = N'dbo',
    @source_name = N'myTable',
    @role_name = NULL,
    @capture_instance = 'dbo_myTable'
    GO
    
  3. Create a new Replication job.

For more information, see Handling changes to source tables.

Pipeline fails or gives incorrect results with BigQuery sink version 0.17.0

There is a known issue in which data pipelines that include the BigQuery sink plugin version 0.17.0 fail or give incorrect results. This issue is resolved in version 0.17.1.

Recommendation

To resolve this issue, update your Google Cloud plugin versions:

  1. Get version Google Cloud version 0.17.1 or later.
    1. In the Cloud Data Fusion web UI, click HUB.
    2. Select Google Cloud version 0.17.1 or later and click Deploy.
  2. Change all Google Cloud plugins that your pipeline uses to the same version with one of the following options:
    • To update all plugin versions at once, export your existing pipeline and then import it again. When you import it, select the option to replace all plugins with the latest version.
    • To update plugins manually:
      1. Open the Pipeline Studio page.
      2. In the Sink menu, hold the pointer over BigQuery, and then click Change.
      3. Select version 0.17.1 or later.
      4. Repeat for any other Google Cloud plugins that you use, such as the BigQuery source plugin.