Transfer from HDFS to Cloud Storage

Storage Transfer Service supports transfers from cloud and on-premises Hadoop Distributed File System (HDFS) sources.

Transfers from HDFS must use Cloud Storage as the destination.

Use cases include migrating from on-premises storage to Cloud Storage, archiving data to free up on-premises storage space, replicating data to Google Cloud for business continuity, or transferring data to Google Cloud for analysis and processing.

Supported features

Transfers from HDFS sources support the following Storage Transfer Service features:

Files transferred from HDFS do not retain their metadata.

Overview

In order to transfer data from an HDFS source to Cloud Storage, you must follow these steps:

  • Configure IAM permissions
  • Create an agent pool and install transfer agents
  • Create a transfer job

Configure IAM permissions

To complete a transfer from HDFS to Cloud Storage you must configure IAM roles for the following principals:

  • User account
  • Google-managed service account
  • Transfer agent account

Follow the instructions in Agent-based transfer permissions to assign the required roles.

Create an agent pool

Don't include sensitive information such as personally identifiable information (PII) or security data in your agent ID prefix or agent pool name. Resource names may be propagated to the names of other Google Cloud resources and may be exposed to Google-internal systems outside of your project.

Don't include sensitive information such as personally identifiable information (PII) or security data in your agent pool name. Resource names may be propagated to the names of other Google Cloud resources and may be exposed to Google-internal systems outside of your project.

To create an agent pool:

Google Cloud console

  1. In the Google Cloud console, go to the Agent pools page.

    Go to Agent pools

    The Agent pools page is displayed, listing your existing agent pools.

  2. Click Create another pool.

  3. Name your pool, and optionally describe it.

  4. You may choose to set a bandwidth limit that will apply to the pool as a whole. The specified bandwidth in MB/s will be split amongst all of the agents in the pool. See Manage network bandwidth for more information.

  5. Click Create.

REST API

Use projects.agentPools.create:

POST https://storagetransfer.googleapis.com/v1/projects/PROJECT_ID/agentPools?agent_pool_id=AGENT_POOL_ID

Where:

  • PROJECT_ID: The project ID that you're creating the agent pool in.
  • AGENT_POOL_ID: The agent pool ID that you are creating.

If an agent pool is stuck in the Creating state for more than 30 minutes, we recommend deleting the agent pool and creating it again.

Revoking required Storage Transfer Service permissions from a project while an agent pool is in the Creating state leads to incorrect service behavior.

gcloud CLI

To create an agent pool with the gcloud command line tool, run [gcloud transfer agent-pools create][agent-pools-create].

gcloud transfer agent-pools create NAME \
  [--no-async] \
  [--bandwidth-limit=BANDWIDTH_LIMIT] \
  [--display-name=DISPLAY_NAME]

Where the following options are available:

  • NAME is a unique, permanent identifier for this pool.

  • --no-async blocks other tasks in your terminal until the pool has been created. If not included, pool creation runs asynchronously.

  • --bandwidth-limit defines how much of your bandwidth in MB/s to make available to this pool's agents. A bandwidth limit applies to all agents in a pool and can help prevent the pool's transfer workload from disrupting other operations that share your bandwidth. For example, enter '50' to set a bandwidth limit of 50 MB/s. By leaving this flag unspecified, this pool's agents will use all bandwidth available to them.

  • --display-name is a modifiable name to help you identify this pool. You can include details that might not fit in the pool's unique full resource name.

Install transfer agents

Install one or more transfer agents on one or more machines with access to your file system. Agents must have access to the namenode, datanodes, Hadoop KMS (Key Management Server), and KDC (Kerberos Key Distribution Center).

See Manage transfer agents for more details and requirements related to transfer agents.

Transfer agents work together in an agent pool. Increasing the number of agents can increase overall job performance, but this is dependent on several factors.

  • Adding more agents can help up to about half the number of nodes in your HDFS cluster. For example, with a 30-node cluster, increasing from 5 to 15 agents should improve performance, but beyond 15 is unlikely to make much difference.

  • For a small HDFS cluster, one agent may be sufficient.

  • Additional agents tend to have a larger impact on performance when a transfer includes a large number of small files. Storage Transfer Service achieves high throughput by parallelizing transfer tasks among multiple agents. The more files in the workload, the more benefit there is to adding more agents.

Using Kerberos

To authenticate to your file system using Kerberos, use the following command:

sudo docker run -d --ulimit memlock=64000000 --rm \
  --network=host \
  -v /:/transfer_root \
  gcr.io/cloud-ingest/tsop-agent:latest \
  --enable-mount-directory \
  --project-id=${PROJECT_ID} \
  --hostname=$(hostname) \
  --creds-file="service_account.json" \
  --agent-pool=${AGENT_POOL_NAME} \
  --hdfs-namenode-uri=kerberos-cluster-namenode \
  --kerberos-config-file=/etc/krb5.conf \
  --kerberos-user-principal=user \
  --kerberos-keytab-file=/path/to/folder.keytab

Where:

  • --network=host should be omitted if you're running more than one agent on this machine.
  • --hdfs-namenode-uri: A schema, namenode, and port, in URI format, representing an HDFS cluster. For example:

    • rpc://my-namenode:8020
    • http://my-namenode:9870

    Use HTTP/HTTPS for WebHDFS. If no schema is provided, we assume native RPC.

  • --kerberos-config-file: Path to a Kerberos configuration file. Default is /etc/krb5.conf.

  • --kerberos-user-principal: The Kerberos user principal to use.

  • --kerberos-keytab-file: Path to a Keytab file containing the user principal specified in kerberos-user-principal.

  • --kerberos-service-principal: Kerberos service principal to use, of the form 'service/instance'. Realm is mapped from your Kerberos configuration file. Any supplied realm is ignored. If this flag is not specified, the default is hdfs/namenode_fqdn.

Using simple auth

To authenticate to your file system using simple auth:

sudo docker run -d --ulimit memlock=64000000 --rm \
  --network=host \
  -v /:/transfer_root \
  gcr.io/cloud-ingest/tsop-agent:latest \
  --enable-mount-directory \
  --project-id=${PROJECT_ID} \
  --hostname=$(hostname) \
  --creds-file="${CREDS_FILE}" \
  --agent-pool="${AGENT_POOL_NAME}" \
  --hdfs-namenode-uri=cluster-namenode \
  --hdfs-username="${USERNAME}"

Where:

  • --hdfs-username: Username to use when connecting to an HDFS cluster using simple auth.
  • --hdfs-namenode-uri: A schema, namenode, and port, in URI format, representing an HDFS cluster. For example:

    • rpc://my-namenode:8020
    • http://my-namenode:9870

    Use HTTP/HTTPS for WebHDFS. If no schema is provided, we assume native RPC.

Create a transfer job

Follow these steps to create a transfer from an HDFS source to a Cloud Storage bucket.

  1. Go to the Storage Transfer Service page in the Google Cloud console.

    Go to Storage Transfer Service

  2. Click Create transfer job. The Create a transfer job page is displayed.

  3. Select Hadoop Distributed File System as the Source type. The destination must be Google Cloud Storage.

    Click Next step.

Configure your source

  1. Specify the required information for this transfer:

    1. Select the agent pool you configured for this transfer.

    2. Enter the Path to transfer from, relative to the root directory.

  2. Optionally, specify any filters to apply to the source data.

  3. Click Next step.

Configure your sink

  1. In the Bucket or folder field, enter the destination bucket and (optionally) folder name, or click Browse to select a bucket from a list of existing buckets in your current project. To create a new bucket, click Bucket icon Create new bucket.

  2. Click Next step.

Schedule the transfer

You can schedule your transfer to run one time only, or configure a recurring transfer.

Click Next step.

Choose transfer settings

  1. In the Description field, enter a description of the transfer. As a best practice, enter a description that is meaningful and unique so that you can tell jobs apart.

  2. Under Metadata options, select your Cloud Storage storage class, and whether to save each objects' creation time. See Metadata preservation for details.

  3. Under When to overwrite, select one of the following:

    • Never: Do not overwrite destination files. If a file exists with the same name, it will not be transferred.

    • If different: Overwrites destination files if the source file with the same name has different Etags or checksum values.

    • Always: Always overwrites destination files when the source file has the same name, even if they're identical.

  4. Under When to delete, select one of the following:

    • Never: Never delete files from either the source or destination.

    • Delete files from destination if they're not also at source: If files in the destination Cloud Storage bucket aren't also in the source, then delete the files from the Cloud Storage bucket.

      This option ensures that the destination Cloud Storage bucket exactly matches your source.

  5. Select whether to enable transfer logging and/or Pub/Sub notifications.

Click Create to create the transfer job.

REST API

To create a transfer from an HDFS source using the REST API, create a JSON object similar to the following example.

POST https://storagetransfer.googleapis.com/v1/transferJobs
{
  ...
  "transferSpec": {
    "source_agent_pool_name":"POOL_NAME",
    "hdfsDataSource": {
      "path": "/mount"
    },
    "gcsDataSink": {
      "bucketName": "SINK_NAME"
    },
    "transferOptions": {
      "deleteObjectsFromSourceAfterTransfer": false
    }
  }
}