Transfer data between file systems

Stay organized with collections Save and categorize content based on your preferences.

This page shows you how to transfer data between two POSIX file systems. Common use cases include:

  • Burst to cloud and Hybrid HPC: Quickly transfer large data sets from on-premises to the cloud for processing.
  • Migration and sync to Filestore: Migrate or sync data from an on-premises file system to Filestore.
  • Managed file transfer: Securely and reliably transfer data between data centers or between two in-cloud file systems.

Before you begin

Before you can perform the tasks described on this page, complete the prerequisite steps.

Create agent pools and install agents

For file system to file system transfers, you need to create agent pools and agents for both the source and destination file systems. Agents for the source agent pool need to be installed on machines or VMs that have access to the source file system. Agents for the destination agent pool need to be installed on machines or VMs that have access to the destination file system.

Create a source agent pool

Create a source agent pool using one of the following methods:

gcloud CLI

Create a source agent pool by running:

gcloud transfer agent-pools create SOURCE_AGENT_POOL

Replace SOURCE_AGENT_POOL with the name that you want to give to the source agent pool.

Google Cloud console

  1. In the Google Cloud console, go to the Agent pools page.

    Go to Agent pools

    The Agent pools page is displayed, listing your existing agent pools.

  2. Click Create another pool.

  3. Enter a name for the pool.

  4. Click Create.

Install agents for the source agent pool

Install agents for the source agent pool on a machine or VM that has access to the source file system:

gcloud CLI

Install agents for the source agent pool by running:

gcloud transfer agents install --pool=SOURCE_AGENT_POOL --count=NUMBER_OF_AGENTS

Replace the following:

  • SOURCE_AGENT_POOL with the name of the source agent pool.
  • NUMBER_OF_AGENTS with the number of agents that you want to install for the source agent pool.

To determine the optimal number of agents for your environment, see Agent requirements and best practices.

Google Cloud console

  1. In the Google Cloud console, go to the Agent pools page.

    Go to Agent pools

    The Agent pools page is displayed, listing your existing agent pools.

  2. Click the name of the source agent pool that you just created.

  3. Under the Agents tab, click Install agent.

  4. Follow the instructions in Google Cloud console to create the Pub/Sub resource, install Docker, and start the agent.

Create a destination agent pool and install agents

Repeat the preceding steps to create a destination agent pool and install agents.

Create a Cloud Storage bucket as an intermediary

File system to file system transfers require a Cloud Storage bucket as an intermediary for the data transfer.

  1. Create a Cloud Storage Standard class bucket with the following settings:

    • Encryption: You can specify a customer-managed encryption key (CMEK). Otherwise, a Google-managed encryption key is used.
    • Object Versioning, Bucket Lock, and default object holds: Keep these features disabled.
    • Lifecycle policy for object deletion: Use the age condition to control the retention period of the transferred data. We recommend specifying a period longer than the longest transfer job that uses the bucket as an intermediary.
  2. Grant permissions and roles using one of the following methods:

    • Grant the Storage Transfer Service service account the Storage Admin role (roles/storage.admin) for the bucket.
    • Use gcloud transfer authorize to authorize your account for all Storage Transfer Service features. This command grants project-wide Storage Admin permissions:

      gcloud transfer authorize --add-missing
      

Manage intermediary buckets

Once a transfer job completes, data in the intermediary bucket is cleaned up and a transfer log is created in the bucket. If the cleanup process does not delete all of the data in the bucket, you can either clean up the left behind data by setting a lifecycle policy for the intermediary bucket or clean up the bucket manually:

Lifecycle policy

Use the age condition to control the retention period of bucket data. Specify a period longer than the longest transfer job that uses the bucket as an intermediary. If the specified age condition is shorter than the time required to download the file from the intermediary bucket to the destination, the file transfer fails.

Manual clean up

Delete bucket data by running:

gsutil -m rm gs://BUCKET/PREFIX**

Replace the following:

  • BUCKET with the name of the Cloud Storage bucket.
  • PREFIX with the prefix of the objects to delete.

Create a transfer job

gcloud CLI

To create a transfer from the source file system to the destination file system, run

gcloud transfer jobs create SOURCE_DIRECTORY DESTINATION_DIRECTORY \
    --source-agent-pool=SOURCE_AGENT_POOL \
    --destination-agent-pool=DESTINATION_AGENT_POOL \
    --intermediate-storage-path=STORAGE_BUCKET

Replace the following:

  • SOURCE_DIRECTORY with the path of the source directory.
  • DESTINATION_DIRECTORY with the path of the destination directory.
  • SOURCE_AGENT_POOL with the name of the source agent pool.
  • DESTINATION_AGENT_POOL with the name of the destination agent pool.
  • STORAGE_BUCKET with the name of the Cloud Storage bucket.

When you start a transfer job, the system first computes the data in the source and destination to determine the source data that's new or updated since the previous transfer. Only the new data is transferred.

Client Libraries

Go


import (
	"context"
	"fmt"
	"io"

	storagetransfer "cloud.google.com/go/storagetransfer/apiv1"
	storagetransferpb "google.golang.org/genproto/googleapis/storagetransfer/v1"
)

func transferFromPosix(w io.Writer, projectID string, sourceAgentPoolName string, rootDirectory string, gcsSinkBucket string) (*storagetransferpb.TransferJob, error) {
	// Your project id
	// projectId := "myproject-id"

	// The agent pool associated with the POSIX data source. If not provided, defaults to the default agent
	// sourceAgentPoolName := "projects/my-project/agentPools/transfer_service_default"

	// The root directory path on the source filesystem
	// rootDirectory := "/directory/to/transfer/source"

	// The ID of the GCS bucket to transfer data to
	// gcsSinkBucket := "my-sink-bucket"

	ctx := context.Background()
	client, err := storagetransfer.NewClient(ctx)
	if err != nil {
		return nil, fmt.Errorf("storagetransfer.NewClient: %v", err)
	}
	defer client.Close()

	req := &storagetransferpb.CreateTransferJobRequest{
		TransferJob: &storagetransferpb.TransferJob{
			ProjectId: projectID,
			TransferSpec: &storagetransferpb.TransferSpec{
				SourceAgentPoolName: sourceAgentPoolName,
				DataSource: &storagetransferpb.TransferSpec_PosixDataSource{
					PosixDataSource: &storagetransferpb.PosixFilesystem{RootDirectory: rootDirectory},
				},
				DataSink: &storagetransferpb.TransferSpec_GcsDataSink{
					GcsDataSink: &storagetransferpb.GcsData{BucketName: gcsSinkBucket},
				},
			},
			Status: storagetransferpb.TransferJob_ENABLED,
		},
	}

	resp, err := client.CreateTransferJob(ctx, req)
	if err != nil {
		return nil, fmt.Errorf("failed to create transfer job: %v", err)
	}
	if _, err = client.RunTransferJob(ctx, &storagetransferpb.RunTransferJobRequest{
		ProjectId: projectID,
		JobName:   resp.Name,
	}); err != nil {
		return nil, fmt.Errorf("failed to run transfer job: %v", err)
	}
	fmt.Fprintf(w, "Created and ran transfer job from %v to %v with name %v", rootDirectory, gcsSinkBucket, resp.Name)
	return resp, nil
}

Java

import com.google.storagetransfer.v1.proto.StorageTransferServiceClient;
import com.google.storagetransfer.v1.proto.TransferProto;
import com.google.storagetransfer.v1.proto.TransferTypes.GcsData;
import com.google.storagetransfer.v1.proto.TransferTypes.PosixFilesystem;
import com.google.storagetransfer.v1.proto.TransferTypes.TransferJob;
import com.google.storagetransfer.v1.proto.TransferTypes.TransferSpec;
import java.io.IOException;

public class TransferFromPosix {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.

    // Your project id
    String projectId = "my-project-id";

    // The agent pool associated with the POSIX data source. If not provided, defaults to the
    // default agent
    String sourceAgentPoolName = "projects/my-project-id/agentPools/transfer_service_default";

    // The root directory path on the source filesystem
    String rootDirectory = "/directory/to/transfer/source";

    // The ID of the GCS bucket to transfer data to
    String gcsSinkBucket = "my-sink-bucket";

    transferFromPosix(projectId, sourceAgentPoolName, rootDirectory, gcsSinkBucket);
  }

  public static void transferFromPosix(
      String projectId, String sourceAgentPoolName, String rootDirectory, String gcsSinkBucket)
      throws IOException {
    TransferJob transferJob =
        TransferJob.newBuilder()
            .setProjectId(projectId)
            .setTransferSpec(
                TransferSpec.newBuilder()
                    .setSourceAgentPoolName(sourceAgentPoolName)
                    .setPosixDataSource(
                        PosixFilesystem.newBuilder().setRootDirectory(rootDirectory).build())
                    .setGcsDataSink(GcsData.newBuilder().setBucketName(gcsSinkBucket).build()))
            .setStatus(TransferJob.Status.ENABLED)
            .build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources,
    // or use "try-with-close" statement to do this automatically.
    try (StorageTransferServiceClient storageTransfer = StorageTransferServiceClient.create()) {

      // Create the transfer job
      TransferJob response =
          storageTransfer.createTransferJob(
              TransferProto.CreateTransferJobRequest.newBuilder()
                  .setTransferJob(transferJob)
                  .build());

      System.out.println(
          "Created a transfer job from "
              + rootDirectory
              + " to "
              + gcsSinkBucket
              + " with "
              + "name "
              + response.getName());
    }
  }
}

Node.js


// Imports the Google Cloud client library
const {
  StorageTransferServiceClient,
} = require('@google-cloud/storage-transfer');

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// Your project id
// const projectId = 'my-project'

// The agent pool associated with the POSIX data source. Defaults to the default agent
// const sourceAgentPoolName = 'projects/my-project/agentPools/transfer_service_default'

// The root directory path on the source filesystem
// const rootDirectory = '/directory/to/transfer/source'

// The ID of the GCS bucket to transfer data to
// const gcsSinkBucket = 'my-sink-bucket'

// Creates a client
const client = new StorageTransferServiceClient();

/**
 * Creates a request to transfer from the local file system to the sink bucket
 */
async function transferDirectory() {
  const createRequest = {
    transferJob: {
      projectId,
      transferSpec: {
        sourceAgentPoolName,
        posixDataSource: {
          rootDirectory,
        },
        gcsDataSink: {bucketName: gcsSinkBucket},
      },
      status: 'ENABLED',
    },
  };

  // Runs the request and creates the job
  const [transferJob] = await client.createTransferJob(createRequest);

  const runRequest = {
    jobName: transferJob.name,
    projectId: projectId,
  };

  await client.runTransferJob(runRequest);

  console.log(
    `Created and ran a transfer job from '${rootDirectory}' to '${gcsSinkBucket}' with name ${transferJob.name}`
  );
}

transferDirectory();

Python

from google.cloud import storage_transfer


def transfer_to_gcs(
        project_id: str, description: str, source_agent_pool_name: str,
        root_directory: str, sink_bucket: str):
    """Create a transfer from a POSIX file system to a GCS bucket."""

    client = storage_transfer.StorageTransferServiceClient()

    # The ID of the Google Cloud Platform Project that owns the job
    # project_id = 'my-project-id'

    # A useful description for your transfer job
    # description = 'My transfer job'

    # The agent pool associated with the POSIX data source.
    # Defaults to 'projects/{project_id}/agentPools/transfer_service_default'
    # source_agent_pool_name = 'projects/my-project/agentPools/my-agent'

    # The root directory path on the source filesystem
    # root_directory = '/directory/to/transfer/source'

    # Google Cloud Storage sink bucket name
    # sink_bucket = 'my-gcs-sink-bucket'

    transfer_job_request = storage_transfer.CreateTransferJobRequest({
        'transfer_job': {
            'project_id': project_id,
            'description': description,
            'status': storage_transfer.TransferJob.Status.ENABLED,
            'transfer_spec': {
                'source_agent_pool_name': source_agent_pool_name,
                'posix_data_source': {
                    'root_directory': root_directory,
                },
                'gcs_data_sink': {
                    'bucket_name': sink_bucket
                },
            }
        }
    })

    result = client.create_transfer_job(transfer_job_request)
    print(f'Created transferJob: {result.name}')

Preserving file metadata

To preserve file metadata, including numeric UID, GID, MODE, and symbolic links:

gcloud CLI

Use the --preserve-metadata field to specify the preservation behavior for this transfer. Options that apply to file system transfers are: acl, gid, mode, symlink, uid.

REST API

Specify the appropriate options in a metadataOptions object.

See Preserving optional POSIX attributes for more information.

Example transfer using the gcloud CLI

In this example, we transfer data from the /tmp/source directory on VM1 to the /tmp/destination directory on VM2.

  1. Set up the source of the transfer.

    1. Create the source agent pool:

      gcloud transfer agent-pools create source_agent_pool
      
    2. On VM1, install agents for source_agent_pool by running:

      gcloud transfer agents install --pool=source_agent_pool \
          --count=1
      
  2. Set up the destination of the transfer.

    1. Create the destination agent pool:

      gcloud transfer agent-pools create destination_agent_pool
      
    2. On VM2, install agents for destination_agent_pool by running:

      gcloud transfer agents install --pool=destination_agent_pool \
          --count=3
      
  3. Create an intermediary Cloud Storage bucket.

    1. Create a bucket named my-intermediary-bucket:

      gsutil mb gs://my-intermediary-bucket
      
    2. Authorize your account for all Storage Transfer Service features by running:

      gcloud transfer authorize --add-missing
      
  4. Create a transfer job by running:

    gcloud transfer jobs create posix:///tmp/source/on/some/system posix:///tmp/destination/on/some/other/system \
        --source-agent-pool=source_agent_pool \
        --destination-agent-pool=destination_agent_pool \
        --intermediate-storage-path=gs://my-intermediary-bucket
    

What's next