Managing Transfer for on-premises jobs

Before you can start a transfer, you must create a transfer job and have a one or more agents installed and connected to the transfer job. This document describes first-time setup, how to create your transfer job, install transfer agents, and how to manage your transfer jobs.

Prerequisites

To use Transfer for on-premises, you need:

  • A POSIX-compliant source.

  • Network connection that is 300Mbps or faster.

  • A Docker-supported 64-bit Linux server or virtual machine that can access the data you plan to transfer.

    Docker Community Edition, supports CentOs, Debian, Fedora, and Ubuntu operating systems.

    To use other Linux operating systems, see Docker Enterprise.

  • Complete Transfer for on-premises first-time setup.

Before you start a transfer, verify that:

  • TCP ports 80 (HTTP) and 443 (HTTPS) are open for outbound connections.
  • All agent processes within a single Google Cloud project have the same filesystem mounted at the same mount point.

Scaling restrictions on jobs and agents

Transfer for on-premises has the following scale restrictions on transfer jobs and agents:

  • Fewer than one billion files per job
  • 100 agents or fewer per transfer project
  • Bandwidth cap must be over 1MBps

First-time setup

The first time you create a Transfer service for on-premises data job, you'll need to enable required APIs and ensure correct permissions are granted.

If you receive errors while performing first-time setup, confirm that the user you logged in with has permissions required to perform the set up steps. In many cases, these permissions are not available to all users, and you may need to contact a project administrator for assistance.

To do the first-time setup:

  1. Enable Pub/Sub API:

    1. Go to the API Library page in the Google Cloud Console.

    Go to the API Library Page

    1. In the Search box, enter Pub/Sub API.

    2. Select Pub/Sub API

      The Pub/Sub API page is displayed.

    3. Click Enable.

      The Pub/Sub API overview is displayed.

  2. Use the Google Cloud project administrator, a user with resourcemanager.projects.setIamPolicy privileges, to grant Identity and Access Management permissions or roles to:
    • Transfer for on-premises administrator (admin) accounts — superuser accounts supporting colleagues that perform transfers. Admins manage Transfer for on-premises agents and set bandwidth usage limits.
    • Transfer for on-premises user accounts — accounts used to create and execute transfers. These accounts typically don't have access to delete transfer jobs.
    • The Transfer for on-premises service account — the service account used by Transfer for on-premises to perform transfers.
    • Transfer for on-premises agent identity - the identity used to run the Transfer for on-premises agent. This can be either a service account or a user account that sets up the on-premises agents.

    The Google Cloud project administrator account is necessary only to set up transfer users and grant the Transfer for on-premises service account required permissions. It isn't required to start transfer jobs.

    For more information on granting Identity and Access Management permissions, see Granting, changing, and revoking access to resources.

    1. To set up a Transfer for on-premises admin account, assign the following IAM permissions and roles to the account:
      Role / Permission What it does Notes
      resourcemanager.projects.getIamPolicy This permission is used to confirm that the Transfer for on-premises service account has the required permissions for a transfer.
      roles/storagetransfer.admin Enables administrative actions in the transfer project, such as project set up and agent monitoring. For a detailed listing of permissions granted, see Predefined roles.
    2. To set up a Transfer for on-premises user account, assign the following permissions and roles to the account:
      Role / Permission What it does Notes
      resourcemanager.projects.getIamPolicy Used to confirm that the Transfer for on-premises service account has the required Pub/Sub permissions for a transfer.
      roles/storagetransfer.user Enables the user to create, get, update, and list transfers. For a detailed listing of permissions granted, see Predefined roles.
      roles/storage.objectAdmin Enables the user to create, update, and delete Cloud Storage objects as part of a transfer. Must be granted for every Cloud Storage bucket this account will use in transfers.

      For a detailed listing of permissions granted, see Predefined roles.
    3. To allow the Transfer for on-premises service account to access resources needed to complete transfers, assign the following roles, or equivalent permissions, to the Transfer for on-premises service account cloud-ingest-dcp@cloud-ingest-prod.iam.gserviceaccount.com:
      Role / Permission What it does Notes
      roles/storage.objectCreator Enables Transfer for on-premises to create transfer logs in the destination Cloud Storage bucket. Grant to all Cloud Storage buckets used in a transfer. If appropriate for your situation, you can grant the role on a project-level to the project that Transfer for on-premises is running from.

      For a detailed listing of the permissions these roles grant, see Predefined roles.
      roles/storage.objectViewer Enables Transfer for on-premises to determine if a file has already been uploaded to Cloud Storage.
      roles/pubsub.editor Enables Transfer for on-premises to automatically create and modify Pub/Sub topics to communicate from Google Cloud to Transfer for on-premises agents. Apply the role on a project-level to the project that Transfer for on-premises is running from.

      For a detailed listing of the permissions this role grants, see Roles.
      storage.buckets.get This permission enables reading Cloud Storage bucket metadata.
    4. To set up a Transfer for on-premises agent service account or user account that will run the Transfer for on-premises agents, assign the following permissions and roles:
      Role / Permission What it does Notes
      roles/storage.objectAdmin Enables Transfer for on-premises agents to create, update, and delete Cloud Storage objects as part of a transfer. Grant to all Cloud Storage buckets used in a transfer. If appropriate for your situation, you can grant the role on a project-level to the project that Transfer for on-premises is running from.

      For a detailed listing of the permissions this role grants, see Roles.
      roles/pubsub.publisher Enables Transfer for on-premises agents to share information with Google Cloud using Pub/Sub topics. For a detailed listing of the permissions this role grants, see Roles.
      roles/pubsub.subscriber Enables Google Cloud to share information with Transfer for on-premises agents using Pub/Sub topics. For a detailed listing of the permissions this role grants, see Roles.
      pubsub.subscriptions.create This permission enables Transfer for on-premises agents to create Pub/Sub subscriptions to the Pub/Sub topic used to communicate between Google Cloud and Transfer for on-premises agents.
      pubsub.subscriptions.delete This permission enables Transfer for on-premises agents that exit gracefully to clean up any Pub/Sub subscriptions they create.
  3. Install and run on-premises agents on each of your machines.

Creating a transfer job

Before you can start a transfer, you must create a transfer job. The transfer job coordinates and controls your on-premises agents as they move your data.

To create a transfer job:

  1. Go to the Transfer service for on-premises data Web Console page in the Google Cloud Console.

    Go to the Transfer service for on-premises data Page

  2. Click Create Transfer Job.

    The Create a transfer job page is displayed.

  3. Describe the transfer job. Enter a short description of your transfer that will help you track it.

  4. Specify a source by entering the fully qualified path of the source file system directory.

  5. Specify a Cloud Storage destination bucket. You can enter a Cloud Storage bucket name, or you can create a new bucket.

    To create and select new bucket:

    1. Click Browse.

    2. Click New bucket.

      The Create a bucket form is displayed.

    3. Complete the form, and then click Create and then Select.

  6. Optional: Enter an Object prefix. Without an object prefix, objects are transferred to Cloud Storage with the source's path, not including the root path, before the file name on the filesystem. For example, if you have the following files:

    • /source_root_path/file1.txt
    • /source_root_path/dirA/file2.txt
    • /source_root_path/dirA/dirB/file3.txt
    Then the object names in Cloud Storage are:
    • file1.txt
    • dirA/file2.txt
    • dirA/dirB/file3.txt
    The object prefix is added to the object's destination name in Cloud Storage, after the / character of the destination bucket name and before any path names that the object was transferred from, not including the source's root path. This can help you distinguish between objects transferred from other transfer jobs.

    The following table demonstrates several examples of object prefixes and their resulting object names in Cloud Storage, if the source object's path is /source_root_path/sub_folder_name/object_name:
    Prefix Destination object name
    None /destination_bucket/sub_folder_name/object_name
    prefix /destination_bucket/prefixsub_folder_name/object_name
    prefix- /destination_bucket/prefix-sub_folder_name/object_name
    prefix/ /destination_bucket/prefix/sub_folder_name/object_name

  7. Optional: Create a schedule for your job.

  8. Click Create.

If you haven't already done so, install and run on-premises transfer agents on each of your machines.

Controlling bandwidth usage for Transfer service for on-premises data

Bandwidth limits are helpful if you need to limit how much data Transfer service for on-premises data uses to transfer data to Cloud Storage. Using a bandwidth limit helps ensure that:

  • Your network up-links are not saturated as a result of using Transfer service for on-premises data.

  • Your organization's existing application behavior doesn't degrade during the transfer.

  • If you're on a network connection that charges by peak bandwidth usage, that you don't cause a sudden price increase.

Bandwidth limits apply to an entire project.

Setting a bandwidth limit

To set a bandwidth limit:

  1. Go to the Transfer service for on-premises data Connection Settings page in Google Cloud Console.

    Go to the Transfer service for on-premises data Connection Settings Page

  2. Click Set Bandwidth Limit.

  3. The Set this project's bandwidth limit pane is displayed.

  4. In the Bandwidth limit text box, enter the desired network limit in megabytes per second (MB/s) and click Set Bandwidth Limit

    The bandwidth limit is displayed for the project.

Editing a bandwidth limit

To edit an existing bandwidth limit:

  1. Go to the Transfer service for on-premises data Connection Settings page in Google Cloud Console.

    Go to the Transfer service for on-premises data Connection Settings Page

  2. In the displayed bandwidth limit, click Edit.

  3. In the Bandwidth limit text box, enter the desired network limit in megabytes per second (MB/s) and click Set Bandwidth Limit

    The bandwidth limit is displayed for the project.

Removing a bandwidth limit

To remove an existing bandwidth limit:

  1. Go to the Transfer service for on-premises data Connection Settings page in Google Cloud Console.

    Go to the Transfer service for on-premises data Connection Settings Page

  2. In the displayed bandwidth limit, click Use All Bandwidth.

  3. To confirm that you wish to remove the existing limit, click Confirm.

Monitoring jobs

You can monitor your Transfer service for on-premises data jobs to ensure they're working as expected.

To monitor your transfer jobs:

  1. Go to the Transfer service for on-premises data Transfer Jobs page in Google Cloud Console.

    Go to the Transfer service for on-premises data Transfer Jobs Page

    A list of jobs is displayed. This list includes both running and completed jobs.

  2. To display detailed information on a transfer job, click the Job description for the job you're interested in.

    The Job details page is displayed.

The Job details page displays the following:

  • How much data has been transferred.

  • Configuration information about the transfer job.

  • Scheduled or recurring job information.

  • Details of the most recent job run.

  • History of all past job runs.

Filtering jobs

If you have many jobs and wish to monitor a subset of them, consider using filters to sort and display only the jobs you are interested in.

To filter your transfer jobs:

  1. Click Filter List .

  2. Select the filters you wish to apply.

Editing job configurations

You can edit the following items for an existing transfer job:

  • The job description
  • Sync option
  • Schedule

To edit a job configuration:

  1. Go to the Transfer service for on-premises data Transfer Jobs page in Google Cloud Console.

    Go to the Transfer service for on-premises data Transfer Jobs Page

  2. Click the Job description for the job you're editing.

    The Job details page is displayed.

  3. Click Configuration.

  4. Click beside the configuration item you wish to edit.

Re-running jobs

Transfer service for on-premises data supports running a completed job a single time again. This can be helpful if you have some additional data to move, and you'd like to reuse an existing job configuration.

To re-run a job:

  1. Go to the Transfer service for on-premises data Transfer Jobs page in Google Cloud Console.

    Go to the Transfer service for on-premises data Transfer Jobs Page

  2. Click the Job description for the job you're editing.

    The Job details page is displayed.

  3. Click Run again.

    The job starts.

Viewing errors

To view a sample of errors encountered during the transfer:

  1. Go to the Transfer service for on-premises data Transfer Jobs page in Google Cloud Console.

    Go to the Transfer service for on-premises data Transfer Jobs Page

  2. Click the Job description for the job you're editing.

    The Job details page is displayed.

  3. Click View error details.

    The Error details page is displayed, which shows a sample of errors encountered during the transfer.

Viewing transfer logs

Transfer service for on-premises data produces detailed transfer logs that you can use to verify the results of your transfer job. Each job produces a collection of transfer logs that are stored in the destination Cloud Storage bucket.

Logs are produced while the transfer job is running. The complete logs are typically available within 15 minutes of job completion.

You can view logs in either of the following:

Viewing errors within the Google Cloud Console

To display all errors encountered during the transfer within Google Cloud console:

  1. Click View transfer logs.

    The Bucket details page is displayed. This is a destination in your Cloud Storage bucket.

  2. Click on the transfer log you are interested in.

    The transfer logs are displayed. For more information, see transfer log format.

Viewing logs in the destination bucket

Transfer logs are stored in the destination bucket at the following path:

destination-bucket-name/storage-transfer/logs/transferJobs/job-name/transferOperations/operation-name

where:

  • destination-bucket-name is the name of the job's destination Cloud Storage bucket.
  • job-name is the job name, as displayed in the job list.
  • operation-name is the name of the individual transfer operation, comprised as the IS08601 timestamp and generated ID.

Logs are aggregated and stored as objects. Each batch of logs is named by its creation time. For example:

my bucket/storage-transfer/logs/transferOperations/job1/2019-10-19T10_52_56.519081644-07_00.log

The transfer logs are displayed. For more information, see transfer log format.

Running BigQuery queries on transfer logs

To run BigQuery queries on your transfer logs:

  1. Load the CSV log data into BigQuery.

  2. Run your BigQuery query.

Example queries

Display number of files attempted transfer and failed/success status

select ActionStatus, count(*) as num_files
from big-query-table
where Action="TRANSFER"
group by 1;

Where big-query-table is the name of the BigQuery table that contains the transfer log.

Display all files that failed to transfer

select Src_File_Path  
from big-query-table
where Action="TRANSFER" and ActionStatus="FAILED";

Where big-query-table is the name of the BigQuery table that contains the transfer log.

Display checksum and timestamp for each file that successfully transferred

select Timestamp, Action, ActionStatus, Src_File_Path, Src_File_Size,
Src_File_Crc32C, Dst_Gcs_BucketName, Dst_Gcs_ObjectName, Dst_Gcs_Size,
Dst_Gcs_Crc32C, Dst_Gcs_Md5
from big-query-table
where Action="TRANSFER" and ActionStatus="SUCCEEDED";

Where big-query-table is the name of the BigQuery table that contains the transfer log.

Display all error information for directories that failed to transfer

select FailureDetails_ErrorType, FailureDetails_GrpcCode, FailureDetails_Message
from big-query-table
where Action="FIND" and ActionStatus="FAILED";

Where big-query-table is the name of the BigQuery table that contains the transfer log.