Transferring Big Data Sets to Cloud Platform

This article describes two ways to transfer large amounts of on-premises, unstructured data to Google Cloud Platform (GCP), primarily focusing on transfers to Google Cloud Storage. Then, the article details best practices for digital network transfers.

To follow along with these recommendations, have the following information about your use case at hand:

  • How much data do you need to transfer? For example, 1 TB.
  • Where is your data located? For example, a data center or a cloud provider.
  • How much network bandwidth is available from this location?
  • Do you need to transfer your data once, or periodically?

Determining how to transfer your data

You can transfer your data to GCP digitally over a network, or physically by sending in hard drive disks or other forms of media through a GCP partner. Digital transfers are often more convenient since they can be automated using software, but there are two reasons why you might need to choose physical transfers:

  • Cost. It would be too expensive to transfer the data over the network.
  • Time. It would take too long to transfer the data over the network.

Cost considerations

As the GCP network pricing page describes, there are no ingress traffic charges, but you might have egress charges depending on where you're transferring data from. Contact your service provider to determine these potential charges. There is also a per-operation charge for writing objects to Cloud Storage.

Use the GCP Pricing Calculator to estimate your total charges.

Time considerations

There are two factors that determine how long it will take to transfer your data:

  • How much data you plan to transfer.
  • Your network’s available bandwidth for outgoing transfers to GCP.

The following diagram illustrates approximately how long it might take to transfer your data, given the data size and the network’s available bandwidth for transferring data to GCP. The time estimates are based on network bandwidth with 80% efficiency.

GCP transfer

For example, if you have 1 TB of data to upload, it might take approximately 12 days with a 10 Mbps connection. Or, it might take 2 minutes with a 100 Gbps connection.

Digitally transferring your data

Transferring data to Cloud Storage

Cloud Storage is ideal for storing large amounts of data. Cloud Storage is a highly available and durable object store service with no limitations on the number of files, and you pay only for the storage you use. The service is optimized to work with other Cloud Platform services such as Google BigQuery and Google Cloud Dataflow, making it easier for you to analyze your data. Cloud Storage offers several storage classes. The API interface is identical for each class, so choose the class that is most appropriate for your use case.

We recommend using the gsutil tool for transferring large amounts of data to GCP. The tool is open source and offers features such as multi-threaded parallel uploads, automatic synchronization of local directories, resumable uploads for large files, and the ability to break large files into smaller parts for parallel uploads. These features reduce your upload time and utilize as much of the network connection as possible.

You can optimize the tool by using multi-threaded uploads, parallel composite uploads, and other optimizations.

The gsutil command-line tool encrypts traffic in transit using transport-layer encryption (HTTPS). Cloud Storage stores data at rest in encrypted form, and offers the option to provide your own encryption keys. For detailed recommendations, see security and privacy considerations.

Multi-threading the data transfer

When transferring multiple files over a network using a single thread, the transfer might not utilize all available bandwidth. For example, the following diagram shows a single-threaded transfer of four files. Each file must wait for the previous file transfer to complete.

Single-threaded transfer

You can utilize more available bandwidth and speed up the transfer by copying files in parallel. The following diagram illustrates a multi-threaded transfer of four files.

Multi-threaded transfer

By default, the gsutil tool transfers multiple files using a single thread. To enable a multi-threaded copy, use the -m flag when executing the cp command.

The following command copies all files from a source directory into a Cloud Storage bucket. Replace [SOURCE_DIRECTORY] with your directory, and [BUCKET_NAME] with your Cloud Storage bucket name.

gsutil -m cp -r [SOURCE_DIRECTORY] gs://[BUCKET_NAME]

Parallel composite uploads

If you plan to upload large files, the gsutil tool offers parallel composite uploads. This feature splits each file into several smaller components, and uploads the components in parallel. The following diagrams show the difference between uploading one large file and a parallel composite upload.

Uploading one large file

Parallel composite upload

There are benefits and tradeoffs to using parallel composite uploads, so it is important to read the documentation.

Tuning TCP parameters

You can improve TCP transfer performance by tuning the following TCP parameters. Before changing these settings, read your operating system’s documentation, or consult an expert.

  • TCP Window Scaling (RFC 1323): Enabling this setting allows the TCP window size to surpass 16 bits by using a scaling factor. The setting potentially allows data transfers to utilize more of the available bandwidth. Both the sender and receiver must support TCP window scaling for this to work.

  • TCP Timestamps (RFC 1323): Enabling this setting allows accurate measurement of the round trip time to aid in smoother TCP performance.

  • TCP Selective Acknowledgment (RFC 2018): Enabling this setting indicates that the sender can only re-transmit data that is missing from the receiver.

  • Send and Receive Buffer Sizes: These settings determine how much data you can send or receive before sending an acknowledgement to the other party. You can try increasing these settings if you believe they are limiting your bandwidth utilization.

Optimizing your data and bandwidth for digital transfers

If it takes too long to transfer your data, there are two ways you can optimize:

  • Increasing network bandwidth.
  • Decreasing the amount of data to transfer.

Determining network bandwidth

By increasing your effective network bandwidth, you can potentially increase your data transfer throughput rate. You can test network latency by running the following gsutil performance diagnostic tool command. Replace [BUCKET_NAME] with your Cloud Storage bucket name.

gsutil perfdiag gs://[BUCKET_NAME]

You can use this tool to experiment with different combinations of OS processes, threads, and more. By using this tool, you can help understand the optimal configuration options for your network and whether you have many small files or a few large files, for example.

You can use the following options to help determine your network throughput.

  • The -c option sets the number of processes to use.
  • The -k option sets the number of threads per process to use.
  • The -n option sets the number of objects to use.
  • The -s option sets the size of each object.
  • The -t wthru_file option reads files from the local disk to gauge the local disk’s read performance.

For example, the following command uploads 100 files that are 10 MB each, using 2 processes and 10 threads. The command includes the -m option for multi-threading, and the -p option for parallel composite uploads. Replace [BUCKET_NAME] with your Cloud Storage bucket name.

gsutil perfdiag -c 2 -p both -t wthru_file -s 10M -n 100 -k 10 gs://[BUCKET_NAME]

The following code shows sample diagnostic output, including a write throughput value in Mbit per second.

------------------------------------------------------------------------------
                        Write Throughput With File I/O
------------------------------------------------------------------------------
Copied 100 10 MiB file(s) for a total transfer size of 1000 MiB.
Write throughput: 135.15 Mbit/s.
Parallelism strategy: both

To check how many hops are between your network and Google’s network, you can use the traceroute command-line tool with the Autonomous System (AS) number flag set. For example, the following command works on a Linux environment.

traceroute -a test.storage-upload.googleapis.com

Look for AS15169, the AS number for most Google services, including GCP. The following sample output shows that it takes 6 hops to enter Google’s network.

traceroute to storage.l.googleusercontent.com (74.125.68.128), 64 hops max, 52 byte packets
 1  [AS0] XXXX.XXXXX.XXX (192.168.2.1)  1.374 ms  1.094 ms  0.982 ms
 2  [AS0] XXXX.XXXXX.XXX (192.168.1.1)  1.582 ms  1.932 ms  1.858 ms
   ...
 6  [AS15169] 108.XXX.XXX.XXX (108.XXX.XXX.XXX)  17.281 ms
   ...

For a full list of performance diagnostic tool options, see the gsutil tool documentation.

Increasing network bandwidth

How you can increase network bandwidth depends on how you choose to connect to GCP.

You can connect to GCP in three main ways:

  • Public Internet connection
  • Direct Peering
  • Cloud Interconnect

Connecting with a public Internet connection

When you use a public Internet connection, network throughput is unpredictable because you're limited by the Internet service provider's (ISP) capacity and routing. The ISP might also offer a limited service level agreement (SLA) or none at all. On the other hand, these connections offer relatively low costs and, thanks to Google's extensive peering arrangements, your ISP may route you onto Google's global network within a few network hops.

Connecting with direct peering

You can use direct peering to access the Google network in fewer network hops. By using this option, you can exchange Internet traffic between your network and Google's edge points of presence (PoPs). Doing so reduces the number of hops between your network and Google's network.

Connecting with Google Cloud Interconnect

Finally, Cloud Interconnect offers a direct connection to GCP through one of the Cloud Interconnect service providers. This service can provide more consistent throughput for large data transfers, and typically provides service levels for network availability and performance associated with the availability and performance of their network. Contact a service provider directly to learn more.

Decreasing data quantity

Compressing your data reduces the amount you need to transfer over the network, which reduces how long it will take and how much it will cost. The tradeoff is that compressing data can be CPU intensive and can take time. If you’re storing files for archival purposes, consider compressing the files before storing them in Cloud Storage. If you plan to use the files in an application, you would need to decompress the data in Cloud Storage as an additional step.

As a general guide, compressing text data can result in a 4:1 compression ratio. Use a lossy compression algorithm for binary and multimedia data for the best chance to reduce costs.

Physically transferring your data

If digitally transferring your data doesn’t work for your use case, you can physically send your storage media to third-party service providers who offer Offline Media Import / Export services. These companies upload the data on your behalf to your Cloud Storage buckets. The available service providers and their supported media types change over time, so check the link above for the latest information.

What's next

  • Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.

Send feedback about...