This article describes two ways to transfer large amounts of on-premises, unstructured data to Google Cloud Platform (GCP), primarily focusing on transfers to Google Cloud Storage. Then, the article details best practices for digital network transfers.
To follow along with these recommendations, have the following information about your use case at hand:
- How much data do you need to transfer? For example, 1 TB.
- Where is your data located? For example, a data center or a cloud provider.
- How much network bandwidth is available from this location?
- Do you need to transfer your data once, or periodically?
Determining how to transfer your data
You can transfer your data to GCP digitally over a network, or physically by sending in hard drive disks or other forms of media through a GCP partner. Digital transfers are often more convenient since they can be automated using software, but there are two reasons why you might need to choose physical transfers:
- Cost. It would be too expensive to transfer the data over the network.
- Time. It would take too long to transfer the data over the network.
As the GCP network pricing page describes, there are no ingress traffic charges, but you might have egress charges depending on where you're transferring data from. Contact your service provider to determine these potential charges. There is also a per-operation charge for writing objects to Cloud Storage.
Use the GCP Pricing Calculator to estimate your total charges.
There are two factors that determine how long it will take to transfer your data:
- How much data you plan to transfer.
- Your network’s available bandwidth for outgoing transfers to GCP.
The following diagram illustrates approximately how long it might take to transfer your data, given the data size and the network’s available bandwidth for transferring data to GCP. The time estimates are based on network bandwidth with 80% efficiency.
For example, if you have 1 TB of data to upload, it might take approximately 12 days with a 10 Mbps connection. Or, it might take 2 minutes with a 100 Gbps connection.
Digitally transferring your data
Transferring data to Cloud Storage
Cloud Storage is ideal for storing large amounts of data. Cloud Storage is a highly available and durable object store service with no limitations on the number of files, and you pay only for the storage you use. The service is optimized to work with other Cloud Platform services such as Google BigQuery and Google Cloud Dataflow, making it easier for you to analyze your data. Cloud Storage offers several storage classes. The API interface is identical for each class, so choose the class that is most appropriate for your use case.
We recommend using the
gsutil tool for transferring large amounts
of data to GCP. The tool is open source and offers features such as
multi-threaded parallel uploads, automatic synchronization of local directories,
resumable uploads for large files, and the ability to break large files into
smaller parts for parallel uploads. These features reduce your upload time and
utilize as much of the network connection as possible.
You can optimize the tool by using multi-threaded uploads, parallel composite uploads, and other optimizations.
gsutil command-line tool encrypts traffic in transit using transport-layer
encryption (HTTPS). Cloud Storage stores data at rest in encrypted form, and
offers the option to provide your own encryption keys. For detailed
security and privacy considerations.
Multi-threading the data transfer
When transferring multiple files over a network using a single thread, the transfer might not utilize all available bandwidth. For example, the following diagram shows a single-threaded transfer of four files. Each file must wait for the previous file transfer to complete.
You can utilize more available bandwidth and speed up the transfer by copying files in parallel. The following diagram illustrates a multi-threaded transfer of four files.
By default, the
transfers multiple files using a single thread. To enable a multi-threaded copy,
when executing the cp command.
The following command copies all files from a source directory into a Cloud
Storage bucket. Replace
[SOURCE_DIRECTORY] with your directory, and
[BUCKET_NAME] with your Cloud Storage bucket name.
gsutil -m cp -r [SOURCE_DIRECTORY] gs://[BUCKET_NAME]
Parallel composite uploads
If you plan to upload large files, the
gsutil tool offers
parallel composite uploads.
This feature splits each file into several smaller components, and uploads the
components in parallel. The following diagrams show the difference between
uploading one large file and a parallel composite upload.
There are benefits and tradeoffs to using parallel composite uploads, so it is important to read the documentation.
Tuning TCP parameters
You can improve TCP transfer performance by tuning the following TCP parameters. Before changing these settings, read your operating system’s documentation, or consult an expert.
TCP Window Scaling (RFC 1323): Enabling this setting allows the TCP window size to surpass 16 bits by using a scaling factor. The setting potentially allows data transfers to utilize more of the available bandwidth. Both the sender and receiver must support TCP window scaling for this to work.
TCP Timestamps (RFC 1323): Enabling this setting allows accurate measurement of the round trip time to aid in smoother TCP performance.
TCP Selective Acknowledgment (RFC 2018): Enabling this setting indicates that the sender can only re-transmit data that is missing from the receiver.
Send and Receive Buffer Sizes: These settings determine how much data you can send or receive before sending an acknowledgement to the other party. You can try increasing these settings if you believe they are limiting your bandwidth utilization.
Optimizing your data and bandwidth for digital transfers
If it takes too long to transfer your data, there are two ways you can optimize:
- Increasing network bandwidth.
- Decreasing the amount of data to transfer.
Determining network bandwidth
By increasing your effective network bandwidth, you can potentially increase
your data transfer throughput rate. You can test network latency by running the
gsutil performance diagnostic tool
[BUCKET_NAME] with your Cloud Storage bucket name.
gsutil perfdiag gs://[BUCKET_NAME]
You can use this tool to experiment with different combinations of OS processes, threads, and more. By using this tool, you can help understand the optimal configuration options for your network and whether you have many small files or a few large files, for example.
You can use the following options to help determine your network throughput.
-coption sets the number of processes to use.
-koption sets the number of threads per process to use.
-noption sets the number of objects to use.
-soption sets the size of each object.
-t wthru_fileoption reads files from the local disk to gauge the local disk’s read performance.
For example, the following command uploads 100 files that are 10 MB each, using
2 processes and 10 threads. The command includes the
-m option for
multi-threading, and the
-p option for parallel composite uploads. Replace
[BUCKET_NAME] with your Cloud Storage bucket name.
gsutil perfdiag -c 2 -p both -t wthru_file -s 10M -n 100 -k 10 gs://[BUCKET_NAME]
The following code shows sample diagnostic output, including a write throughput value in Mbit per second.
------------------------------------------------------------------------------ Write Throughput With File I/O ------------------------------------------------------------------------------ Copied 100 10 MiB file(s) for a total transfer size of 1000 MiB. Write throughput: 135.15 Mbit/s. Parallelism strategy: both
To check how many hops are between your network and Google’s network, you can
traceroute command-line tool with the Autonomous System (AS) number
flag set. For example, the following command works on a Linux environment.
traceroute -a test.storage-upload.googleapis.com
Look for AS15169, the AS number for most Google services, including GCP. The following sample output shows that it takes 6 hops to enter Google’s network.
traceroute to storage.l.googleusercontent.com (18.104.22.168), 64 hops max, 52 byte packets 1 [AS0] XXXX.XXXXX.XXX (192.168.2.1) 1.374 ms 1.094 ms 0.982 ms 2 [AS0] XXXX.XXXXX.XXX (192.168.1.1) 1.582 ms 1.932 ms 1.858 ms ... 6 [AS15169] 108.XXX.XXX.XXX (108.XXX.XXX.XXX) 17.281 ms ...
For a full list of performance diagnostic tool options,
gsutil tool documentation.
Increasing network bandwidth
How you can increase network bandwidth depends on how you choose to connect to GCP.
You can connect to GCP in three main ways:
- Public Internet connection
- Direct Peering
- Cloud Interconnect
Connecting with a public Internet connection
When you use a public Internet connection, network throughput is unpredictable because you're limited by the Internet service provider's (ISP) capacity and routing. The ISP might also offer a limited service level agreement (SLA) or none at all. On the other hand, these connections offer relatively low costs and, thanks to Google's extensive peering arrangements, your ISP may route you onto Google's global network within a few network hops.
Connecting with direct peering
You can use direct peering to access the Google network in fewer network hops. By using this option, you can exchange Internet traffic between your network and Google's edge points of presence (PoPs). Doing so reduces the number of hops between your network and Google's network.
Connecting with Google Cloud Interconnect
Finally, Cloud Interconnect offers a direct connection to GCP through one of the Cloud Interconnect service providers. This service can provide more consistent throughput for large data transfers, and typically provides service levels for network availability and performance associated with the availability and performance of their network. Contact a service provider directly to learn more.
Decreasing data quantity
Compressing your data reduces the amount you need to transfer over the network, which reduces how long it will take and how much it will cost. The tradeoff is that compressing data can be CPU intensive and can take time. If you’re storing files for archival purposes, consider compressing the files before storing them in Cloud Storage. If you plan to use the files in an application, you would need to decompress the data in Cloud Storage as an additional step.
As a general guide, compressing text data can result in a 4:1 compression ratio. Use a lossy compression algorithm for binary and multimedia data for the best chance to reduce costs.
Physically transferring your data
If digitally transferring your data doesn’t work for your use case, you can physically send your storage media to third-party service providers who offer Offline Media Import / Export services. These companies upload the data on your behalf to your Cloud Storage buckets. The available service providers and their supported media types change over time, so check the link above for the latest information.
- Try out other Google Cloud Platform features for yourself. Have a look at our tutorials.