This page provides guidelines for efficiently bulk loading large amounts of data into Cloud Spanner.
Before you can efficiently bulk upload data to Cloud Spanner, you need:
- A schema for your tables in your database, but without any secondary indexes.
- A set of data that you want to upload, such as data from another database, a set of static files, programmatically generated data, or other source.
- An understanding of Cloud Spanner best practices.
Cloud Spanner is designed to scale with high read and write throughput and high volume data storage (hundreds of terabytes, or even more). However, optimal performance is not guaranteed unless certain best practices are followed. The common theme for optimal bulk loading performance is to minimize the number of machines that are involved in each write, because aggregate write throughput is maximized when fewer machines are involved.
Cloud Spanner uses load-based splitting to evenly distribute your data load across nodes: after a few minutes of high load, Cloud Spanner introduces split boundaries between rows of non-interleaved tables and assigns each split to a different server. As a rule of thumb, if your data load is well distributed and you follow best practices for schema design and bulk loading, your write throughput should double every few minutes until you saturate the resources in your instance.
Partition your data by primary key
To get optimal write throughput for bulk loads, partition your data by primary key with this pattern:
- Each partition contains a range of consecutive rows.
- Each commit contains data for only a single partition.
A good rule of thumb for your number of partitions is 10 times the number of nodes in your Cloud Spanner instance. So if you have N nodes, with a total of 10*N partitions, you can assign rows to partitions by:
- Sorting your data by primary key.
- Dividing it into 10*N separate sections.
- Creating a set of worker tasks that upload the data.
Each worker will write to a single partition. Within the partition, it is recommended that your worker write the rows sequentially. However, writing data randomly within a partition should also provide reasonably high throughput.
As more of your data is uploaded, Cloud Spanner automatically splits and rebalances your data to balance load on the nodes in your instance. During this process, you may experience temporary drops in throughput.
Following this pattern, you should see a maximum overall bulk write throughput of 10-20 MiB per second per node.
Commit between 1 MiB to 5 MiB mutations at a time
Each write to Cloud Spanner contains some overhead. To maximize bulk write throughput, maximize the amount of data stored per write. A good technique is for each commit to mutate hundreds of rows. Commits with the number of mutations in the range of 1 MiB - 5 MiB rows usually provide the best performance.
Commits with more than 5 MiB mutations don't provide extra benefit, and they risk exceeding the Cloud Spanner limits on commit size and mutations per commit.
Upload data before creating secondary indexes
Before you create secondary indexes, bulk upload your data. Splits for a secondary index typically live on different machines from the table itself. Because multiple machines are involved when data is written to a table with a secondary index, commits use a more expensive multi-machine commit protocol that adds extra overhead and latency. If your tables do not have secondary indexes when you bulk load your data, fewer machines are involved and bulk write throughput is maximized.
After you bulk load your data, create the secondary indexes. Cloud Spanner will automatically backfill (i.e., populate) the indexes via a background process. For more information, see Adding an index.
Periodic bulk uploads to an existing database
If you are updating an existing database that contains data but does not have any secondary indexes, then the instructions above still apply.
If you do have secondary indexes, the instructions may still yield reasonable performance. It will depend on how many splits, on average, are involved in your transactions. If aggregate throughput becomes too low, you could:
- Include a smaller number of mutations in each commit, as this may increase aggregate throughput.
- If your upload is larger than the total current size of the table being updated, delete your secondary indexes and then re-add them once your upload is complete. This step is usually not necessary, but available as an option.
Avoid inefficient practices
Here are several bad practices to avoid when bulk loading data, because they lead to lower write throughput.
Don't write rows one a time
Each write to Cloud Spanner contains some overhead, whether the write is big or small. Writing one row at a time lowers throughput because it requires a high ratio of overhead per write. An example overhead cost is each write must be written to and acknowledged by a majority of the machines that own the split where the write occurs. The write and acknowledgement process require network round trips. Performing network round trips for each individual row increases the time required for bulk loading.
Don't package N random rows into a commit with N mutations
Writing random rows in a commit where each mutation inserts a single row may be worse than writing one row at a time. Multiple machines are likely involved, because each random row could belong to a different machine. In a worst case scenario, each write will involve every machine in your Cloud Spanner instance. As mentioned above, write throughput is lowered when more machines are involved for a write.
Don't sequentially add all rows in primary key order
Sequentially adding rows means your writes get added to the end of your key range. This causes a hotspot, which is a concentration of operations on a single node. This lowers the write throughput to the capacity of a single node instead of benefiting from load-balanced nodes. For more information about hot- spotting, see Choosing a primary key.