Launch checklist for Spanner

This launch checklist provides a list of considerations that need to be made prior to launching a production application on Spanner. It isn't intended to be exhaustive, but serves to highlights key considerations to minimize risks, optimize performance, and ensure alignment with business and operational goals, offering a systematic approach to deliver a seamless and reliable Spanner deployment.

This checklist is broken down into the following sections:

Design, development, testing, and optimization

Optimizing schema design, transactions and queries is essential to use Spanner's distributed architecture for high performance and scalability. Rigorous at-production scale and end-to-end testing ensures the system can handle real-world workloads, peak loads, and concurrent operations, while minimizing risks of bottlenecks or failures in production.

Checkbox Activity
❑  
Design the schema with scalability and Spanner's distributed architecture in mind. Follow best practices such as selecting appropriate primary keys and indexes to avoid hotspots and consider optimizations like table interleaving for related data. Review Schema design best practices to ensure the schema supports both high performance and scalability under expected workloads.
❑  
Optimize transactions and queries for minimal locking and maximum performance. Use Spanner's transaction modes, such as locking read-write, strong read-only, and partitioned DML statements, to balance consistency, throughput, and latency. Minimize locking scopes by using read-only transactions for queries, batching for maximum DML throughput or partitioned DML statements for large-scale updates and deletes. When migrating from systems with different isolation levels (for example, PostgreSQL or MySQL), use transactions to avoid performance bottlenecks. For more information, see Transactions.
❑  
Conduct rigorous at-scale load testing to validate schema design, transaction behavior, and query performance. Simulate peak and high-concurrency scenarios that mimic real-world application loads, including diverse transaction shapes and query patterns. Evaluate latency and throughput under these conditions to confirm that the database design and instance topology meets performance requirements. Use load testing iteratively during development to optimize and refine implementation.
❑  
Extend load testing to encompass all interacting services, not just isolated applications. Simulate comprehensive user journeys alongside parallel processes, such as batch loads or administration tasks that access the database. Run tests on the production Spanner instance configuration, ensuring load test drivers and services are geographically aligned with the intended production deployment topology. This holistic approach identifies potential conflicts in advance and ensures smooth database performance during real-world operations.
❑  
To ensure predictable query performance, use the optimizer version on which the workload has been tested. By default, Spanner databases use the latest query optimizer version. Regularly evaluate new optimizer versions in a controlled environment, and update the default version only after confirming compatibility and performance improvements. For more information, see Query optimizer overview.
❑  
Ensure that query optimizer statistics are up-to-date to support efficient query execution plans. Although statistics are updated automatically, consider manually constructing a new statistics package in scenarios such as large-scale data modifications (for example, bulk inserts, updates or deletes), addition of new indexes, or schema changes. Keeping the query optimizer statistics current is critical for maintaining optimal query performance.

Migration (optional)

Database migration is a comprehensive process that requires a deep dive into the specifics of each individual migration journey. Consider the following in your migration strategy:

Checkbox Activity
❑  
Develop a detailed standard operating procedure (SOP) for the migration cutover. This includes steps for application rollout, database switchover, and automation to minimize manual intervention. Identify and communicate potential downtime windows to stakeholders well in advance. Implement robust monitoring and alerting mechanisms to track the migration process in real-time and detect any anomalies promptly. Ensure the switchover process includes validation checks to confirm data integrity and application capabilities post-migration.
❑  
Prepare a detailed fallback plan to revert to the source system in the case of critical issues during the migration. Test the fallback procedures in a staging environment to ensure that they are reliable, and can be executed with minimal downtime. Clearly define conditions that would trigger a fallback and ensure the team is trained to execute this plan swiftly and efficiently.

Deployment

Proper deployment planning ensures that Spanner configurations meet workload requirements for availability, latency, and scalability, while accounting for geographic and operational considerations. Aligning sizing, resource management, failover scenarios, and automation minimizes risks, ensures optimal performance, and prevents resource constraints or outages during critical operations.

Checkbox Activity
❑  
Ensure your Spanner instance configuration (whether regional, dual-region, or multi-regional) aligns with your application's workload availability and latency requirements, while also taking geographic considerations into account. Calculate the target compute capacity based on expected storage sizes, traffic patterns, and recommended utilization limits, ensuring sufficient capacity for zonal or regional outages. Plan for traffic peaks by enabling autoscaling. You can set an upper limit for compute capacity to establish cost safeguards. For more information, see Compute capacity, nodes, and processing units.
❑  
If you're using a dual-region or multi-region instance configuration, choose a leader region that minimizes latency for application writes from services deployed in your most latency-sensitive locations. Test the implications of different leader regions on operation latency, and adjust to optimize application performance. Plan for failover scenarios by ensuring that the application topology is able to adapt to leader region changes during regional outages. For more information, see Modify the leader region of a database.
❑  
Configure tags and labels appropriately for operational clarity and Google Cloud resource tracking. Use tags to group instances by environment or workload type. Use labels for metadata that aids in cost analysis and permissions management. For more information, see Control access and organize instances with tags.
❑  
Evaluate whether Spanner warm up is necessary, especially for services expecting sudden and high traffic upon launch. Testing latency under high initial loads might reveal the need for pre-launch warm up to ensure optimal performance. If warm up is required, generate artificial load. For more information, see Warm up the database before application launch.
❑  
Review Spanner limits and quotas before deployment. If necessary, request quota increases in the Google Cloud console to avoid constraints during peak periods. Be mindful of hard limits (for example, maximum tables per database) to prevent issues post-deployment. For more information, see Quotas and limits.
❑  
Use automation tools like Terraform to provision and manage your Spanner instances, ensuring configurations are efficient and error-proof. For schema management, consider using tools like Liquibase to avoid accidental schema drops during updates. For more information, see Use Terraform with Spanner.

Disaster recovery

Establishing a robust disaster recovery (DR) strategy is essential to protect data, minimize downtime, and ensure business continuity during unexpected failures. Regularly testing of restore procedures and automating backups helps ensure operational readiness, compliance with recovery objectives, and reliable data protection tailored to organizational needs.

Checkbox Activity
❑  
Define a comprehensive disaster recovery strategy for Spanner that includes data protection, recovery objectives and failure scenarios. Establish clear recovery time objectives (RTO) and recovery point objectives (RPO) that align with business continuity requirements. Specify backup frequency, retention policies, and use point-in-time recovery (PITR) to minimize data loss in case of failures. Review the Disaster recovery overview to identify the right tools and techniques to ensure compliance with availability, reliability, and security for your application. For more information, see the Data protection and recovery solutions in Spanner whitepaper.
❑  
Create detailed documentation for back up and restore procedures, including step-by-step guides for various recovery scenarios. Regularly test these procedures to ensure operational readiness and validate RTO and RPO requirements. Testing should simulate real-world failure conditions and scenarios to identify gaps and improve the recovery process. For more information, see Restore overview.
❑  
Implement automated backup schedules to ensure consistent and reliable data protection. Configure frequency and retention settings to match business needs and regulatory obligations. Use Spanner's backup scheduling features to automate the creation, management, and monitoring of backups. For more information, see Create and manage backup schedules.
❑  
Align failover procedures with your application's instance configuration topology to minimize latency impacts in the case of an outage. Test disaster recovery scenarios, ensuring the application can operate efficiently when the leader region is moved to a failover region. For more information, see Modify the leader region of a database.

Query optimizer and statistics management

Managing query optimizer versions and statistics is important for maintaining predictable and efficient query performance. Using tested versions and keeping statistics up-to-date ensures stability, prevents unexpected performance changes, and optimizes query execution plans, especially during significant data or schema modifications.

Checkbox Activity
❑  
By default, Spanner databases use the latest query optimizer version. To ensure predictable query performance, use the optimizer version on which the workload has been tested. Regularly evaluate new optimizer versions in a controlled environment, and update the default version only after confirming compatibility and performance improvements. For more information, see the Query optimizer overview.
❑  
Ensure that query optimizer statistics are up-to-date to support efficient query execution plans. Although statistics are updated automatically, consider manually constructing a new statistics package in scenarios such as large-scale data modifications (for example, bulk inserts, updates or deletes), addition of new indexes, or schema changes. Keeping the query optimizer statistics current is critical for maintaining optimal query performance.
❑  
In certain scenarios, such as after bulk deletes or when new statistics generation might unpredictably impact query performance, pinning a specific statistics package is advisable. This provides consistent query performance until a new package can be generated and tested. Regularly review the need to pin statistics and unpin once updated packages are validated. For more information, see Query optimizer statistics packages.

Security

Implementing access control measures is essential to protect sensitive data and prevent unauthorized access in Spanner. By enforcing least-privilege access, fine-grained access control (FGAC), and database deletion protection, you can minimize risk, ensure compliance, and safeguard critical assets against accidental or malicious actions.

Checkbox Activity
❑  
Review and implement Identity and Access Management (IAM) policies following the least-privilege principle for all users and service accounts accessing your database. Assign only the necessary permissions required to perform specific tasks and regularly audit access control permissions to ensure adherence to this model. Use service accounts with minimal privileges for automated processes to reduce the risk of unauthorized access. For more information, see the IAM overview.
❑  
If the application requires restricted access to specific rows, columns, or cells within a table, implement fine-grained access control (FGAC). Design and apply conditional access policies based on user attributes or data values to enforce granular access rules. Regularly review and update these policies to align with evolving security and compliance requirements. For more information, see the Fine-grained access control overview.
❑  
Implement automated backup schedules to ensure consistent and reliable data protection. Configure frequency and retention settings to match business needs and regulatory obligations. Use Spanner's backup scheduling features to automate the creation, management, and monitoring of backups. For more information, see Create and manage backup schedules.
❑  
Enable database deletion protection to prevent accidental or unauthorized deletions. Combine this with strict IAM controls to limit deletion privileges to a small, trusted set of users or service accounts. Additionally, configure infrastructure automation tools like Terraform to include safeguards against unintentional deletion of your databases. This layered approach minimizes risks to critical data assets. For more information, see Prevent accidental database deletion.

Logging and monitoring

Effective logging and monitoring are critical for maintaining visibility into database operations, detecting anomalies, and ensuring system health. By using audit logs, distributed tracing, dashboards, and proactive alerts, you can quickly identify and resolve issues, optimize performance, and meet compliance requirements.

Checkbox Activity
❑  
Enable audit logging to capture detailed information about database activities. Configure audit log levels appropriately based on compliance and operational requirements to monitor access patterns and detect anomalies effectively. Be aware that audit logs might grow large especially for DATA_READ and DATA_WRITE requests since all SQL and DML statements are logged for these respective requests. For more information, see Spanner audit logging.

Routing these logs to a user-defined log bucket lets you optimize your log retention costs (the first 30 days aren't charged) and to granularly control log access using log views.
❑  
Collect client-side metrics by instrumenting your application logic with OpenTelemetry to distribute tracing and observability. Set up OpenTelemetry instrumentation to capture traces and metrics from Spanner, ensuring end-to-end visibility into application performance and database interactions. For more information, see Capture custom client-side metrics using OpenTelemetry.
❑  
Create and configure monitoring metrics to visualize query performance, latency, CPU utilization, and storage usage. Use these metrics for real-time tracking and historical analysis of database performance. For more information, see Monitor instances with Cloud Monitoring.
❑  
Define threshold-based monitoring alerts for critical metrics to proactively detect and address issues. Configure alerts for conditions like high query latency, low storage availability, or unexpected spikes in traffic. Integrate these alerts with incident response tools for prompt action. For more information, see Create alerts for Spanner metrics.

Client library

Configuring operation tagging, session pools, and retry policies is vital for optimizing performance, debugging issues, and maintaining resilience in Spanner. These measures enhance observability, reduce latency, and ensure efficient handling of workload demands and transient errors, aligning system behavior with application requirements.

Checkbox Activity
❑  
Configure the client library to use meaningful query request and transaction tags. You can use request and transaction tags to develop an understanding of your queries, reads, and transactions. As a best practice, use contextual metadata such as application component, request type, or user context, in your tags to enable enhanced debugging and introspection. Ensure tags are visible in query statistics and logs to facilitate performance analysis and troubleshooting. For more information, see Troubleshoot with request tags and transaction tags.
❑  
Optimize session management by enabling session pooling in the client library. Configure pool settings, such as minimum and maximum sessions, to match workload demands while minimizing latency. Regularly monitor session usage to fine-tune these parameters and ensure that the session pool provides consistent performance benefits. For more information, see Sessions.
❑  
In rare scenarios, the default client library parameters for retries, including maximum attempts and exponential backoff intervals, need to be configured to balance resilience with performance. Test these policies thoroughly to ensure that they align with application needs. For more information, see Configure custom timeouts and retries.

Support

To minimize downtime and impact, define clear incident roles and responsibilities to ensure prompt and coordinated responses to Spanner-related issues. For more information, see Get support.

Checkbox Activity
❑  
Establish a clear incident response framework, defining roles and responsibilities for all team members involved in managing Spanner-related incidents. Designate incident roles such as Incident Commander, Communications Lead, and Subject Matter Experts (SMEs) to ensure efficient coordination and communication during incidents. Develop and document processes for identifying, escalating, mitigating and resolving issues. Follow best practices outlined in the Google SRE Workbook on Incident Response and Managing Incidents. Conduct regular incident response training and simulations to ensure readiness and improve the team's ability to manage high-pressure scenarios effectively.

Cost management

Implementing cost management strategies like committed use discounts (CUDs), autoscaling, and incremental backups ensure efficient resource utilization and significant cost savings. Aligning resource provisioning with workload demands and optimizing non-production environments further reduces expenses while maintaining performance and flexibility.

Checkbox Activity
❑  
Evaluate and purchase CUDs for Spanner to lower costs on predictable workloads. These commitments might provide significant savings compared to on-demand pricing. Analyze historical usage patterns to determine optimal CUD commitments. For more information see Committed use discounts and Spanner pricing.
❑  
Monitor compute capacity utilization and adjust provisioned resources to maintain recommended CPU utilization levels. Over-provisioning compute resources might lead to unnecessary costs, while under-provisioning might impact performance. Follow the recommended maximum Spanner CPU utilization guidelines to ensure cost-effective resource alignment.
❑  
Enable autoscaling to dynamically adjust compute capacity based on workload demands. This ensures optimal performance during peak loads while reducing costs during periods of low activity. Configure scaling policies with upper and lower limits to control cost and avoid over-scaling. For more information, see Autoscaling overview.
❑  
Use incremental backups to reduce backup storage costs. Incremental backups only store data changes since the last backup. This significantly lowers storage requirements compared to full backups. Incorporate incremental backups into your backup strategy. For more information, see Incremental backups.
❑  
Optimize costs for non-production environments by selecting the most optimal instance configuration and deprovisioning resources when environments aren't in use. For example, downsize non-critical environments after hours or automate resource scaling for development and testing scenarios. This approach minimizes costs while maintaining operational flexibility.