Diagnosing issues with Cloud SQL instances

This page contains a list of the most frequent issues you might run into when working with Cloud SQL instances and steps you can take to address them. Also review the Known issues page, the Troubleshooting, and the Support page.

Viewing logs

To see information about recent operations, you can view the Cloud SQL instance operation logs or the PostgreSQL error logs.

Connection issues

Verify that your application is closing connections properly

If you see errors containing "Aborted connection nnnn to db:", it usually indicates that your application is not stopping connections properly. Network issues can also cause this error. The error does not mean that there are problems with your Cloud SQL instance.

For examples of best practices for connection management, see Managing connections.

Verify that your certificates have not expired

If your instance is configured to use SSL, go to the Cloud SQL Instances page in the Cloud Console and open the instance. Open its Connections page and make sure that your server certificate is valid. If it has expired, you must add a new certificate and rotate to it. Learn more.

Verify that you are authorized to connect

If your connections are failing, check that you are authorized to connect:

  • If you are having trouble connecting using an IP address, for example, you are connecting from your on-premises environment with the psql client, then make sure that the IP address you are connecting from is authorized to connect to the Cloud SQL instance.

    Connections to a Cloud SQL instance using a private IP address are automatically authorized for RFC 1918 address ranges. This way, all private clients can access the database without going through the proxy. Non-RFC 1918 address ranges must be configured as authorized networks.

    Cloud SQL doesn't learn Non-RFC 1918 subnet routes from your VPC by default. You need to update the network peering to Cloud SQL to export any Non-RFC 1918 routes. For example:

    gcloud compute networks peerings update cloudsql-postgres-googleapis-com --network=NETWORK --export-subnet-routes-with-public-ip --project=PROJECT

  • Here's your current IP address.

  • Try the gcloud sql connect command to connect to your instance. This command authorizes your IP address for a short time. You can run this command in an environment with Cloud SDK and psql client installed. You can also run this command in Cloud Shell, which is available in the Google Cloud Console and has Cloud SDK and the psql client pre-installed. Cloud Shell provides a Compute Engine instance that you can use to connect to Cloud SQL.
  • Temporarily allow all IP addresses to connect to an instance by authorizing 0.0.0.0/0.

Determining how connections are being initiated

You can see information about your current connections by running the following command:

SELECT * from pg_stat_activity ;

Connections that show an IP address, such as 1.2.3.4, are connecting using IP. Connections with cloudsqlproxy~1.2.3.4 are using the Cloud SQL Proxy, or else they originated from App Engine. Connections from localhost may be used by some internal Cloud SQL processes.

Understand connection limits

There are no QPS limits for Cloud SQL instances. However, there are connection, size, and App Engine specific limits in place. See Quotas and Limits.

Database connections consume resources on the server and the connecting application. Always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. For more information, see Managing database connections.

Show connections and threads

To see the processes that are running on your database, use the pg_stat_activity table:

select * from pg_stat_activity;

Connections from Compute Engine

Connections with a Compute Engine instance time out after 10 minutes of inactivity, which can affect long-lived unused connections between your Compute Engine instance and your Cloud SQL instance. For more information, see Networking and Firewalls in the Compute Engine documentation.

To keep long-lived unused connections alive, you can set the TCP keepalive. The following commands set the TCP keepalive value to one minute and make the configuration permanent across instance reboots.

Display the current tcp_keepalive_time value.

cat /proc/sys/net/ipv4/tcp_keepalive_time

Set tcp_keepalive_time to 60 seconds and make it permanent across reboots.

echo 'net.ipv4.tcp_keepalive_time = 60' | sudo tee -a /etc/sysctl.conf

Apply the change.

sudo /sbin/sysctl --load=/etc/sysctl.conf

Display the tcp_keepalive_time value to verify the change was applied.

cat /proc/sys/net/ipv4/tcp_keepalive_time

Connection times out

If a client cannot connect to the Cloud SQL instance using private IP, check to see if the client is using any IP in the range 172.17.0.0/16. Connections fail from any IP within the 172.17.0.0/16 range to Cloud SQL instances using private IP. Similarly, Cloud SQL instances created with an IP in that range are unreachable. This range is reserved for the docker bridge network.

Occasional connection failures

When Cloud SQL restarts an instance due to maintenance events, connections might be routed to the failover replica. When connecting to the failover replica:

  • Read requests from clients using unencrypted connections succeed as normal. However, write requests fail and return an error message, such as 'Error 1290: The MySQL server is running with the --read-only option so it cannot execute this statement.'

  • Read and write requests from clients using encrypted connections fail and return an error message, such as 'x509: certificate is valid for master-instance, not failover-instance.'

After the event is over, Cloud SQL resets the connection. Retry the connection. We recommend that you design your applications to handle occasional connection failures by implementing an error handling strategy like exponential backoff. See Application implementation for more information.

Instance issues

Import and export

Imports and exports into Cloud SQL using the import functionality (with a Cloud Storage bucket) can take a long time to complete, depending on the size of the database. This can have the following impacts:

  • You cannot stop a long-running operation.
  • You can perform only one import or export operation at a time for each instance.

You can decrease the amount of time it takes to complete each operation by using the Cloud SQL import or export functionality with smaller batches of data.

For exports, you can use serverless export to minimize the impact on database performance and allow other operations to run on your instance while an export is running.

Disk space

If your instance reaches the maximum storage amount allowed, writes to the database fail. If you delete data, for example, by dropping a table, the space is freed, but it is not reflected in the reported Storage Used of the instance. You can run the VACUUM FULL command to recover unused space; note that write operations are blocked while the vacuum command is running. Learn more.

Suspended state

There are various reasons why Cloud SQL may suspend an instance, including:

  • Billing issues

    For example, if the credit card for the project's billing account has expired, the instance may be suspended. You can check the billing information for a project by going to the Google Cloud Console billing page, selecting the project, and viewing the billing account information used for the project. After you resolve the billing issue, the instance returns to runnable status within a few hours.

  • KMS key issues

    For example, if the KMS key version used to encrypt the user data in the Cloud SQL instance is not present, or if it has been disabled or destroyed. See Using customer-managed encryption keys (CMEK).

  • Legal issues

    For example, a violation of the Google Cloud Acceptable Use Policy may cause the instance to be suspended. For more information, see "Suspensions and Removals" in the Google Cloud Terms of Service.

  • Operational issues

    For example, if an instance is stuck in a crash loop (it crashes while starting or just after starting), Cloud SQL may suspend it.

While an instance is suspended, you can continue to view information about it or you can delete it, if billing issues triggered the suspension.

Cloud SQL users with Platinum, Gold, or Silver support packages can contact our support team directly about suspended instances. All users can use the earlier guidance along with the google-cloud-sql forum.

Performance

Keep a reasonable number of database schemas and tables

Database schemas and tables consume system resources. A large number can affect instance performance.

General performance tips

Make sure that your instance is not constrained on memory or CPU. For performance-intensive workloads, ensure your instance has at least 60 GB of memory.

For slow database inserts, updates, or deletes, check the locations of the writer and database; sending data a long distance introduces latency.

For slow database selects, consider the following:

  • Caching is important for read performance. Check the various blks_hit / (blks_hit + blks_read) ratios from the PostgreSQL Statistics Collector. Ideally, the ratio is above 99%. If not, consider increasing the size of your instance's RAM.
  • If your workload consists of CPU intensive queries (sorting, regular expressions, other complex functions), your instance might be throttled; add vCPUs.
  • Check the location of the reader and database - latency affects read performance even more than write performance.
  • Investigate non-Cloud SQL specific performance improvements, such as adding appropriate indexing, reducing data scanned, and avoiding extra round trips.
If you observe poor performance executing queries, use EXPLAIN to identify where to add indexes to tables to improve query performance. For example, make sure every field that you use as a JOIN key has an index on both tables.

Customer-managed encryption keys (CMEK)

Cloud SQL administrator operations, such as create, clone, or update, might fail due to Cloud KMS errors, and missing roles or permissions. Common reasons for failure include a missing Cloud KMS key version, a disabled or destroyed Cloud KMS key version, insufficient IAM permissions to access the Cloud KMS key version, or the Cloud KMS key version is in a different region than the Cloud SQL instance. Use the following troubleshooting table to diagnose and resolve common problems.

Customer-managed encryption keys troubleshooting table

For this error... The issue might be... Try this...
Per-product, per-project service account not found The service account name is incorrect. Make sure you created a service account for the correct user project.

GO TO THE SERVICE ACCOUNTS PAGE.

Cannot grant access to the service account The user account does not have permission to grant access to this key version. Add the Organization Administrator role on your user or service account.

GO TO THE IAM ACCOUNTS PAGE

Cloud KMS key version is destroyed The key version is destroyed. If the key version is destroyed, you cannot use it to encrypt or decrypt data.
Cloud KMS key version is disabled The key version is disabled. Re-enable the Cloud KMS key version.

GO TO THE CRYPTO KEYS PAGE

Insufficient permission to use the Cloud KMS key The cloudkms.cryptoKeyEncrypterDecrypter role is missing on the user or service account you are using to run operations on Cloud SQL instances, or the Cloud KMS key version doesn't exist. Add the cloudkms.cryptoKeyEncrypterDecrypter role on your user or service account.

GO TO THE IAM ACCOUNTS PAGE


If the role is already on your account, see Creating a key to learn how to create a new key version. See note.
Cloud KMS key is not found The key version does not exist. Create a new key version. See Creating a key. See note.
Cloud SQL instance and Cloud KMS key version are in different regions The Cloud KMS key version and Cloud SQL instance must be in the same region. It does not work if the Cloud KMS key version is in a global region or multi-region. Create a key version in the same region where you want to create instances. See Creating a key. See note.

Troubleshooting Cloud SQL instances

Click the links in the table for details:

For this problem... The issue might be... Try this...
You want to find out what queries are running now. N/A Try this database query.
You want to find out what units are being used for a specific field. N/A Try this database query.
You want to find the current value of a database setting. N/A Try these database queries.
Temporary storage increased automatic storage. Automatic storage is enabled. Restart deletes the temporary files but not reduce the storage. Only customer support can reset the instance size. Learn more.
Data is being automatically deleted. There is a script running somewhere doing this. Try to find the script.
The instance cannot be deleted. There can be more than one root cause. There can be more than one solution.
The instance is stuck due to large temporary data size. Too many temporary tables were created at one time. Restart the instance, and try this mitigation option.
Fatal error during upgrade. There could be many causes. Logs may reveal more. You may need to contact customer support to force a restart.
Instance is stuck on restart after running out of disk space. Automatic storage increase capability is not enabled. Enable automatic storage increase.
On-premises primary instance is stuck. N/A Cloud SQL customer support cannot help with instances that are not in Cloud SQL.
Slow shutdown on restart. Outstanding connections that don't end after 60 seconds can cause unclean shutdowns. Only have connections that last less than 60 seconds.
Access denied for user. User authentication, or possibly expired SSL/TLS certificates. Check user and cert statuses.
A user cannot be deleted. It could be that the user owns objects in the database. You may need to drop or reassign objects.
Cannot assign a private IP address to an existing instance in a Shared VPC. Instance addresses are tied to their projects upon creation. Create a new Cloud SQL instance to replace the existing one.
Particular queries are running slow. Database-specific issues or network latency. Check these suggestions.
Out of memory is indicated but monitoring charts don't show that. Some RAM may be in used by internal overhead processes. Ensure the instance has enough overhead for your workload.


You want to find out which queries are running now

You want to find out which queries are running right now.

The issue might be

N/A

Things to try

Connect to the database and run this query: SELECT datname, usename, application_name as appname, client_addr, state, now() - backend_start as conn_age, now() - xact_start as xact_age, now() - query_start as query_age, now() - state_change as last_activity_age, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY 8 DESC LIMIT 20;


Want to find out what units are being used for a field

You want to find out what units are being used for a particular field.

The issue might be

N/A

Things to try

Connect to the database and run this query (using your own FIELD_NAME): SELECT name, setting, unit FROM pg_settings WHERE name = '[FIELD_NAME]'.


Want to find the current value for a setting

You want to find the current value for a particular setting.

The issue might be

N/A

Things to try

Connect to the database and run this query (using your own SETTING_NAME): SHOW [SETTING_NAME]; or SHOW ALL; to see all settings.


Temporary storage increased automatic storage

Temporary tables increased storage usage, and automatic storage was increased.

The issue might be

Automatic storage is enabled.

Things to try

Restarting to delete temporary tables does not reduce the automatically increased storage size.


Data is being automatically deleted

You notice that data is being automatically deleted at a regular interval.

The issue might be

Most likely a script is running somewhere in your environment.

Things to try

Look in the logs around the time of the deletion and see if there is a rogue script running from a dashboard or another automated process.


The instance cannot be deleted

You see the error message ERROR: (gcloud.sql.instances.delete) HTTP Error 409: The instance or operation is not in an appropriate state to handle the request, or the instance may have a INSTANCE_RISKY_FLAG_CONFIG flag status.

The issue might be

  1. Another operation is in progress.
  2. The INSTANCE_RISKY_FLAG_CONFIG warning is triggered whenever at least one beta flag is being used.

Things to try

  1. Cloud SQL operations do not run concurrently. Wait for the other operation to complete.
  2. Remove the risky flag settings and restart the instance.

The system is stuck due to large temporary data size

The system is stuck due to large temporary data size.

The issue might be

The system can create many temporary tables at one time, depending on the queries and the load.

Things to try

Unfortunately, you cannot shrink the ibtmp1 file by any method other than restarting the service.

One mitigation option is to create the temporary table with ROW_FORMAT=COMPRESSED, so it is stored in file-per-table tablespaces in the temporary file directory. However, the downside is performance costs associated with creating and removing a file-per-table tablespace for each temporary table.


Fatal error during upgrade

You see the error message ERROR_INTERNAL_FATAL when upgrading resources on instance.

The issue might be

There could be many causes.

Things to try

Logs may reveal more, but in any case customer support may be needed to force re-create the instance.


Instance is stuck on restart after running out of disk space

The instance is stuck on restart after running out of disk space

The issue might be

Automatic storage increase capability is not enabled.

Things to try

If your instance runs out of storage, and the automatic storage increase capability is not enabled, your instance goes offline. To avoid this issue, you can edit the instance to enable automatic storage increase.


on-premises primary instance is stuck

You want to know if Cloud SQL customer support can help when an on-premises primary instance is stuck.

The issue might be

The instance is not in Cloud SQL.

Things to try

Cloud SQL customer support cannot help with instances that are not in Cloud SQL.


Slow shutdown on restart

Slow shutdown on restart.

The issue might be

When an instance shuts down, any outstanding connections that don't end within 60 seconds makes the shutdown unclean.

Things to try

By only having connections that last less than 60 seconds, most unclean shutdowns can be avoided, including connections from the database command prompt. If you keep these connections open for hours or days, shutdowns can be unclean.


Access denied for user

You see the error message Access denied for user 'XXX'@'XXX' (using password: XXX).

The issue might be

There could be several root causes, including:

  • The username (or password) is incorrect.
  • The user is connecting from something other than @XXX.
  • The user doesn't have the correct privileges for the database to which they are trying to connect.

Things to try

  • Verify the username and corresponding password
  • Check the origin of the connection to see whether it matches where the user has been granted access privileges.
  • Check the user's grant privileges in the database,

Cannot delete user

You cannot delete a database user.

The issue might be

The user has objects in the database that depend on it. You first need to drop those objects or reassign them to another user.

Things to try

Find out which objects are dependent on the user, then drop or reassign those objects to a different user. This thread on Stack Exchange discusses how to find the objects owned by the user.


Cannot assign a private IP address to an existing instance in a Shared VPC

You cannot assign a private IP address to an existing instance in a Shared VPC.

The issue might be

The reason behind that is because when a Cloud SQL instance is created it is automatically attached to a tenant project, and so will all the Cloud SQL instances in that same project. However, when the instance created is using private IP in a Shared VPC, it is attached to the tenant project associated with the Shared VPC host project.

Things to try

You can create a new Cloud SQL instance to replace the existing one.


Particular queries are running slow

CPU usage is consistently high.

The issue might be

Queries can be slow for many reasons, mostly due to specific database aspects. One reason that can involve Cloud SQL is network latency, when the source (writer or reader) resource and the destination (Cloud SQL) resource are in different regions.

Things to try

Refer to general performance tips, in particular:

For slow database inserts, updates, or deletes, consider the following actions:

  • Check the locations of the writer and database; sending data a long distance introduces latency.
  • Check the location of the reader and database; latency affects read performance even more than write performance
To reduce the latency the recommendation is to locate both the source and destination resources in the same region.


Out of memory is indicated but monitoring charts don't show that

An instance fails and report Out of memory but the Console or Cloud Monitoring charts seem to show there is still memory remaining.

The issue might be

There are other factors beside your workload that can impact memory usage, such as the number of active connections and internal overhead processes. These are not always reflected in the monitoring charts.

Things to try

Ensure that the instance has enough overhead to account for your workload and some additional overhead.