Troubleshooting Cloud SQL for PostgreSQL

Topics in this page include:

Backup and recovery

Click the links in the table for details:

For this problem... The issue might be... Try this...
Can't see current operation status. The user interface only shows success or failure. Use these database commands to find out more.
Can't find the operation originator. The user interface doesn't show who started an operation. Use audit logging to find out.
Out of disk space during automated backup. Instance reached hard disk space limits. Check the file system size and quota.
Can't do backup after instance deleted. Instance was deleted. Recreate from an export, or contact customer support if within the grace period.
Automated backup seems stuck. Backup time is correlated with database size. Contact customer support if you really need to cancel the operation.
Restore fails. Dump file may contain database users who do not yet exist. Create the database users before restoring.
Operation isn't valid for this instance. Destination instance size is smaller than the source. Increase the destination instance size.
Increase number of days to keep automated backups. Only seven automated backups are retained. Manage your own manual backups.
Unknown error in backup failure. Backup might have timed out. Check these flags.
No notification about backup failure. Notifications are not supported for backup failures. Use REST API or gcloud commands to check the status of a backup.
Instance is stuck cycling between failure and backup restore states. Too much traffic or too many open connections. Verify autovacuum settings and retry logic.
Data is missing in a backup. Unlogged tables are not restored from backup. See these notes about unlogged tables.

Can't see current operation status

You can't see the status of an operation in the Google Cloud Console.

The issue might be

The Google Cloud Console reports only success or failure when done, and is not designed to return warnings.

Things to try

Connect to the database and run SHOW WARNINGS.


Can't find the operation originator

You want to find out who issued an on-demand backup operation.

The issue might be

The instance operations page in the Google Cloud Console does not show who initiated an operation.

Things to try

Look in the logs and filter by text to find the user. You may need to use audit logs for private information. Relevant log files include:

  • cloudsql.googleapis.com/postgres.log
  • cloudaudit.googleapis.com/activity may also be available, if Cloud Audit Logs is enabled.


Out of disk space during automated backup

You see the error message [ERROR] InnoDB: Write to file ./ibtmp1 failed at offset XXXX, YYYY bytes should have been written, only 0 were written.

The issue might be

The instance reached a hard limit during an automated backup. Temporary files can expand beyond available disk space during a backup.

Things to try

Check that the disk is not full or out of disk quota. You can either manually increase the disk size or enable auto storage increase.


Can't do backup after instance deleted

You can't do a backup after deleting the instance.

The issue might be

Instance was deleted.

Things to try

  • The grace period for a Cloud SQL instance purge is four days. During this time, customer support can recreate the instance. After instances are purged, no data recovery is possible.
  • If you have done an export, you can create a new instance and then do an import to recreate the database. Exports are written to Cloud Storage and imports are read from there.

Automated backup is stuck

Automated backup is stuck for many hours and can't be canceled.

The issue might be

Backups can take a long time depending on the database size.

Things to try

If you really need to cancel the operation, you can ask customer support to force restart the instance.


Restore from backup fails

A restore operation can fail when one or more users referenced in the SQL dump file do not exist.

The issue might be

Before restoring a SQL dump, all the database users who own objects or were granted permissions on objects in the dumped database must exist. If they do not, the restore fails to recreate the objects with the original ownership and/or permissions.

Things to try

Create the database users before restoring from the SQL dump.


Operation isn't valid for this instance

You see the error message HTTP Error 400: This operation isn't valid for this instance from an API call to instances.restoreBackup.

The issue might be

You cannot restore from a backup of an instance with a storage size (XX GB) that is smaller than the backup size (YY GB).

Things to try

Edit the target instance to increase its storage size.


Instance is stuck cycling between failure and backup restore states

An instance is repeatedly failing because it is cycling between the failure and backup restore states. Attempts to connect to and use the database following restore fail.

The issue might be

  • There could be too many open connections. Too many connections can result from errors that occur in the middle of a connection where there are no autovacuum settings to clean up dead connections.
  • Cycling can occur if any custom code is using retry logic that doesn't stop after a few failures.
  • There could be too much traffic. Use connection pooling and other best practices for connectivity.

Things to try

  1. Verify that the database is set up for autovacuum.
  2. Check if there is any connection retry logic set up in custom code.
  3. Turn down traffic until the database recovers and then slowly turn traffic back up.

Data is missing in a backup

You find you are missing data when performing a backup/restore operation.

The issue might be

Tables were created as unlogged. For example:

CREATE UNLOGGED TABLE ....

These tables are not included in a restore from a backup:

  • The contents of unlogged tables doesn't survive failover on HA instance.
  • Unlogged tables don't survive postgres crashes.
  • Unlogged tables are not replicated to read replicas.
  • UNLOGGED tables are automatically wiped during backup restore.

Things to try

The solution is to avoid using unlogged tables if you want to restore those tables through a backup. If you're restoring from a database that already has unlogged tables, then you can dump the database to a file, and reload the data after modifying the dumped file to ALTER TABLE to SET LOGGED on those tables.


Increase number of days to keep automated backups

You want to increase the number of days that you can keep automatic backups from seven to 30 days, or longer.

The issue might be

Only seven backups are retained. Backups get pruned regularly due to the cost and size of retaining backups. Unfortunately, this means that the currently visible backups are the only automated backups you can restore from.

Things to try

To keep backups indefinitely, you can create an on-demand backup yourself, as they are not deleted in the same way as automated backups. On-demand backups remain indefinitely. That is, they remain until they are deleted or the instance they belong to gets deleted. Since that type of backup is not deleted automatically, it can affect billing.


Unknown error in backup failure

Backup failed and you see Unknown error.

The issue might be

The backup creation reaching the timeout.

Things to try

There are two flags that influence the backup creation: checkpoint_timeout and checkpoint_completion_target. At the start of the backup, a slow checkpoint is run and it takes checkpoint_completion_target multiplied by checkpoint_timeout.

For example, 900 sec * 0.9 sec = 810 sec = 13.5 min. For this reason, a timeout occurs. Decreasing the value of the checkpoint_completion_target fixes the issue in this case.

No notification about backup failure

An automated backup failed and you didn't receive an email notification.

The issue might be

Notifications are not supported for backup failures.

When an automated backup fails, an Operation error message appears in the Cloud SQL instance's Details page.

Things to try

You can find the status of a backup through either the REST API or gcloud commands. For example, first list the backups for an instance, and then describe a specific backup by its ID:

gcloud sql --project=PROJECT_ID backups list --instance=INSTANCE_ID
gcloud sql --project=PROJECT_ID backups describe BACKUP-ID --instance=INSTANCE_ID

Cloning

Click the links in the table for details:

For this problem... The issue might be... Try this...
Cloning fails with constraints/sql.restrictAuthorizedNetworks error. Blocked by Authorized Networks configuration. Try one of these options.

Cloning fails with a constraints/sql.restrictAuthorizedNetworks error

Cloning fails with a constraints/sql.restrictAuthorizedNetworks error.

The issue might be

The cloning operation is blocked by the Authorized Networks configuration. Authorized Networks are configured for public IP addresses in the Connectivity section of the Google Cloud Console, and cloning is not permitted due to security considerations.

Things to try

Remove all Authorized Networks entries from the Cloud SQL instance if you can. Otherwise, create a replica without any Authorized Networks entries.

Connectivity

Click the links in the table for details:

For this problem... The issue might be... Try this...
FATAL: database 'user' does not exist. gcloud sql connect --user only works with the default `postgres` user. Connect with the default user, then change users.
SSL error: invalid padding. Server certificate error. Create new server certificate and rotate.
You want to find out who is connected. N/A See these things to try.
Unauthorized to connect errors. There can be many root causes. See these things to try.
Network association failed. Service Networking API is not enabled in the project. Enable the Service Networking API in the project.
Remaining connection slots are reserved. The maximum number of connections has been reached. Increase the max_connections flag.
Set Service Networking service account as servicenetworking.serviceAgent role on consumer project. The Service Networking service account is not bound to the servicenetworking.serviceAgent role. Bind the Service Networking service account to the servicenetworking.serviceAgent role.
error x509: certificate is not valid for any names, but wanted to match project-name:db-name. Known issue: The Cloud SQL Proxy Dialer is not compatible with Go 1.15 at this time. Until fixed, see this discussion on GitHub, which includes a workaround.
Cannot parse certificates in some operating systems. Clients using x509 libraries from mac OS 11.0 (Big Sur) may fail to parse some certificates of postgres instances. This may be surfaced to the client as a generic error, like "cancelled". The workaround is to rotate the server certificate and recreate client certificates.
Cannot modify allocated ranges in CreateConnection. Please use UpdateConnection. VPC peerings were not updated after an allocated range was modified or removed. See Things to try for VPC peering update details.
Allocated IP range not found in network. VPC peerings were not updated after an allocated range was modified or removed. See Things to try for VPC peering update details.
ERROR: (gcloud.sql.connect) It seems your client does not have ipv6 connectivity and the database instance does not have an ipv4 address. Please request an ipv4 address for this database instance.. You're trying to connect to your private IP instance using Cloud Shell. Connecting from Cloud Shell to an instance with only a private IP address is not currently supported.

Aborted connection

You see the error message Got an error reading communication packets , or Aborted connection xxx to db: DB_NAME.

The issue might be

  • Networking instability.
  • No response to TCP keep-alive commands (either the client or the server is not responsive, possibly overloaded).
  • The database engine connection lifetime was exceeded and the server ends the connection.

Things to try

Applications must tolerate network failures and follow best practices such as connection pooling and retrying. Most connection poolers catch these errors where possible. Otherwise the application must either retry or fail gracefully.

For connection retry, we recommend the following methods:

  1. Exponential backoff. Increase the time interval between each retry, exponentially.
  2. Add randomized backoff also.
Combining these methods helps reduce throttling.


FATAL: database 'user' does not exist

You see the error message FATAL: database 'user' does not exist when trying to connect to a PostgreSQL instance using gcloud sql connect --user.

The issue might be

The gcloud sql connect --user command only works with the default user (postgres).

Things to try

The workaround is to connect using the default user, then use the "\c" psql command to reconnect as the different user.


SSL Error: Invalid padding

You see the error message SSL error: invalid padding when trying to connect to a PostgreSQL instance using SSL.

The issue might be

Something might be wrong with the server-ca certificate.

Things to try

Create a new server-ca certificate and rotate the server certs.


Want to find out who is connected

You want to find out who is connected and for how long.

The issue might be

N/A

Things to try

Log into the database and run this command: SELECT datname, usename, application_name as appname, client_addr, state, now() - backend_start as conn_age, now() - state_change as last_activity_age FROM pg_stat_activity WHERE backend_type = 'client backend' ORDER BY 6 DESC LIMIT 20</code>


Unauthorized to connect

You see the error message Unauthorized to connect.

The issue might be

There can be many causes, as authorization occurs at many levels.

  • At the database level, the database user must exist and its password must match.
  • At the project level, the user may lack the correct IAM permissions.
  • At the Cloud SQL level, the root cause can depend on how you connect to your instance. If you are connecting directly to an instance through the public IP, the connection's source IP must be in the authorized network of the instance.

    Private IP connectivity is allowed by default, except when you are connecting from a non-RFC 1918 address. Non-RFC 1918 client addresses must be configured as authorized networks.

    Cloud SQL doesn't learn Non-RFC 1918 subnet routes from your VPC by default. You need to update the network peering to Cloud SQL to export any Non-RFC 1918 routes. For example:

    gcloud compute networks peerings update cloudsql-postgres-googleapis-com --network=NETWORK --export-subnet-routes-with-public-ip --project=PROJECT
    

    If you are connecting through the Cloud SQL Auth proxy, ensure that the IAM permissions are set up correctly.

  • At the network level, if the Cloud SQL instance is using public IP, the connection's source IP must be in an authorized network.

Things to try

  • Check the username and password.
  • Check the user's IAM roles and permissions.
  • If using public IP, make sure the source is in the authorized networks.

Network association failed

You see the error message Error: Network association failed due to the following error: set Service Networking service account as servicenetworking.serviceAgent role on consumer project.

The issue might be

The Service Networking API is not enabled in the project.

Things to try

Enable the Service Networking API in your project. If you see this error when you are trying to assign a private IP address to a Cloud SQL instance, and you are using a shared VPC, you also need to enable the Service Networking API for the host project.


Remaining connection slots are reserved

You see the error message FATAL: remaining connection slots are reserved for non-replication superuser connections.

The issue might be

The maximum number of connections has been reached.

Things to try

Edit the max_connections flag value.


Set Service Networking service account as servicenetworking.serviceAgent role on consumer project

You see the error message set Service Networking service account as servicenetworking.serviceAgent role on consumer project..

The issue might be

Service Networking service account is not bound to the servicenetworking.serviceAgent role.

Things to try

To mitigate this issue, try using these gcloud commands to bind the Service Networking service account to the servicenetworking.serviceAgent role.

gcloud beta services identity create --service=servicenetworking.googleapis.com --project=PROJECT_ID
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:service-PROJECT_NUMBER@service-networking.iam.gserviceaccount.com" --role="roles/servicenetworking.serviceAgent"

Error x509: certificate is not valid for any names

You see the error message error x509: certificate is not valid for any names, but wanted to match project-name:db-name

The issue might be ...

Known issue: The Cloud SQL Proxy Dialer is not compatible with Go 1.15 at this time.

Things to try

Until the bug is fixed, see this discussion on GitHub, which includes a workaround.


Cannot parse certificates in some operating systems

When you use x509 libraries from mac OS 11.0 (Big Sur), you may fail to parse the certificates of postgres instances. This may be surfaced as a generic error, like "cancelled."

Things to try

The bug is fixed and the new instances won't hit this problem. For old instances hitting this problem, rotate the server certificate and recreate client certificates.


Cannot modify allocated ranges in CreateConnection. Please use UpdateConnection

You see the error message Cannot modify allocated ranges in CreateConnection. Please use UpdateConnection, or The operation "operations/1234" resulted in a failure "Allocated IP range 'xyz' not found in network.

The issue might be ...

You get the first error when you attempt to make a connection again using a different reserved range.

You see the second error when the allocated range was modified but vpc-peerings was not updated.

Things to try

You need to modify the private connection. Use the following command, and make sure to use the --force argument:

gcloud services vpc-peerings update --network=VPC_NETWORK --ranges=ALLOCATED_RANGES --service=servicenetworking.googleapis.com --force

Creating instances

Click the links in the table for details:

For this problem... The issue might be... Try this...
Internal error. Missing service networking service account. Disable and re-enable the Service Networking API.
Terraform instance creation fails. Terraform configuration error. Inspect and repair the Terraform configuration file.
HTTP Error 409 in Terraform script. Another operation is already in progress. Fix the Terraform script to wait for each operation to finish.
Unknown error

The Service Networking API might not be enabled.

You may be trying to create an instance with the same name as one recently deleted.

You may be trying to create multiple instances simultaneously.

Your subnet creation may have failed if there were no more available addresses in the IP range.

Enable the Service Networking API.

Use a different name for the instance, or wait until it's been a week since the instance was deleted.

Create instances consecutively.

See other Unknown error messages if this doesn't match your case.

Allocate new ranges.

Failed to create subnetwork. No more available addresses in the IP range. Allocate new ranges.

Internal error

You see the error message {"ResourceType":"sqladmin.v1beta4.instance", "ResourceErrorCode":"INTERNAL_ERROR","ResourceErrorMessage":null}.

The issue might be

The service project is likely missing the service networking service account required for this feature.

Things to try

To repair service permissions, disable the Service Networking API , wait five minutes and then re-enable it.


Terraform instance creation fails

Terraform instance creation fails.

The issue might be

This is usually an issue within the Terraform script itself.

Things to try

Inspect and repair the Terraform configuration file.


409 error in Terraform script

You see the error message HTTP Error 409 in Terraform scripts.

The issue might be

Operation failed because another operation was already in progress

Things to try

Revise the script to halt execution until each instance operation is completed. Have the script poll and wait until a 200 is returned for the previous operation ID before continuing to the next step.


Unknown error

When trying to create an instance, you see an error message like Cloud SQL creation failed, error UNKNOWN.

The issue might be

  • The Service Networking API enabled might not be enabled.
  • You may be trying to re-use the name of an instance you recently deleted. Instance names cannot be re-used for one week after deletion.
  • You may be trying to create multiple instances simultaneously. In this case, only the first instance is created and the ret fail with Unknown error. You can only run one create operation at a time.
  • Your subnet creation may have failed if there were no more available addresses in the IP range.

Things to try

  • Enable the Service Networking API.
  • Use a different name for the instance, or wait one week to create a new one using that name.
  • Create multiple instances consecutively instead of simultaneously.
  • See the Failed to create subnetwork section below.

Failed to create subnetwork

You get the error message: Failed to create subnetwork. Couldn't find free blocks in allocated IP ranges. Please allocate new ranges for this service provider.

The issue might be

There are no more available addresses in the allocated IP range.

If you encounter this error when attempting to create a Cloud SQL instance with private IP against a Shared VPC network using private service connections. there are five possible scenarios:

  • The size of the allocated IP range for the private service connection is smaller than /24.
  • The size of the allocated IP range for the private service connection is too small for the number of Cloud SQL instances.
  • You're attempting to create both MySQL or SQL Server and PostgreSQL instances on the same private service connection in the VPC host project. MySQL and SQL Server can share the same service connection. PostgreSQL requires its own service connection.
  • You're attempting to create instances on the same private service connection in different regions, which is not supported.

Things to try

For each of the above scenarios you can elect to either expand the existing or allocate an additional IP range to the private service connection.

If you're allocating a new range, take care to not create an allocation that overlaps with any existing allocations.

After creating a new IP range, update the vpc peering with the following command:

gcloud services vpc-peerings update --service=servicenetworking.googleapis.com
--ranges=[OLD_RESERVED_RANGE_NAME],[NEW_RESERVED_RANGE_NAME] --network=[VPC_NETWORK]
--project=[PROJECT_ID] --force

If you're expanding an existing allocation, take care to only increase the allocation range and not decrease it. For example, if the original allocation was 10.0.10.0/24 then make the new allocation at least 10.0.10.0/23.

In general, if starting from a /24 allocation, decrementing the /mask by 1 for each condition (additional instance type group, additional region) is a good rule of thumb. For example, if trying to create both instance type groups on the same allocation, going from /24 to /23 is enough.

After expanding an existing IP range, update the vpc peering with following command:

gcloud services vpc-peerings update --service=servicenetworking.googleapis.com
--ranges=[RESERVED_RANGE_NAME] --network=[VPC_NETWORK] --project=[PROJECT_ID]

Flags

Click the links in the table for details:

For this problem... The issue might be... Try this...
There's no flag to set the time zone. A time zone flag is not supported. There are some workarounds.

There's no flag to set the time zone

PostgreSQL and SQL Server for Cloud SQL do not support a time zone flag to adjust timezone based on user needs.

The issue might be

A time zone flag is not supported.

Things to try

You can set the time zone per session, but this will expire when you log off. A better solution is to connect to the database and set the database timezone to the desired one either per user or per database:

ALTER DATABASE dbname SET TIMEZONE TO 'timezone';
ALTER USER username SET TIMEZONE TO 'timezone';

These settings will remain even after the session is closed which will mimic the .conf configuration.

High availability

Click the links in the table for details:

For this problem... The issue might be... Try this...
Can't find metrics for manual failover. Only automatic failovers go into the metrics. N/A
CPU and RAM are near 100% usage Instance machine size is too small for the load. Upgrade the instance machine size.

Can't find metrics for manual failover

You performed a manual failover and can't find a corresponding entry in Metrics Explorer in the automatic failover metrics.

The issue might be

Only automatic failovers go into the metrics. Manually initiated failovers do not.

Things to try

N/A


CPU and RAM are near 100% usage

Cloud SQL instance resources (CPU and RAM) are near 100% usage, causing the high availability instance to go down.

The issue might be

The instance machine size is too small for the load.

Things to try

Edit the instance to upgrade to a larger machine size to get more CPUs and memory.

Import

Click the links in the table for details:

For this problem... The issue might be... Try this...
Import is taking too long. Too many active connections can interfere with import operations. Close unused connections, or restart the Cloud SQL instance before beginning an import operation.
Import fails. Exported file may contain database users who do not yet exist. Clean up the failed database before retrying the import. Create the database users before doing the import.

Import is taking too long

Import is taking too long, blocking other operations.

The issue might be

Too many active connections can interfere with import operations. Connections consume CPU and memory, limiting the resources available.

Things to try

Close unused operations. Check CPU and memory usage to make sure there are plenty of resources available. The best way to ensure maximum resources for the import operation is to restart the instance before beginning the operation. A restart:

  • Closes all connections.
  • Ends any tasks that may be consuming resources.


Import fails

Import fails when one or more users referenced in the exported SQL dump file does not exist.

The issue might be

Before importing a SQL dump, all the database users who own objects or were granted permissions on objects in the dumped database must exist. If they do not, the restore fails to recreate the objects with the original ownership and/or permissions.

Things to try

Clean up the failed database before retrying the import. Create the database users before importing the SQL dump.


Export

Click the links in the table for details:

For this problem... The issue might be... Try this...
Can't see the operation status. The user interface only shows success or failure. Use these database commands to find out more.
408 Error (Timeout) during export. SQL export can take a long time depending on database size and export content. Use multiple CSV exports to reduce the size of each operation.
CSV export worked but SQL export failed. SQL export is more likely to encounter compatibility issues with Cloud SQL. Use CSV exports to export only what you need..
Export is taking too long. Cloud SQL does not support concurrent synchronous operations. Use export offloading. Learn more.
Create Extension error. The dump file contains references to unsupported extension. Edit the dump file to remove the references.
Error using pg_dumpall. The tool requires superuser role. The superuser role is not supported.
Export operation times out before exporting anything. Query must produce data within first seven minutes. Try a manual export using the pg_dump tool.
Connection closed during the export operation. Query must produce data within first seven minutes. Test the query manually. Learn more.
Unknown error during export. Possible bandwidth issue. Ensure that both the instance and the Cloud Storage bucket are in the same region.
You want to automate exports. Cloud SQL does not provide a way to automate exports. Build your own pipeline to perform this functionality. Learn more.

Can't see the operation status

You can't see the status of an ongoing operation.

The issue might be

The Google Cloud Console reports only success or failure when done, and is not designed to return warnings.

Things to try

Connect to the database and run SHOW WARNINGS.


408 Error (Timeout) during export

You see the error message 408 Error (Timeout) while performing an export job in Cloud SQL.

The issue might be

CSV and SQL formats do export differently. The SQL format exports the entire database, and likely takes longer to complete. The CSV format lets you define which elements of the database to include in the export,

Things to try

Use the CSV format, and run multiple, smaller export jobs to reduce the size and length of each operation.


CSV export worked but SQL export failed

CSV export worked but SQL export failed.

The issue might be

CSV and SQL formats do export differently. The SQL format exports the entire database, and likely takes longer to complete. The CSV format lets you define which elements of the database to include in the export,

Things to try

Use CSV exports to export only what you need.


Export is taking too long

Export is taking too long, blocking other operations.

The issue might be

Cloud SQL does not support concurrent synchronous operations.

Things to try

Try exporting smaller datasets at a time.



Error using pg_dumpall

You get an error when trying to use the external pg_dumpall command-line tool.

The issue might be

This tool requires the superuser role.

Things to try

Cloud SQL is a managed service and does not give users the superuser roles or permissions.


Connection reset by peer

The export operation times out before anything is exported. You see the error message Could not receive data from client: Connection reset by peer.

The issue might be

If Cloud Storage does not receive any data within a certain time frame, the connection resets.

Things to try

Do a manual export using the pg_dump tool.


Connection closed during the export operation

Connection closed during the export operation.

The issue might be

The connection to Cloud Storage may be timing out because the query running in the export is not producing any data within the first seven minutes since the export is initiated.

Things to try

Test the query manually by connecting from any client and sending the output of your query to STDOUT with the command below:

COPY (INSERT_YOUR_QUERY_HERE) TO STDOUT WITH ( FORMAT csv, DELIMITER ',', ENCODING 'UTF8', QUOTE '"', ESCAPE '"' ).

This is expected behavior since when the export is initiated, the client is expected to start sending data right away. Keeping the connection with no data sent ends up breaking the connection and eventually resulting in the export failing and leaving the operation in an uncertain state. Also, this is what the error message from gcloud is trying to say with this message:

operation is taking longer than expected.


Unknown error during export

You see the error message Unknown error while trying to export a database to a Cloud Storage bucket.

The issue might be

The transfer might be failing due to a bandwidth issue.

Things to try

The Cloud SQL instance may be located in a different region from the Cloud Storage bucket. Reading and writing data from one continent to another involves a lot of network usage, and can cause intermittent issues like this. Check the regions of your instance and bucket.


Want to automate exports

You want to automate exports.

The issue might be

Cloud SQL does not provide a way to automate exports.

Things to try

You could build your own automated export system using Google Cloud products such as Cloud Scheduler, Pub/Sub, and Cloud Functions.


ERROR_RDBMS system error occurred

You see the error message [ERROR_RDBMS] system error occurred.

The issue might be

  • The user might not have all the Cloud Storage permissions it needs.
  • The database table might not exist.

Things to try

  1. Check that you have at least WRITER permissions on the bucket and READER permissions on the export file. For more information on configuring access control in Cloud Storage, see Create and Manage Access Control Lists.
  2. Ensure the table exists. If the table exists, confirm that you have the correct permissions on the bucket.

Logging

Click the links in the table for details:

For this problem... The issue might be... Try this...
Logging is using a lot of CPU and memory. Logging needs to be tuned. Try tuning logging resource usage.
Audit logs are not found. User authentication. Check user roles and permissions.
Operations information not found in logs. Audit logs are not enabled. Enable audit logging.
Log files are hard to read. You'd rather view the logs as json or text. Use gcloud logging commands.
Query logs are not found in PostgreSQL logs. You need to enable the pgaudit flags. Use these gcloud sql commands.

Logging is using a lot of CPU and memory

Logging is using a lot of CPU and memory.

The issue might be

Logging usage needs to be tuned.

Things to try

The log_statement flag can be set to none and the logging_collector flag can be set to off. If logging is still occurring, there may be other log-related flags that can be tuned. You can edit the instance to modify these flags.


Audit logging

You turned on audit logging for Cloud SQL but are unable to find any audit logs in Cloud Logging

The issue might be

Data-Access logs are only written if the operation is an authenticated user-driven API call that creates, modifies, or reads user-created data, or if the operation accesses configuration files or metadata of resources.

Things to try

Check the roles and permissions of the user performing the operations.


Operation information not found in logs

You want to find more information about an operation. For example, a user was deleted but you can't find out who did it. The logs show the operation started but do not provide any more information.

The issue might be

You must enable audit logging for detailed and personal identifying information (PII) like this to be logged.

Things to try

Enable audit logging in your project.


Log files are hard to read

You find it hard to read the logs in the Logs Explorer.

The issue might be

You'd rather download the logs locally in either JSON or text format.

Things to try:

You can use the gcloud logging read command along with linux post-processing commands to download the logs.

To download as JSON:

gcloud logging read "resource.type=cloudsql_database AND logName=projects/PROJECT-ID/logs/cloudsql.googleapis.com%2FLOGFILE-NAME" --format json --project=PROJECT-ID--freshness="1d" > downloaded-log.json

To download as TEXT:

gcloud logging read "resource.type=cloudsql_database AND logName=projects/PROJECT-ID/logs/cloudsql.googleapis.com%2FLOGFILE-NAME" --format json --project=PROJECT-ID--freshness="1d"| jq -rnc --stream 'fromstream(1|truncate_stream(inputs)) | .textPayload' > downloaded-log.txt


Query logs are not found

You're not finding query logs in Cloud Logging for PostgreSQL.

The issue might be

You need to enable the pgaudit flags.

Things to try

  1. From a terminal, connect to your database:

    gcloud sql connect $INSTANCE_NAME
    

  2. Once inside the database, run this command to create the extension:

    CREATE EXTENSION pgaudit;
    

  3. Exit the database, and from a terminal run the following command:

    gcloud sql instances patch $INSTANCE_NAME --database-flags cloudsql.enable_pgaudit=on,pgaudit.log=all
    

  4. Finally, restart the Cloud SQL instance.

Managing instances

Click the links in the table for details:

For this problem... The issue might be... Try this...
You want to find out what queries are running now. N/A Try this database query.
You want to find out what units are being used for a specific field. N/A Try this database query.
You want to find the current value of a database setting. N/A Try these database queries.
You want to stop a blocked background process. The user needs to have the pg_signal_backend role. Grant the role and issue these commands.
Instance is nearing 100% consumption of transaction IDs. The autovacuum job might be blocked, or might not be reclaiming the transaction IDs fast enough to keep up with the workload. See theseself-servicing tips.
Temporary storage increased automatic storage. Automatic storage is enabled. Restart deletes the temporary files but not reduce the storage. Only customer support can reset the instance size. Learn more.
Data is being automatically deleted. There is a script running somewhere doing this. Try to find the script.
The instance cannot be deleted. There can be more than one root cause. There can be more than one solution.
The instance is stuck due to large temporary data size. Too many temporary tables were created at one time. Restart the instance, and try this mitigation option.
Fatal error during upgrade. There could be many causes. Logs may reveal more. You may need to contact customer support to force a restart.
Instance is stuck on restart after running out of disk space. Automatic storage increase capability is not enabled. Enable automatic storage increase.
On-premises primary instance is stuck. N/A Cloud SQL customer support cannot help with instances that are not in Cloud SQL.
Slow shutdown on restart. Outstanding connections that don't end after 60 seconds can cause unclean shutdowns. Only have connections that last less than 60 seconds.
Access denied for user. User authentication, or possibly expired SSL/TLS certificates. Check user and cert statuses.
A user cannot be deleted. It could be that the user owns objects in the database. You may need to drop or reassign objects.
Cannot assign a private IP address to an existing instance in a Shared VPC. Instance addresses are tied to their projects upon creation. Create a new Cloud SQL instance to replace the existing one.
Particular queries are running slow. Database-specific issues or network latency. Check these suggestions.
Out of memory is indicated but monitoring charts don't show that. Some RAM may be in used by internal overhead processes. Ensure the instance has enough overhead for your workload.
Recovering a deleted instance. All data on an instance, including backups, is permanently lost when an instance is deleted. Prevent data loss by exporting to Cloud Storage.
You want to rename an existing Cloud SQL instance. Renaming an existing instance is not supported. There are a couple of ways to accomplish the goal by creating a new instance.


You want to find out which queries are running now

You want to find out which queries are running right now in PostgreSQL.

The issue might be

N/A

Things to try

Connect to the database and run this query: SELECT datname, usename, application_name as appname, client_addr, state, now() - backend_start as conn_age, now() - xact_start as xact_age, now() - query_start as query_age, now() - state_change as last_activity_age, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY 8 DESC LIMIT 20;


Want to find out what units are being used for a field

You want to find out what units are being used for a particular field.

The issue might be

N/A

Things to try

Connect to the database and run this query (using your own FIELD_NAME): SELECT name, setting, unit FROM pg_settings WHERE name = '[FIELD_NAME]'.


You want to find the current value for a setting

You want to find the current value for a particular setting.

The issue might be

N/A

Things to try

Connect to the database and run this query (using your own SETTING_NAME): SHOW SETTING_NAME; or SHOW ALL; to see all settings.

You want to stop a blocked background process

You want to stop a blocked or stuck background process, like autovacuum, for example.

The issue might be

Your user doesn't have the appropriate role. It needs to have the pg_signal_backend role.

Things to try

  1. Connect to the database and run this command:
      GRANT pg_signal_backend TO USERNAME;
      
  2. Find the process ID of the blocked or stuck process:
      SELECT pid, usename, state, query FROM pg_stat_activity;
      
  3. Stop a running or idle process using these commands:
      SELECT pg_cancel_backend(pid)
            FROM pg_stat_activity
            WHERE usename = 'USERNAME';
      
      
      SELECT pg_terminate_backend(pid)
            FROM pg_stat_activity
            WHERE usename = 'USERNAME';
      
      

Instance is nearing 100% consumption of transaction IDs

Your internal monitoring warns that the instance is nearing 100% consumption of transaction IDs. You want to avoid transaction wraparound, which can block writes.

The issue might be

The autovacuum job might be blocked, or might not be reclaiming the transaction IDs fast enough to keep up with the workload.

Things to try

In order to avoid any outages due to transaction wraparound problem, you can review these self-servicing tips for dealing with TXID wraparound.

For general tuning advice, see Optimizing, monitoring, and troubleshooting vacuum operations in PostgreSQL.


Temporary storage increased automatic storage

Temporary tables increased storage usage, and automatic storage was increased.

The issue might be

Automatic storage is enabled.

Things to try

Restarting to delete temporary tables does not reduce the automatically increased storage size.


Data is being automatically deleted

You notice that data is being automatically deleted at a regular interval.

The issue might be

Most likely a script is running somewhere in your environment.

Things to try

Look in the logs around the time of the deletion and see if there is a rogue script running from a dashboard or another automated process.


The instance cannot be deleted

You see the error message ERROR: (gcloud.sql.instances.delete) HTTP Error 409: The instance or operation is not in an appropriate state to handle the request, or the instance may have a INSTANCE_RISKY_FLAG_CONFIG flag status.

The issue might be

  1. Another operation is in progress.
  2. The INSTANCE_RISKY_FLAG_CONFIG warning is triggered whenever at least one beta flag is being used.

Things to try

  1. Cloud SQL operations do not run concurrently. Wait for the other operation to complete.
  2. Remove the risky flag settings and restart the instance.

The system is stuck due to large temporary data size

The system is stuck due to large temporary data size.

The issue might be

The system can create many temporary tables at one time, depending on the queries and the load.

Things to try

Unfortunately, you cannot shrink the ibtmp1 file by any method other than restarting the service.

One mitigation option is to create the temporary table with ROW_FORMAT=COMPRESSED, so it is stored in file-per-table tablespaces in the temporary file directory. However, the downside is performance costs associated with creating and removing a file-per-table tablespace for each temporary table.


Fatal error during upgrade

You see the error message ERROR_INTERNAL_FATAL when upgrading resources on instance.

The issue might be

There could be many causes.

Things to try

Logs may reveal more, but in any case customer support may be needed to force re-create the instance.


Instance is stuck on restart after running out of disk space

The instance is stuck on restart after running out of disk space

The issue might be

Automatic storage increase capability is not enabled.

Things to try

If your instance runs out of storage, and the automatic storage increase capability is not enabled, your instance goes offline. To avoid this issue, you can edit the instance to enable automatic storage increase.


on-premises primary instance is stuck

You want to know if Cloud SQL customer support can help when an on-premises primary instance is stuck.

The issue might be

The instance is not in Cloud SQL.

Things to try

Cloud SQL customer support cannot help with instances that are not in Cloud SQL.


Slow shutdown on restart

Slow shutdown on restart.

The issue might be

When an instance shuts down, any outstanding connections that don't end within 60 seconds makes the shutdown unclean.

Things to try

By only having connections that last less than 60 seconds, most unclean shutdowns can be avoided, including connections from the database command prompt. If you keep these connections open for hours or days, shutdowns can be unclean.


Access denied for user

You see the error message Access denied for user 'XXX'@'XXX' (using password: XXX).

The issue might be

There could be several root causes, including:

  • The username (or password) is incorrect.
  • The user is connecting from something other than @XXX.
  • The user doesn't have the correct privileges for the database to which they are trying to connect.

Things to try

  • Verify the username and corresponding password
  • Check the origin of the connection to see whether it matches where the user has been granted access privileges.
  • Check the user's grant privileges in the database,

Cannot delete user

You cannot delete a database user.

The issue might be

The user has objects in the database that depend on it. You first need to drop those objects or reassign them to another user.

Things to try

Find out which objects are dependent on the user, then drop or reassign those objects to a different user. This thread on Stack Exchange discusses how to find the objects owned by the user.


Cannot assign a private IP address to an existing instance in a Shared VPC

You cannot assign a private IP address to an existing instance in a Shared VPC.

The issue might be

The reason behind that is because when a Cloud SQL instance is created it is automatically attached to a tenant project, and so will all the Cloud SQL instances in that same project. However, when the instance created is using private IP in a Shared VPC, it is attached to the tenant project associated with the Shared VPC host project.

Things to try

You can create a new Cloud SQL instance to replace the existing one.


Particular queries are running slow

CPU usage is consistently high.

The issue might be

Queries can be slow for many reasons, mostly due to specific database aspects. One reason that can involve Cloud SQL is network latency, when the source (writer or reader) resource and the destination (Cloud SQL) resource are in different regions.

Things to try

Refer to general performance tips, in particular:

For slow database inserts, updates, or deletes, consider the following actions:

  • Check the locations of the writer and database; sending data a long distance introduces latency.
  • Check the location of the reader and database; latency affects read performance even more than write performance
To reduce the latency the recommendation is to locate both the source and destination resources in the same region.


Out of memory is indicated but monitoring charts don't show that

An instance fails and report Out of memory but the Console or Cloud Monitoring charts seem to show there is still memory remaining.

The issue might be

There are other factors beside your workload that can impact memory usage, such as the number of active connections and internal overhead processes. These are not always reflected in the monitoring charts.

Things to try

Ensure that the instance has enough overhead to account for your workload and some additional overhead.


Recovering a deleted instance

Whether an instance is deleted on purpose or accidentally, you cannot undelete it.

The issue might be

All data on an instance, including backups, is permanently lost when that instance is deleted.

Things to try

To preserve your data, export it to Cloud Storage before you delete an instance.

The Cloud SQL Admin role includes the permission to delete the instance. To prevent accidental deletion, grant this role only as needed.


You want to rename an existing Cloud SQL instance

For business or other reasons, you want to rename an existing Cloud SQL instance.

The issue might be

Renaming an instance is not supported either in the Google Cloud Console or through an API.

Things to try

There are other ways to accomplish the goal by creating a new instance.

  1. You can clone the instance you want to rename and set a new name for the cloned instance. This allows you to create the new instance without having to import data manually. Just as when creating a new instance, the cloned instance has a new IP address.

  2. You can export data from your instance into a Cloud Storage bucket, create a new instance with the new name you want, and then import the data into the new instance.

In both cases, you can delete your old instance after the operation is done. We recommend going with the cloning route since it has no impact on performance and doesn't require you to redo any instance configuration settings (such as flags, machine type, storage size and memory).

Replication

Click the links in the table for details:

For this problem... The issue might be... Try this...
Read replica did not start replicating on creation. There could be many root causes. Check the logs to find more information.
Unable to create read replica - invalidFlagValue error. One of the flags provided explicitly or by default is invalid. Check flag values and logs to find more information.
Unable to create read replica - unknown error. There could be many root causes. Check the logs to find more information.
Disk is full. The primary instance disk size can become full during replica creation. Upgrade the primary instance to a larger disk size.
Replica instance is using too much memory. Replicas can cache often-requested read operations. Restart the replica instance to reclaim the temporary memory space.
Replication stopped. Max storage space was reached and automatic storage increase is not enabled. Enable automatic storage increase.
Replication lag is consistently high. There can be many different root causes. Here are a few things to try.

Read replica did not start replicating on creation

The read replica did not start replicating on creation.

The issue might be

There is probably a more specific error in the log files.

Things to try

Inspect the logs in Cloud Logging to find the actual error.


Unable to create read replica - invalidFlagValue error

Unable to create read replica - invalidFlagValue

The issue might be

One of the flags in the request is invalid. It could be a flag you provided explicitly or one that was set to a default value.

Things to try

First, check that the value of the max_connections flag is greater than or equal to the value on the primary.

If the max_connections flag is set appropriately, inspect the logs in Cloud Logging to find the actual error.


Unable to create read replica - unknown error

Unable to create read replica - unknown error.

The issue might be

There is probably a more specific error in the log files.

Things to try

Inspect the logs in Cloud Logging to find the actual error.

If the error is: set Service Networking service account as servicenetworking.serviceAgent role on consumer project, then disable and re-enable the Service Networking API. This action creates the service account necessary to continue with the process.


Disk is full

error: disk is full

The issue might be

The primary instance disk size can become full during replica creation.

Things to try

Edit the primary instance to upgrade it to a larger disk size.


Replica instance is using too much memory

The replica instance is using too much memory.

The issue might be

The replica uses temporary memory to cache often-requested read operations, which can lead it to use more memory than the primary instance.

Things to try

Restart the replica instance to reclaim the temporary memory space.


Replication stopped

Replication stopped.

The issue might be

The maximum storage limit was reached and >automatic storage increase is disabled.

Things to try

Edit the instance to enable automatic storage increase.


Replication lag is consistently high

Replication lag is consistently high.

The issue might be

The write load is too high for the replica to handle. Replication lag takes place when the SQL thread on a replica is unable to keep up with the IO thread. Some kinds of queries or workloads can cause temporary or permanent high replication lag for a given schema. Some of the typical causes of replication lag are:

  • Slow queries on the replica. Find and fix them.
  • All tables must have a unique/primary key. Every update on such a table without a unique/primary key causes full table scans on th replica.
  • Queries like DELETE ... WHERE field < 50000000 cause replication lag with row-based replication since a huge number of updates are piled up on the replica.

Things to try

Some possible solutions:

  • Edit the instance to increase the size of the replica.
  • Reduce the load on the database.
  • Index the tables.
  • Identify and fix slow queries.
  • Recreate the replica.