In this tutorial, you create a demonstration database and run an application workload. Then, you configure the archive and backup processes. Next, you learn how to verify the backup, archive, and recovery processes. Finally, you learn how to recover the database to a specific point in time.
This tutorial is intended for database administrators, system operators, DevOps professionals, and cloud architects interested in configuring a backup and recovery strategy for PostgreSQL databases.
This tutorial assumes that you are familiar with Docker containers and that you are comfortable with Linux commands, PostgreSQL database engines, and Compute Engine.
Objectives
- Set up a backup and archiving process.
- Perform a PITR.
- Monitor your backup.
- Verify a recovery.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Compute Engine and Cloud Storage APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Compute Engine and Cloud Storage APIs.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Concepts
Before you start the tutorial, review the following PostgreSQL concepts:
- Continuous archiving. When the database continuously saves sequential transactions to a file.
- Write Ahead Log (WAL). Changes to data files are recorded in the WAL before they are made to the file.
- WAL Record. Each transaction applied to the database is formatted and stored as a WAL record.
- Segment files. Segment files have monotonically increasing filenames, and contain as many WAL records as possible. The file size is configurable, with a default of 16 MiB. You might choose a larger size if you expect voluminous transactions in size, or count, to lower the aggregate number of generated segment files and to decrease the file management burden.
For more information, see Reliability and the Write-Ahead Log.
The following diagram shows how WALs are persisted in two stages.
In the preceding diagram, the first stage of persisting WALs consists of the database engine recording write transactions in the WAL buffer concurrently with the write to a table. When the transaction is committed, the WAL buffer is written (flushed) to disk during the second stage with an append to the WAL segment file.
Choosing a PITR
A PITR is appropriate for the following scenarios:
- Minimize the recovery point objective (RPO). The RPO is the maximum time of data loss that is tolerated before it significantly impacts business processes. Saving all transactions in the WALs between backup snapshots drastically decreases the amount of lost data because you have the transactions since the last full backup to apply to the database.
- Minimize the recovery time objective (RTO). The RTO is the amount of time required to recover a database if a destructive event occurs. After you set up binary backups and log archiving, the time required to recover the database can be minimal.
- Remediation for a data corruption bug, or an administrative misstep. If a code release causes catastrophic data corruption, or an unrecoverable mistake is made during routine maintenance, you can recover to before that moment.
In some application architectures, such as a microservices architecture, there might be parallel databases that could require independent recoveries. For example, a retail application might have customer data in one database and retail order details and inventory information in other databases. Depending on the overall state of data, one, two, or all databases might need to be recovered in parallel.
A PITR is not appropriate in the following scenarios:
- RPO is large. If your disaster recovery policy can tolerate the loss of any transactions received after the recent snapshot, you can avoid additional steps and focus on reducing the time to recover your data.
- A complete database recovery is needed. If your goal is to recover to the most recent transaction, your recovery target is the timestamp of the last persisted transaction. This scenario is a specific case of PITR but semantically this goal is referred to as a full recovery.
Performance considerations
The archiving process puts additional I/O load on your database server. The additional load is dependent upon the characteristics of your workload, because it is proportionate to the write, update, and delete transaction volume.
If you want to reduce the I/O impact that the WAL archive activity might incur on your primary database, you can perform the periodic WAL archives using a read-only replica.
This configuration isolates the primary database from the batch-oriented I/O activities related to the transfer of the WAL files. Transactions destined for the read-only replica are transmitted in a constant stream from the primary database, thus exacting a much lower impact upon steady-state throughput.
Further, if your production database topology already includes a read-only replica, this configuration doesn't add any additional burden: management, price or otherwise.
Reference architecture
The following diagram illustrates the architecture that you implement in this tutorial.
In this tutorial, you create cloud infrastructure to observe a PITR that is using the following components:
- A PostgreSQL database server running on Compute Engine.
- Cloud Storage for storage of snapshots and transaction logs.
The following diagram shows the two Docker containers that are launched on the PostgreSQL database virtual machine (VM). As a separation of concerns, the database server is running in one of the containers, and the WAL archiver is running in the other container.
This diagram shows how the Docker volumes in each container are mapped to Persistent Disk mount points on the host VM.
Setting up environment variables
The scripts and commands used in this tutorial rely on shell environment variables.
In Cloud Shell, set environment variables for your project, the instance name, and the demonstration PostgreSQL database.
export PROJECT_ID=your-gcp-project export PG_INSTANCE_NAME=instance-pg-pitr export POSTGRES_PASSWORD=PasswordIsThis export POSTGRES_PITR_DEMO_DBNAME=pitr_demo
Replace the following:
your-gcp-project
: the name of the project that you created for this tutorial.PasswordIsThis
: a secure password for the PostgreSQL database.
Set the environment variable for the Google Cloud zone. Replace
choose-an-appropriate-zone
with a Google Cloud zone.export ZONE=choose-an-appropriate-zone export REGION=${ZONE%-[a-z]}
Set the environment variable for the default Virtual Private Cloud (VPC) subnet for the region of your zone:
export SUBNETWORK_URI=$(gcloud compute networks subnets \ describe default --format=json --region=$REGION | \ jq --raw-output '.ipCidrRange')
Set the environment variable for the Cloud Storage bucket. Replace
archive-bucket
with a unique name for the Cloud Storage bucket where WALs are saved.export ARCHIVE_BUCKET=archive-bucket
Creating a Cloud Storage bucket
Create a Cloud Storage bucket to archive the WAL files from the PostgreSQL database:
gcloud storage buckets create gs://${ARCHIVE_BUCKET}
Allowing access to private IP addresses instances
For the instances used in this tutorial, as in many production use cases, there is no need for the VM instances to obtain public IP addresses. However, the instances require access to the public internet to pull the example container images, and you require access in order to connect by using a secure shell. You configure a network address translation (NAT) gateway and configure Identity-Aware Proxy (IAP) for TCP forwarding.
Create a NAT gateway
Because the VM instances that you create don't have public IP addresses, you create a NAT gateway so that the instances can pull container images from the Docker Hub.
In Cloud Shell, create a Cloud Router:
export CLOUD_ROUTER_NAME=${PROJECT_ID}-nat-router gloud compute routers create $CLOUD_ROUTER_NAME \ --network=default --region=$REGION
Create the NAT gateway:
gcloud compute routers nats create ${PROJECT_ID}-nat-gateway \ --region=$REGION \ --router=$CLOUD_ROUTER_NAME \ --auto-allocate-nat-external-ips \ --nat-all-subnet-ip-ranges
Configure IAP for TCP forwarding
IAP controls access to your cloud applications and VMs running on Google Cloud. IAP verifies the user identity and context of the request to determine whether a user is allowed to access a VM.
In Cloud Shell, allow traffic from the TCP forwarding net block to the instances in your project:
export IAP_FORWARDING_CIDR=35.235.240.0/20 gcloud compute --project=$PROJECT_ID firewall-rules create \ cloud-iap-tcp-forwarding --direction=INGRESS \ --priority=1000 --network=default \ --action=ALLOW --rules=all \ --source-ranges=$IAP_FORWARDING_CIDR
To connect by using a TCP forwarding tunnel, add an Identity and Access Management (IAM) policy binding. Replace
your-email-address
with the email address that you use to log into Google Cloud console.export GRANT_EMAIL_ADDRESS=your-email-address gcloud projects add-iam-policy-binding $PROJECT_ID \ --member=user:$GRANT_EMAIL_ADDRESS \ --role=roles/iap.tunnelResourceAccessor
Creating the PostgreSQL database infrastructure
In Cloud Shell, clone the source repository that contains the configuration scripts, and change the shell context to the local repository:
git clone https://github.com/GoogleCloudPlatform/gcs-postgresql-recovery-tutorial cd gcs-postgresql-recovery-tutorial
To create and configure the database VM instance, run the following script:
cd bin ./create_postgres_instance.sh
For this tutorial, this script starts a VM instance in your chosen zone with the container-optimized operating system, and two new attached persistent disks. In this case, you can ignore the warning message returned by the API about poor I/O performance because the scripts create small Persistent Disks.
Reviewing the cloud-init configuration
Cloud-init is a multi-distribution package that initializes a cloud instance.
Review the following cloud-init code sample:
For this tutorial, cloud-init is used to do the following:
- Create two Persistent Disk block storage devices.
- Create the file systems on the two devices: one for the data and one for the archive logs.
- Mount the devices at logical mount points on the VM instance which are shared with the Docker containers.
- Create and then start a
systemd
service (postgres.service
), which starts a PostgreSQL Docker container with the following:- The persistent disks mounted as volumes.
- The PostgreSQL port (
5432
) published to the VM host.
- Create a
/var/tmp/docker-entrypoint-initdb.d/init-pitr-demo-db.sql
file to create a simple set of tables in a demonstration database and schema. - Create and start a second
systemd
service (wal_archive.service
) that runs a Google Cloud CLI Docker container with the WAL disks mounted as a volume. This service backs up archived WAL files to Cloud Storage. - Create, enable, and then start a
systemd
timer (wal_archive.timer
) that periodically runs thewal_archive.service
. - Ensure that the PostgreSQL port (
5432
) is open for the VPC subnet so that the transaction generator can reach the database port.
Modify the database instance configuration
The database server is running, but you need to configure network access, and to start the WAL archiving process.
Connect to the database VM instance
In the Google Cloud console, go to the VM instances page.
To open a terminal shell, next to the
instance-pg-pitr
instance that you created, click SSH.In the terminal shell, check that the Docker container started:
docker ps
The output is similar to the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8bb65d8c1197 postgres:11-alpine "docker-entrypoint.s…" About a minute ago Up About a minute postgres-db
If the container is not yet running, wait a moment, and then use the same command to check again.
Allow inbound network connections to database
In the terminal shell of the
instance-pg-pitr
instance, open the PostgreSQL host-based authentication configuration file for editing:sudoedit /mnt/disks/data/pgdata/pg_hba.conf
To remove the default all IP address access to the database, comment out the following line from the end of the file by adding
#
at the beginning of the line. The line in the file looks similar to the following:#host all all all md5
To allow password-protected connections from hosts in the
10.0.0.0/8
CIDR block, add the following line to the end of the file:host all all 10.0.0.0/8 md5
This entry enables connectivity from the VPC subnet where the transaction generator is later created.
Save and then close the file.
Configure WAL archiving
In the terminal shell of the
instance-pg-pitr
instance, edit thepostgresql.conf
file:sudoedit /mnt/disks/data/pgdata/postgresql.conf
Replace the existing commented-out
archive_mode
,archive_command
, andarchive_timeout
lines with the following:archive_mode=on archive_command = '( ARCHIVE_PATH=/var/lib/postgresql/wal/pg_wal_archive; test ! -f $ARCHIVE_PATH/%f && cp %p $ARCHIVE_PATH/%f.cp && mv $ARCHIVE_PATH/%f.cp $ARCHIVE_PATH/%f ) ' archive_timeout = 120
When you replace the lines in the modified file, it looks similar to the following code snippet:
Save and then close the file.
Apply and verify the configuration changes
In the terminal shell of the
instance-pg-pitr
instance, restart the container to apply the changes:sudo systemctl restart postgres
Check for the WAL segment files:
sudo ls -l /mnt/disks/wal/pg_wal
The output is similar to the following:
total 16388 -rw------- 1 postgres 70 16777216 Sep 5 23:07 000000010000000000000001 drwx------ 2 postgres 70 4096 Sep 5 23:05 archive_status
Check for network connectivity to the database:
export LOCAL_IP=127.0.0.1 docker exec postgres-db psql -w --host=$LOCAL_IP \ --command='SELECT 1'
The output is similar to the following:
?column? ---------- 1 (1 row)
Close the SSH connection to the instance.
Starting the transaction generator to populate database
The following steps start a Go program that generates transactions for this tutorial. The program runs inside a container on a VM instance.
The image for the container is already built and hosted in a project with a public Container Registry.
In Cloud Shell, change to the transaction generator directory:
cd ~/gcs-postgresql-recovery-tutorial/bin
Set the environment variables:
export TRANS_GEN_INSTANCE_NAME=instance-trans-gen export POSTGRES_HOST_IP=$(gcloud compute instances describe \ --format=json --zone=${ZONE} ${PG_INSTANCE_NAME} | \ jq --raw-output '.networkInterfaces[0].networkIP')
To run the transaction generator, start the instance:
./run_trans_gen_instance.sh
Ignore the warning message about poor I/O performance.
Wait a few moments, and check that transactions are reaching the PostgreSQL database:
gcloud compute ssh $PG_INSTANCE_NAME \ --tunnel-through-iap \ --zone=$ZONE \ --command="docker exec postgres-db psql \ --dbname=$POSTGRES_PITR_DEMO_DBNAME \ --command='SELECT COUNT(*) FROM pitr_db_schema.customer;'"
The output contains a count greater than 0 when records are added to the database by the transaction generator:
count ------- 413 (1 row)
Configuring the binary snapshot backup schedule
You can back up persistent disks according to a schedule, and retain them for a time that is defined in the resource policy.
Create the snapshot schedule
In Cloud Shell, set environment variables:
export ZONE=zone-of-your-instance export SNAPSHOT_SCHEDULE_NAME=pg-disk-schedule export REGION=${ZONE%-[a-z]} export SNAPSHOT_WINDOW_START=$(TZ=":GMT" date "+%H:00") export SNAPSHOT_RETENTION_DAYS=2 export SNAPSHOT_FREQUENCY_HOURS=1
Replace zone-of-your-instance with the Google Cloud zone where you previously started the database VM.
Create the snapshot schedule:
gcloud compute resource-policies create snapshot-schedule \ $SNAPSHOT_SCHEDULE_NAME \ --region=$REGION \ --max-retention-days=$SNAPSHOT_RETENTION_DAYS \ --on-source-disk-delete=apply-retention-policy \ --hourly-schedule=$SNAPSHOT_FREQUENCY_HOURS \ --start-time=$SNAPSHOT_WINDOW_START \ --storage-location=$REGION
Attach the snapshot schedule to the disks
When you ran the script to create an instance, the data and WAL volumes were created as two independent persistent disks. To create persistent disk snapshots according to a defined schedule, you associate a resource policy with each persistent disk. In this case, you want the disk snapshots to occur concurrently, so you use the same policy for both persistent disks attached to the Compute Engine VM.
In Cloud Shell, set environment variables:
export SNAPSHOT_SCHEDULE_NAME=pgdata-disk-schedule export PG_INSTANCE_NAME=instance-pg-pitr export ZONE=zone-of-your-instance
Attach the schedule policy to the persistent data disk:
gcloud beta compute disks add-resource-policies ${PG_INSTANCE_NAME}-data \ --resource-policies $SNAPSHOT_SCHEDULE_NAME \ --zone $ZONE
Attach the schedule policy to the persistent WAL disk:
gcloud beta compute disks add-resource-policies ${PG_INSTANCE_NAME}-wal \ --resource-policies $SNAPSHOT_SCHEDULE_NAME \ --zone $ZONE
Manually run a snapshot
(Optional) Scheduled snapshots happen within the schedule window, so it's unlikely a snapshot is taken immediately when you created the schedule. If you don't want to wait for the scheduled snapshot, you can manually run the initial snapshot.
In Cloud Shell, set environment variables:
export ZONE=zone-of-your-instance export PG_INSTANCE_NAME=instance-pg-pitr export REGION=${ZONE%-[a-z]}
Create a snapshot of the two PostgreSQL instance persistent disks:
gcloud compute disks snapshot \ ${PG_INSTANCE_NAME}-data ${PG_INSTANCE_NAME}-wal \ --snapshot-names=${PG_INSTANCE_NAME}-data-`date+%s`,${PG_INSTANCE_NAME}-wal-`date +%s` \ --zone=$ZONE --storage-location=$REGION
View the snapshots that you created:
gcloud compute snapshots list
The output is similar to the following:
NAME DISK_SIZE_GB SRC_DISK STATUS instance-pg-pitr-data-1578339767 200 us-central1-f/disks/instance-pg-pitr-data READY instance-pg-pitr-wal-1578339767 100 us-central1-f/disks/instance-pg-pitr-wal READY
Performing a PITR
A PITR is often performed to recover data that was lost due to an operational or programmatic misstep.
In this section of the tutorial, you perform a database update to simulate a catastrophic loss of data. You then simulate a panicked response, before starting a recovery to the moment before the erroneous command was issued.
Ensure that a PITR can be performed
Before performing a PITR, you have to allow sufficient time for the following to run:
- The binary backups (disk snapshots)
- WAL archiving
For this tutorial, the archive_timeout
was set to an atypically low 120
seconds to force frequent WAL file rotation. You also have to wait until at
least one scheduled disk snapshot is performed, or you need to
snapshot the disk manually.
Check that at least one snapshot was taken:
In the Google Cloud console, go to the Snapshots page.
Verify that there are at least two snapshots—one for the data volume and one for the WAL volume, for example,
instance-pg-pitr--us-central1-a-20190805023535-i3hpw7kn
.
Check that the segment files are archived to Cloud Storage:
In the Google Cloud console, go to the Cloud Storage Browser page.
Click
archive-bucket
.
Damage the data
To simulate a catastrophic loss of data, open a command-line shell to the PostgreSQL database, and damage the data in the table populated by the transaction generator.
In the Google Cloud console, go to the VM instances page.
For the
instance-pg-pitr
instance, click SSH.In the SSH terminal, run the PostgreSQL terminal-based frontend in the Docker container:
docker exec -it postgres-db psql --dbname=pitr_demo
To modify a row in the customer table, submit a SQL DML statement with an intentional typo in the PostgreSQL shell:
UPDATE pitr_db_schema.customer SET name = 'New Name for customer id=1'; WHERE id = 1;
The output is similar to the following:
UPDATE 999 pitr_demo=# WHERE id = 1; ERROR: syntax error at or near "WHERE" LINE 1: WHERE id = 1; ^
The error was generated because an extra semicolon was inserted before the
WHERE
clause. All rows in the database were updated. You can now do a PITR to recover the rows that the incorrect statement modified.
Determine target time for the recovery
The first step in a PITR is to determine the target time for the recovery. This time is determined by examining your data to identify a point slightly before the data-damaging event.
In the terminal shell of the
instance-pg-pitr
instance, obtain the maximum timestamp of the damaged rows:SELECT MAX(create_timestamp)::timestamptz FROM pitr_db_schema.customer WHERE name = 'New Name for customer id=1';
The output is similar to the following:
max . ------------------------------- 2019-08-05 18:14:58.226511+00 (1 row)
In a production database, the query to determine the recovery target is more complex, especially in cases where the affected table is large, and the indicative column is unindexed.
Copy the result; you use the value returned by this query in the next step.
Recover the database
For this tutorial, a recovery script automates the PITR. We recommend that you have an automated process to recover the database and that you periodically test this process.
In Cloud Shell, change the current working directory to the location of the recovery script:
cd ~/gcs-postgresql-recovery-tutorial/bin
Set the required environment variables for the script. Replace
YYYY-MM-DD HH:MM:SS.999999+00
with the query output that you previously copied.export PROJECT_ID=$(gcloud config get-value project) export PG_INSTANCE_NAME=instance-pg-pitr export POSTGRES_PASSWORD=PasswordIsThis export PG_INSTANCE_NAME=instance-pg-pitr export RECOVER_BUCKET=archive-bucket export PIT_RECOVERY_TARGET="YYYY-MM-DD HH:MM:SS.999999+00" export ZONE=zone-of-your-instance
Run the recovery script:
./recover_to_point_in_time.sh
Understand the recovery script
This section provides some details about the input parameters and the steps taken by the script.
The script requires that the following environment variables are set:
PIT_RECOVERY_TARGET
: the recovery target time.PROJECT_ID
: the project where thePG_INSTANCE_NAME
instance is located.ZONE
: the zone where thePG_INSTANCE_NAME
instance is located.PG_INSTANCE_NAME
: the instance where the production PostgreSQL instance is running.RECOVER_BUCKET
: the Cloud Storage bucket where WAL segment files are archived.POSTGRES_PASSWORD
: the password used for the PostgreSQL database user.
The script performs the following steps:
- Determines the most recent disk snapshots based on the recovery target date and time.
Creates a
cloud-init.yaml
file that is provided to a container-optimized storage VM that runs the PITR database. Thecloud-init.yaml
file creates configuration files and runs several system commands to establish the following environment:- A
gcsfuse
container that mounts the WAL segment file archive bucket as a volume which is then exposed to the host with a Docker bind mount. A
postgres-db
container where the database engine runs the following:- The host file system where the persistent disks are attached as volumes.
- The host file system where the Cloud Storage bucket is attached as a volume.
A
recovery.conf
recovery file in the PostgreSQL data directory with the following information:- The target date.
- The
restore
command: a parameterized copy command that the database uses to copy WAL segment files as needed from the archive file system.%f
is the segment file and%p
is the path used by the database for processing files during the recovery.
The
archive_
settings are commented out from thepostgresql.conf
settings file to avoid corrupting the WAL archive directory.
- A
Starts the PITR instance with the following information:
- A name created by combining the
$PG_INSTANCE_NAME
environment variable and the alphanumeric values from the$PIT_RECOVERY_TARGET
environment variable. - Persistent disks created from the previously identified disk snapshots.
- A name created by combining the
The following is an example recovery.conf
file:
restore_command = '(test -d /var/lib/postgresql/wal/pg_wal_recover && cp /var/lib/postgresql/wal/pg_wal_recover/%f %p ) '
recovery_target_time='YYYY-MM-DD HH:MM:SS UTC'
recovery_target_inclusive=true
Validate the recovery
In the Google Cloud console, go to the VM instances page.
For the
instance-pg-pitr-YYYYMMDDHHMMSS
instance, click SSH.In the SSH terminals, run the PostgreSQL terminal-based frontend in the Docker container:
docker exec -it postgres-db psql --dbname=pitr_demo
If you get the following error, wait a moment for the PostgreSQL container to start, and rerun the command:
Error: No such container: postgres-db
Check the data in the customer table:
SELECT * FROM pitr_db_schema.customer WHERE id > (SELECT MAX(id)-10 FROM pitr_db_schema.customer);
The output is similar to the following:
id | name | create_timestamp ------+---------------------------+---------------------------- 711 | customer_name_from_golang | 2019-12-06 18:03:51.229444 712 | customer_name_from_golang | 2019-12-06 18:03:52.531755 713 | customer_name_from_golang | 2019-12-06 18:03:53.555441 714 | customer_name_from_golang | 2019-12-06 18:03:54.581872 715 | customer_name_from_golang | 2019-12-06 18:03:55.607459 716 | customer_name_from_golang | 2019-12-06 18:03:56.633362 717 | customer_name_from_golang | 2019-12-06 18:03:57.658523 718 | customer_name_from_golang | 2019-12-06 18:03:58.685469 719 | customer_name_from_golang | 2019-12-06 18:03:59.706939
The name shows the value created by the transaction generator. The final row has a timestamp that is earlier than the recovery target (which you provided to the recovery script in an environment variable). Depending on the number of records that you need to recover, you might have to wait a few moments for all of the rows to update.
Clean up
The easiest way to eliminate billing is to delete the Google Cloud project you created for the tutorial. Alternatively, you can delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn more about the Container Optimized Operating System.
- Learn more about cloud-init.
- Learn more about Cloud Storage Fuse.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center. .