Monitor pglogical replication in AlloyDB Omni

This page provides provides information about how to monitor and troubleshoot pglogical implementations by checking and validating the provider and subscriber databases.

Before you begin

Before you begin to monitor and troubleshoot pglogical implementations, check the provider and subscriber databases, understand the pglogical implementation, and validate how it is configured.

For more information about the pglogical extension, see About pglogical.

For information about data replication using pglogical, see Replicate data between Google Cloud AlloyDB and AlloyDB Omni and Replicate data between AlloyDB Omni and other databases.

Check pglogical, replication, and AlloyDB Omni parameter settings

A number of configuration parameters affect the operation of the pglogical extension, and you can check that in the provider and subscriber databases. Note that the parameter values might vary.

  1. Show the current setting of pglogical-specific parameters:

    SELECT name,
    FROM pg_catalog.pg_settings
    WHERE name LIKE '%pglogical%' AND name NOT LIKE '%alloydb%'
    ORDER BY category, name;
  2. Show other logical replication-related parameters:

    SELECT name,
    FROM pg_catalog.pg_settings
    WHERE name IN ('wal_level',
    ORDER BY name;
  3. Show AlloyDB Omni-specific parameters:

    SELECT name,
    FROM pg_catalog.pg_settings
    WHERE name LIKE '%alloydb%'
    ORDER BY category, name;

List nodes in the configuration

  1. List both local and remote nodes in the pglogical replication configuration:

    SELECT node_id,
        if_nodeid AS node_id,
        if_name AS node_name,
        if_dsn AS dsn
    FROM pglogical.node_interface
    LEFT JOIN pglogical.local_node ON (node_id = if_nodeid AND node_local_interface = if_id)
    ORDER BY node_name;

    If the node_id column is NOT NULL, then that's the local node.

  2. Review the dsn information in detail. Any incorrect or outdated connection string information can result in replication failures. For information about dsn troubleshooting, see Troubleshoot subscription replication.

Check the subscription status and table replication point

The subscription status is always verified from the subscriber database. The subscription shows a status of initializing or replicating". It also shows a status of down. For more information about the down status, see Troubleshoot subscription replication.

  1. List the subscriptions, their current status and settings, in the current database:

    SELECT s.sub_name AS subscription_name,
        n1.node_name AS origin_name,
        n2.node_name AS target_name,
        sub_enabled AS enabled
    FROM pglogical.subscription s,
        (SELECT subscription_name, status FROM pglogical.show_subscription_status()) AS x,
        pglogical.node n1,
        pglogical.node n2
    WHERE s.sub_origin = n1.node_id
    AND s.sub_target = n2.node_id
    AND s.sub_name = x.subscription_name
    ORDER BY s.sub_name;

    The output is similar to the following:

    -[ RECORD 1 ]-----------+--------------------------------------
    subscription_id         | 3072625608
    subscription_name       | test_sub_1
    origin_name             | provider
    target_name             | subscriber
    status                  | replicating
    sub_slot_name           | pgl_my_test_db_provider_test_sub_1
    sub_replication_sets    | {default,default_insert_only,ddl_sql}
    sub_forward_origins     | {all}
    sub_apply_delay         | 00:00:00
    sub_force_text_transfer | f
    enabled                 | t
  2. List that tables that are currently replicated and their current log sequence number (LSN) by the subscription:

    SELECT sync_nspname||'.'||sync_relname AS table_name,
    FROM pglogical.local_sync_status
    WHERE sync_relname IS NOT NULL
    ORDER BY table_name;

    The output is similar to the following:

      table_name      | sync_status | sync_statuslsn 
    public.test_table_1 | r           | 0/B891BC0
    (1 row)

    The sync_statuslsn column shows to which LSN the table is synchronized. You can compare this to the LSN at the provider database to gauge replication lag.

  3. Check the replication status for a specific table:

    SELECT * FROM pglogical.show_subscription_table('test_sub_1','test_table_1');

Verify replication set details on the provider

  1. List the current replication sets in the provider database and check the items that are replicated:

    SELECT set_name,
    FROM pglogical.replication_set
    JOIN pglogical.node ON set_nodeid = node_id
    ORDER BY set_name, node_name;
  2. List tables and sequences that are currently replicated:

    -- Table details:
    SELECT set_name,
        set_reloid AS table_name,
    FROM pglogical.replication_set
    NATURAL JOIN pglogical.replication_set_table
    ORDER BY set_name, table_name;
    -- Sequence details:
    SELECT set_name,
        set_seqoid AS sequence_name
    FROM pglogical.replication_set
    NATURAL JOIN pglogical.replication_set_seq
    ORDER BY set_name, sequence_name;

Check the replication information and slot lag on the provider

  1. Check the status of each subscriber by generating the pg_stat_replication view on the provider database:

    SELECT application_name,
        pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn) AS sent_lag,
        pg_wal_lsn_diff(sent_lsn,flush_lsn) AS receiving_lag,
        pg_wal_lsn_diff(flush_lsn,replay_lsn) AS replay_lag,
        pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) AS total_lag,
        now()-reply_time AS reply_delay
    FROM pg_stat_replication
    ORDER BY client_hostname;

    The output is similar to the following:

    -[ RECORD 1 ]----+------------------------------
    application_name | test_sub_1
    state            | streaming
    sync_state       | async
    client_addr      |
    client_hostname  | 
    sent_lag         | 0
    receiving_lag    | 0
    replay_lag       | 0
    total_lag        | 0
    reply_delay      | 00:00:26.203433
  2. Take note of the reply_delay column, which shows the time when it received the last update from the subscriber database.

  3. Monitor replication lag of the replication slot on the provider because pglogical creates replication slots on the provider database:

    SELECT slot_name,
        COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn),0) AS restart_lag,
        COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(),confirmed_flush_lsn),0) AS confirmed_flush_lag
    FROM pg_replication_slots
    WHERE plugin like '%pglogical%'
    ORDER BY slot_name;

    The output is similar to the following:

    -[ RECORD 1 ]-------+-----------------------------------
    slot_name           | pgl_my_test_db_provider_test_sub_1
    slot_type           | logical
    database            | my_test_db
    active              | t
    restart_lag         | 56
    confirmed_flush_lag | 0

Troubleshooting subscription replication

The subscription that's checked on the subscriber database must show a status of replicating or initializing if the subscription is recently created. If the status is down, then a problem has occurred.

The down status is usually shown after replication has attempted to initiate, but has failed. This is due to connectivity issues caused by the dsn setting, or missing database permissions, which is either at the provider or subscriber.

Use Log Explorer and inspect the PostgreSQL log files in Google Cloud when Google Cloud AlloyDB is one of the endpoints, for additional information that might indicate the cause of the problem. The log files provide details of the problem, including specific details on missing permissions.

  1. Check the PostgreSQL log on your AlloyDB Omni server:

    docker exec pg-service tail -20 /mnt/disks/pgsql/data/log/postgres
  2. Troubleshoot the dsn setting and ensure that network connectivity isn't the source of the problem:

    1. Copy the dsn connection string and try a manual connection using psql and the same string. If the psql session cannot connect, it indicates the following:
      • A networking issue.
      • An incorrect IP address, username, or password.
      • A blocking firewall.
      • The other cluster's pg_hba.conf file is not properly configured.
  3. Resynchronize a table if you don't want to drop and re-create the subscription after taking corrective actions:

    SELECT pglogical.alter_subscription_resynchronize_table(subscription_name := 'test_sub_1',relation := 'table_name');
  4. Alternatively, drop your subscription and re-create it:

    SELECT pglogical.drop_subscription(subscription_name := 'test_sub_1');

What's next