Announcement: Linux kernel incompatibility issue

Last update: April 13, 2023

What you need to know

An incompatibility with Google Cloud Backup and DR was introduced in recent new versions of the Linux kernel. This could impact your production and backup operations.

What is affected?

This issue applies only for configurations on which the following is true:

  • Linux Servers that have the Backup and DR agent installed, and
  • One or more of the following databases, running on Linux, are protected using LVM snapshots with Changed Block Tracking (this continues to run incremental backups by comparing the latest backup with the current workload data and generates a new backup by writing the changed blocks):

    • IBM Db2
    • MariaDB
    • MySQL
    • PostgreSQL
    • SAP ASE
    • SAP HANA
    • SAP IQ
    • SAP MaxDB

Impact

For servers that meet the above conditions, the impact is as follows:

  • RHEL kernel version higher than 4.18.0-425.3.1:
    • Production servers fail to start after a reboot;
    • The system may freeze if it reboots.
    • The system may freeze if it enables Changed Block Tracking functionality on the incompatible kernels.
  • SLES kernel version higher than 5.14.21-150400.22.1:
    • Backup jobs fail.

How do I know if my servers are impacted?

Is the Backup and DR agent installed?

To determine which of your Linux application VMs have the Backup and DR agent installed, log into each Linux VM and run the following on each server:

   sudo systemctl status udsagent

If the agent is installed and running, its output contains the following:

   active (running)

Is my kernel an affected version?

RHEL

Check the post-reboot kernel version. You will be impacted if the kernel version is higher than 4.18.0-425.3.1.

To check the post-reboot kernel version, run the following command in the shell:

sudo grubby --grub2 --default-title

And you will get the output similar to:

Red Hat Enterprise Linux (4.18.0-425.13.1.el8.x86_64) 8.7 (Ootpa)

Which indicates your post-reboot kernel version is 4.18.0-425.13.1.el8.x86_64.

SLES

Check the post-reboot kernel version. You will be impacted if the kernel version is higher than 5.14.21-150400.22.1.

To check the post-reboot kernel version, run the following command in the shell:

sudo grep -e "menuentry " -e submenu -e linux /boot/grub2/grub.cfg

And you will get the output similar to:

menuentry 'SLES15-SP4'  --class sles15_sp4 --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-9d30aeb5-d035-4732-b8c2-145c907808ff' {
      $linux  /boot/vmlinuz-5.14.21-150400.24.11-default root=LABEL=ROOT  console=ttyS0,38400n8 net.ifnames=0 dis_ucode_ldr multipath=off

Which indicates your post-reboot kernel version is 5.14.21-150400.24.11.

When will my servers freeze

Due to the kernel incompatibility, if your system has enabled Changed Block Tracking functionality, there are some cases may cause the system to freeze:

  1. Your system is running an affected kernel but it has not been rebooted yet, it may freeze if you:
    • Reboot.
    • Run sudo /opt/act/cbt/bin/cbt_deactivate.sh
  2. If your system is running an affected kernel but it has already been rebooted, it may freeze at any time.
  3. Your system is running an older kernel which is not affected, it may freeze if the current kernel is upgraded to the affected kernel and reboot.

What you should do

If you have systems that may be impacted, including those that have OS auto-updates turned on, take these immediate proactive measures to mitigate the issue:

If your system is running an affected kernel

  • Avoid rebooting these servers before making these changes.
  • Do not run /opt/act/cbt/bin/cbt_deactivate.sh.
  • If you can, Disable OS automatic updates on RHEL instances.
  • Disable backup jobs for all workloads on these servers. This prevents production impact by stopping all backup jobs on these servers.
  • Denylist the Changed Block Tracking kernel module and Enable Degraded Capture Mode.
    • Login to the Linux VMs.
    • Follow How to denylist the Changed Block Tracking kernel module to denylist the Changed Block Tracking module.
    • Reboot the system (if system freeze during shutting down, forcibly stop it).
    • Set the Enable Degraded Capture Mode to YES in applications by changing the policy settings from the Manage Backup Plan page. For more details, see Policy Settings.
  • Avoid changing the method of backup through any other approaches before the Changed Block Tracking module is denylisted.

If your system is running an unaffected kernel

Disable OS automatic updates on RHEL instances. If you cannot turn off auto-update, consider the following options to disable Changed Block Tracking:

  • Manually disable the Changed Block Tracking functionality and Enable Degraded Capture Mode.

    • Login to the Linux VMs.
    • Run sudo /opt/act/cbt/bin/cbt_deactivate.sh.
    • Set the Enable Degraded Capture Mode to YES in applications by changing the policy settings from the Manage Backup Plan page. For more details, see Policy Settings.
  • From the management console for applications using Linux Changed Block Tracking:

What if you encounter a restart failure or system freeze

If no mitigation steps are taken and a server is restarted then it could get into a soft lockup (system may freeze). Reach out to the Google Cloud Support Center for help with resolving this issue.

To recover from a soft lockup state

You can denylist the Changed Block Tracking kernel module, so that your systems do not experience the freeze issue. Since the system freezes and is unavailable for login, you must create a new boot disk to rescue the crashed boot disk. You can use gce-rescue to do this rescue job.

  1. Use cloud shell or another Linux shell to install gce-rescue.
  2. Run gce-rescue with the command below:

    sudo $(which gce-rescue) --zone instance-zone --project instance-project --name instance-name
    
  3. When the command is finished, connect to the same instance by using the ssh tool in the Cloud Console.

  4. After connecting to the instance, make sure the old boot disk is mounted at /mnt/sysroot.

  5. Run sudo chroot /mnt/sysroot.

  6. Run mount -a.

  7. Follow How to denylist the Changed Block Tracking kernel module to denylist the Changed Block Tracking module.

  8. Restore the original boot disk. Run the same rescue command again. This automatically detects if the instance is in rescue, and restores it to the original boot disk.

    sudo $(which gce-rescue) --zone instance-zone --project instance-project --name instance-name
    

Wait for the instance to reboot, then try to ssh into the instance. Make sure that lsmod | grep act outputs nothing.

How to denylist the Changed Block Tracking kernel module

Edit or create the file /etc/modprobe.d/blacklist.conf and add the following denylist rules:

blacklist act_cbt_1_14
blacklist act_cbt_1_15
blacklist act_cbt_1_14_0
blacklist act_cbt_1_15_0

Then run the following commands:

sudo sh -c "echo 'install act_cbt_1_14 /bin/true' > /etc/modprobe.d/act_cbt_1_14.conf"
sudo sh -c "echo 'install act_cbt_1_15 /bin/true' > /etc/modprobe.d/act_cbt_1_15.conf"
sudo sh -c "echo 'install act_cbt_1_14_0 /bin/true' > /etc/modprobe.d/act_cbt_1_14_0.conf"
sudo sh -c "echo 'install act_cbt_1_15_0 /bin/true' > /etc/modprobe.d/act_cbt_1_15_0.conf"

When is this issue getting resolved?

Backup and DR Service will provide an update to mitigate the issue and prevent the soft lock up by April 18th. Follow instructions on the Backup and DR Service page in Cloud Console and perform the update at the earliest opportunity.

Backup and DR Service will follow up with a more detailed update in May that makes the Backup and DR Service kernel module compatible with the newer versions of the Red Hat and SLES kernel packages.