Rollback framework documentation
balenaCloud and balenaOS support host OS Updates(HUP). Rollbacks is a framework designed to roll back the OS update in case something goes wrong.
There are two rollback mechanisms in the OS, covering different update failure modes: one based on health checks rollback-health, and another recognizing if the new system is unbootable for some reason rollback-altboot. Their detailed operations are explained below.
rollback-health
The new OS gets to userspace but something is unhealthy. Userspace is functional and we can use systemd services and bash scripts in this case.
- This state is checked by a systemd service:
rollback-health.service
. - During a HUP, a flag file
rollback-health-breadcrumb
is left in the state partition to enable therollback-health
systemd service on next boot. rollback-health.service
runsrollback-health
which runsrollback-tests
. Two things are checked to establish if balenaOS is healthy or not.- balenaEngine not working. The balenaEngine healthcheck is run.
- VPN is not connecting but it used to in the previous OS.
- These tests are run once every minute for 15 minutes which is the default value of the
ROLLBACK_HEALTH_TIMEOUT
variable. - If the OS is considered healthy,
rollback-health
clears the flag files left in the state partition. This service won't run again. - If a rollback due to healthcheck fail is triggered, the previous OS boot hooks are run to restore previous boot files,
resin_root_part
is updated inresinOS_uEnv.txt
in the boot parititon to point to the previous OS partition, a flag filerollback-health-triggered
is left in the state partition, and a reboot is triggered.
rollback-altboot
The new OS is unbootable and does not get to Linux userspace. (A kernel panic, something crashes before the OS reaches userspace and is able to run systemd). This requires the bootloader and userspace to work together. The bootloader needs to count the number of boots and userspace needs to reset the bootcount if the OS is functional.
- During a HUP, the variable
upgrade_available
is set inresinOS_uEnv.txt
in the boot partition. resinOS_uEnv.txt
is read by the bootloader and bootcount is incremented ifupgrade_available=1
- Bootcount is saved in the boot partition.
grubenv
for grub andbootcount.env
for u-boot. - During a boot, the bootloader checks the value of the
bootcount
variable. If it is higher than 1, this means nothing in the OS cleared the bootcount. It is assumed that the new OS failed to reach userspace and the bootloader is supposed to boot the previous rootfs. i.e. Ifresin_root_part=3
inresinOS_uEnv.txt
, the bootloader will try to boot assumingresin_root_part=2
- The bootloader has done its job and booted the previous OS. However, the bootfiles (e.g dt overlay files) in the boot partition are still of the new broken rootfs as we don't have multiple copies of them in the boot partition.
- We need to copy the previous boot files into the boot partition. These files are available in the root partition in the
resin-boot
folder. - During a HUP, a flag file
rollback-altboot-breadcrumb
is left in the state partition. rollback-altboot.service
is the systemd service that runs ifrollback-altboot-breadcrumb
is present.rollback-altboot.service
checks if we are running the previous root. i.e.resin_root_part=3
inresinOS_uEnv.txt
, but the current OS is actually mounted and running fromresin_root_part=2
.- If
rollback-altboot
detects that the bootloader has booted the previous rootfs. rollback-altboot
then runs boot hooks and copies over the currently running rootfs boot files fromresin-boot
into the boot partition.- If
rollback-altboot
fails to clear the state and reboot the board for whatever reason,rollback-health
will attempt to clear rollback state and reboot the board after 15 minutes.
- If
- If
rollback-altboot.service
detects that the bootloader has booted the correct rootfs, this script does nothing and letsrollback-health.service
function. Therollback-altboot-breadcrumb
file is cleared by therollback-health.service
.