First, you might not reload everything in memory, so it will be patched on disk but not in process.
Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?
And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.
Only the last generation or 2 of the highest end VAXen had any significant hot swap (VAX 9000/400 and later, which sold very poorly). The vast majority of VAX machines didn't. Even hot-swapping DSSI disks was at best iffy.
When someone whose been there talks about VAX 'high availability', they're usually talking about VAX/VMS clustering. Very cool and generally effective approach to the problem. That was one big issue with the end-game VAXen: clustering a couple of 6-figure mid-range machine was often considered a better solution than all-in on one 7- to 8-figure VAX 'mainframe'.
often require a service contract that includes a permanent on site tech.
I don't recall that being common with DEC service contracts. Most of the sites I know of that had dedicated DEC techs were either very large installs or had...other...drivers (e.g. tech had to have a TS clearance to work on the machines).
Executing hardware hot-swap typically means telling the system that a component is going down. Then the system moves those resources to the other component to gracefully allow you to remove it without a restart.
Like it's not a case where you just yank out a CPU as you like as though it were a spindle in a RAID-6 array. Especially if there's only one CPU. The state machine can't maintain state if the only component that tracks and maintains state goes missing.
Had an accidental reboot, and it could not boot. Had redundancy, but the other server had failed silently days prior. Solved it with three way redundancy and extra monitoring. Systems fail in many ways at the same time. If you do not test it, there is a chance it wont work. Controlled failure is preferred over unknowns, like rebooting once in a while just to make sure it works.
Not sure I'm following honestly. Your primary goes down and it fails over to the secondary (which becomes the primary), but if you can't boot how do you then get another secondary ready to fail over to again when the new primary inevitably fails?
Ah, spoken with the confidence of a freshly minted qualified worker :). Anything you don’t test is a wish, not a production system. You either know that your systems work end to end because you tested periodically, or you pray they will.
How do you know the automatic failover works? How do you know the standby system works?
I’ve seen many a “qualified workers” getting sent packing because they never fully tested the prod system because they just knew everything will work, and never tested the backup systems because qualified workers do the job right the first time, no need for backup.
You patch it in memory and on disk. What you put on disk is the patch though, so when you restart, the original unpatched version is booted, and then the same live patch is applied. This is how Ksplice worked. It has the advantage that there isn't a config file in /etc to get changed out from under it, so the second problem did not apply.
Ksplice can do that because the kernel is only in memory in one place an it never sleeps. It has to orchestrate a process that's always running, which is complex, but it's never more than one.
Now try patching glibc like that. Not only does almost every thread have it in memory, several of them will have it in process, and some of them will have it swapped to disk while the thread sleeps. You're going to quickly decide that you actually just want a little bit of downtime or else you want to stand up a redundant system. There's a reason that some live patching systems explicitly exclude glibc and similar libraries.
First, you might not reload everything in memory, so it will be patched on disk but not in process.
Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?
And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.