The problem with live patching is twofold. First, you might not reload everythin...

kjs3 · 2026-05-22T15:00:57 1779462057

And, yeah, everything is hot swappable on VAX.

Only the last generation or 2 of the highest end VAXen had any significant hot swap (VAX 9000/400 and later, which sold very poorly). The vast majority of VAX machines didn't. Even hot-swapping DSSI disks was at best iffy.

When someone whose been there talks about VAX 'high availability', they're usually talking about VAX/VMS clustering. Very cool and generally effective approach to the problem. That was one big issue with the end-game VAXen: clustering a couple of 6-figure mid-range machine was often considered a better solution than all-in on one 7- to 8-figure VAX 'mainframe'.

often require a service contract that includes a permanent on site tech.

I don't recall that being common with DEC service contracts. Most of the sites I know of that had dedicated DEC techs were either very large installs or had...other...drivers (e.g. tech had to have a TS clearance to work on the machines).

Squeeeez · 2026-05-22T19:36:15 1779478575

How would you implement no-downtime hot swap with only one item?

kjs3 · 2026-05-22T21:51:57 1779486717

By implementing hot-swap into the one item? Am I missing something in this question?

da_chicken · 2026-05-23T09:15:27 1779527727

Executing hardware hot-swap typically means telling the system that a component is going down. Then the system moves those resources to the other component to gracefully allow you to remove it without a restart.

Like it's not a case where you just yank out a CPU as you like as though it were a spindle in a RAID-6 array. Especially if there's only one CPU. The state machine can't maintain state if the only component that tracks and maintains state goes missing.

coldtea · 2026-05-22T09:28:43 1779442123

>First, you might not reload everything in memory, so it will be patched on disk but not in process.

You design for this with generational tagged objects or something similar.

mx7zysuj4xew · 2026-05-22T12:54:25 1779454465

Which is moot, because of the system is important enough you'll have an automatic failover to another system running on standby

All this "we must reboot to test" is bullshit excuses by unqualified workers

z3t4 · 2026-05-22T13:50:50 1779457850

Had an accidental reboot, and it could not boot. Had redundancy, but the other server had failed silently days prior. Solved it with three way redundancy and extra monitoring. Systems fail in many ways at the same time. If you do not test it, there is a chance it wont work. Controlled failure is preferred over unknowns, like rebooting once in a while just to make sure it works.

X0Refraction · 2026-05-22T14:43:27 1779461007

Not sure I'm following honestly. Your primary goes down and it fails over to the secondary (which becomes the primary), but if you can't boot how do you then get another secondary ready to fail over to again when the new primary inevitably fails?

close04 · 2026-05-22T17:23:16 1779470596

Ah, spoken with the confidence of a freshly minted qualified worker :). Anything you don’t test is a wish, not a production system. You either know that your systems work end to end because you tested periodically, or you pray they will.

How do you know the automatic failover works? How do you know the standby system works?

I’ve seen many a “qualified workers” getting sent packing because they never fully tested the prod system because they just knew everything will work, and never tested the backup systems because qualified workers do the job right the first time, no need for backup.

pjmlp · 2026-05-22T10:25:29 1779445529

Yes, some things actually cost money, especially if they aren't easy to implement.

fragmede · 2026-05-22T21:11:23 1779484283

You patch it in memory and on disk. What you put on disk is the patch though, so when you restart, the original unpatched version is booted, and then the same live patch is applied. This is how Ksplice worked. It has the advantage that there isn't a config file in /etc to get changed out from under it, so the second problem did not apply.

da_chicken · 2026-05-23T09:01:01 1779526861

Ksplice can do that because the kernel is only in memory in one place an it never sleeps. It has to orchestrate a process that's always running, which is complex, but it's never more than one.

Now try patching glibc like that. Not only does almost every thread have it in memory, several of them will have it in process, and some of them will have it swapped to disk while the thread sleeps. You're going to quickly decide that you actually just want a little bit of downtime or else you want to stand up a redundant system. There's a reason that some live patching systems explicitly exclude glibc and similar libraries.