The thing we are missing still is the distributed OS. Kubernetes only exists because of the missing abstractions in Linux to be able to do computation, discovery, message passing/IO, instrumentation over multiple nodes. If you could do ps -A and see all processes on all nodes, or run a program and have it automatically execute on a random node, or if (grumble grumble) Systemd unit files would schedule a minimum of X processes on N nodes, most of the K8s ecosystem would become redundant. A lot of other components like unified AuthZ for linux already exist, as well as networking (WireGuard anyone?).
Plan 9 was designed in this way, but never took off.
Rob Pike:
> This is 2012 and we're still stitching together little microcomputers with HTTPS and ssh and calling it revolutionary. I sorely miss the unified system view of the world we had at Bell Labs, and the way things are going that seems unlikely to come back any time soon.
I think Rob is right to call out the problem, but is being a bit rose colored about Plan 9.
Plan 9 was definitely ahead of its time, but it's also a far cry from the sort of distributed OS we need today. "Everything is a remote posix file" ends up being a really bad abstraction for distributed computing. What people are doing today with warehouse scale clusters indeed has a ton of layers of crap in there, and I think it's obvious to yern for sweeping that away. But there's no chance you could do that with P9 as it was designed.
"Everything is a file" originally referred to read and write as universal object interfaces. It's similar to Smalltalk's send/receive as an idealized model for object-based programming. Hierarchical filesystem namespaces for object enumeration and acquisition is tangential, though it often works well because most namespaces (DNS, etc) tend to be hierarchical. (POSIX filesystem semantics doesn't really figure into Plan 9 except, perhaps, incidentally.) Filesystem namespacing isn't quite as abstract, though (open, readdir, etc, are much more concrete interfaces), making impedance mismatch more likely.
The abstraction is sound. We ended up with TCP and HTTP instead of IL and 9P (and at scale, URLs instead of file descriptors), because of trust issues, but that's not surprising. Ultimately the interface of read/write sits squarely in the middle of all of them, and most others. To build a distributed system with different primitives at the core, for example, send/receive, requires creating significantly stronger constraints on usage and implementation environments. People do that all the time, but in practice they do so by building atop the file interface model. That's what makes the "everything is a file" model so powerful--it's an interoperability sweet spot; an axis around which you can expect most large-scale architectures to revolve around at their core, even if the read/write abstraction isn't visible at the point users (e.g. application developers) interact with the architecture.
A hierarchical namespace is fine, but the open/read/write/sync/close protocol on byte based files is definitely inadequate. The constraints on usage you decry are in fact fundamental constraints of distributed computing that are at odds with the filesystem abstraction. And this is exactly what I was getting at in talking about rose colored glasses with P9. It in no way is a replacement for something like Colossus or Spanner.
> P9 ... in no way is a replacement for something like Colossus or Spanner.
Colossus and Spanner are both proprietary so there's very limited info on them, but both seem to be built for very specialized goals and constraints. So, not really on the same level as a general system interface like 9P, which is most readily comparable to, e.g. HTTP. In Plan 9, 9P servers are routunely used to locally wrap connections to such exotic systems. You can even require the file system interface locally exposed by 9P to be endowed with extra semantics, e.g. via special messages written to a 'control' file. So any level of compatibility or lack thereof with simple *nix bytestreams can be supported.
This describe more a Single System Image [0] to me (WPD includes Plan 9 as one but considering it does not does not supports process migration I find it moot).
LinuxPMI [1] seems to be a good idea but they seems to be based on Linux 2.6, so you would have to heavily patch newer kernel.
The only thing that seems to support process migration with current software / still active are CRIU [2] (which doesn't support graphical/wayland programs) and DragonflyBSD [3] (in their own words very basic).
I don't really see any reason to consider process migration a required feature of either a distributed os or a single system image. Even on a single computer this isn't always practical or desireable (ie. you can't 'migrate' a program running on your gpu to your cpu, and you can't trivially migrate a thread from one process to another either).
Not all units of computation are interchangeable, and a system that recognizes this and doesn't try to shoehorn everything down to the lowest common denominator actually gains some expressive power over a uniform system (else we would not have threads).
> his describe more a Single System Image [0] to me
No, Plan 9 is not a SSI OS. The idea is all resources are exposed via a single unified file oriented protocol: 9p. All devices are files which means all communication happens over fd's meaning you look at your computer like a patch bay of resources, all communicated with via read() and write(). e.g.:
Looking above it looks like Unix but with MAJOR differences. First off the disk is a directory containing partitions which are just files who's size is the partitions size. You can read or write those files as you please. Since the kernel only cares about exposing hardware as files, the file system on a partition needs to be translated to 9p. We do this with a program that is a file server which interprets e.g. a fat32 fs and serves it via 9p (dossrv(4)). Your disk based file system is just a user-space program.
And since files are the interface you can bind over them to replace them with a different service like mixfs(4). /dev/audio is like the old linux oss where only one program could open a sound card at a time. To remedy this on plan 9 you run mixfs which opens /dev/audio and then binds itself over /dev replacing /dev/audio in that namespace with a multiplexed /dev/audio from mixfs. Now you start your window manager and the children programs will see mixfs's /dev/audio instead of the kernel /dev/audio. Your programs can now play audio simultaneously without changing ANYTHING. Now compare that simplicity to the trash fire linux audio has been and continues to be with yet another audio subsystem.
Keyboard keymaps are a filter program sitting between /dev/kbd and your program. All it does is read in key codes and maps key presses according to a key map which is just a file with key->mapping lines. Again, keyboards are files so a user space file server can be a keyboard such as a GUI keyboard that binds itself over /dev/kbd.
Now all those files can be exported or imported to other machines, regardless of CPU architecture.
Unix is an OS built on top of a single machine. Plan 9 is a Unix built on top of a network. It's the closest I can get to computing nirvana where all my resources are available from any machine with simple commands that are part of the base OS which is tiny compared to the rest.
Does Plan 9 have an equivalent to alsa or jackaudio or pipewire, where you can pick a (usually shared memory) ring buffer size, have the kernel or daemon alert you (usually through a poll fd) when there's room in the buffer, and only then perform the necessary calculations (instead of performing a blocking write which adds latency)? What about synchronously receiving blocks of input audio as soon as available, and outputting equally-sized aligned blocks of output audio played back exactly two periods of buffering after the matching input audio? Or chaining three or more apps in a pipeline, all woken in sequence whenever input data is available, and still have only two periods of latency?
Graphical programs could be checkpointed and restored as long as they don't directly connect to the hardware. (Because the checkpoint/restore system has no idea how to grab the hardware's relevant state or replicate it on restore.) This means running those apps in a hardware-independent way (e.g. using a separate Wayland instance that connects to the system one), but aside from that it ought to be usable.
It has been done "virtually" by going through e.g. VNC https://criu.org/VNC . Alternately, CRIU apps could be required to use virt-* devices, which CRIU might checkpoint and restore similar to VM's.
For what it's worth, the HPC-standard way of checkpointing/migrating distributed execution (in userspace, unlike CRIU) is https://dmtcp.sourceforge.io/
It supports X via VNC -- I've never tried -- but I guess you could use xpra.
Are there any good walkthroughs of what a good, distributed plan 9 setup looks like from either a development or a administration perspective? Particularly an emphasis on many distributed compute nodes (or cpu servers in plan 9 parlance).
The entire point of UNIX philosophy (Which seems to be something they aren’t teaching in software development these days) is to do one thing and do it well. We don’t need Linux operating operating as a big declarative distributed system with a distributed scheduling systems and a million half-baked APIs to interact with it, the way K8s works. If you want that you should build something to your specific requirements, not shove more things into the kernel.
The UNIX philosophy made more sense as an abstraction for a computer when computers were simpler. Computers nowadays (well at least since 2006-ish) have multiple cores executing simultaneously with complicated amounts of background logic, interrupt-driven logic, shared caches, etc. The UNIX philosophy doesn't map to this reality at all. Right now there's no set of abstractions except machine code that exposes the machine's distributed systems' in a coherent abstraction. Nothing is stopping someone else from writing a UNIX abstraction atop this though.
The idea of doing one thing, and doing it well, isn't dependent on the simplicity of the underlying system (I imagine that PDP-11 systems seemed impressively complicated in their time, too). The UNIX philosophy is a paradigm for managing complexity. To me, that seems more relevant with modern computers, not less.
> “A program is generally exponentially complicated by the number of notions that it invents for itself. To reduce this complication to a minimum, you have to make the number of notions zero or one, which are two numbers that can be raised to any power without disturbing this concept. Since you cannot achieve much with zero notions, it is my belief that you should base systems on a single notion.” - Ken Thompson
All of this was possible with QNX literally decades ago, and it didn't need whatever strawman argument you're making up in your head in order to accomplish it. QNX was small, fast, lean, real-time, distributed, and very powerful for the time. Don't worry, it even had POSIX support. A modern QNX would be very well received, I think, precisely because taking a distributed-first approach would dramatically simplify the whole system design versus tacking on a distributed layer on top of one designed for single computers.
> Which seems to be something they aren’t teaching in software development these days
This is funny. Perhaps the thing you should have been taught instead is history, my friend.
You mean QNet [0]? That is still alive... It is for LAN use ("Qnet is intended for a network of trusted machines that are all running QNX Neutrino and that all use the same endianness."), so extra care is needed to secure this group of machines when exposed to the internet.
Correct. Thought QNet itself is only one possible implementation, in a sense (but obviously the one shipped with QNX.) And the more important part of the whole thing is the message-passing API design built into the system, which enables said networking transparency, because it means your programs are abstracted over the underlying transport mechanism.
"LAN use" I think would qualify roughly 95% of the need for a "distributed OS," including a lot of usage of K8s, frankly. Systems with WAN latency impose a different set of challenges for efficient comms at the OS layer. But even then you also have to design your apps themselves to handle WAN-scale latencies, failover, etc too. So it isn't like QNX is going to make your single-executable app magic or whatever bullshit. But it exposes a set of primitives that are much more tightly woven into the core system design and much more flexible for IPC. Which is what a distributed system is; a large chattery IPC system.
The RECON PDF is a very good illustration of where such a design needs to go, though. It doesn't surprise me QNX is simply behind modern OS's exploit mitigations. But on top of that, a modern take on this would have to blend in a better security model. You'd really just need to throw out the whole UNIX permission model frankly, it's simply terrible as far as modern security design is concerned. QNet would obviously have to change as well. You'd at minimum want something like a capability-based RPC layer I'd think. Every "application server" is like an addressable object you can refer to, invoke methods on, etc. (Cap'n Proto is a good way to get a "feel" for this kind of object-based server design without abandoning Linux, if you use its RPC layer.)
I desperately wish someone would reinvent QNX but with all the nice trappings and avoiding the missteps we've accumulated over the past 10 to 15 years. Alas, it's much more profitable to simply re-invent its features poorly every couple of years and sell that instead.
This overview of the QNX architecture (from 1992!) is one of my favorite papers for its simplicity and straightforward prose. Worth a read for anyone who like OS design.
Moreover when folks talk about doing only one thing and one thing well, they are referring to command line utilities and pipes. And pipes were an invention of Doug mclroy, not Ritchie or Thompson.
And the unix command-pipe philosophy is realized much better as ordinary functions in a functional programming language.
The Unix philosophy was a reasonably good model decades ago. But I think it is over romanticized.
It's binary blob design is no good for security, as opposed to a byte code design like Forth. Its user security model was poor and doesn't help with modern devices like phones. Its multiprocess model was ham fisted into a multithreading model to compete with windows NT. Its asynchronous i/o model has always been a train wreck even compared to NT. Its design creates performance issues, especially in multiproc networking code with needless amount of memcopys. Now folks are rewriting the networking stack in user space. Its software abstraction layer was some simple scheme from the 70s which has fragmented into crazy number of implementations now. Open source developers still complain about how much easier it is to build a package for windows, as opposed to linux. It was never meant to be a distributed system either. Modern enterprise compute cannot scale by treating and managing each individual VM as it's own thing with clusters held together by some sysadmins batch scripts.
A good paper giving a concrete example of all this is "A fork() in the road", where you can see how an API just like fork(2) has an absolutely massive amount of ramifications on the overall design of the system, to the point "POSIX compliance" resulted in some substantial perversions of the authors' non-traditional OS design, all of which did nothing but add complexity and failure modes ("oh, but I thought UNIX magically gave you simplicity and made everything easy?") It also has significantly diverged from its "simple" original incarnation in the PDP-11 to a massive complex beast. So you can add "CreateProcess(), not fork()" on the list of things NT did better, IMO.
And that's just a single system call, albeit a very important one. People simply vastly overestimate how rose-tinted their glasses are and all the devils in the details, until they actually get into the nitty gritty of it all.
I agree that fork (a Unix implementation detail) creates issues like overcommit and complicating memory management (took a look at the paper and I won't dispute the issues it points out). I don't agree that farming out in-app functionality into a herd of daemons (d-bus, desktop portals, pipewire, pipewire-pulse, wireplumber, system and user systemd with daemon-level environment variables) is beneficial for system functionality and doesn't create added complexity (each daemon's state, which daemons are running or crashed, reliance on IPC instead of being able to trace each process in isolation) and new failure modes (apps can't find D-Bus to load the desktop portal, and hang instead, if you login to Wayfire unless you login to Xfce first without killing systemd --user).
Ah, a bit meh. Haiku OS/Be did similar queries with BFS and yet it can be much faster on SSD's.
Ok, no proper permissions/ACL's, but NTFS it's on par on EXT3 performance with some additions.
It needs two news FS', one for desktops and another one for the enteprise. Linux' ones should be F2FS for flash media and BcacheFS for the professional storage needs.
It looks like you have zero familiarity with NTFS. NTFS had a far more fine grained ACL model since version 1. Perhaps linux caught up several decades later. I am not really sure.
I am also not sure why you insist on arguing about a topic you are not familiar with at all.
Eh I can’t see Linux getting a built-in distributed kv store (etcd) any time soon. Same goes for distributed filesystems. All you have out of the box is nfs which gives you the worst of both worlds: Every nfs server is a SPOF yet these servers don’t take advantage of their position to guarantee even basic consistency (atomic appends) that you get for free everywhere else.
And besides how would you even implement all those features you listed without recreating k8s? A distributed “ps -A” that just runs “for s in servers; ssh user@$s ps; done” and sorts the output would be trivial, but anything more complex (e.g. keeping at least 5 instances of an app running as machines die) requires distributed and consistent state.
Fwiw those features existed in Mosix (a Linux SSI patch) 2 decades ago... I feel like we could probably do it again
In terms of CAP, yeah it might not have been technically as reliable. But there's different levels of reliability for different applications; we could implement a lot of it in userland and tailor as needed
I call BS. I can’t find any details about how mosix handled storage, but what I did find suggests nfs semantics. That’s totally unusable which is probably why the project died decades ago. (And apparently you had to recompile every app because they changed the syscall ABI to add a node ID to every inode or something? Guess they were speedrunning obsolescence)
> we could implement a lot of it in userland
Yeah that’s k8s, etcd, ceph, and the distributed database of the week.
You didn't need to recompile programs, that was the whole idea. Distribute any app's compute over many nodes. But shared memory and threading were very hard to distribute and I/O was not distributed except for mfs (a distributed layer on NFS) which did work fine. But obviously NFS is not suitable for all applications, in which case you could use any other form of distributed I/O.
It worked great for forking apps. Trouble was hell would freeze over before the patches got merged and most people thought it wouldn't be widely adopted without shared memory and threads.
But the point is, it did run arbitrary apps across distributed nodes, you could see any node's processes and instrument them, you could see the filesystem of any node. This isn't some advanced mystic sorcery, it was there two decades ago. Clearly we could implement these features again in some new way - not as an SSI, but at least allowing an assortment of system-level RPC and some sort of distributed pluggable VFS.
And also my point is: sure, we have all these 3rd party userland solutions, and that is bad. It means nothing is supported until it's been "integrated". It means we have miles and miles of plumbing that schmucks like me are paid to set up before a JavaScript developer can run their piddly web app across 3 nodes. It should just be baked into the OS, batteries included. A lot less annoying bullshit, a lot more standardization, and the ability to get more shit done with less effort. That is the entire point of operating systems, to make it easier to run programs. Not to make it necessary to add 15 million new abstractions before you can run your programs.
Distributed yes, but not necessarily consistent. You can use CRDTs to manage "partial, flexible" consistency requirements. This might mean, e.g. sometimes having more than 5 instances running, but should come with increased flexibility overall.
The abstractions are there in Linux, largely imported from plan 9. And work is ongoing to support further abstractions, such as easy checkpoint/restore of whole containers. Kubernetes is a very new framework intended to support large-scale orchestration and deployment in a mostly automated way, driven by 'declarative' configuration; at some point, these features will be rewritten in a way that's easier to understand and perhaps extend further.
> to be able to do computation, discovery, message passing/IO, instrumentation over multiple nodes.
Kernel namespaces are the building blocks for this, because an app that accesses all kernel-managed resources via separate namespaces is insulated from the specifics of any single node, and can thus be transparently migrated elsewhere. It enables the kind of location independence that OP is arguing for here.
Linux namespaces don't actually do any of those things though? Like, not even a single one of them are made possible because of namespaces. They're all possible or not possible precisely as much with or without namespaces.
The thing is when comparing plan9 and linux here, you have to recognize that linux has it backwards. On plan9 namespaces are emergent from the distributed structure of the system. On linux they form useful tools to build a distributed system.
But what's possible on plan9 is possible because it really does do "everything is a file," so your namespace is made up of io devices (files) and you can construct or reconstruct that namespace as you need.
Like, this[1] is a description of how to configure plan9's cpu service so you run programs on another node.
Nothing in there makes any sense from a linux containers perspective. You can't namespace the cpu. You can't namespace the gui terminal. All you can namespace is relatively superficial things, and even then opening up that namespacing to unprivileged users has resulted in several linux CVEs over the last year because it's just not built with the right assumptions.
Doesn't Linux create device files in userspace these days, anyway? I thought that's what that udev stuff was all about. So I'm not sure that the Plan9 workflow is inherently unfeasible, there's just no idiomatic support for it just yet.
device nodes are managed in userspace nowadays yes, but they're just special files that identify a particular device id pair and then the OS acts on them in a special way. udev is just the userspace part of things that manages adding and removing them in response to hotplug events. Everything that matters about them is still controlled by the kernel.
That’s not at all what Linux namespaces permit. It’s a side effect of using them that could be leveraged using something like CRIU, sure, but it’s not what they’re for and they’re not a building block for anything mentioned in the portion of their comment you quoted.
Namespaces simply make the kernel lie when asked about sockets and users and such. It’s intended for isolation on a single server. They’re next to useless in distributed work, particularly the kind being discussed here (Plan 9ish). You actually want the opposite: to accomplish that, you want the kernel to lie even harder and make things up in the context of those interfaces, rather than hide things. Namespaces don’t really get you there in their current form.
Isolating processes from the specifics of the system they're running on is a key feature of the namespace-based model; it seems weird to call it a "side effect only". We should keep in mind that CRIU itself is still a fairly new feature that's only entered mainline recently, and the kernel already has plenty of ways to "make up" more virtual resources that are effectively controlled by userspace. While it may be true that these things are largely ad hoc for now, it's not clear that this will be an obstacle in the future,
I can talk about namespaces in HPC distributed systems, and they don't look anything like Plan 9 to me. They make life harder in various respects, and even dangerous with Linux features that don't take them into account (like at least one of the "zero-copy" add-on modules used by MPI shared memory implementations).
There were older attempts at this stuff, in the 90s with "Beowulf" clusters that had cross-machine process management and whatnot. It's a lot harder than it seems to make this approach make sense in the real world, as the abstraction hides important operational details. The explicit container + orchestration abstraction is probably closer to the ideal than trying to stretch linux/systemd/cgroups across the network "seamlessly". It's clearer what's going on and what the operational trade-offs are.
In case of any confusion, that sort of thing wasn't a generic Beowulf feature, but it sounds like Bproc. I don't know if it's still used. (The Sourceforge version is ancient.)
Actually, at some point in the 2.4 kernel it was possibile to do that, with single image systems, such as openmosix, that handled process discovery, computation and much more, but underneath the simple user interface it was complex, kinda insecure and so, was never abandoned and never ported to newer versions.
Distributed computation with message passing (and RDMA) is the essence of HPC systems. SGI systems supported multi-node Linux single system images up to ~1024 cores a fair few years ago, but they depend on a coherent interconnect (NUMAlink, originally from the MPIS-based systems under Irix).
However, you don't ignore the distributed nature of even single HPC nodes unless you want to risk perhaps an order of magnitude performance loss. SMP these days doesn't stand for Symmetric Multi-Processing.
Distributed shared memory is feasible in theory even via being provided in-software by the OS. You're right that this would not change the physical reality of message passing, but it would allow a single multi-processor application code to operate seamlessly using either shared memory on a single node, or distributed memory on a large cluster.
I talk about the practice in HPC, not theory, and this stuff is literally standard (remote memory of various types and the same thing running the same, modulo performance and resources, on a 32-core node as on one core each of 32 nodes). However, you still need to consider network non-uniformity at levels from at least NUMA nodes up, at least if you want performance in general.
I think you're looking at the wrong abstraction level. You're thinking on a node (computer) basis. Even on a single computer, many of the things that happen are distributed. DMA controllers, input interrupts, kernel-forced context switches, there's a lot going on there but we still pretend that our computers are just executing sequential code. I agree with the OP and think it's high time we treat the computer as the distributed system it is. Fuschia and GenodeOS are both making developments in this direction.
I built my startup in elixir and the erlang VM provides all of these. its kind of amazing. Things we've been able to have out of the box
* a metrics dashaboard that gives you a ps -A for all our nodes
* intermachine pubsub without having to setup a third party message queue
* auto failover for all our microservices
* spinning up a microservice takes about as much effort as adding a controller in rails
* cronjobs where one node can trigger a job on another node in the network. Hell, we have crons scheduled where we don't even know which machine it'll run on. it just gets done
I was thinking elixir/erlang too. I've only been using it for a couple of months but I've quickly come to the conclusion that their claims of robustness are not in line with what I want/need. For example, the pubsub lacks persistence and if the node that carries that info dies, you lose that data. There is no built in consensus that tries to maintain state in the face of failure, so you bring in Oban. I've yet to experience the advantages of elixir. Sure, I can hot reload a module, but I spend probably an hour a day waiting for things to compile. I prefer k8s and Go but figure that may be because I'm still new to the ecosystem.
so far we've had little need for persistence in pubsub, for the few places we do, we have used oban the same way you have. It would be easy to pull in a library like Yggdrasil and abstract it away. For the most part, we just haven't needed it enough to justify setting up rabbitmq or kafka. k8 is indeed useful but the benefit of elixir here is that I can setup the supervision tree in pure elixir. By keeping things simple, we've been able to focus on pushing out features instead of worrying about infrastructure.
Abstract a fleet of machines as single super computer sounds nice. But how about partial failures? It's something that a real stateful distributed system would have to deal with all the time but a single host machine almost never deals with (do you worry about a single cacheline failure when writing a program?).
There is a huge amount of research about distributed OSes (really, they were very fashionable at the 90's and early 00's). Plenty of people worked on this problem, and it's basically solved (as in, we don't have any optimal solution, but it won't be a problem on a real system).
K8s is doing distributed OS's on easy mode, supporting basically ephemeral 'webscale' workloads for pure horizontal scaling. Even then it introduces legendary amounts of non-essential complexity in pursuit of this goal. It gets used because "Worse is better" is a thing, not because anyone thinks it's an unusually effective way to address these problems.
I see K8S as Application Servers for everyone, with containers replacing EARs, it certainly gets some WebLogic/WebSphere vibes when looking at those yml files and how we used to setup an Application Server cluster.
I very much agree with this and while Kubernetes is better than a poke in the eye, I look forward to the day when there is a true distributed OS available in the way you describe. It's possible Kubernetes could even grow into that somehow.
Kubernetes only exists because people wanted to do Application Servers in any language, and now they are rediscovering them trying to sell us on Kubernetes + WebAssembly, the irony.