I’m no kernel hacker, but it seems like io_uring is almost being undersold despi...

rektide · on April 12, 2023

Weirdly the low barrier batched sys calls & ebpf user-programmable-kernel sort of feel like a comeback-ish story for un-monolothic kernels/OSes.

After a decade+ of it feeling like Linus/Tannenbaum debate was a de jure settled debate. The debate being anchored to either/or certainly didn't help, confined the dimensionality of what we might consider, eliminated vaster exploration of the possible. But the idea that it's all in the kernel seems improbably from the modern outlook (with some of the GPL IP copy-left awesomeness as notable sacrifice, which has yet to really manifest, but IMO hugely looms).

Again and again I keep thinking we mistake popularity & rapid adoption for success. There's so many brilliant wonderful possibilities. People & attitudes coallesce to the negative enormously fastyl, with much more emotion than belief spreads. Finding ways to give long time for tech to slowly grow & quietly build & thrive is a huge huge challenge. Uring is definitely a nice example of a small conceptual extension of what is that can snowball & grow & metastatize, until it far outshines everything we did before.

Now if only Deno, Node & others would start getting more of this shipped! That said, slow & right is what I'm defending here, soooo, OK.

softirq · on April 12, 2023

This really has nothing to do with the mono vs. micro kernel argument. Also micro kernels were very popular in the research community at the time that Linux was in its infancy, mach and minix are both examples of this. Linux did win on merit and the future work of micro kernels was forked and past micro kernels are only vestigially micro.

Edit: Not sure why this is being downvoted, micro kernels are the antithesis of what eBPF is trying to accomplish. I say this as a university researcher that has built and worked on many kernels with different tradeoffs. If you disagree please explain in a comment.

Micro kernels increase security and decrease complexity by running a very small resource manager in the processor's highest privilege level. eBPF runs code at the highest privilege level by actually extending the kernel, it actually supplants doing kernel work in userspace. It greatly increases kernel complexity with an new subsystem and decreases security in order to allow extensibility in the face of a hard to change interface (syscalls in a mature operating systems are difficult to change).

hinkley · on April 13, 2023

One of the bigger shortcomings of microkernels is the IPC delay. io_uring amortizes those delays, making it less painful to implement a chatty protocol, like the conversation between ukernel processes.

EBNF is sending code instead of data, which is another common way to solve IPC problems, but with different trade offs. Being harder to prove is an important one.

KMag · on April 14, 2023

Presumably you mean eBPF, not EBNF.

kqr · on April 13, 2023

I appreciate what you're saying, but I feel like your argument (including downthread) is something like saying "fuel injectors have nothing to do with carburetors" or "vacuum tubes have nothing to do with transistors". It's true, but the user may end up solving similar problems with them.

rektide · on April 14, 2023

Heeyyyy much more cleanly said than my attempts. Shit yeah nicely made.

never_inline · on April 13, 2023

> eBPF runs code at the highest privilege level by actually extending the kernel

Yeah this was my thought when reading the above argument too.

Maybe they're thinking that in future, more and more kernel functionality will be moved to eBPF programs, and eBPF has a restricted execution environment? I can follow both lines of logic.

rektide · on April 13, 2023

Too easy a dismissal, doesn't reach to explore. This very much enables a much more interesting future of smaller interoperating & pieces.

For example the DPDK work created a small alternative micro-monolith for high speed networking. If someone needs to make a new fast protocol like quic or something further afield conventionally they might have split between building a similar application-owning hardware view, building a kernel module, or facing extreme performance loss. Io-uring opens up much greater fields of extensibility by removing huge barriers, allowing work to spread out far wider than it could have before.

I agree on the surface this seems dissimilar. But architecturally what is enabled is a far greater range of considerations for how we might interlace responsibilities on systems. Io-uring on the surface is for apps to commuicate with the kernel, but if systems like fuse or uhid or other user land devices also prosper, it will disrupt the monolith's lock.

tptacek · on April 13, 2023

Isn't DPDK sort of an exception that proves the rule? Almost nothing gets implemented in DPDK precisely because pulling the TCP/IP stack out of the kernel is a huge logistical pain in the butt for development, and far more energy is put into making the kernel's TCP/IP interface more useful for developers.

(I'm not pushing back on you, just exploring the space, so to speak.)

I'm not sure I see how eBPF is moving us away from monolithic OS's; the whole point of eBPF is running more stuff inside the monolithic kernel. :)

rektide · on April 13, 2023

Good point about eBPF. :) It is running in kernel space, but it's not code from the kernel, it's user code, which expands the model, that was kind of my thought.

DPDK did seem to be limited to pretty niche/exceptional applications. There's like a really fast Open Vswitch backend (and some other very network-centric projects). Practically a partial-unikernel system... trying to almost never ever use the kernel, just stay in user land with your own network stack forever. Just... In Linux.

softirq · on April 13, 2023

It's still running code inside of the kernel. The point of a microkernel is to run as little in ring 0 as possible and have a very thin resource manger that can talk to system servers. The point of eBPF is to move even more stuff dynamically into the kernel to avoid expanding the kernel's static interfaces (syscalls).

rektide · on April 13, 2023

Which brings us back to my previous point,

> The debate being anchored to either/or certainly didn't help, confined the dimensionality of what we might consider, eliminated vaster exploration of the possible.

eBPF doesn't seem to fit squarely in either mold. It has some characteristics of monolithic kernels, in that code runs in ring 0. But it's also now it's not the kernel that's running there. It's user code. Which doesn't fit either pattern.

I do think io_uring could potentially spawn more microkernel-y outcomes. I don't know how exactly eBPF comes into the picture but it's another tool. Just to whip up a nonsense example, maybe we use eBPF to build a secure network switch in ring0 that io_uring based processes communicate to each other with. Io_uring is still kernel based, isn't processes directly communicating with each other (not sure how many microkernels could do that), but again my point is less super specifically about microservices & more that this is a rebalancing possibility, where the kernel code might become less the focus of the system, where other non-kernel-code platform substrates have a higher chance of emerging/performing well. There's more possible than monolithic chunks of code allowed. What new dimensions we might find out there excites me.

softirq · on April 13, 2023

I think you really need to understand that the term micro kernel refers to a very specific thing. Please go read the wikipedia definition because it does a good job of summarizing it. A microkernel handles only the basic kernel primitives in the highest privilege level and critically provides an rpc like communication mechanism for authenticating and delegating everything else.

eBPF does not do that, it does not enable doing that. It does the inverse. It adds code to the kernel in the exact same way a kernel module adds code to the kernel, but in very specific places to allow extending the kernel at points of policy. It does not run kernel code in userspace. It does not run userspace code in the kernel. It loads code from userspace into the kernel and runs it as kernel code.

rektide · on April 13, 2023

No, sorry, I am pretty well versed I think.

On the topic of eBPF, you keep talking about eBPF adding code to the kernel. Yes: that is NOT the microkernel model. But you keep not acknowledging that it's not code shipped by the kernel. If we make the kernel programmable, yes, it means running some code in the kernel. But tons and tons of people are using this to make really really exceptional userland tools that take over kernel-like responsibilities, often running faster than or with more/newer/interesting features the kernel never could do.

Adding kernel user-programmability lets us move stuff out of the kernel. And that keeps being what we see happening.

It does allow other models too! It allows a whole raft of different things to happen. Some of the eBPF uses are to push more responsibility into the kernel, to make more stuff happen in ring0 or to take what would have been userland & make it ring0. It's a very general capability that breaks us out of the low-dimensionality world we've lived in. It makes a more more exploration possible. Some eBPF possibilities will in some ways resemble microkernel-y things. Some will look exactly the opposite.

Right now there's no clear plan to moving most driver code into userland (the microkernel model), but I also would not rule it out & I think there's a number of clearly visible shifts (and my tea leaves tell me increasingly ongoingly) already well under way in that direction.

An example of this is some of the Fuse + eBPF work, extfuse. File Systems in Userland is a recognizably microkernely idea, I hope we can kind of agree: let's have a driver for file-systems that runs in userland. But typically FUSE has a pretty big performance hit that can make it unviable, just too much cost of syscall'ing to the kernel. Well, what did some intrepid blokes do? They wrote a eBPF powered Fuse system that hyper-optimized the IPC layer, ignoring a lot of the normal communication bottlenecks & inventing their own new eBPF powered control/communication system. Which let them both expand speed, but also grow a raft of new features for their userland driver, some of which are utterly unique and new. It allows custom permissions checks, differing caching strategies, and creating io redirection to underlying FSes. All from the userland. So, userland greatly greatly grew, & assumed many roles the kernel either did (& some it couldn't do), by having a more malleable eBPF interface. (But yes, that still required talking to the kernel/having some new code in the kernel.) https://extfuse.github.io/

It's a bit less micro-kernel-y an example, but certainly it's an amazing story watching networking grow via eBPF. Yes the kernel is doing the work, but it's such a different situation that it's not not kernel code: it's user-code, running in ring0. We can focus on that code. But notably, that code while heavily trafficed, is often much smaller than the kernel code it replaces. The simple data-plane eBPF code replaces of dozens of fairly complex kernel modules making a lot of the decisions, figuring out how to prioritize or open tunnels or do any of a dozen other deeds. And that eBPF code is all built by a new Master Control Process, a ring -1, the userland process that wrote & injected it all. This userland code doesn't have privileged access to other processes (so still a micro-kernel-y win!), but it sure as heck has a very powerful position over the kernel, is calling the shots/making the decisions, & programming the (again much smaller) kernel according to it's userland desires. Is there still a kernel? Yes. But it's clear the power dynamic is greatly different in this case. There's so many more isolation barriers & complex who-controls-who power relationships here that all seem so much more interesting than the monolithic kernel world, so many specific assignments of responsibilities, done with so much more isolation than before. Seems like a win. I'd still be more hesitant to push this as an overtly micro-service-y case, where-as extfuse I think is more obviously in-the-mold, but reducing kernel responsibilities and giving them to userland at such huge scale with such critical success should make most Microkernelites cheer & feel vindicated over. I hope, even if it's not exactly 100% the purist they'd hoped for.

We can keep trying to say, but oh! Gotcha! There's stuff happening in the kernel. I think the topic deserves a much broader view than that. We probably are not going to get rid of the kernel (maybe Fuschia's Zircon microkernel, maybe Genode, maybe whomever will some day make a real break-out win though! nothings decided!). Giving processes much faster ways to batch talk to others/the kernel (io_uring) and giving processes the ability to program/modify the kernel & inject some tendrils ther are- I would agree- not inherently micro-kernel-y. But they open up so many possible spaces that before had felt settled. And in quite a number of examples we're seeing core responsibilities move from kernel to userland. The idea that a more IPC centric network-of-processes userland-driver world emerges seems much more possible to me, thanks to io_uring and ebpf.

softirq · on April 13, 2023

Micro kernel-ish behavior would be features that allow for drivers/modules in userspace, which Linux does provide (and I do agree that Fuse is an example of this). eBPF is not sort of, or kind of, or remotely related to micro kernels and I'm not being pedantic. It really, really isn't related to microkernels because it is doing the exact opposite of what a microkernel does. Pluggable policy in kernelspace is not the same as running something in userspace that's implementing a system function like a filesystem, a networking stack, etc. which would be analogous to FUSE, DPDK, etc.

eBPF is no different from runtime loadable kernel modules, which has never made Linux a microkernel. Modules allow for runtime extension, but they are still running in the kernel. The point of eBPF is that kernel modules are dangerous and error prone to write, and eBPF allows non-kernel programmers to write reasonably safe modules that can also be ported to different kernels via CO-RE. They are pluggable policy points or probes.

The Fuse developers just figured out the same thing microkernel developers figured out in the late 80s/early 90s. Performance is much better if you move more processing to the kernel.

rektide · on April 13, 2023

I don't feel like you read my post at all or are replying to anything I've said. You're repeating the same assertions without moving the argument along or talking details. Rather than recognize some similarity, rather than taking any kind of parallax view, rather than making friends, I feel like you've adopted a combatative and quite limited stance, and I think it's a shame you're not willing to chalk up some wins where a monolithic kernel has loosened up & allowed some new microkernel or microkernel-ish either offloading or sharing of responsibilities.

I provided two pretty amazing examples already of how really amazing new userland capabilities were made by having a configurable kernel. You keep ignoring that eBPF may be code running in the kernel, but it also enables not running a dump-truck load of kernel modules that a user would need, while being more flexible, while letting a non-privileged isolated process either have control (ring -1) as the networking Calico/Cilium example shows off, or even more radically by moving the entire task into userland (ExtFUSE) by using eBPF as an io-port with the kernel to let the external process do more of the tasks than it could have before.

You seem to be very rooted in a very precise & narrow minded view of what's afoot here. And fixated on any kernel code being disqualifying, even though it keeps being examples of less & less kernel code running. I think I've really given you a ton of great evidence already, & tried to help you break free of such a narrow conception. My examples really show how eBPF has helped move a ton of work from kernel into userland. I don't think it should be so hard to communicate this, but I'll say it yet again:

programmable kernels sometimes will be programmed to do less in the kernel. That's what we've seen.

(But they can also be used to add more to the kernel too. To repeat again: eBPF is not inherently MK nor inherently monolithic; it depends on what it's used for!)

You talk about Fuse's downside, again boosterizing microkernels, but again, it's like you haven't read my post: where I mention how eBPF was used to greatly speed up & to add brand new novel capabilities neither kernel nor FUSE could do before enhance & Extend FUSE. That's what I keep trying to emphasize. If you have a fixed all-encompassing monolithic kernel, the boundaries are fixed. By introducing eBPF programmability, there's far more flexibility to calve off tasks, to pull them out of the kernel. eBPF is sometimes just a communication tool, not a processing tool, to get the data for the task out of the kernel, into userland, and that 110% qualifies as a microkernel-y thing.

Again, to your points, I said I'd be less interested in trying to claim the Cilium/Calico model is microkernel-ish, because as you say, the data-plane remains in the kernel (although I think it's not hard to recognize that even though it's a different architectural pattern than MK it's still a revolution in moving huge responsibiliites out of the kernel & into userland, & getting the lions share of MK benefits). But the ExtFUSE model shows that that's not the only thing one can do with eBPF (allow a "policy agent", which still seems like a colossal MK-ish win to me!), it shows how it can also help to move work itself outside the kernel in new novel ways.

Rather than focusing on eBPF being something "in the kernel", I think everyone needs to step back & re-ask themselves what eBPF is. It's a malleability system. It's a way to make the kernel programmable. As I've said again and again, that isn't inherently microkernely. But in a huge number of examples, users do exactly that: take things that the kernel would be doing and they make a small eBPF shim to replace complex kernel code with a small port, that sends the task out to userland, where userland does the work. This is quite obviously microkernely.

kortilla · on April 13, 2023

You’re arguing about score and the person you are replying to is not talking about points for or against different approaches.

The person you are replying to is being very specific about ebpf not being a microkernel approach. By definition it is the opposite.

You’re looking to debate someone on the pros/cons of dogs and cats but you’re doing it by insisting that a golden retriever is actually a cat. Then when people call you out saying it’s very much not a cat, you lash out about the great benefits of a golden retriever as if they refutes something.

sfink · on April 13, 2023

I'm kinda with rektide on this. It is true that eBPF is not itself a microkernel approach. But nobody is asserting that it is. The claim is that eBPF is a building block that allows emulating some microkernel characteristics on a monolithic kernel.

The final result is not what anyone would describe as a microkernel. But it does allow implementing things on top of a different API boundary. And microkernel vs monolithic kernel is all about where the API boundary lies. (And I don't mean the ring 0 boundary! If ring 0 is the only relevant definitional characteristic of "microkernel-ish" for you, then I agree to disagree.)

Said another way: if you consider driver-on-API-implemented-with-eBPF-on-monolithic-kernel, then looking from the bottom it will look nothing like a microkernel. Looking from the top, it will.

(rektide is not insisting that a golden retriever is actually a cat, they're saying that a golden retriever can keep the mice under control similar to how a cat might.)

But at this point, I doubt it really matters who agrees with what perspective!

softirq · on April 14, 2023

As I said I have done research work and a written a PhD thesis on operating systems, and I've worked on several production kernels including Linux and I find it hard to fathom this hill you all are trying to die on. You can't just redefine important concepts however you like as the arguer is trying to do. Names mean things. If you want to pretend that x is y and blue is red, then we simply can't discuss the topic because you're not able to agree on the basic particles the community uses to have discussions. It's just contrarianism.

Microkernels are defined by privilege levels. It's not debatable. You wouldn't say that something is true microkernel because it's modular. We could have more interesting discussions about the actual topic if you and the original poster could accept common definitions like any sane technologist.

sfink · on April 14, 2023

I agree that eBPF running on a monolithic kernel is not a microkernel. I agree that microkernels are defined by privilege levels. I would not say that something is true microkernel simply because it is modular. I would agree that I am not a fully sane technologist.

I disagree that it is impossible to emulate any characteristics of running on a microkernel when one is, in fact, running on a monolithic kernel.

rektide · on April 14, 2023

Using one single narrow criteria to completely disregard any & all other similarity is a shame & a sham. This is a huge disservice & actively harmful to understanding the world.

I've said again and again microkernel-y microkernel-ish. But every single time I try to build a bridge & explain how it's similar but different, you burn it down. I think you have been poisoned by being too close to the subject matter & lack objective sensibility about the subject, are unable to adequately step back to actually understand. And you don't seem to have any interest in learning or trying to see, which is a crying shame.

I feel like you as a deep practioner/academic of this area should be best able to help explain relationships & similarities, to see connections. But you focus only on making distance & setting things apart. That's just not good enough. It's not sufficient a viewpoint.

kortilla · on April 14, 2023

> Using one single narrow criteria to completely disregard any & all other similarity is a shame & a sham. This is a huge disservice & actively harmful to understanding the world.

Just pick a different fucking word. Why do you want to use the term microkernel?

It’s like insisting that a rock is actually and airplane because they both fly through the air. It’s a pointless redefinition and everyone qualified to talk about things that fly won’t call it an airplane and will be confused about what you’re talking about.

Dylan16807 · on April 14, 2023

It's not a redefinition when you add "-like" at the end. If I say a particular airplane flies like a rock, nobody needs to clarify that airplanes are not rocks.

But please refer back to the original comment. rektide said it was "un-monolithic", which it is. A virtual machine is less monolithic than baked-in code.

Softirq is the one who brought up microkernels specifically. Softirq is the one that repeatedly steered the conversation back to monokernel versus microkernel.

If you put words in someone's mouth repeatedly until they start using them, you lose the right to complain that they're using the wrong words!

"Why do you want to use the term microkernel?" is the exact opposite of what happened.

ksec · on April 15, 2023

Not sure that is the right analogy. rektide is arguing on a conceptual level, while softirq is arguing on an academic level with very strict definition.

If I had to use your analogy that the new breed of Golden Retriever somehow has many of the same looks as Cat, act like a cat, but it is biologically still defined as a Dog.

rektide · on April 13, 2023

I've said repeatedly I think eBPF is neutral, not inherently one way or another.

But it certainly creates a lot of new possibilities for microkernel-like things (and other things) to be built.

pcl · on April 13, 2023

It’s worth looking at the work that Martin Thompson has done with similar techniques in the JVM space with Aeron and Disruptor:

https://aeron.io/

https://lmax-exchange.github.io/disruptor/

flohofwoe · on April 13, 2023

It seems to be the same evolution that we've seen in 3D rendering APIs (or more generally: GPU APIs) over the last two decades, and roughly for the same reasons (but it goes beyond just reducing syscall overhead, it's also establishing a pipeline for sending batches of work prepared by the CPU for 'decoupled' execution on the GPU, e.g. trading latency for better throughput).

fwsgonzo · on April 13, 2023

Yep, command queues are basically how drivers have been for decades now anyway. It just happens to be more widely exposed in user-facing APIs now. If you can call them user-facing APIs. They are hard to use and get right, and nobody writes directly against them. Instead it's best done through a higher level API (eg. an asynchronous system call API) or even higher level through not even knowing about it (eg. modern game engines).

Modern caches are heavily pipelined in order to saturate links and well, as you probably know, games are 1-2 frames behind but do tens of thousands of GPU operations per frame. I guess we can even add modern CPUs to the list? Eg. certain operations cause a pipeline flush just like some rendering operations can cause the same thing, under certain circumstances.

I just found this article about the graphics pipeline from 2011: https://fgiesen.wordpress.com/2011/07/02/a-trip-through-the-...

ddevault · on April 13, 2023

I am a kernel hacker, and I have worked with io_uring, and I can safely judge that it is very good -- but the main issue is that it represents a totally different approach to I/O (and syscalls generally, which are just I/O in other words), which is going to take the ecosystem a while to reform around. Note that the sample code in the OP's article is much more complex than the traditional approach. It's also very Linux-specific, so any software which takes advantage of it will be less portable or will have to write multiple I/O backends. It's also nontrivial to understand and use effectively, so adding good io_uring support to a project is an effort.

another2another · on April 13, 2023

My hope is that Linux ports of kqueue or libdispatch can use io_uring where possible and we would automatically get the benefit, but I'm not familiar enough with the guts of these projects to know how timely or even feasible that is.

skavi · on April 13, 2023

interesting whitequark thread on that idea: https://twitter.com/whitequark/status/1521146654535172100

epage · on April 13, 2023

Whats old becomes new again. I worked on some Windows drivers that batched calls into the kernel because of the overhead "back in the day". At the time, we were re-working the drivers to not doing batching anymore because the overhead of syscalls had dropped (as part of a larger clean up of our driver design, not a reason in of itself).

kps · on April 13, 2023

> Whats old becomes new again.

The polled mode of io_uring resembles I/O on CDC mainframes, except that the ‘kernel’ ran on dedicated coprocessors (which incidentally ran multiple tasks in a way similar to what is now called hyperthreading).

fulafel · on April 13, 2023

Before that in the 60s io command batching was (like most things) done by IBM: https://en.wikipedia.org/wiki/Channel_I/O#Channel_program

These days Linux supports VFIO pass-through for native IBM/360 channel programs if you happen to be running Linux on a IBM mainframe: https://docs.kernel.org/6.1/s390/vfio-ccw.html

espoal · on April 13, 2023

io_uring it's not about batching. io_uring is about a new concurrency model based on ring buffers

api · on April 13, 2023

If it batches syscalls could libc use it in place of the normal syscall method? You’d have to kind of deliberately nerf it by waiting on calls in spite of its async nature but this might still bring a boost due to reduced kernel user mode switching and other overhead.

bfrog · on April 13, 2023

It absolutely a batch interface for syscalls, with the ability to sort of create little I/O specific programs.

philosopher1234 · on April 12, 2023

You cant leave me hanging!! Whats the genius idea?

benlwalker · on April 13, 2023

Imagine you have a piece of software that runs in an event loop (as many things do). On each loop, queue up all system calls you'd like to perform. At the end of the loop, do one syscall to execute the batch. At the start of the loop, check if anything has completed and continue the operation.

If you're processing a set of sockets and on any given loop N are ready, then with epoll you do N+1 syscalls. With io_uring you do 1. It's independent of N.

klabb3 · on April 13, 2023

And the potential impact is huge! Not only are individual syscalls expensive on their own, and increasingly so (afaik) with spectre and security issues. You also need a thread to execute the call, which is several KB of memory, compound context switches, and (often overlooked) creating and destroying threads also come with syscall overhead.

Now, we’ve had epoll etc so it’s not novel in that respect. However, what’s truly novel is that it’s universal across syscalls, which makes it almost mechanical to port to a new platform. A lot of intricate questions of high-level API design simply go away and become more simple data layout questions. (I’m sure there are little devils hiding in the details, but still)

shorodei · on April 12, 2023

You could use the shared ring buffers scheme to replace essentially all syscalls and similar things in any context. Think gpu drivers, memory controllers, etc. It could be a universal interface for all communication that has to go through some kind of expensive security barrier.

vluft · on April 13, 2023

With cross-core interrupts and user-mode interrupt handlers (as in some new intel cpus), you could even do something without polling (interrupt for submission) where the core user-mode code is running on _never_ context switches (obviously except for scheduling) and you just have a dedicated kernel core or cores off doing kernel things.

benlwalker · on April 13, 2023

You can also use umwait on the next completion entry in the ring

vluft · on April 13, 2023

yup, though that means you're wasting that core's compute; something with green threads where language runtime does a cross-core interrupt to submit syscall then continues execing other green threads until it gets a user interrupt for syscall completion would be pretty neat.

(ETA: and indeed looks like they're considering support for that in io_uring! https://lwn.net/Articles/869140/)

mattclarkdotnet · on April 12, 2023

“General purpose syscall batching”, which drastically cuts down on context switches