Faster zlib/DEFLATE decompression on the Apple M1 (and x86)

miohtama · on Aug 20, 2022

Awesome work from Dougall.

Also super interesting to find out that Deflate was still so suboptimal after all these years.

This work alone might create megawatthours of energy savings across all computers worldwide, because zlib is very, very, widespread.

twotwotwo · on Aug 21, 2022

> Also super interesting to find out that Deflate was still so suboptimal after all these years.

Optimization can be a moving target!

Many of these changes only help on newer processors, rather than being speedups that would have worked since zlib was written. Some are about using instruction-level parallelism that just wasn't available years back, and a few changes relate to other specific changes in CPUs, like using bigger tables to better exploit bigger caches or using specific new instructions.

You could say something related about gains from more modern compression formats: their design decisions make sense in today's hardware context, where DEFLATE's made sense in the 90s. For example, a 4MB history window is reasonable today but wouldn't have been when RAM was much more expensive and therefore limited. Folks were making smart decisions both then and now!

userbinator · on Aug 21, 2022

Also super interesting to find out that Deflate was still so suboptimal after all these years.

It's not surprising to me. If I rewrote it in handwritten Asm I could probably get a similar or even better speedup. People think compilers are magic and can't be beaten, but that's simply not true if you have any experience at all with Asm. Or perhaps they just want to believe in that notion because the alternative takes more effort, and the speed has been more than adequate.

To expand on the above point: the amount of wasted computing power is probably at least one or two orders of magnitude, and these improvements to zlib are a tiny fraction of that. All of the JS-where-JS-isn't-needed web stuff, "desktop" applications that are actually web apps in a browser, etc. contributes far more waste.

brigade · on Aug 21, 2022

I recently did some similar optimization of something that had already been written in hand-tuned asm. I wasn't able to get any real improvement on the CPUs the asm was originally written for (5+ year old Cortex CPUs), but I did get a 40% improvement on M1 with basically the strategy he used to optimize bitparsing. And that was before writing it in asm; writing the newly optimized algorithm in full asm didn't really help.

M1 is so wide that changing algorithms for lower critical-path latency can be more essential than writing good asm, since you are in no danger of saturating 8 instructions per cycle. And changing algorithms isn't something a compiler can do (or really even should do given that 40% faster on M1 translated to 5-10% slower on the old CPUs...)

bitwize · on Aug 21, 2022

> To expand on the above point: the amount of wasted computing power is probably at least one or two orders of magnitude, and these improvements to zlib are a tiny fraction of that. All of the JS-where-JS-isn't-needed web stuff, "desktop" applications that are actually web apps in a browser, etc. contributes far more waste.

That "waste" may be a small price to pay that is more than made up for by a combination of the value delivered to the customer and the shortened time to market. For example, there is no faster or easier way to deploy a desktop app that works on Windows, Mac, and Linux than Electron -- and you can do so with a minimum of cross-platform testing. Therefore, it is incumbent upon you to justify, specifically, why not to use Electron when shipping a desktop app. "It's too slow" or "It wastes memory" are not valid answers; you must provide a specific business goal that Electron precludes you from achieving -- for example, you are writing a high-performance game or maintaining a pre-existing application -- because you are asking the business to spend more time and money, up front, to develop the non-Electron solution and not get anything in return (again unless you can justify specifically why Electron is not up to your particular task).

Always put the business first. The fact that JavaScript frameworks for web or desktop eat a lot of memory is a technical issue, but unless you can show that the costs of these frameworks are much greater than the value they deliver to your business or your customers, it makes sense to use them.

JdeBP · on Aug 21, 2022

I am reminded of this quotation from The Kickstart Guide to the Amiga:

> The only other alternative would have been to write the whole thing in assembler -- this would have resulted in a typical assembler system, which is fast, efficient, streamlined, sexy, and not quite finished yet, sorry.

Page 74.

asveikau · on Aug 21, 2022

In fairness, the code that worked reasonably well decoding in 1995 is going to run super fast today, relatively speaking. Tweaking it to take advantage of larger CPU caches is just going to make it even more dramatically so.

thrwawy74 · on Aug 20, 2022

Totally agree. We don't say it enough, but this has real-world impact. :-)

dundarious · on Aug 21, 2022

> Reading Bits: Most of the speedup just comes from reading and applying Fabian Giesen’s posts on Huffman decoding

Strongly recommend reading Fabian's blog, he has a wealth of great performance knowledge on display there.

keepquestioning · on Aug 21, 2022

Well, is it going to be merged, or just stay in an obscure private fork?

dougall · on Aug 21, 2022

The non-forked zlib hasn't been accepting optimisations: https://github.com/madler/zlib/issues/346

Hopefully changes will be merged from my obscure private fork into four other obscure private forks (zlib-chromium (https://crbug.com/1354990), zlib-ng, zlib-cloudflare, and the one I personally care about, which doesn't take pull requests, zlib-apple), and possibly incorporated into libdeflate. That's the point of the blog post - to help maintainers understand the changes.

brigade · on Aug 21, 2022

Out of curiosity, have you measured performance on M1's E cores as well?

dougall · on Aug 21, 2022

I hadn't, but results are here: https://twitter.com/dougallj/status/1561255753339781120

keepquestioning · on Aug 21, 2022

This seems poor on the part of the maintainer

whatzlib · on Aug 21, 2022

I always thought the system zlib in Apple MacOS was the same as zlib-madler that's on github. Is that not the case? Which zlib does Apple ship then and does anyone know where that source resides?

dougall · on Aug 21, 2022

Yeah, I haven't looked at the exact changes, but I believe this is the source:

https://github.com/apple-oss-distributions/zlib