I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
In the same boat with 7900xtx. 24GB vram, on paper decent performance, in reality most things don't run. Only llama.cpp is consistent that it can run most models, even if maybe not at top performance (afaik - lacking MTP, problems cache invalidation with hybrid models). At least with llama.cpp I know what runs. With various python-based inferencers, between their uv/venv, my venv, system envs/pythons/libs yadayada - I need an agent to get to the bottom of what's actually running. :-) Yeah IK skill issue/user errors - but don't have seconds in the day left to spend them on that.
Doing the same for Apple M-series with fused wgsl shaders specifically targeting Qwen3/3.5.
My effort is called shady-thinker and is on github at github.com/tmzt/shady-thinker.
This was inspired in part by Antirez's earlier work with C kernels as well as other efforts to support in-browser LLMs. I've adapted them to Rust and the wgpu library.
Gemma 4 is also the next likely target (with the MTP work) as I'm experimenting with local AI agents.
I'd love to see what you've done to improve prefill and decode even if its not directly applicable.
One difference, I'm using MLX and GPTQ 4bit quants including AutoRound with safetensors as my shader pipeline is pretty much fixed for each model, ggml just adds unnecessary complexity.
I think llama.cpp could have done a much better job supporting PC. Sure, some of it us due to bad vendor support but with so many users I am surprised we don't see more optimized inference on standard PCs
When it's in a good state I'll open source it, I am keeping track of what optimizations make the most impact, stuff like this:
### Diagnosing parallelism pathologies (L1)
*Grid occupancy:*
- `Grid_Size / Workgroup_Size >= CU count` (W7900 = 96, Strix Halo = 40)?
- < 0.3 = massively undersubscribed. Fix grid FIRST. Micro-optimization
will NOT help.
- 0.3-1.0 = partially utilized; depends on VGPR/LDS pressure.
- 1.0-4.0 = healthy; micro-optimization can help.
*Within-block distribution:*
- Does the kernel do useful work across all threads, or is there an
`if (threadIdx.x == 0)` gate around a serial top-k, reduction, or
scan? For c=1 decode, many kernels can't grow the grid, but they can
always parallelize inside the block.
- `Scratch_Size > 0` from dynamically-indexed per-thread arrays is a
strong secondary signal of the within-block pathology.
*Router top-k (within-block fix)*:
- Kernel: `qwen35_router_select_kernel` @ c=1 decode
- Before: grid=1 (can't help; num_tokens=1), blockDim=512, `if (threadIdx.x == 0)`
gated 2048 serial compares. Scratch=144 B from spilled per-thread arrays.
- Fix: warp-shuffle parallel argmax across the whole block + `__shared__`
top_vals buffer eliminating the spill.
- Result: 5.7× kernel speedup, +6.6% on 4K/D4K E2E.
Oh, is this actually out now? If so, great, but I took a quick look and didn't spot any third party review yet. For those interested in this laptop, personally I'd still wait for some reviews from some real world people.
So, 3.5 years later, the chassis is still neat, and good on them for plugging away I guess, but for anyone that actually needs a new computer, there's no shortage of higher-end Linux-centric laptops with a better shipping track record (Framework, Tuxedo Computers, Slimbook, etc).
Hi, Katie from Star Labs here. Yes, that's a totally fair comment, the
StarFighter took way longer to produce than we expected. It was a combination
of component supply problems, being a small manufacturer so understandably
lower priority at the factory, and then firmware development.
Also completely reasonable to want reviews. We have sent out a StarFighter for review and I know he is currently testing it before publishing on YouTube, so we are hoping it will be out soon.
There are also some completely independent reviews on Reddit if they are of interest to anyone:
For those that don't want their data trained on, OpenRouter allows you to have account-wide or per-request routing with either provider.data_collection: "deny" or zdr: true (zero data retention).
Also, you can use HuggingFace Inference for DeepSeek V4 or Kimi K2.6, both of which work quite well and route through providers that you can enable/disable (like Together AI, DeepInfra, etc) - you'll have to check their policies but I think most of those commercial inference providers claim to not train on your data either.
I wonder why the question about data security and training comes often with DeepSeek, Kimi, Glm and never with Anthropic, OpenAI, and Google models.
Why is that?
IIRC, USA data protection protects data of US citizens only, foreigners data is not protected, and the companies are not even allowed to disclose when they collect those data.
> USA data protection protects data of US citizens only, foreigners data is not protected
HN is an American site. If you look at the US government, it is going to fearmonger about anything China related, because they haven't had a genuine competitor for decades and they're scared and lashing out. Most US news just parrot the government line, sometimes more so than state TV would, and so it reflects here.
I also feel comfortable saying that many Americans don't care one bit what happens to foreigners, be it by action of their government or companies.
> I also feel comfortable saying that many Americans don't care one bit what happens to foreigners, be it by action of their government or companies.
This is true. There are also many of us who do care.
This brings to mind something I heard recently about the so-called "Rule of 10". There will always be 3 people who support you, 3 people who are against you, and 4 people who have no idea what's going on and don't care.
Don't just focus on the 3 people who are being negative.
Wolf Warrior diplomacy isn't even 10 years dead. The HK treaty was violated and continues to be. Taiwan gets threatened every other week.
People can have problems with America and I'm fine with that. But pretending China isn't subsidizing industry (land, education, transportation) in a predatory fashion is silly. Too many companies have gone out of business because of it. We can all have our friends in China without pretending the CCP is playing the ballgame fairly. The government doesn't need to point it out. That doesn't even get into influence operations (which are especially easy on platforms like this.)
Seriously - there may be a day in the future where Western nations and China get along but it really can't/won't happen while it's holding all the industry and trying to take the Services income as well.
The US assisted a genocide, literally kidnapped the president of a sovereign country so it could take its oil, threatened its own allies with invasion and started a war of aggression against another so that it can take their oil, all in a span of a few months.
No it means that perhaps the US should finally start looking at itself instead of just asserting that it doesn't need to because China.
That doesn't mean China should not be criticized. But to me it's clear that the China blame game is not about a genuine concern for Chinese people or its neighbors, it's about trying to keep it down because China should never dared to rise in the first place.
Anglo Saxons and maybe the French should be in charge and the rest should be resource colonies. It very much feels like that Western mentality is still there.
> No it means that perhaps the US should finally start looking at itself instead of just asserting that it doesn't need to because China.
Agreed, the US definitely needs to do some introspection to sort out its own shit (and stop spraying it on everyone else).
However, that does not mean that China gets a pass. Fundamentally, the Chinese model of governance does not protect the individual. For all its faults, the US model is based upon the idea of individual liberty, which acts as a touchstone and allows it to self-correct whenever it goes to far in the wrong direction. That's something the Chinese model does not do, and means that, short of a revolution, it will continue to be an authoritarian state with all of the malignant features that entails.
> Fundamentally, the Chinese model of governance does not protect the individual. For all its faults, the US model is based upon the idea of individual liberty
Look, am not here to defend the Chinese model but I find it interesting how convinced you seem that individualism is the right model for everyone.
While I would generally agree with you, I have spoken to many from poorer countries who say that they prefer to trade some individualism for a steady hand of economic development and lifting the population from poverty. That is the Chinese model.
These people would argue that they can reclaim more and more individual freedom as the country gets richer and more self confident.
I am not saying they are right, but looking at a nominal democracy like India and a nominal autocracy like China, I know which government works better as far as raising the living standards of its population and it's not the Indian one.
My hope is that China will continue to liberalize on its own. Forcing it will likely only reverse the gains.
Individualism also leads to the sort of healthcare system the US had or Skid Row. So it's not all roses.
> also feel comfortable saying that many Americans don't care one bit what happens to foreigners, be it by action of their government or companies
What's the point of this kind of statement for you? Does this help you understand others or just continue to drive the wedge in? Where are you from? Ask yourself can the statement,
"many {of my country} don't care one bit what happens to foreigners, be it by action of the government or companies" not be read as true?
There are self-absorbed, disinterested, uncompassionate people in every country which will satisfy your "many" qualifier.
I am from Europe. I feel comfortable saying that many in Europe do not care about what their governments or companies do to foreigners, (at least not enough to inform themselves about it).
However looking at the polls in the US gives you a fairly decent idea that there's a decent chunk of people that seem to get off on violence towards non-Americans. Why do you think ICE went with the violent tactics it did?
As to
> What's the point of this kind of statement for you? Does this help you understand others or just continue to drive the wedge in?
The point is to maybe make some Americans ask what it is that they can do to reform the government they have the most direct influence over (their own) instead of trying to reassure themselves that theirs is still better than country's X.
In those cases, OpenRouter just chooses providers that agree not to train / offer ZDR. Which sometimes means you start off without access to the model until some other providers start offering it.
In a sense, it's working as intended. If you set zdr to true, you currently can't use DeepSeek v4. However, once other providers offer it (it is an open model, after all), some may allow zdr.
RDNA is a whole different (and much poorer supported) animal than CDNA. As someone with extensive experience in both, if you're asking the question, then, no.
(If you're just looking to learn, use the free Kaggle/Google Cola T4s/TPUs to get started.)
Current "TurboQuant" implementations are about 3.8X-4.9X on compression (w/ the higher end taking some significant hits of GSM8K performance) and with about 80-100% baseline speed (no improvement, regression): https://github.com/vllm-project/vllm/pull/38479
For those not paying attention, it's probably worth sending this and ongoing discussion for vLLM https://github.com/vllm-project/vllm/issues/38171 and llama.cpp through your summarizer of choice - TurboQuant is fine, but not a magic bullet. Personally, I've been experimenting with DMS and I think it has a lot more promise and can be stacked with various quantization schemes.
The biggest savings in kvcache though is in improved model architecture. Gemma 4's SWA/global hybrid saves up to 10X kvcache, MLA/DSA (the latter that helps solve global attention compute) does as well, and using linear, SSM layers saves even more.
None of these reduce memory demand (Jevon's paradox, etc), though. Looking at my coding tools, I'm using about 10-15B cached tokens/mo currently (was 5-8B a couple months ago) and while I think I'm probably above average on the curve, I don't consider myself doing anything especially crazy and this year, between mainstream developers, and more and more agents, I don't think there's really any limit to the number of tokens that people will want to consume.
As some other people mentioned, using both/multiple is the way to go if it's within your means.
I've been working on a wide range of relatively projects and I find that the latest GPT-5.2+ models seem to be generally better coders than Opus 4.6, however the latter tends to be better at big picture thinking, structuring, and communicating so I tend to iterate through Opus 4.6 max -> GPT-5.2 xhigh -> GPT-5.3-Codex xhigh -> GPT-5.4 xhigh. I've found GPT-5.3-Codex is the most detail oriented, but not necessarily the best coder. One interesting thing is for my high-stakes project, I have one coder lane but use all the models do independent review and they tend to catch different subsets of implementation bugs. I also notice huge behavioral changes based on changing AGENTS.md.
In terms of the apps, while Claude Code was ahead for a long while, I'd say Codex has largely caught up in terms of ergonomics, and in some things, like the way it let's you inline or append steering, I like it better now (or where it's far, far, ahead - the compaction is night and day better in Codex).
(These observations are based on about 10-20B/mo combined cached tokens, human-in-the-loop, so heavy usage and most code I no longer eyeball, but not dark factory/slop cannon levels. I haven't found (or built) a multi-agent control plane I really like yet.)
Codex won me over with one simple thing. Reliability. It crashed less, had less load shedding and its configuration is well designed.
I do regular evaluation of both codex and Claude (though not to statistical significance) and I’m of the opinion there is more in group variance on outcome performance than between them.
Like others have mentioned, I think the premise of looking at the most popular few projects (pypi.org currently lists 771,120 projects) on pypi as any sort of proxy for AI coding is terribly misguided/unrepresentative and that almost no one is going to be packaging up their vibe-coded projects for distribution on pypi.
That being said, I've personally put 3 up recently (more than I've published in total). I'm sure they have close to zero downloads (why would they? they're brand new, solve my own problems, I'm not interested in marketing them or supporting them, they're just shared because they might be useful to others) so they wouldn't show up in their review. 2 of these are pretty meaty projects that would have taken weeks if not months of work but instead have been largely just built over a weekend or a few days. I'd say it's not just the speed, but that w/o the lowered effort, these projects just wouldn't ever have crossed the effort/need bar of ever being started.
I've probably coded 50-100X more AI-assisted code that will never go to pypi, even as someone that has released pypi packages before (which already puts me in a tiny minority of programmers, much less regular people that would even think about uploading a pypi project).
For those interested in the scope of the recent projects:
https://pypi.org/project/realitycheck/ - first pypi: Jan 21 - 57K SLoC - "weekend" project that kept growing. It's a framework that leverages agentic coding tools like Codex/Claude Code to do rigorous, systematic analysis of claims, sources, predictions, and argument chains.It has 400+ tests, and does basically everything I want it to do now. The repo has 20 stars and I'd estimate only a handful of people are using it.
https://pypi.org/project/tweetxvault/ - first pypi: Mar 16 - 29K SLoC - another weekend project (followup on a second weekend). This project is a tool for archiving your Twitter/X bookmarks, likes, and tweets into a local db, with support for importing from archives and letting you search through them. I actually found 3 or 4 other AI-coded projects that didn't do quite what I wanted so it I built my own. This repo has 4 stars, although a friend submitted a PR and mentioned it solved exactly their problem and saved them from having to build it themselves, so that was nice and justifies publishing for me.
https://pypi.org/project/batterylog/ - first pypi: Mar 22 - 857 SLoC - this project is actually something I wrote (and have been using daily) 3-4 years ago, but never bothered to properly package up - it tracks how much battery is drained by your laptop when asleep and it's basically the bare minimum script/installer to be useful. I never bothered to package it
up b/c quite frankly, manual pypi releases are enough of a PITA to not bother, but LLMs now basically make it a matter of saying "cut a release," so when I wanted to add a new feature, I packaged it up as well, which I would never have done this otherwise. This repo has 42 stars and a few forks, although probably 0 downloads from pypi.
(I've spent the past couple years heavily using AI-assisted workflows, and only in the past few months (post Opus 4.6, GPT-5.2) would I have even considered AI tools reliable enough to consider trusting them to push new packages to pypi.)
Funy that you mention multi-monitor since it's one of the reasons I eventually moved to Wayland. The only way to support different DPI monitors in X was to do janky scaling or even jankier multiple X servers.
I don't use KDE (or GNOME anymore) but while I had to deal with a lot of initial speedbumps a couple years ago, these days instead of a full DE, I'm using a Niri setup and it's worked out great for me.
For my laptop, I have my own monitor-detection/wl-mirror script for example that is faster and more reliable for plugging into projectors/meeting room HDMI than even my old Macs.
The funny thing about this myth is that wayland does not even try to support Mixed DPI setups, the only thing it supports is, as you put it, janky scaling. Not that X is any better in the end but at least it has the data available if any application wants to try to do correct Mixed dpi (nobody does)
So in yet another case of worse is better, wayland has the reputation of supporting mixed DPI environments, but not because it has any support for actual mixed DPI but because it is better at faking it (fractional scaling).
Myth or not - it is absolutely much better on wayland. I really don't care or know how to tweak linux so i've been using straight install Fedora for years. I also have 4 screens. When Fedora switched to wayland it got much better and it keeps getting better.
I use a docked ThinkPad with the lid closed and two external monitors. Here are my config bits.
set $laptop eDP-1
set $landscape 'Hewlett Packard HP ZR24w CNT037144C'
set $portrait 'Hewlett Packard HP ZR24w CNT03512JN'
bindswitch --reload --locked lid:on output $laptop disable
bindswitch --reload --locked lid:off output $laptop enable
### Output configuration
output $laptop bg $HOME/pictures/wallpaper/1529004448340.jpg fill
output $landscape bg $HOME/pictures/wallpaper/1529004448340.jpg fill
output $portrait bg $HOME/pictures/wallpaper/portrait/DYabJ0FV4AACG69.jpg fill
# pos args are x coords and y coords, transform is degrees of rotation counter-clockwise
# set $portrait as left monitor and rotate it counterclockwise
output $portrait pos 0 1200 transform 270
I am not a theoretical CS or math expert by any means, but I have been wrangling coding agents for a while and reading the paper and the problems Stapper had with dealing w/ Claude (context management, instruction following, etc) decided to see if I could replicate with a slightly better harness. The results were pretty interesting: https://github.com/lhl/claudecycles-revisited
- My original setup left traces of the PDF paper and after GPT 5.3-Codex xhigh reached an impasse it went looking for it and found it!
- I went and did cleanroom (basically one-shot) passes for GPT 5.2 xhigh, GPT 5.3-Codex xhigh, and Claude Opus 4.6 ultrathink and 5.2/5.3 found alternate solutions for odd m >= 5 , Opus 4.6 did not find any proofs but tried more approaches to solving.
I've also included the session traces and analysis in the repo branches. Also, the AGENTS.md was pretty simple, but that harness produced consistent process outcomes across all three models:
I was a bit interested to do a replication and see if better harness could avoid some of the problems they ran w/ context management, poor instruction following, etc and it looks like yes, it's definitely possible.
I used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.
As described in the readme of your repo (did you read it?) your agent found the Knuth paper located one directory level above its working directory.
So, you didn't produce a replication in 47 minutes, it just took around 30 minutes for your agent to find that you had the answer in a PDF in a nearby directory.
I wonder how common of a problem this will be in the future. The experiment will fail due to improper setup, the human will at best glance over the logs and declare victory, and everyone just believes.
Yes, I read it and specifically pointed it out (that's why there are 3 hours of interactive logs). There are 4 other runs pushed now so you can see what actual clean room runs for 5.2 xhigh, 5.3-Codex xhigh, 5.4 xhigh, and Opus 4.6 ultrathink look like: https://github.com/lhl/claudecycles-revisited/blob/main/COMP... as well as the baseline.
omg this is so cool.
because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things
thx for sharing your test setup, i really appreciate the time you took. this will help me so much
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
reply