Hacker Newsnew | past | comments | ask | show | jobs | submit | WanderPanda's commentslogin

It highly depends on the task. For math and coding, sure. But for knowledge tasks GPT-4 is wayy better than even SOTA ~100B models. For my knowledge test cases the lines get blurry at >400B


I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?


Hey! Sorry for not replying sooner - yes we'll keep publishing more KLD - sadly some are saying we are "optimizing" for KLD now since we posted so many haha - but the whole purpose of quantization is to match the BF16 logits as much as possible whilst reducing disk space (ie reduce KLD).

In general so this is funny and a quirk of quantization - sometimes 8bit, 4bit models do BETTER on downstream benchmarks (SWE Bench for eg), since sometimes rounding can actually somehow act as a "regularization" method (this is just my hunch).

So KLD isn't that expensive, since we leverage the trick of causal attention - since causal attention is lower triangular, we can do 1 forward pass on the enter text (say 2048 tokens), and you attain logits for the prediction for every token's position - so this is O(N^2).

However coding benchmarking require actual inference, and cannot use the causal attention trick, and it's best to run them 10 times since temperature = 1.0 is not deterministic - and take an average. We plan to maybe do something like https://marginlab.ai/trackers/claude-code/, which takes a random sample and does it over time.


I would be really interested in a podcast with the CEO where he goes a bit into the trade-offs of backwards and forwards compatibility. I can not imagine that their planning was so immaculate that there aren't any regressions that a clean slate design could have cleaned up. Nevertheless, amazing job for putting this together it looks like a phenomenal product!


This is so true! Shows a lack of care that usually doesn’t stop at just the naming


They are heavily post-trained on code and math these days. I don‘t think we can infer that much about their behavior from just the pre-training dataset anymore


Amazing work and people should really appreciate that the opportunity costs of your work are immense (given the hype).

On another note: I'm a bit paranoid about quantization. I know people are not good at discerning model quality at these levels of "intelligence" anymore, I don't think a vibe check really catches the nuances. How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

I was recently trying Qwen 3 Coder Next and there are benchmark numbers in your article but they seem to be for the official checkpoint, not the quantized ones. But it is not even really clear (and chatbots confuse them for benchmarks of the quantized versions btw.)

I think systematic/automated benchmarks would really bring the whole effort to the next level. Basically something like the bar chart from the Dynamic Quantization 2.0 article but always updated with all kinds of recent models.


Thanks! Yes we actually did think about that - it can get quite expensive sadly - perplexity benchmarks over short context lengths with small datasets are doable, but it's not an accurate measure sadly. We're actually investigating currently what would be the best efficient course of action on evaluating quants - will keep you posted!


> How hard would it be to systematically evaluate the different quantizations? E.g. on the Aider benchmark that you used in the past?

Very hard. $$$

The benchmarks are not cheap to run. It'll cost a lot to run them for each quant of each model.


Yes sadly very expensive :( Maybe a select few quants could happen - we're still figuring out what is the most economical and most efficient way to benchmark!


Roughly how much does it cost to run one of the popular benchmarks? Are we talking $1,000, $10,000, or $100k?


Oh it's more time that's the issue - each benchmark takes 1-3 hours ish to run on 8 GPUs, so running on all quants per model release can be quite painful.

Assume AWS spot say $20/hr B200 for 8 GPUs, then $20 ish per quant, so assuming benchmark is on BF16, 8bit, 6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model ish to $420 ish/hr. Time wise 7 hours to 1 day ish.

We could run them after a model release which might work as well.

This is also on 1 benchmark.


This would be amazing


Working on it! :)


I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks


Unsloth doesn't seem to do this for every new model, but they did publish a report on their quant methods and the performance loss it causes.

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

It isn't much until you get down to very small quants.


Wait but the one you linked seems to be pneumatically driven, while the op one is an actual combustion engine, right?


That’s true! Sorry for not mentioning that.


Small feedback if any of the Antigravity people read here: "Fast" is not a great name for the "eager" option (vs. "Planning") because "Fast" is associated with "dumb" in LLMs (fast/flash/mini). Probably "Eager" would be a more descriptive name


SWIFT is Belgian, though?


It’s just a detail, the international financial market/banking system is basically under active US control, just look at what happened to Wegelin & Co. (at that point the oldest bank in Switzerland) when they thought that that was not the case.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: