Hacker Newsnew | past | comments | ask | show | jobs | submit | msp26's commentslogin

It triggered for me when I asked "Web search for your own model card (released today) and pick out your favourite highlights from the pdf"

>Pricing for both models is $10 per million input tokens and $50 per million output tokens.

Basically double from Opus 4.8 IIRC

hell will freeze over before anthropic release anything meaningful to the public


Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.


You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.


Interesting, I might try that, thanks!


Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.


Qwen is better at some things (code, in particular), but Gemma has better prose and better vision. At least, it feels that way to me.


gemma is also just way faster. i dont wanna wait 10min to get a 5-10% better answer (and sometimes, actually worse answer).

best is to use your own model router atm, depending on the task


I'm pretty sure Qwen is faster? The MoE version of Qwen is 3B active, while Gemma 4 is 4B active. Similarly, the dense Qwen is 27B while Gemma is 31B. All else being equal (though I know all else isn't equal), Qwen should be faster in both cases. I haven't actually measured with any precision, but on my AMD hardware (Strix Halo or dual Radeon Pro V620) they seem quite similar in both cases...both MoE models are fast enough for interactive use, both dense models are notably smarter but much slower, long time to first response and single-digit tokens per second once it starts talking.


qwen-3.6 is really interesting. The dense 27B model is pretty slow for me whereas the sparse 31B is blazingly fast but it also needs to be since it's so chatty. It produces pages and pages of stream of consciousness stuff. 27B does this to a lesser extent but slow enough that I can actually read it whereas 31B just blasts by.

I haven't yet compared either to Gemma 4. I tried that out the day after it came out with the patched llama.cpp that added support for it but I couldn't make tool calling work and so it was kind of useless. I should try again to see if things have changed but judging by what people say, qwen-3.6 seems stronger for coding anyway.


I had the same experience with 31B. Runs well on 4090 too!


I'm using both incessantly and having a great time.


Qwen without thinking is just as fast. I have 4 parameter settings based on recommendation. If you want a good coding problem, the thinking coding mode works well, but takes a while to arrive at an answer. If you want faster turn around time, instruction mode works without thinking.


Genuine question: how do you tune it?

I thought "fine-tuning" meant training it on additional data to add additional facts / knowledge? I might be mistaking your use of the word "tune", though :)


You can fine-tune relatively easily in Unsloth Studio.


Parameter settings are here. https://huggingface.co/Qwen/Qwen3.6-35B-A3B

Most clients that support ollama support passing extra body options where you can set those.


It’s a heck of a lot faster too.


Yes I would just go with qwen.


I like starting most of my projects on marimo notebooks now and slowly moving parts of it to the main codebase + db.

By the end of it I might remove the notebook entirely but usually I keep it for some visualisation + running stuff as a cli tool.


session usage limits this week feel like ass. Even when being careful to not break prefix caching.


I've been seeing much higher session limits late at night (US time). Workday usage struggles though.

I'm looking into how to structure my work to run some autonomous-safe jobs overnight to take advantage of it.


Not necessarily with speculative decoding. Whitespace would be trivial to predict and they would petty much keep using the same amount of compute as before.

I don't think that's their primary motive for doing this but it is a side effect.


They don't have the compute to make Mythos generally available: that's all there is to it. The exclusivity is also nice from a marketing pov.


I've read so many conflicting things about Mythos that it's become impossible to make any real assumptions about it. I don't think it's vaporware necessarily, but the whole "we can't release it for safety reasons" feels like the next level of "POC or STFU".


They don't have demand for the price it would require for inference.

They are definitely distilling it into a much smaller model and ~98% as good, like everybody does.


Some people are speculating that Opus 4.7 is distilled from Mythos due to the new tokenizer (it means Opus 4.7 is a new base model, not just an improved Opus 4.6)


The new tokenizer is interesting, but it definitely is possible to adapt a base model to a new tokenizer without too much additional training, especially if you're distilling from a model that uses the new tokenizer. (see, e.g., https://openreview.net/pdf?id=DxKP2E0xK2).


Not impossible, but you have to be at least a little bit mad to deploy tokenizer replacement surgery at this scale.

They also changed the image encoder, so I'm thinking "new base model". Whatever base that was powering 4.5/4.6 didn't last long then.


Yes, I was thinking that. But it could as well be the other way around. Using the pretrained 4.7 (1T?) to speed up ~70% Mythos (10T?) pretraining.

It's just speculative decoding but for training. If they did at this scale it's quite an achievement because training is very fragile when doing these kinds of tricks.


Reverse distillation. Using small models to bootstrap large models. Get richer signal early in the run when gradients are hectic, get the large model past the early training instability hell. Mad but it does work somewhat.

Not really similar to speculative decoding?

I don't think that's what they've done here though. It's still black magic, I'm not sure if any lab does it for frontier runs, let alone 10T scale runs.


> They don't have demand for the price it would require for inference.

citation needed. I find it hard to believe; I think there are more than enough people willing to spend $100/Mtok for frontier capabilities to dedicate a couple racks or aisles.


> First, Opus 4.7 uses an updated tokenizer that improves how the model processes text

wow can I see it and run it locally please? Making API calls to check token counts is retarded.


> Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer.

Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build.

My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw.

Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly.

> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.

And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API.


This is very true! I could have been more careful/precise in how I worded this. I was really trying to just get across that it's in a sense easier than some tasks that can be much more open ended.

I'll think about how to word this better, thanks for the feedback!


This is extremely true. In fact, from what we see many/most of the problems to be solved with LLMs do not have ground-truth values; even hand-labeled data tends to be mostly subjective.


I think they're just saying that data extraction tasks are easy to evaluate because for a given input text/file you can specify the exact structured output you expect from it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: