More

irthomasthomas · 2026-06-10T21:54:04 1781128444

Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this:

  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-kimi -m k2.6 -n 5 --arbiter mercury-2 --judging-method rank

  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --arbiter mercury-2 --judging-method synthesis

Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.

SwellJoe · 2026-06-10T22:07:51 1781129271

I've found the average output of many suboptimal models is still suboptimal, especially when it comes to judging the accuracy/correctness of the work of other models.

I did some benchmarks recently of how well various models find security vulnerabilities, and then follow up testing of the judging process of whether the models found the right bug and whether other bugs it reported were false positives or legitimate other bugs. A committee of good-not-great models (DeepSeek, MiMo, Gemma 4) cannot replicate the accuracy of Opus by itself. Even when all three of the other models disagreed with Opus, Opus was almost always the one that was actually right.

It's an interesting area for research. And, a model that's very fast can make a lot more attempts at a solution, and in cases where there is an unambiguous "right" solution that can be proven by some sort of static rule, "very fast" may be a useful characteristic. Small classification problems, where you need to make thousands of decisions about some specific aspect of a large corpus of data, seems like a sweet spot for a model like Mercury.

irthomasthomas · 2026-06-10T22:25:27 1781130327

I have had a better experience with my own use. I use it every day and it rarely fails to improve tasks. Perhaps the prompts and rubrics make a difference. And finding bugs is one of the better use cases because it is essentially a search problem. As long as models are non-deterministic and there is some diversity in training data, then an ensemble that iterates on the problem is more likely to cover the ground needed to find solve a problem.

Some tasks benefit from this approach more than others. There was a paper from google on a version they made which was very similar and achieved SOTA then on planning and pathfinding benchmarks.

edit:

Mind Evolution paper https://deepmind.google/research/publications/122391/

(That was a month after I published llm-consortium :) https://xcancel.com/karpathy/status/1870692546969735361

irthomasthomas · 2026-06-10T11:54:10 1781092450

Is it a larger model or just better trained? Anthropic does not actually claim it is a larger model anywhere that I can see.

ChrisLTD · 2026-06-10T12:27:26 1781094446

If it’s not larger, it’d be tough to justify the massive price increase for using it.

brookst · 2026-06-10T13:00:34 1781096434

Price is based on perceived value, not cost to produce. There is no international court of price justifications; if customers are willing to pay $X you can charge $X.

pixl97 · 2026-06-10T15:21:39 1781104899

That and a model can be the same size, yet use a lot more compute, I guess think of it as intelligence per watt used or something like that.

BoorishBears · 2026-06-10T12:43:07 1781095387

Opus 4.7 was smaller and people still paid 4.6 prices.

gpt-5.5 isn't larger than gpt-5.4 but costs double.

irthomasthomas · 2026-06-09T19:20:32 1781032832

"we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).

...

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."

altcognito · 2026-06-09T19:28:13 1781033293

Where is this text coming from?

[edit] -- I see that this comes from the system card -- dang merged the comments from the other discussion so that explains the confusion.

irthomasthomas · 2026-06-09T18:27:52 1781029672

This is just the sales team doing their thing, applying the Law of Scarcity to drive demand.

It's the same exact speed as opus >=4.5, sonnet 4.5, and twice the speed of opus <=4.1

It must have about the same active parameters, or else its a larger model running in turbo mode (smaller batches) and being heavily subsidized for some reason. But given most of the benchmarks are within 5% I doubt it is a much larger model. Most perplexing.

m00x · 2026-06-09T20:55:09 1781038509

It could be a much bigger MoE model

irthomasthomas · 2026-06-09T20:57:54 1781038674

Then it would be slower.

irthomasthomas · 2026-06-09T17:42:45 1781026965

Anthropic has again changed the set of benchmarks they use[0]. This time they have also moved all benchmark scores to the PDF. At a glance it looks like it gains about ~5-10% over other models. the speed is about the same as opus >=4.5, sonnet 4.5, and double the speed of opus <=4.1

                          Mythos 5 Fable 5 MythosPrev Opus 4.8 GPT-5.5 Gemini 3.1 Pro
  SWE-bench Pro             80.3       80        77.8       69.2      58.6       54.2
  SWE-bench Ver             95.5       95        93.9       88.6       -         80.6
  Terminal-Bench            88.0      84.3        -         82.7      83.4         -
  BrowseComp (Single-Agent) 88.0       -        87.9       84.3      84.4       85.9
  BrowseComp (Multi-Agent)  93.3       -          -         88.5       -           -
  HLE (No tools)            59.0      -       56.8      49.8      41.4        44.4
  HLE (Tools)                64.5      -        64.7     57.9      52.2       51.4
  CharXiv Reasoning (No tools) 88.9       -         86.2       80.5       -         -
  CharXiv Reasoning (Tools)    93.5       -         92.5      89.9      -         -
  BioMystery Bench (Human)     83.9       -       82.6     80.4       -         -
  BioMystery Bench (Hard)    46.1       -         29.6     40.0       -         -
  OSWorld-Verified          85.0      85.0       85.4       83.4      78.7      76.2*
  CritPt                     28.6       -       20.9       27.1      17.7       -
  ArxivMath                  78.5      68.7       71.8       71.5      64.0       -

[0] https://news.ycombinator.com/item?id=48312633

Edit: Also in the system card... "we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).

...

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."

charles_f · 2026-06-09T18:44:05 1781030645

It's announced as a revolution but when you look at those benchmarks it surely looks like an iteration.

irthomasthomas · 2026-06-08T20:57:42 1780952262

llm-consortium: prompts multiple models in parallel, loops until confidence_threshold, and iteratively refines a response.

This was inspired by a karpathy tweet [0] and the prototype created using another tool of mine: The LLM Plugin Generator plugin (essentially a curated collection of plugins for simonws llm cli as a few-shot prompt)

The llm-model-gateway companion plugin lets you serve models from the LLM cli as a an openai API. This allows you to use saved consortiums in your various clients as if they where a regular model. Bringing massive parallel reasoning to any workflow.

It occured to me at some time that an collection of parallel LLMs was not really a consortium. A consortium is a group of organizations. A group of groups. To rectify this I added for actual consortiums, where each member of an llm-consortium can itself be a consortium of models. e.g.

llm consortium save cns-glm-n3 -m glm-5.1 -n 3 --arbiter mercury-2

llm consortium save cns-k2-n3 -m kimi-k2.6:3 --arbiter mercury-2

llm consortium save cns-meta-glm-k2 -m cns-k2-n3 -m cns-glm-n3 --arbiter cns-k2-n3

Yes, even the arbiter/judge can be comprised of a consortium of models, bringing parallel reasoning to the task of judging parallel reasoning chains.

Consortiums can also now contain groups of specialists. These custom user-defined expert characters address the prompt from a different perspective. And a Westworld style Attribute matrix can be randomized to inject some more entropy into the process.

[0]https://xcancel.com/karpathy/status/1870692546969735361

Some other llm plugins I vibe coded:

classifai generates labels with approximate confidence derived from logprobs

llm-alias-options saves inference parameters such as reasoning effort with a model alias. (good for setting the provider in openrouter or creating a consortium of high temperature models)

llm-prompt-json adds a --json flag to return the llm logs object (good for getting conversion_id, or reasoning output in scripts)

llm-jina adds support for all jina AI specialised models and tools like web fetching, embedding and reranking.

mattjoyce · 2026-06-08T21:28:48 1780954128

I'm quite curious about this.

I think this is similar. Unfinished. https://github.com/mattjoyce/roundtable-consensus

notesinthefield · 2026-06-08T21:02:46 1780952566

Great project! I often check the opinion of one model against others when doing research and a sort of consensus process would save many a c/p

irthomasthomas · 2026-06-08T20:17:14 1780949834

No one is bitter lesson pilled anymore. Everyone is pivoting to neurosymbolic systems. It looks like Gary Marcus was right.

nl · 2026-06-08T23:49:05 1780962545

> No one is bitter lesson pilled anymore.

Will the 10T parameter Mythos model be released this month or next month?

They better soon because it is generally accepted that one of the reasons GPT 5.5 is better at hard tasks than Opus is because of its parameter size - and that Opus 4.8 remains competitive only be scaling test-time compute (see how many more tokens it uses than GPT 5.5)

https://www.reddit.com/r/LLM/comments/1sz8bjz/parameter_esti...

irthomasthomas · 2026-06-09T08:49:37 1780994977

Why ask me? Anyway, Mythos is not 10T. Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.

Anthropic also confirmed they will not release Mythos, only a "Mythos-class" model, whatever that means.

nl · 2026-06-09T11:23:07 1781004187

> Anthropic confirmed the training run was under 10^26 flops. You can't train 10T to chincilla and stay under 10^26.

I don't think Anthropic have said anything of the sort.

Microsoft published it as 6.1*10^27 FLOPs[1]

Elon has claimed the are also training a 10T model because "Some catching up to do"[2]

[1] https://x.com/scaling01/status/2061897540161728791

[2] https://x.com/elonmusk/status/2041754402239975479

irthomasthomas · 2026-06-09T14:13:17 1781014397

I must have confused mythos with opus 4.7. One of their recent model cards confirmed that training flops was under the EO reporting requirement of 10^26 flops.

wild_egg · 2026-06-09T00:06:50 1780963610

How is neurosymbolic not aligned with the bitter lesson? The bitter lesson is completely agnostic to architecture.

irthomasthomas · 2026-06-09T08:40:35 1780994435

I should have stressed the symbolic part. Everyone has pivoted to symbolic systems like claude code and codex. They would no invest so heavily in such systems if they thought llms would deliver agi soon.

jubilanti · 2026-06-09T15:20:29 1781018429

That's not what symbolic means.

irthomasthomas · 2026-06-08T15:58:29 1780934309

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.

gekoxyz · 2026-06-08T16:12:45 1780935165

Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.

jdthedisciple · 2026-06-08T16:23:42 1780935822

Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?

throwa356262 · 2026-06-08T19:56:59 1780948619

The TileRT approach swaps throughput for latency, which also means less overall efficiency

Given the export restrictions this could mean they need to prioritise how to best use their limited hardware. But they could also be moving to Huawei GPUs like deepseek did and simply not have stable hardware or software for a large scale deployment yet.

This is just speculation based on the MXFP4 support on Huawei GPUs that is lacking on some nvidia GPUs.

slaw · 2026-06-08T16:37:56 1780936676

Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.

ilaksh · 2026-06-08T18:08:41 1780942121

It uses significantly more resources obviously. And/or they have to configure or reconfigure servers for it, which takes time, and doesn't make sense until they have proven the demand at the higher price point.

boutell · 2026-06-08T16:46:13 1780937173

I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?

I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.

HarHarVeryFunny · 2026-06-09T12:36:35 1781008595

> and doesn't require different hardware

But it may well do. They mention TileRT in the announcement, so this speed comes from low level optimization for some specific GPU target.

With availability of SOTA western GPUs being scarce in China, they may well have a mishmash of different GPUs.

boutell · 2026-06-09T13:19:07 1781011147

They specifically said it's stock hardware, but... yeah, maybe highly specific stock hardware.

HarHarVeryFunny · 2026-06-08T16:36:02 1780936562

Maybe they only have a finite number of racks ;-)

irthomasthomas · 2026-06-08T15:54:38 1780934078

Actually, simonw has started saying that after qwen 27B beat Opus 4.7

https://news.ycombinator.com/item?id=48446348

irthomasthomas · 2026-06-08T15:06:39 1780931199

Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.

  "there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
 
  Today, even that loose connection to utility has been broken..."

https://simonwillison.net/2026/Apr/16/qwen-beats-opus/