More

ssivark · 2026-06-10T02:03:51 1781057031

When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?

EDIT: Oh, on second read, do you mean you're running the model on an FPGA?

taneq · 2026-06-10T02:45:03 1781059503

You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.

ssivark · 2026-06-10T11:14:31 1781090071

When doing auto regressive inference, how often do you do a CUDA kernel call? What is the main bottleneck at the throughputs you're operating?

ssivark · 2026-06-09T11:58:21 1781006301

I don't specifically care about Claude -vs- GPT, but comparing models at different amounts of test time compute is a gaping hole. It also means that any unreasonably-expensive token guzzling white-elephant model can top all the benchmarks and still be useless.

What we actually have is like a scaling law for test time compute, so it's silly to focus on specific Y values that someone benchmarked (at whatever default X values). Instead, characterize the slope or power of the scaling law, or just plot the damn curve for each model -vs- number of tokens or cost or something!

Noam Brown also raised this issue recently: https://x.com/polynoamial/status/2064210146558136827

swyx · 2026-06-09T12:53:30 1781009610

ok i mean i agree, how is it a gaping hole when its literally the second (and third and fourth..) chart on the post? yes token cost and reasoning efficiency is important, hence the 2D pareto charts

ssivark · 2026-06-09T17:19:11 1781025551

My apologies... I was responding to the above comment / ranting about the general trend and got carried away. Wasn't directed at specifically at your post.

I love your second graph; hope the trend catches on as the main graph, instead of the model-wise bar graph that seems to be popular.

swyx · 2026-06-09T18:51:39 1781031099

1 dimension is unfortunately all the mental bandwidth that talking heads have.

ssivark · 2026-06-08T20:34:49 1780950889

Isn't the ovum supposed to be a single cell? Eggs of various species can be substantially larger than this.

lmm · 2026-06-08T23:50:56 1780962656

Yes. I remember reading that Ostrich eggs are the largest single cells (in terms of mass/volume; Blue Whale nerve cells are longer).

ssivark · 2026-06-07T22:12:10 1780870330

Interesting categorical framework. It helps make precise the distinctions between interpolation (retrieval), extrapolation (composition/search), and discovery.

ssivark · 2026-06-07T11:18:37 1780831117

How about having a large pool of unified memory and expanding the next layer (L3?) of cache to accommodate more of the CPU's the low-latency RAM usage?

marcosdumay · 2026-06-07T16:17:12 1780849032

As a rule, increasing the size of cache increases its latency, and how much of it you can use is capped by the quality of your cache management algorithms and the latency of the level above it.

Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.

trumpdong · 2026-06-07T17:01:24 1780851684

We might even decide to put 32GB of high-latency cache on the system board and then 12GB of throughput-optimized main memory close to the GPU. ;)

marcosdumay · 2026-06-07T17:45:43 1780854343

You meant a 128GB (instead of 12GB)?

And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.

It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.

trumpdong · 2026-06-07T22:33:55 1780871635

I mean 12GB, an amount that is typical in such a system today, which you can buy at any computer store.

saagarjha · 2026-06-08T08:17:31 1780906651

Yeah but unfortunately I hear trying to get more than that is quite hard

marcosdumay · 2026-06-08T14:19:09 1780928349

Oh, I entirely misunderstood your comment :)

Melatonic · 2026-06-07T19:19:14 1780859954

I think that's basically what Cerebras doing ?

ssivark · 2026-06-07T11:14:52 1780830892

1. Do you expose this dependency graph so folks can play with it / build interesting things on top? An interesting example would be to understand whether/how a version bump on one of your dependencies might affect your code.

2. What would it take to add a new language? I'm interested in using this with Julia.

ssivark · 2026-06-07T11:04:44 1780830284

Note that any cache (eg LRU-eviction) is just a specific speculative model for future usage :-)

The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.

mycall · 2026-06-07T12:28:45 1780835325

Would you say it is homoiconic, similar to LISP where the syntax of the language is the AST; so, data can become code (Macros) and code can be data (the S-Expression)?

ssivark · 2026-06-05T05:36:28 1780637788

Would it have killed them to use a comma instead?!

ssivark · 2026-06-04T10:46:01 1780569961

Maybe have a second model that is configured to nudge the first model in the direction of exploration, and have the two of them work in tandem?

ssivark · 2026-06-03T19:13:25 1780514005

Even if companies decided to move away from expensive models from the major labs, it probably much more economical to pay a cloud provider to host some open weights model which could then be amortized across all (internal) users and do inference at a substantial batch size, rather than giving everyone their own hardware -- which means the company would need to provision for peak usage and inference at batch size of one.