Hacker Newsnew | past | comments | ask | show | jobs | submit | ssivark's commentslogin

When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?

EDIT: Oh, on second read, do you mean you're running the model on an FPGA?


You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.

When doing auto regressive inference, how often do you do a CUDA kernel call? What is the main bottleneck at the throughputs you're operating?

I don't specifically care about Claude -vs- GPT, but comparing models at different amounts of test time compute is a gaping hole. It also means that any unreasonably-expensive token guzzling white-elephant model can top all the benchmarks and still be useless.

What we actually have is like a scaling law for test time compute, so it's silly to focus on specific Y values that someone benchmarked (at whatever default X values). Instead, characterize the slope or power of the scaling law, or just plot the damn curve for each model -vs- number of tokens or cost or something!

Noam Brown also raised this issue recently: https://x.com/polynoamial/status/2064210146558136827


ok i mean i agree, how is it a gaping hole when its literally the second (and third and fourth..) chart on the post? yes token cost and reasoning efficiency is important, hence the 2D pareto charts

My apologies... I was responding to the above comment / ranting about the general trend and got carried away. Wasn't directed at specifically at your post.

I love your second graph; hope the trend catches on as the main graph, instead of the model-wise bar graph that seems to be popular.


1 dimension is unfortunately all the mental bandwidth that talking heads have.

Isn't the ovum supposed to be a single cell? Eggs of various species can be substantially larger than this.

Yes. I remember reading that Ostrich eggs are the largest single cells (in terms of mass/volume; Blue Whale nerve cells are longer).

Interesting categorical framework. It helps make precise the distinctions between interpolation (retrieval), extrapolation (composition/search), and discovery.

How about having a large pool of unified memory and expanding the next layer (L3?) of cache to accommodate more of the CPU's the low-latency RAM usage?

As a rule, increasing the size of cache increases its latency, and how much of it you can use is capped by the quality of your cache management algorithms and the latency of the level above it.

Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.


We might even decide to put 32GB of high-latency cache on the system board and then 12GB of throughput-optimized main memory close to the GPU. ;)

You meant a 128GB (instead of 12GB)?

And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.

It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.


I mean 12GB, an amount that is typical in such a system today, which you can buy at any computer store.

Yeah but unfortunately I hear trying to get more than that is quite hard

Oh, I entirely misunderstood your comment :)

I think that's basically what Cerebras doing ?

1. Do you expose this dependency graph so folks can play with it / build interesting things on top? An interesting example would be to understand whether/how a version bump on one of your dependencies might affect your code.

2. What would it take to add a new language? I'm interested in using this with Julia.


Note that any cache (eg LRU-eviction) is just a specific speculative model for future usage :-)

The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.


Would you say it is homoiconic, similar to LISP where the syntax of the language is the AST; so, data can become code (Macros) and code can be data (the S-Expression)?

Would it have killed them to use a comma instead?!

Maybe have a second model that is configured to nudge the first model in the direction of exploration, and have the two of them work in tandem?

Even if companies decided to move away from expensive models from the major labs, it probably much more economical to pay a cloud provider to host some open weights model which could then be amortized across all (internal) users and do inference at a substantial batch size, rather than giving everyone their own hardware -- which means the company would need to provision for peak usage and inference at batch size of one.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: