When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?
EDIT: Oh, on second read, do you mean you're running the model on an FPGA?
I don't specifically care about Claude -vs- GPT, but comparing models at different amounts of test time compute is a gaping hole. It also means that any unreasonably-expensive token guzzling white-elephant model can top all the benchmarks and still be useless.
What we actually have is like a scaling law for test time compute, so it's silly to focus on specific Y values that someone benchmarked (at whatever default X values). Instead, characterize the slope or power of the scaling law, or just plot the damn curve for each model -vs- number of tokens or cost or something!
ok i mean i agree, how is it a gaping hole when its literally the second (and third and fourth..) chart on the post?
yes token cost and reasoning efficiency is important, hence the 2D pareto charts
My apologies... I was responding to the above comment / ranting about the general trend and got carried away. Wasn't directed at specifically at your post.
I love your second graph; hope the trend catches on as the main graph, instead of the model-wise bar graph that seems to be popular.
Interesting categorical framework. It helps make precise the distinctions between interpolation (retrieval), extrapolation (composition/search), and discovery.
How about having a large pool of unified memory and expanding the next layer (L3?) of cache to accommodate more of the CPU's the low-latency RAM usage?
As a rule, increasing the size of cache increases its latency, and how much of it you can use is capped by the quality of your cache management algorithms and the latency of the level above it.
Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.
And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.
It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.
1. Do you expose this dependency graph so folks can play with it / build interesting things on top? An interesting example would be to understand whether/how a version bump on one of your dependencies might affect your code.
2. What would it take to add a new language? I'm interested in using this with Julia.
Would you say it is homoiconic, similar to LISP where the syntax of the language is the AST; so, data can become code (Macros) and code can be data (the S-Expression)?
Even if companies decided to move away from expensive models from the major labs, it probably much more economical to pay a cloud provider to host some open weights model which could then be amortized across all (internal) users and do inference at a substantial batch size, rather than giving everyone their own hardware -- which means the company would need to provision for peak usage and inference at batch size of one.
EDIT: Oh, on second read, do you mean you're running the model on an FPGA?
reply