Hahaha this hits home too hard, back in early 2000s people would moan all the time whenever they spotted a hint of autotune, in 2026 its the industry standard.
I think its really speaks on the incredible ability of people to be able to be stuck in the past rather than new technology being "bad".
This is an amazing comment. I'm old. I was born in the 70s, grew up in the 80s and 90s and miss those times so much. But that is because I was young, immortal, the world was mine to discover.
In 20 years people will be missing the 2020s too. It is just human nature to complain.
That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.
TL;DR its very effective as it directly tests model on REAL codebases: "The benchmark is constructed from GPL-style copyleft repositories and private proprietary codebases". The use case is very real.
It doesn't sound to me like this benchmark is attempting to measure architecture design. As far as I see in the paper, they do not evaluate the architectural quality of a task completion, only whether the model is capable of completing it at all.
Could you elaborate? I hear some people say a big model should be driving a smaller model, I hear some people say a small model should be driving a bigger models.
When I have an expensive task that is clearly defined, I will get opus to write an LLM workflow for it, and then I will execute it with a smaller model. (Starting with the smallest one, and then upgrading if the task fails.)
But this is a single well defined task, designed by me and Opus in concert. If I need ongoing agentic work, Opus would be too expensive. I'm not sure if Haiku is big enough to be the driver yet. And Sonnet is probably too big! Haha.
(Grok looks promising, optics aside... Grok 4 Fast was almost there but not quite. Great for interactive / realtime (steered) work though.)
But I'm thinking you need a smallish model which can delegate both up and down. I'm not exactly sure what that looks like though. Cause the model needs to be big enough to know that it's struggling... Instead of pattern matching to something stupid and getting stuck in a loop trying to solve it the wrong way.
All of the major model's memory are handled by smaller more specific models.
I do not know about the future, but I believe, like the human brain (the amylgada + cerebral cortex), AGI will have smaller but more specific submodels running in parallel to craft an compelling heuristic.
The thing about Spotify is that is NOT driven by record labels, it is an platform for the individual meaning an individual can upload their music in an laissez-faire situation.
If they disallow AI artists tomorrow, they are going against what they created the company for.
reply