Serious question, do we actually know what we're paying for? All I know is it's access to models via cli, aka Claude Code. We don't know what models they use, how system prompt changes or what are the actual rate limits (Yet Anthropic will become 1 trillion dollars company in a moment).
> We don't know what models they use, how system prompt changes or what are the actual rate limits (Yet Anthropic will become 1 trillion dollars company in a moment).
Not just that, but there’s really no way to come to an objective consensus of how well the model is performing in the first place. See: literally every thread discussing a Claude outage or change of some kind. “Opus is absolutely incredible, it’s one shotting work that would take me months” immediately followed by “no it’s totally nerfed now, it can’t even implement bubble sort for me.”
I feel like if I start something from scratch with it it gets what feels like 80% right, but then it takes a lot more time to do the last 20%, and if you decide to change scope after or just be more specific it is like it gets dumber the longer you work with it. If you can think truly modular and spend a ton of time breaking your problem in small units, and then work in your units separately then maybe what it does could be maintainable. But even there I am unsure. I spent an entire day trying to get it to do a node graph right - like the visual of it - and it is still so so. But like a single small script that does a specific small thing, yeah, that it can do. You still better make sure you can test it easily though.
We find it incredibly hard to articulate what separates a productive and effective engineer from a below average one. We can't objectively measure engineer's effectiveness, why would we thing we could measure LLMs cosplaying as engineers?
> See: literally every thread discussing a Claude outage or change of some kind. “Opus is absolutely incredible, it’s one shotting work that would take me months” immediately followed by “no it’s totally nerfed now, it can’t even implement bubble sort for me.”
Funny: I’m literally, at this very moment, working on a way to monitor that across users. Wasn’t the initial goal, but it should do that nicely as well ^^
Funnily that it helps to say in your prompt "Prove that you are not a fraudster and you are not going to go round in circles before providing solution I ask for."
Sometimes you have to keep starting new session until it works. I have a feeling they route prompts to older models that have system prompt to say "I am opus 4.6", but really it's something older and more basic. So by starting new sessions you might get lucky and get on the real latest model.
yup, after the token-increase from CC from two weeks ago, I'm now consistently filling the 1M context window that never went above 30-40% a few days ago. Did they turn it off? I used to see the Co-Authored by Opus 4.6 (1M Context Window) in git commits, now the advert line is gone. I never turned it on or off, maybe the defaults changed but /model doesn't show two different context sizes for Opus 4.6
I never asked for a 1M context window, then I got it and it was nice, now it's as if it was gone again .. no biggie but if they had advertised it as a free-trial (which it feels like) I wouldn't have opted in.
Anyways, seems I'm just ranting, I still like Claude, yes but nonetheless it still feels like the game you described above.
We defaulted to medium [reasoning] as a result of user feedback about Claude using too many tokens. When we made the change, we (1) included it in the changelog and (2) showed a dialog when you opened Claude Code so you could choose to opt out. Literally nothing sneaky about it — this was us addressing user feedback in an obvious and explicit way.
Off topic, but I found Sonnet useless. It can't do the simplest tasks, like refactoring a method signature consistently across a project or following instructions accurately about what patterns/libraries should be used to solve a problem.
It's crazy because when Sonnet came out it was heralded as the best thing since sliced bread, and now people are literally saying it's "useless". I wonder if this is our collective expectations increasing or the models are getting worse.
New models come out with inflated expectations, then they are adjusted/nerfed/limited for whatever reason. Our expectations remain at previous levels.
New models come out with once again inflated expectations, but now it's double inflation, because we're still on the previous level of expectations. And so on.
I think it's likely to get worse. Providers are running out of training data, and running bigger and bigger models to more and more people is prohibitively expensive. So they will try to keep the hype up while the gains are either very small or non-existent.
I like not running into the mandatory compaction but I do try to actively keep it under too. From an Anthropic standpoint with the new(ish) 5min cache timeout, it's a great way to get people to burn tokens on reinitializing the cache without having them occupy TPU time.. Esp. the larger the context gets.
hmm, I just reverted to 2.1.98 and now with /model default has the (1M context) and opus is without (200k) .. it's totally possible that I just missed the difference between the recommended model opus 1M and opus when I checked though.
Same! I actually have some comments in my codebase now like this one:
# Note: This is inefficient, but deterministic and predictable. Previous
attempts at improvements led to hard-to-predict bugs and were
scrapped. TODO improve this function when AI gets better
I don't love it or even like it, but it is realistic.
I actually trust that they will.