We rolled it out across ~1k engineers and the biggest issue wasn't the model qua...

We rolled it out across ~1k engineers and the biggest issue wasn't the model quality, it was observability. Nobody could tell me if the agent was stuck in a loop, which sessions were expensive, or what the cache hit rate looked like. Without that visibility you can't distinguish "the model is bad" from "my setup is bad." Most of the complaints we got early on turned out to be config problems.