If you read the mythos report, in which they discuss and account for contaminati...

If you read the mythos report, in which they discuss and account for contamination substantially, it still suggests that performance on SWE-bench verified is meaningful. Benchmarks, including SWE-bench can absolutely be gamed, but if you're not explicitly benchmaxxing, improving on SWE-bench still measures model improvements, at least up to the level of Mythos.