Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you read the mythos report, in which they discuss and account for contamination substantially, it still suggests that performance on SWE-bench verified is meaningful. Benchmarks, including SWE-bench can absolutely be gamed, but if you're not explicitly benchmaxxing, improving on SWE-bench still measures model improvements, at least up to the level of Mythos.
 help



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: