there's more details under the Too narrow and too wide tests heading. It would b...

		fulafel 3 days ago \| parent \| context \| favorite \| on: SWE-bench Verified no longer measures frontier cod... there's more details under the Too narrow and too wide tests heading. It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.

		help