These so-called “naughty strings” expose implementation errors in code that processes widely used formal languages such as numeric literals, URLs, HTML, JSON, SQL, etc. These languages are so widely used that it's criminal not to have formal specifications for them. And the mathematical techniques for constructing programs that meet their formal specifications are very well known.
Fine, imagine we have a formal specification for SQL. Now how do I make sure my parser is compliant with the spec without testing it? Formal verification is a very active research area, I don't think this is quite as easy as you're implying. How do I avoid "implementation errors" without exposing them?
(0) Design languages so that implementation errors are harder to make and easier to detect. For example, avoid context-sensitive grammars like the plague.
(1) Design implementations (parsers, code generators, etc.) with the logical argument for their correctness in mind. That is, don't attempt to verify an existing possibly incorrect program - write it to be correct right from the beginning! This is greatly aided by designs that meet criterion (0).
These are obviously useful ideas, but "write it to be correct from the beginning"? Are you serious? This is the oldest joke in software engineering. "Don't worry about testing it, I don't make mistakes." No matter how idiot-proof your languages and frameworks are, it is grossly irresponsible to not test work that a human has done. Until developers are themselves replaced by formally verified programs, testing is an absolute necessity.
I doubt human programmers can be fully replaced, and I'm not saying testing is completely useless. But the sheer number of “naughty strings” in that list is an indictment of our languages: They have way too many corner cases, way too many traps for us to fall into.
I still don't understand your logic. Are you saying once a program passes a test, we should stop using that test? The point of this list is to cover all classes of input in general, not just ones that a specific framework has issues with.
These are corner cases in the concept of user input, not just corner cases of any specific parser. What if it's a number, what if it's not? What if it's the same alphabet as the code, what if it's not? What if it is valid code? What if it's empty, what if it's not? etc. Even if you've written the perfect parser in the perfect language, you still need to have unit tests for all of this stuff. They are traps caused by human definitions of "input" and "string", which cannot be formally verified.
> Are you saying once a program passes a test, we should stop using that test?
No. I'm saying that programs have to be proven correct. Then you can use tests to rule out other pesky problems that have nothing to do with your design being incorrect. (For example, you could prove a program correct on paper, then transcribe it incorrectly to a computer. It has happened to me before.)
> These are corner cases in the concept of user input
“undefined” and “null” aren't special cases in the concept of user input - they're special cases in languages that happen to have “undefined” and “null”.
Octal numeric literals aren't special cases in the concept of number - they're special cases in languages where octal literals begin with the prefix “0”, rather than something more sensible like “0o”.
Failing to distinguish between escaped and unescaped strings is also a language problem - they should have different types!