Right, htmlspecialchars() works perfectly as long as you use it for an attribute value surrounded by quotes, or if you escape a whole string. The only way you can get into trouble is if you use it on an attribute value that doesn't have quotes.
If you are using PHP and want to accept any user input that should be interpreted as HTML, you basically need to be using http://htmlpurifier.org/. If you are going to be accepting text input, clean the input value to ensure proper character encoding is being used (important for multi-byte encoding such as UTF-8) and then use htmlspecialchars(), and make sure to specify your encoding as the third parameter.
// Specify your encoding to the browser
header('Content-Type: text/html; charset=utf8');
function get_escape($field) {
$value = iconv(
'UTF-8',
'UTF-8//IGNORE',
isset($_GET[$field]) ? $_GET[$field] : ''
);
return htmlspecialchars($value, ENT_COMPAT, 'UTF-8');
}
// Safely output text user input
echo '<html><body><p title="' . get_escape('title') . '">' . get_escape('content') . '</p></body></html>';
That's actually not true, and htmlspecialchars() will automatically do the UTF-8 validation for you as long as you have set your charset correctly. But even if you do that, you are still vulnerable.
Inside on* handlers and style attributes, the rules are different. Take something like this:
htmlspecialchars() does its job here. It turns a single quote into ' however, inside on* and style attributes the ' entity is treated as a raw single quote. You need to double-escape in this particular case to be safe, or better yet, don't use raw on* handlers and style attributes. There are much cleaner ways to do those, but if you have to, don't ever put user data in them because you will mess up the escaping.
It is not possible to write a single generic html escaping function that will work in all contexts. If it was, I would have written htmlspecialchars() differently.
There are more examples of how you can mess up even if you always quote your attributes if your escaping function isn't smart. The UTF-7 hack was mentioned, which is good, but the invalid UTF-8 hack wasn't explained. That is, if you send an invalid UTF-8 sequence, like %E0 then certain browsers (well, just IE) will lose their minds unless you make sure you don't display that invalid UTF-8 sequence back to the user. So htmlspecialchars() does more than just escape the set of chars you mentioned, it also validates the characters and makes sure it never outputs an invalid UTF-8 byte sequence.
0xE0 by itself is the first byte of a 3-byte UTF-8 char and IE will simply eat the following 2 bytes to make up the char. So if you output: "<e0>"> even though the byte is inside quotes, IE will eat the following "> and replace those 3 bytes with the dreaded (?) char, but more disastrously it will think it is still inside the quoted attribute so the next raw quote it sees will end the attribute and you have yourself another quoted xss hole.
I'm not trying to be overly pedantic with this reply, but I think when it comes to security related topics it is good to be clear and so I think the conversation should continue until it is clear to readers what is being discussed.
For instance, the original article said:
Please don’t assume that, having read this post, you now know
everything there is to know about HTML escaping. I can
guarantee that you don’t, because I don’t.
Here I can say that using PHP's htmlspecialchars() with a proper encoding and clean data is all you need to know about escaping HTML. In the example of talking about on* attributes, you are now discussing escaping of JavaScript. The reason htmlspecialchars() fails here is not because it is broken, but because you are using it on the wrong language.
In terms of the UTF-7 issue with IE, the example code I posted properly handles those situations since it uses iconv() to clean the UTF-8. Really we've now changed from talking about escaping HTML output to cleaning user input.
I think the most important thing people understand is that there is a lot of knowledge required to write a security PHP application. You don't just need to worry about escaping HTML and SQL injection.
If anyone is interested in learning more about the various aspects, I gave a talk at BostonPHP and the Boston Security Meetup last year that was a survey of PHP security. If anyone is interested in seeing the slides, you can find them at http://wbond.net/security. The slides are HTML-based and contain links to learn even more.
> Inside on* handlers and style attributes, the rules are different.
That may be so, but I'm not sure I'd call that an “HTML” escaping problem per se. An attribute always has an additional syntax, and you have to account for the subsyntaxes of whatever attributes are in place—but those aren't at the HTML layer proper, only implied by it. E.g., a's href attribute takes a URI. onfoo attributes take JavaScript code. style attributes take CSS. So in order to make HTML “safe” (FSVO “safe”) when it contains those attributes, you have to make those values “safe” recursively according to their subsyntaxes—e.g., if you want to allow CSS with url(), then you have a URI inside CSS inside an HTML attribute inside an HTML document inside (for instance) a UTF-8 string, and you have to take all the layers into account.
It may be that a lot of people don't realize there's several potential layers of syntax involved, think of it as a monolithic and simple thing, and then get confused when it is not. Things like PHP htmlspecialchars can inadvertently encourage this kind of inaccurate view.
(I'm not disagreeing with you exactly, just describing from a variant perspective. A bit of redundancy in discussion can create an antialiasing-like effect.)
That certainly wasn't my intention specifically. I mentioned htmlspecialchars because it was the example being used upthread, but any API with analogous functionality is potentially subject to similar provisos, and if any surrounding cultural element encourages its use without thinking through the syntax layers, it can have a similar effect. I assumed this was implicit.
If you are using PHP and want to accept any user input that should be interpreted as HTML, you basically need to be using http://htmlpurifier.org/. If you are going to be accepting text input, clean the input value to ensure proper character encoding is being used (important for multi-byte encoding such as UTF-8) and then use htmlspecialchars(), and make sure to specify your encoding as the third parameter.