Sunday, April 29, 2012

Cleanse Multi-Lingual User-Input in PHP (A Secure, Working Version of HTMLEntities)

Unfortunately, the official PHP HTMLEntities function will mangle foreign characters, for instance, そもそも我々は出 often gets turned into stuff like îáóâÃ.  And if you try to stop it from mangling characters by specifying a charset, then that opens the floodgate to characters that can be used in XSS, Phishing, and Injection tactics.  (Thanks, PHP, guys.  Brilliant move.)  If you want to retain that data, then use the mb_encode_numericentity function, which will encode string ranges to unicode.  Below is an implementation that encodes only special chars, so that "a" won't be converted to its unicode equivalent.

The Politics:  PHP.net briefly published this function, and then removed it.  Probably because it's both more functional and secure than their own HTMLEntities function.  (This function catches bit-variable injection hacking tactics, which HTMLEntities easily mangles itself on.)

<?php

        // Input:
        //    Hello! <=> そもそも我々は出!

    $format_array = array(

            // Handle Chars 0x00 to 0x1f = White Spacing and Technical Chars
        0x0000, 0x001f, 0, 0xffff,

            // Handle Char 0x3c = Lesser-Than Sign
        0x003c, 0x003c, 0, 0xffff,

            // Handle Char 0x3e = Greater-Than Sign
        0x003e, 0x003e, 0, 0xffff,

            // Handle Chars 0x007f to 0xffff = All Other Chars
        0x007f, 0xffff, 0, 0xffff

    );
       
    $formatted_user_input = mb_encode_numericentity($formatted_user_input, $format_array, 'UTF-8');

        // Output:
        //    Hello! &#60;=&#62; &#12381;&#12418;&#12381;&#12418;&#25105;&#12293;&#12399;&#20986;!

?>

Official Function Page: http://php.net/manual/en/function.htmlentities.php

// Note: All code appearing on the PHP Revolution blog by the blog owner is released under the Hacktivismo Enhanced-Source Software License Agreement (HESSLA), unless otherwise noted.  http://www.hacktivismo.com/about/hessla.php

No comments:

Post a Comment