I am working for international clients who have all very different alphabets and so I am trying to finally get an overview of a complete workflow between PHP and MySQL that would ensure all character encodings to be inserted correctly. I have read a bunch of tutorials on this but still have questions(there is much to learn) and thought I might just put it all together here and ask.


header('Content-Type:text/html; charset=UTF-8');


<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<form accept-charset="UTF-8"> .. </form>

(though the later is optional and rather a suggestion but I belief I'd rather suggest as not doing anything)


CREATE database_name DEFAULT CHARACTER SET utf8; or ALTER database_name DEFAULT CHARACTER SET utf8; and/or use utf8_general_ci as MySQL connection collation.

(it is important to note here that this will increase the database size if it uses varchar)


mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER_SET utf8");

Businesses logic

detect if not UTF8 with mb_detect_encoding() and convert with ivon().
validating overly long sequences of UTF8 and UTF16

$body=preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);


is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?
is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
should it really be utf8_general_ci or rather utf8_bin?
is there something missing in the above workflow?



asked Sep 7, 2015 by rajesh
