/usr/portage

What to do? 0

During my daily work I often heard discussions about how to handle charset properly. What a server must provide to handle charsets correctly, which configuration for Apache is needed, what options must be set in php.ini to make PHP correctly working, which functions should be avoided when using PHP, which locales must be used and so on. So I want to give a short overview how to sail around common problems in a LAMP-setup.

Kernel

Just to make sure the option CONFIG_NLS_UTF8 is set to y.

Environment

To make sure, newly created filenames are there in UTF-8 and in general VT-input is handled correctly, you have to choose a charset, which comes with an .UTF-8-suffix. For german feel free to choose de_DE.UTF-8. Make sure your glibc is provides this locales. To convert current names of files you can just use convmv. For a desktop you must also adjust the font and set a correct TTY-font but this could be ignored for a server which is just administrated via remote shell.

Webservers in general – focus on Apache

To make sure, the users input is UTF-8, the server has to deliver the correct Content-Type-header. Take a look at the output of wget -S http://usrportage.de, my weblog, which is hosted on Schokokeks.org, a properly configured server (sure!):
wget -S usrportage.de
—21:00:33— http://usrportage.de/ => `index.html’
Resolving usrportage.de… 87.106.4.7
Connecting to usrportage.de|87.106.4.7|:80… connected.
HTTP request sent, awaiting response… HTTP/1.1 200 OK Date: Thu, 13 Jul 2006 19:00:27 GMT Server: Apache X-Powered-By: PHP/5.1.4-pl0-gentoo with Hardening-Patch X-Blog: Serendipity Set-Cookie: PHPSESSID=9da2ded6522851ef8ddc3ebe7590b354; path=/ Expires: Thu, 19 Nov 1981 08:52:00 GMT Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache X-Serendipity-InterfaceLang: de X-FreeTag-Count: Array Connection: close Content-Type: text/html; charset=UTF-8
Length: unspecified [text/html]

[ <=> ] 65,557 250.71K/s

21:00:33 (250.19 KB/s) – `index.html’ saved [65557]


You see the header Content-Type: text/html; charset=UTF-8. (You can also the a bug in S9Y, which poorly casts an array, but anyway.) So your browser is notified, that it should send UTF-8 encoded data. That’s the whole secret. To configure Apache properly, make sure the directive AddDefaultCharset is set to UTF-8.

One thing at last: if you’re using AJAX-functions from Prototype for JavaScript-purposes, you have to reencode the string delivered by the AJAX-call. In PHP the following would work:
$string = utf8_encode( $_POST[‘key’] );

MySQL

Before transacting any data, make sure your connection charset is set to UTF-8:
SET NAMES utf8;
By the way: have I ever mentioned you should ever use mysql_real_escape_string() instead of mysql_escape_string()?

PHP

Just two rules: use mb_string-functions whereever it is possible, set the INI-setting default_charset to UTF-8 and – anyway – don’t use functions from the ereg-family also they have an mb_-Prefix. They aren’t binary-safe, that’s all you need to know.
Also make sure, your sources are UTF-8 encoded. Use iconv to correct those who are not.

Update


I forgot to mention, that the functions htmlentities(), html_entity_decode() and htmlspecialchars() does not reflect PHPs default_charset-directive but assumes iso-8859-15 as the default charset, which is pretty annoying and should be considered as a bug, from my point of view. So you need to pass UTF-8 as the third parameter to the function to make sure it will work properly with Unicode.

Filed under , , , , , , , & no comments & no trackbacks