Translations
Info
All page names need to be in English.
en da  de  fr  it  ja  km  nl  ru  zh

UTF-8 support

From TYPO3Wiki
Jump to: navigation, search

<< Back to Help, tips and troubleshooting page

[edit]

Note The option $TYPO3_CONF_VARS['BE']['forceCharset'] has been removed in TYPO3 4.7. If you use TYPO3 4.7 or newer, please configure your system correctly to set your installation to native UTF-8. This information still has to be included below. See the talk page.
Note If a step is not understandable, you can note that here.

Introduction

On this page we collect information about the old but still current UTF-8 topic. There are many options to set and check.

A good start is to make sure that everything in the chain is using UTF-8 encoding, starting with apache.conf, php.ini, my.cnf and ending with the TYPO3 settings.

In some cases not all settings are necessary and everything will run fine without certain changes.

But at least you have a checklist about possible character encoding problems and fixes. :)

General Settings

File system

When setting [BE][forceCharset], all files in the TYPO3 root folder and below are handled as UTF-8 files by TYPO3; so you should make sure that they really are. You should check your HTML-Templates and CSS files for special chars like umlauts. If they are displayed incorrectly, you should fix that by saving the file in UTF-8 format.

When editing such files only use editors which can save files in UTF-8 format.

Attention: Do not save the files in UTF-8 format with Byte Order Mark (BOM). Saving them as UTF-8 with BOM can cause different problems, e.g. thumbnails in the BE will no longer be shown. Save the files as UTF-8 without BOM instead.


Apache: vhost.conf

 AddDefaultCharset utf-8

According to the official Apache docs, this setting specifies the charset a browser should use when displaying a page. You can set this in vhost.conf or in .htaccess. The first one will be faster. It should overwrite the value of the meta tag in the page (although not all browsers respect that).

In current versions of TYPO3 (at least version 4.3+), setting this explicitly is no longer needed. It will be set automatically, when you set [BE][forceCharset] in the Install Tool (which you must do, see below).

You can check that by inspecting the HTTP header data, for example using the Firefox extension Firebug. You should see a line saying Content-Type: text/html; charset=utf-8>.

PHP: php.ini

 default_charset = "utf-8"

With that setting, stand-alone scripts will use this charset too. (By default, this value is empty.)

PHP extensions that should be enabled
 extension=php_mbstring.so

You can choose either iconv or mbstring to do charset conversions. They are much faster than the PHP implementation that comes with TYPO3.

Comparisons of these methods show that mb_string seems to be the best choice. If you're using PHP 5 or newer, mbstring is enabled by default and you don't need to enable the extension.

Warning: Do not enable mbstring.func_overload. While it's generally useful in UTF-8 setups, it conflicts with TYPO3's internal character set handling in t3lib_cs.

No matter which of the two extensions you use, you should make sure that it's configured to use UTF-8. You can check that in phpinfo() and correct the settings in php.ini or .htaccess, if needed.

To use the extension, you also need to modify localconf.php (see below).

MySQL: my.cnf

The following will set the system variables for character set and collations for the whole MySQL server. Be careful with this setting! It will also affect existing databases (which maybe don't use UTF-8, but something else; for example latin1). So only set this when only UTF-8 databases are supposed to be on the server. You don't need this when you set ['SYS']['setDBinit'] in the Install Tool (see below). (If you don't set this option, but still want to use the SQL comand line client, you should use --default-character-set=utf8 when connecting to a UTF-8 database.)

 [mysqld]
 default_character_set = utf8
Note Since MySQL 5.5.3 default_character_set is not supported anymore. Use
[mysqld]
character-set-server = utf8
instead!

Info on dropping support of default_character_set in MySQL 5.5.3

TYPO3 settings

localconf.php

PHP script:
 // For backend charset
 $TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8';
 $TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;'; 
 
 // For GIFBUILDER support
 // Set it to 'iconv' or 'mbstring'
 $TYPO3_CONF_VARS['SYS']['t3lib_cs_convMethod'] = 'mbstring';
 // For 'iconv' support you need at least PHP 5.
 $TYPO3_CONF_VARS['SYS']['t3lib_cs_utils'] = 'mbstring';

Note: You can also use iconv instead of mbstring. Though mbstring isn't compiled into PHP by default (whereas iconv is), mbstring is much faster than iconv.

Note $TYPO3_CONF_VARS['SYS']['multiplyDBfieldSize'] has been removed in TYPO3 6.0. In older versions, if your database's encoding was UTF-8, do not set $TYPO3_CONF_VARS['SYS']['multiplyDBfieldSize']. It was only needed if your database was latin1-encoded but the content was UTF-8. When using UTF-8 database encoding, it was not needed and only wasted space.

Issues with [setDBinit]

[setDBinit] contains commands, separated by newlines, that are sent to the database right after connecting. Ignored by the DBAL extension, except for the 'native' type. Please note that each command in [setDBinit] is an SQL statement and thus needs to be terminated with a semicolon.

SET NAMES utf8;

In most cases it is sufficient to just add the first directive to localconf.php:

PHP script:
 $TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;';

SET NAMES utf8; is equivalent to the following three statements:

SQL:
 SET character_set_client = utf8; 
 SET character_set_results = utf8; 
 SET character_set_connection = utf8;

More information in the official MySQL docs.

Without SET NAMES utf8; your TYPO3 UTF-8 setup might work, but chances are that database content entered after the conversion to UTF-8 has each international character stored as two separate, garbled latin1 chars.

If you check your database using phpMyAdmin and find umlauts in new content being shown as two garbled characters, this is the case. If this happens to you, you cannot just add the above statement any more. Your output for the new content will be broken. Instead you have to correct the newly added special chars first. This is done most easily by just deleting the content, setting the option as described above and re-entering it.

SET NAMES utf8; and SET SESSION character_set_server=utf8;

In some configurations a setting for the session is needed, too:

SQL:
 SET NAMES utf8;
 SET SESSION character_set_server=utf8;

It seems like the setting for character_set_server is only needed to create the DB with the right character set. So you don't need it at all, if you already created your DB and if it already uses UTF-8 as character set.

Don't use SET CHARACTER SET utf8;

Warning! The following can create character set problems in TYPO3 that are hard to solve. Avoid using this directive:

SQL:
 SET CHARACTER SET utf8;

According to the official docs this sets the same variables as SET NAMES, but possibly to other values:

SQL:
 SET character_set_client = utf8;
 SET character_set_results = utf8;
 SET collation_connection = @@collation_database;

That way character_set_connection is set to the value of character_set_database, too, causing problems:

If character_set_connection is not "utf8", your transferred UTF-8 encoded data will be UTF-8-encoded again. Together with the data you already had in the database before, you will get a mix of old correctly encoded data and new incorrectly double-encoded data.

The textual data in your database should be displayed correctly in tools like phpMyAdmin. When you use SET CHARACTER SET utf8;, then see wrong characters inside TYPO3 and proceed to "correcting" these errors from inside TYPO3, you will destroy the characters in your database and end up with garbled text. More information on that problem.

In short: Just use SET NAMES utf8;.

TypoScript setup

When [BE][forceCharset] is "utf-8" (see above), then config.renderCharset and config.metaCharset will default to "utf-8", too. So if you want UTF-8 output, you don't need to use these options.

Note: When you set config.renderCharset, config.metaCharset will be set to the same value by default. When you set both values, TYPO3 will use renderCharset internally and convert the generated page right before delivering it to the browser.

More information in the TypoScript Reference.

To avoid problems with accents of PHP generated date strings, configure your locale:

TS TypoScript:
config.locale_all = de_DE.utf-8
TS TypoScript:
config.locale_all = fr_FR.utf-8

Extensions

Collect extensions related information here.

Lowercasing/uppercasing text in extensions

To work with strings in TYPO3 extensions, use the methods in t3lib_cs:

  • UPPERCASING a string:
    $value = $GLOBALS['LANG']->csConvObj->conv_case($GLOBALS['LANG']->charSet, $value, 'toUpper');
  • lowercasing a string:
    $value = $GLOBALS['LANG']->csConvObj->conv_case($GLOBALS['LANG']->charSet, $value, 'toLower');
  • string length:
    $length = $GLOBALS['LANG']->csConvObj->strlen($GLOBALS['LANG']->charSet, $string);


RealURL

One problem is that RealURL might not be able to understand a page title if it is in unusual (i.e. not roman) characters. For example, with a page title in Japanese, I found that the title was not interpreted and the page was rendered as jp.html. Using the Navigation title solves this problem (to follow on the example, setting "home" as the Navigation title, my page was then rendered as jp/home.html).

TemplaVoila

Make sure that your templates are saved in UTF-8. It is possible that you have to map them again.

Further information

Database

Database charset

It is highly recommended (although not strictly necessary) to use UTF-8 in the database. Otherwise database sorting functions will not work correctly.

Problem with indeces

You might encounter this error:

 SQL=Specified key was too long; max key length is 1000 bytes:

This particular problem might occur when you are using UTF-8 encoding. UTF-8 uses up to 3 bytes per character, and the maximum index length is 1000 bytes, so the effective maximum index is 1000/3 = 333 characters.

If this error occurs, you should check which part of TYPO3 added the index: If it was added by the TYPO3 Core itself, report the bug at forge.typo3.org. If it was set by an extension, report it to the extension author in forge.typo3.org or whereever their bugtracker is located. If there is no bugtracker for the extension, maybe sending a mail to the extension author helps.

You can work around this issue temporarily by simply removing the index from the field.

Note: Using indeces that big anyway is not recommended and shows bad DB design.

GIFBUILDER: Use Unicode font files

If you use GIFBUILDER to create text (e.g. in a menu), make sure to use an Unicode font file

If there still are problems with broken special chars in these images, you should make sure that the configuration for mbstring or iconv (the one which you have chosen in the Install Tool) is set to UTF-8. You can check that in phpinfo() and correct the settings in php.ini or in your web server settings, if needed.

HTML Tidy

If HTML entities like &nbsp; show up as ? in the browser, add the -utf8 option to the [tidy_path] variable in the Install Tool, e.g.:

PHP script:
$TYPO3_CONF_VARS['FE']['tidy_path'] = 'tidy -i --quiet true --tidy-mark true -wrap 0 -raw --output-xhtml true -utf8';


Convert an already existing database to UTF-8

Possibility 1

Note This has been tested and works.

Jigal van Hemert wrote a script to convert a MySQL database to UTF-8. This script converts all columns, tables and setting for the whole database to UTF-8.

Jigal writes:

Read the following very carefully, because you have to make a few adjustments depending on the situation:

  • Always backup your database.
  • The script was intended for the situation in which UTF-8 encoded data is stored in Latin-1 (or other charsets) tables; as was common in 2008. You can recognize this by looking into phpMyAdmin. Watch for characters with accents (diacriticals) that are shown as weird double-character combinations; for example, instead of "Ali Gökgöz and Gültekin Tarcan", text shows as "Ali Gökgöz and Gültekin Tarcan". If this doesn't apply to your situation, comment out lines 108 - 123 (line numbers for the file with the date "26-10-2011" in one of the first lines). If you use a version of the script that does not have a change date in one of the first lines, the script is most probably of an older version; in which case the lines to be commented out are 97 - 107.
  • In line 19, the constant SIMULATE is set to true. This activates "dry-run" mode, that is, the tables are not really converted, it's only printed what *would* happen. After you executed the script at least once and there are no errors, you can set this constant to false.
  • Save the script into a subdirectory of the TYPO3 installation, for example inside fileadmin/. It is designed to run from a subdirectory so it can pick up the database connection data from localconf.php.
  • Run the script from your browser: http://example.com/fileadmin/db_utf8_fix.php. It shows each found table and prints a dot after the table name for each column it has converted.

Columns/tables already in UTF-8 encoding won't be touched.

Settings in TYPO3

When you're done, use the following settings in the Install Tool. You should then have a UTF-8 installation:

PHP script:
$TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;';
$TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8';

Possibility 2

Dump your database, modifiy the dumped file and import it again.

Note If you do it that way, setting ['BE']['forceCharset'] might cause broken special chars inside TYPO3. See below for more information.


Requirements:

  • Shell access to your Unix server
  • sed installed on the server

For this example we assume:

  • hostname: domain.com
  • database: typo3

This example is for *nix users. If your working on a Windows PC, you can do the same using PuTTY. Enter the hostname in the field "Host Name (or IP adress)" and click on "Open". Then enter your ssh username, press enter and enter the password (which will not be displayed) and press enter. You should now be connected to the server.

Linux users connect to the server via ssh typing

shell script:
ssh -l (user) domain.com

Create a backup of the database (if things go wrong...)

shell script:
mysqldump -u (user) -p(pass) --max_allowed_packet=10000000 typo3 > typo3_backup.sql

Dump database (without the table typo3.sys_refindex. This prevents the following error: "SQL=Specified key was too long; max key length is 1000 bytes. You have to rebuild the reference index afterwards!)

shell script:
mysqldump -u (user) -p(pass) --max_allowed_packet=10000000 --ignore-table=typo3.sys_refindex  typo3  > typo3_utf8.sql

Now modifiy the dump:

Newer versions of MySQL (at least 5.0) also save the collation for each column seperately. You have to convert all of them:

First convert all occurences of "DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci" (use the character set which you have written in your file) in typo3_utf8.sql to "DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci":

shell script:
 sed  -e 's/DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci/DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci/g' -i "" typo3_utf8.sql

Then convert all occurences of "COLLATE latin1_german1_ci" (use the charset you have written in your file) to "COLLATE utf8_general_ci":

shell script:
 sed  -e 's/COLLATE latin1_german1_ci/COLLATE utf8_general_ci/g' -i "" typo3_utf8.sql

Import database:

shell script:
 mysql -u (user) -p(pass) --default-character-set=utf8  typo3 < typo3_utf8.sql

Alter character set and collation for the whole database:

shell script:
 mysql -u (user) -p(pass) -e "ALTER DATABASE typo3 DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci"
Broken special chars?

If the result of the above mentioned is that special chars are displayed incorrectly in TYPO3 (a small black box with a question mark in it instead of the special char), the following might help:

Create a new database. Make sure that it uses UTF-8 as default charset and utf8_general_ci as collation:

shell script:
mysql -u [username] -p[password] -e "ALTER DATABASE [newdb] DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;"

Then import the dump into that database without using sed to replace the occurences of latin1 (or what you have) with UTF-8.

The result will be that the tables and columns in your database still use latin1 (or what you had before).

This might be a problem, e.g. when you now add new tables to this database, they will use UTF-8 as charset, because the database is set to UTF-8. This will lead to a mix of both charsets in your DB.

Possibility 3

This might be the way to go for german speaking users with a Unix server:

A way similar to possibility 2 is recommended by t3n (german).

Basically they make the dump and replace the charset and collation statements.

Then they use iconv on the dumped file to convert the signs inside:

shell script:
 iconv -f iso-8859-1 -t utf8 dump.sql > dump-iconv.sql

Hint: The names of the charset's may differ from platform to platform. Use this to find supported charset names:

shell script:
iconv -l

After that they import the file using the switch --default-character-set=utf8:

shell script:
 mysql -u USER -p PASSWORT -h HOST --default-character-set=utf8 DB < dump-iconv.sql
Note 1

|If you did that and get umlauts displayed correctly, but ß (sz-lig) and € (euro) displayed wrongly inside TYPO3, you should specify CP1252 as the origin charset to the iconv command like that:

shell script:
iconv -f CP1252 -t utf8 dump.sql > dump-iconv.sql
Note 2

If DB collations are set to utf8_general_ci, and TYPO3 is configured like that:

PHP script:
$TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8';
// not set:
// $TYPO3_CONF_VARS['SYS']['setDBinit']


It is possible that your data gets double UTF-8-encoded, as TYPO3 sends UTF-8 encoded data to the DB server, but the DB server has no additional information on the connection and defaults to Latin1 - thus it converts the data again. To solve this, use the following command, which converts your data back to correctly encoded UTF-8:

shell script:
iconv -f UTF-8 -t ISO-8859-1 dump.sql > dump-iconv.sql
# if error occures try:
# iconv -f UTF-8 -t ISO-8859-1//TRANSLIT dump.sql > dump-iconv.sql
# or even:
# iconv -f UTF-8 -t ISO-8859-1//TRANSLIT//IGNORE dump.sql > dump-iconv.sql
Note 3

If you tried to use iconv and it threw an error like "cannot convert", try this command which attempts to translate given strings for which there is no representation in the target charset:

shell script:
iconv -f iso-8859-1//TRANSLIT -t utf8 dump.sql > dump-iconv.sql

If this still doesn't work, as a workaround there is the possibility to ignore these characters silently:

shell script:
iconv -f iso-8859-1//TRANSLIT//IGNORE -t utf8 dump.sql > dump-iconv.sql

Possibility 4

Source (in German). Tested on Debian Lenny, MySQL 5.0.51, TYPO3 4.5.

Convert database to utf8

For the database do:

shell script:
echo "ALTER DATABASE mydb CHARACTER SET utf8 COLLATE utf8_general_ci;" | mysql
mysqldump --default-character-set=latin1 --databases mydb > a.sql
cp a.sql b.sql
sed -i 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g' b.sql
sed -i 's/CHARACTER SET latin1/CHARACTER SET utf8/g' b.sql
<*>
grep -v character_set_client <b.sql > c.sql
mysql --default-character-set=utf8 < c.sql

Your data should display correctly when you use a MySQL console.

Troubleshooting:

If errors occur while loading data try to display the corresponding statement by setting MySQL to very verbose:

shell script:
mysql -v -v --default-character-set=utf8 < c.sql

For errors like this:

shell script:
CREATE TABLE `sys_registry` (
  `uid` int(11) unsigned NOT NULL auto_increment,
  `entry_namespace` text NOT NULL,
  `entry_key` text NOT NULL,
  `entry_value` blob,
  PRIMARY KEY  (`uid`),
  UNIQUE KEY `entry_identifier` (`entry_namespace`(256),`entry_key`(127))
) ENGINE=InnoDb DEFAULT CHARSET=utf8
 
ERROR 1071 (42000) at line 1333: Specified key was too long; max key length is 767 bytes
Bye

Here you should resize your index above at <*> by adding another sed line:

shell script:
sed -i 's/`entry_namespace`(256)/`entry_namespace`(127)/g' b.sql
Note

The maximum key length with latin1 is

  • for InnoDb: 767.
  • for MyIsam: 1000

The maximum key length with utf8 is

  • for InnoDb: 767/3
  • for MyIsam: 1000/3

Possibility 5

Try the extension toolbox_utf8 and give feedback.

Documentation could be found in forge wiki of the project

TYPO3 specific links about charset conversion

German

Misc. links about charset conversion

German