One customer parse a substantial amount of (somewhat poorly formed) XML on a daily basis. (I say poorly formed as the prolog does not specify a character set — i.e. the file starts with <?xml version=”1.0″> and not <?xml version=”1.0″ charset=”UTF-8″> and different files will contain either UTF-8 or ISO-8859-1 text depending on the data supplier).
Recently with PHP we’ve been seeing errors like :
PHP Notice: iconv_strlen(): Detected an illegal character in input string in ..../vendor/zendframework/zendframework1/library/Zend/Validate/StringLength.php on line 246 PHP Stack trace: .... Zend_Form_Element->isValid() ...../vendor/zendframework/zendframework1/library/Zend/Form.php:2300 Zend_Validate_StringLength->isValid() ...../vendor/zendframework/zendframework1/library/Zend/Form/Element.php:1443 iconv_strlen() ..../vendor/zendframework/zendframework1/library/Zend/Validate/StringLength.php:246
After some investigation, we found out that Zend_Validate_StringLength defaults to using the ‘iconv.internal_encoding’ php.ini setting if an encoding is not specified when creating the validator ($options = array('encoding' => 'UTF-8'); $validator new Zend_Validate_StringLength($options) ...
)
So, perhaps the moral learnt is :
- Set php.ini to have a default_charset of UTF-8
- Set php.ini to have a default iconv.internal_encoding of UTF-8
Alternatively, I suspect passing the XML file through xmllint before it’s used.
At which point an é will be turned into an é which will solve the problem.