Hi Derick,
Derick Rethans wrote:
>>In this case UTF-8 should be used. UTF-8 ("Modified UTF-8") encoding is an
>>internal encoding of Zend_Search_Lucene.
>
> Just wondering... what are the differences between UTF-8 and "Modified
> UTF-8" ?
http://en.wikipedia.org/wiki/UTF-8#JavaTo put it briefly:
a) 0x00 symbol representation.
UTF-8 encodes U+0000 as 0x00
Modified UTF-8 encodes U+0000 as 0xC0 0x80(11000000 10000000)
b) Supplementary characters (characters whose code points are greater
than 0xFFFF) encoding.
UTF-8 uses four bytes for such characters.
Modified UTF-8 represents these characters as a pair of char (16-bit)
values, the first from the high-surrogates range (0xD800-0xDBFF), the
second from the low-surrogates range (0xDC00-0xDFFF). Then they are
encoded as usual UTF-8 characters in six bytes.
Zend_Search_Lucene automatically translates U+0000 (to be binary
compatible with Java Lucene), but supports only Basic Multilingual Plane
(doesn't support supplementary characters).
Support of supplementary characters is not very useful, but slows down
string read/write operations.
With best regards,
Alexander Veremyev.