In this case UTF-8 should be used. UTF-8 ("Modified UTF-8") encoding is
an internal encoding of Zend_Search_Lucene.
But manual query construction (through API) should be used instead of
parsing query from a string:
$query = new Zend_Search_Lucene_Search_Query_MultiTerm();
/* required term */
$query->addTerm(new Zend_Search_Lucene_Index_Term('word1'), true);
/* searched, but not required term */
/* prohibited term */
$query->addTerm(new Zend_Search_Lucene_Index_Term('für'), false);
$hits = $index->find($query);
Other way is to upgrade query parser
(Zend_Search_Lucene_Search_QueryTokenizer class) to support UTF-8.
Derick Rethans wrote:
>>In this case UTF-8 should be used. UTF-8 ("Modified UTF-8") encoding is an
>>internal encoding of Zend_Search_Lucene.
> Just wondering... what are the differences between UTF-8 and "Modified
> UTF-8" ?
a) 0x00 symbol representation.
UTF-8 encodes U+0000 as 0x00
Modified UTF-8 encodes U+0000 as 0xC0 0x80(11000000 10000000)
b) Supplementary characters (characters whose code points are greater
than 0xFFFF) encoding.
UTF-8 uses four bytes for such characters.
Modified UTF-8 represents these characters as a pair of char (16-bit)
values, the first from the high-surrogates range (0xD800-0xDBFF), the
second from the low-surrogates range (0xDC00-0xDFFF). Then they are
encoded as usual UTF-8 characters in six bytes.
Zend_Search_Lucene automatically translates U+0000 (to be binary
compatible with Java Lucene), but supports only Basic Multilingual Plane
(doesn't support supplementary characters).
Support of supplementary characters is not very useful, but slows down
string read/write operations.