Quantcast

Re: Lucene search problems with umlauts

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lucene search problems with umlauts

CustomSoft
That doesn't help at all because the index is not built with Zend.
I converted my query string to iso-8859-1, utf-8, ascii with no change.

With kind regards,
Eric Bartels


Alexander Veremyev wrote:

> Hi all,
>
> Moreover, it's described in a documentation now:
> http://framework.zend.com/manual/en/zend.search.charset.html#zend.search.charset.description
>
> ;)
>
> With best regards,
>    Alexander Veremyev.
>
> Daniel Andersson wrote:
>>> Whats wrong? Lies the problem within the umlauts?
>>
>>
>> see http://www.zend.com/lists/fw-general/200603/msg00460.html
>>
>> / d
>>
>>
>


--

Mit freundlichen Grüßen, Eric Bartels
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lucene search problems with umlauts

Alexander Veremyev
Hi Eric,

In this case UTF-8 should be used. UTF-8 ("Modified UTF-8") encoding is
an internal encoding of Zend_Search_Lucene.
But manual query construction (through API) should be used instead of
parsing query from a string:
-------------
$query = new Zend_Search_Lucene_Search_Query_MultiTerm();
/* required term */
$query->addTerm(new Zend_Search_Lucene_Index_Term('word1'), true);
/* searched, but not required term */
$query->addTerm(new Zend_Search_Lucene_Index_Term('fröbel'));
/* prohibited term */
$query->addTerm(new Zend_Search_Lucene_Index_Term('für'), false);

$hits = $index->find($query);
-------------

Other way is to upgrade query parser
(Zend_Search_Lucene_Search_QueryTokenizer class) to support UTF-8.

With best regards,
    Alexander Veremyev.

CustomSoft wrote:

> That doesn't help at all because the index is not built with Zend.
> I converted my query string to iso-8859-1, utf-8, ascii with no change.
>
> With kind regards,
> Eric Bartels
>
>
> Alexander Veremyev wrote:
>
>>Hi all,
>>
>>Moreover, it's described in a documentation now:
>>http://framework.zend.com/manual/en/zend.search.charset.html#zend.search.charset.description
>>
>>;)
>>
>>With best regards,
>>   Alexander Veremyev.
>>
>>Daniel Andersson wrote:
>>
>>>>Whats wrong? Lies the problem within the umlauts?
>>>
>>>
>>>see http://www.zend.com/lists/fw-general/200603/msg00460.html
>>>
>>>/ d
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lucene search problems with umlauts

Derick Rethans-4
On Tue, 16 May 2006, Alexander Veremyev wrote:

> Hi Eric,
>
> In this case UTF-8 should be used. UTF-8 ("Modified UTF-8") encoding is an
> internal encoding of Zend_Search_Lucene.

Just wondering... what are the differences between UTF-8 and "Modified
UTF-8" ?

regards,
Derick
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lucene search problems with umlauts

Alexander Veremyev
Hi Derick,

Derick Rethans wrote:
>>In this case UTF-8 should be used. UTF-8 ("Modified UTF-8") encoding is an
>>internal encoding of Zend_Search_Lucene.
>
> Just wondering... what are the differences between UTF-8 and "Modified
> UTF-8" ?

http://en.wikipedia.org/wiki/UTF-8#Java


To put it briefly:

a) 0x00 symbol representation.
UTF-8 encodes U+0000 as 0x00
Modified UTF-8 encodes U+0000 as 0xC0 0x80(11000000 10000000)

b) Supplementary characters (characters whose code points are greater
than 0xFFFF) encoding.
UTF-8 uses four bytes for such characters.
Modified UTF-8 represents these characters as a pair of char (16-bit)
values, the first from the high-surrogates range (0xD800-0xDBFF), the
second from the low-surrogates range (0xDC00-0xDFFF). Then they are
encoded as usual UTF-8 characters in six bytes.


Zend_Search_Lucene automatically translates U+0000 (to be binary
compatible with Java Lucene), but supports only Basic Multilingual Plane
(doesn't support supplementary characters).
Support of supplementary characters is not very useful, but slows down
string read/write operations.

With best regards,
    Alexander Veremyev.

Loading...