Zend_Lucene + UTF8 search problem... Help!(8EB-F5F)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Zend_Lucene + UTF8 search problem... Help!(8EB-F5F)

Maxim Savenko
Hi everybody,

I have a problem with searching russian strings, utf8 encoded,  with
Zend_Search_Lucene. Here is my short sample code:

<?php
require_once 'ZendInit.php';
require_once 'Zend/Search/Lucene.php';
require_once 'Zend/Search/Lucene/Document.php';

// Create index
$index = Zend_Search_Lucene::create('data/index');
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('samplefield', 'русский
текст; english text', 'utf-8'));
$index->addDocument($doc);
$index->commit();

// Open index and search:
$index = Zend_Search_Lucene::open('data/index');
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
Zend_Search_Lucene::setDefaultSearchField('samplefield');

// Query the index:
$queryStr = 'english';
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');
$hits = $index->find($query);
foreach ($hits as $hit) {
   /*@var $hit Zend_Search_Lucene*/
   $doc = $hit->getDocument();
   echo $doc->getField('samplefield')->value, PHP_EOL;
}

The 'samplefield' of the document contain string in too languages �C
russian and english(see code). If we'll search 'english' it's all fine
- we successfully find the document, but if we'll try to find russian
part of field( set $queryStr to 'русский') then we don't find any
document.

What is a problem with my code? Help me find solution...

Thank you guys

Maxim Savenko
[hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Zend_Lucene + UTF8 search problem... Help!(8EB-F5F)

Alexander Veremyev
Hi Maxim,

The problem is that default analyzer works only with ascii text - http://framework.zend.com/manual/en/zend.search.lucene.charset.html#zend.search.lucene.charset.default_analyzer

That's so because mbstring PHP extension is not included into PHP installation by default and iconv() doesn't have necessary functionality.

You should use special UTF-8 analyzers to work with non-ascii text which can't be transliterated by iconv() - http://framework.zend.com/manual/en/zend.search.lucene.charset.html#zend.search.lucene.charset.utf_analyzer


---------------------------------
<?php
require_once 'ZendInit.php';
require_once 'Zend/Search/Lucene.php';


Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive ());

// Create index
$index = Zend_Search_Lucene::create('data/index');
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('samplefield',
                                              'русский текст; english text',
                                              'utf-8'));
$index->addDocument($doc); $index->commit();

...
-----------------

Don't forget to set the same analyzer as default before searching:
---------------------------------
<?php
require_once 'ZendInit.php';
require_once 'Zend/Search/Lucene.php';


Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive ());

// Open index
$index = Zend_Search_Lucene::open('data/index');
...

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
foreach ($index->find($query) as $hit) {
    echo $hit->samplefield, PHP_EOL;
}
...
-----------------


With best regards,
   Alexander Veremyev.


> -----Original Message-----
> From: Maxim Savenko [mailto:[hidden email]]
> Sent: Thursday, July 24, 2008 3:58 PM
> To: [hidden email]
> Subject: [fw-formats] Zend_Lucene + UTF8 search problem... Help!(8EB-F5F)
>
> Hi everybody,
>
> I have a problem with searching russian strings, utf8 encoded,  with
> Zend_Search_Lucene. Here is my short sample code:
>
> <?php
> require_once 'ZendInit.php';
> require_once 'Zend/Search/Lucene.php';
> require_once 'Zend/Search/Lucene/Document.php';
>
> // Create index
> $index = Zend_Search_Lucene::create('data/index');
> $doc = new Zend_Search_Lucene_Document();
> $doc->addField(Zend_Search_Lucene_Field::Text('samplefield', 'русский
> текст; english text', 'utf-8'));
> $index->addDocument($doc);
> $index->commit();
>
> // Open index and search:
> $index = Zend_Search_Lucene::open('data/index');
> Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
> Zend_Search_Lucene::setDefaultSearchField('samplefield');
>
> // Query the index:
> $queryStr = 'english';
> $query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');
> $hits = $index->find($query);
> foreach ($hits as $hit) {
>    /*@var $hit Zend_Search_Lucene*/
>    $doc = $hit->getDocument();
>    echo $doc->getField('samplefield')->value, PHP_EOL;
> }
>
> The 'samplefield' of the document contain string in too languages -
> russian and english(see code). If we'll search 'english' it's all fine
> - we successfully find the document, but if we'll try to find russian
> part of field( set $queryStr to 'русский') then we don't find any
> document.
>
> What is a problem with my code? Help me find solution...
>
> Thank you guys
>
> Maxim Savenko
> [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Zend_Lucene + UTF8 search problem... Help!(8EB-F5F)

Maxim Savenko
Hi

Thank you Alexander....
I have understood the problem. My script works fine now...


2008/7/25 Alexander Veremyev <[hidden email]>:

> Hi Maxim,
>
> The problem is that default analyzer works only with ascii text - http://framework.zend.com/manual/en/zend.search.lucene.charset.html#zend.search.lucene.charset.default_analyzer
>
> That's so because mbstring PHP extension is not included into PHP installation by default and iconv() doesn't have necessary functionality.
>
> You should use special UTF-8 analyzers to work with non-ascii text which can't be transliterated by iconv() - http://framework.zend.com/manual/en/zend.search.lucene.charset.html#zend.search.lucene.charset.utf_analyzer
>
>
> ---------------------------------
> <?php
> require_once 'ZendInit.php';
> require_once 'Zend/Search/Lucene.php';
>
>
> Zend_Search_Lucene_Analysis_Analyzer::setDefault(
>  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive ());
>
> // Create index
> $index = Zend_Search_Lucene::create('data/index');
> $doc = new Zend_Search_Lucene_Document();
> $doc->addField(Zend_Search_Lucene_Field::Text('samplefield',
>                                              'русский текст; english text',
>                                              'utf-8'));
> $index->addDocument($doc); $index->commit();
>
> ...
> -----------------
>
> Don't forget to set the same analyzer as default before searching:
> ---------------------------------
> <?php
> require_once 'ZendInit.php';
> require_once 'Zend/Search/Lucene.php';
>
>
> Zend_Search_Lucene_Analysis_Analyzer::setDefault(
>  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive ());
>
> // Open index
> $index = Zend_Search_Lucene::open('data/index');
> ...
>
> Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
> foreach ($index->find($query) as $hit) {
>    echo $hit->samplefield, PHP_EOL;
> }
> ...
> -----------------
>
>
> With best regards,
>   Alexander Veremyev.
>
>
>> -----Original Message-----
>> From: Maxim Savenko [mailto:[hidden email]]
>> Sent: Thursday, July 24, 2008 3:58 PM
>> To: [hidden email]
>> Subject: [fw-formats] Zend_Lucene + UTF8 search problem... Help!(8EB-F5F)
>>
>> Hi everybody,
>>
>> I have a problem with searching russian strings, utf8 encoded,  with
>> Zend_Search_Lucene. Here is my short sample code:
>>
>> <?php
>> require_once 'ZendInit.php';
>> require_once 'Zend/Search/Lucene.php';
>> require_once 'Zend/Search/Lucene/Document.php';
>>
>> // Create index
>> $index = Zend_Search_Lucene::create('data/index');
>> $doc = new Zend_Search_Lucene_Document();
>> $doc->addField(Zend_Search_Lucene_Field::Text('samplefield', 'русский
>> текст; english text', 'utf-8'));
>> $index->addDocument($doc);
>> $index->commit();
>>
>> // Open index and search:
>> $index = Zend_Search_Lucene::open('data/index');
>> Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
>> Zend_Search_Lucene::setDefaultSearchField('samplefield');
>>
>> // Query the index:
>> $queryStr = 'english';
>> $query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');
>> $hits = $index->find($query);
>> foreach ($hits as $hit) {
>>    /*@var $hit Zend_Search_Lucene*/
>>    $doc = $hit->getDocument();
>>    echo $doc->getField('samplefield')->value, PHP_EOL;
>> }
>>
>> The 'samplefield' of the document contain string in too languages -
>> russian and english(see code). If we'll search 'english' it's all fine
>> - we successfully find the document, but if we'll try to find russian
>> part of field( set $queryStr to 'русский') then we don't find any
>> document.
>>
>> What is a problem with my code? Help me find solution...
>>
>> Thank you guys
>>
>> Maxim Savenko
>> [hidden email]
>
>



--
Good Luck.

Maxim Savenko
EMail: [hidden email]