Zend_Search_Lucene - Large amount of data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Zend_Search_Lucene - Large amount of data

Sam Davey
Hi,

I am really impressed with the performance of Zend_Search_Lucene and am trying to shift my SQL based search to an index based search for the obvious advantages of relieving stress on my MySql server.

However I have a problem in that there is a massive amount of data I need to index.  And when I try to index it all the script either runs out of memory or exceeds the given execution time.  Of course I can use ini_set to increase these values but I have already increased them to high values and I can still only index about half of my data.

Does anyone know of a good strategy to minimise or control the required memory/time of a script indexing this amount of data?

Cheers,

Sam
Reply | Threaded
Open this post in threaded view
|

Re: Zend_Search_Lucene - Large amount of data

Alexander Veremyev
Hi Sam,

I am preparing Zend_Search_Lucene Best Practice documentation section
right now and it'll include recommendations for different indexing modes
(see below) :)

Hope it'll help.


To get quick result:
1. Don't limit batch indexing execution time.
2. Choose MaxBufferedDocs according to your memory limit (set it to 128
and decrease it twice each time you get 'out of memory' error).
3. Skip MergeFactor tuning
4. Set MaxMergedDocs to floor(NumberOfDocuments/64)



-- Indexing performance -------------
Indexing performance is a compromise between used resources, indexing
time and index quality.


Index quality is completely determined by number of index segments.

Each index segment is entirely independent portion of data. So index
containing more segments needs more memory and more time for searching.

Index optimization is a process of merging several segments into new
one. Fully optimized index contains only one segment.

Full index optimization may be performed with 'optimize()' method:
----
$index = Zend_Search_Lucene::open($indexPath);

$index->optimize();
----

Index optimization works with data streams and doesn't take a lot of
memory, but takes processor resources and time.


Lucene index segments are not updatable by their nature (update
operation needs segment file to be completely rewritten). So adding new
document(s) to the index always generates new segment. It decreases
index quality.

Index auto-optimization process is performed after each segment
generation and consists in partial segments merging.


There are three options to control behavior of auto-optimization:
1. MaxBufferedDocs is a number of documents buffered in memory before
new segment is generated and written to a hard drive.
2. MaxMergeDocs is a maximum number of documents merged by
auto-optimization process into new segment.
3. MergeFactor determines how often auto-optimization is performed.
* All these options are Zend_Search_Lucene object properties, but not
index properties. So they affect only current Zend_Search_Lucene object
behavior and may vary for different scripts.

MaxBufferedDocs doesn't matter if you index only one document per script
execution. To the contrary, it's very important for batch indexing.
Greater value increases indexing performance, but also needs more memory.

There are no way to calculate best value for MaxBufferedDocs parameter
because it depends on documents size, used analyzer and allowed memory.

Good way to get right value is to perform several tests with largest
document you expect to be added to the index ('memory_get_usage()' and
'memory_get_peak_usage()' may be used to control memory usage). That's
good idea not to use more than a half of allowed memory.


MaxMergeDocs limits segment size (in terms of documents). So it limits
auto-optimization time. That guarantees addDocument() method to be not
executed more than a certain time. It's important for interactive
application.

Decreasing MaxMergeDocs parameter also may improve batch indexing
performance. Index auto-optimization is iterative process and is
performed step by step. Small segments are merged into larger, at some
moment they are merged into even greater and so on. Full index
optimization is much more effective.

On the over hand, smaller segments decreases index quality and may
generate too many segments. It may be a cause of the 'Too many open
files' error determined by OS limitations (Zend_Search_Lucene keeps each
segment file opened to improve search performance).

So background index optimization should be performed for interactive
indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.


MergeFactor affects auto-optimization frequency. Less values increases
quality of unoptimized index. Larger values increases indexing
performance, but also increases number of segments. It again may be a
cause of the 'Too many open files' error.

MergeFactor groups index segments by their size:
1. Not greater than MaxBufferedDocs.
2. Greater than MaxBufferedDocs, but not greater than
MaxBufferedDocs*MergeFactor.
3. Greater than MaxBufferedDocs*MergeFactor, but not greater than
MaxBufferedDocs*MergeFactor*MergeFactor.
...

Zend_Search_Lucene checks at each addDocument() call if merging of any
segments group may move newly created segment into next group. If yes,
then merging is performed.

So index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactor
segments and contains at least MaxBufferedDocs*MergeFactor^(N-1) documents.

It gives good approximation for number of segments in the index:
NumberOfSegments  <= MaxBufferedDocs +
MergeFactor*ln(NumberOfDocuments/MaxBufferedDocs)/ln(MergeFactor)

MaxBufferedDocs is determined by allowed memory. It gives the
possibility to choose appropriate merge factor to get reasonable number
of segments.


Tuning MergeFactor parameter is more effective for batch indexing
performance than MaxMergeDocs. But it's more rough.
So use above estimation for tuning MergeFactor, then play with
MaxMergeDocs to get best batch indexing performance.
---------------

With best regards,
    Alexander Veremyev.


Sam Davey wrote:

> Hi,
>
> I am really impressed with the performance of Zend_Search_Lucene and am
> trying to shift my SQL based search to an index based search for the obvious
> advantages of relieving stress on my MySql server.
>
> However I have a problem in that there is a massive amount of data I need to
> index.  And when I try to index it all the script either runs out of memory
> or exceeds the given execution time.  Of course I can use ini_set to
> increase these values but I have already increased them to high values and I
> can still only index about half of my data.
>
> Does anyone know of a good strategy to minimise or control the required
> memory/time of a script indexing this amount of data?
>
> Cheers,
>
> Sam