Re: Search not working

Actions

From Jeff Schnitzer Sep 15, 2006 10:45 AM

The current design is quite deliberate, for several reasons:

1) Pull-based batching works in a cluster.  If you are load balancing 
across two machines, only one of the machines will process the incoming 
email.  Pull-batching means that every machine will update their index 
eventually.  Push-based indexing would require some sort of reliable 
broadcast mechanism like durable JMS topics, because other machines in 
the cluster might be unavailable when the mail arrives.

2) Lucene strongly prefers batch updates.  Open index, add 50 messages, 
close index is way way way faster than 50 times doing open index, add 
message, close index.

3) To the extent that Lucene does caching, it does so in an object that 
you must throw away every time you update the index.  Frequent updates 
means constantly trashing the cache, hurting performance.

Honestly I would probably be willing to live with the performance hits 
of immediate updates for the convenience of push-updates, but the poor 
clustering behavior is fatal.  Emmanuel is even building push-based 
Lucene integration directly into Hibernate, which would make it 
trivial.  If he ever gets it to work in a clustered environment, I would 
be willing to consider a change.  Unfortunately the only option 
currently available is "nfs mount the index directory".

The one big downside of the current system is that message deletions 
(new feature) aren't removed from the index, so when you search the 
total counts might be wrong.  I've tried to program around this case as 
gracefully as possible but if you watch carefully the total count may 
appear to change slightly as you paginate.  It goes away if you rebuild 
the index, of course.

Btw if you have deleted messages, grab trunk as of just now - the 
handling of deleted messages is better.

There are some open questions as to how best to use Lucene, but 
unfortunately a one-size-fits-all solution isn't ideal.  On a single 
box, push works fine.  On a single box or a cluster of a few machines, 
having a pull-based index replicated on each machine is fine.  If you 
have a cluster of dozens of machines (think sourceforge), it's probably 
a lot better to have only a couple pull-based indexers in the cluster 
rather than trying to maintain a separate index on every appserver.  
This is something that may be evolved over time - I've considered 
pulling the indexer out into a separate EAR that administrators can 
deploy separately from the main application.

Jeff


Corey Puffalt wrote:
> Jeff,
>
> I will try another snapshot.  I'm curious though, as to why the 
> decision was made to have the indexing service run as a periodic 
> batch-type process rather than having it integrated directly into the 
> incoming mail pipeline?  It seems there are a lot of disadvantages to 
> this approach as emails aren't immediately searchable.  Is there any 
> plan to change this?
>
> Thanks,
> Corey
>
> On 9/15/06, *Jeff Schnitzer* <jeff@infohazard.org 
> <mailto:jeff@infohazard.org>> wrote:
>
>     Ok, I'm an idiot.  For future reference, if you call:
>
>     Timer.scheduleAtFixedRate(task, System.currentTimeMillis() +
>     DELAY, DELAY);
>
>     ...you will not get a task that starts DELAY milliseconds from
>     now.  You
>     will get a task that starts DELAY milliseconds plus, oh, about 36
>     years
>     from now :-)
>
>     Grab trunk, it will solve the search problem.
>
>     Postfix configuration is unfortunately complicated, and apparently
>     varies with Postfix version.  I think the docs will get a lot better
>     when we have a wiki.  If you can get away with it, the easiest
>     thing to
>     do is run SubEtha directly on port 25, but that doesn't work for
>     everyone... not even me :-(
>
>     Jeff
>
>