From Jeff Schnitzer Sep 15, 2006 10:45 AM
The current design is quite deliberate, for several reasons: 1) Pull-based batching works in a cluster. If you are load balancing across two machines, only one of the machines will process the incoming email. Pull-batching means that every machine will update their index eventually. Push-based indexing would require some sort of reliable broadcast mechanism like durable JMS topics, because other machines in the cluster might be unavailable when the mail arrives. 2) Lucene strongly prefers batch updates. Open index, add 50 messages, close index is way way way faster than 50 times doing open index, add message, close index. 3) To the extent that Lucene does caching, it does so in an object that you must throw away every time you update the index. Frequent updates means constantly trashing the cache, hurting performance. Honestly I would probably be willing to live with the performance hits of immediate updates for the convenience of push-updates, but the poor clustering behavior is fatal. Emmanuel is even building push-based Lucene integration directly into Hibernate, which would make it trivial. If he ever gets it to work in a clustered environment, I would be willing to consider a change. Unfortunately the only option currently available is "nfs mount the index directory". The one big downside of the current system is that message deletions (new feature) aren't removed from the index, so when you search the total counts might be wrong. I've tried to program around this case as gracefully as possible but if you watch carefully the total count may appear to change slightly as you paginate. It goes away if you rebuild the index, of course. Btw if you have deleted messages, grab trunk as of just now - the handling of deleted messages is better. There are some open questions as to how best to use Lucene, but unfortunately a one-size-fits-all solution isn't ideal. On a single box, push works fine. On a single box or a cluster of a few machines, having a pull-based index replicated on each machine is fine. If you have a cluster of dozens of machines (think sourceforge), it's probably a lot better to have only a couple pull-based indexers in the cluster rather than trying to maintain a separate index on every appserver. This is something that may be evolved over time - I've considered pulling the indexer out into a separate EAR that administrators can deploy separately from the main application. Jeff Corey Puffalt wrote: > Jeff, > > I will try another snapshot. I'm curious though, as to why the > decision was made to have the indexing service run as a periodic > batch-type process rather than having it integrated directly into the > incoming mail pipeline? It seems there are a lot of disadvantages to > this approach as emails aren't immediately searchable. Is there any > plan to change this? > > Thanks, > Corey > > On 9/15/06, *Jeff Schnitzer* <jeff@infohazard.org > <mailto:jeff@infohazard.org>> wrote: > > Ok, I'm an idiot. For future reference, if you call: > > Timer.scheduleAtFixedRate(task, System.currentTimeMillis() + > DELAY, DELAY); > > ...you will not get a task that starts DELAY milliseconds from > now. You > will get a task that starts DELAY milliseconds plus, oh, about 36 > years > from now :-) > > Grab trunk, it will solve the search problem. > > Postfix configuration is unfortunately complicated, and apparently > varies with Postfix version. I think the docs will get a lot better > when we have a wiki. If you can get away with it, the easiest > thing to > do is run SubEtha directly on port 25, but that doesn't work for > everyone... not even me :-( > > Jeff > >
