1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 """Provides search functionality for TurboGears_ using PyLucene_.
18
19 This module uses PyLucene to do all the heavy lifting, but as a result this
20 module does some fancy things with threads.
21
22 PyLucene requires that all threads that use it must inherit from
23 ``PythonThread``. This means either patching CherryPy_ and/or TurboGears, or
24 having the CherryPy thread hand off the request to a ``PythonThread`` and, in
25 the case of searching, wait for the result. The second method was chosen so
26 that a patched CherryPy or TurboGears does not have to be maintained.
27
28 The other advantage to the chosen method is that indexing happens in a separate
29 thread so the web request can return more quickly by not waiting for the
30 results.
31
32 The main disadvantage with PyLucene and CherryPy, however, is that *autoreload*
33 does not work with it. You **must** disable it by adding
34 ``autoreload.on = False`` to your ``dev.cfg``.
35
36 Configuration options
37 =====================
38
39 TurboLucene_ uses the following configuration options:
40
41 **turbolucene.search_fields**:
42 The list of fields that should be searched by default when a specific field
43 is not specified. (e.g. ``['id', 'title', 'text', 'categories']``)
44 (Default: ``['id']``)
45 **turbolucene.default_language**:
46 The default language to use if a language is not given calling
47 `add`/`update`/`search`/etc. (Default: ``'en'``)
48 **turbolucene.languages**:
49 The list of languages to support. This is a list of ISO language codes
50 that you want to support in your application. The languages must be
51 supported by PyLucene and must be configured in the languages
52 configuration file. Currently the choice of languages that are possible
53 out-of-the-box are : *Czech (cs)*, *Danish (da)*, *German (de)*, *Greek
54 (el)*, *English (en)*, *Spanish (es)*, *Finnish (fi)*, *French (fr)*,
55 *Italian (it)*, *Japanese (ja)*, *Korean (ko)*, *Dutch (nl)*, *Norwegian
56 (no)*, *Portuguese (pt)*, *Brazilian (pt-br)*, *Russian (ru)*, *Swedish
57 (sv)*, and *Chinese (zh)*. (Default: ``[<default_language>]``)
58 **turbolucene.default_operator**:
59 The default search operator to use between search terms when non is
60 specified. (Default: ``'AND'``) This must be a valid operator object from
61 the ``PyLucene.MultiFieldQueryParser.Operator`` namespace.
62 **turbolucene.optimize_days**:
63 The list of days to schedule index optimization. Index optimization cleans
64 up and compacts the indexes so that searches happen faster. This is a list
65 of day numbers (Sunday = 1). Optimization of all indexes will occur on
66 those days. (Default: ``[1, 2, 3, 4, 5, 6, 7]``, i.e. every day)
67 **turbolucene.optimize_time**:
68 A tuple containing the hour (24 hour format) and minute of the time to run
69 the scheduled index optimizations. (Default: ``(00, 00)``, i.e. midnight)
70 **turbolucene.index_root**:
71 The base path in which to store the indexes. There is one index per
72 supported language. Each index is a directory. Those directories will be
73 sub-directories of this base path. If the path is relative, it is
74 relative to your project's root. Normally you should not need to override
75 this unless you specifically need the indexes to be located somewhere else.
76 (Default: ``u'index'``)
77 **turbolucene.languages_file**:
78 The path to the languages configuration file. The languages configuration
79 file provides the configuration information for all the languages that
80 *TurboLucene* supports. Normally you should not need to override this.
81 (Default: the ``u'languages.cfg'`` file in the `turbolucene` package)
82 **turbolucene.languages_file_encoding**:
83 The encoding of the languages file. (Default: ``'utf-8'``)
84 **turbolucene.stopwords_root**:
85 The languages file can specify files that contain stopwords. If a
86 stopwords file path is relative, this path with be prepended to it. This
87 allows for all stopword files to be customized without needing to specify
88 full paths for every one. Normally you should not need to override this.
89 (Default: the ``stopwords`` directory in the `turbolucene` package)
90 **turbolucene.force_lock_release**:
91 If this is set to True, then if TurboLucne has troubles opening an index,
92 it will try to force the release of any write lock that may exist and try
93 again. The write lock is to prevent multiple processes writing to the
94 same index at the same time, but if the TurboLucne-based project is killed,
95 the lock gets left behind. This setting let you override the default
96 behaviour. (Default: ``True`` in development and ``False`` in production)
97
98 All fields are optional, but at the minimum, you will likely want to specify
99 ``turbolucene.search_fields``.
100
101 :See: `_load_language_data` for details about the languages configuration file.
102
103 :Warning: Do not forget to turn off *autoreload* in ``dev.cfg``.
104
105 :Requires: TurboGears_ and PyLucene_
106
107 .. _TurboGears: http://turbogears.org/
108 .. _PyLucene: http://pylucene.osafoundation.org/
109 .. _CherryPy: http://cherrypy.org/
110 .. _TurboLucene: http://dev.krys.ca/turbolucene/
111
112 :newfield api_version: API Version
113 :newfield revision: Revision
114
115 :group Objects to use in make_document: Document, Field, STORE, COMPRESS,
116 TOKENIZED, UN_TOKENIZED
117 :group Public API: start, add, update, remove, search
118
119 """
120
121 __author__ = 'Krys Wilken'
122 __contact__ = 'krys AT krys DOT ca'
123 __copyright__ = '(c) 2007 Krys Wilken'
124 __license__ = 'MIT'
125 __version__ = '0.2.1'
126 __api_version__ = '2.0'
127 __revision__ = '$Id: __init__.py 60 2007-05-02 03:29:02Z krys $'
128 __docformat__ = 'restructuredtext en'
129 __all__ = ['start', 'add', 'update', 'remove', 'search', 'Document', 'Field',
130 'STORE', 'COMPRESS', 'TOKENIZED', 'UN_TOKENIZED']
131
132
133
134
135
136 from Queue import Queue
137 from os.path import exists, join, isabs
138 from logging import getLogger
139 from atexit import register
140 import codecs
141
142
143 from turbogears import scheduler, config
144 from configobj import ConfigObj
145
146 from pkg_resources import resource_stream
147
148
149 import PyLucene
150 from PyLucene import (PythonThread, IndexModifier, JavaError, Term,
151 IndexSearcher, MultiFieldQueryParser, FSDirectory)
152
153 from PyLucene import Document, Field
154
155
156
157
158
159 _DEFAULT_LANGUAGE = 'en'
160
161
162 _log = getLogger('turbolucene')
163
164 _language_data = None
165
166 _indexer = None
167
168 _searcher_factory = None
169
170
171
172
173 STORE = Field.Store.YES
174
175 COMPRESS = Field.Store.COMPRESS
176
177 TOKENIZED = Field.Index.TOKENIZED
178
179 UN_TOKENIZED = Field.Index.UN_TOKENIZED
180
181
182
183
185 """Load all the language data from the configured languages file.
186
187 The languages configuration file can be set with the
188 ``turbolucene.languages_file`` configuration option and it's encoding is
189 set with ``turbolucene.languages_file_encoding``.
190
191 Configuration file format
192 =========================
193
194 The languages file is an INI-type (ConfigObj_) file. Each section is
195 defined by an ISO language code (``en``, ``de``, ``el``, ``pt-br``, etc.).
196 In each section the following keys are possible:
197
198 **analyzer_class**:
199 The PyLucene analyzer class to use for this language. (e.g.
200 ``SnowballAnalyzer``) (Required)
201 **analyzer_class_args**:
202 Any arguments that should be passed to the analyzer class. (e.g.
203 ``Danish``) (Optional)
204 **stopwords**:
205 A list of stopwords (words that do not get indexed) to pass to the
206 analyzer class. This is not normally used as ``stopwords_file`` is
207 generally preferred. (Optional)
208 **stopwords_file**:
209 The path to the file that contains the list of stopwords to pass to the
210 analyzer class. (e.g. ``stopwords_da.txt``) (Optional)
211 **stopwords_file_encoding**:
212 The encoding of the stopwords file. (e.g. ``windows-1252``)
213
214 If neither ``stopwords`` or ``stopwords_file`` is defined for a language,
215 then any stopwords that are used are determined automatically by the
216 analyzer class' constructor.
217
218 Example
219 -------
220
221 ::
222
223 # German
224 [de]
225 analyzer_class = SnowballAnalyzer
226 analyzer_class_args = German2
227 stopwords_file = stopwords_de.txt
228 stopwords_file_encoding = windows-1252
229
230 :Exceptions:
231 - `IOError`: Raised of the languages configuration file could not be
232 opened.
233 - `configobj.ParseError`: Raised if the languages configuration file is
234 contains errors.
235
236 :See:
237 - `turbolucene` (module docstring) for details about configuration
238 settings.
239 - `_read_stopwords` for details about stopwords files.
240
241 .. _ConfigObj: http://www.voidspace.org.uk/python/configobj.html
242
243 """
244
245 global _language_data
246 languages_file = config.get('turbolucene.languages_file', None)
247 languages_file_encoding = config.get('turbolucene.languages_file_encoding',
248 'utf-8')
249 if languages_file:
250 _log.info(u'Loading custom language data from "%s"' % languages_file)
251 else:
252 _log.info(u'Loading default language data')
253 languages_file = resource_stream(__name__, u'languages.cfg')
254 _language_data = ConfigObj(languages_file,
255 encoding=languages_file_encoding, file_error=True, raise_errors=True)
256
257
259 """Schedule index optimization using the TurboGears scheduler.
260
261 This function reads it's configuration data from
262 ``turbolucene.optimize_days`` and ``turbolucene.optimize_time``.
263
264 :Exceptions:
265 - `TypeError`: Raised if ``turbolucene.optimize_time`` is invalid.
266
267 :See: `turbolucene` (module docstring) for details about configuration
268 settings.
269
270 """
271 optimize_days = config.get('turbolucene.optimize_days', range(1, 8))
272 optimize_time = config.get('turbolucene.optimize_time', (00, 00))
273 scheduler.add_weekday_task(_optimize, optimize_days, optimize_time)
274 _log.info(u'Index optimization scheduled on %s at %s' % (unicode(
275 optimize_days), unicode(optimize_time)))
276
277
279 """Return the path to the index for the given language.
280
281 This function gets it's configuration data from ``turbolucene.index_root``.
282
283 :Parameters:
284 language : `str`
285 An ISO language code. (e.g. ``en``, ``pt-br``, etc.)
286
287 :Returns: The path to the index for the given language.
288 :rtype: `unicode`
289
290 :See: `turbolucene` (module docstring) for details about configuration
291 settings.
292
293 """
294 index_base_path = config.get('turbolucene.index_root', u'index')
295 return join(index_base_path, language)
296
297
299 """Read the stopwords from the given a stopwords file path.
300
301 Stopwords are words that should not be indexed because they are too common
302 or have no significant meaning (e.g. *the*, *in*, *with*, etc.) They are
303 language dependent.
304
305 This function gets it's configuration data from
306 ``turbolucene.stopwords_root``.
307
308 If `file_path` is not an absolute path, then it will be appended to the
309 path configured in ``turbolucene.stopwords_root``.
310
311 Stopwords files are text files (in the given encoding), with one stopword
312 per line. Comments are marked by a ``|`` character. This is for
313 compatibility with the stopwords files found at
314 http://snowball.tartarus.org/.
315
316 :Parameters:
317 file_path : `unicode`
318 The path to the stopwords file to read.
319 encoding : `str`
320 The encoding of the stopwords file.
321
322 :Returns: The list of stopwords from the file.
323 :rtype: `list` of `unicode` strings
324
325 :Exceptions:
326 - `IOError`: Raised if the stopwords file could not be opened.
327
328 :See: `turbolucene` (module docstring) for details about configuration
329 settings.
330
331 """
332 stopwords_base_path = config.get('turbolucene.stopwords_root', None)
333 if isabs(file_path) or stopwords_base_path:
334 if not isabs(file_path):
335 file_path = join(stopwords_base_path, file_path)
336 _log.info(u'Reading custom stopwords file "%s"' % file_path)
337 stopwords_file = codecs.open(file_path, 'r', encoding)
338 else:
339 _log.info(u'Reading default stopwords file "%s"' % file_path)
340 stopwords_file = codecs.getreader(encoding)(resource_stream(__name__,
341 join(u'stopwords', file_path)))
342 stopwords = []
343 for line in stopwords_file:
344
345
346
347 stopword = line.split(u'|')[0].strip()
348 if stopword:
349 stopwords.append(stopword)
350 stopwords_file.close()
351 return stopwords
352
353
355 """Produce an analyzer object appropriate for the given language.
356
357 This function uses the data that was read in from the languages
358 configuration file to determine and instantiate the analyzer object.
359
360 :Parameters:
361 language : `str` or `unicode`
362 An ISO language code that is configured in the languages configuration
363 file.
364
365 :Returns: An instance of the configured analyser class for given language.
366 :rtype: ``PyLucene.Analyzer`` sub-class
367
368 :Exceptions:
369 - `KeyError`: Raised if the given language is not configured or if the
370 configuration for that language does not have a *analyzer_class* key.
371 - `PyLucene.InvalidArgsError`: Raised if any of the parameters passed to
372 the analyzer class are invalid.
373
374 :See: `_load_language_data` for details about the language configuration
375 file.
376
377 """
378 ldata = _language_data[language]
379 args = (u'analyzer_class_args' in ldata and ldata[u'analyzer_class_args']
380 or [])
381 if not isinstance(args, list):
382 args = [args]
383
384
385
386 stopwords = []
387 if u'stopwords' in ldata and ldata[u'stopwords']:
388 stopwords = [ldata.stopwords]
389 elif u'stopwords_file' in ldata and u'stopwords_file_encoding' in ldata:
390 stopwords = [_read_stopwords(ldata[u'stopwords_file'],
391 ldata[u'stopwords_file_encoding'])]
392
393
394
395 args += stopwords
396
397 return getattr(PyLucene, ldata[
398 u'analyzer_class'])(*args)
399
400
406
407
409 """Tell the search engine to optimize it's index."""
410 _indexer('optimize')
411
412
413
414
415 -def start(make_document, results_formatter=None):
416 """Initialize and start the search engine threads.
417
418 This function loads the language configuration information, starts the
419 search engine threads, makes sure the search engine will be shutdown upon
420 shutdown of TurboGears and starts the optimization scheduler to run at the
421 configured times.
422
423 The `make_document` and `results_formatter` parameters are
424 callables. Here are examples of how they should be defined:
425
426 Example `make_document` function:
427 ===================================
428
429 .. python::
430
431 def make_document(entry):
432 '''Make a new PyLucene Document instance from an Entry instance.'''
433 document = Document()
434 # An 'id' string field is required.
435 document.add(Field('id', str(entry.id), STORE, UN_TOKENIZED))
436 document.add(Field('posted_on', entry.rendered_posted_on, STORE,
437 TOKENIZED))
438 document.add(Field('title', entry.title, STORE, TOKENIZED))
439 document.add(Field('text', strip_tags(entry.etree), COMPRESS,
440 TOKENIZED))
441 categories = ' '.join([unicode(category) for category in
442 entry.categories])
443 document.add(Field('category', categories, STORE, TOKENIZED))
444 return document
445
446 Example `results_formatter` function:
447 =======================================
448
449 .. python::
450
451 def results_formatter(results):
452 '''Return the results as SQLObject instances.
453
454 Returns either an empty list or a SelectResults object.
455
456 '''
457 if results:
458 return Entry.select_with_identity(IN(Entry.q.id, [int(id) for id
459 in results]))
460
461 :Parameters:
462 make_document : callable
463 `make_document` is a callable that will return a PyLucene `Document`
464 object based on the object passed in to `add`, `update` or `remove`.
465 The `Document` object must have at least a field called ``id`` that is
466 a string. This function operates inside a PyLucene ``PythonThread``.
467 results_formatter : callable
468 `results_formatter`, if provided, is a callable that will return
469 a formatted version of the search results that are passed to it by
470 `_Searcher.__call__`. Generally the `results_formatter` will take the
471 list of ``id`` strings that is passed to it and return a list of
472 application-specific objects (like SQLObject_ instances, for example.)
473 This function operates outside of any PyLucene ``PythonThread`` objects
474 (like in the CherryPy thread, for example). (Optional)
475
476 :See:
477 - `turbolucene` (module docstring) for details about configuration
478 settings.
479 - `_load_language_data` for details about the language configuration
480 file.
481
482 .. _SQLObject: http://sqlobject.org/
483
484 """
485 _load_language_data()
486
487 global _indexer, _searcher_factory
488 _indexer = _Indexer(make_document)
489 _searcher_factory = _SearcherFactory(results_formatter)
490
491
492 register(_stop)
493 _schedule_optimization()
494 _log.info(u'Search engine started.')
495
496
497 -def add(object_, language=None):
498 """Tell the search engine to add the given object to the index.
499
500 This function returns immediately. It does not wait for the indexer to be
501 finished.
502
503 :Parameters:
504 `object_`
505 This can be any object that ``make_document`` knows how to handle.
506 language : `str`
507 This is the ISO language code of the language of the object. If
508 `language` is given, then it must be on that was previously configured
509 in ``turbolucene.languages``. If `language` is not given, then
510 the language configured in ``turbolucene.default_language`` will be
511 used. (Optional)
512
513 :See:
514 - `turbolucene` (module docstring) for details about configuration
515 settings.
516 - `start` for details about ``make_document``.
517
518 """
519 _indexer('add', object_, language)
520
521
522 -def update(object_, language=None):
523 """Tell the the search engine to update the index for the given object.
524
525 This function returns immediately. It does not wait for the indexer to be
526 finished.
527
528 :Parameters:
529 `object_`
530 This can be any object that ``make_document`` knows how to handle.
531 language : `str`
532 This is the ISO language code of the language of the object. If
533 `language` is given, then it must be on that was previously configured
534 in ``turbolucene.languages``. If `language` is not given, then
535 the language configured in ``turbolucene.default_language`` will be
536 used. (Optional)
537
538 :See:
539 - `turbolucene` (module docstring) for details about configuration
540 settings.
541 - `start` for details about ``make_document``.
542
543 """
544 _indexer('update', object_, language)
545
546
547 -def remove(object_, language=None):
548 """Tell the search engine to remove the given object from the index.
549
550 This function returns immediately. It does not wait for the indexer to be
551 finished.
552
553 :Parameters:
554 `object_`
555 This can be any object that ``make_document`` knows how to handle.
556 language : `str`
557 This is the ISO language code of the language of the object. If
558 `language` is given, then it must be on that was previously configured
559 in ``turbolucene.languages``. If `language` is not given, then
560 the language configured in ``turbolucene.default_language`` will be
561 used. (Optional)
562
563 :See:
564 - `turbolucene` (module docstring) for details about configuration
565 settings.
566 - `start` for details about ``make_document``.
567
568 """
569 _indexer('remove', object_, language)
570
571
572 -def search(query, language=None):
573 """Return results from the search engine that match the query.
574
575 If a ``results_formatter`` function was passed to `start` then the results
576 will be passed through the formatter before returning. If not, the
577 returned value is a list of strings that are the ``id`` fields of matching
578 objects.
579
580 :Parameters:
581 query : `str` or `unicode`
582 This is the search query to give to PyLucene. All of Lucene's query
583 syntax (field identifiers, wild cards, etc.) are available.
584 language : `str`
585 This is the ISO language code of the language of the object. If
586 `language` is given, then it must be on that was previously configured
587 in ``turbolucene.languages``. If `language` is not given, then
588 the language configured in ``turbolucene.default_language`` will be
589 used. (Optional)
590
591 :Returns: The results of the search.
592 :rtype: iterable
593
594 :See:
595 - `start` for details about ``results_formatter``.
596 - `turbolucene` (module docstring) for details about configuration
597 settings.
598 - http://lucene.apache.org/java/docs/queryparsersyntax.html for details
599 about Lucene's query syntax.
600
601 """
602 return _searcher_factory()(query, language)
603
604
605
606
608
609 """Responsible for updating and maintaining the search engine index.
610
611 A single `_Indexer` thread is created to handle all index modifications.
612
613 Once the thread is started, messages are sent to it by calling the instance
614 with a task and an object, where the task is one of the following strings:
615
616 - ``add``: Adds the object to the index.
617 - ``remove``: Removes the object from the index.
618 - ``update``: Updates the index of an object.
619
620 and the object is any object that ``make_document`` knows how to handle.
621
622 To properly shutdown the thread, send the ``stop`` task with `None` as the
623 object. (This is normally handled by the `turbolucene._stop` function.)
624
625 To optimize the index, which can take a while, pass the ``optimize``
626 task with `None` for the object. (This is normally handled by the
627 TurboGears scheduler as set up by `_schedule_optimization`.)
628
629 :See: `turbolucene.start` for details about ``make_document``.
630
631 :group Public API: __init__, __call__
632 :group Threaded methods: run, _add, _remove, _update, _optimize, _stop
633
634 """
635
636
637
639 """Initialize the message queue and the PyLucene indexes.
640
641 One PyLucene index is created/opened for each of the configured
642 supported languages.
643
644 This method uses the ``turbolucene.default_language``,
645 ``turbolucene.languages`` and ``turbolucene.force_lock_release``
646 configuration settings.
647
648 :Parameters:
649 make_document : callable
650 A callable that takes the object to index as a parameter and
651 returns an appropriate `Document` object.
652
653 :Note: Instantiating this class starts the thread automatically.
654
655 :See:
656 - `turbolucene` (module docstring) for details about configuration
657 settings.
658 - `turbolucene.start` for details about ``make_document``.
659 - `_get_index_path` for details about the directory location of each
660 index.
661 - `_analyzer_factory` for details about the analyzer used for each
662 index.
663
664 """
665 PythonThread.__init__(self)
666 self._make_document = make_document
667 self._task_queue = Queue()
668 self._indexes = {}
669 default_language = config.get('turbolucene.default_language',
670 _DEFAULT_LANGUAGE)
671 default_force_lock_release = config.get('server.environment',
672 'development').lower() == 'development' and True or False
673 force_lock_release = config.get('turbolucene.force_lock_release',
674 default_force_lock_release)
675
676 languages = config.get('turbolucene.languages', [default_language])
677 for language in languages:
678 index_path = _get_index_path(language)
679 analyzer = _analyzer_factory(language)
680 create_path = not exists(index_path) and True or False
681 try:
682 self._indexes[language] = IndexModifier(index_path, analyzer,
683 create_path)
684 except JavaError, error:
685 if not error.getJavaException().getClass().getName(
686 ) == 'java.io.IOException' or not force_lock_release:
687 raise
688 _log.warn('Error opening index "%s". '
689 'turbolucene.force_lock_release is True, trying to force '
690 'lock release.' % index_path)
691 FSDirectory.getDirectory(index_path, False).makeLock(
692 'write.lock').release()
693 self._indexes[language] = IndexModifier(index_path, analyzer,
694 create_path)
695 self.start()
696
697 - def __call__(self, task, object_=None, language=None):
698 """Pass `task`, `object_` and `language` to the thread for processing.
699
700 If `language` is `None`, then the default language configured in
701 ``turbolucene.default_language`` is used.
702
703 If `task` is ``stop``, then the `_Indexer` thread is shutdown and this
704 method will wait until the shutdown is complete.
705
706 :Parameters:
707 task : `str`
708 The task to perform.
709 `object_`
710 Any object that ``make_document`` knows how to handle. (Default:
711 `None`)
712 language : `str`
713 The ISO language code of the language of the object. This
714 specifies which PyLucene index to use.
715
716 :See:
717 - `turbolucene` (module docstring) for details about configuration
718 settings.
719 - `turbolucene.start` for details about ``make_document``.
720
721 """
722 if not language:
723 language = config.get('turbolucene.default_language',
724 _DEFAULT_LANGUAGE)
725 self._task_queue.put((task, object_, language))
726 if task == 'stop':
727 self.join()
728
729
730
732 """Main thread loop to do dispatching based on messages in the queue.
733
734 This method expects that the queue will contain 3-tuples in the form of
735 (task, object, language), where task is one of ``add``, ``update``,
736 ``remove``, ``optimize`` or ``stop``, entry is any object that
737 ``make_document`` can handle or `None` in the case of ``optimize`` and
738 ``stop``, and language is the ISO language code of the indexer.
739
740 If the task is ``stop``, then the thread shuts down.
741
742 :Note: This method is run in the thread.
743
744 :See:
745 - `_add`, `_update`, `_remove`, `_optimize` and `_stop` for details
746 about each respective task.
747 - `turbolucene.start` for details about ``make_document``.
748
749 """
750 while True:
751 task, object_, language = self._task_queue.get()
752 method = getattr(self, '_' + task)
753 if task in ('optimize', 'stop'):
754 method()
755 else:
756 method(object_, language)
757 if task == 'stop':
758 break
759 self._indexes[language].flush()
760
761 - def _add(self, object_, language, document=None):
762 """Add a new object to the index.
763
764 If `document` is not provided, then this method passes the object off
765 to ``make_document`` and then indexes the resulting `Document` object.
766 Otherwise it just indexes the `document` object.
767
768 :Parameters:
769 `object_`
770 The object to be indexed. It will be passed to ``make_document``
771 (unless `document` is provided).
772 language : `str`
773 The ISO language code of the indexer to use.
774 document : `Document`
775 A pre-built `Document` object for the given object, if it exists.
776 This is used internally by `_update`. (Default: `None`)
777
778 :Note: This method is run in the thread.
779
780 :See: `turbolucene.start` for details about ``make_document``.
781
782 """
783 if not document:
784 document = self._make_document(object_)
785 _log.info(u'Adding object "%s" (id %s) to the %s index.' % (unicode(
786 object_), document['id'], language))
787 self._indexes[language].addDocument(document)
788
789 - def _remove(self, object_, language, document=None):
790 """Remove an object from the index.
791
792 If `document` is not provided, then this method passes the object off
793 to ``make_document`` and then removes the resulting `Document` object
794 from the index. Otherwise it just removes the `document` object.
795
796 :Parameters:
797 `object_`
798 The object to be removed from the index. It will be passed to
799 ``make_document`` (unless `document` is provided).
800 language : `str`
801 The ISO language code of the indexer to use.
802 document : `Document`
803 A pre-built `Document` object for the given object, if it exists.
804 This is used internally by `_update`. (Default: `None`)
805
806 :Note: This method is run in the thread.
807
808 :See: `turbolucene.start` for details about ``make_document``.
809
810 """
811 if not document:
812 document = self._make_document(object_)
813 _log.info(u'Removing object "%s" (id %s) from %s index.' % (unicode(
814 object_), document['id'], language))
815 self._indexes[language].deleteDocuments(Term('id', document['id']))
816
817 - def _update(self, object_, language):
818 """Update an object in the index by replacing it.
819
820 This method updates the index by removing and then re-adding the
821 object.
822
823 :Parameters:
824 `object_`
825 The object to update in the index. It will be passed to
826 ``make_document`` and the resulting `Document` object will be
827 updated.
828 language : `str`
829 The ISO language code of the indexer to use.
830
831 :Note: This method is run in the thread.
832
833 :See:
834 - `_remove` and `_add` for details about the removal and
835 re-addition.
836 - `turbolucene.start` for details about ``make_document``.
837
838 """
839 document = self._make_document(object_)
840 self._remove(object_, language, document)
841 self._add(object_, language, document)
842
844 """Optimize all of the indexes. This can take a while.
845
846 :Note: This method is run in the thread.
847
848 """
849 _log.info(u'Optimizing indexes.')
850 for index in self._indexes.values():
851 index.optimize()
852 _log.info(u'Indexes optimized.')
853
855 """Shutdown all of the indexes.
856
857 :Note: This method is run in the thread.
858
859 """
860 for index in self._indexes.values():
861 index.close()
862
863
865
866 """Responsible for searching an index and returning results.
867
868 `_Searcher` threads are created for each search that is requested. After
869 the search is completed, the thread dies.
870
871 To search, a `_Searcher` class is instantiated and then called with the
872 query and the ISO language code for the index to search. It returns the
873 results as a list of object id strings unless ``results_formatter`` was
874 provided. If it was, then the list of id strings are passed to
875 ``results_formatter`` to process and it's results are returned.
876
877 The thread is garbage collected when it goes out of scope.
878
879 The catch to all this is that a CherryPy thread cannot directly instantiate
880 a `_Searcher` thread because of PyLucene restrictions. So to get around
881 that, see the `_SearcherFactory` class.
882
883 :See: `turbolucene.start` for details about ``results_formatter``.
884
885 :group Public API: __init__, __call__
886 :group Threaded methods: run
887
888 """
889
890
891
893 """Initialize message queues and start the thread.
894
895 :Note: The thread is started as soon as the class is instantiated.
896
897 """
898 PythonThread.__init__(self)
899 self._results_formatter = results_formatter
900 self._query_queue = Queue()
901 self._results_queue = Queue()
902 self.start()
903
904 - def __call__(self, query, language=None):
905 """Send `query` and `language` to the thread, wait and return results.
906
907 If `language` is `None`, then the default language configured in
908 ``turbolucene.default_language`` is used.
909
910 :Parameters:
911 query : `str` or `unicode`
912 The search query to give to PyLucene. All of Lucene's query
913 syntax (field identifiers, wild cards, etc.) are available.
914 language : `str`
915 The ISO language code of the indexer to use.
916
917 :Returns: An iterable of id field strings that match the query or the
918 results produced by ``results_formatter`` if it was provided.
919 :rtype: iterable
920
921 :See:
922 - `turbolucene` (module docstring) for details about configuration
923 settings.
924 - `turbolucene.start` for details about ``results_formatter``.
925 - http://lucene.apache.org/java/docs/queryparsersyntax.html for
926 details about Lucene's query syntax.
927
928 """
929 if not language:
930 language = config.get('turbolucene.default_language',
931 _DEFAULT_LANGUAGE)
932 self._query_queue.put((query, language))
933 results = self._results_queue.get()
934
935
936
937
938 if self._results_formatter:
939 return self._results_formatter(results)
940 return results
941
942
943
945 """Search the language index for the query and send back the results.
946
947 The results is an iterable of id field strings that match the query.
948
949 This method uses the ``turbolucene.search_fields`` configuration
950 setting for the default fields to search if none are specified in the
951 query itself, and ``turbolucene.default_operator`` for the default
952 operator to use when joining terms.
953
954 :Exceptions:
955 - `AttributeError`: Raised when the configured default operator is
956 not valid.
957
958 :Note: This method is run in the thread.
959
960 :Note: The thread dies after one search.
961
962 :See:
963 - `turbolucene` (module docstring) for details about configuration
964 settings.
965 - `_get_index_path` for details about the directory location of the
966 index.
967 - `_analyzer_factory` for details about the analyzer used for the
968 index.
969 - http://lucene.apache.org/java/docs/queryparsersyntax.html for
970 details about Lucene's query syntax.
971
972 """
973 query, language = self._query_queue.get()
974 searcher = IndexSearcher(_get_index_path(language))
975 search_fields = config.get('turbolucene.search_fields', ['id'])
976 parser = MultiFieldQueryParser(search_fields, _analyzer_factory(
977 language))
978 default_operator = getattr(parser.Operator, config.get(
979 'turbolucene.default_operator', 'AND').upper())
980 parser.setDefaultOperator(default_operator)
981 try:
982 hits = searcher.search(parser.parse(query))
983 results = [document['id'] for _, document in hits]
984 except JavaError:
985 results = []
986 self._results_queue.put(results)
987 searcher.close()
988
989
991
992 """Produces running `_Searcher` threads.
993
994 ``PythonThread`` threads can only be started by the main program or other
995 ``PythonThread`` threads, so this ``PythonThread``-based class creates and
996 starts single-use `_Searcher` threads. This thread is created and started
997 by the main program during TurboGears initialization as a singleton.
998
999 To get a `_Searcher` thread, call the `_SearcherFactory` instance. Then
1000 pass the query to the `_Searcher` thread that was returned.
1001
1002 :group Public API: __init__, __call__, stop
1003 :group Threaded methods: run
1004
1005 """
1006
1007
1008
1009 - def __init__(self, *searcher_args, **searcher_kwargs):
1010 """Initialize message queues and start the thread.
1011
1012 :Note: The thread is started as soon as the class is instantiated.
1013
1014 """
1015 PythonThread.__init__(self)
1016 self._searcher_args = searcher_args
1017 self._searcher_kwargs = searcher_kwargs
1018 self._request_queue = Queue()
1019 self._searcher_queue = Queue()
1020 self.start()
1021
1023 """Send a request for a running `_Searcher` class, then return it.
1024
1025 :Returns: A running instance of the `_Searcher` class.
1026 :rtype: `_Searcher`
1027
1028 """
1029 self._request_queue.put('request')
1030 return self._searcher_queue.get()
1031
1033 """Stop the `_SearcherFactory` thread."""
1034 self._request_queue.put('stop')
1035 self.join()
1036
1037
1038
1040 """Listen for requests and create `_Searcher` classes.
1041
1042 If the request message is ``stop``, then the thread will be shutdown.
1043
1044 :Note: This method is run in the thread.
1045
1046 """
1047 while True:
1048 request = self._request_queue.get()
1049 if request == 'stop':
1050 break
1051
1052 self._searcher_queue.put(_Searcher(
1053 *self._searcher_args, **self._searcher_kwargs))
1054