Package turbolucene
[hide private]
[frames] | no frames]

Source Code for Package turbolucene

   1  # -*- coding: utf-8 -*- 
   2   
   3  #---Header--------------------------------------------------------------------- 
   4   
   5  #============================================================================== 
   6  # turbolucene/__init__.py 
   7  # 
   8  # This is part of the TurboLucene project (http://dev.krys.ca/turbolucene/). 
   9  # 
  10  # Copyright (c) 2007 Krys Wilken <krys AT krys DOT ca> 
  11  # 
  12  # This software is licensed under the MIT license.  See the LICENSE file for 
  13  # licensing information. 
  14  # 
  15  #============================================================================== 
  16   
  17  """Provides search functionality for TurboGears_ using PyLucene_. 
  18   
  19  This module uses PyLucene to do all the heavy lifting, but as a result this 
  20  module does some fancy things with threads. 
  21   
  22  PyLucene requires that all threads that use it must inherit from 
  23  ``PythonThread``.  This means either patching CherryPy_ and/or TurboGears, or 
  24  having the CherryPy thread hand off the request to a ``PythonThread`` and, in 
  25  the case of searching, wait for the result.  The second method was chosen so 
  26  that a patched CherryPy or TurboGears does not have to be maintained. 
  27   
  28  The other advantage to the chosen method is that indexing happens in a separate 
  29  thread so the web request can return more quickly by not waiting for the 
  30  results. 
  31   
  32  The main disadvantage with PyLucene and CherryPy, however, is that *autoreload* 
  33  does not work with it.  You **must** disable it by adding 
  34  ``autoreload.on = False`` to your ``dev.cfg``. 
  35   
  36  Configuration options 
  37  ===================== 
  38   
  39  TurboLucene_ uses the following configuration options: 
  40   
  41    **turbolucene.search_fields**: 
  42      The list of fields that should be searched by default when a specific field 
  43      is not specified.  (e.g. ``['id', 'title', 'text', 'categories']``) 
  44      (Default: ``['id']``) 
  45    **turbolucene.default_language**: 
  46      The default language to use if a language is not given calling 
  47      `add`/`update`/`search`/etc.  (Default: ``'en'``) 
  48    **turbolucene.languages**: 
  49      The list of languages to support.  This is a list of ISO language codes 
  50      that you want to support in your application.  The languages must be 
  51      supported by PyLucene and must be configured in the languages 
  52      configuration file.  Currently the choice of languages that are possible 
  53      out-of-the-box are : *Czech (cs)*, *Danish (da)*, *German (de)*, *Greek 
  54      (el)*, *English (en)*, *Spanish (es)*, *Finnish (fi)*, *French (fr)*, 
  55      *Italian (it)*, *Japanese (ja)*, *Korean (ko)*, *Dutch (nl)*, *Norwegian 
  56      (no)*, *Portuguese (pt)*, *Brazilian (pt-br)*, *Russian (ru)*, *Swedish 
  57      (sv)*, and *Chinese (zh)*.  (Default: ``[<default_language>]``) 
  58    **turbolucene.default_operator**: 
  59      The default search operator to use between search terms when non is 
  60      specified.  (Default: ``'AND'``)  This must be a valid operator object from 
  61      the ``PyLucene.MultiFieldQueryParser.Operator`` namespace. 
  62    **turbolucene.optimize_days**: 
  63      The list of days to schedule index optimization.  Index optimization cleans 
  64      up and compacts the indexes so that searches happen faster.  This is a list 
  65      of day numbers (Sunday = 1).  Optimization of all indexes will occur on 
  66      those days.  (Default: ``[1, 2, 3, 4, 5, 6, 7]``, i.e. every day) 
  67    **turbolucene.optimize_time**: 
  68      A tuple containing the hour (24 hour format) and minute of the time to run 
  69      the scheduled index optimizations.  (Default: ``(00, 00)``, i.e. midnight) 
  70    **turbolucene.index_root**: 
  71      The base path in which to store the indexes.  There is one index per 
  72      supported language.  Each index is a directory.  Those directories will be 
  73      sub-directories of this base path.  If the path is relative, it is 
  74      relative to your project's root.  Normally you should not need to override 
  75      this unless you specifically need the indexes to be located somewhere else. 
  76      (Default: ``u'index'``) 
  77    **turbolucene.languages_file**: 
  78      The path to the languages configuration file.  The languages configuration 
  79      file provides the configuration information for all the languages that 
  80      *TurboLucene* supports.  Normally you should not need to override this. 
  81      (Default: the ``u'languages.cfg'`` file in the `turbolucene` package) 
  82    **turbolucene.languages_file_encoding**: 
  83      The encoding of the languages file.  (Default: ``'utf-8'``) 
  84    **turbolucene.stopwords_root**: 
  85      The languages file can specify files that contain stopwords.  If a 
  86      stopwords file path is relative, this path with be prepended to it.  This 
  87      allows for all stopword files to be customized without needing to specify 
  88      full paths for every one.  Normally you should not need to override this. 
  89      (Default: the ``stopwords`` directory in the `turbolucene` package) 
  90    **turbolucene.force_lock_release**: 
  91      If this is set to True, then if TurboLucne has troubles opening an index, 
  92      it will try to force the release of any write lock that may exist and try 
  93      again.  The write lock is to prevent multiple processes writing to the 
  94      same index at the same time, but if the TurboLucne-based project is killed, 
  95      the lock gets left behind.  This setting let you override the default 
  96      behaviour.  (Default: ``True`` in development and ``False`` in production) 
  97   
  98  All fields are optional, but at the minimum, you will likely want to specify 
  99  ``turbolucene.search_fields``. 
 100   
 101  :See: `_load_language_data` for details about the languages configuration file. 
 102   
 103  :Warning: Do not forget to turn off *autoreload* in ``dev.cfg``. 
 104   
 105  :Requires: TurboGears_ and PyLucene_ 
 106   
 107  .. _TurboGears: http://turbogears.org/ 
 108  .. _PyLucene: http://pylucene.osafoundation.org/ 
 109  .. _CherryPy: http://cherrypy.org/ 
 110  .. _TurboLucene: http://dev.krys.ca/turbolucene/ 
 111   
 112  :newfield api_version: API Version 
 113  :newfield revision: Revision 
 114   
 115  :group Objects to use in make_document: Document, Field, STORE, COMPRESS, 
 116    TOKENIZED, UN_TOKENIZED 
 117  :group Public API: start, add, update, remove, search 
 118   
 119  """ 
 120   
 121  __author__ = 'Krys Wilken' 
 122  __contact__ = 'krys AT krys DOT ca' 
 123  __copyright__ = '(c) 2007 Krys Wilken' 
 124  __license__ = 'MIT' 
 125  __version__ = '0.2.1' 
 126  __api_version__ = '2.0' 
 127  __revision__ = '$Id: __init__.py 60 2007-05-02 03:29:02Z krys $' 
 128  __docformat__ = 'restructuredtext en' 
 129  __all__ = ['start', 'add', 'update', 'remove', 'search', 'Document', 'Field', 
 130    'STORE', 'COMPRESS', 'TOKENIZED', 'UN_TOKENIZED'] 
 131   
 132   
 133  #---Imports-------------------------------------------------------------------- 
 134   
 135  #---  Standard library imports 
 136  from Queue import Queue 
 137  from os.path import exists, join, isabs 
 138  from logging import getLogger 
 139  from atexit import register 
 140  import codecs 
 141   
 142  #---  Framework imports 
 143  from turbogears import scheduler, config 
 144  from configobj import ConfigObj 
 145  # PyLint does not like this setuptools voodoo, but it works. 
 146  from pkg_resources import resource_stream # pylint: disable-msg=E0611 
 147   
 148  #---  Third-party imports 
 149  import PyLucene 
 150  from PyLucene import (PythonThread, IndexModifier, JavaError, Term, 
 151    IndexSearcher, MultiFieldQueryParser, FSDirectory) 
 152  # For use in make_document 
 153  from PyLucene import Document, Field 
 154   
 155   
 156  #---Globals-------------------------------------------------------------------- 
 157   
 158  #: Default language to use if none is specified in `config`. 
 159  _DEFAULT_LANGUAGE = 'en' 
 160  # These are intentionally module-level globals, so C0103 does not apply. 
 161  #: Logger for this module 
 162  _log = getLogger('turbolucene') # pylint: disable-msg=C0103 
 163  #: This will hold the language support data read from file. 
 164  _language_data = None # pylint: disable-msg=C0103 
 165  #: This will hold the `_Indexer` singleton class. 
 166  _indexer = None # pylint: disable-msg=C0103 
 167  #: This will hold the `_SearcherFactory` singleton class. 
 168  _searcher_factory = None # pylint: disable-msg=C0103 
 169   
 170  #---  Convenience constants 
 171   
 172  #: Tells `Field` not to compress the field data 
 173  STORE = Field.Store.YES 
 174  #: Tells `Field` to compress the field data 
 175  COMPRESS = Field.Store.COMPRESS 
 176  #: Tells `Field` to tokenize and do stemming on the field data 
 177  TOKENIZED = Field.Index.TOKENIZED 
 178  #: Tells `Field` not to tokenize and do stemming on the field data 
 179  UN_TOKENIZED = Field.Index.UN_TOKENIZED 
 180   
 181   
 182  #---Functions------------------------------------------------------------------ 
 183   
184 -def _load_language_data():
185 """Load all the language data from the configured languages file. 186 187 The languages configuration file can be set with the 188 ``turbolucene.languages_file`` configuration option and it's encoding is 189 set with ``turbolucene.languages_file_encoding``. 190 191 Configuration file format 192 ========================= 193 194 The languages file is an INI-type (ConfigObj_) file. Each section is 195 defined by an ISO language code (``en``, ``de``, ``el``, ``pt-br``, etc.). 196 In each section the following keys are possible: 197 198 **analyzer_class**: 199 The PyLucene analyzer class to use for this language. (e.g. 200 ``SnowballAnalyzer``) (Required) 201 **analyzer_class_args**: 202 Any arguments that should be passed to the analyzer class. (e.g. 203 ``Danish``) (Optional) 204 **stopwords**: 205 A list of stopwords (words that do not get indexed) to pass to the 206 analyzer class. This is not normally used as ``stopwords_file`` is 207 generally preferred. (Optional) 208 **stopwords_file**: 209 The path to the file that contains the list of stopwords to pass to the 210 analyzer class. (e.g. ``stopwords_da.txt``) (Optional) 211 **stopwords_file_encoding**: 212 The encoding of the stopwords file. (e.g. ``windows-1252``) 213 214 If neither ``stopwords`` or ``stopwords_file`` is defined for a language, 215 then any stopwords that are used are determined automatically by the 216 analyzer class' constructor. 217 218 Example 219 ------- 220 221 :: 222 223 # German 224 [de] 225 analyzer_class = SnowballAnalyzer 226 analyzer_class_args = German2 227 stopwords_file = stopwords_de.txt 228 stopwords_file_encoding = windows-1252 229 230 :Exceptions: 231 - `IOError`: Raised of the languages configuration file could not be 232 opened. 233 - `configobj.ParseError`: Raised if the languages configuration file is 234 contains errors. 235 236 :See: 237 - `turbolucene` (module docstring) for details about configuration 238 settings. 239 - `_read_stopwords` for details about stopwords files. 240 241 .. _ConfigObj: http://www.voidspace.org.uk/python/configobj.html 242 243 """ 244 # Use of global here is intentional and necessary. W0603 does not apply. 245 global _language_data # pylint: disable-msg=W0603 246 languages_file = config.get('turbolucene.languages_file', None) 247 languages_file_encoding = config.get('turbolucene.languages_file_encoding', 248 'utf-8') 249 if languages_file: 250 _log.info(u'Loading custom language data from "%s"' % languages_file) 251 else: 252 _log.info(u'Loading default language data') 253 languages_file = resource_stream(__name__, u'languages.cfg') 254 _language_data = ConfigObj(languages_file, 255 encoding=languages_file_encoding, file_error=True, raise_errors=True)
256 257
258 -def _schedule_optimization():
259 """Schedule index optimization using the TurboGears scheduler. 260 261 This function reads it's configuration data from 262 ``turbolucene.optimize_days`` and ``turbolucene.optimize_time``. 263 264 :Exceptions: 265 - `TypeError`: Raised if ``turbolucene.optimize_time`` is invalid. 266 267 :See: `turbolucene` (module docstring) for details about configuration 268 settings. 269 270 """ 271 optimize_days = config.get('turbolucene.optimize_days', range(1, 8)) 272 optimize_time = config.get('turbolucene.optimize_time', (00, 00)) 273 scheduler.add_weekday_task(_optimize, optimize_days, optimize_time) 274 _log.info(u'Index optimization scheduled on %s at %s' % (unicode( 275 optimize_days), unicode(optimize_time)))
276 277
278 -def _get_index_path(language):
279 """Return the path to the index for the given language. 280 281 This function gets it's configuration data from ``turbolucene.index_root``. 282 283 :Parameters: 284 language : `str` 285 An ISO language code. (e.g. ``en``, ``pt-br``, etc.) 286 287 :Returns: The path to the index for the given language. 288 :rtype: `unicode` 289 290 :See: `turbolucene` (module docstring) for details about configuration 291 settings. 292 293 """ 294 index_base_path = config.get('turbolucene.index_root', u'index') 295 return join(index_base_path, language)
296 297
298 -def _read_stopwords(file_path, encoding):
299 """Read the stopwords from the given a stopwords file path. 300 301 Stopwords are words that should not be indexed because they are too common 302 or have no significant meaning (e.g. *the*, *in*, *with*, etc.) They are 303 language dependent. 304 305 This function gets it's configuration data from 306 ``turbolucene.stopwords_root``. 307 308 If `file_path` is not an absolute path, then it will be appended to the 309 path configured in ``turbolucene.stopwords_root``. 310 311 Stopwords files are text files (in the given encoding), with one stopword 312 per line. Comments are marked by a ``|`` character. This is for 313 compatibility with the stopwords files found at 314 http://snowball.tartarus.org/. 315 316 :Parameters: 317 file_path : `unicode` 318 The path to the stopwords file to read. 319 encoding : `str` 320 The encoding of the stopwords file. 321 322 :Returns: The list of stopwords from the file. 323 :rtype: `list` of `unicode` strings 324 325 :Exceptions: 326 - `IOError`: Raised if the stopwords file could not be opened. 327 328 :See: `turbolucene` (module docstring) for details about configuration 329 settings. 330 331 """ 332 stopwords_base_path = config.get('turbolucene.stopwords_root', None) 333 if isabs(file_path) or stopwords_base_path: 334 if not isabs(file_path): 335 file_path = join(stopwords_base_path, file_path) 336 _log.info(u'Reading custom stopwords file "%s"' % file_path) 337 stopwords_file = codecs.open(file_path, 'r', encoding) 338 else: 339 _log.info(u'Reading default stopwords file "%s"' % file_path) 340 stopwords_file = codecs.getreader(encoding)(resource_stream(__name__, 341 join(u'stopwords', file_path))) 342 stopwords = [] 343 for line in stopwords_file: 344 # Stopword files can have comments after a '|' character on each line. 345 # This is to support the stopword files that come from 346 # http://snowball.tartarus.org/ 347 stopword = line.split(u'|')[0].strip() 348 if stopword: 349 stopwords.append(stopword) 350 stopwords_file.close() 351 return stopwords
352 353
354 -def _analyzer_factory(language):
355 """Produce an analyzer object appropriate for the given language. 356 357 This function uses the data that was read in from the languages 358 configuration file to determine and instantiate the analyzer object. 359 360 :Parameters: 361 language : `str` or `unicode` 362 An ISO language code that is configured in the languages configuration 363 file. 364 365 :Returns: An instance of the configured analyser class for given language. 366 :rtype: ``PyLucene.Analyzer`` sub-class 367 368 :Exceptions: 369 - `KeyError`: Raised if the given language is not configured or if the 370 configuration for that language does not have a *analyzer_class* key. 371 - `PyLucene.InvalidArgsError`: Raised if any of the parameters passed to 372 the analyzer class are invalid. 373 374 :See: `_load_language_data` for details about the language configuration 375 file. 376 377 """ 378 ldata = _language_data[language] 379 args = (u'analyzer_class_args' in ldata and ldata[u'analyzer_class_args'] 380 or []) 381 if not isinstance(args, list): 382 args = [args] 383 # Note: It seems that the <LANGUAGE>_STOP_WORDS class variables are not 384 # exposed very often in PyLucene. They are also not very complete anyway, 385 # so I use stopwords from other sources. 386 stopwords = [] 387 if u'stopwords' in ldata and ldata[u'stopwords']: 388 stopwords = [ldata.stopwords] 389 elif u'stopwords_file' in ldata and u'stopwords_file_encoding' in ldata: 390 stopwords = [_read_stopwords(ldata[u'stopwords_file'], 391 ldata[u'stopwords_file_encoding'])] 392 # This function assumes that the stopwords parameter is always the last 393 # argument to the analyzer constructor. According to the Lucene docs, this 394 # is true in all cases so far. 395 args += stopwords 396 # Use of *args here is deliberate and necessary, so W0142 does not apply. 397 return getattr(PyLucene, ldata[ #pylint: disable-msg=W0142 398 u'analyzer_class'])(*args)
399 400
401 -def _stop():
402 """Shutdown search engine threads.""" 403 _searcher_factory.stop() 404 _indexer('stop') 405 _log.info(u'Search engine stopped.')
406 407
408 -def _optimize():
409 """Tell the search engine to optimize it's index.""" 410 _indexer('optimize')
411 412 413 #--- Public API 414
415 -def start(make_document, results_formatter=None):
416 """Initialize and start the search engine threads. 417 418 This function loads the language configuration information, starts the 419 search engine threads, makes sure the search engine will be shutdown upon 420 shutdown of TurboGears and starts the optimization scheduler to run at the 421 configured times. 422 423 The `make_document` and `results_formatter` parameters are 424 callables. Here are examples of how they should be defined: 425 426 Example `make_document` function: 427 =================================== 428 429 .. python:: 430 431 def make_document(entry): 432 '''Make a new PyLucene Document instance from an Entry instance.''' 433 document = Document() 434 # An 'id' string field is required. 435 document.add(Field('id', str(entry.id), STORE, UN_TOKENIZED)) 436 document.add(Field('posted_on', entry.rendered_posted_on, STORE, 437 TOKENIZED)) 438 document.add(Field('title', entry.title, STORE, TOKENIZED)) 439 document.add(Field('text', strip_tags(entry.etree), COMPRESS, 440 TOKENIZED)) 441 categories = ' '.join([unicode(category) for category in 442 entry.categories]) 443 document.add(Field('category', categories, STORE, TOKENIZED)) 444 return document 445 446 Example `results_formatter` function: 447 ======================================= 448 449 .. python:: 450 451 def results_formatter(results): 452 '''Return the results as SQLObject instances. 453 454 Returns either an empty list or a SelectResults object. 455 456 ''' 457 if results: 458 return Entry.select_with_identity(IN(Entry.q.id, [int(id) for id 459 in results])) 460 461 :Parameters: 462 make_document : callable 463 `make_document` is a callable that will return a PyLucene `Document` 464 object based on the object passed in to `add`, `update` or `remove`. 465 The `Document` object must have at least a field called ``id`` that is 466 a string. This function operates inside a PyLucene ``PythonThread``. 467 results_formatter : callable 468 `results_formatter`, if provided, is a callable that will return 469 a formatted version of the search results that are passed to it by 470 `_Searcher.__call__`. Generally the `results_formatter` will take the 471 list of ``id`` strings that is passed to it and return a list of 472 application-specific objects (like SQLObject_ instances, for example.) 473 This function operates outside of any PyLucene ``PythonThread`` objects 474 (like in the CherryPy thread, for example). (Optional) 475 476 :See: 477 - `turbolucene` (module docstring) for details about configuration 478 settings. 479 - `_load_language_data` for details about the language configuration 480 file. 481 482 .. _SQLObject: http://sqlobject.org/ 483 484 """ 485 _load_language_data() 486 # Use of global here is deliberate. W0603 does not apply. 487 global _indexer, _searcher_factory #pylint: disable-msg=W0603 488 _indexer = _Indexer(make_document) 489 _searcher_factory = _SearcherFactory(results_formatter) 490 # Using atexit insted of call_on_shutdown so that tg-admin shell will also 491 # shutdown properly. 492 register(_stop) 493 _schedule_optimization() 494 _log.info(u'Search engine started.')
495 496
497 -def add(object_, language=None):
498 """Tell the search engine to add the given object to the index. 499 500 This function returns immediately. It does not wait for the indexer to be 501 finished. 502 503 :Parameters: 504 `object_` 505 This can be any object that ``make_document`` knows how to handle. 506 language : `str` 507 This is the ISO language code of the language of the object. If 508 `language` is given, then it must be on that was previously configured 509 in ``turbolucene.languages``. If `language` is not given, then 510 the language configured in ``turbolucene.default_language`` will be 511 used. (Optional) 512 513 :See: 514 - `turbolucene` (module docstring) for details about configuration 515 settings. 516 - `start` for details about ``make_document``. 517 518 """ 519 _indexer('add', object_, language)
520 521
522 -def update(object_, language=None):
523 """Tell the the search engine to update the index for the given object. 524 525 This function returns immediately. It does not wait for the indexer to be 526 finished. 527 528 :Parameters: 529 `object_` 530 This can be any object that ``make_document`` knows how to handle. 531 language : `str` 532 This is the ISO language code of the language of the object. If 533 `language` is given, then it must be on that was previously configured 534 in ``turbolucene.languages``. If `language` is not given, then 535 the language configured in ``turbolucene.default_language`` will be 536 used. (Optional) 537 538 :See: 539 - `turbolucene` (module docstring) for details about configuration 540 settings. 541 - `start` for details about ``make_document``. 542 543 """ 544 _indexer('update', object_, language)
545 546
547 -def remove(object_, language=None):
548 """Tell the search engine to remove the given object from the index. 549 550 This function returns immediately. It does not wait for the indexer to be 551 finished. 552 553 :Parameters: 554 `object_` 555 This can be any object that ``make_document`` knows how to handle. 556 language : `str` 557 This is the ISO language code of the language of the object. If 558 `language` is given, then it must be on that was previously configured 559 in ``turbolucene.languages``. If `language` is not given, then 560 the language configured in ``turbolucene.default_language`` will be 561 used. (Optional) 562 563 :See: 564 - `turbolucene` (module docstring) for details about configuration 565 settings. 566 - `start` for details about ``make_document``. 567 568 """ 569 _indexer('remove', object_, language)
570 571
572 -def search(query, language=None):
573 """Return results from the search engine that match the query. 574 575 If a ``results_formatter`` function was passed to `start` then the results 576 will be passed through the formatter before returning. If not, the 577 returned value is a list of strings that are the ``id`` fields of matching 578 objects. 579 580 :Parameters: 581 query : `str` or `unicode` 582 This is the search query to give to PyLucene. All of Lucene's query 583 syntax (field identifiers, wild cards, etc.) are available. 584 language : `str` 585 This is the ISO language code of the language of the object. If 586 `language` is given, then it must be on that was previously configured 587 in ``turbolucene.languages``. If `language` is not given, then 588 the language configured in ``turbolucene.default_language`` will be 589 used. (Optional) 590 591 :Returns: The results of the search. 592 :rtype: iterable 593 594 :See: 595 - `start` for details about ``results_formatter``. 596 - `turbolucene` (module docstring) for details about configuration 597 settings. 598 - http://lucene.apache.org/java/docs/queryparsersyntax.html for details 599 about Lucene's query syntax. 600 601 """ 602 return _searcher_factory()(query, language)
603 604 605 #---Classes-------------------------------------------------------------------- 606
607 -class _Indexer(PythonThread):
608 609 """Responsible for updating and maintaining the search engine index. 610 611 A single `_Indexer` thread is created to handle all index modifications. 612 613 Once the thread is started, messages are sent to it by calling the instance 614 with a task and an object, where the task is one of the following strings: 615 616 - ``add``: Adds the object to the index. 617 - ``remove``: Removes the object from the index. 618 - ``update``: Updates the index of an object. 619 620 and the object is any object that ``make_document`` knows how to handle. 621 622 To properly shutdown the thread, send the ``stop`` task with `None` as the 623 object. (This is normally handled by the `turbolucene._stop` function.) 624 625 To optimize the index, which can take a while, pass the ``optimize`` 626 task with `None` for the object. (This is normally handled by the 627 TurboGears scheduler as set up by `_schedule_optimization`.) 628 629 :See: `turbolucene.start` for details about ``make_document``. 630 631 :group Public API: __init__, __call__ 632 :group Threaded methods: run, _add, _remove, _update, _optimize, _stop 633 634 """ 635 636 #---Public API 637
638 - def __init__(self, make_document):
639 """Initialize the message queue and the PyLucene indexes. 640 641 One PyLucene index is created/opened for each of the configured 642 supported languages. 643 644 This method uses the ``turbolucene.default_language``, 645 ``turbolucene.languages`` and ``turbolucene.force_lock_release`` 646 configuration settings. 647 648 :Parameters: 649 make_document : callable 650 A callable that takes the object to index as a parameter and 651 returns an appropriate `Document` object. 652 653 :Note: Instantiating this class starts the thread automatically. 654 655 :See: 656 - `turbolucene` (module docstring) for details about configuration 657 settings. 658 - `turbolucene.start` for details about ``make_document``. 659 - `_get_index_path` for details about the directory location of each 660 index. 661 - `_analyzer_factory` for details about the analyzer used for each 662 index. 663 664 """ 665 PythonThread.__init__(self) # PythonThread is an old-style class 666 self._make_document = make_document 667 self._task_queue = Queue() 668 self._indexes = {} 669 default_language = config.get('turbolucene.default_language', 670 _DEFAULT_LANGUAGE) 671 default_force_lock_release = config.get('server.environment', 672 'development').lower() == 'development' and True or False 673 force_lock_release = config.get('turbolucene.force_lock_release', 674 default_force_lock_release) 675 # Create indexes 676 languages = config.get('turbolucene.languages', [default_language]) 677 for language in languages: 678 index_path = _get_index_path(language) 679 analyzer = _analyzer_factory(language) 680 create_path = not exists(index_path) and True or False 681 try: 682 self._indexes[language] = IndexModifier(index_path, analyzer, 683 create_path) 684 except JavaError, error: 685 if not error.getJavaException().getClass().getName( 686 ) == 'java.io.IOException' or not force_lock_release: 687 raise 688 _log.warn('Error opening index "%s". ' 689 'turbolucene.force_lock_release is True, trying to force ' 690 'lock release.' % index_path) 691 FSDirectory.getDirectory(index_path, False).makeLock( 692 'write.lock').release() 693 self._indexes[language] = IndexModifier(index_path, analyzer, 694 create_path) 695 self.start()
696
697 - def __call__(self, task, object_=None, language=None):
698 """Pass `task`, `object_` and `language` to the thread for processing. 699 700 If `language` is `None`, then the default language configured in 701 ``turbolucene.default_language`` is used. 702 703 If `task` is ``stop``, then the `_Indexer` thread is shutdown and this 704 method will wait until the shutdown is complete. 705 706 :Parameters: 707 task : `str` 708 The task to perform. 709 `object_` 710 Any object that ``make_document`` knows how to handle. (Default: 711 `None`) 712 language : `str` 713 The ISO language code of the language of the object. This 714 specifies which PyLucene index to use. 715 716 :See: 717 - `turbolucene` (module docstring) for details about configuration 718 settings. 719 - `turbolucene.start` for details about ``make_document``. 720 721 """ 722 if not language: 723 language = config.get('turbolucene.default_language', 724 _DEFAULT_LANGUAGE) 725 self._task_queue.put((task, object_, language)) 726 if task == 'stop': 727 self.join()
728 729 #---Threaded methods 730
731 - def run(self):
732 """Main thread loop to do dispatching based on messages in the queue. 733 734 This method expects that the queue will contain 3-tuples in the form of 735 (task, object, language), where task is one of ``add``, ``update``, 736 ``remove``, ``optimize`` or ``stop``, entry is any object that 737 ``make_document`` can handle or `None` in the case of ``optimize`` and 738 ``stop``, and language is the ISO language code of the indexer. 739 740 If the task is ``stop``, then the thread shuts down. 741 742 :Note: This method is run in the thread. 743 744 :See: 745 - `_add`, `_update`, `_remove`, `_optimize` and `_stop` for details 746 about each respective task. 747 - `turbolucene.start` for details about ``make_document``. 748 749 """ 750 while True: 751 task, object_, language = self._task_queue.get() 752 method = getattr(self, '_' + task) 753 if task in ('optimize', 'stop'): 754 method() 755 else: 756 method(object_, language) 757 if task == 'stop': 758 break 759 self._indexes[language].flush() # This is essential.
760
761 - def _add(self, object_, language, document=None):
762 """Add a new object to the index. 763 764 If `document` is not provided, then this method passes the object off 765 to ``make_document`` and then indexes the resulting `Document` object. 766 Otherwise it just indexes the `document` object. 767 768 :Parameters: 769 `object_` 770 The object to be indexed. It will be passed to ``make_document`` 771 (unless `document` is provided). 772 language : `str` 773 The ISO language code of the indexer to use. 774 document : `Document` 775 A pre-built `Document` object for the given object, if it exists. 776 This is used internally by `_update`. (Default: `None`) 777 778 :Note: This method is run in the thread. 779 780 :See: `turbolucene.start` for details about ``make_document``. 781 782 """ 783 if not document: 784 document = self._make_document(object_) 785 _log.info(u'Adding object "%s" (id %s) to the %s index.' % (unicode( 786 object_), document['id'], language)) 787 self._indexes[language].addDocument(document)
788
789 - def _remove(self, object_, language, document=None):
790 """Remove an object from the index. 791 792 If `document` is not provided, then this method passes the object off 793 to ``make_document`` and then removes the resulting `Document` object 794 from the index. Otherwise it just removes the `document` object. 795 796 :Parameters: 797 `object_` 798 The object to be removed from the index. It will be passed to 799 ``make_document`` (unless `document` is provided). 800 language : `str` 801 The ISO language code of the indexer to use. 802 document : `Document` 803 A pre-built `Document` object for the given object, if it exists. 804 This is used internally by `_update`. (Default: `None`) 805 806 :Note: This method is run in the thread. 807 808 :See: `turbolucene.start` for details about ``make_document``. 809 810 """ 811 if not document: 812 document = self._make_document(object_) 813 _log.info(u'Removing object "%s" (id %s) from %s index.' % (unicode( 814 object_), document['id'], language)) 815 self._indexes[language].deleteDocuments(Term('id', document['id']))
816
817 - def _update(self, object_, language):
818 """Update an object in the index by replacing it. 819 820 This method updates the index by removing and then re-adding the 821 object. 822 823 :Parameters: 824 `object_` 825 The object to update in the index. It will be passed to 826 ``make_document`` and the resulting `Document` object will be 827 updated. 828 language : `str` 829 The ISO language code of the indexer to use. 830 831 :Note: This method is run in the thread. 832 833 :See: 834 - `_remove` and `_add` for details about the removal and 835 re-addition. 836 - `turbolucene.start` for details about ``make_document``. 837 838 """ 839 document = self._make_document(object_) 840 self._remove(object_, language, document) 841 self._add(object_, language, document)
842
843 - def _optimize(self):
844 """Optimize all of the indexes. This can take a while. 845 846 :Note: This method is run in the thread. 847 848 """ 849 _log.info(u'Optimizing indexes.') 850 for index in self._indexes.values(): 851 index.optimize() 852 _log.info(u'Indexes optimized.')
853
854 - def _stop(self):
855 """Shutdown all of the indexes. 856 857 :Note: This method is run in the thread. 858 859 """ 860 for index in self._indexes.values(): 861 index.close()
862 863
864 -class _Searcher(PythonThread):
865 866 """Responsible for searching an index and returning results. 867 868 `_Searcher` threads are created for each search that is requested. After 869 the search is completed, the thread dies. 870 871 To search, a `_Searcher` class is instantiated and then called with the 872 query and the ISO language code for the index to search. It returns the 873 results as a list of object id strings unless ``results_formatter`` was 874 provided. If it was, then the list of id strings are passed to 875 ``results_formatter`` to process and it's results are returned. 876 877 The thread is garbage collected when it goes out of scope. 878 879 The catch to all this is that a CherryPy thread cannot directly instantiate 880 a `_Searcher` thread because of PyLucene restrictions. So to get around 881 that, see the `_SearcherFactory` class. 882 883 :See: `turbolucene.start` for details about ``results_formatter``. 884 885 :group Public API: __init__, __call__ 886 :group Threaded methods: run 887 888 """ 889 890 #---Public API 891
892 - def __init__(self, results_formatter):
893 """Initialize message queues and start the thread. 894 895 :Note: The thread is started as soon as the class is instantiated. 896 897 """ 898 PythonThread.__init__(self) # PythonThread is an old-style class 899 self._results_formatter = results_formatter 900 self._query_queue = Queue() 901 self._results_queue = Queue() 902 self.start()
903
904 - def __call__(self, query, language=None):
905 """Send `query` and `language` to the thread, wait and return results. 906 907 If `language` is `None`, then the default language configured in 908 ``turbolucene.default_language`` is used. 909 910 :Parameters: 911 query : `str` or `unicode` 912 The search query to give to PyLucene. All of Lucene's query 913 syntax (field identifiers, wild cards, etc.) are available. 914 language : `str` 915 The ISO language code of the indexer to use. 916 917 :Returns: An iterable of id field strings that match the query or the 918 results produced by ``results_formatter`` if it was provided. 919 :rtype: iterable 920 921 :See: 922 - `turbolucene` (module docstring) for details about configuration 923 settings. 924 - `turbolucene.start` for details about ``results_formatter``. 925 - http://lucene.apache.org/java/docs/queryparsersyntax.html for 926 details about Lucene's query syntax. 927 928 """ 929 if not language: 930 language = config.get('turbolucene.default_language', 931 _DEFAULT_LANGUAGE) 932 self._query_queue.put((query, language)) 933 results = self._results_queue.get() 934 # The join is causing a segfault and I don't know why. In theory the 935 # join should not be necessary, but I thought it good practice to 936 # include it. Apparently I am wrong. 937 ## self.join() 938 if self._results_formatter: 939 return self._results_formatter(results) 940 return results
941 942 #---Threaded methods 943
944 - def run(self):
945 """Search the language index for the query and send back the results. 946 947 The results is an iterable of id field strings that match the query. 948 949 This method uses the ``turbolucene.search_fields`` configuration 950 setting for the default fields to search if none are specified in the 951 query itself, and ``turbolucene.default_operator`` for the default 952 operator to use when joining terms. 953 954 :Exceptions: 955 - `AttributeError`: Raised when the configured default operator is 956 not valid. 957 958 :Note: This method is run in the thread. 959 960 :Note: The thread dies after one search. 961 962 :See: 963 - `turbolucene` (module docstring) for details about configuration 964 settings. 965 - `_get_index_path` for details about the directory location of the 966 index. 967 - `_analyzer_factory` for details about the analyzer used for the 968 index. 969 - http://lucene.apache.org/java/docs/queryparsersyntax.html for 970 details about Lucene's query syntax. 971 972 """ 973 query, language = self._query_queue.get() 974 searcher = IndexSearcher(_get_index_path(language)) 975 search_fields = config.get('turbolucene.search_fields', ['id']) 976 parser = MultiFieldQueryParser(search_fields, _analyzer_factory( 977 language)) 978 default_operator = getattr(parser.Operator, config.get( 979 'turbolucene.default_operator', 'AND').upper()) 980 parser.setDefaultOperator(default_operator) 981 try: 982 hits = searcher.search(parser.parse(query)) 983 results = [document['id'] for _, document in hits] 984 except JavaError: 985 results = [] 986 self._results_queue.put(results) 987 searcher.close()
988 989
990 -class _SearcherFactory(PythonThread):
991 992 """Produces running `_Searcher` threads. 993 994 ``PythonThread`` threads can only be started by the main program or other 995 ``PythonThread`` threads, so this ``PythonThread``-based class creates and 996 starts single-use `_Searcher` threads. This thread is created and started 997 by the main program during TurboGears initialization as a singleton. 998 999 To get a `_Searcher` thread, call the `_SearcherFactory` instance. Then 1000 pass the query to the `_Searcher` thread that was returned. 1001 1002 :group Public API: __init__, __call__, stop 1003 :group Threaded methods: run 1004 1005 """ 1006 1007 #---Public API 1008
1009 - def __init__(self, *searcher_args, **searcher_kwargs):
1010 """Initialize message queues and start the thread. 1011 1012 :Note: The thread is started as soon as the class is instantiated. 1013 1014 """ 1015 PythonThread.__init__(self) # PythonThread is an old-style class 1016 self._searcher_args = searcher_args 1017 self._searcher_kwargs = searcher_kwargs 1018 self._request_queue = Queue() 1019 self._searcher_queue = Queue() 1020 self.start()
1021
1022 - def __call__(self):
1023 """Send a request for a running `_Searcher` class, then return it. 1024 1025 :Returns: A running instance of the `_Searcher` class. 1026 :rtype: `_Searcher` 1027 1028 """ 1029 self._request_queue.put('request') 1030 return self._searcher_queue.get()
1031
1032 - def stop(self):
1033 """Stop the `_SearcherFactory` thread.""" 1034 self._request_queue.put('stop') 1035 self.join()
1036 1037 #---Threaded methods 1038
1039 - def run(self):
1040 """Listen for requests and create `_Searcher` classes. 1041 1042 If the request message is ``stop``, then the thread will be shutdown. 1043 1044 :Note: This method is run in the thread. 1045 1046 """ 1047 while True: 1048 request = self._request_queue.get() 1049 if request == 'stop': 1050 break 1051 # * and ** are used here for simplicity and transparency. 1052 self._searcher_queue.put(_Searcher( # pylint: disable-msg=W0142 1053 *self._searcher_args, **self._searcher_kwargs))
1054