Package turbolucene
[hide private]
[frames] | no frames]

Package turbolucene

source code

Provides search functionality for TurboGears using PyLucene.

This module uses PyLucene to do all the heavy lifting, but as a result this module does some fancy things with threads.

PyLucene requires that all threads that use it must inherit from PythonThread. This means either patching CherryPy and/or TurboGears, or having the CherryPy thread hand off the request to a PythonThread and, in the case of searching, wait for the result. The second method was chosen so that a patched CherryPy or TurboGears does not have to be maintained.

The other advantage to the chosen method is that indexing happens in a separate thread so the web request can return more quickly by not waiting for the results.

The main disadvantage with PyLucene and CherryPy, however, is that autoreload does not work with it. You must disable it by adding autoreload.on = False to your dev.cfg.

Configuration options

TurboLucene uses the following configuration options:

turbolucene.search_fields:
The list of fields that should be searched by default when a specific field is not specified. (e.g. ['id', 'title', 'text', 'categories']) (Default: ['id'])
turbolucene.default_language:
The default language to use if a language is not given calling add/update/search/etc. (Default: 'en')
turbolucene.languages:
The list of languages to support. This is a list of ISO language codes that you want to support in your application. The languages must be supported by PyLucene and must be configured in the languages configuration file. Currently the choice of languages that are possible out-of-the-box are : Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Finnish (fi), French (fr), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Norwegian (no), Portuguese (pt), Brazilian (pt-br), Russian (ru), Swedish (sv), and Chinese (zh). (Default: [<default_language>])
turbolucene.default_operator:
The default search operator to use between search terms when non is specified. (Default: 'AND') This must be a valid operator object from the PyLucene.MultiFieldQueryParser.Operator namespace.
turbolucene.optimize_days:
The list of days to schedule index optimization. Index optimization cleans up and compacts the indexes so that searches happen faster. This is a list of day numbers (Sunday = 1). Optimization of all indexes will occur on those days. (Default: [1, 2, 3, 4, 5, 6, 7], i.e. every day)
turbolucene.optimize_time:
A tuple containing the hour (24 hour format) and minute of the time to run the scheduled index optimizations. (Default: (00, 00), i.e. midnight)
turbolucene.index_root:
The base path in which to store the indexes. There is one index per supported language. Each index is a directory. Those directories will be sub-directories of this base path. If the path is relative, it is relative to your project's root. Normally you should not need to override this unless you specifically need the indexes to be located somewhere else. (Default: u'index')
turbolucene.languages_file:
The path to the languages configuration file. The languages configuration file provides the configuration information for all the languages that TurboLucene supports. Normally you should not need to override this. (Default: the u'languages.cfg' file in the turbolucene package)
turbolucene.languages_file_encoding:
The encoding of the languages file. (Default: 'utf-8')
turbolucene.stopwords_root:
The languages file can specify files that contain stopwords. If a stopwords file path is relative, this path with be prepended to it. This allows for all stopword files to be customized without needing to specify full paths for every one. Normally you should not need to override this. (Default: the stopwords directory in the turbolucene package)
turbolucene.force_lock_release:
If this is set to True, then if TurboLucne has troubles opening an index, it will try to force the release of any write lock that may exist and try again. The write lock is to prevent multiple processes writing to the same index at the same time, but if the TurboLucne-based project is killed, the lock gets left behind. This setting let you override the default behaviour. (Default: True in development and False in production)

All fields are optional, but at the minimum, you will likely want to specify turbolucene.search_fields.




See Also: _load_language_data for details about the languages configuration file.

Warning: Do not forget to turn off autoreload in dev.cfg.

Requires: TurboGears and PyLucene

Version: 0.2.1

Author: Krys Wilken

Contact: krys AT krys DOT ca

Copyright: (c) 2007 Krys Wilken

License: MIT

API Version: 2.0

Revision: $Id: __init__.py 60 2007-05-02 03:29:02Z krys $

Classes [hide private]
  _Indexer
Responsible for updating and maintaining the search engine index.
  _Searcher
Responsible for searching an index and returning results.
  _SearcherFactory
Produces running _Searcher threads.
    Objects to use in make_document
  Document
j_document objects
  Field
j_field objects
Functions [hide private]
 
_load_language_data()
Load all the language data from the configured languages file.
source code
 
_schedule_optimization()
Schedule index optimization using the TurboGears scheduler.
source code
unicode
_get_index_path(language)
Return the path to the index for the given language.
source code
list of unicode strings
_read_stopwords(file_path, encoding)
Read the stopwords from the given a stopwords file path.
source code
PyLucene.Analyzer sub-class
_analyzer_factory(language)
Produce an analyzer object appropriate for the given language.
source code
 
_stop()
Shutdown search engine threads.
source code
 
_optimize()
Tell the search engine to optimize it's index.
source code
    Public API
 
start(make_document, results_formatter=None)
Initialize and start the search engine threads.
source code
 
add(object_, language=None)
Tell the search engine to add the given object to the index.
source code
 
update(object_, language=None)
Tell the the search engine to update the index for the given object.
source code
 
remove(object_, language=None)
Tell the search engine to remove the given object from the index.
source code
iterable
search(query, language=None)
Return results from the search engine that match the query.
source code
Variables [hide private]
  _DEFAULT_LANGUAGE = 'en'
Default language to use if none is specified in config.
  _log = getLogger('turbolucene')
Logger for this module
  _language_data = None
This will hold the language support data read from file.
  _indexer = None
This will hold the _Indexer singleton class.
  _searcher_factory = None
This will hold the _SearcherFactory singleton class.
    Objects to use in make_document
  STORE = <Field_Store: YES>
Tells Field not to compress the field data
  COMPRESS = <Field_Store: COMPRESS>
Tells Field to compress the field data
  TOKENIZED = <Field_Index: TOKENIZED>
Tells Field to tokenize and do stemming on the field data
  UN_TOKENIZED = <Field_Index: UN_TOKENIZED>
Tells Field not to tokenize and do stemming on the field data
Function Details [hide private]

_load_language_data()

source code 

Load all the language data from the configured languages file.

The languages configuration file can be set with the turbolucene.languages_file configuration option and it's encoding is set with turbolucene.languages_file_encoding.

Configuration file format

The languages file is an INI-type (ConfigObj) file. Each section is defined by an ISO language code (en, de, el, pt-br, etc.). In each section the following keys are possible:

analyzer_class:
The PyLucene analyzer class to use for this language. (e.g. SnowballAnalyzer) (Required)
analyzer_class_args:
Any arguments that should be passed to the analyzer class. (e.g. Danish) (Optional)
stopwords:
A list of stopwords (words that do not get indexed) to pass to the analyzer class. This is not normally used as stopwords_file is generally preferred. (Optional)
stopwords_file:
The path to the file that contains the list of stopwords to pass to the analyzer class. (e.g. stopwords_da.txt) (Optional)
stopwords_file_encoding:
The encoding of the stopwords file. (e.g. windows-1252)

If neither stopwords or stopwords_file is defined for a language, then any stopwords that are used are determined automatically by the analyzer class' constructor.

Example

# German
[de]
analyzer_class = SnowballAnalyzer
analyzer_class_args = German2
stopwords_file = stopwords_de.txt
stopwords_file_encoding = windows-1252
Raises:
  • IOError - Raised of the languages configuration file could not be opened.
  • configobj.ParseError - Raised if the languages configuration file is contains errors.

See Also:

_schedule_optimization()

source code 

Schedule index optimization using the TurboGears scheduler.

This function reads it's configuration data from turbolucene.optimize_days and turbolucene.optimize_time.

Raises:
  • TypeError - Raised if turbolucene.optimize_time is invalid.

See Also: turbolucene (module docstring) for details about configuration settings.

_get_index_path(language)

source code 

Return the path to the index for the given language.

This function gets it's configuration data from turbolucene.index_root.

Parameters:
  • language (str) - An ISO language code. (e.g. en, pt-br, etc.)
Returns: unicode
The path to the index for the given language.

See Also: turbolucene (module docstring) for details about configuration settings.

_read_stopwords(file_path, encoding)

source code 

Read the stopwords from the given a stopwords file path.

Stopwords are words that should not be indexed because they are too common or have no significant meaning (e.g. the, in, with, etc.) They are language dependent.

This function gets it's configuration data from turbolucene.stopwords_root.

If file_path is not an absolute path, then it will be appended to the path configured in turbolucene.stopwords_root.

Stopwords files are text files (in the given encoding), with one stopword per line. Comments are marked by a | character. This is for compatibility with the stopwords files found at http://snowball.tartarus.org/.

Parameters:
  • file_path (unicode) - The path to the stopwords file to read.
  • encoding (str) - The encoding of the stopwords file.
Returns: list of unicode strings
The list of stopwords from the file.
Raises:
  • IOError - Raised if the stopwords file could not be opened.

See Also: turbolucene (module docstring) for details about configuration settings.

_analyzer_factory(language)

source code 

Produce an analyzer object appropriate for the given language.

This function uses the data that was read in from the languages configuration file to determine and instantiate the analyzer object.

Parameters:
  • language (str or unicode) - An ISO language code that is configured in the languages configuration file.
Returns: PyLucene.Analyzer sub-class
An instance of the configured analyser class for given language.
Raises:
  • KeyError - Raised if the given language is not configured or if the configuration for that language does not have a analyzer_class key.
  • PyLucene.InvalidArgsError - Raised if any of the parameters passed to the analyzer class are invalid.

See Also: _load_language_data for details about the language configuration file.

start(make_document, results_formatter=None)

source code 

Initialize and start the search engine threads.

This function loads the language configuration information, starts the search engine threads, makes sure the search engine will be shutdown upon shutdown of TurboGears and starts the optimization scheduler to run at the configured times.

The make_document and results_formatter parameters are callables. Here are examples of how they should be defined:

Example make_document function:

def make_document(entry):
    '''Make a new PyLucene Document instance from an Entry instance.'''
    document = Document()
    # An 'id' string field is required.
    document.add(Field('id', str(entry.id), STORE, UN_TOKENIZED))
    document.add(Field('posted_on', entry.rendered_posted_on, STORE,
      TOKENIZED))
    document.add(Field('title', entry.title, STORE, TOKENIZED))
    document.add(Field('text', strip_tags(entry.etree), COMPRESS,
      TOKENIZED))
    categories = ' '.join([unicode(category) for category in
      entry.categories])
    document.add(Field('category', categories, STORE, TOKENIZED))
    return document

Example results_formatter function:

def results_formatter(results):
    '''Return the results as SQLObject instances.

    Returns either an empty list or a SelectResults object.

    '''
    if results:
        return Entry.select_with_identity(IN(Entry.q.id, [int(id) for id
          in results]))
Parameters:
  • make_document (callable) - make_document is a callable that will return a PyLucene Document object based on the object passed in to add, update or remove. The Document object must have at least a field called id that is a string. This function operates inside a PyLucene PythonThread.
  • results_formatter (callable) - results_formatter, if provided, is a callable that will return a formatted version of the search results that are passed to it by _Searcher.__call__. Generally the results_formatter will take the list of id strings that is passed to it and return a list of application-specific objects (like SQLObject instances, for example.) This function operates outside of any PyLucene PythonThread objects (like in the CherryPy thread, for example). (Optional)

See Also:

add(object_, language=None)

source code 

Tell the search engine to add the given object to the index.

This function returns immediately. It does not wait for the indexer to be finished.

Parameters:
  • object_ - This can be any object that make_document knows how to handle.
  • language (str) - This is the ISO language code of the language of the object. If language is given, then it must be on that was previously configured in turbolucene.languages. If language is not given, then the language configured in turbolucene.default_language will be used. (Optional)

See Also:

  • turbolucene (module docstring) for details about configuration settings.
  • start for details about make_document.

update(object_, language=None)

source code 

Tell the the search engine to update the index for the given object.

This function returns immediately. It does not wait for the indexer to be finished.

Parameters:
  • object_ - This can be any object that make_document knows how to handle.
  • language (str) - This is the ISO language code of the language of the object. If language is given, then it must be on that was previously configured in turbolucene.languages. If language is not given, then the language configured in turbolucene.default_language will be used. (Optional)

See Also:

  • turbolucene (module docstring) for details about configuration settings.
  • start for details about make_document.

remove(object_, language=None)

source code 

Tell the search engine to remove the given object from the index.

This function returns immediately. It does not wait for the indexer to be finished.

Parameters:
  • object_ - This can be any object that make_document knows how to handle.
  • language (str) - This is the ISO language code of the language of the object. If language is given, then it must be on that was previously configured in turbolucene.languages. If language is not given, then the language configured in turbolucene.default_language will be used. (Optional)

See Also:

  • turbolucene (module docstring) for details about configuration settings.
  • start for details about make_document.

search(query, language=None)

source code 

Return results from the search engine that match the query.

If a results_formatter function was passed to start then the results will be passed through the formatter before returning. If not, the returned value is a list of strings that are the id fields of matching objects.

Parameters:
  • query (str or unicode) - This is the search query to give to PyLucene. All of Lucene's query syntax (field identifiers, wild cards, etc.) are available.
  • language (str) - This is the ISO language code of the language of the object. If language is given, then it must be on that was previously configured in turbolucene.languages. If language is not given, then the language configured in turbolucene.default_language will be used. (Optional)
Returns: iterable
The results of the search.

See Also: