Python library for indexing and similarity-search.
Download from GitHub
This library is primarily for pedagogical purposes, not production use. The code here is meant to illustrate the basic workings of similarity and indexing engines, without worrying much about optimization and efficiency. Although efficiency down to the byte is necessary for any production indexing engine, the additional complexity that introduces often obscures the simple concepts that drive modern information retrieval systems. Certain patterns used for scaling indexes (e.g., distributed indexes) are included, although not optimized nearly to the extent necessary for large-scale production use.
Although the code is currently for Python 2.7 series, we use __future__ imports to match Python 3 as closely as possible.
If you are interested in learning more about search and information retrieval, I highly recommend the following two books:
Quick sample:
>>> from pprint import pprint
>>> from pysimsearch import sim_index, doc_reader
>>> index = sim_index.MemorySimIndex()
>>> index.index_filenames('http://www.stanford.edu/',
'http://www.berkeley.edu/',
'http://www.ucla.edu',
'http://www.mit.edu')
>>> pprint(index.postings_list('university'))
[(0, 3), (1, 1), (2, 1)]
>>> pprint(list(index.docnames_with_terms('university', 'california')))
['http://www.stanford.edu/', 'http://www.ucla.edu']
>>> index.set_query_scorer('tfidf')
>>> pprint(list(index.query_by_string("stanford university")))
[('http://www.stanford.edu/', 0.5827172819606118),
('http://www.ucla.edu', 0.05801461340864149),
('http://www.berkeley.edu/', 0.025725104682131295)]
View a larger Example