SimIndexCollection
Sample usage:
from pprint import pprint
from pysimsearch.sim_index import MemorySimIndex, SimIndexCollection
indexes = (MemorySimIndex(), MemorySimIndex())
index_coll = SimIndexCollection()
index_coll.add_shards(*indexes)
index_coll.set_query_scorer('tfidf')
index_coll.index_urls('http://www.stanford.edu/',
'http://www.berkeley.edu',
'http://www.ucla.edu',
'http://www.mit.edu')
pprint(index_coll.query_by_string('stanford university'))
Inherits from pysimsearch.sim_index.SimIndex.
Provides a SimIndex view over a sharded collection of SimIndexes.
Useful with collections of remote SimIndexes to provide a distributed indexing and serving architecture.
Assumes document-level sharding:
- query() requests are routed to all shards in collection.
- index_files() requests are routed according to a sharding function
Note that if we had used query-sharding, then instead, queries would be routed using a sharding function, and index-requests would be routed to all shards. The two sharding approaches correspond to either partitioning the postings matrix by columns (doc-sharding), or rows (query-sharding).
The shard-function is only used for index_*() operations. If you have a read-only collection, you don’t need a sharding function.
implements the default sharding function
Translates global docid to name
Returns a list of docids of docs containing all terms
Returns an iterable of docnames containing terms
Return local number of documents
Build a similarity index over files given by filenames
Convenience method that wraps index_files()
Translates name to global docid
Returns aggregated postings list in terms of global docids
Issues query to collection and returns merged results
TODO: use a merge alg. (heapq.merge doesn’t have a key= arg yet) TODO: add support for rank-aggregation in the case of heterogenous
collections where ir scores are not directly comparable
Finds documents similar to query_string.
Convenience method that calls self.query()
Update config var for shards
Passes set_query_scorer() request to all shards.
Update config for shards
Fetches local stats from all shards, aggregates them, and rebroadcasts global stats back to shards. Currently uses “brute-force”; incremental updating (in either direction) is not supported.
Decorator for methods that update the index. Used as a post-update trigger that gathers new term stats, and propagates them back down (if we’re the root node)