The SimIndexCollection Class

SimIndexCollection

Sample usage:

from pprint import pprint
from pysimsearch.sim_index import MemorySimIndex, SimIndexCollection

indexes = (MemorySimIndex(), MemorySimIndex())
index_coll = SimIndexCollection()
index_coll.add_shards(*indexes)
index_coll.set_query_scorer('tfidf')
index_coll.index_urls('http://www.stanford.edu/',
                      'http://www.berkeley.edu',
                      'http://www.ucla.edu',
                      'http://www.mit.edu')

pprint(index_coll.query_by_string('stanford university'))
class pysimsearch.sim_index.SimIndexCollection(root=True)

Inherits from pysimsearch.sim_index.SimIndex.

Provides a SimIndex view over a sharded collection of SimIndexes.

Useful with collections of remote SimIndexes to provide a distributed indexing and serving architecture.

Assumes document-level sharding:

  • query() requests are routed to all shards in collection.
  • index_files() requests are routed according to a sharding function

Note that if we had used query-sharding, then instead, queries would be routed using a sharding function, and index-requests would be routed to all shards. The two sharding approaches correspond to either partitioning the postings matrix by columns (doc-sharding), or rows (query-sharding).

The shard-function is only used for index_*() operations. If you have a read-only collection, you don’t need a sharding function.

default_shard_func(shard_key)

implements the default sharding function

docid_to_name(docid)

Translates global docid to name

docids_with_terms(terms)

Returns a list of docids of docs containing all terms

docnames_with_terms(*terms)

Returns an iterable of docnames containing terms

get_local_N()

Return local number of documents

index_filenames(*filenames)

Build a similarity index over files given by filenames

Convenience method that wraps index_files()

Params:
*filenames: list of filenames to add to the index.
name_to_docid(name)

Translates name to global docid

postings_list(term)

Returns aggregated postings list in terms of global docids

query(query_vec)

Issues query to collection and returns merged results

TODO: use a merge alg. (heapq.merge doesn’t have a key= arg yet) TODO: add support for rank-aggregation in the case of heterogenous

collections where ir scores are not directly comparable
query_by_string(query_string)

Finds documents similar to query_string.

Convenience method that calls self.query()

Params:
query_string: the query given as a string
set_config(key, value, passthrough=True)

Update config var for shards

set_query_scorer(query_scorer)

Passes set_query_scorer() request to all shards.

Params:
query_scorer: scorer object or name. If any backends are remote,
query_scorer needs to be a scorer name, rather than a scorer object (which we currently don’t serialize for rpcs)
update_config(passthrough=True, **d)

Update config for shards

update_node_stats()

Fetches local stats from all shards, aggregates them, and rebroadcasts global stats back to shards. Currently uses “brute-force”; incremental updating (in either direction) is not supported.

update_trigger(method)

Decorator for methods that update the index. Used as a post-update trigger that gathers new term stats, and propagates them back down (if we’re the root node)

Previous topic

The RemoteSimIndex Class

Next topic

The similarity Module

This Page