Scoring algorithms for finding similar documents
Interface for query scorers which score similarity search results
QueryScorers are used by the SimIndex.query() method to handle the scoring of similarity search results.
Returns a new scorer object
Scores documents’ similarities to query
Scans postings_lists to compute similarity scores for docs for the query term vector
QueryScorer that uses simple term frequencies for scoring.
Scores query-document similarity using number of occurrences of query terms in document. Multiple occurrences of a term in the query are ignored.
QueryScorer that uses TFIDF weighting with the cosine similarity measure.
This implementation is actually an approximation to the true cosine, because of the way we normalize by document length. When computing document length, we assume a term weight of 1 for each document term. E.g., we do not factor in term weights when computing the “document length”, since that would require choosing the weighting strategy at index time.
Query length is ignored, as it has no effect on relative ordering
Returns idf weight
Scores documents’ similarities to query using cosine similarity in a vector space model. Uses tf.idf weighting.
An individual term hit is scored as:
idf * self.tf_weight(q_tf) * self.tf_weight(d_tf)
The overall score for a doc is given by the sum of the term-hit scores
Returns sublinear scaling of tf: 1+log(tf)
Returns unscaled tf