The similarity Module

Sample usage as a script:

$ python similarity.py http://www.stanford.edu/ http://www.berkeley.edu/ http://www.mit.edu/
Comparing files ['http://www.stanford.edu/', 'http://www.berkeley.edu/', 'http://www.mit.edu/']
sim(http://www.stanford.edu/,http://www.berkeley.edu/)=0.322771960247
sim(http://www.stanford.edu/,http://www.mit.edu/)=0.142787018368
sim(http://www.berkeley.edu/,http://www.mit.edu/)=0.248877629741
pysimsearch.similarity.cosine_sim(u, v)

Returns the cosine similarity of u,v: <u,v>/(|u||v|) where |u| is the L2 norm

pysimsearch.similarity.jaccard_sim(A, B)

Returns the Jaccard similarity of A,B: |A \cap B| / |A \cup B| We treat A and B as multi-sets (The Jaccard coefficient is technically meant for sets, although it is easily extended to multi-sets)

pysimsearch.similarity.main()

Commandline interface for measure pairwise similarities of files

pysimsearch.similarity.measure_similarity(file_a, file_b, sim_func=None)

Returns the textual similarity of term_vec_a and term_vec_b using chosen similarity metric

‘sim_func’ defaults to cosine_sim if not specified

pysimsearch.similarity.pairwise_compare_filenames(*filenames)

Do a pairwise comparison of the documents specified by ‘filenames’ and return their pairwise similarities

pysimsearch.similarity.pairwise_compare_files(*named_files)

Do a pairwise comparison of the ‘named_files’and print their pairwise similarities

Previous topic

The SimIndexCollection Class

Next topic

The doc_reader Module

This Page