Sample usage as a script:
$ python similarity.py http://www.stanford.edu/ http://www.berkeley.edu/ http://www.mit.edu/
Comparing files ['http://www.stanford.edu/', 'http://www.berkeley.edu/', 'http://www.mit.edu/']
sim(http://www.stanford.edu/,http://www.berkeley.edu/)=0.322771960247
sim(http://www.stanford.edu/,http://www.mit.edu/)=0.142787018368
sim(http://www.berkeley.edu/,http://www.mit.edu/)=0.248877629741
Returns the cosine similarity of u,v: <u,v>/(|u||v|) where |u| is the L2 norm
Returns the Jaccard similarity of A,B: |A \cap B| / |A \cup B| We treat A and B as multi-sets (The Jaccard coefficient is technically meant for sets, although it is easily extended to multi-sets)
Commandline interface for measure pairwise similarities of files
Returns the textual similarity of term_vec_a and term_vec_b using chosen similarity metric
‘sim_func’ defaults to cosine_sim if not specified
Do a pairwise comparison of the documents specified by ‘filenames’ and return their pairwise similarities
Do a pairwise comparison of the ‘named_files’and print their pairwise similarities