API - 语义工具箱¶
semantic.load_dict ([path]) |
通过 pkl 文件加载原生字典对象 |
semantic.generate_swords () |
生成敏感词词典 |
semantic.check_swords (sentence) |
检测是否包含敏感词 |
semantic.synonym_cut (sentence[, pattern]) |
Cut the sentence into a synonym vector tag. |
semantic.get_tag (sentence, config) |
Get semantic tag of sentence. |
semantic.get_tags (word) |
获取词对应的语义标签集合 |
semantic.sim_tag (tag1, tag2) |
计算两个语义标签的相似度,得分区间为[0, 1]。 |
semantic.max_sim_tag (word1, word2) |
计算两个词对应的语义标签集合中标签的最大相似度,得分区间为[0, 1]。 |
semantic.sum_cosine (matrix, threshold) |
Calculate the parameters of the semantic Jaccard model based on the Cosine similarity matrix of semantic word segmentation. |
semantic.jaccard_basic (synonym_vector1, ...) |
Similarity score between two vectors with basic jaccard. |
semantic.jaccard (synonym_vector1, ...[, ...]) |
Similarity score between two vectors with jaccard. |
semantic.jaccard2 (sv1, sv2[, threshold]) |
Similarity score between two vectors with jaccard. |
semantic.edit_distance (synonym_vector1, ...) |
Similarity score between two vectors with edit distance. |
semantic.similarity (synonym_vector1, ...[, ...]) |
Similarity score between two sentences. |
semantic.similarity2 (s1, s2) |
Similarity score between two sentences. |
semantic.get_location (sentence) |
Get location in sentence. |
semantic.get_musicinfo (sentence) |
Get music info in sentence. |
自定义分词(包含标点及语气词过滤)¶
自定义分词(可将句子切分为同义词向量标签)¶
-
semantic.
synonym_cut
(sentence, pattern='wf')[source]¶ Cut the sentence into a synonym vector tag. 将句子切分为同义词向量标签。
If a word in this sentence was not found in the synonym dictionary, it will be marked with default value of the word segmentation tool. 如果同义词词典中没有则标注为切词工具默认的词性。
- Args:
- pattern: ‘w’-分词, ‘k’-唯一关键词,’t’-关键词列表, ‘wf’-分词标签, ‘tf-关键词标签’。
获取词对应的语义标签集合¶
获取词对应的语义标签集合
计算两个词对应的语义标签集合中标签的最大相似度,得分区间为[0, 1]¶
根据语义分词Cosine相似性矩阵计算语义jaccard模型的各个参数¶
-
semantic.
sum_cosine
(matrix, threshold)[source]¶ Calculate the parameters of the semantic Jaccard model based on the Cosine similarity matrix of semantic word segmentation. 根据语义分词Cosine相似性矩阵计算语义jaccard模型的各个参数。
- Args:
- matrix: Semantic Cosine similarity matrix. 语义分词Cosine相似性矩阵。 threshold: Threshold for semantic matching. 达到语义匹配标准的阈值。
- Returns:
- total: The semantic intersection of two sentence language fragments.
- 两个句子语言片段组成集合的语义交集。
- num_not_match: The total number of fragments or the maximum value of two sets
- that do not meet the semantic matching criteria controlled by the threshold. 两个集合中没有达到语义匹配标准(由阈值threshold控制)的总片段个数或者两者中取最大值。
- total_dif: The degree of semantic difference between two sets.
- 两个集合的语义差异程度。
向量相似度计算-基础 jaccard 模型¶
-
semantic.
jaccard_basic
(synonym_vector1, synonym_vector2)[source]¶ Similarity score between two vectors with basic jaccard. 两个向量的基础jaccard相似度得分。
According to the bassic jaccard model to calculate the similarity. The similarity score interval for each two sentences was [0, 1]. 根据基础jaccard模型来计算相似度。每两个向量的相似度得分区间为为[0, 1]。
向量相似度计算-语义 jaccard 模型¶
-
semantic.
jaccard
(synonym_vector1, synonym_vector2, threshold=0.8)[source]¶ Similarity score between two vectors with jaccard. 两个向量的语义jaccard相似度得分。
According to the semantic jaccard model to calculate the similarity. The similarity score interval for each two sentences was [0, 1]. 根据语义jaccard模型来计算相似度。每两个向量的相似度得分区间为为[0, 1]。
分词:语义标签词典 + 自定义词典 单词相似度:基于标签字母前n位相同情况 算法:基于词向量相似度矩阵 + 向量余弦
实现:通过计算语义标签相似度矩阵,比较两词相似度。 1.阈值:0.8,每两个语义标签的相似度区间:[0,1],若无标签则计算原词相似度得分。 2.计算两个标签相似度得分:根据标签字母前n位相同情况判断得分。
向量相似度计算-语义 jaccard2 模型¶
-
semantic.
jaccard2
(sv1, sv2, threshold=0.8)[source]¶ Similarity score between two vectors with jaccard. 两个向量的语义jaccard相似度得分。
According to the semantic jaccard model to calculate the similarity. The similarity score interval for each two sentences was [0, 1]. 根据语义jaccard模型来计算相似度。每两个向量的相似度得分区间为[0, 1]。
分词:自定义词典 单词相似度:从语义标签树中获取两个单词对应的语义标签集合,计算它们在分级编码 语义标签树中的距离 算法:基于词向量相似度矩阵 + 向量余弦
实现:通过计算语义标签相似度矩阵,比较两词相似度。 1.阈值:0.8,每两个语义标签的相似度区间:[0,1],若无标签则计算原词相似度得分。 2.计算两个标签相似度得分:词林提供三层编码:第一级大类用大写英文字母表示, 第二级中类用小写字母表示,第三级小类用二位十进制整数表示,第四级词群用大写 英文字母表示,第五级原子词群用二位十进制整数表示。第八位的标记有三种,分别 是“=“、”#“、”@“,=代表相等、同义,#代表不等、同类,@代表自我封闭、独立, 它在词典中既没有同义词,也没有相关词。