Clustering a set of wordnet synsets using k-means, the wordnet pair-wise distance (semantic relatedness) of word senses using the Edge Counting method of the of Wu & Palmer (1994) is mapped to the euclidean distance to allow K-means to converge preserving the original pair-wise relationship.
By toggling use_wordnet = False to True the distance metric between words will use a GloVe model glove.6B.300d_word2vec.txt (this must be in the word2vec format) and the word2vec similarity value
extras folder is proof of concept/experimentations
- create a newline delimited file with a list of
wordnetsenses (eg. data/example_tags.txt) - to use
wordnetsetuse_wordnet=True, to useword2vecuse_wordnet=False python generate-tag-clusters.py data/example_tags.txt 25 0.7- 25 is the number of clusters to segment the list of
wordnetsenses into. - 0.7 is the similarity threshold, below this the words are considered not similar
- 25 is the number of clusters to segment the list of
- results places into the
resultsfolder as a json file