Skip to content

Latest commit

 

History

History
57 lines (39 loc) · 2.47 KB

File metadata and controls

57 lines (39 loc) · 2.47 KB

Text clusterization overview

This is more of a cheat sheet, than a serious project with high goals. Data is Reddit posts for one year with #wot hashtag posts_Reddit_wot_en.csv. Text vectorization was done with four main methods: BoW, TF-IDF, PV-DM, PV-DBOW,

Clusterization method is always K-means++, just because i believe modification of it makes little impact compared to change of vectorization technique. Visualization is performed via: MDS, PCA

Install

git clone git@github.com:bluella/Text-clusterization-overview.git
cd Text-clusterization-overview
virtualenv -p /usr/bin/python3.7 tco_env
source ./tco_env/bin/activate
pip install -r requirements.txt

You are good to go!

Results

TF-IDF has shown best results among other vectorization methods. BoW is a bit less accurate. PV-DM and PV-DBOW deliveres really weird results. Pephaps because of small dataset size, which is not appropriate to proper model learning. PCA visualization seems to comply more with real outcome than MDS.

Futher development

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Heavy loads of code were taken from the following resources: