Text clusterization overview

This is more of a cheat sheet, than a serious project with high goals. Data is Reddit posts for one year with #wot hashtag posts_Reddit_wot_en.csv. Text vectorization was done with four main methods: BoW, TF-IDF, PV-DM, PV-DBOW,

Clusterization method is always K-means++, just because i believe modification of it makes little impact compared to change of vectorization technique. Visualization is performed via: MDS, PCA

Install

git clone git@github.com:bluella/Text-clusterization-overview.git
cd Text-clusterization-overview
virtualenv -p /usr/bin/python3.7 tco_env
source ./tco_env/bin/activate
pip install -r requirements.txt

You are good to go!

Results

TF-IDF has shown best results among other vectorization methods. BoW is a bit less accurate. PV-DM and PV-DBOW deliveres really weird results. Pephaps because of small dataset size, which is not appropriate to proper model learning. PCA visualization seems to comply more with real outcome than MDS.

Futher development

Proper clusterization evaluation
Use pretrained model for PV-DM with help of fasttext or else

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Heavy loads of code were taken from the following resources:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text clusterization overview

Install

Results

Futher development

License

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Text clusterization overview

Install

Results

Futher development

License

Acknowledgments