NLP pipeline for discovering themes and emerging signals in large Polish-language text corpora. Transforms 248k raw documents into structured clusters through text preprocessing, vectorization, and unsupervised learning.
Dataset: WiktorS/polish-news
(HuggingFace) — 248,123 Polish news documents · fields: title, headline,
content, link
🌐 Live demo: sygnaly-miner.streamlit.app
248,123 documents → cleaning & filtering → 2,890 high-quality subset (~96.3% retained)
↓
CountVectorizer
↓
KMeans clustering (k=8)
↓
PCA projection · cluster distribution · top terms
| Metric | Value |
|---|---|
| Silhouette Score | 0.07 |
| Inertia | tracked |
| Clusters | 8 (KMeans) |
Low silhouette score is expected with Bag-of-Words on high-dimensional sparse text — confirming that semantic embeddings are necessary for meaningful cluster separation.
Python · scikit-learn · NLTK · Sentence Embeddings · Streamlit · Docker
The end-to-end baseline pipeline is complete: data cleaning, vectorization, clustering, evaluation, and visualization are implemented and deployed.
Next phase (not yet started): TF-IDF and semantic embeddings, improved clustering with HDBSCAN, and expanded Streamlit interface.