Skip to content

cleberfc23/sygnaly-miner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sygnały — Public Signals Mining

Status

NLP pipeline for discovering themes and emerging signals in large Polish-language text corpora. Transforms 248k raw documents into structured clusters through text preprocessing, vectorization, and unsupervised learning.

Dataset: WiktorS/polish-news (HuggingFace) — 248,123 Polish news documents · fields: title, headline, content, link

🌐 Live demo: sygnaly-miner.streamlit.app


Pipeline

248,123 documents → cleaning & filtering → 2,890 high-quality subset (~96.3% retained)
                                                    ↓
                                           CountVectorizer
                                                    ↓
                                         KMeans clustering (k=8)
                                                    ↓
                          PCA projection · cluster distribution · top terms

Baseline Results

Metric Value
Silhouette Score 0.07
Inertia tracked
Clusters 8 (KMeans)

Low silhouette score is expected with Bag-of-Words on high-dimensional sparse text — confirming that semantic embeddings are necessary for meaningful cluster separation.


Tech Stack

Python · scikit-learn · NLTK · Sentence Embeddings · Streamlit · Docker


Baseline complete — next phase planned

The end-to-end baseline pipeline is complete: data cleaning, vectorization, clustering, evaluation, and visualization are implemented and deployed.

Next phase (not yet started): TF-IDF and semantic embeddings, improved clustering with HDBSCAN, and expanded Streamlit interface.


Author

Cleber F. Carvalho

About

Exploring how machine learning can help uncover patterns in large datasets of citizen feedback. This project transforms raw public reports into meaningful signals through text analysis and clustering, focusing on building a clear and reproducible engineering workflow.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages