anvay

anvay is a web-based tool for interpretive topic modelling of text corpora. Built for digital humanities research, literary analysis, and linguistic inquiry, it offers fine-grained control over preprocessing and presents results through an interactive interface designed for researchers and students. The application is modular, interpretable, and lightweight, with no reliance on neural networks, transformer embeddings, or LLMs — every transformation is documented and controlled by the user.

anvay takes plain-text .txt files, performs preprocessing (tokenisation, normalisation, stopword removal, frequency filtering, n-gram construction), builds a Latent Dirichlet Allocation (LDA) topic model using Gensim, and presents results through visualisations and a structured interpretive report.

anvay_200_demo.mp4

Language Support

Bengali is the primary and most thoroughly validated language in anvay. English, Hindi, and Tamil are supported but should be considered experimental — functional pipelines, less extensively tested.

Language	Script	Tokeniser	Normalisation	Status
Bengali	Bengali script	Custom (zero-width aware)	Dictionary-based stemming (~75,000 pairs)	Primary
English	Latin	NLTK word_tokenize	WordNet lemmatisation	Experimental
Hindi	Devanagari	Custom Devanagari	simplemma lemmatisation	Experimental
Tamil	Tamil script	Custom Tamil	Dictionary-based stemming	Experimental

All language pipelines are modular and isolated, so future refinements to experimental languages do not affect the core Bengali pipeline.

Features

Preprocessing

Upload up to 800 UTF-8 encoded .txt files (max 100 MB total)
Tokenisation and normalisation tailored to each supported language
Standard NLTK stopword lists (Bengali and English) plus custom stopword upload
Configurable stemming and filtering order: Stem → Filter or Filter → Stem
no_below and no_above frequency thresholds
Top-N% most frequent token removal
N-gram selection: unigrams, bigrams, trigrams

Topic Modelling

Gensim LdaMulticore with tunable parameters: number of topics, passes, iterations, alpha, eta, chunk size, minimum probability threshold
Multicore processing option for large corpora

Visualisations

Topic similarity scatterplot (PCA)
Topic × word heatmap
Top word frequency bar chart
Topic prevalence pie chart
Topic–word network graph
Topic evolution across documents
Topic distribution across documents
Hierarchical clustering dendrogram with merged-cluster keyword tooltips
Consistent global colour scheme across all charts (golden-ratio hue jumping)
Top-word hover tooltips on all visualisations

Report

Training overview: corpus statistics, token filtering log, perplexity estimates
Top words per topic (downloadable as CSV and TXT)
Topic weight matrix per document (downloadable as CSV and TXT)
Top terms in corpus
Topic prevalence table
Representative sentences per topic with ±2 sentence context window and confidence indicator

Documentation

Fully integrated documentation panel covering corpus preparation, parameter guidance, visualisation explanations, and a complete walkthrough using a sample Bengali news corpus

What's New in v2.0

Multilingual expansion

Added experimental support for English, Hindi, and Tamil. Each language has an isolated preprocessing pipeline with language-appropriate tokenisation and normalisation. See Language Support above.

Expanded Bengali stemming dictionary

The Bengali stemming dictionary has grown from approximately 7,000 to over 75,000 word pairs. This significantly improves coverage of inflectional and derivational variants, producing more stable and interpretable topic-word distributions — particularly for journalistic and literary corpora.

Configurable normalisation order

Users can now choose between Stem → Filter and Filter → Stem preprocessing orders. This makes the preprocessing pipeline transparent and reproducible, and enables comparative experimentation with token reduction strategies.

Enhanced representative sentence display

Each topic now shows its most representative sentence with two sentences of context on either side, alongside the topic weight. This situates topic salience within local textual context rather than presenting isolated extracts.

Screenshots

Upload interface

Documentation panel

Visualisations

Report

Technical Stack

Backend: Python (Flask), Gensim, NLTK, simplemma, NetworkX, Scikit-learn
Frontend: Bootstrap, Plotly, Bokeh, Seaborn
Deployment: Gunicorn on a university or personal server; Docker supported

Installation

anvay has been tested with Python 3.9–3.11 and Gensim 4.3.x.

Option 1: Virtual environment

git clone https://github.com/vinayakdasgupta/anvay.git
cd anvay

python -m venv venv
source venv/bin/activate      # macOS/Linux
venv\Scripts\activate         # Windows

pip install -r requirements.txt
python app.py

Open http://localhost:5000 in your browser.

Option 2: Docker

git clone https://github.com/vinayakdasgupta/anvay.git
cd anvay

docker build -t anvay .
docker run -p 5000:5000 anvay

Open http://localhost:5000 in your browser.

Citation

If you use anvay in academic work, please cite:

Das Gupta, V., (2026). anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali. Journal of Open Source Software, 11(118), 8641, https://doi.org/10.21105/joss.08641

@article{Das Gupta2026,
  doi       = {10.21105/joss.08641},
  url       = {https://doi.org/10.21105/joss.08641},
  year      = {2026},
  publisher = {The Open Journal},
  volume    = {11},
  number    = {118},
  pages     = {8641},
  author    = {Das Gupta, Vinayak},
  title     = {anvay: A Web-based Tool for Interpretive Topic Modelling in Bengali},
  journal   = {Journal of Open Source Software}
}

Referenced Libraries and Datasets

Bengali lemmatisation dataset

@inproceedings{chakrabarty-etal-2017-context,
  title     = {Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks},
  author    = {Chakrabarty, Abhisek and Pandit, Onkar Arun and Garain, Utpal},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
  pages     = {1481--1491},
  year      = {2017},
  publisher = {Association for Computational Linguistics},
  doi       = {10.18653/v1/P17-1136}
}

@article{alam2021review,
  title   = {A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
  author  = {Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
  journal = {arXiv preprint arXiv:2107.03844},
  year    = {2021}
}

@article{islam2025banglalem,
  title     = {BanglaLem: A Transformer-based Bangla Lemmatizer with an Enhanced Dataset},
  author    = {Islam, Md Fuadul and Hasan, Jakir and Islam, Md Ashikul and Dewan, Prato and Rahman, M Shahidur},
  journal   = {Systems and Soft Computing},
  pages     = {200244},
  year      = {2025},
  publisher = {Elsevier}
}

Gensim

@inproceedings{rehurek_lrec,
  author    = {Řehůřek, Radim and Sojka, Petr},
  title     = {Software Framework for Topic Modelling with Large Corpora},
  booktitle = {Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks},
  pages     = {45--50},
  year      = {2010},
  publisher = {ELRA}
}

NLTK

@book{bird2009natural,
  author    = {Bird, Steven and Klein, Ewan and Loper, Edward},
  title     = {Natural Language Processing with Python},
  year      = {2009},
  publisher = {O'Reilly Media}
}

NetworkX

@inproceedings{hagberg2008exploring,
  author    = {Hagberg, Aric A. and Schult, Daniel A. and Swart, Pieter J.},
  title     = {Exploring Network Structure, Dynamics, and Function using NetworkX},
  booktitle = {Proceedings of the 7th Python in Science Conference},
  pages     = {11--15},
  year      = {2008}
}

Scikit-learn

@article{pedregosa2011scikit,
  author  = {Pedregosa, Fabian et al.},
  title   = {Scikit-learn: Machine Learning in Python},
  journal = {Journal of Machine Learning Research},
  volume  = {12},
  pages   = {2825--2830},
  year    = {2011}
}

Plotly

Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC. https://plot.ly

Acknowledgements

anvay draws on Gensim (topic modelling), NLTK (tokenisation and stopwords), simplemma (Hindi lemmatisation), Plotly, Seaborn, and Matplotlib (visualisation), NetworkX (topic-word graph), Scikit-learn (PCA and clustering), and Flask (web framework).

Licence

MIT Licence

Contact

Vinayak Das Gupta — Shiv Nadar University
https://vinayakdasgupta.com
For questions, suggestions, or collaborations, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
analysis		analysis
config		config
data		data
docs		docs
lda		lda
preprocessing		preprocessing
results		results
sample_corpus		sample_corpus
static		static
stopwords		stopwords
templates		templates
tests		tests
uploads		uploads
venv_refactor		venv_refactor
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
app.py		app.py
paper.bib		paper.bib
paper.md		paper.md
requirements.txt		requirements.txt
utils.py		utils.py
viz.py		viz.py

Folders and files

Latest commit

History

Repository files navigation

anvay

Language Support

Features

Preprocessing

Topic Modelling

Visualisations

Report

Documentation

What's New in v2.0

Multilingual expansion

Expanded Bengali stemming dictionary

Configurable normalisation order

Enhanced representative sentence display

Screenshots

Technical Stack

Installation

Option 1: Virtual environment

Option 2: Docker

Citation

Referenced Libraries and Datasets

Bengali lemmatisation dataset

Gensim

NLTK

NetworkX

Scikit-learn

Plotly

Acknowledgements

Licence

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages