A Modern Multimodal Corpus Research Software
Features | Installation | Quick Start | Documentation | License
Meta-Lingo is a comprehensive desktop application designed for corpus linguistics research. Built with modern technologies (Electron + React + Python FastAPI), it provides powerful tools for multimodal corpus management, linguistic analysis, and annotation.
- Multimodal Support: Text, audio, and video files with drag-and-drop upload
- Audio Transcription: Whisper Large V3 Turbo with word-level timestamps
- Forced Alignment: Wav2Vec2 word-level alignment for English audio (automatic)
- Pitch Extraction: TorchCrepe F0 extraction for English audio (automatic)
- Video Analysis: YOLOv8 object detection and CLIP semantic classification
- Automatic Annotation: SpaCy NLP (POS/NER/Dependency), USAS semantic domains, MIPVU metaphor identification
- Metadata Management: Language, author, source, text type with tag system
| Module | Description |
|---|---|
| Word Frequency | Frequency analysis with POS filtering, lemma/word form selection, visualization |
| N-gram Analysis | 2-6 gram support, nested grouping, Sankey diagrams |
| Keyword Extraction | TF-IDF, TextRank, YAKE!, RAKE, and 9 keyness statistics methods |
| Collocation | KWIC search with 6 modes, CQL query language, CQL Builder |
| Synonym Analysis | WordNet integration with network visualization |
| Semantic Domain | USAS-based analysis with dual view (by domain/by word) |
| Sentiment Analysis | NRC-EmoLex polarity + emotion dimension analysis (pie/radar) |
| Metaphor Analysis | MIPVU-based detection; 3-step pipeline (word filter → rules → Clause model); source color-coding by POS |
| Word Sketch | Grammar pattern analysis (50 relations), logDice scoring, difference comparison |
| Topic Modeling | BERTopic, LDA, LSA, NMF with dynamic topic analysis |
| Bibliography | Refworks parsing (WOS/CNKI), shadow corpus for abstracts, network visualization, burst detection; analysis modules support corpus/literature toggle and library selection (all / by keyword / manual). |
- Text Annotation: Sentence-level display, intelligent segmentation, batch annotation
- Multimodal Annotation: Video frame tracking, DAW-style timeline, YOLO overlay
- Audio Waveform Annotation: Wavesurfer.js waveform visualization with word alignment, pitch curve overlay, box drawing annotation (English audio only)
- Framework Management: 49 preset frameworks (SFL, UAM, etc.), custom framework support
- Inter-coder Reliability: Fleiss' Kappa, Cohen's Kappa, Krippendorff's Alpha, Gold Standard support (plain text archives only)
- Syntax Visualization: Constituency and dependency parsing
- Dictionary Lookup: Macmillan, Longman Collocations with fuzzy search
- Bilingual Interface: Chinese and English with real-time switching
- Custom Wallpaper: Personalized application background
- Export Options: CSV, PNG, SVG for all visualizations
- AI Assistant (optional): context-aware assistant for analysis modules (Ollama or OpenAI-compatible API)
- MCP Server Integration (optional): enable an MCP server so external AI assistants (e.g., Claude Desktop/Cursor) can call Meta-Lingo tools directly
- OpenAI-Compatible API Support (optional): configure compatible endpoints/keys for AI-assisted features and better model naming
+----------------------------------------------------------+
| Meta-Lingo |
+----------------------------------------------------------+
| Frontend (Electron + React + TypeScript) |
| - Material-UI components |
| - Zustand state management |
| - D3.js / Plotly.js visualizations |
| - i18next internationalization |
+----------------------------------------------------------+
| HTTP REST API |
+----------------------------------------------------------+
| Backend (Python FastAPI) |
| - SpaCy NLP processing |
| - USAS semantic tagging (PyMUSAS) |
| - MIPVU metaphor detection (DeBERTa) |
| - BERTopic / LDA / LSA / NMF topic modeling |
| - Whisper / YOLO / CLIP multimodal analysis |
+----------------------------------------------------------+
| Data Storage |
| - SQLite database (metadata) |
| - File system (corpora, annotations) |
+----------------------------------------------------------+
| Technology | Purpose |
|---|---|
| Electron 28+ | Desktop application framework |
| React 18 | UI framework |
| TypeScript 5 | Type safety |
| Material-UI 5 | Component library |
| D3.js 7 | Data visualization |
| Plotly.js | Interactive charts |
| Technology | Purpose |
|---|---|
| Python 3.12 | Runtime environment |
| FastAPI | Web framework |
| SpaCy 3.8+ | NLP processing |
| PyMUSAS | Semantic tagging |
| BERTopic | Topic modeling |
| Transformers | Whisper/CLIP models |
| Ultralytics | YOLOv8 |
Visit our official website to download the latest version:
https://tltanium.github.io/meta-lingo-website/
Source code in this repository is provided for reference and academic verification only. Please use the official distribution above to run Meta-Lingo.
After installing from the website, launch the application and follow the in-app guidance. For documentation, use the Help module inside the application.
- In-app Help: Access via the Help module with bilingual documentation
- API Documentation: http://localhost:8000/docs (when backend is running)
| Category | Endpoints |
|---|---|
| Corpus | /api/corpus/* - CRUD, upload, annotation |
| Analysis | /api/analysis/* - Word frequency, N-gram, keywords, etc. |
| Collocation | /api/collocation/* - KWIC search, CQL parsing |
| Topic Modeling | /api/topic-modeling/* - BERTopic, LDA, LSA, NMF |
| Annotation | /api/annotation/*, /api/framework/* |
| Word Sketch | /api/sketch/* - Grammar patterns, difference |
| Bibliography | /api/biblio/* - Libraries, visualization |
Full API documentation available at /docs endpoint.
Meta-Lingo integrates several pre-trained models:
| Model | Purpose | Source |
|---|---|---|
| Whisper Large V3 Turbo | Audio transcription | OpenAI |
| Wav2Vec2-base-960h | Forced alignment (English) | ModelScope — facebook/wav2vec2-base-960h |
| TorchCrepe Full | Pitch extraction (F0) | maxrmorrison/torchcrepe |
| YOLOv8 | Object detection | Ultralytics |
| CLIP ViT-Large-Patch14 | Image classification | OpenAI |
| SpaCy en/zh_core_web_sm | NLP processing (no static word vectors) | Explosion |
| DeBERTa-v3-large-clause-metaphor | MIPVU metaphor detection (F1 75.83) | tommyleo2077 |
| Sentence-BERT | Text embeddings | sentence-transformers |
This project is currently maintained for academic research purposes. For bug reports or feature requests, please open an issue.
Recent releases below mirror PROJECT.md (abbreviated). For the full version history, see PROJECT.md at the repository root or the Git commit log.
- Wav2Vec2 (multimodal alignment): ModelScope download id
facebook/wav2vec2-base-960hinmodel_manifest_constants.py; docs/help/README updated. Model Management dialog does not show an extra ModelScope link line (avoid redundancy). SeePROJECT.md. - Video transcript auto-annotation (MIPVU):
corpus.pyvideo upload path now passes SpaCy tokenstart/endinto MIPVU merge (parity with audio). Re-run MIPVU or re-upload to refresh old transcripts. - BERTopic dynamic / topics over time: Embeddings now save
{id}_docs.jsonso chunk texts with newlines cannot desync document rows from vectors (fixes missing evolution chart when dates exist). Recreate embeddings if load fails; visualization tab also keys offtopics_over_timedata.
- Help — Corpus SpaCy table: In
help/zh.mdandhelp/en.md, the language/model table no longer includes a “common ISO / aliases” column; it keeps only UI language name, corpus language code, and SpaCy package name.
- Help — Corpus SpaCy: Corpus Management section adds tables for supported languages vs. SpaCy packages (11 languages,
SPACY_MODEL_MAP) and for annotation output;src/pages/CorpusManagement/mldoc.mdlinks to the full help tables.
- SpaCy EN/ZH (lg → sm): Defaults switched from
en_core_web_lg/zh_core_web_lgtoen_core_web_sm/zh_core_web_sm; updates acrossspacy_service,backend.spec, BERTopic PartOfSpeech UI, preprocess, Benepar,build.sh/build.bat, help text, etc.
- Corpus Management — text type display: Fixes the text-type dropdown briefly showing codes (e.g.
GEN) before/api/usas/text-typesloads; addsusasTextTypeLabelandcorpus.textTypeCodesi18n fallbacks.
- Corpus resource dialog — refresh robustness: Fixes parsing when
api.get()returns{ success, data }; increases timeout for requests withrefresh=1so refresh does not fail when rebuilds exceed ~60s. - Corpus resource dialog — refresh fix: Corrects frontend parsing for
/api/corpus-resource/listand/tagsso the list reloads reliably after Refresh. - Keyness — default reference corpus: Default reference corpus changed from
OANCtoAmE06(full corpus:ame06_total). - Corpus resource cache — startup: Persistent cache for
corpus_resource_service; rebuild only when Refresh is clicked, avoiding heavy CSV work on every app launch. - Corpus resource dialog — dialog cache: Refresh button added; without it, the dialog uses cached data to avoid rebuilding the resource list every time it opens.
- Keyness — NOW card:
NOW(News on the Web) country breakdown time range set to2010–2024; help examples completed for COCA/COHA/GloWbE/Coronavirus/iWeb/TV/Movies/SOAP/Wikipedia. - Corpus resource intro — NOW:
NOWdescription updated to2010–2024; resource name and tags unchanged. - Corpus resource colors: Distinct color per corpus prefix; fixes TV/SOAP vs. Brown being too similar.
- Metaphor Analysis — Clause-only pipeline: Removed HiTZ model entirely. All tokens now annotated by a single
deberta-v3-large-clause-metaphormodel using full-sentence context (max_length=192). 3-step pipeline: word-form filter → SpaCy rule filter → Clause model. Function words (IN/DT/RB/RP) keep orange tag (finetuned); other words use green tag (clause). Legacyhitzsource in existing annotations treated asclause(green). Help docs updated with Clause model accuracy (Precision 78.08%, Recall 73.69%, F1 75.83; DT F1 90.87, IN F1 87.87).
- Sentiment Analysis — USAS mode: Search panel adds "USAS Semantic Domain" mode; results aggregate sentiment scores by domain code with full domain name tooltip; word cloud uses domain names; CSV export adds
domain_namecolumn.
- Bibliography Visualization: PDF export rewritten via Electron IPC (
printToPDF) to fix blank-page issue on large documents. Paper column with PDF upload and first-page thumbnail. 11 AI-generated fields per entry (research goal, questions, design, conclusions, mechanism, contribution, limitations, value, dialogue, future work, summary). Batch AI generation for multiple entries. Column visibility control. Export to styled PDF report.
- Sentiment Analysis (NRC): Full NRC-EmoLex annotation added to corpus pipeline after MIPVU. New analysis page with polarity (pie chart + word cloud) and emotion dimensions (radar chart + word cloud). Result table cross-links to collocation/word sketch/N-gram/semantic domain. Backend:
nrc_service.py,sentiment_analysis_service.py,POST /api/analysis/sentiment.
- Cross-module links default to case-insensitive search. Collocation wordlist search mode (multi-word input, one per line).
- Bibliography: Bulk delete for selected entries. Relevance rating (0–5 stars), tags, and notes columns added to entry table and detail dialog. CSV export.
- Metaphor Analysis: Added Clause model (
deberta-v3-large-clause-metaphor) to MIPVU pipeline for function-word annotation. POS-group statistics (IN/DT/RB/RP/OTHER metaphor rates) shown in results table header.
- Cross-module corpus selection sync across all analysis modules. Topic modeling bibliography mode with publication year for dynamic analysis.
- AI Assistant: Robot icon in all analysis modules' left panel (requires Ollama or OpenAI-compatible API); sends current page state as context. OpenAI-compatible API support in Settings (address / key / model). Cross-module library-mode link sync fixes.
- Semantic domain analysis: CQL cross-link, word cloud, domain name display. Collocation network expand on click, MinSense fix, Word Sketch Difference word-form/lemma mode. Topic modeling: N-gram preprocessing mode, LDA/LSA/NMF dynamic topic analysis.
- Praat acoustic analysis: Spectrogram, formants (F1–F5), intensity, HNR, jitter, shimmer. Chinese audio full visualization support. Corpus building script (
saves/corpus/corpus_building.py) for 13 English corpora.
- Ridge plot SVG/PNG full export, CQL top-level OR operator and template auto-fill. Collocation search mode (lemma/word form). Result table search fix across all modules. Unified UI spacing and labeling. Cross-module N-gram link.
- Audio waveform annotation (Wavesurfer.js + TorchCrepe pitch + box drawing). Full annotation pipeline for audio/video transcripts. Inter-coder reliability gold standard fix. CQL distance selector fix.
- LLM topic naming (Ollama). USAS annotation modes (rule / neural / hybrid). Stopword removal (20+ languages). Custom wallpaper. Keyword extraction enhancements. Theme/Rheme auto-annotation. Dark theme for all topic modeling visualizations.
Meta-Lingo Software License (Non-Commercial)
Meta-Lingo is an independently developed corpus research software by Tommy Leo, protected under the Copyright Law of the People's Republic of China.
This software is licensed only for:
- Personal learning
- Academic research
- Non-commercial corpus analysis and linguistic research
Commercial use is prohibited without written permission.
See LICENSE_CN.txt (Chinese) or LICENSE_EN.txt (English) for full terms.
Copyright 2026 Tommy Leo. All rights reserved.
