Author: Kameldip Singh Basra · kameldipbasra@gmail.com
Current paper: paper_v5.md · DOI 10.5281/zenodo.20138182
Concept DOI (always resolves to latest): 10.5281/zenodo.18598229
I am a software engineer and AI architect, not a historian, linguist, or medical historian. My background is in machine learning, NLP, and computational systems. I came to the Voynich Manuscript the same way I approach any pattern-recognition problem: build a decoder, measure it against external corpora, stress-test it against hostile alternatives, and document everything so someone else can reproduce or refute it.
This repository is a live research log, not a polished academic paper uploaded after the fact. I started in February 2026 and have been publishing each stage of the work as it happened — hypothesis, test, revision, new finding, repeat. The five versions in this repo represent that evolution in real time. V1 introduced the core decipherment; V5 is five months of refinement, adversarial testing, and evidence accumulation on top of that. Each version is preserved unchanged so the progression is auditable.
What I used: Python for all analysis and testing, SQLite for the corpus database, and Anthropic Claude (Sonnet 4.6 / Claude Code) as a collaborative AI assistant throughout — for coding, statistical testing, cross-referencing medical texts, and drafting. I am naming this explicitly because it is part of the method and it would be dishonest not to. The AI did not generate the hypothesis or the evidence; it helped me execute tests quickly, catch errors in my reasoning, and work through large corpora I would not have been able to process manually alone.
What I am not claiming: I am not a Sinhala linguist. I am not a Sri Lankan medical historian. I am not a trained botanist. The paper is explicit about each of these gaps and identifies the specific specialists whose review it needs. The computational and structural evidence is solid; the linguistic and botanical interpretation layers need expert review before the identification should be treated as confirmed.
The Voynich Manuscript (Beinecke MS 408, carbon-dated 1404–1438 CE) is a 15th-century Sri Lankan Elu-Sinhala pharmaceutical text — a working pharmacist's compressed reference recording Ayurvedic drug preparations in a bespoke phonetic script.
Confidence summary:
| Claim | Confidence |
|---|---|
| South Asian pharmaceutical text | ~97% |
| Sri Lankan provenance | ~90% |
| Sinhala/Elu specifically (vs Pali/Sanskrit sister) | ~90% |
| Pre-12c Elu chronolect | ~83% |
| Working-pharmacist register | ~98% |
| P(overall identification wrong) | ~7–10% |
-
Rival-language tournament (V5): 27 control corpora tested across 11 tradition families. Sārārtha Saṃgrahaya (Sri Lankan Sinhala medical text) scores 66.67% repeated locked-anchor overlap. All Unani, Tamil/Siddha, and European controls score 0–15%. No tested rival tradition explains the same evidence bundle.
-
Non-circular structure tests: Section classifier 61% accuracy vs 32.3% chance baseline (p=0.0099). KALPANA preparation-marker enrichment OR=8.11 (p=2×10⁻²²²). q-/ch- phonological allomorph distribution OR=32.81. These tests do not depend on any English glosses.
-
Parallel-recipe template matches: f75r L38 decoded output
q-keda q-keda q-keda q-keda q-keda ladais structurally identical to the Bodleian Sinhala recipe enumeration pattern (Bodleian MS Sinh.a.2(R)). -
Single-token chapter anchor:
pesaca(piśāca = evil spirit) is the sole occurrence of that word in 36,633 corpus tokens and opens f113r line 1 — the direct Elu reflex of the diagnostic term in AH Bhūta-pratishedha. -
External state-marker grounding:
leda(disease) attested 60× in Somadasa's Wellcome manuscript catalogue (33 distinct disease-stems);seda(fomentation) matches Caraka's 9 sveda compounds; 5 of 21 VPNS state-markers independently grounded.
| Version | Date | DOI | What changed |
|---|---|---|---|
| V1 | Feb 2026 | — | Initial decipherment hypothesis, primary statistical tests |
| V2 | 2026-05-04 | 10.5281/zenodo.20023733 | V17 decoder, Bowern engagement, hostile-reviewer test, falsification probes |
| V3 | 2026-05-07 | 10.5281/zenodo.20072618 | Full corpus expansion, VPNS 21 states, 25 BM formula clusters, COSMO architecture |
| V4 | 2026-05-09 | 10.5281/zenodo.20098162 | V21 meaning corrections, 81-folio plant table, 23-chapter BM/AH mapping, blind botanical review, Team B suite |
| V5 | 2026-05-12 | 10.5281/zenodo.20138182 | 27-corpus rival tournament, BALNEO recharacterised, pharmacopoeia architecture, pesaca anchor, senna backbone, iron-eye cross-reference, canonical plant IDs |
All versions are preserved in this repository and on Zenodo. Read any version to see the state of the evidence at that point.
paper_v5.md ← current paper (1,135 lines)
paper_v4.md / v3.md / v2.md ← preserved earlier versions
canonical_plant_test.py ← DB-level plant ID verification (17 checks)
run_all.sh ← full validation gate (24 tests, all pass)
scripts/ ← decoders, statistical tests, corpus analysis
translation/
voynich_v20_corpus.db ← canonical corpus DB (36,633 tokens, V17+V21)
supplementary/ ← extended analysis writeups
PHARMACOPOEIA_STRUCTURE_ANALYSIS.md
CANONICAL_PLANT_IDENTIFICATIONS.md
RECIPE_AH_MAPPING.md
PAPER_ADDITION_NOTES_2026-05-12.md
… (60+ files)
teamb/
scripts/ ← Team B validation scripts (16 tests)
reports/ ← audit reports, rival scorecard, recipe order audit
outputs/ ← v5_close_gap_package.json (evidence lockbox)
references/medical_corpus/ ← comparison corpora (Sarartha, BM, Caraka, AH, …)
results/ ← validation outputs from run_all.sh
All analysis runs on Python 3.8+ with no unusual dependencies (sqlite3, scipy, numpy). Clone the repo and:
# Full validation gate — should produce: 24 passed, 0 failed
./run_all.sh
# DB-level canonical plant ID checks
python3 canonical_plant_test.py
# Rival-language tournament (27 corpora)
python3 teamb/scripts/genre_anchor_control_tests.py
# Non-circular structure tests
python3 teamb/scripts/statistical_stress_tests.pyThe frozen corpus DB is translation/voynich_v20_corpus.db (SHA256 in teamb/outputs/v5_close_gap_package.json). All test outputs are in teamb/outputs/ and results/.
- No Sinhala/Elu specialist review yet. The linguistic interpretation needs a philologist. Outreach materials are in
supplementary/SPECIALIST_OUTREACH_PACKAGE.md. - No trained botanist review yet. Plant identifications are candidate-level. Dossier at
v15_work/BOTANIST_DOSSIER.md. - Sister-language question remains open. Pali and Sinhala/Elu are closely related; the corpus discriminates well statistically but specialist review is the definitive test.
- Three rival corpora not yet acquired. Arabic/Persian aqrabadhin prose, Tibetan Sowa Rigpa formulary text, and historical Tamil/Siddha prose are missing controls. A definitive rival-family closure claim is not made pending those.
- Plant section species IDs are candidates. One-picture-one-label species matching failed as a general model. Direct text-token IDs (where a decoded word names the plant) are strong; visual-only species IDs remain hypotheses.
@misc{basra2026voynich,
title = {A Candidate Decipherment of the Voynich Manuscript:
Evidence for a Spoken Elu-Sinhala Pharmaceutical Register (V5)},
author = {Basra, Kameldip Singh},
year = {2026},
month = {May},
doi = {10.5281/zenodo.20138182},
url = {https://doi.org/10.5281/zenodo.20138182},
note = {Concept DOI 10.5281/zenodo.18598229 always resolves to latest version}
}Beinecke Rare Book and Manuscript Library for digital access to MS 408. The EVA transcription community (Stolfi, Takahashi, and contributors) for the foundational transcription without which none of this analysis would be possible. Daniel Gaskell for open-sourcing the random-forest Voynich classifier used in §4.13. The Buddhist medical traditions of Sri Lanka, whose pharmacopoeial literature forms the comparative backbone of this work. Anthropic Claude (Sonnet 4.6 / Claude Code) was used throughout as a collaborative AI assistant for coding, statistical testing, corpus cross-referencing, and drafting — this is stated explicitly as a matter of transparency, not as a limitation.