This project is designed so new programming languages can be added mainly by updating data files, not parser/codegen logic.
Language onboarding follows a controlled-language policy: add deterministic, testable surface forms only. See cnl_scope.md.
Enable a new language code (for example xx) across:
- lexing and parsing (keyword recognition)
- semantic analysis error reporting
- runtime builtins and execution
- REPL command/help localization
File: multilingualprogramming/resources/usm/keywords.json
- Add the new code to
languages. - For every concept in every category, add a translation key for the new language.
Important:
- All concepts must have a translation to keep validation complete.
- Prefer unique tokens per language to avoid ambiguity.
- Keep tokens identifier-safe (letters/underscores, no spaces).
Why this is enough:
KeywordRegistryloads this file dynamically.Lexerdetects/recognizes keywords throughKeywordRegistry.Parserconsumes concept tokens, so syntax support follows automatically.RuntimeBuiltinsmaps builtins from concept IDs, so execution picks up new language keywords automatically.
File: multilingualprogramming/resources/parser/error_messages.json
For each message key under messages, add the new language translation (same placeholders, e.g. {token}, {line}).
Why:
ErrorMessageRegistry.format()reads this file dynamically and parser/semantic analyzer use it for diagnostics.
File: multilingualprogramming/resources/repl/commands.json
Update:
help_titlesfor the language.messageskeys (keywords_title,symbols_title,unsupported_language).commands.<name>.aliasesfor command words.commands.<name>.descriptionsfor help text.
Why:
- REPL command parsing/help is fully catalog-driven from this JSON.
File: multilingualprogramming/resources/usm/operators.json
Add the new language under description where available.
Why:
- REPL
:symbolsuses these descriptions when present; otherwise it falls back to English.
File: multilingualprogramming/resources/usm/builtins_aliases.json
Add localized aliases for selected universal builtins (for example range, len, sum).
The universal English built-in name remains available; aliases are additive.
Why:
RuntimeBuiltinsloads this file dynamically.- Users can write either universal names or localized aliases in programs/REPL.
File: multilingualprogramming/resources/usm/surface_patterns.json
Use this file when keyword translation alone is not enough for natural phrasing. Rules are declarative and normalize alternate surface token order into canonical concept order before parser grammar runs.
Validation is enforced at load time by validate_surface_patterns_config
(multilingualprogramming/parser/surface_normalizer.py), including:
- rule/template schema shape
- language support checks
- exactly one of
normalize_to/normalize_template - slot-reference consistency between
patterncaptures and output rewrite
Typical use:
- iterable-first
forheaders - language-specific particles around loop/condition clauses
- alternate phrase forms that still map to one core AST
Keep rules narrow and test-backed. Prefer additive normalization over parser forks.
Lexertokenizes source and resolves known keywords to concepts.SurfaceNormalizermatches token-level surface rules.- Matched rules rewrite tokens into canonical concept order.
Parserconsumes rewritten tokens with existing grammar.
Important: surface patterns do not replace lexing. They operate on lexer output.
surface_patterns.json has two top-level sections:
templates: reusable canonical rewritespatterns: language-scoped matching rules
Each pattern must include:
namelanguagepattern(what to match)- exactly one of:
normalize_template(reference a template)normalize_to(inline rewrite)
Allowed pattern kinds:
expr: capture an expression span into a slot (for exampleiterable)identifier: capture one identifier token into a slot (for exampletarget)keyword: require a specific concept token (for exampleLOOP_FOR)delimiter: require a delimiter token (for example:)literal: require a literal token value (for particles like内の,ضمن)
Allowed output (normalize_to/template) kinds:
keyword: emit a concept keyword token in the target languagedelimiter: emit a delimiter tokenidentifier_slot: emit captured identifier slotexpr_slot: emit captured expression slot
Use a template when multiple languages share one canonical rewrite target.
{
"templates": {
"for_iterable_first": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
]
},
"patterns": [
{
"name": "ja_for_iterable_first",
"language": "ja",
"normalize_template": "for_iterable_first",
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "内の" },
{ "kind": "literal", "value": "各" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "に対して" },
{ "kind": "delimiter", "value": ":" }
]
}
]
}Surface input:
範囲(4) 内の 各 i に対して:
パス
Normalized parse shape:
毎 i 中 範囲(4):
パス
Use inline output for a one-off rule that is not worth templating.
{
"name": "xx_for_custom",
"language": "xx",
"pattern": [
{ "kind": "expr", "slot": "iterable" },
{ "kind": "literal", "value": "particleA" },
{ "kind": "identifier", "slot": "target" },
{ "kind": "literal", "value": "particleB" },
{ "kind": "delimiter", "value": ":" }
],
"normalize_to": [
{ "kind": "keyword", "concept": "LOOP_FOR" },
{ "kind": "identifier_slot", "slot": "target" },
{ "kind": "keyword", "concept": "IN" },
{ "kind": "expr_slot", "slot": "iterable" },
{ "kind": "delimiter", "value": ":" }
]
}- Write 2-3 real source examples from native speakers.
- Tokenize with lexer tests to confirm surface particles are tokenized as expected.
- Add the narrowest possible
patternthat matches those forms. - Rewrite to one canonical concept order via template or inline output.
- Add parser + executor tests before adding more variants.
- Repeat with additional rules rather than broad/fragile mega-rules.
- Capturing a slot in output that was never captured in
pattern. - Defining both
normalize_toandnormalize_templatein one rule. - Using unsupported language code in
language. - Overly broad
exprpatterns that unintentionally match unrelated lines. - Trying to encode full natural-language grammar in one rule.
If a surface form does not parse:
- Confirm lexer tokenization first (
tests/lexer_test.pypatterns are good references). - Add a parser unit test for just the failing statement.
- Check that slot names are consistent (
targetvsiterator, etc.). - Confirm template name exists and is spelled exactly.
- Ensure the final normalized sequence is compatible with existing parser grammar.
Minimum recommended tests:
tests/keyword_registry_test.py
- language appears in
get_supported_languages() - concept lookups for representative keywords
tests/executor_test.py
- one end-to-end program using new language keywords (
ProgramExecutor)
tests/error_messages_test.py
- new language included in "all messages have all languages" coverage
tests/runtime_builtins_test.py
- localized aliases map to the expected Python built-ins
tests/surface_normalizer_test.py(when adding surface rules)
- config stays schema-valid
- invalid rule shapes fail with
ValueError
tests/parser_test.py+tests/executor_test.py(when adding surface rules)
- parser accepts new surface form
- end-to-end execution still works
This validates lexer -> parser -> semantic -> codegen/runtime in one path.
At minimum:
README.mdsupported languages listdocs/reference.mdsupported languages list- link this onboarding guide where relevant
python -m pytest -q
python -m pylint $(git ls-files '*.py')For focused checks while iterating:
python -m pytest -q tests/keyword_registry_test.py tests/error_messages_test.py tests/executor_test.py tests/repl_test.pySurface-pattern focused checks:
python -m pytest -q tests/surface_normalizer_test.py tests/parser_test.py tests/executor_test.pyLanguage-pack smoke checks:
python -m multilingualprogramming smoke --lang xx
python -m multilingualprogramming smoke --allUse this template when opening a PR for a new language pack:
docs/templates/language_pack_checklist.md