Skip to content
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

Note - NTTT will work on Windows, macOS and Linux.

## Documentation

For maintainers, [doc/transformations.md](doc/transformations.md) describes what NTTT changes in `meta.yml` and Markdown files (sections, HTML, formatting, URLs, and related behaviour).

## Prerequisites

The tool requires having Python 3.7 or newer.
Expand Down Expand Up @@ -61,6 +65,14 @@ pip3 install . --upgrade

![install nttt](images/install_nttt.png)

You could also use `pipx` (instructions below for Mac using homebrew):

```bash
brew install pipx
pipx install /path/to/project/nttt
Comment thread
jamdelion marked this conversation as resolved.
nttt --help
```

You can uninstall nttt using:

```bash
Expand Down Expand Up @@ -102,6 +114,28 @@ You can specify different directories for the input and output folder using the
nttt --input c:\path\to\project\de-DE --output c:\path\to\project\de-DE-tidy
```

### Crowdin marker stripping and restoring

NTTT has three processing modes:

- `tidy` (default): restore stripped Markdown markers for non-English locale folders, then run the existing tidy-up transforms.
- `strip`: remove non-translatable Markdown markers before uploading English source files to Crowdin.
- `restore`: reinsert stripped Markdown markers into translated files after downloading from Crowdin.

Use `strip` on the English source folder before Crowdin upload:

```bash
nttt --mode strip -i en -o en -Y on
```

Use `restore` on a translated locale folder after Crowdin download:

```bash
nttt --mode restore -i de-DE -e en -o de-DE -Y on
```

Modern bare markers such as `> [!TASK]` are removed entirely, along with their paired empty `>` line. Modern labelled markers such as `> [!ACCORDION] Where are my voice recordings stored?` keep the label available for translation by becoming `> Where are my voice recordings stored?`; restore reinserts `[!ACCORDION]` before the translated label. Legacy markers such as `--- task ---` and `--- /task ---` are also removed and restored by line alignment against `en/`.

### Help

To bring up full usage information use the `-h`/`--help` option.
Expand Down
134 changes: 134 additions & 0 deletions doc/transformations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# NTTT: transformations reference

This document describes what **Nina's Translation Tidy-up Tool (NTTT)** changes on disk, so maintainers know what to expect and where to look in code.

## Scope

- **Inputs:** Files under the chosen **input** directory. The tool collects every `meta.yml` and every `*.md` (see `find_files` in [`nttt/utilities.py`](../nttt/utilities.py)).
- **English reference:** A parallel tree (default: `INPUT/../en`) used for `meta.yml` sync and optional section-tag revert.
- **Outputs:** Corresponding paths under the **output** directory (created as needed). After processing, **missing** files/folders can be copied from input and English (`add_missing_entries`).

NTTT does **not** process standalone `.html` files. HTML-related steps run on **HTML inside Markdown**.

---

## High-level pipeline (`fix_md_step`)

For each `.md` file, [`nttt/tidyup.py`](../nttt/tidyup.py) applies, in order:

1. **`restore_tree`** — for non-English locale folders, restore Markdown markers stripped before Crowdin upload.
2. **`fix_sections`** — normalise `---` section lines (Crowdin quirks).
3. **`revert_section_translation`** — optional; restore English section tag lines when structure matches.
4. **`trim_md_tags`** — strip padding inside paired Markdown delimiters (outside ` ``` ` fences).
5. **`trim_html_tags`** — strip padding inside simple inline HTML tags (outside single `` ` `` spans).
6. **`trim_formatting_tags`** — normalise `{ … }` attribute blocks after a word (Scratch/Pico-style).
7. **URL rewrite:** replace `/en/` with `/<language>/` everywhere in the file body.

Steps 1–5 can be skipped via **`--disable`** (see [`nttt/arguments.py`](../nttt/arguments.py)).

`meta.yml` is handled separately by **`fix_meta`** (YAML round-trip, revert non-translatable keys from English). This doc focuses on Markdown/HTML-style transforms.

---

## Crowdin marker strip/restore (`nttt/strip.py`, `nttt/restore.py`)

**Modes:** `--mode strip`, `--mode restore`, and default `--mode tidy`.

| Mode | Behaviour |
|------|-----------|
| `strip` | Runs on `en/` before Crowdin upload. Removes structural-only markers and keeps labelled marker text translatable. |
| `restore` | Runs on a locale folder after Crowdin download. Rebuilds markers from the matching English file. |
| `tidy` | For non-English locale folders, runs restore first, then the existing tidy transforms. |

**Marker classification (`nttt/markers.py`):**

| Kind | Pattern | Strip output | Restore output |
|------|---------|--------------|----------------|
| Modern bare | `> [!TASK]`, `> [!SAVE]`, nested forms like `> > [!HINT]` | Dropped. A following empty blockquote line (`>`, `> >`) is also dropped. | Copied back from `en/`. |
| Modern labelled | `> [!ACCORDION] Where are my voice recordings stored?` | Rewritten to `> Where are my voice recordings stored?`. | Rewritten to `> [!ACCORDION] <translated label>`. |
| Legacy bare | `--- task ---`, `--- /task ---`, `--- print-only ---`, `--- feedback ---` | Dropped. | Copied back from `en/`. |

Restore uses line-index alignment against the stripped English file. If the translated file has a different number of lines from the stripped English reference, NTTT logs a warning and leaves that file unchanged for this step.

Fenced code blocks split by ` ``` ` are not stripped.

## 1. Section markers (`nttt/cleanup_sections.py`)

**Function:** `fix_sections`

| Behaviour | Purpose |
|-----------|---------|
| Replace `\---` with `---` | Crowdin sometimes escapes section markers. |
| Normalise `--` / `---` wrappers around section names | Fix missing dash or inconsistent spacing; target form **`--- <tag> ---`**. Tags allow word chars, digits, hyphens, and certain Unicode space characters inside the name. |
| Normalise closing sections | **`--- /tag ---`** — removes extra spaces between `/` and the tag name. |
| Split jammed section lines | Restore newline between adjacent **`--- … ---`** lines when Crowdin merges them (e.g. hints/hint); regex also tolerates some translator edits. |
| Repair broken collapse/title blocks | Restore **`--- collapse ---`** plus YAML-style **`title:`** block when Crowdin breaks the structure; colons may be ASCII or full-width (`:`). |

**Function:** `revert_section_translation` (requires English `.md`)

- Collects lines matching **`--- <anything> ---`** in translation and English.
- If **counts match**, replaces each translated section line with the **English** line at the same index (keeps English tag names, e.g. `task` vs translated word).
- If counts differ, logs a **warning** to stderr and leaves the file unchanged for this step.

---

## 2. Markdown delimiters (`nttt/cleanup_markdown.py`)

**Function:** `trim_md_tags`

- Splits content on **` ``` `** (triple backtick). **`apply_to_every_other_part`** runs trimming only on segments **outside** fenced blocks (indices 0, 2, 4, …); fence interiors are untouched.
- Per line outside fences:
- **List lines:** odd number of `*` and line starts with `*` after `lstrip` → only the substring **after the first `*`** is trimmed (preserves the bullet marker).
- Otherwise the **whole line** is trimmed.
- **Trim rule:** regex finds paired **`` ` ``**, **`_` … `___`**, or **`*` … `***`** wrapping content; inner content is **`.strip()`**; delimiters unchanged.

Logging can record each replacement (`log_replacement`).

---

## 3. Inline HTML (`nttt/cleanup_html.py`)

**Function:** `trim_html_tags`

- Splits on **single** `` ` ``. Only **even-index** segments are processed; **inline code** segments are preserved.
- Matches **paired** tags: `<tagName>…</tagName>` where `tagName` is **word characters + digits only** (no hyphenated custom elements in the pattern). Inner HTML is **`.strip()`**.
- Does **not** handle attributes on the opening tag, self-closing tags, or arbitrary XML namespaces.

---

## 4. Formatting braces (`nttt/cleanup_formatting.py`)

**Function:** `trim_formatting_tags`

- Single-pass regex over the **entire** file (no code-fence splitting).
- Targets patterns like **`word { … key = "value" … }`** with flexible Unicode spaces, colons, and quotes (see [`nttt/constants.py`](../nttt/constants.py) `RegexConstants`).
- **Lowercases** the attribute name and value.
- Normalises "blank" link targets: values matching **`_` + spaces + `blank`** → **`_blank`**.

---

## 5. Locale URLs (`nttt/tidyup.py`)

After cleanup: **replace every `/en/` with `/<language>/`** in the Markdown file (`language` from resolved CLI args, defaulting from input folder basename).

---

## Operational notes

- **Confirmation:** Unless **`-Y`**, the tool lists files and waits for **`y`** before writing.
- **Volunteer acknowledgements / missing files:** Separate from Markdown transforms; see `add_volunteer_acknowledgement` and `add_missing_entries` in [`nttt/tidyup.py`](../nttt/tidyup.py).
- **Logging:** Several modules accept a `logging` object for replacement traces (`nttt_logging`).

---

## Quick code map

| Concern | Module |
|---------|--------|
| Orchestration | `nttt/tidyup.py`, `nttt/__init__.py` |
| CLI / disable flags | `nttt/arguments.py` |
| Sections | `nttt/cleanup_sections.py` |
| Markdown emphasis / code delimiters | `nttt/cleanup_markdown.py` |
| Inline HTML | `nttt/cleanup_html.py` |
| Brace attributes | `nttt/cleanup_formatting.py` |
| Split "every other segment" | `nttt/utilities.py` → `apply_to_every_other_part` |
16 changes: 15 additions & 1 deletion nttt/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
from .arguments import parse_command_line, resolve_arguments, check_arguments, show_arguments
from .constants import ArgumentKeyConstants, Modes
from .restore import restore_tree
from .strip import strip_tree
from .tidyup import tidyup_translations
from ._version import __version__

Expand All @@ -7,4 +10,15 @@ def main():
resolved_arguments = resolve_arguments(command_line_args)
show_arguments(resolved_arguments)
if (check_arguments(resolved_arguments)):
tidyup_translations(resolved_arguments)
mode = resolved_arguments[ArgumentKeyConstants.MODE]
if mode == Modes.STRIP:
strip_tree(
resolved_arguments[ArgumentKeyConstants.INPUT],
resolved_arguments[ArgumentKeyConstants.OUTPUT])
Comment thread
jamdelion marked this conversation as resolved.
elif mode == Modes.RESTORE:
restore_tree(
resolved_arguments[ArgumentKeyConstants.INPUT],
resolved_arguments[ArgumentKeyConstants.ENGLISH],
resolved_arguments[ArgumentKeyConstants.OUTPUT])
else:
tidyup_translations(resolved_arguments)
13 changes: 12 additions & 1 deletion nttt/arguments.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .constants import ArgumentKeyConstants
from .constants import ArgumentKeyConstants, Modes
import os
from pathlib import Path
from argparse import ArgumentParser
Expand Down Expand Up @@ -51,6 +51,11 @@ def parse_command_line(version):
parser.add_argument("-l", "--language", help="The language of the content to be tidied up, defaults to basename(INPUT).")
parser.add_argument("-v", "--volunteers", help="The list of volunteers as a comma separated list, defaults to an empty list.")
parser.add_argument("-f", "--final", help="The number of the final step file, defaults to the step file with the highest number.")
parser.add_argument("-m", "--mode", choices=[Modes.TIDY, Modes.STRIP, Modes.RESTORE],
help="The processing mode. Options are: tidy (default cleanup), "
"strip (remove non-translatable structural markers before Crowdin upload), "
"restore (restore stripped structural markers after Crowdin download). "
"Default is tidy.")
parser.add_argument("-D", "--Disable", help="The risky features to be disabled, separated by commas. "
"Options are: fix_md (fix common markdown-related issues), "
"fix_html (fix common issues in HTML-like tags (<kbd>Return</kbd>)), "
Expand Down Expand Up @@ -120,6 +125,11 @@ def resolve_arguments(command_line_args):
else:
arguments[ArgumentKeyConstants.YES] = "off"

if hasattr(command_line_args, "mode") and command_line_args.mode:
arguments[ArgumentKeyConstants.MODE] = command_line_args.mode
else:
arguments[ArgumentKeyConstants.MODE] = Modes.TIDY

return arguments


Expand All @@ -138,6 +148,7 @@ def show_arguments(arguments):
print("Disabled functions - '{}'".format(arguments[ArgumentKeyConstants.DISABLE]))
print("Logging - '{}'".format(arguments[ArgumentKeyConstants.LOGGING]))
print("Yes - '{}'".format(arguments[ArgumentKeyConstants.YES]))
print("Mode - '{}'".format(arguments[ArgumentKeyConstants.MODE]))


def check_folder(folder):
Expand Down
7 changes: 7 additions & 0 deletions nttt/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@ class ArgumentKeyConstants:
DISABLE = 'DISABLE'
LOGGING = 'LOGGING'
YES = 'YES'
MODE = 'MODE'


class Modes:
TIDY = "tidy"
STRIP = "strip"
RESTORE = "restore"


class RegexConstants:
Expand Down
96 changes: 96 additions & 0 deletions nttt/markers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import re


LINE_KIND_BARE_MARKER = "bare"
LINE_KIND_LABELLED_MARKER = "labelled"
LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE = "paired_empty_blockquote"
LINE_KIND_REGULAR = "regular"


RFM_BARE_MARKER_PATTERN = re.compile(
r'^(?P<prefix>\s*(?:>\s*)+)\[!(?P<tag>[A-Z][A-Z0-9_-]*)\]\s*$'
)

RFM_LABELLED_MARKER_PATTERN = re.compile(
r'^(?P<prefix>\s*(?:>\s*)+)\[!(?P<tag>[A-Z][A-Z0-9_-]*)\]\s+(?P<label>\S.*?)\s*$'
)

LEGACY_BARE_MARKER_PATTERN = re.compile(
r'^\s*---\s+/?[\w-]+\s+---\s*$'
)

EMPTY_BLOCKQUOTE_PATTERN = re.compile(r'^\s*(?:>\s*)+$')
FENCE_LINE_PREFIX_PATTERN = re.compile(r'^\s*(?:>\s*)*')
SAME_LINE_FENCE_PATTERN = re.compile(r'^```[^`]*```$')


def remove_eol(line):
return line.rstrip("\r\n")


def get_eol(line):
if line.endswith("\r\n"):
return "\r\n"
if line.endswith("\n"):
return "\n"
if line.endswith("\r"):
return "\r"
return ""


def classify_line(line):
line_without_eol = remove_eol(line)

match = RFM_LABELLED_MARKER_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_LABELLED_MARKER, match

match = RFM_BARE_MARKER_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_BARE_MARKER, match

match = LEGACY_BARE_MARKER_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_BARE_MARKER, match

match = EMPTY_BLOCKQUOTE_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE, match

return LINE_KIND_REGULAR, None


def is_marker_line(line):
line_kind, _ = classify_line(line)
return line_kind in (LINE_KIND_BARE_MARKER, LINE_KIND_LABELLED_MARKER)


def is_rfm_bare_marker_line(line):
return RFM_BARE_MARKER_PATTERN.match(remove_eol(line)) is not None


def is_paired_empty_blockquote(line):
line_kind, _ = classify_line(line)
return line_kind == LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE


def iter_lines_with_fence_state(content):
inside_fenced_code = False

for line in content.splitlines(keepends=True):
yield line, inside_fenced_code
if _count_fence_markers(line) % 2 == 1:
inside_fenced_code = not inside_fenced_code


def _count_fence_markers(line):
content = remove_eol(line)
content_without_prefix = content[FENCE_LINE_PREFIX_PATTERN.match(content).end():]

if not content_without_prefix.startswith("```"):
return 0

if SAME_LINE_FENCE_PATTERN.match(content_without_prefix):
return 2

return 1
Loading
Loading