Skip to content

Proj/language selection#749

Open
danhdoan wants to merge 25 commits into
mainfrom
proj/language-selection
Open

Proj/language selection#749
danhdoan wants to merge 25 commits into
mainfrom
proj/language-selection

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

@danhdoan danhdoan commented Jun 1, 2026

Summary

Brings the language-selection feature stack to main and resolves the divergence between this branch and main's recently-landed tool-mode response-file surface (ENG-3002) and <img> first-class inline content (ENG-3021).

After this lands:

  • Curate kickoff and correction prompts emit a language clause — either "Match the user's input language…" (auto, the default) or "…in <language>" (fixed). The calling agent's LLM is told which language to write the topic body in.
  • The preference is stored in global daemon settings (language.mode / language.code) since ENG-2974. A per-project .brv/config.json language field is still honored as a fallback for users who configured before the migration.
  • The MCP curate tool description carries the auto clause so the same instruction reaches non-CLI callers.

The recently-merged main features (--response-file /--delete-response-file and <img> inline support) are preserved
unchanged — the language thread composes on top, not in place of.

What's included

Feature work (history-only — these landed on the branch already):

  • ENG-2687 — BrvConfig.language schema + language-clause builder
  • ENG-2688 — Threaded language through kickoffSession and continueSession; auto clause appended to MCP TOOL_DESCRIPTION
  • ENG-2689 — CJK bigram tokenizer wired into MiniSearch (INDEX_SCHEMA_VERSION bumped to 7)
  • ENG-2690 — brv query helper + cross-language curate→query roundtrip coverage in the auto-test harness
  • ENG-2691 — brv config set/get language.* CLI (now intercepted and redirected to brv settings set per ENG-2974)
  • ENG-2974 — Language preference moved from per-project config to global daemon settings; legacy brv config set language.* kept as a deprecation gate that names the new command

Merge-resolution work (new in this PR):

  • src/agent/infra/tools/implementations/search-knowledge-service.ts — UNION of ENG-3021's parsed.imageContent indexer wiring (from main) and ENG-2689's tokenizeWithCjk in MINISEARCH_OPTIONS (from this branch). INDEX_SCHEMA_VERSION happened to be at 7 on both sides — coincidence, both changes needed an index invalidation.
  • src/oclif/commands/curate/index.ts — Kept main's richer handleContinuation body (mutex checks + loadResponseFile + local pre-dispatch validation + dispatchContinuation helper + post-success --delete-response-file cleanup) and grafted the language thread on top:
    • resolveLanguagePreference(projectRoot) is called once after
      projectRoot is resolved
    • language is passed into dispatchContinuation(...)
    • dispatchContinuation's args widened to accept
      language?: BrvConfigLanguage and forwarded into
      continueSession({...})
      Imports merged as a clean disjoint union — BrvConfigLanguage,
      ProjectConfigStore, SettingsEvents, SETTINGS_KEYS from this
      branch + CurateSessionEnvelope and the full curate-session
      helper set from main.
  • Lib-layer signatures (curate-session.ts) already accept
    language? on both KickoffOptions and ContinueOptions, so no
    lib changes were required.

Harness migration (local-auto-test/language-selection-e2e):

  • Migrated 11 cases from the deprecated brv config set language.*
    to brv settings set language.* + settings get verification.
    Settings are GLOBAL since ENG-2974, so each migrated case wraps
    its mutation in try/finally with resetLanguageSettings() to
    keep cases isolated.
  • Case 6 was repurposed from the obsolete cross-field "fixed-
    without-code" validation (no longer applicable — language.mode
    and language.code are independent settings keys) into a
    deprecation-gate test that confirms brv config set language.*
    rejects with code: 'deprecated-key' and names the new
    brv settings set ... command.
  • Pre-suite reset added in restartDaemon so a prior aborted run
    cannot leak global state into the next run.
  • README updated to reflect the new surface and renumbered to 22
    cases.

Conflict-resolution summary (for future archaeology)

File Strategy
search-knowledge-service.ts UNION — both contributions preserved verbatim
curate/index.ts (imports) Disjoint UNION
curate/index.ts (handleContinuation) Kept main's body, grafted language thread via dispatchContinuation widening
curate-session.ts No change required — already a superset

Test plan

  • npm run lint — 0 errors (280 pre-existing warnings on unrelated test files, none on merged files)
  • npm test — 9006 / 9006 passing
  • local-auto-test/curate-tool-mode-e2e — 20 / 20 (all --response-file / --delete-response-file cases + all language-clause cases pass)
  • local-auto-test/language-selection-e2e — 22 / 22 after the brv settings set migration
  • Manual: brv settings set/get language.mode|code round-trips; brv config set language.* deprecation message names the new brv settings set command

Reviewer call-outs

  • The dispatchContinuation widening (added language?: BrvConfigLanguage to its args type) is the only behavioural change to main's response-file surface — all other main-side contracts are byte-equivalent. Worth a focused read on src/oclif/commands/curate/index.ts.
  • The harness migration is intentionally scoped to language settings only. The tool-mode harness's language smoke cases pass without migration because they use the .brv/config.json project-config fallback tier that resolveLanguagePreference still honors.
  • INDEX_SCHEMA_VERSION was bumped to 7 by both branches independently (CJK tokenizer on this branch, <img> indexer on main). Existing cached indexes were already invalidated by whichever side landed first; no additional bump needed.

Out of scope

  • Sunset of the project-config language field (the resolveLanguagePreference fallback). Tracked separately — needs raw-overrides exposure on the settings transport so mode: 'auto' from daemon can short-circuit the fallback cleanly.
  • Sunset of brv config set/get entirely — the surface still hosts the language.* deprecation gate and the applyConfigSet dispatcher for the next project-config key that needs to land.

danhdoan and others added 24 commits May 26, 2026 17:10
… language-clause module

Scaffolding for the language-preservation feature (issue #616). Adds
`BrvConfig.language` and a shared clause module; no surface is wired yet.
Commits 02–05 do the actual injection (tool-mode prompts, tokenizer,
validation, CLI).

- `BrvConfig.language?: { mode: 'auto' | 'fixed'; code? }` round-trips
  through toJson/fromJson and every with* method. `mode: 'fixed'` without
  `code` is rejected at load so silent fallback to English is structurally
  impossible.
- `language-clause.ts` exports `buildLanguageClause(language?)` and an
  inline `LANGUAGE_NAMES` ISO-639-1 → English map (~24 entries, no
  iso-639-1 dependency). Auto / fixed-known / fixed-unknown all return a
  clause that carves out tag names, attribute names, enum values, and
  `path` from translation so Zod validation can't fire on a localized
  schema key.

Tests: 13 new BrvConfig tests + 14 new language-clause tests. All
existing tests pass; clause module has zero src/ consumers (commit 02).
…tion

[ENG-2687] Language selection 1/5 — foundation (BrvConfig.language + language-clause module)
…rection prompts + MCP tool description

Wires the language-preservation clause from ENG-2687 into the three
tool-mode prompt surfaces a calling agent's LLM sees during curate. After
this commit a tool-mode user under `language: { mode: 'fixed', code: 'ru' }`
gets `<bv-*>` body text authored in Russian; schema (tag names, attribute
names, enum values, `path`) stays English because the clause carves it out.

Closes #616 for Cyrillic / Vietnamese / European Latin users. CJK users
need commit 03 (tokenizer fix) before queries find matches.

Injection points:

- `buildGeneratePrompt(options)` and `buildCorrectionPrompt(options)` —
  add `language?: BrvConfigLanguage` to options; emit a `# Language`
  section. In the kickoff: between Path format and Element vocabulary
  (part of the byterover-controlled framing that commits BEFORE the
  user-intent block). In correction: between Output contract and
  Errors-to-fix (reasserts the contract on every retry).
- `brv-curate-tool.ts:TOOL_DESCRIPTION` — append the auto clause
  unconditionally. The MCP tool description is built once at server-boot
  and cannot read live config; per-call fixed-mode is honored via the
  oclif kickoff prompt (which IS dynamic).

Threading:

- `kickoffSession` / `continueSession` in curate-session.ts now accept
  `language?` and pass it to the builders.
- The oclif `brv curate` command loads `BrvConfig.language` via
  `ProjectConfigStore.read()` (with a try/catch that degrades a corrupt
  config to `undefined` → auto, never blocking curate) and threads it
  into both kickoff and continuation.
- `language` is re-read on every continuation so a mid-session
  `brv config set language.code <iso>` is honored on the next retry.

Tests (10 new):

- 4 cases for buildGeneratePrompt (section ordering, auto-default,
  auto-mode, fixed-RU-emits-Russian-and-not-auto).
- 3 cases for buildCorrectionPrompt (section ordering, auto-default,
  fixed-RU).
- 2 integration cases on kickoffSession (auto-default thread, fixed-RU
  thread) confirming end-to-end orchestrator → builder.
- 1 case on TOOL_DESCRIPTION self-containment asserting the auto clause
  + schema-key carve-out are present.

175/175 tests green across the affected surfaces. Typecheck + lint clean;
the pre-existing `max-params` warning on `registerBrvCurateTool` is
unchanged (signature was already 5 params).
…tion

[ENG-2688] Language selection 2/5 — tool-mode injection (kickoff + correction + MCP tool description)
…e / Korean queries

Fixes a confirmed CJK-search bug in `search-knowledge-service.ts`'s
BM25 index. MiniSearch 7.2.0's default tokenizer splits on `\p{Z}\p{P}`
(whitespace + punctuation). CJK scripts have no whitespace, so
`'认证系统使用JWT令牌'` tokenizes as a single token and a query for
`'认证'` returns zero matches against indexed content.

Empirical confirmation pre-fix:

  const ms = new MiniSearch({fields: ['t'], idField: 'id'})
  ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}])
  ms.search('认证')          // → [] — broken
  ms.search('Привет мир')  // → matches as expected

Without this fix, language preservation from ENG-2687 / ENG-2688 is
invisible to CJK users: their curate output is correctly authored in
Chinese / Japanese / Korean but their queries return zero matches.

Approach:

- New `cjk-tokenizer.ts` module. Algorithm:
  1. Split on Unicode whitespace + punctuation (same as MiniSearch default).
  2. For each token, split at CJK ↔ non-CJK script boundaries.
  3. Non-CJK segments emit as-is (Latin / Cyrillic / Vietnamese / European
     behave byte-identical to the default).
  4. CJK segments emit overlapping bigrams; single-char fallback to unigram.
- CJK ranges: U+4E00–9FFF (Unified Ideographs), U+3040–309F (Hiragana),
  U+30A0–30FF (Katakana), U+AC00–D7AF (Hangul Syllables).
- Bigrams are the standard CJK IR compromise — unigrams are too noisy
  (common chars like `的` dominate scoring), trigrams too sparse.
- Wired via the top-level `tokenize` option on `MINISEARCH_OPTIONS`. Per
  the MiniSearch source (line 1564-1566), that single option applies at
  both index and query time when `searchOptions.tokenize` is unset.
- No new runtime deps — `Intl.Segmenter` is available in Node 16+ but its
  CJK quality varies by ICU version; inline bigrams are deterministic
  across Node versions / platforms.

INDEX_SCHEMA_VERSION bumped 6 → 7 so cached indexes built with the
default tokenizer invalidate and rebuild on first daemon start.

Tests (18 new):

- 4 cases for non-CJK scripts (English, Russian, Vietnamese, punctuation)
  asserting byte-identical behavior to the MiniSearch default.
- 5 cases for CJK scripts (Chinese 4-char bigrams, Chinese 2-char single
  bigram, Japanese kanji+kana, Korean Hangul, single-char unigram fallback).
- 3 cases for mixed Latin+CJK tokens (whitespace-separated, no-whitespace
  split, multiple alternating runs).
- 6 MiniSearch integration cases — the actual production wiring contract:
  Chinese / Japanese / Korean queries return matches; Russian regression;
  English regression; English query does NOT match CJK content
  (cross-script isolation).

251/251 across affected surfaces (18 new tokenizer + 58 existing
search-knowledge + 175 prior PRs from ENG-2687/2688). Typecheck + lint
clean; 9 pre-existing warnings on unrelated functions unchanged.
[ENG-2689] Language selection 3/5 — CJK-aware MiniSearch tokenizer
… integration

Validates the language-selection feature (ENG-2687 → ENG-2689) end-to-end
across the four target non-English scripts (Russian / Vietnamese /
Chinese / Japanese) plus the auto-mode default.

Adds `test/integration/scenarios/language-roundtrip.test.ts` (8 cases) —
walks the full pipeline a real user hits: `.brv/config.json` on disk →
`ProjectConfigStore.read()` → `BrvConfig.language` → `kickoffSession()` →
the kickoff prompt envelope the calling agent's LLM consumes. Proves no
layer in the threading regresses for any target language.

Coverage:
- Auto mode (default, no language field) → auto clause present
- Auto mode (explicit `mode: auto`) → auto clause present
- Fixed-RU → "in Russian" in the prompt; auto clause absent
- Fixed-VI → "in Vietnamese"
- Fixed-ZH → "in Chinese"
- Fixed-JA → "in Japanese"
- Schema rejection at load: fixed mode without code throws via fromJson
- Forward-compat: unmapped code (`xx`) → "in \"xx\"" fallback

Out of scope and documented in the validation report (research repo,
features/language-selection/validation/04-validation.md): LLM-honoring
of the clause requires a real consumer (Claude Code, Cursor) and is
verified manually pre-release. The auto-test harness in local-auto-test
also exercises curate + query roundtrip across all four scripts via
real `brv` CLI invocations — 13/13 cases green; covered in the report.

Test counts after this commit: 326 total green across the feature
(54 BrvConfig + 175 prompt/tool + 18 tokenizer + 58 search-knowledge
regression + 8 new language-roundtrip + 13 auto-test harness).
[ENG-2690] Language selection 4/5 — validation (fixtures + cross-language E2E + English regression)
Final commit of the language-selection initiative (#616). Ships the
user-facing CLI for language preference and the release notes that close
the issue end-to-end.

New `brv config set <key> <value>` and `brv config get <key>` commands —
generic project-config infrastructure (not a one-off `brv language`
command). Today's keys: `language.mode` and `language.code`; future
project-config keys plug into the `SETTERS` / `GETTERS` dispatch with no
new oclif surface area.

CLI behavior:

- `brv config set language.mode auto | fixed` — reject `fixed` when no
  `language.code` is set, with a redirect message pointing at
  `brv config set language.code <iso>`. Prevents writing a config that
  `isBrvConfigJson` would refuse on next load.
- `brv config set language.code <iso>` — validates against the
  `LANGUAGE_NAMES` map (24 entries, no `iso-639-1` dependency). Unknown
  codes rejected with a sorted supported-list error.
- `brv config get language.<key>` — symmetric reader; returns "(not set)"
  for absent fields, or the value (in text or JSON mode).

Other:

- New `BrvConfig.withLanguage(language?)` method following the existing
  `with*` pattern. Used by the set command's setters; previously a future
  caller would have to spread BrvConfig fields by hand.
- Pure dispatcher functions (`applyConfigSet`, `applyConfigGet`) separated
  from the oclif Command class so the validation contract is testable in
  isolation.

Release notes under [Unreleased] in CHANGELOG.md crediting Dmitriy K and
including the **restoration recipe** for users who prefer the prior
implicit-English behavior:

    brv config set language.code en
    brv config set language.mode fixed

Tests (22 new):

- 12 cases on `applyConfigSet` covering: auto / fixed / mode transitions,
  code update preserves mode, English restoration recipe, rejection paths
  (fixed-without-code, unknown mode value, unknown ISO code, unknown
  config key, totally unrelated key).
- 5 cases on `applyConfigGet`: unset → undefined, both modes, both keys,
  unknown-key rejection.
- 5 cases on `withLanguage`: replace, set-when-unset, clear via undefined,
  no mutation of original, all-other-fields-preserved.

Ship gate:

- Typecheck + lint clean (pre-existing complexity warning unchanged).
- 242 mocha tests green across the affected surfaces (BrvConfig + config
  CLI + prompt builders + clause module + curate-session + brv-curate-tool
  + CJK tokenizer + language roundtrip integration + validate-brv-config
  init hook).
- Auto-test harness 13/13 green after rebuild — including the 4
  cross-language curate→query roundtrips and the 7 original English
  cases (zero structural drift).
- Manual CLI smoke test: full restoration recipe roundtrip + unknown-
  code rejection.

Post-merge action: post on #616 with feature summary, link to release
notes, and pointer to backlog.md so Dmitriy can comment on what to
prioritize next (per-curate --lang flag, TUI panel, per-domain overrides).
Don't auto-close — let Dmitriy close after confirming the feature works
for him.
[ENG-2691] Language selection 5/5 — brv config set/get + release notes + ship gate
… in SETTINGS_REGISTRY

Adds EnumSettingDescriptor to the settings type system and registers the
two language settings (mode, code) under category 'language'. The wire
layer (CLI parser, transport DTO, validator, store, handler) now accepts
and round-trips string-valued enum settings end-to-end; TUI and WebUI
renderers will pick them up in follow-up commits.

- SettingDescriptor union grows with EnumSettingDescriptor (options field)
- SettingItem.current / default widen to boolean | number | string
- SettingsItemDTO exposes 'options' on enum-typed items
- brv settings set <enum-key> rejects invalid values with the allowed list
- SettingsValidator.validate routes enum values through validateEnum
- SETTINGS_KEYS gains LANGUAGE_MODE + LANGUAGE_CODE
- registry test: enum narrowing, options shape, language category
- handler test: LIST exposes options, SET rejects non-string to enum keys
Surfaces language.mode / language.code via the existing settings UIs.

- shared/language/language-names.ts: extracted single source of truth for
  the ISO-639-1 → English-name map (was inline in language-clause.ts);
  reused by settings registry, WebUI display labels, and language-clause
- shared SettingsRow / format-settings: adds enum row builder with
  options + display formatting "[ auto ]"
- TUI settings-page: Left/Right cycles options in edit mode; LANGUAGE
  group appears below UPDATES per CATEGORY_ORDER
- WebUI enum-settings-row.tsx: select dropdown driven by descriptor
  options; language.code rows render "ja — Japanese" via LANGUAGE_NAMES
- WebUI LanguagePanel wires into Configuration > General between
  TaskHistoryPanel and UpdatesPanel
- tests: enum row builder + parseRowInput + category ordering
…onfig set language.*

Switches the curate kickoff/continuation language read site from project
config to the daemon settings store, and points users at the new
brv settings set surface for any future change. Project config remains
a fallback for users who still have a per-project override from
ENG-2691 so no one loses their setting on upgrade.

- curate/index.ts: resolveLanguagePreference now calls FileSettingsStore
  first; falls back to ProjectConfigStore for backward compat
- config/set.ts: brv config set language.{mode,code} now fails with a
  clear "moved to brv settings set" message; pure dispatcher unchanged
  so the validation path still covers other project-config keys
…ange + integer parsed.value

Pre-push tsc rejected:
- enum-settings-row: Select.onValueChange signature is (string | null) under
  the component lib's types; the null branch is unreachable for a non-
  cleared single-select but the type system needs an explicit guard.
- settings-row IntegerSettingsRow: parsed.value widened to (number | string)
  after enum support landed; narrow to number before passing to
  toastValue/setMutation so the integer branch keeps compiling under strict.
The oclif settings command kept a duplicate local CATEGORY_ORDER + HEADERS
that wasn't extended when 'language' was added to the canonical shared
registry, so language rows silently dropped from text output even though
the JSON form exposed them. Promote CATEGORY_HEADERS plus a
CATEGORY_ORDER-derived toRowCategory helper into the shared types module
so the oclif and shared-utils formatters consume a single canonical
source. Add a regression test that iterates every category in
CATEGORY_ORDER, gating the broader "new category added but normalizer not
updated" bug class.
`oclif/curate/index.ts` was importing FileSettingsStore + SETTINGS_KEYS
directly from server/ to resolve the language preference, violating
CLAUDE.md's "oclif and tui never import from server" boundary. Route
through SettingsEvents.LIST via withDaemonRetry — same pattern used by
`brv settings get/list`. Add an intent comment on resolveLanguagePreference
explaining the daemon-vs-project-config precedence semantics for the
ENG-2691 → ENG-2974 migration window so a future reader doesn't try to
"fix" the deliberate fallback. Distinguishing "user explicitly set
mode=auto" from "user never touched settings" requires raw-overrides
access the transport doesn't expose today, and the case can't manifest
in production until project-config language ships.
The round-3 transport refactor inlined `'language.mode'` and `'language.code'`
as local string-literal consts in `oclif/commands/curate/index.ts` to avoid the
oclif → server boundary slip. That workaround reproduced the brittleness the
original `SETTINGS_KEYS` constant was designed to prevent — a rename in the
registry would no longer be a typecheck error at the curate call site (and
the same trap existed in two WebUI files).

Move `SETTINGS_KEYS` to `src/shared/types/settings-keys.ts` so every consumer
(server registry / validator / bootstrap, agent, oclif curate, WebUI labels +
enum-settings-row) imports one canonical map. Rename safety is now machine
checked. WebUI consumers that previously inlined the wire keys also switch
to the constant.

Also export `readLanguageFromSettings` with an optional `DaemonClientOptions`
parameter and add three unit tests covering: daemon returns fixed → returns
the language; daemon returns auto → undefined; daemon throws → undefined.
The function was the surface that just changed and silently swallows
daemon errors, so direct coverage is warranted.
…ings PR

Five inline-comment items on PR #724's auto-review against feat/ENG-2974:

1. Tight retry budget on readLanguageFromSettings — kickoff was inheriting
   withDaemonRetry's MAX_RETRIES=10 × 1s default, making every `brv curate`
   block ~9s when the daemon was unreachable before falling back to project
   config. Now passes {maxRetries:1, retryDelayMs:0, ...options} so a
   missing daemon trips the fallback immediately. Test asserts call count
   stays at most 2 (was 10 with the old default; test would have run for
   ~9s, now completes in 6ms).

2. Surface enum options on `brv settings get` — text mode now prints an
   `allowed: a, b, c` line for enum keys; JSON mode includes the `options`
   field. Mirrors how integer keys surface `min`/`max`/`range`. Scripted
   consumers can discover the legal value set without grepping source.

3. Clean stale `brv config set` surface — help text, args description, and
   examples no longer advertise the now-deprecated `language.mode` /
   `language.code` (the command-level interceptor catches those upstream
   with a deprecation-redirect). Examples point users at `brv settings set`
   instead. SETTERS dispatcher emptied with a comment that future
   project-config keys wire in by adding one entry; unreachable
   setLanguageMode / setLanguageCode functions removed; their dispatcher
   tests collapsed to skeleton coverage.

4. Collapse language-clause re-export — replaced the
   `export {LANGUAGE_NAMES} from ...` + `import {... as LANGUAGE_NAMES_LOCAL}`
   double-binding with a single `import {LANGUAGE_NAMES}` directly from the
   canonical shared module. `oclif/commands/config/set.ts` and the
   language-clause test repoint to the same canonical path.

5. file-settings-store — source `restartRequired` from the registry
   descriptor instead of hardcoding `true`. The hardcoded value was dead in
   the transport-routed path (the handler already overrides via the
   descriptor) but misrepresented boolean/enum keys whose descriptors
   carry `restartRequired: false` (update.checkForUpdates, language.mode,
   language.code) for any in-process caller reading the store directly.
   Also refresh the `SettingsFile.values` doc that still said "set writes
   only validated numeric values" — now mentions booleans and enums.

Verification: 8871 npm test passing; ESLint + typecheck + build clean;
3 sub-agents PASS on settings-get / curate-pipeline / config-deprecation
surfaces. Codex round-5 review APPROVED.
The onboarding tour now asks which written language to save the user's
memory in, surfacing the ENG-2974 language settings during the guided
intro instead of leaving users to discover them later.

- Reuse the existing global settings (`language.code` / `language.mode`)
  via `brv settings set` — no new infrastructure.
- Ask it as Message 1's first question, on its own turn (one question
  per turn), before the persona interview. Set the language before the
  first `brv curate` so the saved artifact renders in the chosen language.
- Default path stays zero-setup: English / "match me" / blank leaves
  `auto` mode and runs nothing (auto already matches the user's input).
- Set `language.code` first, then `language.mode fixed` — a rejected or
  mis-mapped code then leaves mode at its safe `auto` default rather than
  forcing English. Unsupported language or `invalid_value` degrades to
  auto; never dumps the supported list or surfaces raw error JSON.
- Amend onboarding guardrails: carve the narrow config exception, extend
  the no-form/menu rule to the language line, and forbid asking the
  language and persona questions in the same message.
- Add a content test asserting onboarding.md documents the language step,
  the code-before-mode ordering, and the trust-opener-first sequence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat: [ENG-3002] add language selection to onboarding tour
Conflict resolution:

Two files had 3-stage index entries. The full conflicted file list and
the resolution decision per file:

1. src/agent/infra/tools/implementations/search-knowledge-service.ts
   — UNION resolution.
     * From main (ENG-3021): `parsed.imageContent` added to the
       indexedContent array so BM25 sees <img alt + src> tokens.
     * From HEAD  (ENG-2689): `tokenizeWithCjk` import + wired into
       MINISEARCH_OPTIONS.tokenize for CJK bigram support.
     * Both sides bumped INDEX_SCHEMA_VERSION to 7 — coincidence, both
       changes needed an index invalidation. No behavioural overlap;
       both contributions preserved verbatim.

2. src/oclif/commands/curate/index.ts — manual resolution, two regions.

   Region 1 (imports) — disjoint UNION:
     * HEAD added BrvConfigLanguage, ProjectConfigStore, SettingsEvents,
       SETTINGS_KEYS for the language-preference resolution path; took
       a narrow {continueSession, kickoffSession, resolveProjectRoot}
       from curate-session.
     * main added CurateSessionEnvelope and the full curate-session
       import set including deleteCurateResponseFile,
       InvalidResponseFileError, InvalidResponseFormatError,
       loadCurateResponseFile, parseCurateResponse, peekCurateSession,
       unknownSessionEnvelope.
     * Resolution: merged both import-type lines + combined named-
       import list.

   Region 2 (handleContinuation body) — STRUCTURAL choice + graft.
     The two sides described different shapes of the same function:
       * HEAD: old shape — inline withDaemonRetry(continueSession(...))
         with `language` passed in. No file IO, no envelope validation,
         no peek-session.
       * main: new shape — mutex checks, response resolution from
         inline OR file via loadResponseFile, local pre-dispatch
         validation (empty/format/session-peek), dispatchContinuation
         helper, post-success --delete-response-file cleanup with
         append-to-errors[] semantics on cleanup failure.

     Decision: kept main's full structure (strictly subsumes HEAD's
     old shape), then grafted language threading on top:
       * resolveLanguagePreference(projectRoot) called once after the
         projectRoot is resolved
       * Passed language into dispatchContinuation(...)
       * Widened dispatchContinuation args to accept
         language?: BrvConfigLanguage
       * Threaded language into the inner continueSession({...}) call

     This preserves every behaviour from both sides: file IO +
     validation + cleanup from main, language threading from HEAD.

   handleKickoff already called kickoffSession({content, language,
   projectRoot}) outside the conflict markers and merged cleanly — no
   action needed there.

Lib-layer signatures (curate-session.ts) were already a superset on
main: both ContinueOptions and KickoffOptions accept
`language?: BrvConfigLanguage`, so no lib changes were needed.
@danhdoan danhdoan requested a review from RyanNg1403 as a code owner June 1, 2026 04:44
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Claude finished @danhdoan's task in 4m 49s —— View job


Claude Code Review

  • Read CLAUDE.md and gather context
  • Review merge-resolution changes (search-knowledge-service.ts, curate/index.ts)
  • Review language-clause & curate-prompt-builder
  • Review settings surface migration (events, validator, format-settings)
  • Review CJK tokenizer
  • Review WebUI/TUI surface
  • Post inline comments for issues found
  • Post summary with score

Summary

Score: 4.5 / 5 — solid, well-tested, conservative merge. The merge-resolution is clean and additive; the language-clause threading composes on top of main's response-file / <img> surface without breaking either contract. The deferred concerns (precedence quirk, project-config sunset) are correctly scoped out and documented.

Strengths

  • Single source of truth for the language clause (language-clause.ts) — every emission site (kickoff, correction, MCP tool description) routes through one helper. Wording revisions are a one-file change.
  • CJK tokenizer is well-engineered — script-boundary splits, BMP-correct for…of iteration, single-char unigram fallback, comprehensive test coverage including cross-script isolation regressions. Empirical baseline ("MiniSearch 7.2.0 returned []") in the docstring is exactly the kind of forensic note that ages well.
  • Schema defends against silent fallback to EnglishisBrvConfigJson rejects {mode: 'fixed'} without code at load time, structurally closing the "silently writes in English" failure mode.
  • Strict TDD discipline — unit tests for language-clause, brv-config, curate-prompt-builder, curate-session, cjk-tokenizer PLUS a full-pipeline integration test (language-roundtrip.test.ts) covering Russian / Vietnamese / Chinese / Japanese end-to-end against a real on-disk config.
  • Deprecation gate is cleanbrv config set language.* rejects with a structured code: 'deprecated-key' that names the new command, instead of silently succeeding or silently no-op'ing.
  • Daemon read uses a tight retry budget (maxRetries: 1, retryDelayMs: 0) — prevents a ~9s curate stall when the daemon is unreachable, and is asserted with a unit test that caps connector.callCount.

Findings (all non-blocking — see inline comments)

  1. readLanguageFromSettings precedence: explicit daemon mode: 'auto' can be overridden by stale project-config fixed. Documented but worth surfacing during sunset.
  2. Surface asymmetry: brv config get language.* still reads the legacy fallback while brv config set is gated — could mislead a mid-migration user.
  3. Enum invalid-value error surfaces as code: 'invalid_value' rather than a more specific 'invalid_enum_option' or 'invalid_value_type' with expected: 'enum'.
  4. CJK Extension A (0x3400-0x4DBF) is excluded — fine, but suggested as a cheap follow-up for zh-TW rare-char coverage.
  5. WebUI general.tsx panel ordering doesn't match CATEGORY_ORDER (LanguagePanel rendered before Updates, but listed after).
  6. Minor enum-settings-row.tsx if (next === null) return looks like dead code unless the Select callback type is string | null.
  7. buildLanguageClause silently degrades {mode: 'fixed', code: undefined} → auto, with no logger breadcrumb — fine today, worth instrumenting if future call sites bypass BrvConfig.fromJson.

Risk profile: low. Merge-resolution is byte-equivalent for main-side contracts (per the PR's own claim); the only behavioral widening is dispatchContinuation's language?: BrvConfigLanguage param, which is opt-in and fully covered by curate-prompt-builder.test.ts. Test plan checks out — full suite passes 9006/9006, both relevant local-auto-test harnesses green.

Recommendation: ship after addressing finding #1's UX surfacing (the "I clicked auto but it's still Russian" failure mode is the most likely user-visible footgun during the migration window). Other findings can fold into follow-ups.
· Branch: proj/language-selection

Comment on lines +586 to +596
private async resolveLanguagePreference(projectRoot: string): Promise<BrvConfigLanguage | undefined> {
const fromSettings = await readLanguageFromSettings()
if (fromSettings !== undefined) return fromSettings

try {
const config = await new ProjectConfigStore().read(projectRoot)
return config?.language
} catch {
return undefined
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking): Precedence bug: an explicit daemon mode: 'auto' cannot override a stale per-project {mode: 'fixed', code: …} because readLanguageFromSettings returns undefined for any non-fixed daemon mode and resolveLanguagePreference then falls through to project config. This is documented in the docstring as intentional during the migration window, but the comment buries the lede — the function name suggests "no opinion" while the behavior is "no opinion only when daemon says fixed."

Worth at least logging the override edge-case for forensics (the docstring already says "revisit once project-config language is fully sunset"), and tracking the project-config sunset as a follow-up in the PR's "Out of scope" list (already noted). Confirm the WebUI Settings → Language → "auto" toggle messaging makes it clear that a project-config override may still apply mid-migration, or users will report "I clicked auto but it's still writing in Russian" until the sunset lands.

Comment on lines +140 to +149
if (descriptor.type === 'enum' && got !== 'string') {
return {
code: 'invalid_value_type',
expected: 'enum',
got,
key,
message: `expected string for '${key}', got ${got}`,
value,
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (minor): Enum value membership is enforced at the store layer (validateEnum in settings-validator.ts), which surfaces as code: 'invalid_value' — not 'invalid_value_type'. That's inconsistent with integer/boolean where a runtime type mismatch yields 'invalid_value_type'. For an enum with options: ['auto', 'fixed'], calling set('language.mode', 'bogus') returns invalid_value (range-style) rather than something more semantically aligned with "not in the enum set."

Not a behavior bug — both surface in error.message — but consumers switching on code lose the ability to distinguish "string but not in options" from "actual range/coupling violations." If you ever want to enable richer UI states (e.g. "please pick from the dropdown" vs "value out of allowed range"), this is the place. Consider extending SettingsErrorDTO.code with 'invalid_enum_option' (or just route the descriptor-known-options case through the type-check helper here so it emits invalid_value_type with expected: 'enum').

Comment thread src/oclif/commands/config/get.ts Outdated
Comment thread src/agent/infra/tools/implementations/cjk-tokenizer.ts
Comment thread src/webui/features/settings/components/enum-settings-row.tsx
Comment thread src/server/core/domain/render/language-clause.ts
…anguage surface

ENG-2974 left `brv config set/get language.*` as a deprecation gate
pointing at `brv settings set/get`. With the feature stack settled and
no other user-settable project-config keys planned, the dispatcher
itself is dead weight: `SETTERS` was empty for writes, `GETTERS` only
leaked the legacy fallback (the surface asymmetry the auto-reviewer
flagged on PR #749).

Removed:
- src/oclif/commands/config/{set,get}.ts + their tests + empty dir
- BrvConfig.withLanguage() — added in ENG-2691 specifically to back
  applyConfigSet's language setter; no other caller after this change
- describe('withLanguage', ...) test block in brv-config.test.ts
  (5 cases)

Retained:
- BrvConfig.language field + serialization. resolveLanguagePreference
  still reads it as the second-tier fallback for upgrade-compat with
  users who configured language before ENG-2974. The harness's
  per-project fixed-X kickoff cases still exercise this tier via
  setLanguageConfigDirect.
- No per-project language surface added. Auto-mode (the default)
  already matches the user's input language each curate, so
  multilingual users get correct behavior without a knob.

After this commit, `brv settings set/get language.mode|code` is the
ONLY canonical write/read surface. `brv config` no longer exists.

Verification:
- lint: 0 errors (280 pre-existing warnings, none on changed files)
- unit tests: 8994 passing / 0 failing
- curate-tool-mode-e2e: 20 / 20
- language-selection-e2e: 20 / 20 (was 22 — deprecation-gate and
  config-get-unknown cases gone with the command; harness migrated
  separately in this branch's prior work)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Claude finished @danhdoan's task in 5m 21s —— View job


Claude finished @danhdoan's task —— View job


Claude Code Review

  • Read CLAUDE.md and PR diff
  • Review merge-resolution changes (search-knowledge-service.ts, curate/index.ts)
  • Review language-clause & curate-prompt-builder
  • Review settings surface (events, validator, format-settings, handlers)
  • Review CJK tokenizer
  • Review WebUI/TUI settings surfaces
  • Post inline comments and final score

Summary

Score: 3.5 / 5 — solid feature work and a clean merge resolution, but one ship-blocking documentation defect and a state-trap created by the brv config removal that need attention before merge. The underlying code is well-designed and well-tested; the issues are in reaching users with correct instructions and giving them a way out of pre-migration state.

Strengths

  • Single source of truth for the language clauselanguage-clause.ts is the only file that emits clause text; kickoff, correction prompt, and MCP tool description all import from it. Wording revisions are a one-file change.
  • Defense-in-depth schema validationisBrvConfigJson rejects {mode: 'fixed'} without code at load time, structurally closing the "silently writes in English" failure mode. buildLanguageClause defends against the same shape anyway.
  • CJK tokenizer is well-engineered — script-boundary splits, BMP-correct iteration, single-char unigram fallback, and an empirical baseline in the docstring ("MiniSearch 7.2.0 returned []") that ages well. Tests cover cross-script isolation and corpus invariance.
  • Tight retry budget for the daemon settings read (maxRetries: 1, retryDelayMs: 0) prevents a ~9s curate kickoff stall when the daemon is unreachable; asserted with a connector.callCount cap so a future loosening can't slip through.
  • Strict TDD discipline visible across 13 changed/added test fileslanguage-clause, cjk-tokenizer, read-language-from-settings, full-pipeline language-roundtrip.test.ts covering Russian / Vietnamese / Chinese / Japanese end-to-end.
  • Merge resolution is cleansearch-knowledge-service.ts is a verbatim UNION of CJK tokenizer + <img> indexer (both invalidate via the same INDEX_SCHEMA_VERSION = 7 bump); curate/index.ts widens dispatchContinuation to thread language without disturbing main's response-file / overwrite contract.

Findings (inline comments on each)

  1. Ship-blocker — CHANGELOG.md references commands that no longer exist. The Unreleased entry describes brv config set/get language.* and the restoration recipe (brv config set language.code en + brv config set language.mode fixed) as the user-facing surface. Commit d7f3bf6 removed brv config entirely in this PR; the canonical surface is brv settings set/get language.mode|code. Users following the release notes will hit Command not found. Code suggestion attached.
  2. WebUI panel ordering doesn't match CATEGORY_ORDERgeneral.tsx renders LanguagePanel before UpdatesPanel; TUI and brv settings text both sort by the shared constant. The docstring on CATEGORY_ORDER warns about exactly this split-source drift. Cheapest fix is to swap the two JSX lines.
  3. Project-config language fallback is now a one-way trap. With brv config removed, a user with a stale .brv/config.json {language: fixed, code: 'ru'} who toggles daemon language.mode to auto cannot make auto take effect — readLanguageFromSettings returns undefined for non-fixed daemon mode and the fallback wins. Two paths: (a) have readLanguageFromSettings return {mode: 'auto'} explicitly so a daemon auto short-circuits the fallback, or (b) drop the project-config fallback entirely as a follow-up. (a) is small, well-tested, and unblocks (b).
  4. MCP tool description is auto-onlybuildLanguageClause() (no arg) builds once at server boot, so a user with mode: 'fixed', code: 'ru' who curates via the MCP brv-curate tool will see saved memory in English even though oclif brv curate honors fixed. Code comment documents the limitation but no UX breadcrumb. Suggest either a docs note in brv connectors install README or threading the clause through per-call responses.
  5. CJK tokenizer excludes Extension A (0x3400-0x4DBF) — ~6,500 codepoints in active zh-TW / zh-HK use. BMP-safe, single-line fix, but INDEX_SCHEMA_VERSION would need a bump. Non-blocking; flag as zh-TW backlog item.
  6. Dead-code if (next === null) return in enum-settings-row.tsx — Radix Select.onValueChange emits string, not string | null. Drop the check or surface the edge case explicitly with a toast.

Out-of-scope items the PR correctly defers

  • Sunset of the project-config language field (tracked).
  • MCP per-call language clause (tracked above).

Risk profile: medium, driven mostly by finding #1 (user-facing docs lie about the surface). Code-level correctness is high; merge resolution is byte-equivalent for main-side contracts as claimed.

Recommendation: address finding #1 before merge — it's a 4-line CHANGELOG edit and prevents every reader from running broken commands. Finding #3 is the next-most-important since it's the "I clicked auto but it's still Russian" footgun for migrating users and the PR's own "Out of scope" doesn't fully cover it. The others fold cleanly into follow-ups.
· Branch: proj/language-selection

Comment thread CHANGELOG.md
Comment on lines +8 to +13
- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with the new `brv config set` command:
- `brv config set language.mode auto` — match the user's input language (default).
- `brv config set language.mode fixed` + `brv config set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
- `brv config get language.mode` / `brv config get language.code` — read back the current setting.

CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv config set language.code en` then `brv config set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking — user-facing): The release notes describe brv config set/get language.* as the public surface, but commit d7f3bf6 ("remove brv config CLI — settings is the only language surface") deleted the entire brv config command in this same PR. A user who reads these notes and runs brv config set language.mode auto will hit Command not found — and so will the restoration recipe at the bottom (brv config set language.code en then brv config set language.mode fixed), which is the headline "don't lose your prior behavior" path for the people most likely to care.

The canonical surface is now brv settings set language.mode|code (and brv settings get … to read back). The supported-codes list and the restoration recipe also need updating.

Suggested edit:

Suggested change
- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with the new `brv config set` command:
- `brv config set language.mode auto` — match the user's input language (default).
- `brv config set language.mode fixed` + `brv config set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
- `brv config get language.mode` / `brv config get language.code` — read back the current setting.
CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv config set language.code en` then `brv config set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with `brv settings`:
- `brv settings set language.mode auto` — match the user's input language (default).
- `brv settings set language.mode fixed` + `brv settings set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
- `brv settings get language.mode` / `brv settings get language.code` — read back the current setting.
CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv settings set language.code en` then `brv settings set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).

Also note that the PR body still references the deprecation-gate cases (Case 6) as part of the harness migration; per the d7f3bf6 commit message those were also removed (language-selection-e2e went 22 → 20). Worth a quick sweep on the PR description so future archaeology lines up with the final commit.

<ConcurrencyPanel />
<LlmPanel />
<TaskHistoryPanel />
<LanguagePanel />
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (consistency): This places LanguagePanel before UpdatesPanel, but the canonical CATEGORY_ORDER in src/shared/types/settings-row.ts:37-44 is ['concurrency', 'llm', 'task-history', 'updates', 'language', 'other']. The TUI (tui/features/settings/utils/format-settings.ts) and brv settings text output (oclif/commands/settings/index.ts:71) both sort by that constant, so this is the only surface where Language renders before Updates.

That's the exact split-source drift CATEGORY_ORDER's docstring warns against:

Adding a new category here … is the only edit needed for both consumers to render it — oclif/commands/settings/index.ts and tui/features/settings/utils/format-settings.ts both import from this module, so the surfaces never drift.

The WebUI is now a third surface and it doesn't subscribe. Cheapest fix is to swap these two lines:

Suggested change
<LanguagePanel />
<UpdatesPanel />
<LanguagePanel />

Lower-priority follow-up: drive the panel list from CATEGORY_ORDER so future additions only need an entry in the shared constant.

Comment on lines +52 to +60
// Auto clause unconditional: the MCP tool description is built once at
// server-boot, so it cannot read live config. Per-call fixed-mode is
// honored via the oclif `brv curate` kickoff prompt (which IS dynamic).
// MCP-only consumers under `language: { mode: 'fixed' }` see the auto
// clause here; their input language still gets preserved because auto
// says "match the input language".
'# Language',
'',
buildLanguageClause(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (UX gap, non-blocking): The MCP tool description is materialized once at server-boot, so buildLanguageClause() (no argument) always emits the auto clause regardless of the user's language.mode / language.code setting. The comment correctly documents this and reasons that auto says "match the input language" so multilingual users get correct behavior — and that's true.

But here's the user-visible gap: someone who explicitly chose mode: 'fixed', code: 'ru' because their input is English but they want output in Russian (e.g. a non-native English speaker writing prompts in English but wanting saved memory in their native language) will see the MCP-driven curate save in English. The oclif brv curate kickoff prompt does honor fixed mode (via resolveLanguagePreference), so the two surfaces diverge — and there's no breadcrumb in the MCP response that says "your fixed-language preference was ignored on this path."

Two ways to close the loop, in order of cost:

  1. Document the limitation in the MCP tool's user-facing surface (the response footer or brv connectors install … README) so MCP users know to use the oclif kickoff when fixed-language matters.
  2. Plumb language into MCP either by rebuilding the description per-call (expensive — registerTool happens once) or by injecting the clause into the per-call response wrapper (cheaper — the daemon already knows the settings).

(1) is a docs change; (2) is a follow-up ticket. Leaning toward (1) for now plus an "Out of scope" line in this PR, since the auto-fallback degrades safely.

disabled={setMutation.isPending}
onValueChange={(next) => {
if (next === null) return
choose(next).catch(noop)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if (next === null) return is dead-code-shaped — Select.onValueChange from Radix (which @campfirein/byterover-packages/components/select wraps) emits string, never null. TypeScript should have flagged this if the callback signature was tight; it likely passed because the callback param is implicitly typed via useCallback inference.

Two cleaner alternatives:

  1. Drop the check if the signature is (value: string) => void:
Suggested change
choose(next).catch(noop)
onValueChange={(next) => {
choose(next).catch(noop)
}}
  1. Surface the edge case if the signature is genuinely (value: string | null) => void (clear-selection support) — silently returning hides a state the user just produced. A short toast or no-op acknowledgment beats an invisible drop.

The cheapest correctness win is to confirm the upstream type and pick (1) or (2) explicitly. Today's behavior is fine — this is housekeeping.

[0x30_40, 0x30_9F],
[0x30_A0, 0x30_FF],
[0xAC_00, 0xD7_AF],
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking, zh-TW coverage): The ranges deliberately exclude CJK Extension A (0x3400-0x4DBF), which covers ~6,500 codepoints in active use across Traditional Chinese — proper nouns, place names, corporate names, academic and historical text in zh-TW and zh-HK corpora. The comment defends the exclusion as "rarely in user content," which holds on aggregate but optimistically for Taiwan/HK knowledge bases.

Extension A is BMP, so for...of handles it without surrogate-pair work, and the bigram-index cost is proportional to occurrences (effectively zero on non-zh-TW corpora). Adding it is one line:

Suggested change
]
const CJK_RANGES: ReadonlyArray<readonly [number, number]> = [
[0x34_00, 0x4D_BF],
[0x4E_00, 0x9F_FF],
[0x30_40, 0x30_9F],
[0x30_A0, 0x30_FF],
[0xAC_00, 0xD7_AF],
]

The maintainer note already points at INDEX_SCHEMA_VERSION = 7 for invalidation — that's the cost. If you'd rather stay conservative, leaving as-is is fine; log a follow-up the first time a zh-TW user reports search misses on rare characters.

Comment on lines +570 to +596
/**
* Resolve the language preference. Daemon settings (the source of
* truth) take precedence; a per-project `.brv/config.json language`
* field acts as a fallback for users who configured language before
* it moved to global settings.
*
* Note on precedence: only daemon `mode: 'fixed'` short-circuits the
* fallback. An explicit daemon `mode: 'auto'` reads as "no opinion"
* and falls through to project config, so a stale project-config
* `fixed/X` will still win. This is intentional for the migration
* window — distinguishing "user explicitly chose auto" from "user
* never touched settings" needs raw-overrides access that the
* transport doesn't expose today, and the bug only manifests for
* users with a pre-existing per-project `language` field. Revisit
* once project-config language is fully sunset.
*/
private async resolveLanguagePreference(projectRoot: string): Promise<BrvConfigLanguage | undefined> {
const fromSettings = await readLanguageFromSettings()
if (fromSettings !== undefined) return fromSettings

try {
const config = await new ProjectConfigStore().read(projectRoot)
return config?.language
} catch {
return undefined
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking — but worth tracking): With brv config now removed (per commit d7f3bf6), this fallback's job has shifted from "honor a current-but-deprecated surface" to "honor stale state nobody can write any more." A user mid-migration whose .brv/config.json still has {language: {mode: 'fixed', code: 'ru'}} and who then flips daemon language.mode to auto will not see auto take effect — readLanguageFromSettings returns undefined for any non-fixed daemon mode, and the project-config fixed wins via this fallthrough.

That's the same precedence quirk the prior auto-reviewer flagged (the docstring at L575-585 calls it out and defers it to the project-config sunset). The new wrinkle is that the user has no way to clear the stale state — there's no brv config set ... to write auto over it, and editing .brv/config.json by hand is undocumented. So the fallback now acts as a one-way migration trap rather than a graceful bridge.

Two ways forward:

  1. Honor explicit auto — change readLanguageFromSettings to return {mode: 'auto'} when the daemon explicitly says auto (instead of undefined), and treat that as "user has opinions, do not consult fallback." Closes the trap cleanly and matches what a user expects after toggling the WebUI Language → auto control.
  2. Drop the project-config fallback entirely in a follow-up PR — the PR description's "Out of scope" already names this, but with brv config gone the fallback no longer serves a live surface; it only matters for users upgrading from a pre-ENG-2974 build.

(1) is a small, surgical change behind a unit test; (2) needs a migration sweep but cleans up the legacy code path. (1) makes (2) safer to land later. Worth at least surfacing on the WebUI Language → auto control with a one-line "if you previously configured a per-project language, it may still apply" hint until (1) lands, to avoid the "I clicked auto but it's still writing in Russian" report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants