campfirein · danhdoan · May 26, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
@@ -2,6 +2,15 @@
 
 All notable user-facing changes to ByteRover CLI will be documented in this file.
 
+## [Unreleased]
+
+### Added
+- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with the new `brv config set` command:
+  - `brv config set language.mode auto` — match the user's input language (default).
+  - `brv config set language.mode fixed` + `brv config set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
+  - `brv config get language.mode` / `brv config get language.code` — read back the current setting.
+
+  CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv config set language.code en` then `brv config set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with the new `brv config set` command:
-  - `brv config set language.mode auto` — match the user's input language (default).
-  - `brv config set language.mode fixed` + `brv config set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
-  - `brv config get language.mode` / `brv config get language.code` — read back the current setting.
-
-  CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv config set language.code en` then `brv config set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
+- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with `brv settings`:
+  - `brv settings set language.mode auto` — match the user's input language (default).
+  - `brv settings set language.mode fixed` + `brv settings set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
+  - `brv settings get language.mode` / `brv settings get language.code` — read back the current setting.
+
+  CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv settings set language.code en` then `brv settings set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with the new `brv config set` command:
-  - `brv config set language.mode auto` — match the user's input language (default).
-  - `brv config set language.mode fixed` + `brv config set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
-  - `brv config get language.mode` / `brv config get language.code` — read back the current setting.
-
-  CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv config set language.code en` then `brv config set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
+- **ByteRover preserves your input language by default.** When you curate context in Russian, Chinese, Japanese, Vietnamese, or any other language, the calling agent's LLM is now instructed to author body text in the same language (the schema — tag names, attribute names, enum values, paths — stays English so tooling is unaffected). Configure with `brv settings`:
+  - `brv settings set language.mode auto` — match the user's input language (default).
+  - `brv settings set language.mode fixed` + `brv settings set language.code <iso>` — force a specific language. ISO 639-1 codes accepted: `ar`, `de`, `el`, `en`, `es`, `fi`, `fr`, `he`, `hi`, `id`, `it`, `ja`, `ko`, `nl`, `no`, `pl`, `pt`, `ru`, `sv`, `th`, `tr`, `uk`, `vi`, `zh`.
+  - `brv settings get language.mode` / `brv settings get language.code` — read back the current setting.
+
+  CJK queries (Chinese, Japanese, Korean) are now searchable in BM25 — the tokenizer was previously whitespace-only and treated entire CJK sentences as one token. **Restoration recipe** for users who prefer the prior implicit-English behavior: `brv settings set language.code en` then `brv settings set language.mode fixed`. Reported by Dmitriy K — thanks for the thorough reproduction in [#616](https://github.com/campfirein/byterover-cli/issues/616).
 ## [3.16.1]
 
 ### Fixed

@@ -21,9 +21,9 @@ import type {ToolProvider} from '../tools/tool-provider.js'
 import type {AgentConfig} from './agent-schemas.js'
 import type {ProviderUpdateConfig} from './provider-update-config.js'
 
-import {SETTINGS_KEYS} from '../../../server/core/domain/entities/settings.js'
 import {TransportStateEventNames} from '../../../server/core/domain/transport/schemas.js'
 import {agentLog} from '../../../server/utils/process-logger.js'
+import {SETTINGS_KEYS} from '../../../shared/types/settings-keys.js'
 import {getEffectiveMaxInputTokens, resolveRegistryProvider} from '../../core/domain/llm/index.js'
 import {STREAMING_EVENT_NAMES} from '../../core/domain/streaming/types.js'
 import {ToolName} from '../../core/domain/tools/constants.js'

@@ -0,0 +1,159 @@
+/**
+ * BM25 tokenizer with CJK bigram segmentation.
+ *
+ * MiniSearch 7.2.0's default tokenizer splits on `\p{Z}\p{P}` (Unicode
+ * whitespace + punctuation). Latin / Cyrillic / Vietnamese / European
+ * scripts use whitespace between words and tokenize correctly. CJK scripts
+ * do not — a sentence like `认证系统使用JWT令牌` becomes a single token,
+ * so a query for `认证` against indexed CJK content returns zero matches.
+ *
+ * Empirical confirmation before this fix (MiniSearch 7.2.0):
+ *
+ *   const ms = new MiniSearch({fields: ['t'], idField: 'id'})
+ *   ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}])
+ *   ms.search('认证')           // → [] — broken
+ *   ms.search('Привет мир')   // → matches as expected
+ *
+ * This tokenizer preserves the default behavior for whitespace-separated
+ * scripts and adds overlapping-bigram segmentation for CJK runs. Mixed
+ * Latin+CJK tokens (e.g. `JWT令牌`) split at the script boundary so the
+ * Latin portion stays a real word token.
+ *
+ * Wired via the top-level `tokenize` option on MiniSearch — per the
+ * library docs and source (`MiniSearch.js:1564-1566`), that single option
+ * applies at both index and query time unless `searchOptions.tokenize`
+ * is set, which we leave unset.
+ */
+
+/**
+ * Unicode ranges treated as CJK for the purposes of bigram segmentation.
+ * Anything outside these ranges is "non-CJK" and tokenizes by whitespace
+ * boundaries only.
+ *
+ * - `0x4E00–0x9FFF`: CJK Unified Ideographs (Chinese, Japanese kanji)
+ * - `0x3040–0x309F`: Hiragana
+ * - `0x30A0–0x30FF`: Katakana
+ * - `0xAC00–0xD7AF`: Hangul Syllables (Korean)
+ *
+ * CJK Extension A/B/C/… are deliberately excluded — they appear in academic
+ * / historical text but rarely in user content. If a user's corpus needs
+ * them, extend this list and bump `INDEX_SCHEMA_VERSION` in
+ * `search-knowledge-service.ts` so cached indexes invalidate.
+ */
+const CJK_RANGES: ReadonlyArray<readonly [number, number]> = [
+  [0x4E_00, 0x9F_FF],
+  [0x30_40, 0x30_9F],
+  [0x30_A0, 0x30_FF],
+  [0xAC_00, 0xD7_AF],
+]
-]
+const CJK_RANGES: ReadonlyArray<readonly [number, number]> = [
+  [0x34_00, 0x4D_BF],
+  [0x4E_00, 0x9F_FF],
+  [0x30_40, 0x30_9F],
+  [0x30_A0, 0x30_FF],
+  [0xAC_00, 0xD7_AF],
+]
-]
+const CJK_RANGES: ReadonlyArray<readonly [number, number]> = [
+  [0x34_00, 0x4D_BF],
+  [0x4E_00, 0x9F_FF],
+  [0x30_40, 0x30_9F],
+  [0x30_A0, 0x30_FF],
+  [0xAC_00, 0xD7_AF],
+]
+
+function isCjkCodePoint(cp: number): boolean {
+  for (const [lo, hi] of CJK_RANGES) {
+    if (cp >= lo && cp <= hi) return true
+  }
+
+  return false
+}
+
+/**
+ * Whitespace + punctuation split, matching MiniSearch's default
+ * `SPACE_OR_PUNCTUATION` regex. Kept verbatim so a future upstream tweak
+ * is easy to spot via diff.
+ */
+const SPACE_OR_PUNCTUATION = /[\p{Z}\p{P}]+/u
+
+/**
+ * Split a token at boundaries between CJK and non-CJK runs.
+ *
+ * - `'JWT令牌'`  → `['JWT', '令牌']` (script boundary at index 3)
+ * - `'认证'`     → `['认证']`        (single CJK run)
+ * - `'JWT'`      → `['JWT']`          (single non-CJK run)
+ */
+function splitAtCjkBoundary(token: string): string[] {
+  const segments: string[] = []
+  let current = ''
+  let currentIsCjk: boolean | undefined
+
+  // Iterate by code point so any future range extension into the
+  // supplementary plane handles surrogate pairs correctly. The current
+  // four ranges are all BMP, so `for...of` is equivalent to char-by-char
+  // here — but cheap to be correct.
+  for (const ch of token) {
+    const cp = ch.codePointAt(0)
+    if (cp === undefined) continue
+    const charIsCjk = isCjkCodePoint(cp)
+
+    if (currentIsCjk === undefined) {
+      current = ch
+      currentIsCjk = charIsCjk
+    } else if (charIsCjk === currentIsCjk) {
+      current += ch
+    } else {
+      segments.push(current)
+      current = ch
+      currentIsCjk = charIsCjk
+    }
+  }
+
+  if (current.length > 0) segments.push(current)
+
+  return segments
+}
+
+/**
+ * Emit overlapping bigrams for a CJK run.
+ *
+ * - `'认证系统'` (4 chars) → `['认证', '证系', '系统']`
+ * - `'认证'`     (2 chars) → `['认证']`
+ * - `'认'`       (1 char)  → `['认']` (unigram fallback so single-char tokens are searchable)
+ *
+ * Bigrams are the standard CJK IR compromise: unigrams are too noisy
+ * (common chars like `的` dominate scoring), trigrams are too sparse
+ * (miss 2-character compound matches).
+ */
+function cjkBigrams(run: string): string[] {
+  const chars = [...run]
+  if (chars.length <= 1) return chars
+
+  const grams: string[] = []
+  for (let i = 0; i < chars.length - 1; i++) {
+    grams.push(chars[i] + chars[i + 1])
+  }
+
+  return grams
+}
+
+/**
+ * Tokenize text for BM25 indexing and querying.
+ *
+ * Algorithm:
+ *   1. Split on Unicode whitespace + punctuation (matches MiniSearch default).
+ *   2. For each resulting token, split at CJK ↔ non-CJK script boundaries.
+ *   3. For non-CJK segments, emit the segment as-is.
+ *   4. For CJK segments, emit overlapping bigrams.
+ *
+ * The result is the union — Latin / Cyrillic / Vietnamese behave exactly
+ * as the MiniSearch default, while CJK runs become searchable.
+ */
+export function tokenizeWithCjk(text: string): string[] {
+  const out: string[] = []
+
+  for (const wsToken of text.split(SPACE_OR_PUNCTUATION)) {
+    if (wsToken.length === 0) continue
+
+    for (const segment of splitAtCjkBoundary(wsToken)) {
+      if (segment.length === 0) continue
+
+      // `splitAtCjkBoundary` returns single-script segments, so the
+      // first code point's classification applies to the whole segment.
+      const firstCp = segment.codePointAt(0)
+      if (firstCp !== undefined && isCjkCodePoint(firstCp)) {
+        out.push(...cjkBigrams(segment))
+      } else {
+        out.push(segment)
+      }
+    }
+  }
+
+  return out
+}
@@ -36,6 +36,7 @@ import {
 import {getFormatForRead} from '../../../../server/infra/render/format/format-detector.js'
 import {ElementAxisIndex} from '../../../../server/infra/render/reader/element-axis-index.js'
 import {readHtmlTopicSync} from '../../../../server/infra/render/reader/html-reader.js'
+import {tokenizeWithCjk} from './cjk-tokenizer.js'
 import {isPathLikeQuery, matchMemoryPath, parseSymbolicQuery} from './memory-path-matcher.js'
 import {
   buildReferenceIndex,
@@ -52,10 +53,7 @@ import {
 const MAX_CONTEXT_TREE_FILES = 10_000
 const DEFAULT_CACHE_TTL_MS = 5000
 
-/**
- * Bump when MINISEARCH_OPTIONS fields/boost change to invalidate cached indexes.
- *  v7 (ENG-3021): include `<img>` alt + src in HTML topic indexed content.
- */
+/** Bump when MINISEARCH_OPTIONS fields/boost change to invalidate cached indexes */
 const INDEX_SCHEMA_VERSION = 7
 
 /** Only include results whose normalized score is at least this fraction of the top result's score */
@@ -174,6 +172,12 @@ const MINISEARCH_OPTIONS = {
     prefix: true,
   },
   storeFields: ['title', 'path'] as string[],
+  // Custom tokenizer adds CJK bigram segmentation alongside the default
+  // whitespace split. Without it, queries against Chinese / Japanese /
+  // Korean content return zero matches even when the content is curated
+  // correctly — see `cjk-tokenizer.ts`. Top-level `tokenize` applies to
+  // both indexing and querying per MiniSearch's API.
+  tokenize: tokenizeWithCjk,
 }
 
 interface IndexedDocument {

@@ -1,7 +1,11 @@
 import {Args, Command, Flags} from '@oclif/core'
 
+import type {BrvConfigLanguage} from '../../../server/core/domain/entities/brv-config.js'
 import type {CurateSessionEnvelope} from '../../lib/curate-session.js'
 
+import {ProjectConfigStore} from '../../../server/infra/config/file-config-store.js'
+import {SettingsEvents, type SettingsListResponse} from '../../../shared/transport/events/settings-events.js'
+import {SETTINGS_KEYS} from '../../../shared/types/settings-keys.js'
 import {
   continueSession,
   deleteCurateResponseFile,
@@ -122,17 +126,19 @@ Bad examples:
   protected async dispatchContinuation(args: {
     confirmOverwrite: boolean
     format: 'json' | 'text'
+    language?: BrvConfigLanguage
     projectRoot: string
     response: string
     sessionId: string
   }): Promise<CurateSessionEnvelope> {
-    const {confirmOverwrite, format, projectRoot, response, sessionId} = args
+    const {confirmOverwrite, format, language, projectRoot, response, sessionId} = args
     let envelope: CurateSessionEnvelope | undefined
     await withDaemonRetry(async (client) => {
       envelope = await continueSession({
         client,
         confirmOverwrite,
         format,
+        language,
         projectRoot,
         response,
         sessionId,
@@ -466,11 +472,16 @@ Bad examples:
     // this path so the agent retains the source it already paid an
     // LLM call to produce.
     const confirmOverwrite = flags.overwrite ?? false
+    // Read fresh per continuation — mirrors kickoff so a mid-session
+    // language change (rare) is honored on the next correction prompt.
+    // The same fallback chain applies (daemon settings → project config).
+    const language = await this.resolveLanguagePreference(projectRoot)
     let dispatchEnvelope: CurateSessionEnvelope
     try {
       dispatchEnvelope = await this.dispatchContinuation({
         confirmOverwrite,
         format,
+        language,
         projectRoot,
         response,
         sessionId,
@@ -550,9 +561,74 @@ Bad examples:
       return
     }
 
-    const envelope = await kickoffSession({content, projectRoot: resolveProjectRoot()})
+    const projectRoot = resolveProjectRoot()
+    const language = await this.resolveLanguagePreference(projectRoot)
+    const envelope = await kickoffSession({content, language, projectRoot})
     this.emitToolModeEnvelope(envelope, format)
   }
+
+  /**
+   * Resolve the language preference. Daemon settings (the source of
+   * truth) take precedence; a per-project `.brv/config.json language`
+   * field acts as a fallback for users who configured language before
+   * it moved to global settings.
+   *
+   * Note on precedence: only daemon `mode: 'fixed'` short-circuits the
+   * fallback. An explicit daemon `mode: 'auto'` reads as "no opinion"
+   * and falls through to project config, so a stale project-config
+   * `fixed/X` will still win. This is intentional for the migration
+   * window — distinguishing "user explicitly chose auto" from "user
+   * never touched settings" needs raw-overrides access that the
+   * transport doesn't expose today, and the bug only manifests for
+   * users with a pre-existing per-project `language` field. Revisit
+   * once project-config language is fully sunset.
+   */
+  private async resolveLanguagePreference(projectRoot: string): Promise<BrvConfigLanguage | undefined> {
+    const fromSettings = await readLanguageFromSettings()
+    if (fromSettings !== undefined) return fromSettings
+
+    try {
+      const config = await new ProjectConfigStore().read(projectRoot)
+      return config?.language
+    } catch {
+      return undefined
+    }
+  }
+}
+
+/**
+ * Reads the language preference from daemon settings via the same
+ * `SettingsEvents.LIST` transport every other settings consumer uses.
+ *
+ * Exported (and accepts a `DaemonClientOptions`) so tests can drive
+ * `withDaemonRetry` with a stubbed transport client. Returns `undefined`
+ * on any non-fixed mode, missing/non-string code, or daemon error —
+ * callers should treat `undefined` as "no opinion" and fall back to
+ * project config / the auto clause.
+ *
+ * Uses a tight retry budget by default (1 retry, 0ms delay) because this
+ * runs on every `brv curate` kickoff: `withDaemonRetry`'s 10× retries ×
+ * 1s default would block kickoff for ~9s when the daemon is unreachable
+ * before the catch trips the project-config fallback. The caller can
+ * override either field by passing it in `options`.
+ */
+export async function readLanguageFromSettings(
+  options?: DaemonClientOptions,
+): Promise<BrvConfigLanguage | undefined> {
+  try {
+    const response = await withDaemonRetry<SettingsListResponse>(
+      async (client) => client.requestWithAck<SettingsListResponse>(SettingsEvents.LIST),
+      {maxRetries: 1, retryDelayMs: 0, ...options},
+    )
+    const byKey = new Map(response.items.map((item) => [item.key, item.current]))
+    const mode = byKey.get(SETTINGS_KEYS.LANGUAGE_MODE)
+    const code = byKey.get(SETTINGS_KEYS.LANGUAGE_CODE)
+    if (mode !== 'fixed') return undefined
+    if (typeof code !== 'string') return undefined
+    return {code, mode: 'fixed'}
+  } catch {
+    return undefined
+  }
 }
 
 /**

@@ -82,6 +82,10 @@ export default class SettingsGet extends Command {
       this.log(`  range:   ${range}`)
     }
 
+    if (item.type === 'enum' && item.options !== undefined && item.options.length > 0) {
+      this.log(`  allowed: ${item.options.join(', ')}`)
+    }
+
     this.log(`  scope:   ${item.scope ?? 'global'}`)
   }
 
@@ -99,12 +103,14 @@ export default class SettingsGet extends Command {
     if (item.category !== undefined) payload.category = item.category
     if (item.unit !== undefined) payload.unit = item.unit
     if (item.scope !== undefined) payload.scope = item.scope
+    if (item.type === 'enum' && item.options !== undefined) payload.options = item.options
     return payload
   }
 }
 
-function renderValue(item: SettingsItemDTO, value: boolean | number): string {
+function renderValue(item: SettingsItemDTO, value: boolean | number | string): string {
   if (typeof value === 'boolean') return value ? 'true' : 'false'
+  if (typeof value === 'string') return value
   return renderInteger(item, value)
 }