-
Notifications
You must be signed in to change notification settings - Fork 453
Proj/language selection #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4311254
25b7a70
31c7f19
10984fd
91e76be
c19c5bf
dbe72c0
4341ed7
0163bbc
f52cc29
19c9b6b
f40e188
fe56936
aa7282d
7ee702d
dbf2509
2df0c85
f467c29
0f723c8
c68322c
a45bf98
773db4b
88164f5
c148c92
d7f3bf6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,159 @@ | ||||||||||||||||||
| /** | ||||||||||||||||||
| * BM25 tokenizer with CJK bigram segmentation. | ||||||||||||||||||
| * | ||||||||||||||||||
| * MiniSearch 7.2.0's default tokenizer splits on `\p{Z}\p{P}` (Unicode | ||||||||||||||||||
| * whitespace + punctuation). Latin / Cyrillic / Vietnamese / European | ||||||||||||||||||
| * scripts use whitespace between words and tokenize correctly. CJK scripts | ||||||||||||||||||
| * do not — a sentence like `认证系统使用JWT令牌` becomes a single token, | ||||||||||||||||||
| * so a query for `认证` against indexed CJK content returns zero matches. | ||||||||||||||||||
| * | ||||||||||||||||||
| * Empirical confirmation before this fix (MiniSearch 7.2.0): | ||||||||||||||||||
| * | ||||||||||||||||||
| * const ms = new MiniSearch({fields: ['t'], idField: 'id'}) | ||||||||||||||||||
| * ms.addAll([{id: 1, t: '认证系统使用JWT令牌'}]) | ||||||||||||||||||
| * ms.search('认证') // → [] — broken | ||||||||||||||||||
| * ms.search('Привет мир') // → matches as expected | ||||||||||||||||||
| * | ||||||||||||||||||
| * This tokenizer preserves the default behavior for whitespace-separated | ||||||||||||||||||
| * scripts and adds overlapping-bigram segmentation for CJK runs. Mixed | ||||||||||||||||||
| * Latin+CJK tokens (e.g. `JWT令牌`) split at the script boundary so the | ||||||||||||||||||
| * Latin portion stays a real word token. | ||||||||||||||||||
| * | ||||||||||||||||||
| * Wired via the top-level `tokenize` option on MiniSearch — per the | ||||||||||||||||||
| * library docs and source (`MiniSearch.js:1564-1566`), that single option | ||||||||||||||||||
| * applies at both index and query time unless `searchOptions.tokenize` | ||||||||||||||||||
| * is set, which we leave unset. | ||||||||||||||||||
| */ | ||||||||||||||||||
|
|
||||||||||||||||||
| /** | ||||||||||||||||||
| * Unicode ranges treated as CJK for the purposes of bigram segmentation. | ||||||||||||||||||
| * Anything outside these ranges is "non-CJK" and tokenizes by whitespace | ||||||||||||||||||
| * boundaries only. | ||||||||||||||||||
| * | ||||||||||||||||||
| * - `0x4E00–0x9FFF`: CJK Unified Ideographs (Chinese, Japanese kanji) | ||||||||||||||||||
| * - `0x3040–0x309F`: Hiragana | ||||||||||||||||||
| * - `0x30A0–0x30FF`: Katakana | ||||||||||||||||||
| * - `0xAC00–0xD7AF`: Hangul Syllables (Korean) | ||||||||||||||||||
| * | ||||||||||||||||||
| * CJK Extension A/B/C/… are deliberately excluded — they appear in academic | ||||||||||||||||||
| * / historical text but rarely in user content. If a user's corpus needs | ||||||||||||||||||
| * them, extend this list and bump `INDEX_SCHEMA_VERSION` in | ||||||||||||||||||
| * `search-knowledge-service.ts` so cached indexes invalidate. | ||||||||||||||||||
| */ | ||||||||||||||||||
| const CJK_RANGES: ReadonlyArray<readonly [number, number]> = [ | ||||||||||||||||||
| [0x4E_00, 0x9F_FF], | ||||||||||||||||||
| [0x30_40, 0x30_9F], | ||||||||||||||||||
| [0x30_A0, 0x30_FF], | ||||||||||||||||||
| [0xAC_00, 0xD7_AF], | ||||||||||||||||||
| ] | ||||||||||||||||||
|
danhdoan marked this conversation as resolved.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion (non-blocking, zh-TW coverage): The ranges deliberately exclude CJK Extension A ( Extension A is BMP, so
Suggested change
The maintainer note already points at |
||||||||||||||||||
|
|
||||||||||||||||||
| function isCjkCodePoint(cp: number): boolean { | ||||||||||||||||||
| for (const [lo, hi] of CJK_RANGES) { | ||||||||||||||||||
| if (cp >= lo && cp <= hi) return true | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| return false | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| /** | ||||||||||||||||||
| * Whitespace + punctuation split, matching MiniSearch's default | ||||||||||||||||||
| * `SPACE_OR_PUNCTUATION` regex. Kept verbatim so a future upstream tweak | ||||||||||||||||||
| * is easy to spot via diff. | ||||||||||||||||||
| */ | ||||||||||||||||||
| const SPACE_OR_PUNCTUATION = /[\p{Z}\p{P}]+/u | ||||||||||||||||||
|
|
||||||||||||||||||
| /** | ||||||||||||||||||
| * Split a token at boundaries between CJK and non-CJK runs. | ||||||||||||||||||
| * | ||||||||||||||||||
| * - `'JWT令牌'` → `['JWT', '令牌']` (script boundary at index 3) | ||||||||||||||||||
| * - `'认证'` → `['认证']` (single CJK run) | ||||||||||||||||||
| * - `'JWT'` → `['JWT']` (single non-CJK run) | ||||||||||||||||||
| */ | ||||||||||||||||||
| function splitAtCjkBoundary(token: string): string[] { | ||||||||||||||||||
| const segments: string[] = [] | ||||||||||||||||||
| let current = '' | ||||||||||||||||||
| let currentIsCjk: boolean | undefined | ||||||||||||||||||
|
|
||||||||||||||||||
| // Iterate by code point so any future range extension into the | ||||||||||||||||||
| // supplementary plane handles surrogate pairs correctly. The current | ||||||||||||||||||
| // four ranges are all BMP, so `for...of` is equivalent to char-by-char | ||||||||||||||||||
| // here — but cheap to be correct. | ||||||||||||||||||
| for (const ch of token) { | ||||||||||||||||||
| const cp = ch.codePointAt(0) | ||||||||||||||||||
| if (cp === undefined) continue | ||||||||||||||||||
| const charIsCjk = isCjkCodePoint(cp) | ||||||||||||||||||
|
|
||||||||||||||||||
| if (currentIsCjk === undefined) { | ||||||||||||||||||
| current = ch | ||||||||||||||||||
| currentIsCjk = charIsCjk | ||||||||||||||||||
| } else if (charIsCjk === currentIsCjk) { | ||||||||||||||||||
| current += ch | ||||||||||||||||||
| } else { | ||||||||||||||||||
| segments.push(current) | ||||||||||||||||||
| current = ch | ||||||||||||||||||
| currentIsCjk = charIsCjk | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| if (current.length > 0) segments.push(current) | ||||||||||||||||||
|
|
||||||||||||||||||
| return segments | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| /** | ||||||||||||||||||
| * Emit overlapping bigrams for a CJK run. | ||||||||||||||||||
| * | ||||||||||||||||||
| * - `'认证系统'` (4 chars) → `['认证', '证系', '系统']` | ||||||||||||||||||
| * - `'认证'` (2 chars) → `['认证']` | ||||||||||||||||||
| * - `'认'` (1 char) → `['认']` (unigram fallback so single-char tokens are searchable) | ||||||||||||||||||
| * | ||||||||||||||||||
| * Bigrams are the standard CJK IR compromise: unigrams are too noisy | ||||||||||||||||||
| * (common chars like `的` dominate scoring), trigrams are too sparse | ||||||||||||||||||
| * (miss 2-character compound matches). | ||||||||||||||||||
| */ | ||||||||||||||||||
| function cjkBigrams(run: string): string[] { | ||||||||||||||||||
| const chars = [...run] | ||||||||||||||||||
| if (chars.length <= 1) return chars | ||||||||||||||||||
|
|
||||||||||||||||||
| const grams: string[] = [] | ||||||||||||||||||
| for (let i = 0; i < chars.length - 1; i++) { | ||||||||||||||||||
| grams.push(chars[i] + chars[i + 1]) | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| return grams | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| /** | ||||||||||||||||||
| * Tokenize text for BM25 indexing and querying. | ||||||||||||||||||
| * | ||||||||||||||||||
| * Algorithm: | ||||||||||||||||||
| * 1. Split on Unicode whitespace + punctuation (matches MiniSearch default). | ||||||||||||||||||
| * 2. For each resulting token, split at CJK ↔ non-CJK script boundaries. | ||||||||||||||||||
| * 3. For non-CJK segments, emit the segment as-is. | ||||||||||||||||||
| * 4. For CJK segments, emit overlapping bigrams. | ||||||||||||||||||
| * | ||||||||||||||||||
| * The result is the union — Latin / Cyrillic / Vietnamese behave exactly | ||||||||||||||||||
| * as the MiniSearch default, while CJK runs become searchable. | ||||||||||||||||||
| */ | ||||||||||||||||||
| export function tokenizeWithCjk(text: string): string[] { | ||||||||||||||||||
| const out: string[] = [] | ||||||||||||||||||
|
|
||||||||||||||||||
| for (const wsToken of text.split(SPACE_OR_PUNCTUATION)) { | ||||||||||||||||||
| if (wsToken.length === 0) continue | ||||||||||||||||||
|
|
||||||||||||||||||
| for (const segment of splitAtCjkBoundary(wsToken)) { | ||||||||||||||||||
| if (segment.length === 0) continue | ||||||||||||||||||
|
|
||||||||||||||||||
| // `splitAtCjkBoundary` returns single-script segments, so the | ||||||||||||||||||
| // first code point's classification applies to the whole segment. | ||||||||||||||||||
| const firstCp = segment.codePointAt(0) | ||||||||||||||||||
| if (firstCp !== undefined && isCjkCodePoint(firstCp)) { | ||||||||||||||||||
| out.push(...cjkBigrams(segment)) | ||||||||||||||||||
| } else { | ||||||||||||||||||
| out.push(segment) | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| return out | ||||||||||||||||||
| } | ||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,11 @@ | ||
| import {Args, Command, Flags} from '@oclif/core' | ||
|
|
||
| import type {BrvConfigLanguage} from '../../../server/core/domain/entities/brv-config.js' | ||
| import type {CurateSessionEnvelope} from '../../lib/curate-session.js' | ||
|
|
||
| import {ProjectConfigStore} from '../../../server/infra/config/file-config-store.js' | ||
| import {SettingsEvents, type SettingsListResponse} from '../../../shared/transport/events/settings-events.js' | ||
| import {SETTINGS_KEYS} from '../../../shared/types/settings-keys.js' | ||
| import { | ||
| continueSession, | ||
| deleteCurateResponseFile, | ||
|
|
@@ -122,17 +126,19 @@ Bad examples: | |
| protected async dispatchContinuation(args: { | ||
| confirmOverwrite: boolean | ||
| format: 'json' | 'text' | ||
| language?: BrvConfigLanguage | ||
| projectRoot: string | ||
| response: string | ||
| sessionId: string | ||
| }): Promise<CurateSessionEnvelope> { | ||
| const {confirmOverwrite, format, projectRoot, response, sessionId} = args | ||
| const {confirmOverwrite, format, language, projectRoot, response, sessionId} = args | ||
| let envelope: CurateSessionEnvelope | undefined | ||
| await withDaemonRetry(async (client) => { | ||
| envelope = await continueSession({ | ||
| client, | ||
| confirmOverwrite, | ||
| format, | ||
| language, | ||
| projectRoot, | ||
| response, | ||
| sessionId, | ||
|
|
@@ -466,11 +472,16 @@ Bad examples: | |
| // this path so the agent retains the source it already paid an | ||
| // LLM call to produce. | ||
| const confirmOverwrite = flags.overwrite ?? false | ||
| // Read fresh per continuation — mirrors kickoff so a mid-session | ||
| // language change (rare) is honored on the next correction prompt. | ||
| // The same fallback chain applies (daemon settings → project config). | ||
| const language = await this.resolveLanguagePreference(projectRoot) | ||
| let dispatchEnvelope: CurateSessionEnvelope | ||
| try { | ||
| dispatchEnvelope = await this.dispatchContinuation({ | ||
| confirmOverwrite, | ||
| format, | ||
| language, | ||
| projectRoot, | ||
| response, | ||
| sessionId, | ||
|
|
@@ -550,9 +561,74 @@ Bad examples: | |
| return | ||
| } | ||
|
|
||
| const envelope = await kickoffSession({content, projectRoot: resolveProjectRoot()}) | ||
| const projectRoot = resolveProjectRoot() | ||
| const language = await this.resolveLanguagePreference(projectRoot) | ||
| const envelope = await kickoffSession({content, language, projectRoot}) | ||
| this.emitToolModeEnvelope(envelope, format) | ||
| } | ||
|
|
||
| /** | ||
| * Resolve the language preference. Daemon settings (the source of | ||
| * truth) take precedence; a per-project `.brv/config.json language` | ||
| * field acts as a fallback for users who configured language before | ||
| * it moved to global settings. | ||
| * | ||
| * Note on precedence: only daemon `mode: 'fixed'` short-circuits the | ||
| * fallback. An explicit daemon `mode: 'auto'` reads as "no opinion" | ||
| * and falls through to project config, so a stale project-config | ||
| * `fixed/X` will still win. This is intentional for the migration | ||
| * window — distinguishing "user explicitly chose auto" from "user | ||
| * never touched settings" needs raw-overrides access that the | ||
| * transport doesn't expose today, and the bug only manifests for | ||
| * users with a pre-existing per-project `language` field. Revisit | ||
| * once project-config language is fully sunset. | ||
| */ | ||
| private async resolveLanguagePreference(projectRoot: string): Promise<BrvConfigLanguage | undefined> { | ||
| const fromSettings = await readLanguageFromSettings() | ||
| if (fromSettings !== undefined) return fromSettings | ||
|
|
||
| try { | ||
| const config = await new ProjectConfigStore().read(projectRoot) | ||
| return config?.language | ||
| } catch { | ||
| return undefined | ||
| } | ||
| } | ||
|
Comment on lines
+586
to
+596
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (non-blocking): Precedence bug: an explicit daemon Worth at least logging the override edge-case for forensics (the docstring already says "revisit once project-config language is fully sunset"), and tracking the project-config sunset as a follow-up in the PR's "Out of scope" list (already noted). Confirm the WebUI Settings → Language → "auto" toggle messaging makes it clear that a project-config override may still apply mid-migration, or users will report "I clicked auto but it's still writing in Russian" until the sunset lands.
Comment on lines
+570
to
+596
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (non-blocking — but worth tracking): With That's the same precedence quirk the prior auto-reviewer flagged (the docstring at L575-585 calls it out and defers it to the project-config sunset). The new wrinkle is that the user has no way to clear the stale state — there's no Two ways forward:
(1) is a small, surgical change behind a unit test; (2) needs a migration sweep but cleans up the legacy code path. (1) makes (2) safer to land later. Worth at least surfacing on the WebUI Language → auto control with a one-line "if you previously configured a per-project language, it may still apply" hint until (1) lands, to avoid the "I clicked auto but it's still writing in Russian" report. |
||
| } | ||
|
|
||
| /** | ||
| * Reads the language preference from daemon settings via the same | ||
| * `SettingsEvents.LIST` transport every other settings consumer uses. | ||
| * | ||
| * Exported (and accepts a `DaemonClientOptions`) so tests can drive | ||
| * `withDaemonRetry` with a stubbed transport client. Returns `undefined` | ||
| * on any non-fixed mode, missing/non-string code, or daemon error — | ||
| * callers should treat `undefined` as "no opinion" and fall back to | ||
| * project config / the auto clause. | ||
| * | ||
| * Uses a tight retry budget by default (1 retry, 0ms delay) because this | ||
| * runs on every `brv curate` kickoff: `withDaemonRetry`'s 10× retries × | ||
| * 1s default would block kickoff for ~9s when the daemon is unreachable | ||
| * before the catch trips the project-config fallback. The caller can | ||
| * override either field by passing it in `options`. | ||
| */ | ||
| export async function readLanguageFromSettings( | ||
| options?: DaemonClientOptions, | ||
| ): Promise<BrvConfigLanguage | undefined> { | ||
| try { | ||
| const response = await withDaemonRetry<SettingsListResponse>( | ||
| async (client) => client.requestWithAck<SettingsListResponse>(SettingsEvents.LIST), | ||
| {maxRetries: 1, retryDelayMs: 0, ...options}, | ||
| ) | ||
| const byKey = new Map(response.items.map((item) => [item.key, item.current])) | ||
| const mode = byKey.get(SETTINGS_KEYS.LANGUAGE_MODE) | ||
| const code = byKey.get(SETTINGS_KEYS.LANGUAGE_CODE) | ||
| if (mode !== 'fixed') return undefined | ||
| if (typeof code !== 'string') return undefined | ||
| return {code, mode: 'fixed'} | ||
| } catch { | ||
| return undefined | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking — user-facing): The release notes describe
brv config set/get language.*as the public surface, but commitd7f3bf6("removebrv configCLI — settings is the only language surface") deleted the entirebrv configcommand in this same PR. A user who reads these notes and runsbrv config set language.mode autowill hitCommand not found— and so will the restoration recipe at the bottom (brv config set language.code enthenbrv config set language.mode fixed), which is the headline "don't lose your prior behavior" path for the people most likely to care.The canonical surface is now
brv settings set language.mode|code(andbrv settings get …to read back). The supported-codes list and the restoration recipe also need updating.Suggested edit:
Also note that the PR body still references the deprecation-gate cases (Case 6) as part of the harness migration; per the
d7f3bf6commit message those were also removed (language-selection-e2e went 22 → 20). Worth a quick sweep on the PR description so future archaeology lines up with the final commit.