Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
run: pnpm smoke

- name: Verify generated dist is committed
run: git diff --exit-code -- dist
run: git diff --exit-code -- packages/mimir/dist packages/mimir-tts/dist

- name: Verify npm package metadata
run: pnpm package:check
Expand Down
14 changes: 10 additions & 4 deletions .github/workflows/npm-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ jobs:

- name: Verify version input
run: |
test "$(node -p "require('./package.json').version")" = "${{ inputs.version }}"
test "$(node -p "require('./packages/mimir/package.json').version")" = "${{ inputs.version }}"
test "$(node -p "require('./packages/mimir-tts/package.json').version")" = "${{ inputs.version }}"

- name: Install dependencies
run: pnpm install --frozen-lockfile
Expand All @@ -68,7 +69,7 @@ jobs:
run: pnpm smoke

- name: Verify generated dist is committed
run: git diff --exit-code -- dist
run: git diff --exit-code -- packages/mimir/dist packages/mimir-tts/dist

- name: Verify npm package metadata
run: pnpm package:check
Expand All @@ -82,7 +83,12 @@ jobs:
name: mimir-release-${{ inputs.version }}
path: release-artifacts/

- name: Publish
run: npm publish --access public --provenance
- name: Publish Mimir TTS
run: pnpm --dir packages/mimir-tts publish --access public --provenance --no-git-checks
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

- name: Publish Mimir
run: pnpm --dir packages/mimir publish --access public --provenance --no-git-checks
env:
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,20 @@
node_modules/
coverage/
.env
.env.*
.DS_Store
private/**
.kb/
.mimir/
*.tgz
release-artifacts/

# Tracked synthetic examples. Keep generated example runtime state ignored.
!packages/mimir/examples/
!packages/mimir/examples/**/
!packages/mimir/examples/**/.kb/
!packages/mimir/examples/**/.kb/config.json
!packages/mimir/examples/**/.kb/sources.txt
packages/mimir/examples/**/.kb/storage/
packages/mimir/examples/**/.kb/access.log
packages/mimir/examples/**/.mimir/
74 changes: 62 additions & 12 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,78 @@
- `kb init` and `kb install-skill` must keep generated local Mimir state ignored in target
repositories. By default, add `.kb/`, `.mimir/`, and private raw-document paths to the
target repository `.gitignore`.
- Keep confidentiality features low-friction: local-only network policy, redaction before
indexing, metadata-only access logs, bounded MCP retrieval, and `security-audit` should work
from default config.
- Keep confidentiality features low-friction: local-hash retrieval by default, optional
Transformers.js embeddings with remote model loading disabled by default, redaction before
indexing, metadata-only access logs, bounded MCP retrieval, configurable text-extension ingestion,
and `security-audit` should work from default config.
- Keep public positioning focused on sovereign local RAG for confidential datasets and AI agents.
Avoid claiming universal binary-file support; unsupported proprietary formats need extraction or
dedicated parsers.
- Keep optional audio summaries separate from core ingestion/query behavior. The
`mimir-audio-summary` skill must prefer `kb audio` / `@jcode.labs/mimir-tts`, support offline
model loading, and keep generated audio under ignored local Mimir state.
- Keep the repository as a simple pnpm workspace monorepo. Add Turbo only if multiple packages or
apps start needing task caching/orchestration beyond `pnpm --filter`.
- Keep Mimir core free of Ollama. `embeddingProvider: "local-hash"` supports ingestion, search, MCP,
and cited retrieval without a model server, but it must not be described as equivalent to semantic
retrieval. `embeddingProvider: "transformers"` is the optional semantic embedding path.
- Keep `packages/mimir/examples/sovereign-rag-demo` synthetic and safe to commit. It exists for
package/user testing only; never place real confidential documents there.
- Use Context7 before changing dependencies or public APIs that rely on external libraries.
- Run `pnpm validate` before opening a release pull request or publishing. It covers
Biome, TypeScript, Vitest, build output, production CLI/MCP smoke tests, and npm package
metadata.
- Do not publish from a local machine or direct push to `main`. npm releases must go through
the protected manual `Publish npm` GitHub Actions workflow after `main` has green CI.
the protected manual `Publish npm` GitHub Actions workflow after `main` has green CI. The workflow
publishes `@jcode.labs/mimir-tts` first, then `@jcode.labs/mimir`.

## Coding Conventions

General principles (KISS, DRY, YAGNI, SOLID) as applied in this codebase. Match the surrounding style.

- One responsibility per module. The ingest pipeline is split on purpose: `files` discovers,
`parsing` extracts, `redaction` strips, `chunking` splits, `embeddings` vectorizes, `store`
persists, `query` retrieves. Add logic to the module that owns the concern, or a new small module.
- No duplicated logic. Reuse existing helpers (`loadConfig`, `embedText`/`embedTexts`,
`openRowsTable`, `redactText`, `supportedExtensions`, `recordAccess`); extract instead of copying.
`embedText` delegating to `embedTexts` is the reference pattern.
- No dead or obsolete code. Delete replaced code, unused exports, and commented-out blocks in the
same change; a deletion must cover both source and the regenerated package `dist/`.
- No magic strings or numbers. Name meaningful literals as constants, and put shared paths, provider
defaults, and ignore constants in `packages/mimir/src/defaults.ts` rather than copying them across
modules.
- Validate at the boundary, narrow inside. Use Zod at external edges (config in `config.ts`, MCP
inputs in `mcp.ts`) and CLI parsers (`parsePositiveInt`); trust the types past that point.
- Type-guard instead of casting. Prefer runtime guards over `as`/`!` (`hasToList`, `isNumberArray`,
`isNumberMatrix`); LanceDB row casts at the `store`/`query` driver boundary are the only exception.
- Named exports only; keep the public surface explicit in `index.ts`. Functions stay small and pure;
private helpers sit below the exported function in the same file.
- Comments explain why, not what; the codebase is near comment-free. Only the CLI (`cli.ts`) writes
to stdout/stderr — library, MCP, and pipeline code return data, never log.
- YAGNI: no options, providers, or abstractions ahead of a real need.

## Architecture

- `src/cli.ts` exposes the `kb` CLI.
- `src/config.ts` resolves `.kb/config.json` from the target repository.
- `src/ingest.ts` parses supported files, chunks text, embeds chunks, and rebuilds the
- `packages/mimir` is the core package published as `@jcode.labs/mimir`.
- `packages/mimir/src/cli.ts` exposes the `kb` CLI.
- `packages/mimir/src/config.ts` resolves `.kb/config.json` from the target repository.
- `packages/mimir/src/defaults.ts` owns shared default paths, provider defaults, and generated-state ignore
constants. Keep config/init/security/gitignore aligned through this module instead of copying
literals.
- `packages/mimir/src/ingest.ts` parses supported files, chunks text, embeds chunks, and rebuilds the
local LanceDB table.
- `src/query.ts` performs vector search and local Ollama answer synthesis.
- `src/mcp.ts` exposes Mimir as an MCP stdio server for agents.
- `src/gitignore.ts` owns target-repository `.gitignore` entries for local generated Mimir
- `packages/mimir/src/query.ts` performs vector search and returns cited retrieval context; LLM synthesis belongs
outside Mimir core.
- `packages/mimir/src/mcp.ts` exposes Mimir as an MCP stdio server for agents.
- `packages/mimir-tts` is the standalone JS/ONNX TTS package used by `kb audio`.
- `packages/mimir/src/gitignore.ts` owns target-repository `.gitignore` entries for local generated Mimir
state.
- `src/security.ts`, `src/network.ts`, `src/redaction.ts`, and `src/access-log.ts` own the
- `packages/mimir/src/security.ts`, `packages/mimir/src/redaction.ts`, and
`packages/mimir/src/access-log.ts` own the
privacy and confidentiality hardening layer.
- `skills/mimir/SKILL.md` is the bundled portable agent skill.
- `packages/mimir/skills/mimir/SKILL.md` is the bundled portable agent skill.
- `packages/mimir/skills/mimir-audio-summary/SKILL.md` is the optional bundled audio-summary skill.
- `packages/mimir/examples/sovereign-rag-demo` is the tracked synthetic test workspace for manual
and package validation.
- `.kb/`, `.mimir/`, and project `private/` folders are local user data or generated agent
state in target repositories and must not be committed.
25 changes: 22 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,29 @@
# Changelog

## 0.4.0 - 2026-06-28

- Reposition Mimir as sovereign local RAG for confidential datasets and AI agents.
- Expand default ingestion to common text, Office/OpenDocument, data, config, log, and source-code
file types.
- Add `includeExtensions` / `KB_INCLUDE_EXTENSIONS` for custom UTF-8 text file extensions.
- Add the optional `mimir-audio-summary` bundled skill for confidential audio summaries.
- Install both the main Mimir skill and optional audio-summary skill with `kb install-skill`.
- Improve agent guidance for deep multi-query retrieval before synthesis.
- Make Mimir core retrieval-only: `kb ask` now returns cited context for external agents or LLMs
instead of generating answers internally.
- Add optional Transformers.js semantic embeddings through `embeddingProvider: "transformers"`.
- Remove Ollama providers and keep `embeddingProvider: "local-hash"` as the no-model default.
- Move the repository to a simple pnpm workspace monorepo without adding Turbo.
- Move the core `@jcode.labs/mimir` package into `packages/mimir`.
- Add `@jcode.labs/mimir-tts` for plug-and-play JS/ONNX WAV rendering without Python or ffmpeg.
- Add `kb audio` and update the audio-summary skill to use Mimir TTS before advanced fallback
engines.

## 0.3.0 - 2026-06-28

- Add confidentiality hardening defaults: local-only Ollama network policy, built-in
redaction before indexing, metadata-only access logs, and bounded MCP retrieval.
- Add `kb security-audit` for zero-telemetry, network, redaction, gitignore, storage, and
- Add confidentiality hardening defaults: built-in redaction before indexing, metadata-only access
logs, and bounded MCP retrieval.
- Add `kb security-audit` for zero-telemetry, provider, redaction, gitignore, storage, and
MCP posture checks.
- Add `kb destroy-index --yes` to remove generated vector indexes.
- Add release verification artifacts: npm tarball, SHA256 checksums, SBOM, and manifest.
Expand Down
97 changes: 97 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

`AGENTS.md` is the authoritative source for shared rules — working rules, coding conventions, and
high-level architecture. Read it first. This file adds only the Claude Code operational details and
non-obvious traps that matter when editing here, without duplicating `AGENTS.md`.

## Commands

```bash
pnpm build # builds packages/mimir-tts, then packages/mimir; package dist is committed
pnpm check # typecheck only (tsc --noEmit)
pnpm lint # Biome CI (format + lint check, no writes)
pnpm lint:fix # Biome auto-fix
pnpm format # Biome format --write
pnpm test # vitest run for packages/mimir-tts, then packages/mimir
pnpm smoke # build production CLI + MCP smoke test (scripts/smoke.mjs)
pnpm validate # full release gate: lint + check + test + build + smoke + package:check + release:artifacts
```

Run a single core test file: `pnpm --filter @jcode.labs/mimir exec vitest run src/config.test.ts`
Run a single core test by name: `pnpm --filter @jcode.labs/mimir exec vitest run -t "applies env overrides"`
Run only the TTS package tests: `pnpm --filter @jcode.labs/mimir-tts test`

Tests are colocated as `packages/*/src/*.test.ts` and run on the TypeScript sources.

## Committed `dist/` — critical

`packages/mimir/dist/` and `packages/mimir-tts/dist/` are checked into Git. CI enforces
`git diff --exit-code -- packages/mimir/dist packages/mimir-tts/dist`. After any change under
`packages/mimir/src/` or `packages/mimir-tts/src/`, run `pnpm build` and commit the regenerated
output in the same commit, or CI fails. This is the single easiest mistake to make in this repo.

## Naming map (the package has several names on purpose)

- Product / core package: **Mimir**, published as `@jcode.labs/mimir` from `packages/mimir`.
- TTS package: **Mimir TTS**, published as `@jcode.labs/mimir-tts`.
- CLI binary: **`kb`** (`packages/mimir/bin.kb` -> `packages/mimir/dist/cli.js`). Commands: `init`,
`ingest`, `search`, `ask`, `audit`, `status`, `security-audit`, `destroy-index`, `audio`,
`serve-mcp`, `skill-path`, `install-skill`.
- TTS CLI binary: **`mimir-tts`** (`packages/mimir-tts/dist/cli.js`). Commands: `doctor`, `render`.
- Project config/state in the target repo: **`.kb/`** (`config.json`, `sources.txt`, `access.log`,
`storage/`), raw documents in **`private/`**, agent kit in **`.mimir/`**.
- Environment overrides: **`KB_*`** (e.g. `KB_EMBEDDING_PROVIDER`, `KB_CHUNK_SIZE`).
- MCP tools exposed to agents: **`mimir_*`** (`mimir_status`, `mimir_search`, `mimir_ask`,
`mimir_audit`, `mimir_security_audit`).

## Architecture and data flow

This is a pnpm workspace monorepo with the core package in `packages/mimir` and TTS in
`packages/mimir-tts`. Do not add Turbo unless `pnpm --filter` stops being enough.

The core package is an ESM-only TypeScript library + CLI + MCP server. Same core, three entry
points: `packages/mimir/src/cli.ts` (commander), `packages/mimir/src/index.ts` (public library
exports), `packages/mimir/src/mcp.ts` (stdio MCP server).

The ingest pipeline (`packages/mimir/src/ingest.ts`) chains single-responsibility modules:
`files.ts` (discover supported files via fast-glob, with sha256 checksums) →
`parsing.ts` (extract text per format: PDF/Office/HTML/etc.) →
`redaction.ts` (strip secrets/PII *before* anything is embedded) →
`chunking.ts` (split into overlapping chunks) →
`embeddings.ts` (vectorize) → `store.ts` (LanceDB). `query.ts` embeds the query and runs vector
search; `ask` returns cited passages only (no LLM synthesis in core).

`packages/mimir-tts` is a separate ESM package that uses Transformers.js text-to-speech to render
WAV files without Python or ffmpeg. Core `kb audio` imports it dynamically.

Key behaviors to keep in mind before editing:

- **Config resolution is caller-relative.** `loadConfig` walks up from `cwd` looking for
`.kb/config.json` (`findProjectRoot`). The package must resolve project data from the caller's
working directory, never from its own install path. Zod validates config; `KB_*` env vars override.
- **Two embedding providers, not interchangeable at runtime.** `local-hash` (default) is a 384-dim
sha256 lexical embedding — fully offline, no model, *not semantic*. `transformers` lazily
`import()`s `@huggingface/transformers` with `allowRemoteModels` off by default. The two produce
different vectors, so **switching providers requires a full re-ingest**.
- **Ingest always full-rebuilds** the LanceDB table (`mode: "overwrite"`). The `--rebuild` flag is a
no-op kept for compatibility. There is no incremental indexing; `audit` only *reports* missing/stale
files against the current index.
- **Privacy is a feature, not a side effect.** Redaction runs before embedding, the access log stores
query hashes/metadata only (`access-log.ts`), MCP top-K is clamped to `mcpMaxTopK`, and
`gitignore.ts` keeps `.kb/`, `.mimir/`, `private/**` ignored in target repos. `security-audit`
reports this posture and `--strict` exits non-zero on warnings. Preserve these guarantees.

Coding conventions (KISS, DRY, YAGNI, SOLID as applied here) live in `AGENTS.md`.

## Toolchain constraints

- Strict TypeScript with `noUncheckedIndexedAccess` and `exactOptionalPropertyTypes`; module mode is
`NodeNext`, so relative imports use `.js` extensions even from `.ts` sources.
- Biome is the formatter and linter (not ESLint/Prettier): 2-space indent, width 100, double quotes,
semicolons as-needed, trailing commas all.
- Conventional Commits are enforced by commitlint in CI.

Release policy (no local publish, no direct push to `main`, protected `Publish npm` workflow) lives
in `AGENTS.md`. The workflow publishes `@jcode.labs/mimir-tts` before `@jcode.labs/mimir`.
Loading
Loading