Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@
Transformers.js WAV path for offline/confidential rendering, use the Edge MP3 path for global
Voice Forge quality only when online TTS is explicitly acceptable, and keep generated audio under
ignored local Mimir state.
- Keep report generation separate from core retrieval. The `mimir-markdown-report` skill writes cited
Markdown reports under ignored `.mimir/reports/` by default and must distinguish evidence,
inference, uncertainty, missing documents, and professional-review items.
- Ingestion must be explicit about files it did not index. Preserve `kb audit --unsupported`,
unsupported-extension summaries, secret-like file skipping, max file size limits, and checksum-based
stale detection.
- Keep the repository as a simple pnpm workspace monorepo. Add Turbo only if multiple packages or
apps start needing task caching/orchestration beyond `pnpm --filter`.
- Keep Mimir core free of Ollama. `embeddingProvider: "local-hash"` supports ingestion, search, MCP,
Expand Down Expand Up @@ -95,6 +101,11 @@ General principles (KISS, DRY, YAGNI, SOLID) as applied in this codebase. Match
privacy and confidentiality hardening layer.
- `packages/mimir/skills/mimir/SKILL.md` is the bundled portable agent skill.
- `packages/mimir/skills/mimir-audio-summary/SKILL.md` is the optional bundled audio-summary skill.
- `packages/mimir/skills/mimir-markdown-report/SKILL.md` is the optional bundled Markdown-report
skill.
- `kb setup` must keep generating agent-specific MCP helpers for easy local use:
`.mimir/claude-mcp-server.json` for `claude mcp add-json` and `.mimir/codex-mcp.toml` for Codex
config layers.
- `packages/mimir/examples/sovereign-rag-demo` is the tracked synthetic test workspace for manual
and package validation.
- `.kb/`, `.mimir/`, and project `private/` folders are local user data or generated agent
Expand Down
121 changes: 109 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,7 @@ Early public package. APIs may evolve before `1.0.0`.
- Give Claude, Codex, Cursor, internal assistants, or other MCP-compatible tools the same private
retrieval layer.
- Retrieve grounded local evidence through CLI, library calls, MCP tools, or bundled agent skills.
- Optionally create listenable MP3 or WAV summaries with `kb audio`, `@jcode.labs/mimir-tts`, and
the bundled `mimir-audio-summary` skill.
- Optionally create listenable MP3/WAV summaries or cited Markdown reports with bundled skills.

Mimir is not a hosted SaaS, not a remote vector database, and not a certified high-assurance system.
For regulated or state-grade environments, pair it with encrypted disks, controlled machines,
Expand All @@ -88,6 +87,7 @@ context.
| Prepare meetings or decisions | "Give me a one-page briefing.", "What is missing before deciding?", "List action items and evidence." |
| Ask questions over offline documents | "Which files mention local-only operation?", "What evidence supports this claim?" |
| Generate audio briefings | "Create a listenable high-quality or offline summary of the current dossier." |
| Generate Markdown reports | "Write a cited local report with findings, risks, next actions, and sources." |

## Requirements

Expand All @@ -103,6 +103,8 @@ context.
external `edge-tts` CLI and render with `--engine edge`. For confidential or air-gapped content,
use the Transformers.js WAV path with `--engine transformers --offline`; it does not require
Python, ffmpeg, Piper, XTTS, or a local server.
- Optional Markdown reports use the bundled `mimir-markdown-report` skill and should stay under
ignored `.mimir/reports/` unless explicitly sanitized for sharing.

## Install

Expand Down Expand Up @@ -145,12 +147,15 @@ private/ # raw documents to ingest
.kb/sources.txt # optional extra source paths
.mimir/skills/mimir/SKILL.md # portable agent skill
.mimir/skills/mimir-audio-summary/SKILL.md
.mimir/mcp.json # MCP server config snippet
.mimir/skills/mimir-markdown-report/SKILL.md
.mimir/mcp.json # generic MCP server config snippet
.mimir/claude-mcp-server.json # Claude Code add-json payload
.mimir/codex-mcp.toml # Codex config.toml snippet
.gitignore # ignores private/**, .kb/, and .mimir/
```

It detects the repository package manager and writes `.mimir/mcp.json` with the right command, such
as `pnpm exec kb serve-mcp`, `npx kb serve-mcp`, `yarn exec kb serve-mcp`, or `bunx kb serve-mcp`.
It detects the repository package manager and writes the MCP helper files with the right command:
`pnpm exec kb serve-mcp`, `npx kb serve-mcp`, `yarn exec kb serve-mcp`, or `bunx kb serve-mcp`.

Check readiness at any time:

Expand Down Expand Up @@ -192,7 +197,15 @@ pnpm exec kb ingest
pnpm exec kb doctor
```

When the index is ready, `kb doctor` prints `ready=true`.
When the index is ready, `kb doctor` prints `ready=true`. `kb ingest` and `kb audit` also report
files that were discovered but not indexed because the type is unsupported, the file is too large,
or the file name looks like a secret/private key.

List skipped paths explicitly:

```bash
pnpm exec kb audit --unsupported
```

Retrieve exact passages:

Expand Down Expand Up @@ -286,13 +299,18 @@ This creates:
```plain text
.mimir/skills/mimir/SKILL.md
.mimir/skills/mimir-audio-summary/SKILL.md
.mimir/skills/mimir-markdown-report/SKILL.md
.mimir/mcp.json
.mimir/claude-mcp-server.json
.mimir/codex-mcp.toml
.mimir/README.md
```

Agents that support skill folders can load `.mimir/skills/mimir/` for deep local RAG usage. Load
`.mimir/skills/mimir-audio-summary/` only when an optional spoken summary is needed. Other agents can
read the generated `.mimir/README.md` and use the MCP config snippet.
`.mimir/skills/mimir-audio-summary/` only when an optional spoken summary is needed. Load
`.mimir/skills/mimir-markdown-report/` when the user asks for a cited Markdown report, dossier,
audit memo, or planning note. Other agents can read the generated `.mimir/README.md` and use the MCP
config snippet.

Start the MCP server from the repository root:

Expand All @@ -312,6 +330,55 @@ This MCP layer is the recommended way to let any compatible LLM or agent query t
knowledge base. The LLM does not need to know about LanceDB or the raw file layout; it asks Mimir for
ranked passages or cited context and uses the returned citations.

### Claude Code

From the target repository root:

```bash
pnpm exec kb setup
claude mcp add-json --scope local mimir "$(cat .mimir/claude-mcp-server.json)"
```

Claude Code provides the active project path to MCP servers through `CLAUDE_PROJECT_DIR`; Mimir uses
that value when serving MCP, so the same installed npm package can work inside each repository where
`kb setup` was run. Keep the MCP scope local unless you intentionally want to share the server
config.

### Codex

From the target repository root:

```bash
pnpm exec kb setup
cat .mimir/codex-mcp.toml
```

Copy the printed TOML into `~/.codex/config.toml` or another trusted Codex config layer. The snippet
contains the repository `cwd`, so Codex can launch the Mimir MCP server from the right project.

For other MCP clients that cannot set `cwd`, set `MIMIR_PROJECT_ROOT=/absolute/path/to/repository`
when launching `kb serve-mcp`.

### Agent Demo

From a repository that already ran `kb setup` and has Mimir wired into the current agent, ask:

```plain text
Use Mimir to audit the local evidence. First run mimir_status and mimir_audit. Then search for
"offline retrieval approval" and produce a cited Markdown report. Do not rely on memory if Mimir
does not contain enough evidence.
```

Agents that support skill folders should also load:

```plain text
.mimir/skills/mimir/
.mimir/skills/mimir-markdown-report/
```

The Markdown report skill writes reports under `.mimir/reports/` by default, which stays ignored by
Git.

Print the bundled skill path from the installed package:

```bash
Expand Down Expand Up @@ -419,14 +486,20 @@ Mimir supports common text, document, data, config, log, and source-code files o
- YAML: `.yaml`, `.yml`
- CSV/TSV: `.csv`, `.tsv`
- HTML: `.html`, `.htm`
- EPUB: `.epub`
- PDF: `.pdf`
- Office/OpenDocument: `.docx`, `.pptx`, `.xlsx`, `.odt`, `.ods`, `.odp`
- Rich text: `.rtf`
- Notebook: `.ipynb`
- Subtitles/calendars/mail: `.vtt`, `.srt`, `.ics`, `.eml`
- Line data and logs: `.jsonl`, `.ndjson`, `.log`
- XML feeds and documents: `.xml`, `.rss`, `.atom`
- XML feeds and documents: `.xml`, `.rss`, `.atom`, `.svg`
- Config and data files: `.toml`, `.ini`, `.conf`, `.cfg`, `.properties`, `.sql`
- Source code: `.ts`, `.tsx`, `.js`, `.jsx`, `.py`, `.go`, `.rs`, `.java`, `.rb`, `.php`, `.cs`,
`.c`, `.cpp`, `.h`, `.css`
- Source code: `.ts`, `.tsx`, `.mts`, `.cts`, `.js`, `.jsx`, `.mjs`, `.cjs`, `.py`, `.go`, `.rs`,
`.java`, `.rb`, `.php`, `.cs`, `.c`, `.cpp`, `.h`, `.hpp`, `.css`, `.scss`, `.vue`, `.svelte`,
`.astro`, `.sh`, `.bash`, `.ps1`
- Documentation/code review text: `.rst`, `.adoc`, `.tex`, `.diff`, `.patch`, `.markdown`,
`.mdown`

Custom UTF-8 text extensions can be enabled without changing code:

Expand All @@ -447,6 +520,13 @@ that are not listed should be OCRed, transcribed, converted, or exported to text
Mimir intentionally avoids pretending that every binary format can be indexed safely without
extraction logic.

Secret-like files such as `.env`, `.npmrc`, private keys, and certificates are skipped by default.
Convert safe examples to a normal text format before ingestion.

Sensitive key/certificate-like files such as `.pem`, `.key`, `.p12`, `.pfx`, `.jks`, `.gpg`, and
common secret filenames such as `.env`, `.npmrc`, `.netrc`, and `.pgpass` are skipped by default even
if they sit under a source directory.

## Configuration Reference

Default `.kb/config.json`:
Expand All @@ -472,6 +552,9 @@ Default `.kb/config.json`:
"topK": 5,
"chunkSize": 1200,
"chunkOverlap": 150,
"maxFileBytes": 50000000,
"ingestConcurrency": 4,
"embeddingBatchSize": 32,
"includeExtensions": []
}
```
Expand All @@ -493,6 +576,9 @@ Environment overrides:
- `KB_TOP_K`
- `KB_CHUNK_SIZE`
- `KB_CHUNK_OVERLAP`
- `KB_MAX_FILE_BYTES`
- `KB_INGEST_CONCURRENCY`
- `KB_EMBEDDING_BATCH_SIZE`
- `KB_INCLUDE_EXTENSIONS`

## CLI Reference
Expand All @@ -512,6 +598,7 @@ Mimir ships two CLIs:
| `kb doctor --fix` | Create missing scaffolding, install skills/MCP config, and rebuild stale indexes when safe. |
| `kb ingest` | Parse source files, redact, chunk, embed, and rebuild the local LanceDB index. |
| `kb audit` | Check whether supported source files are missing from or stale in the index. |
| `kb audit --unsupported` | List files skipped because they are unsupported, too large, or secret-like. |
| `kb search "<query>"` | Retrieve ranked passages without asking an LLM to write an answer. |
| `kb ask "<question>"` | Return cited retrieval context for an agent or trusted model runtime. |
| `kb security-audit` | Inspect privacy posture: telemetry, providers, redaction, Git ignore, MCP. |
Expand Down Expand Up @@ -547,7 +634,8 @@ Mimir ships two CLIs:
| Option | Applies to | Meaning |
| --- | --- | --- |
| `--top-k <number>` | `search`, `ask` | Number of passages to return. |
| `--json` | `doctor`, `security-audit`, `audio --doctor`, `mimir-tts doctor` | Print machine-readable JSON. |
| `--json` | `doctor`, `audit`, `security-audit`, `audio --doctor`, `mimir-tts doctor` | Print machine-readable JSON. |
| `--unsupported` | `audit` | List skipped file paths and reasons. |
| `--strict` | `security-audit` | Exit non-zero when warnings exist. |
| `--offline` | `audio`, `mimir-tts render` | Disable remote model downloads and force the local Transformers.js path. |
| `--allow-remote-models` | `audio`, `mimir-tts render` | Explicitly allow model downloads for Transformers.js. |
Expand Down Expand Up @@ -602,6 +690,15 @@ pnpm exec kb doctor
If documents live elsewhere, add one path per line to `.kb/sources.txt`. Relative paths resolve from
the project root.

If files exist but are not supported yet, inspect the skipped inventory:

```bash
pnpm exec kb audit --unsupported
```

Then either convert them to a supported format, OCR/transcribe them, or add a safe custom UTF-8 text
extension with `includeExtensions` / `KB_INCLUDE_EXTENSIONS`.

### Search Returns Weak Results

The default `local-hash` provider is dependency-light and offline, but it is lexical/hash retrieval,
Expand Down
44 changes: 42 additions & 2 deletions SECURITY-HARDENING.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ built to minimize data movement, but it is not a certified high-assurance system
remote model loading disabled by default through `transformersAllowRemoteModels: false`.
- Redaction before indexing: built-in DLP patterns redact common secrets and identifiers before
chunks are embedded and stored.
- Secret-like files are skipped by default: common private-key, certificate, and credential
filenames/extensions are not indexed even when they appear under a source directory.
- Ingestion has a default per-file size cap through `maxFileBytes` and reports unsupported,
oversized, and secret-like skipped files.
- Metadata-only access logs: access logs contain action metadata and query hashes, not raw
queries or retrieved text.
- Generated local state is ignored by Git: `.kb/`, `.mimir/`, and `private/**` are ignored by
Expand All @@ -22,6 +26,8 @@ built to minimize data movement, but it is not a certified high-assurance system
- Optional audio summaries use `kb audio` / `@jcode.labs/mimir-tts`. Transformers.js WAV is the
default offline/confidential path and does not require Python, ffmpeg, Piper, XTTS, or a local TTS
server. Edge MP3 gives the highest quality only when online TTS is explicitly acceptable.
- Optional Markdown reports use the bundled `mimir-markdown-report` skill and should be written
under `.mimir/reports/` by default.
- npm releases are published with provenance from the protected GitHub Actions workflow.
- Release artifacts include a package tarball, SHA256 checksums, SBOM, and manifest.

Expand Down Expand Up @@ -61,8 +67,9 @@ Move the generated tarballs from `release-artifacts/` into the offline environme

```bash
pnpm add -D ./jcode.labs-mimir-tts-<version>.tgz ./jcode.labs-mimir-<version>.tgz
pnpm exec kb init
pnpm exec kb ingest
pnpm exec kb setup
pnpm exec kb doctor --fix
pnpm exec kb audit --unsupported
```

For semantic embeddings, preload the Transformers.js-compatible embedding model files inside the
Expand Down Expand Up @@ -104,6 +111,16 @@ Run:
pnpm exec kb security-audit --strict
```

Also run:

```bash
pnpm exec kb audit --unsupported
```

This exposes local relative paths for files that were skipped because the extension is unsupported,
the file exceeds `maxFileBytes`, or the filename looks like a secret/key artifact. Use it before
assuming a dossier was fully indexed.

## DLP Redaction

Built-in redaction is enabled by default for common secret and identifier shapes: private keys,
Expand All @@ -129,6 +146,23 @@ Custom patterns can be added in `.kb/config.json`:

Redaction changes the indexed text, not the raw files under `private/`.

## Ingestion Boundaries

Mimir indexes many text, document, Office/OpenDocument, PDF, EPUB, subtitle, notebook, mail, config,
and source-code formats. It does not silently ingest every binary file. Unsupported images, scans,
audio/video, old proprietary Office binaries, and unknown formats must be converted, OCRed, or
transcribed first.

Default ingestion guardrails:

- `maxFileBytes`: 50 MB per file by default;
- `ingestConcurrency`: four parse/chunk workers by default;
- `embeddingBatchSize`: 32 chunks per embedding batch by default;
- checksum-based stale detection for supported files;
- unsupported/skipped file reporting through `kb ingest`, `kb audit`, and `kb audit --unsupported`.

These are configurable, but raising limits increases local memory and parsing risk.

## Optional Audio Summaries

`kb install-skill` installs an optional `mimir-audio-summary` skill. It is designed for listenable
Expand All @@ -151,6 +185,12 @@ Confidentiality defaults:
Generated audio can still contain sensitive information. Treat it like a derived confidential
document.

## Optional Markdown Reports

`kb install-skill` also installs `mimir-markdown-report`. Reports generated from private evidence are
derived confidential documents. Keep them under `.mimir/reports/` by default, cite source paths and
chunk numbers, and do not commit them unless the user explicitly asks for a sanitized tracked report.

## MCP Hardening

MCP gives an agent access to retrieved private context. Use it only for agents running under the
Expand Down
11 changes: 8 additions & 3 deletions docs/ux-dx-audit.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ developer and agent workflow around installation, indexing, querying, safety, au
| Generated helper files | `private/README.md` was indexed and could pollute retrieval results. | Fixed: generated private README is skipped by source discovery. |
| Audio confidentiality | `auto` could select online Edge TTS when installed. | Fixed: default path is Transformers.js WAV; Edge MP3 requires `--engine edge`. |
| Documentation shape | The package README had too much tutorial, reference, and explanation mixed together. | Fixed: the root README is canonical; package README files are minimal npm entrypoints. |
| Agent onboarding | `install-skill` installed files but gave limited operational guidance. | Fixed: command output now prints agent next steps. |
| Agent onboarding | `install-skill` installed files but gave limited operational guidance. | Fixed: command output now prints agent next steps and Claude Code/Codex MCP snippets. |
| Ingestion visibility | Unsupported files were ignored silently, which made users overestimate coverage. | Fixed: `ingest`, `audit`, and `audit --unsupported` report skipped files by reason. |
| Report generation | Users had audio summaries but no dedicated Markdown-report workflow. | Fixed: `mimir-markdown-report` skill writes cited reports under ignored local state. |
| Stale detection | Audit compared paths but did not detect changed file content. | Fixed: audit now uses stored checksums to flag stale indexed content. |

## DX Findings

Expand All @@ -32,7 +35,7 @@ developer and agent workflow around installation, indexing, querying, safety, au
| Local validation | `pnpm validate` already covers lint, typecheck, tests, build, smoke, package checks, and artifacts. | Good. |
| Release safety | npm publish is protected by CI, environment approval, provenance, and explicit version input. | Good. |
| API clarity | Core exports are small and named, but the README only shows a minimal API snippet. | Partially improved by CLI docs; deeper API docs remain future work. |
| MCP reference | Tool names are documented, but tool schemas are not deeply documented. | Future work. |
| MCP reference | Tool names and an agent demo prompt are documented, but tool schemas are not deeply documented. | Partially improved. |
| Error guidance | Common setup and audio errors were not centralized. | Fixed in the root README troubleshooting section. |
| Dist workflow | `dist/` is committed and documented in `CLAUDE.md`; this is unusual but CI-enforced. | Good for this repo, but keep documenting it. |

Expand All @@ -44,11 +47,13 @@ developer and agent workflow around installation, indexing, querying, safety, au
fully air-gapped operation requires a documented model-preload workflow.
- MCP access is read-focused but still exposes private retrieved passages to the connected agent.
Team/RBAC support remains out of scope.
- `audit --unsupported` intentionally lists relative paths only; users still need to avoid pasting
sensitive path names into public issue reports.
- The library API is usable, but a dedicated API reference page would help external developers.

## Recommended Next Pass

1. Add API reference docs for exported functions and result types.
2. Add MCP tool schema examples for agent developers.
3. Add a model-preload guide for semantic embeddings and offline TTS.
4. Add a recorded or scripted demo workspace flow for release QA.
4. Add deeper API reference docs for external library consumers once the public API grows.
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "jcode-mimir",
"version": "0.4.6",
"version": "0.4.7",
"private": true,
"description": "Monorepo for the Mimir open-source local RAG packages.",
"type": "module",
Expand Down
Loading
Loading