`memory scan`

memory scan is a repository bootstrap command.

Its job is to inspect an existing codebase, ask an LLM for durable project knowledge, validate the result, and then write that knowledge into Memory Layer through the normal capture and curate pipeline.

This is not a generic "summarize my repo" command. It is specifically trying to extract project memory that should still be useful later.

scan now sits on top of a local repository index. You can build that index explicitly with memory repo index and inspect it with memory repo status.

What It Does
What It Reads
Repository Index
What It Reads From Git
What It Sends To The LLM
What It Accepts From The LLM
How It Writes Memory
Idempotency
Dry Run Mode
Scan Reports
Configuration Requirements
Current Limits And Defaults
What scan Is Good At
What scan Is Not Good At
Practical Workflow
Troubleshooting

What It Does

At a high level, scan does this:

load the current repo and project context
read a curated subset of repository files
read a bounded amount of recent git history
build a structured dossier from that material
send the dossier to an OpenAI-compatible chat model
require strict JSON back
validate and deduplicate the returned candidates
write them as a normal Memory Layer capture
run curation so the resulting memories become searchable

So scan is really:

local repository indexing
repository sampling
LLM extraction
strict validation
normal Memory Layer ingestion

It does not bypass the existing backend or write directly to PostgreSQL tables.

Repository Index

Before scan asks the LLM anything, it now works from a local repository index stored under:

.mem/runtime/index/

You can manage that index directly:

memory repo index --project my-project
memory repo status --project my-project

The current index includes:

the selected repository files used for scan
the selected git commits used for scan
the current HEAD
a simple language-coverage summary
parser-backed analyzer summaries for enabled languages
extracted symbol, import, reference, call, and test-link facts
evidence-bundle counts for debugging and future scan quality work

Analyzer enablement comes from .agents/memory-layer.toml:

[analysis]
analyzers = ["rust", "typescript", "python"]

When you run memory scan, it reuses the existing index if it still matches the current repo HEAD and the same --since window. Otherwise it rebuilds the index first.

If you want to force a fresh index before scanning, use:

memory scan --project my-project --rebuild-index

What It Reads

scan does not read every file in the repository.

It chooses a bounded set of high-value files using a scoring heuristic.

The current implementation prefers:

README*
files under docs/
top-level manifests such as Cargo.toml, package.json, pyproject.toml, go.mod
main Rust entrypoints like crates/*/src/main.rs and crates/*/src/lib.rs
files under src/
files under scripts/
files under packaging/
files under .agents/skills/
common config and service files such as .toml, .md, .yaml, .yml, .json, .sh, .service

It skips obvious low-value or noisy paths such as:

.git/
target/
.mem/
node_modules/

Implementation limits:

up to 18 repository files
up to 8_000 bytes per file after normalization
file content budget is roughly 70% of the configured LLM input budget

What It Reads From Git

scan also reads recent git history because important architecture and workflow knowledge often lives in commit history rather than only in current files.

The current implementation:

reads up to 20 non-merge commits
captures commit hash
captures commit timestamp
captures subject
captures trimmed body
captures up to 12 changed paths per commit

You can bound this with:

memory scan --since "2 weeks ago"

or:

memory scan --since "2026-03-01"

What It Sends To The LLM

The CLI builds a structured dossier with:

project slug
canonical repo root
current HEAD commit if available
selected file contents
selected git commits

It then sends that dossier to the configured OpenAI-compatible chat endpoint with a system prompt that tells the model to:

extract durable repository memory
return strict JSON
keep candidates concise and repo-specific
avoid speculative claims
avoid transient task notes
attach provenance through files and/or commits

The requested JSON shape is:

summary
candidates[]

Each candidate is expected to include:

canonical_text
summary
memory_type
confidence
importance
tags
provenance_files
provenance_commits
rationale

What It Accepts From The LLM

The LLM output is not trusted blindly.

The current validation step rejects candidates when:

canonical_text is empty
summary is empty
both file provenance and commit provenance are missing
the candidate is a duplicate of an earlier candidate in the same scan

It also normalizes:

candidate text
tags
confidence
importance

The current hard cap is:

at most 12 accepted candidates per scan

If validation produces zero acceptable candidates, scan fails instead of writing low-quality memory.

How It Writes Memory

Accepted candidates are converted into a normal CaptureTaskRequest.

That request contains:

task_title = "Repository scan for <project>"
a scan-specific user_prompt
the LLM-generated summary as agent_summary
the selected repo files as files_changed
a condensed git summary in git_diff_summary
the validated candidates as structured_candidates

Then scan does exactly what a normal high-level write should do:

call memory capture task
call curate

That means scan output goes through the same:

backend validation
provenance rules
curation rules
search chunk generation
activity streaming

Idempotency

scan generates an idempotency key so rerunning it on the same repo state does not create uncontrolled duplicate raw captures.

The key is currently based on:

prompt version
project slug
current HEAD
selected file paths and contents
selected commit hashes

This means:

rerunning an unchanged scan tends to collapse to the same raw capture
changing important files or commits produces a new scan capture

Dry Run Mode

Use this first if you want to inspect what scan is going to do:

memory scan --project my-project --dry-run

In dry-run mode, scan still:

builds or reuses the repository index
calls the LLM
validates candidates
prints a concise preview of the accepted candidate memories

But it does not:

write a scan report file
emit a scan activity event
create a capture
run curation
write project memory

Scan Reports

Non-dry-run scans write a local report under:

.mem/runtime/scan/

The report includes:

prompt version
project
whether it was a dry run
whether the repository index was reused or rebuilt
the repository index path
summary
how many files were considered
how many commits were considered
language coverage
the dossier that was sent
the accepted candidates

This is useful for debugging why a scan produced the memory it did.

Configuration Requirements

scan requires working LLM configuration.

Today that means:

[llm].provider = "openai_compatible"
[llm].base_url
[llm].model
[llm].api_key_env
the API key available in:
- process environment
- .mem/memory-layer.env
- shared memory-layer.env

If these are not present, scan fails before doing any repository work.

Current Limits And Defaults

Important implementation details:

only openai_compatible providers are supported today
the request goes to POST /chat/completions
response_format is forced to JSON object
temperature is sent first, then omitted on retry if the model rejects it
max_completion_tokens comes from [llm].max_output_tokens

Current fixed limits:

MAX_FILES = 18
MAX_COMMITS = 20
MAX_FILE_BYTES = 8_000
MAX_CANDIDATES = 12

These are implementation limits, not user-facing flags.

What `scan` Is Good At

scan works best for extracting:

architecture facts
major functionality
durable conventions
setup and environment facts
repo-specific workflow knowledge

It is especially useful when onboarding Memory Layer to an existing project that already has a lot of knowledge spread across README files, docs, and git history.

What `scan` Is Not Good At

scan is not currently a full semantic repository model.

It does not:

read every file
execute code
run tests
infer runtime behavior from actual execution
inspect issue trackers or external systems
build the code graph yet
guarantee that every accepted candidate is correct just because the model produced it

It is only as good as:

the selected file set
the selected git history
the configured model
the validation and curation pipeline

Practical Workflow

Recommended usage:

initialize the repo with memory wizard or memory init
make sure [llm] is configured
run a dry run first
inspect the generated report
run the real scan
open the TUI and inspect the resulting memories

Example:

memory scan --project my-project --dry-run
memory scan --project my-project
memory tui --project my-project

Troubleshooting

Common failure cases:

missing [llm].model
missing API key
wrong model name
unsupported model parameters
no valid durable candidates returned

Useful checks:

memory doctor
memory scan --project my-project --dry-run

If you are debugging a specific scan result, the most useful artifact is the JSON report in .mem/runtime/scan/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`memory scan`

Table of Contents

What It Does

Repository Index

What It Reads

What It Reads From Git

What It Sends To The LLM

What It Accepts From The LLM

How It Writes Memory

Idempotency

Dry Run Mode

Scan Reports

Configuration Requirements

Current Limits And Defaults

What `scan` Is Good At

What `scan` Is Not Good At

Practical Workflow

Troubleshooting

Related Docs

FilesExpand file tree

scan.md

Latest commit

History

scan.md

File metadata and controls

memory scan

Table of Contents

What It Does

Repository Index

What It Reads

What It Reads From Git

What It Sends To The LLM

What It Accepts From The LLM

How It Writes Memory

Idempotency

Dry Run Mode

Scan Reports

Configuration Requirements

Current Limits And Defaults

What scan Is Good At

What scan Is Not Good At

Practical Workflow

Troubleshooting

Related Docs

`memory scan`

What `scan` Is Good At

What `scan` Is Not Good At