Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ PUBLISHING PROCEDURE:
5. After publishing, the next PR author will add a new "## Unreleased" section
-->

## Unreleased
## 0.6.1 (2026-05-20)

### Changed

Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "monodex"
version = "0.6.0"
version = "0.6.1"
edition = "2024"
rust-version = "1.93"
description = "Fast, accurate code search for large Rush monorepos"
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@ monodex dump-chunks --file ./src/JsonFile.ts --debug
monodex dump-chunks --file ./src/JsonFile.ts --target-size 4000

# Audit chunking quality across multiple files (AST-only mode)
monodex audit-chunks --count 20 --dir /path/to/project
monodex audit-chunks --count 20 --folder /path/to/project
```

**Chunk Quality Score**: 0-100%, higher is better. Scores below 95% may indicate chunking issues. Note: `dump-chunks` and `audit-chunks` use AST-only mode (fallback disabled) to accurately measure partitioner quality.
Expand Down Expand Up @@ -460,7 +460,7 @@ RUST_LOG=debug ./target/release/monodex crawl --catalog sparo --label main --com

The crawl behavior (which files to index and how to chunk them) can be customized via configuration files.

For the full inventory of files Monodex reads or writes (config-folder state, the database directory layout, repo-local config files), see [docs/design/monodex_files.md](https://github.com/microsoft/monodex/blob/main/docs/design/monodex_files.md).
For the full inventory of files Monodex reads or writes (config-folder state, the database folder layout, repo-local config files), see [docs/design/monodex_files.md](https://github.com/microsoft/monodex/blob/main/docs/design/monodex_files.md).

### Config Discovery

Expand All @@ -474,7 +474,7 @@ No merging occurs. Exactly one config is used.

### Config Schema

JSON schemas are available in the `schemas/` directory for IDE autocomplete and validation. Reference the appropriate schema in your config file via the `$schema` field:
JSON schemas are available in the `schemas/` folder for IDE autocomplete and validation. Reference the appropriate schema in your config file via the `$schema` field:

| Config File | Schema File |
| ------------------------- | ----------------------------- |
Expand Down Expand Up @@ -530,12 +530,12 @@ shouldCrawl = matchesFileType && (matchesPatternsToKeep || !matchesPatternsToExc

- `fileTypes` is the primary filter. Unsupported file types are never crawled.
- `patternsToKeep` overrides `patternsToExclude` (useful for keeping test files in `src/`)
- Directory patterns (ending in `/`) match anywhere in the path
- Folder patterns (ending in `/`) match anywhere in the path

**Pattern syntax:**

- Glob patterns use the standard syntax: `**` for recursive, `*` for wildcard
- Directory patterns end with `/` (e.g., `node_modules/`)
- Folder patterns end with `/` (e.g., `node_modules/`)
- Example: `**/*.test.ts` matches test files at any depth

## Status
Expand Down
8 changes: 4 additions & 4 deletions docs/backlog.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ For official feature requests, create a GitHub issue. If an issue needs higher p

<a id="BL51"></a>

**BL51 `monodex init` command, with `examples/` rename.** Generate `<config-folder>/monodex-config.json`, `<config-folder>/monodex-crawl-config.json`, and `<config-folder>/monodex-state.json` from the templates currently under `examples/`, with `$schema` URLs set to the published locations. Removes a setup step for new users. Implementation: `include_bytes!` to embed templates at compile time, plus a small command handler with the standard "file already exists" handling. Depends on the templates being embedded (trivial) and ideally on schema publication (otherwise `$schema` URLs are placeholders). The directory should be renamed from `examples/` to `config-templates/` as part of this work, since the current name is a misnomer.
**BL51 `monodex init` command, with `examples/` rename.** Generate `<config-folder>/monodex-config.json`, `<config-folder>/monodex-crawl-config.json`, and `<config-folder>/monodex-state.json` from the templates currently under `examples/`, with `$schema` URLs set to the published locations. Removes a setup step for new users. Implementation: `include_bytes!` to embed templates at compile time, plus a small command handler with the standard "file already exists" handling. Depends on the templates being embedded (trivial) and ideally on schema publication (otherwise `$schema` URLs are placeholders). The folder should be renamed from `examples/` to `config-templates/` as part of this work, since the current name is a misnomer.

(severity=feature, work=small)

Expand Down Expand Up @@ -90,7 +90,7 @@ Items with at least one non-obvious insight worth recording, but no commitment t

<a id="BL52"></a>

**BL52 Orphan reclamation garbage collection.** Three orphan kinds, swept by one `monodex gc` command: chunk-row orphans (rows in `chunks` with `active_label_ids = []`, typically from interrupted crawls; reclaimed by deleting the row), vector-payload orphans (non-NULL `vector` on a row no in-selection vector method points at; reclaimed by setting `vector = NULL`, row stays), Tantivy-directory orphans (a directory under `<db>/fts/<catalog>/<label>/` for a label whose selection no longer includes FTS, or no longer exists; reclaimed by deleting the directory). All three share the same conceptual structure (content unreferenced by any in-selection label state) and operational constraint (requires the database to be quiescent for a full scan). One feature, offline command, not continuous background work. Workaround until the verb exists: `purge` and rebuild from scratch. Revisit once databases live long enough that orphan accumulation matters in practice. Implementation note: an internal `null_vectors_for_row_ids` primitive already exists, which nulls vector columns while preserving the rows. It may be the right mechanism for vector-only invalidation or orphan cleanup.
**BL52 Orphan reclamation garbage collection.** Three orphan kinds, swept by one `monodex gc` command: chunk-row orphans (rows in `chunks` with `active_label_ids = []`, typically from interrupted crawls; reclaimed by deleting the row), vector-payload orphans (non-NULL `vector` on a row no in-selection vector method points at; reclaimed by setting `vector = NULL`, row stays), Tantivy-folder orphans (a folder under `<db>/fts/<catalog>/<label>/` for a label whose selection no longer includes FTS, or no longer exists; reclaimed by deleting the folder). All three share the same conceptual structure (content unreferenced by any in-selection label state) and operational constraint (requires the database to be quiescent for a full scan). One feature, offline command, not continuous background work. Workaround until the verb exists: `purge` and rebuild from scratch. Revisit once databases live long enough that orphan accumulation matters in practice. Implementation note: an internal `null_vectors_for_row_ids` primitive already exists, which nulls vector columns while preserving the rows. It may be the right mechanism for vector-only invalidation or orphan cleanup.

(severity=feature, work=large)

Expand Down Expand Up @@ -126,7 +126,7 @@ Items with at least one non-obvious insight worth recording, but no commitment t

<a id="BL104"></a>

**BL104 Batch the per-row writes in `remove_label_from_chunks`.** The label-reassignment cleanup phase at the end of every successful crawl does one LanceDB write per orphaned chunk (`src/engine/storage/chunks/storage.rs:706` and `:710`) while holding the commit mutex for the whole loop. Fine for typical crawls; pathological for large refactors (directory renames, package moves, mass file deletions) where orphans run into the thousands and the held mutex blocks other writers for the duration. The work splits cleanly: bulk-delete rows that go to zero `active_label_ids` via `delete` with a `row_id IN (...)` predicate in `UPSERT_BATCH_SIZE` chunks; apply non-empty label-list shrinks via `merge_insert` (LanceDB's `update` is per-predicate, not vectorized over different per-row values, so `merge_insert` is the natural batched primitive). Adjacent to but distinct from BL52 (orphan GC): BL52 reclaims rows whose `active_label_ids` is already empty; this gets them to empty more efficiently.
**BL104 Batch the per-row writes in `remove_label_from_chunks`.** The label-reassignment cleanup phase at the end of every successful crawl does one LanceDB write per orphaned chunk (`src/engine/storage/chunks/storage.rs:706` and `:710`) while holding the commit mutex for the whole loop. Fine for typical crawls; pathological for large refactors (folder renames, package moves, mass file deletions) where orphans run into the thousands and the held mutex blocks other writers for the duration. The work splits cleanly: bulk-delete rows that go to zero `active_label_ids` via `delete` with a `row_id IN (...)` predicate in `UPSERT_BATCH_SIZE` chunks; apply non-empty label-list shrinks via `merge_insert` (LanceDB's `update` is per-predicate, not vectorized over different per-row values, so `merge_insert` is the natural batched primitive). Adjacent to but distinct from BL52 (orphan GC): BL52 reclaims rows whose `active_label_ids` is already empty; this gets them to empty more efficiently.

(severity=performance, work=medium)

Expand All @@ -150,7 +150,7 @@ Items with at least one non-obvious insight worth recording, but no commitment t

<a id="BL68"></a>

**BL68 Orphaned per-catalog lockfile cleanup command.** Per-catalog lockfiles get created lazily and never deleted; the lockfile directory grows monotonically as catalogs come and go. Bounded and tiny per the design's framing in `concurrency.md:134` and `:168`, but a real loose end with no current owner. A future maintenance command can sweep orphaned per-catalog lockfiles for catalogs no longer in `monodex-config.json`.
**BL68 Orphaned per-catalog lockfile cleanup command.** Per-catalog lockfiles get created lazily and never deleted; the lockfile folder grows monotonically as catalogs come and go. Bounded and tiny per the design's framing in `concurrency.md:134` and `:168`, but a real loose end with no current owner. A future maintenance command can sweep orphaned per-catalog lockfiles for catalogs no longer in `monodex-config.json`.

(severity=hygiene, work=small)

Expand Down
17 changes: 16 additions & 1 deletion docs/code_organization_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,14 +145,29 @@ There is no fixed threshold. The judgment is relative: tag the largest contribut

## Naming

### File and directory names

- Command handlers: named after the CLI subcommand (`purge.rs`, `search.rs`). Use `use_cmd.rs` for `use` (reserved keyword).
- Engine submodule directories: named after the concept (`partitioner/`, `storage/`).
- Type-only files: `types.rs` or `models.rs`.
- Test files: `tests.rs` (singular).
- No semantically vapid filenames. `utilities.rs`, `helpers.rs`, `common.rs`, `misc.rs` are free to write and tell the next reader nothing; half the codebase is "utilities" of some sort. The work of naming is finding what the functions actually have in common, and that shared trait is usually a better name: `formatting.rs` if the trait is formatting, `test_mocks.rs` or `test_fixtures.rs` if the trait is test setup. `test_helpers.rs` is acceptable only when no narrower trait is visible. Pick the narrowest accurate name today; rename when contents change.

### Folder vs directory

Prefer "folder" in identifiers, prose, doc comments, error messages, and clap help text.

The cases that stay "directory":

- **Established compounds**, in their established meaning: "working directory" when it means the Git enlistment folder or `std::env::current_dir()`; "current directory" / "current working directory" for `std::env::current_dir()`; "root directory" for a filesystem root; "home directory" for `$HOME`.
- **Vendor and standard-library API surface**: type names, trait names, function names, error variants, and terms-of-art from third-party documentation. `std::fs::read_dir`, `std::fs::create_dir_all`, Tantivy's `Directory` trait, `MmapDirectory`, `OpenDirectoryError`, LanceDB's "directory-based table format" all stay as-is; renaming them would prevent readers from finding the underlying documentation.

The two cases above identify objects that keep the word "directory". Prose specifically describing one of those objects inherits the word, so the sentence agrees with the symbol it names. A doc comment on `MmapDirectory::open` says "opens the directory at the given path"; a sentence about a function called `parse_working_directory_arg` says "parses the working directory argument." This is a derived rule, not a third independent criterion.

Counter-examples: a loop variable iterating folders is `current_folder`, not `current_directory` (the first rule requires the literal `current_dir()` meaning). A test fixture holding a `TempDir` may keep `_tmp_dir`; it mirrors the crate's type, not a Monodex concept.

## Banned patterns

- No semantically vapid filenames. `utilities.rs`, `helpers.rs`, `common.rs`, `misc.rs` are free to write and tell the next reader nothing; half the codebase is "utilities" of some sort. The work of naming is finding what the functions actually have in common, and that shared trait is usually a better name: `formatting.rs` if the trait is formatting, `test_mocks.rs` or `test_fixtures.rs` if the trait is test setup. `test_helpers.rs` is acceptable only when no narrower trait is visible. Pick the narrowest accurate name today; rename when contents change.
- No wildcard re-exports (`pub use submodule::*`). List re-exports explicitly.
- No putting unrelated items together just because they're small.
- No structural splits in the same change as feature or fix work. Splits are their own change unless explicitly authorized by the maintainer or the planned reorganization being applied.
Expand Down
Loading
Loading