From 7e41f3fc0acde12623c623cbd663109663d38c95 Mon Sep 17 00:00:00 2001 From: Pete Gonzalez <4673363+octogonz@users.noreply.github.com> Date: Tue, 19 May 2026 18:34:42 -0700 Subject: [PATCH 1/6] Add policy about folder vs directory --- docs/code_organization_policy.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/docs/code_organization_policy.md b/docs/code_organization_policy.md index 44dcf6c..19b96c4 100644 --- a/docs/code_organization_policy.md +++ b/docs/code_organization_policy.md @@ -145,14 +145,29 @@ There is no fixed threshold. The judgment is relative: tag the largest contribut ## Naming +### File and directory names + - Command handlers: named after the CLI subcommand (`purge.rs`, `search.rs`). Use `use_cmd.rs` for `use` (reserved keyword). - Engine submodule directories: named after the concept (`partitioner/`, `storage/`). - Type-only files: `types.rs` or `models.rs`. - Test files: `tests.rs` (singular). +- No semantically vapid filenames. `utilities.rs`, `helpers.rs`, `common.rs`, `misc.rs` are free to write and tell the next reader nothing; half the codebase is "utilities" of some sort. The work of naming is finding what the functions actually have in common, and that shared trait is usually a better name: `formatting.rs` if the trait is formatting, `test_mocks.rs` or `test_fixtures.rs` if the trait is test setup. `test_helpers.rs` is acceptable only when no narrower trait is visible. Pick the narrowest accurate name today; rename when contents change. + +### Folder vs directory + +Prefer "folder" in identifiers, prose, doc comments, error messages, and clap help text. + +The cases that stay "directory": + +- **Established compounds**, in their established meaning: "working directory" when it means the Git enlistment folder or `std::env::current_dir()`; "current directory" / "current working directory" for `std::env::current_dir()`; "root directory" for a filesystem root; "home directory" for `$HOME`. +- **Vendor and standard-library API surface**: type names, trait names, function names, error variants, and terms-of-art from third-party documentation. `std::fs::read_dir`, `std::fs::create_dir_all`, Tantivy's `Directory` trait, `MmapDirectory`, `OpenDirectoryError`, LanceDB's "directory-based table format" all stay as-is; renaming them would prevent readers from finding the underlying documentation. + +The two cases above identify objects that keep the word "directory". Prose specifically describing one of those objects inherits the word, so the sentence agrees with the symbol it names. A doc comment on `MmapDirectory::open` says "opens the directory at the given path"; a sentence about a function called `parse_working_directory_arg` says "parses the working directory argument." This is a derived rule, not a third independent criterion. + +Counter-examples: a loop variable iterating folders is `current_folder`, not `current_directory` (the first rule requires the literal `current_dir()` meaning). A test fixture holding a `TempDir` may keep `_tmp_dir`; it mirrors the crate's type, not a Monodex concept. ## Banned patterns -- No semantically vapid filenames. `utilities.rs`, `helpers.rs`, `common.rs`, `misc.rs` are free to write and tell the next reader nothing; half the codebase is "utilities" of some sort. The work of naming is finding what the functions actually have in common, and that shared trait is usually a better name: `formatting.rs` if the trait is formatting, `test_mocks.rs` or `test_fixtures.rs` if the trait is test setup. `test_helpers.rs` is acceptable only when no narrower trait is visible. Pick the narrowest accurate name today; rename when contents change. - No wildcard re-exports (`pub use submodule::*`). List re-exports explicitly. - No putting unrelated items together just because they're small. - No structural splits in the same change as feature or fix work. Splits are their own change unless explicitly authorized by the maintainer or the planned reorganization being applied. From 3c76db93bdecec32615f5e6396d4fa94e6919ea0 Mon Sep 17 00:00:00 2001 From: Pete Gonzalez <4673363+octogonz@users.noreply.github.com> Date: Tue, 19 May 2026 18:56:15 -0700 Subject: [PATCH 2/6] docs: apply folder-vs-directory policy to documentation (Phase 1) - Rename placeholder to throughout - Change 'directory' to 'folder' in prose, comments, descriptions - Update audit-chunks --dir to --folder in examples - Update JSON schema description fields - Keep established compounds: working-directory, working-dir - Keep vendor terms: LanceDB directory-based, Tantivy Directory --- README.md | 10 ++++----- docs/backlog.md | 8 ++++---- docs/design/architecture.md | 20 +++++++++--------- docs/design/chunker.md | 2 +- docs/design/concurrency.md | 22 ++++++++++---------- docs/design/crawl.md | 14 ++++++------- docs/design/monodex_files.md | 40 ++++++++++++++++++------------------ docs/design/search.md | 10 ++++----- docs/smoke_test.md | 4 ++-- schemas/config.schema.json | 4 ++-- schemas/editing.md | 2 +- 11 files changed, 68 insertions(+), 68 deletions(-) diff --git a/README.md b/README.md index 21f517d..cfdb281 100644 --- a/README.md +++ b/README.md @@ -361,7 +361,7 @@ monodex dump-chunks --file ./src/JsonFile.ts --debug monodex dump-chunks --file ./src/JsonFile.ts --target-size 4000 # Audit chunking quality across multiple files (AST-only mode) -monodex audit-chunks --count 20 --dir /path/to/project +monodex audit-chunks --count 20 --folder /path/to/project ``` **Chunk Quality Score**: 0-100%, higher is better. Scores below 95% may indicate chunking issues. Note: `dump-chunks` and `audit-chunks` use AST-only mode (fallback disabled) to accurately measure partitioner quality. @@ -460,7 +460,7 @@ RUST_LOG=debug ./target/release/monodex crawl --catalog sparo --label main --com The crawl behavior (which files to index and how to chunk them) can be customized via configuration files. -For the full inventory of files Monodex reads or writes (config-folder state, the database directory layout, repo-local config files), see [docs/design/monodex_files.md](https://github.com/microsoft/monodex/blob/main/docs/design/monodex_files.md). +For the full inventory of files Monodex reads or writes (config-folder state, the database folder layout, repo-local config files), see [docs/design/monodex_files.md](https://github.com/microsoft/monodex/blob/main/docs/design/monodex_files.md). ### Config Discovery @@ -474,7 +474,7 @@ No merging occurs. Exactly one config is used. ### Config Schema -JSON schemas are available in the `schemas/` directory for IDE autocomplete and validation. Reference the appropriate schema in your config file via the `$schema` field: +JSON schemas are available in the `schemas/` folder for IDE autocomplete and validation. Reference the appropriate schema in your config file via the `$schema` field: | Config File | Schema File | | ------------------------- | ----------------------------- | @@ -530,12 +530,12 @@ shouldCrawl = matchesFileType && (matchesPatternsToKeep || !matchesPatternsToExc - `fileTypes` is the primary filter. Unsupported file types are never crawled. - `patternsToKeep` overrides `patternsToExclude` (useful for keeping test files in `src/`) -- Directory patterns (ending in `/`) match anywhere in the path +- Folder patterns (ending in `/`) match anywhere in the path **Pattern syntax:** - Glob patterns use the standard syntax: `**` for recursive, `*` for wildcard -- Directory patterns end with `/` (e.g., `node_modules/`) +- Folder patterns end with `/` (e.g., `node_modules/`) - Example: `**/*.test.ts` matches test files at any depth ## Status diff --git a/docs/backlog.md b/docs/backlog.md index aeff9f8..3486408 100644 --- a/docs/backlog.md +++ b/docs/backlog.md @@ -44,7 +44,7 @@ For official feature requests, create a GitHub issue. If an issue needs higher p -**BL51 `monodex init` command, with `examples/` rename.** Generate `/monodex-config.json`, `/monodex-crawl-config.json`, and `/monodex-state.json` from the templates currently under `examples/`, with `$schema` URLs set to the published locations. Removes a setup step for new users. Implementation: `include_bytes!` to embed templates at compile time, plus a small command handler with the standard "file already exists" handling. Depends on the templates being embedded (trivial) and ideally on schema publication (otherwise `$schema` URLs are placeholders). The directory should be renamed from `examples/` to `config-templates/` as part of this work, since the current name is a misnomer. +**BL51 `monodex init` command, with `examples/` rename.** Generate `/monodex-config.json`, `/monodex-crawl-config.json`, and `/monodex-state.json` from the templates currently under `examples/`, with `$schema` URLs set to the published locations. Removes a setup step for new users. Implementation: `include_bytes!` to embed templates at compile time, plus a small command handler with the standard "file already exists" handling. Depends on the templates being embedded (trivial) and ideally on schema publication (otherwise `$schema` URLs are placeholders). The folder should be renamed from `examples/` to `config-templates/` as part of this work, since the current name is a misnomer. (severity=feature, work=small) @@ -90,7 +90,7 @@ Items with at least one non-obvious insight worth recording, but no commitment t -**BL52 Orphan reclamation garbage collection.** Three orphan kinds, swept by one `monodex gc` command: chunk-row orphans (rows in `chunks` with `active_label_ids = []`, typically from interrupted crawls; reclaimed by deleting the row), vector-payload orphans (non-NULL `vector` on a row no in-selection vector method points at; reclaimed by setting `vector = NULL`, row stays), Tantivy-directory orphans (a directory under `/fts//