Skip to content

Add subpatch for catalog id fixes #30#33

Merged
spoorcc merged 12 commits intomainfrom
claude/fix-monorepo-catalog-id-VUrvY
Mar 7, 2026
Merged

Add subpatch for catalog id fixes #30#33
spoorcc merged 12 commits intomainfrom
claude/fix-monorepo-catalog-id-VUrvY

Conversation

@spoorcc
Copy link
Contributor

@spoorcc spoorcc commented Mar 4, 2026

Fixes #30

Summary by CodeRabbit

  • New Features

    • Monorepo support: per-component subpaths produce distinct IDs, per-component detail files, and tailored README output.
    • Richer catalog metadata: persistent per-project details (tags, branches, URLs, readme/license) and improved merging/labeling behavior.
    • Better homepage discovery for vcpkg ports via portfile inspection; updated monorepo examples and catalog entries.
  • Tests

    • Added comprehensive tests for subpath handling, ID generation, README generation, tag/source merging, vcpkg portfile parsing, and catalog/detail persistence.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

Replaces procedural catalog writing with Catalog and CatalogWriter classes; adds CatalogDetail, CatalogEntry, and catalog model types; extends BaseManifest with public subpath and in_project_repo plus sanitize_subpath/sanitized_subpath; makes catalog IDs, detail paths, and README/detail generation subpath-aware; CLI assigns manifest.subpath for repo subfolders.

Changes

Cohort / File(s) Summary
Manifest model
dfetch_hub/catalog/sources/__init__.py
Adds public `subpath: str
Catalog writer & CLI
dfetch_hub/catalog/writer.py, dfetch_hub/commands/update.py
Replaces functional writer with Catalog and CatalogWriter classes; write flow loads/dumps catalog.json, is subpath-aware, computes subpath-aware IDs, removes root entries when needed, and delegates per-project detail writes to CatalogDetail; CLI now constructs CatalogWriter and assigns manifest.subpath when using fallback homepages.
Catalog detail & models
dfetch_hub/catalog/detail.py, dfetch_hub/catalog/model.py, dfetch_hub/catalog/entry.py
Adds CatalogDetail (serialize/merge, README generation, tag fetching), data models (VCSLocation, Tag, GitRefs, PackageContent, FetchMetadata, CatalogSource), and CatalogEntry with catalog_id/subpath-aware logic and merge/update behaviors.
Source parsers
dfetch_hub/catalog/sources/*.py
Propagates in_project_repo into manifests (clib.py, conan.py, readme.py, vcpkg.py); vcpkg adds _github_url_from_portfile to extract upstream GitHub URL; README parsing and raw fetch validate schemes.
Catalog data & config
dfetch_hub/data/*, dfetch-hub.toml
Adds example catalog entries and monorepo example files; uncomments/enables example blocks in dfetch-hub.toml for monorepo usage and adds three new data JSONs for example repos.
Tests
test/*, test/test_catalog_*.py
Adds/updates tests for Catalog, CatalogWriter, CatalogDetail, CatalogEntry, model types, vcpkg portfile logic, BaseManifest.sanitize_subpath, and CLI subpath assignment; asserts subpath/in_project_repo handling and monorepo detail paths.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor CLI
    participant Parser as Manifest Parser
    participant Writer as CatalogWriter
    participant Catalog as Catalog (in-memory)
    participant Detail as CatalogDetail
    participant FS as Filesystem / Registry

    CLI->>Parser: scan entry directories
    Parser->>Parser: parse manifests, set manifest.subpath / in_project_repo
    Parser->>Writer: pass manifests list
    Writer->>Catalog: Catalog.load(registry_path)
    Writer->>Writer: parse VCS slug -> (vcs_host, org, repo)
    Writer->>Writer: compute catalog_id (host, org, repo, subpath)
    Writer->>Catalog: get_or_create_entry(manifest, vcs_host, org, repo, label)
    Writer->>Detail: CatalogDetail.from_manifest / update_from_manifest(...)
    Detail->>FS: dump detail JSON to data_dir/.../repo[/subpath].json
    Writer->>Catalog: remove root entry if subpath replaces it
    Writer->>Catalog: Catalog.dump(registry_path)
    Catalog->>FS: write catalog.json
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 I hop through subpaths, sniff each trail,
New IDs now include the tiniest rail.
Details tucked per-project, neat and snug,
Monorepo morsels get their own rug.
Hooray — catalogs snug, carrots in a jug! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Multiple out-of-scope changes detected. The PR introduces extensive refactoring including new CatalogDetail and CatalogEntry classes, CatalogWriter class redesign, comprehensive test suites, and data model files that significantly exceed the scope of issue #30's requirement to include subpath in catalog ID generation. Consider splitting this PR into focused changes: (1) minimal subpath support for catalog IDs addressing #30, and (2) separate PRs for CatalogDetail/CatalogEntry refactoring, CatalogWriter redesign, and test infrastructure additions.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is partially related to the changeset. It mentions 'subpatch for catalog id' which aligns with adding subpath support to catalog IDs, but it's imprecise ('subpatch' vs 'subpath') and doesn't fully capture the extensive refactoring of catalog writing, entry management, and detail handling that dominates the changes.
Linked Issues check ✅ Passed The PR successfully addresses issue #30 by updating catalog ID generation to include subpath/subfolder information. Changes to CatalogEntry.catalog_id(), CatalogWriter.write_manifest(), and BaseManifest.subpath field directly fulfill the requirement that subfolders in monorepos generate distinct catalog IDs.
Docstring Coverage ✅ Passed Docstring coverage is 93.23% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/fix-monorepo-catalog-id-VUrvY

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@spoorcc spoorcc changed the title fold hierarchical tags with '/' in Versions tab Add subpatch for catalog id fixes #30 Mar 4, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
dfetch_hub/catalog/writer.py (1)

359-381: ⚠️ Potential issue | 🟠 Major

Normalize subpath consistently for both catalog key and detail filename.

Line [360] canonicalizes subpath via _catalog_id(...subpath.lower()), but Line [379] writes detail JSON using raw subpath. With subpath="MyLib", catalog ID becomes .../mylib while file can be MyLib.json, causing detail fetch misses on case-sensitive systems.

Suggested fix
-        subpath = manifest.subpath
-        cat_id = _catalog_id(vcs_host, org, repo, subpath)
+        subpath = manifest.subpath
+        normalized_subpath = subpath.lower() if subpath else None
+        cat_id = _catalog_id(vcs_host, org, repo, normalized_subpath)
@@
-        if subpath:
-            detail_path = data_dir / vcs_host / org / repo / f"{subpath}.json"
+        if normalized_subpath:
+            detail_path = data_dir / vcs_host / org / repo / f"{normalized_subpath}.json"
         else:
             detail_path = data_dir / vcs_host / org / f"{repo}.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/writer.py` around lines 359 - 381, The code canonicalizes
subpath when building the catalog key via _catalog_id but writes the detail JSON
using the raw subpath, causing case-mismatch on case-sensitive filesystems; fix
by normalizing subpath the same way before constructing both cat_id and
detail_path (e.g., derive a canonical_subpath used for _catalog_id and for
building detail_path) so that variables subpath, cat_id, _catalog_id,
detail_path and data_dir all use the same normalized subpath.
dfetch_hub/catalog/sources/__init__.py (1)

126-159: ⚠️ Potential issue | 🟠 Major

CI blocker: BaseManifest now exceeds pylint instance-attribute limit (9/8).

Line [159] adds a ninth instance field, and your pipeline is already failing with R0902. This must be handled before merge.

Suggested fix
 `@dataclass`
-class BaseManifest:
+class BaseManifest:  # pylint: disable=too-many-instance-attributes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/sources/__init__.py` around lines 126 - 159, BaseManifest
currently defines nine instance attributes triggering pylint R0902; collapse
related fields into a single nested attribute (e.g., create a new dataclass like
ManifestMetadata or a dict) and replace grouped fields (homepage, license,
version, readme_content) with that one attribute on BaseManifest to drop the
instance count below the limit; update all call sites that construct or read
BaseManifest to use the new nested attribute (e.g.,
BaseManifest.metadata.homepage) and run tests/linter to confirm R0902 is
resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dfetch_hub/catalog/sources/__init__.py`:
- Around line 126-159: BaseManifest currently defines nine instance attributes
triggering pylint R0902; collapse related fields into a single nested attribute
(e.g., create a new dataclass like ManifestMetadata or a dict) and replace
grouped fields (homepage, license, version, readme_content) with that one
attribute on BaseManifest to drop the instance count below the limit; update all
call sites that construct or read BaseManifest to use the new nested attribute
(e.g., BaseManifest.metadata.homepage) and run tests/linter to confirm R0902 is
resolved.

In `@dfetch_hub/catalog/writer.py`:
- Around line 359-381: The code canonicalizes subpath when building the catalog
key via _catalog_id but writes the detail JSON using the raw subpath, causing
case-mismatch on case-sensitive filesystems; fix by normalizing subpath the same
way before constructing both cat_id and detail_path (e.g., derive a
canonical_subpath used for _catalog_id and for building detail_path) so that
variables subpath, cat_id, _catalog_id, detail_path and data_dir all use the
same normalized subpath.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7949da and 4158414.

📒 Files selected for processing (4)
  • dfetch_hub/catalog/sources/__init__.py
  • dfetch_hub/catalog/writer.py
  • dfetch_hub/commands/update.py
  • test/test_catalog_writer.py

claude added 2 commits March 6, 2026 23:59
When a single repository contains multiple components (e.g. packages in
subdirectories), each manifest previously received the same catalog ID
(vcs_host/org/repo), causing later entries to silently overwrite earlier
ones in catalog.json and the per-project detail files.

Fix:
- Add `subpath: str | None` field to `BaseManifest`.  Set by
  `_process_subfolders_source` when the manifest has no individual
  homepage (i.e. the component's URL is inherited from the monorepo
  root via `source.url`).
- `_catalog_id` accepts an optional `subpath` and produces
  `vcs_host/org/repo/subpath` when set, leaving non-monorepo IDs
  unchanged.
- `write_catalog` routes monorepo detail files to
  `data/<host>/<org>/<repo>/<subpath>.json` so each component has its
  own file; `subfolder_path` in the detail JSON is populated accordingly.
- 5 new tests covering distinct catalog IDs, distinct file paths,
  `subfolder_path` population, and backward-compatibility for packages
  without a subpath.

Closes #30

https://claude.ai/code/session_01TM31h9vfpQCdpBw4TyNC8T
…mponents

_generate_readme now emits a `src: <subpath>` line when the manifest
has a non-None subpath, matching the frontend's existing behaviour in
buildSnipHtml/buildEntry.  The local checkout name (ext/<name>) also
uses the subpath instead of the repo name, so the snippet is correct
for monorepo components.

Three new tests cover src-line presence, correct local name selection,
and absence of src when subpath is None.

https://claude.ai/code/session_01TM31h9vfpQCdpBw4TyNC8T
@spoorcc spoorcc force-pushed the claude/fix-monorepo-catalog-id-VUrvY branch from 4158414 to 9426078 Compare March 6, 2026 23:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch_hub/catalog/writer.py`:
- Around line 420-424: Catalog entries created at repo root are not pruned when
a manifest gains a subpath, leaving stale duplicates; before calling
_merge_catalog_entry and _write_detail_json, detect the legacy repo-level key by
building a legacy_id via _catalog_id(vcs_host, org, repo, "") (i.e., empty
subpath) and if catalog.get(legacy_id) exists compare its
registry_path/entry_name against the new manifest's registry_path/entry_name
(from manifest and registry_path) and only when they match remove
catalog[legacy_id] and delete the legacy detail file (the repo.json under the
repo detail path) to avoid removing unrelated repo-root packages, then proceed
to write the new subpath record with _merge_catalog_entry and
_write_detail_json.

In `@dfetch_hub/commands/update.py`:
- Around line 112-114: The manifest subpath is only set when m.homepage is None,
causing sibling packages that already have a repo-root homepage to keep
m.subpath as None and collide in write_catalog(); update the logic around
m.homepage/fallback_homepage to set m.subpath = entry_dir.name whenever the
resolved homepage (either m.homepage or fallback_homepage) points to the repo
root — i.e., determine the effective_homepage = m.homepage or fallback_homepage
and if that equals the repo-root URL then assign m.subpath = entry_dir.name
(leave m.homepage as-is if present); adjust the block around m.homepage,
fallback_homepage, and m.subpath to use this effective_homepage before calling
write_catalog().

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de42edae-3f09-45bb-b298-d207a0db49c6

📥 Commits

Reviewing files that changed from the base of the PR and between 4158414 and 9426078.

📒 Files selected for processing (4)
  • dfetch_hub/catalog/sources/__init__.py
  • dfetch_hub/catalog/writer.py
  • dfetch_hub/commands/update.py
  • test/test_catalog_writer.py

Comment on lines +112 to +114
if m.homepage is None and fallback_homepage is not None:
m.homepage = fallback_homepage
m.subpath = entry_dir.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Populate subpath whenever the manifest resolves to the source repo root.

This only sets subpath when homepage is missing. If a parser already returns the same repo-root URL in m.homepage, sibling packages from that monorepo still reach write_catalog() with subpath=None and continue to collide on the same github/org/repo entry/detail file.

Suggested change
-            if m.homepage is None and fallback_homepage is not None:
-                m.homepage = fallback_homepage
-                m.subpath = entry_dir.name
+            if fallback_homepage is not None:
+                if m.homepage is None:
+                    m.homepage = fallback_homepage
+                if m.homepage == fallback_homepage and m.subpath is None:
+                    m.subpath = entry_dir.name
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if m.homepage is None and fallback_homepage is not None:
m.homepage = fallback_homepage
m.subpath = entry_dir.name
if fallback_homepage is not None:
if m.homepage is None:
m.homepage = fallback_homepage
if m.homepage == fallback_homepage and m.subpath is None:
m.subpath = entry_dir.name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/commands/update.py` around lines 112 - 114, The manifest subpath
is only set when m.homepage is None, causing sibling packages that already have
a repo-root homepage to keep m.subpath as None and collide in write_catalog();
update the logic around m.homepage/fallback_homepage to set m.subpath =
entry_dir.name whenever the resolved homepage (either m.homepage or
fallback_homepage) points to the repo root — i.e., determine the
effective_homepage = m.homepage or fallback_homepage and if that equals the
repo-root URL then assign m.subpath = entry_dir.name (leave m.homepage as-is if
present); adjust the block around m.homepage, fallback_homepage, and m.subpath
to use this effective_homepage before calling write_catalog().

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
dfetch_hub/commands/update.py (1)

112-115: ⚠️ Potential issue | 🟠 Major

Subpath not set when manifest already has repo-root homepage.

The current logic only assigns subpath when m.homepage is None. If the manifest parser already populates m.homepage with the same repo-root URL (e.g., from a portfile that explicitly sets the homepage), sibling packages from that monorepo will still have subpath=None and may collide on catalog IDs.

🐛 Proposed fix to always set subpath for subfolder packages
-            if m.homepage is None and fallback_homepage is not None:
-                m.homepage = fallback_homepage
+            if fallback_homepage is not None:
+                if m.homepage is None:
+                    m.homepage = fallback_homepage
+                # Always set subpath for subfolder packages so monorepo components
+                # get distinct catalog IDs regardless of homepage source
                 m.subpath = entry_dir.name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/commands/update.py` around lines 112 - 115, The code only sets
m.subpath when m.homepage is None, causing sibling packages with an
already-populated repo-root m.homepage to miss subpaths and collide; update the
logic in update.py so that after determining fallback_homepage you always set
m.subpath = entry_dir.name for packages that live in a subfolder (i.e., when
entry_dir is not the repo root or when fallback_homepage is not None), and
ensure m.homepage remains unchanged if already set (use fallback_homepage only
when m.homepage is None); modify the block referencing m.homepage,
fallback_homepage, m.subpath and manifests to assign subpath unconditionally for
subfolder packages while preserving existing homepage values.
dfetch_hub/catalog/writer.py (1)

163-166: ⚠️ Potential issue | 🟠 Major

Prune the old repo-root record when a package gains a subpath.

On the first write after this change, a package that used to live at github/org/repo will now also be written at github/org/repo/<subpath>, but the old catalog key and repo-level detail JSON are left behind. That keeps stale duplicate packages in existing catalog data until someone manually cleans them up. Remove the legacy pair before writing the new record, but only when it matches the same package identity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/writer.py` around lines 163 - 166, Before writing the new
subpath record, detect and remove any legacy repo-root record that represents
the same package identity: after calling catalog.get_or_create_entry(manifest,
vcs_host, org, repo, self.label) and before self._write_detail(...), check
whether the manifest now has a subpath (manifest.package or manifest.identity
indicates a non-root path) and whether a repo-root entry exists for the same
package identity; if so, call the catalog removal API (e.g.,
catalog.delete_entry / catalog.remove_pair on the repo-level key for vcs_host,
org, repo and self.label) to delete the old repo-root pair, then proceed to
write the new detail with self._write_detail(vcs_host, org, repo, manifest).
Ensure the identity comparison uses the package identity field
(manifest.identity or manifest.package name+version) so only matching-package
root records are pruned.
🧹 Nitpick comments (6)
test/test_catalog_entry.py (1)

5-8: Unused Path import in TYPE_CHECKING block.

The Path type is imported but never used in this test file.

🧹 Remove unused import
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from pathlib import Path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/test_catalog_entry.py` around lines 5 - 8, Remove the unused Path import
from the TYPE_CHECKING block: delete the "from pathlib import Path" line since
Path is not referenced anywhere in this test file; keep the TYPE_CHECKING import
if other conditional imports remain, otherwise remove the entire TYPE_CHECKING
block.
dfetch_hub/catalog/model.py (1)

79-170: Consider using @dataclass for Tag and CatalogSource.

These classes hold structured data with simple attributes, similar to VCSLocation and GitRefs above. Converting them to dataclasses would reduce boilerplate and align with the coding guidelines.

♻️ Proposed refactor for Tag
+@dataclass
 class Tag:
     """Represents a Git tag or branch reference from a remote repository.

     Tags identify specific versions (e.g., "v1.2.3") while branches track lines
     of development. Both are fetched via ``git ls-remote`` when updating the catalog.
+
+    Attributes:
+        name: The tag or branch name.
+        is_tag: True for tags, False for branches.
+        commit_sha: Full 40-character SHA of the commit.
+        date: ISO-formatted date string.
     """

-    def __init__(
-        self,
-        name: str,
-        is_tag: bool = True,
-        commit_sha: str | None = None,
-        date: str | None = None,
-    ) -> None:
-        """Initialize a Tag.
-
-        Args:
-            name: The tag or branch name.
-            is_tag: True for tags, False for branches. Defaults to True.
-            commit_sha: Full 40-character SHA of the commit. Defaults to None.
-            date: ISO-formatted date string. Defaults to None.
-        """
-        self.name = name
-        self.is_tag = is_tag
-        self.commit_sha = commit_sha
-        self.date = date
+    name: str
+    is_tag: bool = True
+    commit_sha: str | None = None
+    date: str | None = None

As per coding guidelines: "Use @dataclass decorator for all structured data in Python".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/model.py` around lines 79 - 170, Replace the manual
boilerplate in Tag and CatalogSource with Python dataclasses: add "from
dataclasses import dataclass, field" and annotate the classes Tag and
CatalogSource with `@dataclass`, move the existing __init__ attributes into
class-level type-annotated fields (preserving defaults and Optional/Union types
for commit_sha/date/registry_version), and convert to_dict/from_dict to either
use dataclasses.asdict or simple classmethods that construct the dataclass from
a dict (keep the same keys and default fallbacks). Ensure method names
Tag.to_dict, Tag.from_dict, CatalogSource.to_dict and CatalogSource.from_dict
remain available with the same behavior and types.
test/test_catalog_detail.py (1)

13-14: Unused Path import in TYPE_CHECKING block.

The Path type is imported but never used in this test file.

🧹 Remove unused import
-if TYPE_CHECKING:
-    from pathlib import Path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/test_catalog_detail.py` around lines 13 - 14, Remove the unused Path
import from the TYPE_CHECKING block: the import "from pathlib import Path" is
not referenced in this test, so delete that line (leave the TYPE_CHECKING
conditional if other imports remain) to avoid an unused import warning; search
for the symbols TYPE_CHECKING and Path in the file to locate and remove the
specific import statement.
dfetch_hub/catalog/writer.py (1)

39-99: Bring the new public API docstrings up to the repo format.

These public classes and methods mostly use summary-only docstrings. The repo standard for public Python APIs requires Google-style Args:, Returns:, and Raises: sections where relevant.

As per coding guidelines, "Add docstrings to every public Python module, class, and function in Google-style format with Args:, Returns:, and Raises: sections where relevant".

Also applies to: 107-176

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/writer.py` around lines 39 - 99, The public API docstrings
(class Catalog and its methods: __init__, to_dict, from_dict, load, dump,
get_or_create_entry) use summary-only style; update each to Google-style
docstrings including Args:, Returns:, and Raises: sections where applicable
(e.g., __init__ should list entries param, from_dict/load should document the
data/path and return Catalog, dump should document path and any IO errors in
Raises, get_or_create_entry should document manifest, vcs_host, org, repo, label
parameters and specify the returned (CatalogEntry, bool) tuple). Keep the
current summaries but expand them to conform to the repo standard for public
classes and methods.
test/test_catalog_writer.py (2)

224-271: Add one nested-subpath regression case.

These tests prove distinct IDs for foo and bar, but the linked issue is about preserving full subfolder paths. A regression that collapsed libs/foo to just foo would still pass here. Add at least one case with a nested subpath and assert both the catalog ID and detail path keep the full path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/test_catalog_writer.py` around lines 224 - 271, Add a regression test
that uses _monorepo_manifest with a nested subpath (e.g., "libs/foo") and assert
the CatalogWriter preserves the full subfolder path: check the catalog key and
the entry's "id" equal "github/myorg/mymonorepo/libs/foo", assert the detail
file exists at the nested path under tmp_path (e.g.,
.../mymonorepo/libs/foo.json), and assert the detail JSON's "subfolder_path" ==
"libs/foo"; place this alongside the existing monorepo tests and reuse
CatalogWriter and _monorepo_manifest to keep consistency.

15-34: Add explicit BaseManifest return types to the test helpers.

Both helper factories are unannotated today, which conflicts with the repo's mypy --strict rule for explicit function return types. These should return BaseManifest explicitly, ideally via a TYPE_CHECKING import.

As per coding guidelines, "Ensure mypy --strict passes with full type annotations on every function, return type, and non-obvious variables".

Also applies to: 37-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/test_catalog_writer.py` around lines 15 - 34, Add an explicit
BaseManifest return type to the test helper(s): import TYPE_CHECKING from typing
and, inside an if TYPE_CHECKING: block, import BaseManifest from
dfetch_hub.catalog.sources; then annotate the helper function def _manifest(...)
to return BaseManifest (use a string forward reference if needed, e.g. ->
"BaseManifest") and do the same for the other helper factory referenced in the
file so mypy --strict passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch_hub/catalog/writer.py`:
- Around line 177-189: Validate and sanitize manifest.subpath before using it to
build detail_path or passing it to CatalogDetail.dump: ensure it's not absolute
and does not contain path-traversal segments (e.g., '..'); normalize with
pathlib.Path(manifest.subpath).resolve() or .parts and verify the resulting path
stays inside self.data_dir (compare resolved paths), or reject/clean invalid
values (strip leading slashes and reject any '..' segments) and raise/log an
error if invalid; update the code paths around detail_path construction and the
call sites CatalogDetail.load, CatalogDetail.from_manifest,
detail.update_from_manifest, and detail.dump to use the validated/sanitized
subpath.

---

Duplicate comments:
In `@dfetch_hub/catalog/writer.py`:
- Around line 163-166: Before writing the new subpath record, detect and remove
any legacy repo-root record that represents the same package identity: after
calling catalog.get_or_create_entry(manifest, vcs_host, org, repo, self.label)
and before self._write_detail(...), check whether the manifest now has a subpath
(manifest.package or manifest.identity indicates a non-root path) and whether a
repo-root entry exists for the same package identity; if so, call the catalog
removal API (e.g., catalog.delete_entry / catalog.remove_pair on the repo-level
key for vcs_host, org, repo and self.label) to delete the old repo-root pair,
then proceed to write the new detail with self._write_detail(vcs_host, org,
repo, manifest). Ensure the identity comparison uses the package identity field
(manifest.identity or manifest.package name+version) so only matching-package
root records are pruned.

In `@dfetch_hub/commands/update.py`:
- Around line 112-115: The code only sets m.subpath when m.homepage is None,
causing sibling packages with an already-populated repo-root m.homepage to miss
subpaths and collide; update the logic in update.py so that after determining
fallback_homepage you always set m.subpath = entry_dir.name for packages that
live in a subfolder (i.e., when entry_dir is not the repo root or when
fallback_homepage is not None), and ensure m.homepage remains unchanged if
already set (use fallback_homepage only when m.homepage is None); modify the
block referencing m.homepage, fallback_homepage, m.subpath and manifests to
assign subpath unconditionally for subfolder packages while preserving existing
homepage values.

---

Nitpick comments:
In `@dfetch_hub/catalog/model.py`:
- Around line 79-170: Replace the manual boilerplate in Tag and CatalogSource
with Python dataclasses: add "from dataclasses import dataclass, field" and
annotate the classes Tag and CatalogSource with `@dataclass`, move the existing
__init__ attributes into class-level type-annotated fields (preserving defaults
and Optional/Union types for commit_sha/date/registry_version), and convert
to_dict/from_dict to either use dataclasses.asdict or simple classmethods that
construct the dataclass from a dict (keep the same keys and default fallbacks).
Ensure method names Tag.to_dict, Tag.from_dict, CatalogSource.to_dict and
CatalogSource.from_dict remain available with the same behavior and types.

In `@dfetch_hub/catalog/writer.py`:
- Around line 39-99: The public API docstrings (class Catalog and its methods:
__init__, to_dict, from_dict, load, dump, get_or_create_entry) use summary-only
style; update each to Google-style docstrings including Args:, Returns:, and
Raises: sections where applicable (e.g., __init__ should list entries param,
from_dict/load should document the data/path and return Catalog, dump should
document path and any IO errors in Raises, get_or_create_entry should document
manifest, vcs_host, org, repo, label parameters and specify the returned
(CatalogEntry, bool) tuple). Keep the current summaries but expand them to
conform to the repo standard for public classes and methods.

In `@test/test_catalog_detail.py`:
- Around line 13-14: Remove the unused Path import from the TYPE_CHECKING block:
the import "from pathlib import Path" is not referenced in this test, so delete
that line (leave the TYPE_CHECKING conditional if other imports remain) to avoid
an unused import warning; search for the symbols TYPE_CHECKING and Path in the
file to locate and remove the specific import statement.

In `@test/test_catalog_entry.py`:
- Around line 5-8: Remove the unused Path import from the TYPE_CHECKING block:
delete the "from pathlib import Path" line since Path is not referenced anywhere
in this test file; keep the TYPE_CHECKING import if other conditional imports
remain, otherwise remove the entire TYPE_CHECKING block.

In `@test/test_catalog_writer.py`:
- Around line 224-271: Add a regression test that uses _monorepo_manifest with a
nested subpath (e.g., "libs/foo") and assert the CatalogWriter preserves the
full subfolder path: check the catalog key and the entry's "id" equal
"github/myorg/mymonorepo/libs/foo", assert the detail file exists at the nested
path under tmp_path (e.g., .../mymonorepo/libs/foo.json), and assert the detail
JSON's "subfolder_path" == "libs/foo"; place this alongside the existing
monorepo tests and reuse CatalogWriter and _monorepo_manifest to keep
consistency.
- Around line 15-34: Add an explicit BaseManifest return type to the test
helper(s): import TYPE_CHECKING from typing and, inside an if TYPE_CHECKING:
block, import BaseManifest from dfetch_hub.catalog.sources; then annotate the
helper function def _manifest(...) to return BaseManifest (use a string forward
reference if needed, e.g. -> "BaseManifest") and do the same for the other
helper factory referenced in the file so mypy --strict passes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f363e9db-a755-4daa-99bf-ac44e1c0d307

📥 Commits

Reviewing files that changed from the base of the PR and between 9426078 and 187171c.

📒 Files selected for processing (10)
  • dfetch-hub.toml
  • dfetch_hub/catalog/detail.py
  • dfetch_hub/catalog/entry.py
  • dfetch_hub/catalog/model.py
  • dfetch_hub/catalog/writer.py
  • dfetch_hub/commands/update.py
  • test/test_catalog_detail.py
  • test/test_catalog_entry.py
  • test/test_catalog_model.py
  • test/test_catalog_writer.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
dfetch_hub/commands/update.py (1)

112-115: ⚠️ Potential issue | 🔴 Critical

Don't overwrite subpath for every parsed entry.

This now assigns entry_dir.name whenever the source has a fallback homepage, even if the manifest already points at its own upstream repo or already carries a more specific subpath. That will misclassify normal packages as monorepo components, and it also collapses parser-provided nested subpaths like libs/foo to just foo.

Suggested fix
         else:
+            effective_homepage = m.homepage or fallback_homepage
             if m.homepage is None and fallback_homepage is not None:
                 m.homepage = fallback_homepage
-            if fallback_homepage is not None:
+            if (
+                fallback_homepage is not None
+                and effective_homepage == fallback_homepage
+                and m.subpath is None
+            ):
                 m.subpath = entry_dir.name
             manifests.append(m)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/commands/update.py` around lines 112 - 115, The code
unconditionally sets m.subpath to entry_dir.name whenever fallback_homepage is
present, overwriting parser-provided subpaths and misclassifying packages;
change the logic in the update code that handles fallback_homepage (check where
m.homepage and fallback_homepage are processed) so you only set m.subpath =
entry_dir.name if m.subpath is None or empty (i.e., preserve an existing
m.subpath), and do not overwrite when m.homepage already points to an upstream
repo; keep the existing assignment that fills m.homepage from fallback_homepage
but guard the subpath assignment behind a conditional that respects existing
m.subpath and upstream-homepage presence.
dfetch_hub/catalog/sources/__init__.py (1)

193-202: ⚠️ Potential issue | 🟠 Major

Return the normalized subpath, not the raw input.

This validates subpath.replace("\\", "/"), but then returns the original subpath. Inputs like libs\\foo therefore stay unnormalized, and on Windows rooted/drive-prefixed values can still be treated as filesystem roots when CatalogDetail.dump() joins them into a path.

Suggested fix
-        subpath = subpath.strip("/")
+        subpath = subpath.strip("/").replace("\\", "/")

         if not subpath or subpath.startswith("."):
             return None

-        parts = subpath.replace("\\", "/").split("/")
+        parts = subpath.split("/")
         if any(part == ".." for part in parts):
             return None

-        return subpath
+        if ":" in parts[0]:
+            return None
+
+        return "/".join(parts)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/sources/__init__.py` around lines 193 - 202, The function
currently validates a normalized form but returns the original raw subpath;
change the code to operate on and return the normalized subpath (e.g., perform
subpath = subpath.replace("\\", "/").strip("/") or assign normalized =
subpath.replace("\\", "/").strip("/") and use that for empty/leading-dot/..
checks), then return that normalized value so callers like CatalogDetail.dump()
receive a safe, normalized path. Ensure all checks (startsWith ".", any part ==
"..") use the normalized variable before returning it.
dfetch_hub/catalog/writer.py (1)

211-215: ⚠️ Potential issue | 🟠 Major

Don't delete the repo-root record just because a subpath exists.

This prunes github/org/repo for every subpath manifest. If the same upstream repository legitimately has both a repo-root package and a subcomponent, the root package disappears from catalog.json, and the old <repo>.json detail file is left behind anyway. Only remove the root record when you can prove it is the stale legacy alias of this same manifest.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/writer.py` around lines 211 - 215, The current logic
removes the repo-root catalog entry whenever manifest.sanitized_subpath is
truthy, which wrongly deletes valid root packages; instead, when sanitized =
manifest.sanitized_subpath is set, first fetch the existing root entry (use
catalog.get_entry or catalog.get_or_create_entry outcome for vcs_host, org,
repo) and only call catalog.remove_entry(vcs_host, org, repo) if that root entry
is provably the stale legacy alias of this same manifest (for example by
comparing a unique manifest identifier such as
manifest.digest/manifest.source_url/manifest.version or another stable key on
the existing entry to the current manifest); otherwise leave the root record
intact and proceed to call catalog.get_or_create_entry(manifest, vcs_host, org,
repo, self.label) as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@dfetch_hub/catalog/sources/__init__.py`:
- Around line 193-202: The function currently validates a normalized form but
returns the original raw subpath; change the code to operate on and return the
normalized subpath (e.g., perform subpath = subpath.replace("\\",
"/").strip("/") or assign normalized = subpath.replace("\\", "/").strip("/") and
use that for empty/leading-dot/.. checks), then return that normalized value so
callers like CatalogDetail.dump() receive a safe, normalized path. Ensure all
checks (startsWith ".", any part == "..") use the normalized variable before
returning it.

In `@dfetch_hub/catalog/writer.py`:
- Around line 211-215: The current logic removes the repo-root catalog entry
whenever manifest.sanitized_subpath is truthy, which wrongly deletes valid root
packages; instead, when sanitized = manifest.sanitized_subpath is set, first
fetch the existing root entry (use catalog.get_entry or
catalog.get_or_create_entry outcome for vcs_host, org, repo) and only call
catalog.remove_entry(vcs_host, org, repo) if that root entry is provably the
stale legacy alias of this same manifest (for example by comparing a unique
manifest identifier such as manifest.digest/manifest.source_url/manifest.version
or another stable key on the existing entry to the current manifest); otherwise
leave the root record intact and proceed to call
catalog.get_or_create_entry(manifest, vcs_host, org, repo, self.label) as
before.

In `@dfetch_hub/commands/update.py`:
- Around line 112-115: The code unconditionally sets m.subpath to entry_dir.name
whenever fallback_homepage is present, overwriting parser-provided subpaths and
misclassifying packages; change the logic in the update code that handles
fallback_homepage (check where m.homepage and fallback_homepage are processed)
so you only set m.subpath = entry_dir.name if m.subpath is None or empty (i.e.,
preserve an existing m.subpath), and do not overwrite when m.homepage already
points to an upstream repo; keep the existing assignment that fills m.homepage
from fallback_homepage but guard the subpath assignment behind a conditional
that respects existing m.subpath and upstream-homepage presence.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d217c0a6-5838-41ac-a0f1-47f321b890e6

📥 Commits

Reviewing files that changed from the base of the PR and between 187171c and 6a7bba4.

📒 Files selected for processing (8)
  • dfetch_hub/catalog/detail.py
  • dfetch_hub/catalog/model.py
  • dfetch_hub/catalog/sources/__init__.py
  • dfetch_hub/catalog/writer.py
  • dfetch_hub/commands/update.py
  • test/test_catalog_detail.py
  • test/test_catalog_entry.py
  • test/test_catalog_writer.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dfetch_hub/catalog/model.py

@spoorcc spoorcc linked an issue Mar 7, 2026 that may be closed by this pull request
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
dfetch_hub/commands/update.py (1)

251-256: Consider extracting the error message to a constant or custom exception.

Static analysis flags TRY003: specifying long messages outside the exception class. This is a minor style nit.

Optional refactor
+class LimitValueError(ValueError):
+    """Raised when --limit is negative."""
+    def __init__(self) -> None:
+        super().__init__("--limit must be >= 0")
+
+
 def _non_negative_int(value: str) -> int:
     """Parse *value* as a non-negative integer for ``--limit``."""
     parsed = int(value)
     if parsed < 0:
-        raise argparse.ArgumentTypeError("--limit must be >= 0")
+        raise argparse.ArgumentTypeError(str(LimitValueError()))
     return parsed

Alternatively, this is fine as-is since argparse.ArgumentTypeError is inherently a user-facing message container.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/commands/update.py` around lines 251 - 256, The error message
string in _non_negative_int is a long literal flagged by static analysis;
extract it into a named constant (e.g., LIMIT_ERROR_MSG = "--limit must be >=
0") or define a small custom exception class (e.g., class
NonNegativeIntError(argparse.ArgumentTypeError): pass) and raise that with the
constant message instead of embedding the literal directly in the raise
argparse.ArgumentTypeError call; update the raise in _non_negative_int to use
the constant or the new custom exception to satisfy TRY003.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dfetch_hub/catalog/sources/__init__.py`:
- Around line 22-47: The current parse_vcs_slug (and _VCS_URL_RE) greedily
treats the last path segment as the repo which breaks on component URLs like
/tree/...; change parse_vcs_slug to parse the URL path with
urllib.parse.urlparse, split path into segments, detect common
repository-trailing markers (e.g. "tree", "blob", "src", maybe "-") and if one
exists use the segment immediately before that as the repo, otherwise use the
last segment, strip an optional ".git", and return (netloc.lower(),
"/".join(owner_segments).lower(), repo.lower()) or None for invalid paths;
update/remove use of _VCS_URL_RE in parse_vcs_slug accordingly so nested owners
(group/subgroup) are preserved and trailing component paths are dropped.
- Around line 74-82: The fetch_raw function accepts non-HTTP schemes via
urlopen; validate the URL scheme at the top of fetch_raw (use
urllib.parse.urlparse) and only allow "http" or "https"; if the scheme is
missing/invalid, log a debug message with logger (similar format to the existing
exception log) and return None instead of calling urlopen, keeping the existing
behavior for network errors; update imports to bring in urlparse so you can
perform scheme checking before using urlopen and preserve use of _HEADERS and
the current timeout/exception handling.

In `@dfetch_hub/catalog/sources/clib.py`:
- Around line 180-182: The manifest is being marked in_project_repo
unconditionally; change that to be conditional (e.g., replace
in_project_repo=True with in_project_repo=is_github or better:
in_project_repo=from_project_repo where from_project_repo is a boolean you
compute earlier based on whether the manifest truly came from the project repo —
for example derive from is_github, presence of a VCS URL, or whether
fetch_readme(owner, repo) succeeded — so only manifests pulled from the actual
project repo are flagged in_project_repo. Use the existing symbols
readme_content, is_github, vcs_url (or create from_project_repo) to implement
this.

In
`@dfetch_hub/data/github/bakeruk/modern-typescript-monorepo-example/typescript-example-2.json`:
- Around line 5-10: The manifest's subfolder_path currently contains only the
leaf directory ("typescript-example-2") which causes
CatalogEntry.catalog_id(..., manifest.subpath) collisions for nested monorepo
components; update the JSON so subfolder_path stores the full repo-relative
subpath ("packages/typescript-example-2") to match index_path, then regenerate
the detail record and re-run the catalog index generation to refresh entries and
avoid id collisions. Ensure you update the file field "subfolder_path" to the
full relative path and rebuild the catalog index.

In `@test/test_catalog_sources.py`:
- Around line 247-257: The test uses BaseManifest(...) with description=None
which violates the manifest's type (description: str); update the fixture in
test_sanitize_subpath_property_returns_normalized_backslash_path to provide a
type-correct value (e.g. description="") instead of None (and similarly replace
any other str-typed fields set to None such as homepage/license/version if
present) so the test remains focused on sanitized_subpath while keeping mypy
--strict satisfied; locate the constructor call to BaseManifest in that test and
make the replacement.

In `@test/test_commands_update.py`:
- Around line 29-38: The test constructs a BaseManifest with description=None
which conflicts with BaseManifest.description being typed as str; update the
call that creates BaseManifest (the test helper returning BaseManifest) to pass
description="" (empty string) instead of None so the type matches and mypy will
pass; locate the factory/return expression that calls BaseManifest(...) and
replace description=None with description="".

---

Nitpick comments:
In `@dfetch_hub/commands/update.py`:
- Around line 251-256: The error message string in _non_negative_int is a long
literal flagged by static analysis; extract it into a named constant (e.g.,
LIMIT_ERROR_MSG = "--limit must be >= 0") or define a small custom exception
class (e.g., class NonNegativeIntError(argparse.ArgumentTypeError): pass) and
raise that with the constant message instead of embedding the literal directly
in the raise argparse.ArgumentTypeError call; update the raise in
_non_negative_int to use the constant or the new custom exception to satisfy
TRY003.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a5f9cb60-a8d8-4490-a582-328295fb85cb

📥 Commits

Reviewing files that changed from the base of the PR and between 6a7bba4 and 397d6f1.

📒 Files selected for processing (18)
  • dfetch_hub/catalog/sources/__init__.py
  • dfetch_hub/catalog/sources/clib.py
  • dfetch_hub/catalog/sources/conan.py
  • dfetch_hub/catalog/sources/readme.py
  • dfetch_hub/catalog/sources/vcpkg.py
  • dfetch_hub/catalog/writer.py
  • dfetch_hub/commands/update.py
  • dfetch_hub/data/catalog.json
  • dfetch_hub/data/github/bakeruk/modern-typescript-monorepo-example/typescript-example-1.json
  • dfetch_hub/data/github/bakeruk/modern-typescript-monorepo-example/typescript-example-2.json
  • dfetch_hub/data/github/faburaya/3fd.json
  • test/test_catalog_clib.py
  • test/test_catalog_conan.py
  • test/test_catalog_readme.py
  • test/test_catalog_sources.py
  • test/test_catalog_vcpkg.py
  • test/test_catalog_writer.py
  • test/test_commands_update.py
✅ Files skipped from review due to trivial changes (3)
  • dfetch_hub/data/github/bakeruk/modern-typescript-monorepo-example/typescript-example-1.json
  • dfetch_hub/data/github/faburaya/3fd.json
  • test/test_catalog_conan.py

Comment on lines +211 to +222
sanitized = manifest.sanitized_subpath
if sanitized:
root_id = CatalogEntry.catalog_id(vcs_host, org, repo, None)
existing_root = catalog.entries.get(root_id)
if existing_root and self.label in existing_root.source_labels:
catalog.remove_entry(vcs_host, org, repo)

_, is_new = catalog.get_or_create_entry(manifest, vcs_host, org, repo, self.label)

self._write_detail(vcs_host, org, repo, manifest)

return is_new, not is_new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's validation that would skip manifests with invalid subpaths earlier in the flow
rg -n "sanitize_subpath|sanitized_subpath" --type=py -C3

Repository: dfetch-org/dfetch-hub

Length of output: 7409


🏁 Script executed:

#!/bin/bash
# Find where write_catalog is called and any upstream validation
rg -n "write_catalog" --type=py -B5 -A2 | head -50

Repository: dfetch-org/dfetch-hub

Length of output: 3465


🏁 Script executed:

#!/bin/bash
# Check the get_or_create_entry implementation to see how it uses manifest
rg -n "def get_or_create_entry" --type=py -A10

Repository: dfetch-org/dfetch-hub

Length of output: 760


🏁 Script executed:

#!/bin/bash
# Check the full _write_detail implementation to see what happens when sanitized is None
sed -n '228,240p' dfetch_hub/catalog/writer.py

Repository: dfetch-org/dfetch-hub

Length of output: 493


🏁 Script executed:

#!/bin/bash
# Check entry.py to see how catalog_id is constructed from manifest
rg -n "def catalog_id" --type=py -A10

Repository: dfetch-org/dfetch-hub

Length of output: 1616


🏁 Script executed:

#!/bin/bash
# Get full get_or_create_entry implementation
sed -n '99,130p' dfetch_hub/catalog/writer.py

Repository: dfetch-org/dfetch-hub

Length of output: 1293


🏁 Script executed:

#!/bin/bash
# Check the write method to see if there's any filtering of manifests
sed -n '185,222p' dfetch_hub/catalog/writer.py

Repository: dfetch-org/dfetch-hub

Length of output: 1573


🏁 Script executed:

#!/bin/bash
# Check if there are tests for manifests with invalid subpaths
rg -n "test.*invalid.*subpath|test.*dotdot" test/ --type=py -B2 -A5

Repository: dfetch-org/dfetch-hub

Length of output: 1167


🏁 Script executed:

#!/bin/bash
# Check if there's any specific handling or tests for manifests with invalid subpaths in write_manifest
rg -n "write_manifest" test/ --type=py -A3 | grep -A3 "subpath"

Repository: dfetch-org/dfetch-hub

Length of output: 862


🏁 Script executed:

#!/bin/bash
# Check what happens when a manifest with invalid subpath is written
sed -n '1,50p' test/test_catalog_writer.py | head -40

Repository: dfetch-org/dfetch-hub

Length of output: 1215


🏁 Script executed:

#!/bin/bash
# Check if there's a test for write_manifest with invalid subpath
rg -n "def test.*write.*manifest" test/test_catalog_writer.py -A1

Repository: dfetch-org/dfetch-hub

Length of output: 611


🏁 Script executed:

#!/bin/bash
# Look for any tests that use an invalid subpath (with ..)
rg -n "_manifest.*subpath|_monorepo_manifest" test/test_catalog_writer.py -B1 -A3

Repository: dfetch-org/dfetch-hub

Length of output: 2166


🏁 Script executed:

#!/bin/bash
# Check the full code path for a manifest with invalid subpath - what's the actual behavior?
# Let's verify: when manifest.subpath = "..", what happens?
python3 << 'EOF'
# Simulate the logic
manifest_subpath = ".."
sanitized_subpath_result = None  # sanitize_subpath would return None for ".."

# What catalog ID would be created?
catalog_id = f"github/org/repo/{manifest_subpath.lower()}" if manifest_subpath else "github/org/repo"
print(f"Catalog ID with subpath='..': {catalog_id}")

# What detail path would be used?
if sanitized_subpath_result:
    detail_path = f"data_dir/vcs_host/org/repo/{sanitized_subpath_result}.json"
else:
    detail_path = f"data_dir/vcs_host/org/repo.json"
print(f"Detail path when sanitized=None: {detail_path}")

print("\nInconsistency confirmed:")
print(f"  Catalog ID references: {catalog_id}")
print(f"  Detail file location: {detail_path}")
EOF

Repository: dfetch-org/dfetch-hub

Length of output: 304


🏁 Script executed:

#!/bin/bash
# Check if there's any validation before write is called that filters manifests
rg -n "\.write\(" dfetch_hub/ --type=py -B5 | head -50

Repository: dfetch-org/dfetch-hub

Length of output: 1836


🏁 Script executed:

#!/bin/bash
# Check if there's a test that explicitly verifies behavior with invalid subpath
rg -n "test.*invalid|test.*\\.\\." test/test_catalog_writer.py -C3

Repository: dfetch-org/dfetch-hub

Length of output: 47


Handle manifests with invalid subpaths to prevent orphaned catalog entries.

When a manifest has an invalid subpath (e.g., containing ..), sanitized_subpath returns None. In this case, line 218 creates a catalog entry using the raw manifest.subpath via get_or_create_entry, generating an ID like github/org/repo/... However, the detail file is written to <repo>.json (line 232-237) since the sanitized path is invalid. This creates an orphaned catalog entry with an invalid ID that doesn't match the detail file location.

Either skip manifests with invalid subpaths before processing, or use sanitized_subpath consistently for both the catalog ID and detail path to prevent this mismatch.

Comment on lines +5 to +10
"subfolder_path": "typescript-example-2",
"catalog_sources": [
{
"source_name": "ts-monorepo",
"label": "ts-monorepo",
"index_path": "packages/typescript-example-2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist the full repo-relative subpath here.

index_path says this component lives under packages/typescript-example-2, but subfolder_path keeps only the leaf name. CatalogEntry.catalog_id(..., manifest.subpath) uses that value verbatim, so nested monorepo components with the same leaf directory would still collide. Regenerate this detail record and the catalog index with the full relative path.

Expected shape
-  "subfolder_path": "typescript-example-2",
+  "subfolder_path": "packages/typescript-example-2",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"subfolder_path": "typescript-example-2",
"catalog_sources": [
{
"source_name": "ts-monorepo",
"label": "ts-monorepo",
"index_path": "packages/typescript-example-2",
"subfolder_path": "packages/typescript-example-2",
"catalog_sources": [
{
"source_name": "ts-monorepo",
"label": "ts-monorepo",
"index_path": "packages/typescript-example-2",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@dfetch_hub/data/github/bakeruk/modern-typescript-monorepo-example/typescript-example-2.json`
around lines 5 - 10, The manifest's subfolder_path currently contains only the
leaf directory ("typescript-example-2") which causes
CatalogEntry.catalog_id(..., manifest.subpath) collisions for nested monorepo
components; update the JSON so subfolder_path stores the full repo-relative
subpath ("packages/typescript-example-2") to match index_path, then regenerate
the detail record and re-run the catalog index generation to refresh entries and
avoid id collisions. Ensure you update the file field "subfolder_path" to the
full relative path and rebuild the catalog index.

@spoorcc
Copy link
Contributor Author

spoorcc commented Mar 7, 2026

@coderabbitai resume

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 7, 2026

✅ Actions performed

Reviews resumed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dfetch_hub/catalog/sources/__init__.py (1)

226-233: Consider normalizing single-dot segments for path consistency.

The current implementation rejects .. traversal and leading dots but allows redundant single-dot segments like foo/./bar. While not a security issue, this could lead to inconsistent catalog paths.

Optional: Normalize single-dot segments
         subpath = subpath.replace("\\", "/").strip("/")
 
         if not subpath or subpath.startswith("."):
             return None
 
-        if any(part == ".." for part in subpath.split("/")):
+        parts = [p for p in subpath.split("/") if p and p != "."]
+        if any(part == ".." for part in parts):
             return None
 
-        return subpath
+        return "/".join(parts) if parts else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dfetch_hub/catalog/sources/__init__.py` around lines 226 - 233, The code
currently rejects parent-directory segments but leaves redundant single-dot
segments like "foo/./bar"; update the normalization of the subpath variable by
removing all single-dot segments: after the existing backslash-to-slash and
strip call (the subpath = subpath.replace("\\", "/").strip("/") line) and after
checking for any(part == ".." for part in subpath.split("/")), rebuild subpath
by filtering out parts equal to "." (e.g., new_subpath = "/".join(p for p in
subpath.split("/") if p != ".")), then re-check emptiness or leading dot
conditions on that normalized subpath and return None if invalid; keep using the
same subpath identifier and the existing checks (any(part == "..") and
startswith(".")) to locate where to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dfetch_hub/catalog/sources/__init__.py`:
- Around line 226-233: The code currently rejects parent-directory segments but
leaves redundant single-dot segments like "foo/./bar"; update the normalization
of the subpath variable by removing all single-dot segments: after the existing
backslash-to-slash and strip call (the subpath = subpath.replace("\\",
"/").strip("/") line) and after checking for any(part == ".." for part in
subpath.split("/")), rebuild subpath by filtering out parts equal to "." (e.g.,
new_subpath = "/".join(p for p in subpath.split("/") if p != ".")), then
re-check emptiness or leading dot conditions on that normalized subpath and
return None if invalid; keep using the same subpath identifier and the existing
checks (any(part == "..") and startswith(".")) to locate where to change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a3b362a-2708-4b70-9651-3a1f61223a28

📥 Commits

Reviewing files that changed from the base of the PR and between 397d6f1 and e83615d.

📒 Files selected for processing (5)
  • dfetch_hub/catalog/sources/__init__.py
  • dfetch_hub/catalog/sources/clib.py
  • dfetch_hub/commands/update.py
  • test/test_catalog_sources.py
  • test/test_commands_update.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • dfetch_hub/catalog/sources/clib.py

@spoorcc spoorcc merged commit 1d446dd into main Mar 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catalog id does not consider the subfolder Subfolder strategy never fills the "subfolder_path" in a details.json

2 participants