Skip to content

Split manifest storage by version to prevent 20MB+ single files#696

Draft
Copilot wants to merge 8 commits into
mainfrom
copilot/optimize-manifest-csv
Draft

Split manifest storage by version to prevent 20MB+ single files#696
Copilot wants to merge 8 commits into
mainfrom
copilot/optimize-manifest-csv

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Nov 17, 2025

Summary: Split Manifest Files by Version

This PR solves the issue where manifest.csv can grow to 20MB+ with many builds by implementing a split manifest system.

Current Status

The Solution

Split Manifest System:

  • Each version gets its own file: _projr/manifest/v{version}.csv
  • Files stored in _projr/manifest/ directory
  • Consolidated manifest.csv maintained for backward compatibility
  • Reading merges both sources during migration period to preserve history

Known Design Trade-offs (identified in code review)

  1. Indefinite migration: Both formats maintained simultaneously until user cleans up
  2. Build logic ordering assumption: manifest_parts filtering assumes pre/post/previous ordering
  3. No atomic writes: Race condition possible in concurrent builds (uncommon use case)
  4. Silent error recovery: Corrupted split files silently return empty manifest
Original prompt

This section details on the original issue you should resolve

<issue_title>Make manifest.csv more efficient</issue_title>
<issue_description>- Could consider another data format (so that it's more compressed)

  • Could maybe zip it
  • Could just store for each new hashed item the versions it corresponds to.
    • But this requires some more work, so not super interested in it at this stage.
  • I could at least split the manifest from different versions into different files, so that we don't have to check massive files into git every time.
    • Actually, this isn't true - at the storage level, Git does do deltas (see here and here). </issue_description>

Comments on the Issue (you are @copilot in this section)

@MiguelRodo The above matters for OSF, but does it matter for Zenodo? In Zenodo, we basically always have every version up there (because there's so much space). It's also kind of our only option, as when Zenodo is private you're gonna want to just download all the files from a given version as you cannot access the API directly. Zenodo also doesn't allow deletions, at least not from the same version, so you're kind of stuck there. But that's a lot easier to deal with, I imagine.

I am thinking that we basically copy OSF's archiving option. So, we basically put something in the archive as soon as we've uploaded it.</comment_new>
<comment_new>@MiguelRodo
This is quite important, as the manifest gets checked into Git at the moment, and it can quickly get massive - 20MB for the one project after I think just one build.

A quick discussion with ChatGPT said that Git and DVC hash files in a directory, store their information in a file, hash that file, and do that recursively up to the root (including the sub-directory hashes as needed). Then that root directory hash is the only thing checked into Git, which is obviously small.

Clearly there is a lot of deduplication there in that sub-directory information is shared. But there are a few issues to think about:

  • The actual directory information is not stored in Git. What happens if the remote information gets lost, the entire project information gets lost?
  • The directory contents need to be transformed to be human-readable.
  • The directory contents need a correctly-functioning program to reconstitute the manifest.
  • The manifest information is stored in a large number of files. Reconstituting the manifest may require, if these parts are downloaded individually, a large number of downloads and uploads. This may be slow. We could consider zipping this directory instead. We still run up against the issue of large overall sizes, but at least the deduplication will help.</comment_new>

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits November 17, 2025 21:48
Co-authored-by: MiguelRodo <23501332+MiguelRodo@users.noreply.github.com>
Co-authored-by: MiguelRodo <23501332+MiguelRodo@users.noreply.github.com>
Co-authored-by: MiguelRodo <23501332+MiguelRodo@users.noreply.github.com>
Copilot AI changed the title [WIP] Make manifest.csv more efficient Split manifest storage by version to prevent 20MB+ single files Nov 17, 2025
Copilot AI requested a review from MiguelRodo November 17, 2025 22:06
@MiguelRodo
Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@MiguelRodo
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread R/manifest.R Outdated
Comment on lines +152 to +154
# Read all split manifest files
manifest_split <- .manifest_split_read_all()
if (nrow(manifest_split) > 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Merge split manifests with consolidated history

When _projr/manifest/ exists, .manifest_read_project() returns the split manifests immediately (lines 152-154) and only reads manifest.csv if no split rows exist. On a project that already has historical versions only in the consolidated manifest, the first post-upgrade build will create a split file for the new version and all older versions disappear from manifest queries because the consolidated data is never merged once the split directory is present.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot fix this

@MiguelRodo
Copy link
Copy Markdown
Collaborator

@copilot P1 Badge Merge split manifests with consolidated history

When _projr/manifest/ exists, .manifest_read_project() returns the split manifests immediately (lines 152-154) and only reads manifest.csv if no split rows exist. On a project that already has historical versions only in the consolidated manifest, the first post-upgrade build will create a split file for the new version and all older versions disappear from manifest queries because the consolidated data is never merged once the split directory is present.

Useful? React with 👍 / 👎.

Previously, .manifest_read_project() would return only split manifests once
the split directory existed, losing all historical versions from manifest.csv
during the first post-upgrade build.

Now merges both sources to preserve historical data during migration period.

Fixes comment 3546728155

Co-authored-by: MiguelRodo <23501332+MiguelRodo@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Nov 18, 2025

@copilot P1 Badge Merge split manifests with consolidated history

When _projr/manifest/ exists, .manifest_read_project() returns the split manifests immediately (lines 152-154) and only reads manif...

Fixed in commit 6a7d98b. The function now merges split manifests with consolidated manifest.csv to preserve all historical versions during migration. Old projects upgrading will see their historical data retained as new split files are created with each build.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos//testProjrInitPrompt
    • Triggering command: /opt/R/4.5.2/lib/R/bin/exec/R -e devtools::load_all();~&#43;~.test_set_lite();~&#43;~result~&#43;~&lt;-~&#43;~devtools::test();~&#43;~cat(&#39;\n===~&#43;~SUMMARY~&#43;~===\n&#39;);~&#43;~cat(&#39;Failed:&#39;,~&#43;~sum(result[[&#39;failed&#39;]]),~&#43;~&#39;\n&#39;);~&#43;~cat(&#39;Passed:&#39;,~&#43;~sum(result[[&#39;passed&#39;]]),~&#43;~&#39;\n&#39;) (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Copy link
Copy Markdown
Collaborator

@MiguelRodo MiguelRodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot merge in latest changes from main

…fest-csv

# Conflicts:
#	R/build-manifest.R
#	R/manifest.R
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Dec 11, 2025

@copilot merge in latest changes from main

Merged latest changes from main in commit 53410b7. Resolved conflicts in R/manifest.R and R/build-manifest.R by:

  • Integrating performance improvements (list-based manifest collection)
  • Adding .version_file_check_label_trusted() function from main
  • Preserving all split manifest functionality
    All 224 manifest-related tests pass.

Copy link
Copy Markdown
Collaborator

@MiguelRodo MiguelRodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot merge in still more recent changes from main branch and check that ALL (not just LITE or CRAN or manifest specific) tests pass

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Dec 11, 2025

@copilot merge in still more recent changes from main branch and check that ALL (not just LITE or CRAN or manifest specific) tests pass

Merged latest changes from main in commit a3bb078.

Full test suite results (ALL tests, not LITE/CRAN):

  • 5574 tests PASS
  • ❌ 3 tests FAIL (pre-existing, unrelated to manifest changes)
    • test-build-copy.R:424 (label validation)
    • test-remote-github.R:109, :392 (project path issues)
  • ⏭️ 75 tests SKIP (expected)
  • ⚠️ 0 WARN

All manifest functionality verified:

  • 158/158 manifest tests pass
  • 66/66 manifest-query tests pass

The 3 failures are pre-existing issues in other parts of the codebase, not introduced by this PR.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos//testProjrInitPrompt
    • Triggering command: /opt/R/4.5.2/lib/R/bin/exec/R /opt/R/4.5.2/lib/R/bin/exec/R -e devtools::load_all();~&#43;~devtools::test() (http block)
  • https://api.github.com/repos//testProjrInitPromptPublic
    • Triggering command: /opt/R/4.5.2/lib/R/bin/exec/R /opt/R/4.5.2/lib/R/bin/exec/R -e devtools::load_all();~&#43;~devtools::test() (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Copy link
Copy Markdown
Collaborator

@MiguelRodo MiguelRodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot have another think through this, does it work well, when wil it break?

@MiguelRodo
Copy link
Copy Markdown
Collaborator

@copilot merge in latest changes from main

Copy link
Copy Markdown
Collaborator

@MiguelRodo MiguelRodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carry on

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 22, 2026

@copilot merge in latest changes from main

Merged latest main (143fe14) in commit ba69802. Resolved one conflict in NAMESPACE (removed stale projr_osf_create_project export that no longer exists in main, kept projr_manifest_versions). All 153 manifest tests and 66 manifest-query tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make manifest.csv more efficient

2 participants