[codex] Add attachment media backup support#66
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a6054f039
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := os.Rename(tmpPath, target); err != nil { | ||
| _ = os.Remove(tmpPath) |
There was a problem hiding this comment.
Remove existing media file before renaming temp download
fetchURL can enter the write path when a cached attachment already exists but is corrupt (or when --force is used), yet it calls os.Rename(tmpPath, target) without deleting an existing regular target first. On Windows, rename-to-existing fails, so forced refetch and corruption repair will consistently fail with a fetch error instead of updating the cache.
Useful? React with 👍 / 👎.
| if err := os.Rename(tmpPath, target); err != nil { | ||
| _ = os.Remove(tmpPath) |
There was a problem hiding this comment.
Replace existing cache target before renaming copied media
copyFile uses the same temp-file rename pattern and also renames onto target without removing an existing file. importMedia calls this helper when a local cached media file exists but hash-mismatches, so on Windows update/subscribe media restoration will fail whenever a cached file needs replacement.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93ec2463e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| List: store.AttachmentListOptions{ | ||
| GuildIDs: opts.GuildIDs, | ||
| ExcludeGuildIDs: []string{store.DirectMessageGuildID}, | ||
| ChannelIDs: channelIDs, | ||
| Since: opts.Since, |
There was a problem hiding this comment.
Limit media fetch to the latest-only sync window
When sync runs in its default latest-only mode (LatestOnly true with no --since), fetchSyncMedia passes only opts.Since into AttachmentListOptions, which is zero in that mode, so media.Fetch scans all attachments in the targeted guilds instead of the latest delta. Because media.Fetch verifies reusable files by hashing them, enabling --with-media (or [sync].attachment_media=true) can turn routine syncs into full-archive media verification/download passes on every run, causing severe performance regressions and unexpected backlog fetching.
Useful? React with 👍 / 👎.
Summary
This adds first-class Discord attachment media support to Discrawl. Attachments were already tracked as searchable metadata and small text snippets, but the binary media itself was only reachable through Discord CDN URLs. This change lets Discrawl cache attachment bytes locally, keep that cache verified by content hash, and include non-DM media in Git snapshot backups so subscribers can restore the same files offline.
What changed
discrawl attachmentsto list attachment rows with filters for channel, author, message, filename, content type, time range, guild scope, DM scope, and missing cached media.discrawl attachments fetchto download attachment media intocache_dir/media, with missing-only fetches, forced refetch, byte limits, cache verification, long filename handling, and corrupt-cache repair.discrawl sync --with-media, backed by[sync] attachment_mediaand[sync] max_attachment_bytes, so sync can fetch media for the same Discord or wiretap scope it just updated.publishexports cached non-DM media intomedia/, andupdateorsubscriberestore that media into the subscriber cache by default.--no-mediaforpublish,update, andsubscribe;subscribe --no-mediapersists[share] media = false.@meDM media local-only. DM rows and media are excluded from shared snapshots, andattachments --dm/attachments fetch --dmskip Git share auto-update.Design notes
The media logic lives in Discrawl rather than Crawlkit. Crawlkit remains the generic snapshot and mirror layer; the new behavior depends on Discord-specific attachment URLs, proxy URLs, message and guild scoping, the
@meprivacy boundary, and Discrawl's attachment table. If another crawler needs sidecar binary artifact support later, Crawlkit can grow a generic artifact API from this shape, but moving this first implementation there would force Discord semantics into the shared package too early.Media files are content-addressed under
cache_dir/media/attachments/<sha-prefix>/<sha>-<filename>. The database stores the relative path and hash, not the bytes. Git snapshots copy only verified regular files and reject symlink escapes during import. Export also rejects overlapping repo/cache media roots so a misconfigured share path cannot delete or overwrite the local cache.Validation
GOWORK=off go test ./...git diff --checknode scripts/build-docs-site.mjscodex-review --parallel-tests "GOWORK=off go test ./..."The nested review sandbox could not bind
httptestports during its own direct media package run, but the helper's parallel test invocation and the local full suite both passed outside that sandbox limitation.