Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

### Changes

- Added attachment media caching with `discrawl attachments`, `attachments fetch`, `sync --with-media`, and Git snapshot backup/restore for cached non-DM media files.
- Docker: add a local image with `/data` persistence and CI smoke coverage.

### Fixes
Expand Down
25 changes: 22 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Wiretap DMs stay local and are never exported to the Git-backed snapshot mirror.
- maintains FTS5 search indexes for fast local text search
- builds an offline member directory from archived profile payloads
- extracts small text-like attachments into the local search index
- downloads and backs up cached attachment media when requested
- records structured user and role mentions for direct querying
- tails Gateway events for live updates, with periodic repair syncs
- imports classifiable Discord Desktop cache messages with `wiretap`, including proven DMs under `@me`
Expand Down Expand Up @@ -268,6 +269,7 @@ Bot sync modes:
`--all` ignores `default_guild_id` and fans out across every discovered guild the bot can access.
`--skip-members` refreshes guild/channel/message data without crawling the full member list, which is useful for frequent Git snapshot publishers that only need latest messages.
`--latest-only` is still accepted for explicit latest-only runs; it is now the default for untargeted `sync`. Use `--all-channels` to opt out of the fast default without doing a full historical crawl.
`--with-media` downloads missing attachment media into `cache_dir/media` after the message sync/import phase.
When `--channels` includes a forum channel id, `discrawl` expands that forum's threads and syncs their messages as part of the targeted run.
`--since` limits initial history/bootstrap and full-history backfill to messages at or after the given RFC3339 timestamp. It does not mark older history as complete, so a later `sync --full` without `--since` can continue the backfill.
Long runs now emit periodic progress logs to stderr so large backfills and Git snapshot imports do not look hung.
Expand Down Expand Up @@ -372,6 +374,19 @@ Notes:
- at least one filter is required
- `--dm` is shorthand for `--guild @me`, so DM searches and message slices do not need raw SQL

### `attachments`

Lists attachment metadata and downloads media into the local cache when requested.

```bash
discrawl attachments --channel general --days 7
discrawl attachments --filename crash --type image --all
discrawl attachments fetch --channel general --days 7
discrawl attachments fetch --missing --max-bytes 104857600
```

Media bytes are stored under `cache_dir/media`, not in SQLite. SQLite stores attachment metadata, content hash, relative media path, fetch status, and fetch error. Cached non-DM media is included in Git snapshots by default; `publish --no-media` omits it.

### `dms`

Lists local wiretap DM conversations or reads one DM thread.
Expand Down Expand Up @@ -491,6 +506,7 @@ Publisher:
discrawl publish --remote https://github.com/example/discord-archive.git --push
discrawl publish --readme path/to/discord-backup/README.md --push
discrawl publish --public-only --include-channels 1458141495701012561 --push
discrawl publish --no-media --push
```

Subscriber:
Expand All @@ -514,8 +530,8 @@ Once `share.remote` is configured, read commands auto-fetch and import when the

Hybrid mode is supported too: keep normal Discord credentials configured and set `share.remote`. `discrawl sync --update=auto` and `discrawl messages --sync` import the Git snapshot first, usually as a changed-shard delta, then use live Discord for latest-message deltas. Use `sync --all-channels` or `sync --full` when you intentionally want a broader live repair/backfill pass.

Git snapshots publish non-DM archive tables by default. DMs, desktop wiretap
rows, and local secrets are never exported.
Git snapshots publish non-DM archive tables and cached non-DM attachment media by default. DMs, desktop wiretap rows, DM media, and local secrets are never exported. Use `publish --no-media` to omit cached media files.
Subscribers can use `subscribe --no-media` or `update --no-media` to import only SQLite rows and skip restoring cached files.

Publish filters narrow only the Git snapshot. The publisher can still sync and
keep a richer local SQLite archive, then publish a smaller view for Git-only
Expand Down Expand Up @@ -666,6 +682,8 @@ concurrency = 16
repair_every = "6h"
full_history = true
attachment_text = true
attachment_media = false
max_attachment_bytes = 104857600

[desktop]
path = "~/.config/discord" # macOS default: "~/Library/Application Support/discord"
Expand All @@ -688,6 +706,7 @@ repo_path = "~/.local/share/discrawl/share" # macOS: "~/Library/Application Supp
branch = "main"
auto_update = true
stale_after = "15m"
media = true
```

The value above is an example. `init` writes an auto-sized default based on the host: `min(32, max(8, GOMAXPROCS*2))`.
Expand Down Expand Up @@ -757,7 +776,7 @@ Proven DMs use `@me` as their guild id. Unclassifiable desktop-cache payloads ar

SQLite schema migrations are versioned with `PRAGMA user_version`. Startup now fails fast when a local DB schema is newer than the supported binary.

Attachment binaries are not stored in SQLite.
Attachment binaries are not stored in SQLite. `attachments fetch` and `sync --with-media` write media bytes under `cache_dir/media`; SQLite stores the relative media path, content hash, size, and fetch status.

Set `sync.attachment_text = false` if you want to keep attachment metadata and filenames but disable attachment body fetches for text indexing.

Expand Down
47 changes: 47 additions & 0 deletions docs/commands/attachments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# `attachments`

Lists attachment metadata and optionally downloads attachment media into the local cache.

## Usage

```bash
discrawl attachments --channel general --days 7
discrawl attachments --filename crash --type image --all
discrawl attachments --message 1456744319972282449
discrawl attachments fetch --channel general --days 7
discrawl attachments fetch --missing --max-bytes 104857600
discrawl --json attachments --missing --all
```

## Flags

- `--channel <id|name|#name>` - id, exact name, `#name`, or partial name match
- `--guild <id>` / `--guilds <id,id>` / `--dm` - restrict the guild scope (`--dm` is shorthand for `--guild @me`)
- `--author <name>` - restrict to one author
- `--message <id>` - restrict to one message
- `--filename <text>` - filename substring match
- `--type <text>` - content-type substring match, such as `image` or `application/pdf`
- `--hours <n>` - shorthand for "since now minus N hours"
- `--days <n>` - shorthand for "since now minus N days"
- `--since <RFC3339>` / `--before <RFC3339>` - explicit time window
- `--limit <n>` - safety limit (default 200; `--all` removes it)
- `--all` - removes the safety limit
- `--missing` - only attachments whose cached media file is absent

`attachments fetch` also accepts:

- `--force` - re-download already cached attachments
- `--max-bytes <n>` - per-attachment download cap (defaults to `[sync].max_attachment_bytes`)

## Notes

- media bytes are stored under `cache_dir/media`, not in SQLite
- SQLite stores attachment metadata, content hash, cached media path, fetch status, and errors
- `publish` backs up cached non-DM media files by default; use `publish --no-media` to omit them
- `@me` DM media is local-only and is not published to Git snapshots

## See also

- [`messages`](messages.html)
- [`publish`](publish.html)
- [`sync`](sync.html)
4 changes: 4 additions & 0 deletions docs/commands/publish.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ discrawl publish --remote https://github.com/example/discord-archive.git --push
discrawl publish --readme path/to/discord-backup/README.md --push
discrawl publish --message "sync: discord archive" --push
discrawl publish --with-embeddings --push
discrawl publish --no-media --push
discrawl publish --public-only --include-channels 1458141495701012561 --push
```

Expand All @@ -25,6 +26,7 @@ discrawl publish --public-only --include-channels 1458141495701012561 --push
- `--include-channels <ids>` - comma-separated channel ids to export; forum parents include their allowed public threads
- `--exclude-channels <ids>` - comma-separated channel ids to omit; exclusions win over includes
- `--with-embeddings` - also export stored `message_embeddings` rows
- `--no-media` - omit cached attachment media files from the snapshot

Filters narrow only the published snapshot. The local SQLite archive can still
be synced from a richer bot-visible dataset. Git-only readers see the filtered
Expand Down Expand Up @@ -58,6 +60,7 @@ README files without Discrawl report markers are left alone.
## What is published

- non-DM archive tables (DM `@me` rows are always excluded)
- cached non-DM attachment media files under `media/` unless `--no-media` is used
- when filters are enabled: only matching guilds, channels, messages, events,
attachments, mentions, channel-scoped sync-state rows, member rows referenced
by matching messages, and matching embedding rows
Expand All @@ -68,6 +71,7 @@ README files without Discrawl report markers are left alone.
## What is not published

- `@me` DM guilds, channels, messages, events, attachments, mentions, wiretap sync state
- `@me` DM media files
- when filters are enabled: share manifest state and guild-level member
freshness markers, because they describe the full archive
- `embedding_jobs`
Expand Down
2 changes: 2 additions & 0 deletions docs/commands/subscribe.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ discrawl subscribe --stale-after 15m https://github.com/example/discord-archive.
discrawl subscribe --no-auto-update https://github.com/example/discord-archive.git
discrawl subscribe --no-import https://github.com/example/discord-archive.git
discrawl subscribe --with-embeddings https://github.com/example/discord-archive.git
discrawl subscribe --no-media https://github.com/example/discord-archive.git
```

## What it does
Expand All @@ -31,6 +32,7 @@ discrawl subscribe --with-embeddings https://github.com/example/discord-archive.
- `--no-auto-update` - disable auto-refresh (use [`update`](update.html) manually)
- `--no-import` - write config only; skip the initial pull/import
- `--with-embeddings` - import vectors that match your local `[search.embeddings]` identity
- `--no-media` - skip restoring cached attachment media files into `cache_dir/media`

## Disabled in this mode

Expand Down
2 changes: 2 additions & 0 deletions docs/commands/sync.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ discrawl sync --source wiretap # desktop cache only; aliases: desktop, cache
discrawl sync --guild 123456789012345678 --all-channels
discrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z
discrawl sync --with-embeddings
discrawl sync --with-media
```

## Sources
Expand Down Expand Up @@ -61,6 +62,7 @@ discrawl sync --with-embeddings
- `--concurrency <n>` - override worker count (default auto-sized: floor 8, cap 32)
- `--skip-members` - refresh guild/channel/message data without crawling members
- `--with-embeddings` - also enqueue changed messages into `embedding_jobs`
- `--with-media` - after sync, download missing attachment media into `cache_dir/media`

## Notes

Expand Down
2 changes: 2 additions & 0 deletions docs/commands/update.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ discrawl update \
--repo ~/.local/share/discrawl/share \
--remote https://github.com/example/discord-archive.git
discrawl update --with-embeddings
discrawl update --no-media
```

## Flags
Expand All @@ -20,6 +21,7 @@ discrawl update --with-embeddings
- `--remote <url>` - target Git remote (defaults to `[share].remote`)
- `--branch <name>` - snapshot branch (defaults to `[share].branch`)
- `--with-embeddings` - also import vectors that match your local `[search.embeddings]` identity
- `--no-media` - skip restoring cached attachment media files into `cache_dir/media`

## When to use it

Expand Down
5 changes: 5 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ concurrency = 16
repair_every = "6h"
full_history = true
attachment_text = true
attachment_media = false
max_attachment_bytes = 104857600

[desktop]
path = "~/.config/discord" # macOS default: "~/Library/Application Support/discord"
Expand All @@ -84,6 +86,7 @@ repo_path = "~/.local/share/discrawl/share" # macOS: "~/Library/Application Supp
branch = "main"
auto_update = true
stale_after = "15m"
media = true

[share.filter]
public_only = false
Expand Down Expand Up @@ -118,6 +121,8 @@ Set `discord.token_source = "keyring"` if you want to require keyring lookup and
- `guild_ids` is reserved for explicit multi-guild fan-out; usually you do not set this directly
- changing `[search.embeddings]` provider/model/input version retargets pending jobs and resets prior attempts; existing vectors for another identity remain in SQLite but are not used for semantic search
- changing `db_path` does not migrate existing data; copy the file yourself if you want to keep history
- `sync.attachment_media = true` makes `sync` behave like `sync --with-media`; media bytes are cached under `cache_dir/media`
- `share.media = false` makes publish/update/auto-update omit or skip restoring cached media; `subscribe --no-media` writes this for Git-only readers
- `[share.filter]` narrows only `publish` output; sync can still keep a richer local archive
- `share.filter.public_only` exports only channels visible to the guild
`@everyone` role after category/channel permission overwrites; private
Expand Down
4 changes: 3 additions & 1 deletion docs/guides/data-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ Proven DMs use the synthetic guild id `@me`. Unclassifiable desktop-cache payloa

## Attachments

Attachment binaries are not stored in SQLite. Only attachment metadata, filenames, and (optionally) extracted text.
Attachment binaries are not stored in SQLite. SQLite stores attachment metadata, filenames, optional extracted text, and media cache bookkeeping.

`discrawl attachments fetch` and `discrawl sync --with-media` download media into `cache_dir/media` and record the relative media path, SHA-256, byte size, fetch time, and fetch status on the attachment row.

Set `sync.attachment_text = false` if you want to keep attachment metadata and filenames but disable attachment body fetches for text indexing.

Expand Down
4 changes: 3 additions & 1 deletion docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,12 @@ CI runs secret scanning with `gitleaks` against git history and the working tree
- FTS index rows
- optional local embedding queue metadata and vectors

Attachment binaries are not stored in SQLite. Only attachment metadata and (optionally) extracted text.
Attachment binaries are not stored in SQLite. Only attachment metadata, optional extracted text, and media cache bookkeeping are stored there. Cached files live under `cache_dir/media`.

Set `sync.attachment_text = false` if you want to keep attachment metadata and filenames but disable attachment body fetches for text indexing.

Git snapshots include cached non-DM media files by default. Use `publish --no-media` to omit them. DM media under `@me` stays local-only.

## What is sent over the wire

With remote embedding providers, message text is sent during `discrawl embed`, and search query text is sent when using `--mode semantic` or `--mode hybrid`. Stored message text is not sent during local vector scoring.
Expand Down
81 changes: 76 additions & 5 deletions internal/cli/admin_commands.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (
"github.com/openclaw/discrawl/internal/config"
"github.com/openclaw/discrawl/internal/discord"
"github.com/openclaw/discrawl/internal/discorddesktop"
"github.com/openclaw/discrawl/internal/media"
"github.com/openclaw/discrawl/internal/share"
"github.com/openclaw/discrawl/internal/store"
"github.com/openclaw/discrawl/internal/syncer"
Expand All @@ -32,6 +33,7 @@ type syncRunStats struct {
Source string `json:"source"`
Discord *syncer.SyncStats `json:"discord,omitempty"`
Wiretap *discorddesktop.Stats `json:"wiretap,omitempty"`
Media *media.FetchStats `json:"media,omitempty"`
}

func (r *runtime) runInit(args []string) error {
Expand Down Expand Up @@ -110,6 +112,7 @@ func (r *runtime) runSync(args []string) error {
concurrency := fs.Int("concurrency", r.cfg.Sync.Concurrency, "")
source := fs.String("source", r.cfg.Sync.Source, "")
withEmbeddings := fs.Bool("with-embeddings", false, "")
withMedia := fs.Bool("with-media", r.cfg.AttachmentMediaEnabled(), "")
skipMembers := fs.Bool("skip-members", false, "")
latestOnly := fs.Bool("latest-only", false, "")
guildsFlag := fs.String("guilds", "", "")
Expand Down Expand Up @@ -155,11 +158,11 @@ func (r *runtime) runSync(args []string) error {
LatestOnly: syncLatestOnly(*latestOnly, defaultLatest),
}
return r.withSyncLock(func() error {
return r.runSyncLocked(sources, opts)
return r.runSyncLocked(sources, opts, *withMedia)
})
}

func (r *runtime) runSyncLocked(sources syncSources, opts syncer.SyncOptions) error {
func (r *runtime) runSyncLocked(sources syncSources, opts syncer.SyncOptions, withMedia bool) error {
var apiStats *syncer.SyncStats
if sources.discord {
r.setSyncLockPhase("discord sync")
Expand Down Expand Up @@ -190,13 +193,81 @@ func (r *runtime) runSyncLocked(sources syncSources, opts syncer.SyncOptions) er
}
wiretapStats = &stats
}
if sources.discord && !sources.wiretap {
var mediaStats *media.FetchStats
if withMedia {
r.setSyncLockPhase("attachment media fetch")
cacheDir, err := config.ExpandPath(r.cfg.CacheDir)
if err != nil {
return configErr(err)
}
channelIDs := opts.ChannelIDs
if len(channelIDs) > 0 {
channelIDs, err = r.store.ExpandAttachmentChannelIDs(r.ctx, channelIDs)
if err != nil {
return err
}
}
stats, err := r.fetchSyncMedia(sources, opts, cacheDir, channelIDs)
if err != nil {
return err
}
mediaStats = stats
}
if sources.discord && !sources.wiretap && mediaStats == nil {
return r.print(*apiStats)
}
if sources.wiretap && !sources.discord {
if sources.wiretap && !sources.discord && mediaStats == nil {
return r.print(*wiretapStats)
}
return r.print(syncRunStats{Source: sources.name, Discord: apiStats, Wiretap: wiretapStats})
return r.print(syncRunStats{Source: sources.name, Discord: apiStats, Wiretap: wiretapStats, Media: mediaStats})
}

func (r *runtime) fetchSyncMedia(sources syncSources, opts syncer.SyncOptions, cacheDir string, channelIDs []string) (*media.FetchStats, error) {
total := media.FetchStats{}
if sources.discord {
stats, err := media.Fetch(r.ctx, r.store, media.FetchOptions{
CacheDir: cacheDir,
MaxBytes: r.cfg.Sync.MaxAttachmentBytes,
List: store.AttachmentListOptions{
GuildIDs: opts.GuildIDs,
ExcludeGuildIDs: []string{store.DirectMessageGuildID},
ChannelIDs: channelIDs,
Since: opts.Since,
Comment on lines +231 to +235
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit media fetch to the latest-only sync window

When sync runs in its default latest-only mode (LatestOnly true with no --since), fetchSyncMedia passes only opts.Since into AttachmentListOptions, which is zero in that mode, so media.Fetch scans all attachments in the targeted guilds instead of the latest delta. Because media.Fetch verifies reusable files by hashing them, enabling --with-media (or [sync].attachment_media=true) can turn routine syncs into full-archive media verification/download passes on every run, causing severe performance regressions and unexpected backlog fetching.

Useful? React with 👍 / 👎.

},
StatusUpdate: true,
Now: r.now,
})
if err != nil {
return nil, err
}
total = addFetchStats(total, stats)
}
if sources.wiretap {
stats, err := media.Fetch(r.ctx, r.store, media.FetchOptions{
CacheDir: cacheDir,
MaxBytes: r.cfg.Sync.MaxAttachmentBytes,
List: store.AttachmentListOptions{
GuildIDs: []string{store.DirectMessageGuildID},
},
StatusUpdate: true,
Now: r.now,
})
if err != nil {
return nil, err
}
total = addFetchStats(total, stats)
}
return &total, nil
}

func addFetchStats(a, b media.FetchStats) media.FetchStats {
a.Attachments += b.Attachments
a.Fetched += b.Fetched
a.Reused += b.Reused
a.Skipped += b.Skipped
a.Failed += b.Failed
a.Bytes += b.Bytes
return a
}

func defaultLatestSyncMode(full bool, allChannels bool, since string, channels string) bool {
Expand Down
Loading