Skip to content

Add file deduplication support#644

Open
volker-fr wants to merge 1 commit intorusq:masterfrom
volker-fr:dedup-downloads
Open

Add file deduplication support#644
volker-fr wants to merge 1 commit intorusq:masterfrom
volker-fr:dedup-downloads

Conversation

@volker-fr
Copy link
Contributor

@volker-fr volker-fr commented Mar 17, 2026

On slackdump resume files get downloaded over and over again even if they exist already on disk.

This PR adds the filesize to sqlite to compare it in the future with new downloads. The Slack API returns the file size itself and not a checksum. The ID should also be unique based on upload, therefore the filesize is more optional but could be used in the future to compare files on disk with the DB.

For real world testing I used the -v flag on resume and it worked fine

...
2026-03-16 22:32:17 DEBUG skipping duplicate file
                      ├ file_id: XXXXXXXXX
                      └ size: 123456

@rusq rusq requested a review from Copilot March 19, 2026 10:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds file-download deduplication for slackdump resume by persisting Slack file sizes in the SQLite archive and using (file_id, size) to detect already-recorded files before downloading again.

Changes:

  • Add SIZE column (and index) to the FILE table via a new goose migration.
  • Extend the DB file model/repository with Size and a GetByIDAndSize lookup method.
  • Add a DeduplicatingFileProcessor wrapper and wire it into the resume controller path.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql Adds FILE.SIZE column and an (ID, SIZE) index to support dedup lookups.
internal/chunk/backend/dbase/repository/dbfile.go Persists slack.File.Size, adds repository method for dedup lookup.
internal/chunk/backend/dbase/repository/mock_repository/mock_file.go Regenerates/extends mock to include GetByIDAndSize.
internal/convert/transform/fileproc/dedup.go Implements DB-backed deduplicating filer wrapper.
internal/convert/transform/fileproc/dedup_test.go Adds placeholder tests; currently skips the main behavior test.
cmd/slackdump/internal/archive/archive.go Wraps the filer with dedup logic for resume.
internal/fixtures/assets/source_database.db Updates/adds a DB fixture reflecting the new schema (binary).

// GetByIDAndSize returns a file by its ID and size.
// If a file with the same ID and size exists, we assume it hasn't changed.
func (r fileRepository) GetByIDAndSize(ctx context.Context, conn sqlx.QueryerContext, fileID string, size int) (*DBFile, error) {
const stmt = `SELECT ID, CHUNK_ID, CHANNEL_ID, MESSAGE_ID, THREAD_ID, IDX, MODE, FILENAME, URL, DATA, SIZE
Comment on lines +54 to +79

for i := range ff {
f := &ff[i]
if !IsValid(f) {
continue
}

// Check if file already exists with same ID and size
existing, err := fr.GetByIDAndSize(ctx, d.db, f.ID, f.Size)
if err != nil {
d.lg.WarnContext(ctx, "error checking file existence", "error", err, "file_id", f.ID)
// Continue with download on error
}

if existing != nil {
d.lg.DebugContext(ctx, "skipping duplicate file", "file_id", f.ID, "size", f.Size)
continue
}

// File doesn't exist or size differs - download it
if err := d.inner.Files(ctx, channel, parent, []slack.File{*f}); err != nil {
return err
}
}

return nil
Comment on lines +23 to +27
func TestDeduplicatingFileProcessor_Files(t *testing.T) {
// This test would need a real database connection
// For now, just verify the logic compiles
t.Skip("Requires database connection - manual test only")
}
Comment on lines +216 to +220
// Wrap file processor with deduplication for resume operations
var filer processor.Filer = fileproc.New(dl)
if cmdName == "resume" {
filer = fileproc.NewDeduplicatingFileProcessor(filer, conn, lg)
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nitpick if I ever saw one

-- +goose Up
-- +goose StatementBegin
-- Add SIZE column to FILE table for deduplication
ALTER TABLE FILE ADD COLUMN SIZE INTEGER;
Comment on lines 30 to 42
type DBFile struct {
ID string `db:"ID"`
ChunkID int64 `db:"CHUNK_ID"`
ChannelID string `db:"CHANNEL_ID"`
MessageID *int64 `db:"MESSAGE_ID"`
ThreadID *int64 `db:"THREAD_ID,omitempty"`
Index int `db:"IDX"`
Mode string `db:"MODE"`
Filename *string `db:"FILENAME"`
URL *string `db:"URL"`
Data []byte `db:"DATA"`
Size int `db:"SIZE"` // File size in bytes from Slack API
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call, I'd go with *int64 for simplicity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants