Add file deduplication support by volker-fr · Pull Request #644 · rusq/slackdump

volker-fr · 2026-03-17T03:41:46Z

On slackdump resume files get downloaded over and over again even if they exist already on disk.

This PR adds the filesize to sqlite to compare it in the future with new downloads. The Slack API returns the file size itself and not a checksum. The ID should also be unique based on upload, therefore the filesize is more optional but could be used in the future to compare files on disk with the DB.

For real world testing I used the -v flag on resume and it worked fine

...
2026-03-16 22:32:17 DEBUG skipping duplicate file
                      ├ file_id: XXXXXXXXX
                      └ size: 123456

Copilot

Pull request overview

This PR adds file-download deduplication for slackdump resume by persisting Slack file sizes in the SQLite archive and using (file_id, size) to detect already-recorded files before downloading again.

Changes:

Add SIZE column (and index) to the FILE table via a new goose migration.
Extend the DB file model/repository with Size and a GetByIDAndSize lookup method.
Add a DeduplicatingFileProcessor wrapper and wire it into the resume controller path.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql	Adds `FILE.SIZE` column and an `(ID, SIZE)` index to support dedup lookups.
internal/chunk/backend/dbase/repository/dbfile.go	Persists `slack.File.Size`, adds repository method for dedup lookup.
internal/chunk/backend/dbase/repository/mock_repository/mock_file.go	Regenerates/extends mock to include `GetByIDAndSize`.
internal/convert/transform/fileproc/dedup.go	Implements DB-backed deduplicating filer wrapper.
internal/convert/transform/fileproc/dedup_test.go	Adds placeholder tests; currently skips the main behavior test.
cmd/slackdump/internal/archive/archive.go	Wraps the filer with dedup logic for `resume`.
internal/fixtures/assets/source_database.db	Updates/adds a DB fixture reflecting the new schema (binary).

internal/chunk/backend/dbase/repository/dbfile.go

+// GetByIDAndSize returns a file by its ID and size.
+// If a file with the same ID and size exists, we assume it hasn't changed.
+func (r fileRepository) GetByIDAndSize(ctx context.Context, conn sqlx.QueryerContext, fileID string, size int) (*DBFile, error) {
+	const stmt = `SELECT ID, CHUNK_ID, CHANNEL_ID, MESSAGE_ID, THREAD_ID, IDX, MODE, FILENAME, URL, DATA, SIZE 


internal/convert/transform/fileproc/dedup.go

+
+	for i := range ff {
+		f := &ff[i]
+		if !IsValid(f) {
+			continue
+		}
+
+		// Check if file already exists with same ID and size
+		existing, err := fr.GetByIDAndSize(ctx, d.db, f.ID, f.Size)
+		if err != nil {
+			d.lg.WarnContext(ctx, "error checking file existence", "error", err, "file_id", f.ID)
+			// Continue with download on error
+		}
+
+		if existing != nil {
+			d.lg.DebugContext(ctx, "skipping duplicate file", "file_id", f.ID, "size", f.Size)
+			continue
+		}
+
+		// File doesn't exist or size differs - download it
+		if err := d.inner.Files(ctx, channel, parent, []slack.File{*f}); err != nil {
+			return err
+		}
+	}
+
+	return nil


internal/convert/transform/fileproc/dedup_test.go

+func TestDeduplicatingFileProcessor_Files(t *testing.T) {
+	// This test would need a real database connection
+	// For now, just verify the logic compiles
+	t.Skip("Requires database connection - manual test only")
+}


rusq · 2026-03-19T10:15:19Z

cmd/slackdump/internal/archive/archive.go

+	// Wrap file processor with deduplication for resume operations
+	var filer processor.Filer = fileproc.New(dl)
+	if cmdName == "resume" {
+		filer = fileproc.NewDeduplicatingFileProcessor(filer, conn, lg)
+	}


That's a nitpick if I ever saw one

internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql

+-- +goose Up
+-- +goose StatementBegin
+-- Add SIZE column to FILE table for deduplication
+ALTER TABLE FILE ADD COLUMN SIZE INTEGER;


rusq · 2026-03-19T10:17:45Z

internal/chunk/backend/dbase/repository/dbfile.go

 type DBFile struct {
 	ID        string  `db:"ID"`
 	ChunkID   int64   `db:"CHUNK_ID"`
 	ChannelID string  `db:"CHANNEL_ID"`
 	MessageID *int64  `db:"MESSAGE_ID"`
 	ThreadID  *int64  `db:"THREAD_ID,omitempty"`
 	Index     int     `db:"IDX"`
 	Mode      string  `db:"MODE"`
 	Filename  *string `db:"FILENAME"`
 	URL       *string `db:"URL"`
 	Data      []byte  `db:"DATA"`
+	Size      int     `db:"SIZE"` // File size in bytes from Slack API
 }


Fair call, I'd go with *int64 for simplicity

Add file deduplication support

4f3b5e6

rusq requested a review from Copilot March 19, 2026 10:01

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add file deduplication support#644

Add file deduplication support#644
volker-fr wants to merge 1 commit intorusq:masterfrom
volker-fr:dedup-downloads

volker-fr commented Mar 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

rusq Mar 19, 2026

Uh oh!

rusq Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

volker-fr commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

rusq Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

rusq Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

volker-fr commented Mar 17, 2026 •

edited

Loading