Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds file-download deduplication for slackdump resume by persisting Slack file sizes in the SQLite archive and using (file_id, size) to detect already-recorded files before downloading again.
Changes:
- Add
SIZEcolumn (and index) to theFILEtable via a new goose migration. - Extend the DB file model/repository with
Sizeand aGetByIDAndSizelookup method. - Add a
DeduplicatingFileProcessorwrapper and wire it into the resume controller path.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql | Adds FILE.SIZE column and an (ID, SIZE) index to support dedup lookups. |
| internal/chunk/backend/dbase/repository/dbfile.go | Persists slack.File.Size, adds repository method for dedup lookup. |
| internal/chunk/backend/dbase/repository/mock_repository/mock_file.go | Regenerates/extends mock to include GetByIDAndSize. |
| internal/convert/transform/fileproc/dedup.go | Implements DB-backed deduplicating filer wrapper. |
| internal/convert/transform/fileproc/dedup_test.go | Adds placeholder tests; currently skips the main behavior test. |
| cmd/slackdump/internal/archive/archive.go | Wraps the filer with dedup logic for resume. |
| internal/fixtures/assets/source_database.db | Updates/adds a DB fixture reflecting the new schema (binary). |
| // GetByIDAndSize returns a file by its ID and size. | ||
| // If a file with the same ID and size exists, we assume it hasn't changed. | ||
| func (r fileRepository) GetByIDAndSize(ctx context.Context, conn sqlx.QueryerContext, fileID string, size int) (*DBFile, error) { | ||
| const stmt = `SELECT ID, CHUNK_ID, CHANNEL_ID, MESSAGE_ID, THREAD_ID, IDX, MODE, FILENAME, URL, DATA, SIZE |
Comment on lines
+54
to
+79
|
|
||
| for i := range ff { | ||
| f := &ff[i] | ||
| if !IsValid(f) { | ||
| continue | ||
| } | ||
|
|
||
| // Check if file already exists with same ID and size | ||
| existing, err := fr.GetByIDAndSize(ctx, d.db, f.ID, f.Size) | ||
| if err != nil { | ||
| d.lg.WarnContext(ctx, "error checking file existence", "error", err, "file_id", f.ID) | ||
| // Continue with download on error | ||
| } | ||
|
|
||
| if existing != nil { | ||
| d.lg.DebugContext(ctx, "skipping duplicate file", "file_id", f.ID, "size", f.Size) | ||
| continue | ||
| } | ||
|
|
||
| // File doesn't exist or size differs - download it | ||
| if err := d.inner.Files(ctx, channel, parent, []slack.File{*f}); err != nil { | ||
| return err | ||
| } | ||
| } | ||
|
|
||
| return nil |
Comment on lines
+23
to
+27
| func TestDeduplicatingFileProcessor_Files(t *testing.T) { | ||
| // This test would need a real database connection | ||
| // For now, just verify the logic compiles | ||
| t.Skip("Requires database connection - manual test only") | ||
| } |
Comment on lines
+216
to
+220
| // Wrap file processor with deduplication for resume operations | ||
| var filer processor.Filer = fileproc.New(dl) | ||
| if cmdName == "resume" { | ||
| filer = fileproc.NewDeduplicatingFileProcessor(filer, conn, lg) | ||
| } |
| -- +goose Up | ||
| -- +goose StatementBegin | ||
| -- Add SIZE column to FILE table for deduplication | ||
| ALTER TABLE FILE ADD COLUMN SIZE INTEGER; |
Comment on lines
30
to
42
| type DBFile struct { | ||
| ID string `db:"ID"` | ||
| ChunkID int64 `db:"CHUNK_ID"` | ||
| ChannelID string `db:"CHANNEL_ID"` | ||
| MessageID *int64 `db:"MESSAGE_ID"` | ||
| ThreadID *int64 `db:"THREAD_ID,omitempty"` | ||
| Index int `db:"IDX"` | ||
| Mode string `db:"MODE"` | ||
| Filename *string `db:"FILENAME"` | ||
| URL *string `db:"URL"` | ||
| Data []byte `db:"DATA"` | ||
| Size int `db:"SIZE"` // File size in bytes from Slack API | ||
| } |
Owner
There was a problem hiding this comment.
Fair call, I'd go with *int64 for simplicity
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On
slackdump resumefiles get downloaded over and over again even if they exist already on disk.This PR adds the filesize to sqlite to compare it in the future with new downloads. The Slack API returns the file size itself and not a checksum. The ID should also be unique based on upload, therefore the filesize is more optional but could be used in the future to compare files on disk with the DB.
For real world testing I used the
-vflag on resume and it worked fine