Skip to content

Prevent duplicate commit insertion on facade re-processing #3790

@MoralCode

Description

@MoralCode

the commits table has a primary key that is derived only from a sequence (i.e. an autoincrementing integer)

When facade analyze_commits_in_parallel runs (to insert all new commit change information into the commits table), it does not utilize Augur's existing upsert (bulk_insert_dicts or similar) logic. Instead it simply inserts always, causing new IDs to be generated and new rows to be added.

If the run of analyzing commits is a rerun (i.e. the repo previously was fully collected, but the admin reset the last collection date to force recollection, meaning many of the commits are already in the table), this will simply generate duplicate rows, contributing to the size growth of one of the largest tables in Augur.

In order to use upserts for the commits table, we need a compound primary key based on the actual data. Given this table is actually more accurately described as commit_changes (#3682), i propose this constraint UniqueConstraint("repo_id", "cmt_commit_hash", "cmt_filename", name="commit-changes-unique"),.

Metadata

Metadata

Assignees

No one assigned

    Labels

    theoreticalProblems that might be possible in theory but still require confirmation of actual impact

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions