Skip to content

Tracking symlink changes #695

@jgarvin

Description

@jgarvin

Summary 💡

I have working code using git_repository to search through all commits backwards from a branch tip that affect files with names matching a user supplied regex. This works great, and combined with rayon is about 1,728x faster than my original git2 code (even accounting for running on a 128 cores, this is a >10x per core speedup :D).

However I would like to add tracking through symlinks. Git just models symlinks as blobs that contain a path with a bit to indicate it's intended to be a symlink. This means if commit 1 creates symlink A to file B, and then commit 2 modifies B, in the diff I only see a modification of B, not a modification of A. But I want to consider it a modification of A for purposes of determining if the user regex matches the commit.

Further complicating things branches point to the tip, and the natural way to iterate is backwards through history, but to incrementally process symlink state from scratch we would need to start at the beginning of the repo state. So my plan instead is:

  • Examine all files in checkout of branch tip, and build up a hash table of symlink hash to target.
  • Iterate backwards up the commit chain, altering the hash table in response to diff contents as I go
  • Use a persistent hash table implementation so "forking" the hash table to deal with commits with multiple parents is a cheap operation, and forks can be iterated in parallel

The vast majority of commits do not modify symlinks, so the number of distinct hash table states I have should be small enough to fit in memory. At the end, every commit would have one of these tables associated with it, so that when I compute diffs I can lookup if any of the files changed have symlinks that match the regex pointing at them.

Questions:

  1. Any thoughts on if this is a good general strategy? Am I missing something builtin that could help me do this already?
  2. One challenge is using rev_walk for this -- when a commit has multiple parents I need to visit each (parent, child) pair in order to compute all the diffs, but Walk only has the one "current" commit. I'm not sure if there is a way to get this effect without needing to reimplement Walk?

Motivation 🔦

The code base I'm running this on switched from using files that are edited in place to a crazy rats nest of symlinks in the middle of its history :(

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions