Skip to content

Generate unified diff only when producing output#307

Merged
disinvite merged 22 commits intoisledecomp:masterfrom
disinvite:retain-diff
Mar 29, 2026
Merged

Generate unified diff only when producing output#307
disinvite merged 22 commits intoisledecomp:masterfrom
disinvite:retain-diff

Conversation

@disinvite
Copy link
Copy Markdown
Collaborator

@disinvite disinvite commented Jan 24, 2026

The process of generating a diff for a function or vtable is:

  1. Get the ReccmpMatch object from the database. Assuming that exists:
  2. Get the raw list of "items" for the entity. For a function this is each address and instruction. For a vtable this is the offset and function of each entry in the table.
  3. Use difflib.SequenceMatcher to compare the raw text (i.e. not the addresses or offsets) and produce diff opcodes. These are instructions from the set (equal, insert, delete, replace) that describe how to turn sequence A (orig) into sequence B (recomp).
  4. Generate a unified diff with a certain number of context lines. This breaks the diff into groups and eliminates long sequences of matched items. For functions, a group will begin or end with up to 10 lines of matched lines for context. For vtables, we retain all matched lines but still use the udiff structure, so it is one big group.
  5. Set the result in a DiffReport object.
  6. Output the diff. If we are creating a JSON or HTML report, convert the DiffReport to a ReccmpComparedEntity object first, then follow the serialization logic.

Some data is lost between steps 3 and 4. We don't need to generate the unified diff until step 6 when it is time to output to a file or to the screen.

I changed stackcmp to use the entire diff report instead of the grouped version. Is that useful at all? Or not needed?

My end goal is to create report format version 2 that will store this data more efficiently. (#98) JSON deserialization becomes a performance problem if you are aggregating a lot of entropy runs. For the 1024 sample set, IIRC the bulk of the time is spent in JSON parsing.

We also do not store the entity type in the JSON report. If we had this, it would unblock #93.

This is a draft for now because I'm still trying to figure out what kind of automated tests would help verify this change. Ideally we would commit the tests as a separate PR first. The manual testing is straightforward but tedious. (Diffing JSON files from many decomp projects.)

I would appreciate any and all preliminary feedback!

@jonschz
Copy link
Copy Markdown
Collaborator

jonschz commented Jan 25, 2026

I changed stackcmp to use the entire diff report instead of the grouped version. Is that useful at all? Or not needed?

Yes, that is actually useful. Let's say we have variable a at 0x10 in orig and recomp, and variable b at 0x14 in orig and recomp. Now assume we use the wrong variable in one location, leading to a diff like

-mov ebp+0x10, 1
+mov ebp+0x14, 1

If there are no matched, correct usages of ebp+0x10 or ebp+0x14 in the 10 lines of context around the error, stackcmp currently doesn't know about that and mistakenly assumes that the line is correct, but the stack is reordered.

@jonschz
Copy link
Copy Markdown
Collaborator

jonschz commented Jan 25, 2026

My end goal is to create report format version 2 that will store this data more efficiently.

There is definitely a point to that. Do you want to discuss some options before starting to implement? I am wondering if there is some existing format so we don't have to go fully custom. The first thing to come to my mind is something like diff/patch files, which could be stored as strings in JSON. Most markdown renderers have support for that already. Maybe there's something out of the box we could use for that?

@disinvite
Copy link
Copy Markdown
Collaborator Author

There is definitely a point to that. Do you want to discuss some options before starting to implement?

Sure, I haven't started anything yet. Rough ideas so far:

  • Store the difflib opcodes for each function in the file. We can just reimplement get_grouped_opcodes in JS for the HTML view.
  • For functions with a diff, retain all matching lines so the user can specify how much context they want. They might want to dump the entire function, for example.
  • This will result in a lot of extra data, so we could break out the instruction text into its own global list and store only the indices for each function's diff. For example:
{
  "instructions": ["push ebp", "mov ebp, esp", "push ebx", "..."]
  "diffs": { "0x1234": { "orig": [1, 2, 3, "..."], "recomp": [1, 2, 3, "..."] } }
  // missing addrs and opcodes, but you get the idea
}

We could get fancy and do huffman coding so the most common instructions get the smallest indices. Reducing file size isn't the main concern here, though. It's about reducing the complexity of the JSON so we can parse it more quickly.

The argument against doing this is that it makes the JSON not very human readable or diffable in a useful way. You could counter by saying that we chose JSON for convenience (and JS compatibility) and it is not intended to be human readable outside of our tools.

@disinvite disinvite marked this pull request as ready for review March 7, 2026 19:13
@disinvite disinvite requested review from jonschz and madebr March 18, 2026 02:45
@disinvite
Copy link
Copy Markdown
Collaborator Author

I just did some manual testing with this and (as intended) the output is the same for vtables and function entity varieties: stubs, diffs, matches, effective matches.

Copy link
Copy Markdown
Collaborator

@jonschz jonschz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very solid overall! I have faith in your manual testing, I didn't do any.

Comment thread reccmp/compare/report.py
Comment thread reccmp/compare/report.py
Comment thread reccmp/tools/stackcmp.py
Comment thread reccmp/compare/report.py
Comment thread reccmp/compare/report.py
Comment thread reccmp/compare/core.py Outdated
@disinvite disinvite merged commit 5483544 into isledecomp:master Mar 29, 2026
18 checks passed
@disinvite disinvite deleted the retain-diff branch March 29, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants