Generate unified diff only when producing output by disinvite · Pull Request #307 · isledecomp/reccmp

disinvite · 2026-01-24T22:34:21Z

The process of generating a diff for a function or vtable is:

Get the ReccmpMatch object from the database. Assuming that exists:
Get the raw list of "items" for the entity. For a function this is each address and instruction. For a vtable this is the offset and function of each entry in the table.
Use difflib.SequenceMatcher to compare the raw text (i.e. not the addresses or offsets) and produce diff opcodes. These are instructions from the set (equal, insert, delete, replace) that describe how to turn sequence A (orig) into sequence B (recomp).
Generate a unified diff with a certain number of context lines. This breaks the diff into groups and eliminates long sequences of matched items. For functions, a group will begin or end with up to 10 lines of matched lines for context. For vtables, we retain all matched lines but still use the udiff structure, so it is one big group.
Set the result in a DiffReport object.
Output the diff. If we are creating a JSON or HTML report, convert the DiffReport to a ReccmpComparedEntity object first, then follow the serialization logic.

Some data is lost between steps 3 and 4. We don't need to generate the unified diff until step 6 when it is time to output to a file or to the screen.

I changed stackcmp to use the entire diff report instead of the grouped version. Is that useful at all? Or not needed?

My end goal is to create report format version 2 that will store this data more efficiently. (#98) JSON deserialization becomes a performance problem if you are aggregating a lot of entropy runs. For the 1024 sample set, IIRC the bulk of the time is spent in JSON parsing.

We also do not store the entity type in the JSON report. If we had this, it would unblock #93.

This is a draft for now because I'm still trying to figure out what kind of automated tests would help verify this change. Ideally we would commit the tests as a separate PR first. The manual testing is straightforward but tedious. (Diffing JSON files from many decomp projects.)

I would appreciate any and all preliminary feedback!

jonschz · 2026-01-25T07:39:27Z

I changed stackcmp to use the entire diff report instead of the grouped version. Is that useful at all? Or not needed?

Yes, that is actually useful. Let's say we have variable a at 0x10 in orig and recomp, and variable b at 0x14 in orig and recomp. Now assume we use the wrong variable in one location, leading to a diff like

-mov ebp+0x10, 1
+mov ebp+0x14, 1

If there are no matched, correct usages of ebp+0x10 or ebp+0x14 in the 10 lines of context around the error, stackcmp currently doesn't know about that and mistakenly assumes that the line is correct, but the stack is reordered.

jonschz · 2026-01-25T07:46:17Z

My end goal is to create report format version 2 that will store this data more efficiently.

There is definitely a point to that. Do you want to discuss some options before starting to implement? I am wondering if there is some existing format so we don't have to go fully custom. The first thing to come to my mind is something like diff/patch files, which could be stored as strings in JSON. Most markdown renderers have support for that already. Maybe there's something out of the box we could use for that?

disinvite · 2026-01-25T18:25:23Z

There is definitely a point to that. Do you want to discuss some options before starting to implement?

Sure, I haven't started anything yet. Rough ideas so far:

Store the difflib opcodes for each function in the file. We can just reimplement get_grouped_opcodes in JS for the HTML view.
For functions with a diff, retain all matching lines so the user can specify how much context they want. They might want to dump the entire function, for example.
This will result in a lot of extra data, so we could break out the instruction text into its own global list and store only the indices for each function's diff. For example:

{
  "instructions": ["push ebp", "mov ebp, esp", "push ebx", "..."]
  "diffs": { "0x1234": { "orig": [1, 2, 3, "..."], "recomp": [1, 2, 3, "..."] } }
  // missing addrs and opcodes, but you get the idea
}

We could get fancy and do huffman coding so the most common instructions get the smallest indices. Reducing file size isn't the main concern here, though. It's about reducing the complexity of the JSON so we can parse it more quickly.

The argument against doing this is that it makes the JSON not very human readable or diffable in a useful way. You could counter by saying that we chose JSON for convenience (and JS compatibility) and it is not intended to be human readable outside of our tools.

disinvite · 2026-03-18T02:47:50Z

I just did some manual testing with this and (as intended) the output is the same for vtables and function entity varieties: stubs, diffs, matches, effective matches.

jonschz

Looks very solid overall! I have faith in your manual testing, I didn't do any.

disinvite added 7 commits January 11, 2026 18:13

Delay generating unified diff until output

8eb08a8

Restore previous behavior

6ab1b4f

Establish 'raw diff' name. Move udiff conversion to report code.

ed7190d

Create udiff for effective match

1010892

Merge branch 'isledecomp:master' into retain-diff

f400fef

Minimize diff in report.py

dda3c03

Fix comment

a4095c6

disinvite added 5 commits January 25, 2026 18:23

Merge remote-tracking branch 'origin/master' into retain-diff

73dc64c

Fix missed item from merge

6357d8f

IsleCompare fix

465e64a

Merge branch 'master' into retain-diff

829620c

Merge branch 'isledecomp:master' into retain-diff

d3e59d0

disinvite mentioned this pull request Feb 26, 2026

Tests for reccmp report output and unified diff #326

Merged

disinvite added 2 commits March 7, 2026 12:19

Merge branch 'master' into retain-diff

58d8e43

Tweak tests to use new member names. Stub bugfix

64fd65d

disinvite marked this pull request as ready for review March 7, 2026 19:13

disinvite added 4 commits March 8, 2026 20:32

Merge branch 'master' into retain-diff

90b806f

Pylint fix

9e856c5

Merge branch 'master' into retain-diff

71ff11c

Fix typing TODO

233ac72

disinvite requested review from jonschz and madebr March 18, 2026 02:45

jonschz approved these changes Mar 28, 2026

View reviewed changes

Comment thread reccmp/compare/report.py

Comment thread reccmp/compare/report.py

Comment thread reccmp/tools/stackcmp.py

Comment thread reccmp/compare/report.py

Comment thread reccmp/compare/report.py

Comment thread reccmp/compare/core.py Outdated

disinvite added 3 commits March 28, 2026 11:41

Docstring for udiff function. Test for reccmp-aggregate scenario.

557bd82

Easier to just rename the dataclass to cover both entity types

5084a8b

Merge branch 'master' into retain-diff

7c17b85

Add stackcmp comment

6ea95cb

disinvite merged commit 5483544 into isledecomp:master Mar 29, 2026
18 checks passed

disinvite deleted the retain-diff branch March 29, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate unified diff only when producing output#307

Generate unified diff only when producing output#307
disinvite merged 22 commits intoisledecomp:masterfrom
disinvite:retain-diff

disinvite commented Jan 24, 2026 •

edited

Loading

Uh oh!

jonschz commented Jan 25, 2026

Uh oh!

jonschz commented Jan 25, 2026

Uh oh!

disinvite commented Jan 25, 2026

Uh oh!

disinvite commented Mar 18, 2026

Uh oh!

jonschz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

disinvite commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonschz commented Jan 25, 2026

Uh oh!

jonschz commented Jan 25, 2026

Uh oh!

disinvite commented Jan 25, 2026

Uh oh!

disinvite commented Mar 18, 2026

Uh oh!

jonschz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

disinvite commented Jan 24, 2026 •

edited

Loading