Generate unified diff only when producing output#307
Generate unified diff only when producing output#307disinvite merged 22 commits intoisledecomp:masterfrom
Conversation
Yes, that is actually useful. Let's say we have variable -mov ebp+0x10, 1
+mov ebp+0x14, 1If there are no matched, correct usages of |
There is definitely a point to that. Do you want to discuss some options before starting to implement? I am wondering if there is some existing format so we don't have to go fully custom. The first thing to come to my mind is something like diff/patch files, which could be stored as strings in JSON. Most markdown renderers have support for that already. Maybe there's something out of the box we could use for that? |
Sure, I haven't started anything yet. Rough ideas so far:
{
"instructions": ["push ebp", "mov ebp, esp", "push ebx", "..."]
"diffs": { "0x1234": { "orig": [1, 2, 3, "..."], "recomp": [1, 2, 3, "..."] } }
// missing addrs and opcodes, but you get the idea
}We could get fancy and do huffman coding so the most common instructions get the smallest indices. Reducing file size isn't the main concern here, though. It's about reducing the complexity of the JSON so we can parse it more quickly. The argument against doing this is that it makes the JSON not very human readable or diffable in a useful way. You could counter by saying that we chose JSON for convenience (and JS compatibility) and it is not intended to be human readable outside of our tools. |
|
I just did some manual testing with this and (as intended) the output is the same for vtables and function entity varieties: stubs, diffs, matches, effective matches. |
jonschz
left a comment
There was a problem hiding this comment.
Looks very solid overall! I have faith in your manual testing, I didn't do any.
The process of generating a diff for a function or vtable is:
ReccmpMatchobject from the database. Assuming that exists:difflib.SequenceMatcherto compare the raw text (i.e. not the addresses or offsets) and produce diff opcodes. These are instructions from the set(equal, insert, delete, replace)that describe how to turn sequence A (orig) into sequence B (recomp).DiffReportobject.DiffReportto aReccmpComparedEntityobject first, then follow the serialization logic.Some data is lost between steps 3 and 4. We don't need to generate the unified diff until step 6 when it is time to output to a file or to the screen.
I changed
stackcmpto use the entire diff report instead of the grouped version. Is that useful at all? Or not needed?My end goal is to create report format version 2 that will store this data more efficiently. (#98) JSON deserialization becomes a performance problem if you are aggregating a lot of entropy runs. For the 1024 sample set, IIRC the bulk of the time is spent in JSON parsing.
We also do not store the entity type in the JSON report. If we had this, it would unblock #93.
This is a draft for now because I'm still trying to figure out what kind of automated tests would help verify this change. Ideally we would commit the tests as a separate PR first. The manual testing is straightforward but tedious. (Diffing JSON files from many decomp projects.)
I would appreciate any and all preliminary feedback!