Skip to content

gtsrelf: include LOCAL globals, apply atEnd, and add flag to throw on mismatch#512

Merged
ailrst merged 11 commits intomainfrom
gtsrelf-fix-locals-and-atend-again
Jul 21, 2025
Merged

gtsrelf: include LOCAL globals, apply atEnd, and add flag to throw on mismatch#512
ailrst merged 11 commits intomainfrom
gtsrelf-fix-locals-and-atend-again

Conversation

@katrinafyi
Copy link
Member

@katrinafyi katrinafyi commented Jul 16, 2025

works towards #509

@katrinafyi
Copy link
Member Author

katrinafyi commented Jul 16, 2025

With the new stricter tests, we're seeing the known nestedifglobal problem and some others.

We're observing a mismatch in these cases:

  • extraspec_correct/malloc_memcpy_strlen_memset_free/gcc_O2:GTIRB *** FAILED *** (22 milliseconds)
  • extraspec_correct/malloc_memcpy_strlen_memset_free/clang_O2:GTIRB *** FAILED *** (12 milliseconds)
  • extraspec_incorrect/malloc_memcpy_strlen_memset_free/gcc_O2:GTIRB *** FAILED *** (13 milliseconds)
  • extraspec_incorrect/malloc_memcpy_strlen_memset_free/clang_O2:GTIRB *** FAILED *** (13 milliseconds)

These appear to be fixed by re-lifting? But @ l-kent says re-lifting didn't help. Maybe worth looking into it.

In tests which use examples/cntlm-noduk/cntlm-noduk.gts (e.g. IntervalDSATest), we're also seeing the deliberate incompatibility where oldrelf includes external variables as a ExternalFunction, but gtsrelf does not.

[INFO]   [!] Using ELF data from relf: examples/cntlm-noduk/cntlm-noduk.relf
[WARN]   PLEASE REPORT THIS ISSUE! (https://github.com/UQ-PAC/BASIL/issues/509) include the gts and relf files.
         gtirb relf discrepancy, external functions differ:
         gtirb - relf = HashSet()
         relf - gtirb = HashSet(ExternalFunction(optind,4391784), ExternalFunction(stdout,4391792), ExternalFunction(stderr,4391760), ExternalFunction(stdin,4391808), ExternalFunction(optarg,4391776)) [checkReadELFCompatibility@GTIRBReadELF.scala:281]

Do we want to precisely mimic the (probably wrong) oldrelf? Or, should we strike out and I'll modify the compat check?

@ailrst
Copy link
Contributor

ailrst commented Jul 17, 2025

Do we want to precisely mimic the (probably wrong) oldrelf? Or, should we strike out and I'll modify the compat check?

We probably want them somewhere, but should implement it correctly rather than just matching the previous implementation.

If they are just mapped as variables into the this processes' address space then it might make sense to treat them like regular globals.

@katrinafyi
Copy link
Member Author

Yeah, so the oldrelf line

    ExternalFunction("optind", BigInt("4391784")),

also has a global variables line

    SpecGlobal("optind", 32, None, BigInt("4391784")),

and the SpecGlobal is also present in the gtsrelf. So maybe I'll remove them from ExternalFunctions.

The only problem might be if two libraries are linked together, I think we lose the information that the underlying variables are the same. But this shouldn't be a problem as long as we look at whole ELFs.

oldrelf will generate both SpecGlobal and ExternalFunction entries for
external global variables (e.g., errno, optind), but gtsrelf does not.
the SpecGlobal entry is more semantically correct and contains strictly
more information, so we allow this incompatibility
@l-kent
Copy link
Contributor

l-kent commented Jul 17, 2025

But @ l-kent says re-lifting didn't help. Maybe worth looking into it.

This is likely compiler/OS etc. version differences, or potentially a ddisasm version difference given that it seems to be related to a warning about relocating the BSS section that I get from ddisasm

Do we want to precisely mimic the (probably wrong) oldrelf?

Including those external variables as ExternalFunctions is an instance where the existing ReadELFLoader is imprecise - external functions can appear in the .rela.dyn section but aren't readily distinguishable from external variables, so they're currently all included as ExternalFunctions. I think the way to improve this is by matching the names to the .dynsym table entries but having those extraneous ExternalFunctions hadn't previously caused any problems so no one had looked any further at it.

@l-kent
Copy link
Contributor

l-kent commented Jul 17, 2025

The optind external variable in cntlm-noduk seems to just be a rarer case that we haven't given any particular thought to before - it's type R_AARCH64_COPY in .rela.dyn and appears in .dynsym and .symtab too, all with the same address, so the .symtab entry causes it to be included as a SpecGlobal and the .rela.dyn entry causes it to be (incorrectly) included as an ExternalFunction.

@katrinafyi
Copy link
Member Author

To clarify, when you say "none of it is solved by re-lifting" in the other thread, are you using main branch or this #512 branch to do the gtsrelf/oldrelf comparison? In #509 (comment), I was using the #512 branch. Can you confirm which you used? Sorry for the confusion.

@l-kent
Copy link
Contributor

l-kent commented Jul 17, 2025

That was with the main branch. On this branch, the only discrepancies for the malloc_memcpy_strlen_memset_free tests involve __bss_start, which re-lifting doesn't change.

   + [WARN]   PLEASE REPORT THIS ISSUE! (https://github.com/UQ-PAC/BASIL/issues/509) include the gts and relf files.
          gtirb relf discrepancy, symbol tables differ:
          gtirb - relf = HashSet(ELFSymbol(70,69664,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start__), ELFSymbol(92,69664,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start))
          relf - gtirb = HashSet(ELFSymbol(92,69660,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start), ELFSymbol(70,69660,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start__)) [checkReadELFCompatibility@GTIRBReadELF.scala:290]
   + [WARN]   PLEASE REPORT THIS ISSUE! (https://github.com/UQ-PAC/BASIL/issues/509) include the gts and relf files.
          gtirb relf discrepancy, symbol tables differ:
          gtirb - relf = HashSet(ELFSymbol(70,69664,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start__), ELFSymbol(92,69664,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start))                                                                                                        
          relf - gtirb = HashSet(ELFSymbol(92,69660,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start), ELFSymbol(70,69660,0,NOTYPE,GLOBAL,DEFAULT,Section(23),__bss_start__)) [checkReadELFCompatibility@GTIRBReadELF.scala:290

@l-kent
Copy link
Contributor

l-kent commented Jul 17, 2025

To be clear, this issue was not caused by mismatched .relf and .gts files, they were derived from the same binaries and it still happens with freshly compiled and lifted binaries. The only test case where that's an issue is nestedifglobal and I have included a fix for that in #510.

The issue is likely related to the ddisasm warning I previously highlighted: transform WARNING: Moving symbol to first block of section: __bss_start. I don't know why ddisasm does that only for these examples and no others. I've confirmed that my ddisasm installation isn't out of date with the one in the nix repo.

@katrinafyi
Copy link
Member Author

katrinafyi commented Jul 17, 2025

I understand, I'll change the comment.

Which ddisasm version do you have? If it's from nix, do you know which commit it's from (nix profile list)? When I lift using ddisasm-0-unstable-2024-10-31 from katrinafyi/pac-nix@569afdf, I don't see the warnings. Edit: I've just tried the latest ddisasm from pac-nix and there's no warning and no mismatch.

@katrinafyi katrinafyi marked this pull request as ready for review July 17, 2025 05:17
@katrinafyi
Copy link
Member Author

katrinafyi commented Jul 17, 2025

Anyway, this PR can be reviewed now. It fixes some bugs which affected a lot of the system tests. We can fix these last few test cases in another PR.

Also, be assured that the CI passes. I just didn't want to re-run it after the comment-only change.

@l-kent
Copy link
Contributor

l-kent commented Jul 17, 2025

It's ddisasm-0-unstable-2024-10-31 from katrinafyi/pac-nix@2a579b1 because I just ran update earlier to see if that fixed anything. Previously I believe it was commit katrinafyi/pac-nix@55e0c10

Here's one of the binaries where ddisasm produces that warning when lifting:

incorrect_malloc_memcpy_strlen_memset_free_clang_O2.zip

@katrinafyi
Copy link
Member Author

I also see the ddisasm warning with your binary. But honestly, idk where to go from this. I'm using the Docker-based infrastructure in #288 and that produces a binary that lifts with no warnings and seems to work. I'm happy enough with that, and I think it would be really hard to dig into why your system's compiler has this problem.

Anyway, do you want to review this PR?

@l-kent
Copy link
Contributor

l-kent commented Jul 17, 2025

I think a reasonable solution would just be to ignore it if the offsets for symbols named __bss_start__ or __bss_start don't match given that this seems to be some weird quirk of either ddisasm or the compiler.

Looking at the .relf file, it seems that ddisasm is just modifying the address of those symbols to match the address of the .bss section - I'm not sure why those would be different but they are in this case.

@katrinafyi
Copy link
Member Author

katrinafyi commented Jul 17, 2025

I don't really want to special case certain symbol entries by name, especially if it only happens with certain test cases on certain compilers.

I've lessened the error message so it doesn't say "PLEASE REPORT THIS ISSUE" for these particular test cases. Eventually, we should move to a repeatable lifting environment which doesn't have the mismatch

@l-kent
Copy link
Contributor

l-kent commented Jul 18, 2025

I think it makes a lot of sense to have a special case where we ignore a known inconsistency in behaviour between two different tools (readelf and DDisasm) that doesn't seem to actually matter. The point of these checks is to just to make sure that the behaviour of BASIL is consistent in reading the symbol table etc. data from readelf vs. DDisasm + GTIRB, isn't it, to ensure that the new DDisasm + GTIRB symbol table pipeline is working correctly? In this case, we've determined that the inconsistency is due to DDisasm, not due to an error in BASIL.

It's possible to be more precise about what DDisasm is doing - it is changing the address of the __bss_start and __bss_start__ symbols to match the start of section 24, the .bss section. If this is the only difference between the symbols we get from readelf and DDisasm/GTIRB, then it seems fine to ignore it. If there's other inconsistencies with the addresses of those symbols, then that would be a new problem we'd have to look into.

@katrinafyi
Copy link
Member Author

katrinafyi commented Jul 18, 2025

To tweak the compatibility check, we have two knobs: (1) the check itself, which applies to all inputs, and (2) the reporting level of particular test cases (exception, warning, or silent). We should choose which knob to turn based on whether a problem is systemic or isolated.

At the moment, this issue has only been observed in that extraspec test, so I have adjusted the reporting level of that test case only. This way, if the same issue were to appear in other test cases, this would raise an error and we would notice it (which is good!). Yes, this does have the side-effect of silencing all mismatches in this test case, but I think this is an acceptable compromise. In the alternative, it would be impossible to tell if more binaries started having the bss mismatch.

Also, in the current infrastructure, it is not possible to special case the checking method based on a particular test case. I'm also not interested in adding that level of customisation, because I think it will be tremendously hacky. I also think that this issue will go away if we use the Docker container for compiling, so I'm not interested in building an exception which will only be necessary for a short period of time.

@l-kent
Copy link
Contributor

l-kent commented Jul 18, 2025

In the alternative, we would not notice if more binaries started having the bss mismatch.

Why do we need to notice this? This isn't an issue with anything other than DDisasm and readelf handling certain binaries (where the .bss section doesn't exactly match the __bss_start symbol) differently.

@katrinafyi
Copy link
Member Author

katrinafyi commented Jul 18, 2025

Guh, I yield. I'll log on and change it.

UNFORTUNATELY, the bss_start correction can't be put in normaliseRelf
alone, because i want to be able to detect when it happens. the mismatch
occurs in the gtsrelf, but its only detectable by considering both the
oldrelf and gtsrelf. if it was in normaliseRelf, the differences would
be smoothed over and undetectable.
@ailrst ailrst self-requested a review July 21, 2025 05:51
Copy link
Contributor

@ailrst ailrst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@l-kent assuming you aren't interested in reviewing

This change set looks fine. The only difference in cntlm is now the @GLIBC tag dropped and the extern globals only appearing in the globals section which is what we want.

@ailrst ailrst merged commit 929f442 into main Jul 21, 2025
12 checks passed
@ailrst ailrst deleted the gtsrelf-fix-locals-and-atend-again branch July 21, 2025 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants