Skip to content

Conversation

@rovarma
Copy link
Contributor

@rovarma rovarma commented Nov 23, 2025

Hi @davea42,

For our use case, we're interested in extracting all unwind rows for a given binary. To do so, we're currently using the dwarf_get_fde_info_for_all_regs3_b function by repeatedly calling it in a loop until all rows have been consumed. This looks something like this:

Dwarf_Addr currentRowRVA = fdeStartAddress;
Dwarf_Bool hasMoreRows = false;

// Iterate through all rows for this FDE while there are more rows available
do
{
	Dwarf_Addr rowRVA;

	// Get row for the current address, which also outputs the address of the next
	// available row
	Dwarf_Addr nextRowRVA;
	dwarf_get_fde_info_for_all_regs3_b(fde, currentRowRVA, &registerTable, &rowRVA, &hasMoreRows, &nextRowRVA, &dwarfError);

	// do something with register table

	// Move to the next available row
	currentRowRVA = nextRowRVA;
} while (hasMoreRows);

For a binary we're testing against, this takes ~1 second to extract all unwind rows for all FDEs in the binary. The binary has:

  • 154.361 FDEs
  • 1.548.145 rows

We analyzed this, and the root cause is that this is essentially a somewhat hidden quadratic loop: dwarf_get_fde_info_for_all_regs3_b internally uses _dwarf_exec_frame_instr (via _dwarf_get_fde_info_for_a_pc_row) to execute all instructions for the given FDE until it reaches the search_pc_val passed in, but it starts from the first instruction for each call.

This means that if you're iterating through the rows for an FDE like this, the loop essentially looks like this in pseudo code:

for each row in function:
	for each instruction in function:
		execute instruction
		if current location > search_pc_val:
			break

This PR implements a new function dwarf_iterate_fd_info_for_all_regs3 that fixes this. The new function uses a new (internal) helper function _dwarf_iterate_frame_instr that executes all instructions for the FDE, invoking a callback for each row. This turns the quadratic loop into a linear loop over the instructions, which results in a significant speedup: iteration for this binary goes from 1007ms to 83ms, which is ~12x faster.

Open questions

I'm sending this PR as a RFC/draft, because there are some questions around the change as implemented that I'm not sure about:

First of all, the new _dwarf_iterate_frame_instr helper is a copy/paste of the existing _dwarf_exec_frame_instr with some minor modifications to invoke the callback instead. This results in significant code duplication that is probably not desired. It is technically possible to implement _dwarf_exec_frame_instr in terms of the new _dwarf_iterate_frame_instr to prevent this code duplication.

I haven't included that as part of the PR, but that could look something like this:

struct Dwarf_Exec_Frame_Callback_Info {
    Dwarf_Bool  search_pc;
    Dwarf_Addr  search_pc_val;
    Dwarf_Frame output_table;
};

Dwarf_Bool _dwarf_exec_frame_instr_callback(Dwarf_Frame table,
    Dwarf_Addr subsequent_pc, Dwarf_Bool is_last_row, void* user_data)
{
    struct Dwarf_Exec_Frame_Callback_Info* exec_data =
        (struct Dwarf_Exec_Frame_Callback_Info*)user_data;

    Dwarf_Bool done = exec_data->search_pc &&
        (subsequent_pc > exec_data->search_pc_val);

    if ((done || is_last_row) && exec_data->output_table) {

        struct Dwarf_Reg_Rule_s* t2reg = exec_data->output_table->fr_reg;
        struct Dwarf_Reg_Rule_s* t3reg = table->fr_reg;
        unsigned minregcount = (unsigned)MIN(exec_data->output_table->fr_reg_count,
            table->fr_reg_count);
        unsigned curreg = 0;

        exec_data->output_table->fr_loc = table->fr_loc;
        for (; curreg < minregcount; curreg++, t3reg++, t2reg++) {
            *t2reg = *t3reg;
        }

        exec_data->output_table->fr_cfa_rule = table->fr_cfa_rule;
    }

    return done;
}

int
_dwarf_exec_frame_instr(Dwarf_Bool make_instr,
    Dwarf_Bool search_pc,
    Dwarf_Addr search_pc_val,
    Dwarf_Addr initial_loc,
    Dwarf_Small* start_instr_ptr,
    Dwarf_Small* final_instr_ptr,
    Dwarf_Frame table,
    Dwarf_Cie cie,
    Dwarf_Debug dbg,
    Dwarf_Unsigned reg_num_of_cfa,
    Dwarf_Bool* has_more_rows,
    Dwarf_Addr* subsequent_pc,
    Dwarf_Frame_Instr_Head* ret_frame_instr_head,
    Dwarf_Unsigned* returned_frame_instr_count,
    Dwarf_Error* error)
{
    struct Dwarf_Exec_Frame_Callback_Info user_data;
    user_data.search_pc = search_pc;
    user_data.search_pc_val = search_pc_val;
    user_data.output_table = table;

    return _dwarf_iterate_frame_instr(&_dwarf_exec_frame_instr_callback,
        &user_data, make_instr, initial_loc, start_instr_ptr,
        final_instr_ptr, cie, dbg, reg_num_of_cfa, has_more_rows,
        subsequent_pc, ret_frame_instr_head, returned_frame_instr_count, error);
}

But this is not a risk-free change.

Secondly, the new dwarf_iterate_fde_info_for_all_regs3 has to make use of an internal callback function _dwarf_iterate_fde_info_for_all_regs3_callback that's passed to the new _dwarf_iterate_frame_instr helper. The reason for this is that _dwarf_iterate_frame_instr (and _dwarf_exec_frame_instr) work with an internal struct Dwarf_Frame that's not currently exposed in libdwarf.h. This means we need to copy from Dwarf_Frame to Dwarf_Regtable3, which is a bit wasteful.

It would be nice if Dwarf_Frame (and related structs) could be exposed in the API to avoid this copy step. As I understand it, the difference between Dwarf_Frame and Dwarf_Regtable3 was introduced to avoid breaking the API for existing functions like dwarf_get_fde_info_for_all_regs3_b, but since there is a new function being introduced here, there is no risk of that.

I realize this is quite a lot taken together, but it would be great to get your thoughts/feedback on all of this to see if we can get this PR in a shape where it could be upstreamed (or if you have any ideas around how a similar optimization could be implemented in a way that fits a bit better in libdwarf's current architecture).

…e_info_for_all_regs3 API function

This greatly optimizes iteration through all FDE unwind rows. The new helper function invokes a callback for each row in the FDE, instead of having to repeatedly call _dwarf_exec_frame_instr, which performs a lot of redundant work per row and has quadratic behavior. The new helper is mostly a copy of _dwarf_exec_frame_instr with some minor modifications.

The new dwarf_iterate_fde_info_for_all_regs3() function uses the new helper function internally
@davea42
Copy link
Owner

davea42 commented Nov 25, 2025

I have not looked at this yet, still working on harmless errors/dwarfdump.

The problem with public structs in the API is that when DWARF changes something
affecting those structs it is sort of impossible to preserve source compatiblity
for library users. And causes code bloat.

A function interface, on the other hand, even with N return values via pointers,
can easily deal with new versions by adding
new functions, and there are several examples in the library.
Basically painless.

Your code in this pull request looks pretty sensible. I need to look again, but it looks promising.

@davea42 davea42 marked this pull request as ready for review November 25, 2025 21:22
@rovarma
Copy link
Contributor Author

rovarma commented Nov 25, 2025

Thanks! No rush -- happy to discuss when you have the room for it.

@davea42
Copy link
Owner

davea42 commented Nov 26, 2025

It would be nice if Dwarf_Frame (and related structs) could be exposed in the API to avoid this copy step. As I understand it, the difference between Dwarf_Frame and Dwarf_Regtable3 was introduced to avoid breaking the API for existing functions like dwarf_get_fde_info_for_all_regs3_b, but since there is a new function being introduced here, there is no risk of that.

Well, that assumes yet more change won't be required in the future.
But I assume change will be required. So not introducing more public structs.
The transformation is not expensive for anyone, but breaking the API
is inevitably expensive for some people (including me).

Anyway, the overall idea looks good.

I'm about to write my own dwarf_iterate_fde_info_for_all_regs3() to verify various details.
No code duplication will be involved. Existing thorough tests will ensure changes
break nothing that already works. I don't do new code in my head (wish I could though).

I will also write a new dwarfexample/frame2.c using the new approach. Maybe with
an automated comparison of the important output of frame1.c with frame2.c

You are not the first to request data on all the fde pc value rows, so this will help
others.

There are only three calls to _dwarf_exec_frame_instr (before changes)
so it's not difficult to get the entire picture. Thanks for pushing me
to look at this!

It would be nice to see your iterate versions as a cross-check on my efforts.

@davea42 davea42 marked this pull request as draft November 26, 2025 21:42
@davea42
Copy link
Owner

davea42 commented Nov 26, 2025

Changing draft to pull request was a mistake on my part. So back to draft now.
DavidA

@davea42
Copy link
Owner

davea42 commented Nov 28, 2025

I have to refactor the 1600 lines of dwarf_exec_frame_instr()
into two functions to separate out the a long switch doing DW_OP processing in dwarf_frame.c
Then I have to add the ability to remember where we were in that function so
it can be restarted when appropriate.

iterate.c
I have what I hope is a sensible approach in files outside the library.

But first: refactor.
frame2.c

@davea42
Copy link
Owner

davea42 commented Nov 29, 2025

I moved _dwarf_exec_frame_instr() into a new source file: dwarf_cfa_read.c

It's no smaller, but it's easier to work with (it already seems so).
Regression tests are running, but of course they will pass.

I think I see where and how to place the callbacks. More later.

@rovarma
Copy link
Contributor Author

rovarma commented Dec 3, 2025

Hey Dave,

Sorry for the late response, I've been away for a bit.

Well, that assumes yet more change won't be required in the future.
But I assume change will be required. So not introducing more public structs.
The transformation is not expensive for anyone, but breaking the API
is inevitably expensive for some people (including me).

Totally understand the reasoning, but just to respond to the expensive part: in my profiles, about 25% of the remaining time in the new dwarf_iterate_fde_info_for_all_regs3 is spent on this transformation. It's fast enough in general, so it's fine to keep it like that, just wanted to give some context.

It would be nice to see your iterate versions as a cross-check on my efforts.

Which iterate versions do you mean here? The inner iteration version is part of the commits for this PR. If you mean a version of _dwarf_exec_frame_instr implemented in terms of the new iteration function, that's in the original post:

struct Dwarf_Exec_Frame_Callback_Info {
    Dwarf_Bool  search_pc;
    Dwarf_Addr  search_pc_val;
    Dwarf_Frame output_table;
};

Dwarf_Bool _dwarf_exec_frame_instr_callback(Dwarf_Frame table,
    Dwarf_Addr subsequent_pc, Dwarf_Bool is_last_row, void* user_data)
{
    struct Dwarf_Exec_Frame_Callback_Info* exec_data =
        (struct Dwarf_Exec_Frame_Callback_Info*)user_data;

    Dwarf_Bool done = exec_data->search_pc &&
        (subsequent_pc > exec_data->search_pc_val);

    if ((done || is_last_row) && exec_data->output_table) {

        struct Dwarf_Reg_Rule_s* t2reg = exec_data->output_table->fr_reg;
        struct Dwarf_Reg_Rule_s* t3reg = table->fr_reg;
        unsigned minregcount = (unsigned)MIN(exec_data->output_table->fr_reg_count,
            table->fr_reg_count);
        unsigned curreg = 0;

        exec_data->output_table->fr_loc = table->fr_loc;
        for (; curreg < minregcount; curreg++, t3reg++, t2reg++) {
            *t2reg = *t3reg;
        }

        exec_data->output_table->fr_cfa_rule = table->fr_cfa_rule;
    }

    return done;
}

int
_dwarf_exec_frame_instr(Dwarf_Bool make_instr,
    Dwarf_Bool search_pc,
    Dwarf_Addr search_pc_val,
    Dwarf_Addr initial_loc,
    Dwarf_Small* start_instr_ptr,
    Dwarf_Small* final_instr_ptr,
    Dwarf_Frame table,
    Dwarf_Cie cie,
    Dwarf_Debug dbg,
    Dwarf_Unsigned reg_num_of_cfa,
    Dwarf_Bool* has_more_rows,
    Dwarf_Addr* subsequent_pc,
    Dwarf_Frame_Instr_Head* ret_frame_instr_head,
    Dwarf_Unsigned* returned_frame_instr_count,
    Dwarf_Error* error)
{
    struct Dwarf_Exec_Frame_Callback_Info user_data;
    user_data.search_pc = search_pc;
    user_data.search_pc_val = search_pc_val;
    user_data.output_table = table;

    return _dwarf_iterate_frame_instr(&_dwarf_exec_frame_instr_callback,
        &user_data, make_instr, initial_loc, start_instr_ptr,
        final_instr_ptr, cie, dbg, reg_num_of_cfa, has_more_rows,
        subsequent_pc, ret_frame_instr_head, returned_frame_instr_count, error);
}

I have to refactor the 1600 lines of dwarf_exec_frame_instr()
into two functions to separate out the a long switch doing DW_OP processing in dwarf_frame.c
Then I have to add the ability to remember where we were in that function so
it can be restarted when appropriate.

That seems like a sensible approach indeed. I considered doing that, but I didn't feel confident enough in my understanding of dwarf_exec_frame_instr to do so.

I think I see where and how to place the callbacks. More later.

Great! Let me know if I can provide any input; will respond more promptly now that I'm back in town :)

@davea42
Copy link
Owner

davea42 commented Dec 6, 2025

I've been proceeding with your idea in a branch. It compiles but nothing more to report.

It also occurred to me that having the Dwarf_Debug instance involved
in recording the frame data in _dwarf_exec_frame_instr (effectively a closure) would enable
restarting an iteration (when a request and the closure data matched up properly).
I suppose that's the only other means available in C/C++.

@davea42
Copy link
Owner

davea42 commented Dec 10, 2025

So I have a working version of the callback method. I need to do a bit more in
dwarfexample/frame2.c (new file) to show data reasonbly similar in format to
what dwarfexample/frame1.c. Necessary to really have confidence things are ok.
Lots of debug printf to remove ...
No timing tests, but it has to be good since it is a single pass (no backtracking)
reading the frame instructions for an entire fde.

I do have some really large executables with lots of code and many
fdes, so getting an idea of timings should not be too tough. I suppose I really only
need to build frame1.c and frame2.c specially as the timings should treat libdwarf as a black box.

@rovarma
Copy link
Contributor Author

rovarma commented Dec 11, 2025

Sounds great! Did you change the iteration code so that it can remember the state of where it was and continue from there? If so, you might not need the callback version at all; you could conceivably offer a similar interface as dwarf_get_fde_info_for_all_regs3_b where you loop through it, though the user would need to provide storage to store the state.

I initially investigated in that direction, but I wasn't confident about which state would need to be saved between invocations to _dwarf_exec_frame_instr for it to be able to continue from a previous call.

Re: test binaries, I'm not sure how big the binaries you mention are, butI can provide you with the binary with 1.548.145 rows if you'd like another test case.

@davea42
Copy link
Owner

davea42 commented Dec 18, 2025

If you would like to try it, there is a new non-released libdwarf:

libdwarf-special-2.3.0.tar.gz

While it works correctly given the tests I've applied, I have been unable to verify
that it's really faster for your case of interest. But I don't have any FDE's with really long lists of rows,
and that is where n*n in getting frame data hurts.
Let me know if you have a chance to try it.

The new function is dwarf_iterate_fde_all_regs3()
and the function callback type is dwarf_iterate_fde_callback_function_type
See libdwarf.h for details.

I look forward to hearing from you.
David Anderson

@davea42
Copy link
Owner

davea42 commented Dec 18, 2025

Just FYI:

Running frame1.c and frame2.c (with options not in the tar.gz referenced above)

On the largest object I have which has just 33K rows across 800 or so FDEs.

t=/var/tmp/dwtest/debugfissionb/ld-new
time -f "\t%E real,\t%U user,\t%S sys" /tmp/frame1 --skip-all-printf --just-print-selected-regs $t  >/dev/null
time -f "\t%E real,\t%U user,\t%S sys" /tmp/frame2 --skip-all-printf $t  >/dev/null
0:03.83 real,	2.33 user,	1.50 sys frame1

0:02.46 real,	2.44 user,	0.02 sys frame2

The descrepancy in sys-time has been consistent across many runs, some with
all printf output.

@davea42
Copy link
Owner

davea42 commented Dec 21, 2025

I need to mention that the special tar.gz I posted above has a memory leak. oops.

Here is the correction of that:

diff --git a/src/lib/libdwarf/dwarf_frame.c b/src/lib/libdwarf/dwarf_frame.c
index 0909ccda..b51b2e32 100644
--- a/src/lib/libdwarf/dwarf_frame.c
+++ b/src/lib/libdwarf/dwarf_frame.c
@@ -875,6 +875,7 @@ dwarf_get_fde_info_for_all_regs3_b(Dwarf_Fde fde,
         output_table_real_data_size,
         error);
     if (res != DW_DLV_OK) {
+        _dwarf_empty_fde_table(&fde_table);
         return res;
     }
     /* Allocate array of internal structs to match,
@@ -905,6 +906,7 @@ dwarf_get_fde_info_for_all_regs3_b(Dwarf_Fde fde,
     }
     _dwarf_rule_copy(dbg,&fde_table, reg_table,
         &reg_table_i, output_table_real_data_size,row_pc);
+    _dwarf_empty_fde_table(&fde_table);
     free(reg_table_i.rt3_rules);
     reg_table_i.rt3_rules = 0;
     reg_table_i.rt3_reg_table_size = 0;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants