Use a left join when adding splining stats to avoid row deletion by tobyallwood · Pull Request #1310 · AFM-SPM/TopoStats

tobyallwood · 2026-03-05T13:47:10Z

When splining is enabled and some grains do not have molecules these grains would be omitted from the grain_stats `.csv` in error.

This was because grain_stats_additions was merged with grain_stats_all with an inner join with the keys image and grain_number. If a grain did not have a molecule then the image and grain_number keys did not exist in grain_stats_additions causing that row to be removed from the merged DataFrame. Switching this join to a left join rather than an inner solves this issue.

I also discovered an issue in this process where not running splining causes a `KeyError` which in turn means `grain_stats_additions` never get defined (the existance of which is required further down the process).

This was because the grain's molecule_stats_all entry did not contain the keys contour_length or end_to_end_distance even though their existence is then assumed by run_modules.py process(). I have added a catch for KeyErrors which sets grain_stats_additions to None which lets the program run smoothly after this point.

As far as I can see these changes won't cause any problems, the program runs as expected whether splining is turned on or not and all existing tests pass, but if anyone can see potential issues with these changes please let me know.

Side note:
In grain_statistics.csv when splining is enabled and some grains don't have molecules the total_contour_length and mean_end_to_end_distance columns exist, grains with molecules have these fields filled and grains without molecules just have these fields empty. Is this the right way to go about it or would we specifically want to assign some sort of n/a indicator to these empty fields to avoid user confusion?

Before submitting a Pull Request please check the following.

Existing tests pass.
Pre-commit checks pass.

ns-rse

Minor suggestion to log and error message so that users know what has happened (splining either hasn't run or failed).

Left join makes sense 👍

It probably hasn't cropped up before because in the tests we don't have that situation. In such cases what we should do is construct a test that covers such a scenario, however because the code is wrapped inside the process() function that is a little tricky. It looks like a I left a note about a common pattern of combining dataframes and to investigate abstracting this out to a function of its own, but alas don't have time to work on that.

ns-rse · 2026-03-06T16:05:33Z

+    # Set additions to none if splining was not run
+    except KeyError:
+        grain_stats_additions = None


Perhaps include an error in the logs to indicate what has happened. Not sure what the message should be as I can't think which key is missing. Quite possibly coming from within Pandas during one of the .groupby() methods but would be indicative of splining not having been run.

Suggested change

# Set additions to none if splining was not run

except KeyError:

grain_stats_additions = None

# Set additions to none if splining was not run

except KeyError:

LOGGER.error("<some message>")

grain_stats_additions = None

Would it be more suitable to log a warning rather than an error? Seeing as the program will continue and valid inputs can cause this exception (i.e. splining turned off in config). The specific key error is 'contour_length' and 'end_to_end_distance' keys not existing, both of which are defined only when splining is run.

I also feel like rather than relying on a KeyError catch a specific check could be added to where it attempts to access contour_length and end_to_end_distance and the contents of the except block can just be an else statement (including a logged warning still).

LOGGER.warning() would be fine, as long as we tell users what has happened so they can understand why the output is as it is.

As if if: ... else: ... v try: ... except: to check if keys exist, well it depends.

Try-Except vs. If-Else in Python: What’s the Difference? | by Radithya Zuhayr Fasya | Medium

try-except vs If in Python - GeeksforGeeks

Second has some timings and explains why when an exception is raised it takes longer. How often are these exceptions likely to be raised might therefore be worth considering.

Here though there are two exceptions being caught so you would have to construct if: ... else: ... to cover both of those.

tobyallwood · 2026-03-10T15:21:43Z

I've added a couple of checks to ensure no uncaught errors occur through the process, including automatically turning off curvature if it's enabled and splining is not. (However, curvature is disabled from the splining method so that it can still run independently without automatically getting disabled every time).

Also added an if statement in io.py extract_height_profiles() to ensure the grain_crops dict has values before trying to loop through them. (This issue came up when testing my fixes in this commit)

As far as the if vs try issue went I decided that the else case would be common enough to justify using an if statement; This is because the else block will run whenever splining is disabled. However I kept the previously implemented try block to handle the possible ValueError from no molecules being available as this should not usually happen.

I have not looked into making a function for combining DataFrames except for a quick sweep over the places it'd be implemented.. while there are obvious similarities in each section there's a number of small differences from case to case which could potentially make the function quite messy with parameters and checks to ensure correct function, but it could still be worth creating and using just for simplifying creating the missing test cases.

ns-rse

Good catch on needing to have topostats_object.grain_crops @tobyallwood jogged my memory that it is possible for a GrainCrop to not have height_profile attribute itself so I've suggested a check on each of those.

With regards to running, or not, curvature if splining isn't run the additions are great but there is also the processing.check_run_steps() function and associated tests where this logic could perhaps be incorporated.

For each step that is requested a check is made that all preceding steps that are required are also configured to be run. This check of the configuration, which is performed early before processing begins, means that users won't have to wait until the end to find that the configuration doesn't include what they wanted and then have to run everything a second time after correcting the configuration.

It's structured in reverse order, so adding curvature should come before the check for splining (although this isn't essential really), and adding a test to ensure it does what it should would of course be sensible.

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

tobyallwood · 2026-03-24T12:10:04Z

Am I right in thinking the failing pre-commit test is nothing to do with this PR? If so I guess we can consider all checks passed

ns-rse · 2026-03-24T12:12:59Z

Am I right in thinking the failing pre-commit test is nothing to do with this PR? If so I guess we can consider all checks passed

Fixed in #1315

@tobyallwood

@tobyallwood has implemented some changes to ensure output is consistent when some grains don't have molecules. Similar problems can arise if curvature doesn't run, perhaps because splining has not been configured to run. We have `processing.check_run_steps()` function which checks that configuration options for which steps to run are consistent and should help capture some of these problems prior to processing, saving users from running analyses and not getting the results they expected. Regardless this function did not have a parameter for `curvature_run` which checks that all steps require to process images are enabled `if curvature_run:`. This commit/pull request adds that functionality and a couple of basic tests.

ns-rse

I still think it would be prudent to add the necessary logic to processing.check_run_steps() to capture whether options mean this is going to run in the first place. I've addressed in #1317.

feature(processing): Adds curvature checks to run_check_steps()

ubdbra001

Just one question about the code, other than that this looks good to me.

ubdbra001 · 2026-04-01T12:54:18Z

+        if splining_run is False:
+            LOGGER.error("Curvature enabled but Splining disabled. Please check your configuration file.")
+        if ordered_tracing_run is False:
+            LOGGER.error("Curvature enabled but Ordered Tracing disabled. Please check your configuration file.")
+        if nodestats_run is False:
+            LOGGER.error("Curvature enabled but NodeStats disabled. Tracing will use the 'old' method.")
+        if disordered_tracing_run is False:
+            LOGGER.error("Curvature enabled but Disordered Tracing disabled. Please check your configuration file.")
+        elif grainstats_run is False:
+            LOGGER.error("Curvature enabled but Grainstats disabled. Please check your configuration file.")
+        elif grains_run is False:
+            LOGGER.error("Curvature enabled but Grains disabled. Please check your configuration file.")
+        elif filter_run is False:
+            LOGGER.error("Curvature enabled but Filters disabled. Please check your configuration file.")
+        else:
+            LOGGER.info("Configuration run options are consistent, processing can proceed.")


Out of curiosity, why the switch from multiple ifs to elifs in this block?

This is based on a quick review of the code as I've not looked at these sections since #1317, but I think this is why...

Using elif means only the first error in the configuration would be reported, but there are instances where it is useful for users to know if there is more than one processing stage which is incorrectly disabled.

This can be seen if you switch everything to elif: for everything which results in the second paramterised test (lines 440-452 tests/test_processing.py) and some others failing as its checking whether grainstats being disabled is captured (logging will also have noted that nodestats is not enabled).

Why not use if:... throughout? From memory it is because there is an on-going task to allow entry points at any stage of the processing because we save data to .topostats and can now, thanks to the big-refactoring I undertook to get the internal TopoStats and nested dataclasses consistent with the HDF5 structure, read in files that do actually have the necessary data from earlier intermediary processing steps. Thus the checking of the various *_run options becomes less relevant/almost redundant in this scenario. I think I got upto introducing an entry point for everything upto grainstats and so those checks are elif: but didn't complete them all because the refactoring to dataclasses was such a 🦣 task and took ages.

Once there is an entry point for each stage in processing this check_run would need reworking and only calling when running the pipeline in its entirety. It would probably be worth noting this as something to think about carefully in the future and adjust if necessary.

Switch to left join when adding splining stats

b54147f

tobyallwood requested review from SylviaWhittle and ns-rse March 5, 2026 13:47

ns-rse reviewed Mar 6, 2026

View reviewed changes

Add checks for if splining was run

cd19baf

tobyallwood requested a review from ns-rse March 10, 2026 15:03

ns-rse requested changes Mar 11, 2026

View reviewed changes

Comment thread topostats/io.py Outdated

tobyallwood and others added 2 commits March 12, 2026 13:56

Update topostats/io.py

566795e

Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>

[pre-commit.ci] Fixing issues with pre-commit

5ffaed9

tobyallwood requested a review from ns-rse March 24, 2026 12:09

ns-rse mentioned this pull request Mar 24, 2026

feature(processing): Adds curvature checks to run_check_steps() #1317

Merged

4 tasks

ns-rse reviewed Mar 24, 2026

View reviewed changes

Merge pull request #1317 from AFM-SPM/ns-rse/check-run-steps-curvature

ee8629c

feature(processing): Adds curvature checks to run_check_steps()

tobyallwood requested a review from ubdbra001 April 1, 2026 12:16

ubdbra001 approved these changes Apr 1, 2026

View reviewed changes

SylviaWhittle enabled auto-merge April 8, 2026 14:24

SylviaWhittle approved these changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a left join when adding splining stats to avoid row deletion#1310

Use a left join when adding splining stats to avoid row deletion#1310
tobyallwood wants to merge 6 commits intomainfrom
tobyallwood/missing-grain-stats

tobyallwood commented Mar 5, 2026 •

edited

Loading

Uh oh!

ns-rse left a comment

Uh oh!

ns-rse Mar 6, 2026

Uh oh!

tobyallwood Mar 10, 2026

Uh oh!

ns-rse Mar 10, 2026

Uh oh!

tobyallwood commented Mar 10, 2026 •

edited

Loading

Uh oh!

ns-rse left a comment

Uh oh!

Uh oh!

tobyallwood commented Mar 24, 2026

Uh oh!

ns-rse commented Mar 24, 2026

Uh oh!

ns-rse left a comment

Uh oh!

ubdbra001 left a comment

Uh oh!

ubdbra001 Apr 1, 2026

Uh oh!

ns-rse Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tobyallwood commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When splining is enabled and some grains do not have molecules these grains would be omitted from the grain_stats .csv in error.

I also discovered an issue in this process where not running splining causes a KeyError which in turn means grain_stats_additions never get defined (the existance of which is required further down the process).

Uh oh!

ns-rse left a comment

Choose a reason for hiding this comment

Uh oh!

ns-rse Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

tobyallwood Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ns-rse Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

tobyallwood commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ns-rse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tobyallwood commented Mar 24, 2026

Uh oh!

ns-rse commented Mar 24, 2026

Uh oh!

ns-rse left a comment

Choose a reason for hiding this comment

Uh oh!

ubdbra001 left a comment

Choose a reason for hiding this comment

Uh oh!

ubdbra001 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ns-rse Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tobyallwood commented Mar 5, 2026 •

edited

Loading

When splining is enabled and some grains do not have molecules these grains would be omitted from the grain_stats `.csv` in error.

I also discovered an issue in this process where not running splining causes a `KeyError` which in turn means `grain_stats_additions` never get defined (the existance of which is required further down the process).

tobyallwood commented Mar 10, 2026 •

edited

Loading