Skip to content

Use a left join when adding splining stats to avoid row deletion#1310

Open
tobyallwood wants to merge 6 commits intomainfrom
tobyallwood/missing-grain-stats
Open

Use a left join when adding splining stats to avoid row deletion#1310
tobyallwood wants to merge 6 commits intomainfrom
tobyallwood/missing-grain-stats

Conversation

@tobyallwood
Copy link
Copy Markdown
Collaborator

@tobyallwood tobyallwood commented Mar 5, 2026

Closes #1304

When splining is enabled and some grains do not have molecules these grains would be omitted from the grain_stats .csv in error.

  • This was because grain_stats_additions was merged with grain_stats_all with an inner join with the keys image and grain_number. If a grain did not have a molecule then the image and grain_number keys did not exist in grain_stats_additions causing that row to be removed from the merged DataFrame. Switching this join to a left join rather than an inner solves this issue.

I also discovered an issue in this process where not running splining causes a KeyError which in turn means grain_stats_additions never get defined (the existance of which is required further down the process).

  • This was because the grain's molecule_stats_all entry did not contain the keys contour_length or end_to_end_distance even though their existence is then assumed by run_modules.py process(). I have added a catch for KeyErrors which sets grain_stats_additions to None which lets the program run smoothly after this point.

As far as I can see these changes won't cause any problems, the program runs as expected whether splining is turned on or not and all existing tests pass, but if anyone can see potential issues with these changes please let me know.


Side note:
In grain_statistics.csv when splining is enabled and some grains don't have molecules the total_contour_length and mean_end_to_end_distance columns exist, grains with molecules have these fields filled and grains without molecules just have these fields empty. Is this the right way to go about it or would we specifically want to assign some sort of n/a indicator to these empty fields to avoid user confusion?


Before submitting a Pull Request please check the following.

  • Existing tests pass.
  • Pre-commit checks pass.

Copy link
Copy Markdown
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion to log and error message so that users know what has happened (splining either hasn't run or failed).

Left join makes sense 👍

It probably hasn't cropped up before because in the tests we don't have that situation. In such cases what we should do is construct a test that covers such a scenario, however because the code is wrapped inside the process() function that is a little tricky. It looks like a I left a note about a common pattern of combining dataframes and to investigate abstracting this out to a function of its own, but alas don't have time to work on that.

Comment thread topostats/run_modules.py Outdated
Comment on lines +271 to +273
# Set additions to none if splining was not run
except KeyError:
grain_stats_additions = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps include an error in the logs to indicate what has happened. Not sure what the message should be as I can't think which key is missing. Quite possibly coming from within Pandas during one of the .groupby() methods but would be indicative of splining not having been run.

Suggested change
# Set additions to none if splining was not run
except KeyError:
grain_stats_additions = None
# Set additions to none if splining was not run
except KeyError:
LOGGER.error("<some message>")
grain_stats_additions = None

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more suitable to log a warning rather than an error? Seeing as the program will continue and valid inputs can cause this exception (i.e. splining turned off in config). The specific key error is 'contour_length' and 'end_to_end_distance' keys not existing, both of which are defined only when splining is run.

I also feel like rather than relying on a KeyError catch a specific check could be added to where it attempts to access contour_length and end_to_end_distance and the contents of the except block can just be an else statement (including a logged warning still).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOGGER.warning() would be fine, as long as we tell users what has happened so they can understand why the output is as it is.

As if if: ... else: ... v try: ... except: to check if keys exist, well it depends.

Second has some timings and explains why when an exception is raised it takes longer. How often are these exceptions likely to be raised might therefore be worth considering.

Here though there are two exceptions being caught so you would have to construct if: ... else: ... to cover both of those.

@tobyallwood tobyallwood requested a review from ns-rse March 10, 2026 15:03
@tobyallwood
Copy link
Copy Markdown
Collaborator Author

tobyallwood commented Mar 10, 2026

I've added a couple of checks to ensure no uncaught errors occur through the process, including automatically turning off curvature if it's enabled and splining is not. (However, curvature is disabled from the splining method so that it can still run independently without automatically getting disabled every time).

Also added an if statement in io.py extract_height_profiles() to ensure the grain_crops dict has values before trying to loop through them. (This issue came up when testing my fixes in this commit)

As far as the if vs try issue went I decided that the else case would be common enough to justify using an if statement; This is because the else block will run whenever splining is disabled. However I kept the previously implemented try block to handle the possible ValueError from no molecules being available as this should not usually happen.

I have not looked into making a function for combining DataFrames except for a quick sweep over the places it'd be implemented.. while there are obvious similarities in each section there's a number of small differences from case to case which could potentially make the function quite messy with parameters and checks to ensure correct function, but it could still be worth creating and using just for simplifying creating the missing test cases.

Copy link
Copy Markdown
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on needing to have topostats_object.grain_crops @tobyallwood jogged my memory that it is possible for a GrainCrop to not have height_profile attribute itself so I've suggested a check on each of those.

With regards to running, or not, curvature if splining isn't run the additions are great but there is also the processing.check_run_steps() function and associated tests where this logic could perhaps be incorporated.

For each step that is requested a check is made that all preceding steps that are required are also configured to be run. This check of the configuration, which is performed early before processing begins, means that users won't have to wait until the end to find that the configuration doesn't include what they wanted and then have to run everything a second time after correcting the configuration.

It's structured in reverse order, so adding curvature should come before the check for splining (although this isn't essential really), and adding a test to ensure it does what it should would of course be sensible.

Comment thread topostats/io.py Outdated
tobyallwood and others added 2 commits March 12, 2026 13:56
Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>
@tobyallwood tobyallwood requested a review from ns-rse March 24, 2026 12:09
@tobyallwood
Copy link
Copy Markdown
Collaborator Author

Am I right in thinking the failing pre-commit test is nothing to do with this PR? If so I guess we can consider all checks passed

@ns-rse
Copy link
Copy Markdown
Collaborator

ns-rse commented Mar 24, 2026

Am I right in thinking the failing pre-commit test is nothing to do with this PR? If so I guess we can consider all checks passed

Fixed in #1315

@tobyallwood has implemented some changes to ensure output is consistent when some grains don't have molecules.

Similar problems can arise if curvature doesn't run, perhaps because splining has not been configured to run. We have
`processing.check_run_steps()` function which checks that configuration options for which steps to run are consistent
and should help capture some of these problems prior to processing, saving users from running analyses and not getting
the results they expected.

Regardless this function did not have a parameter for `curvature_run` which checks that all steps require to process
images are enabled `if curvature_run:`.

This commit/pull request adds that functionality and a couple of basic tests.
Copy link
Copy Markdown
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be prudent to add the necessary logic to processing.check_run_steps() to capture whether options mean this is going to run in the first place. I've addressed in #1317.

feature(processing): Adds curvature checks to run_check_steps()
@tobyallwood tobyallwood requested a review from ubdbra001 April 1, 2026 12:16
Copy link
Copy Markdown
Collaborator

@ubdbra001 ubdbra001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question about the code, other than that this looks good to me.

Comment thread topostats/processing.py
Comment on lines +1543 to +1558
if splining_run is False:
LOGGER.error("Curvature enabled but Splining disabled. Please check your configuration file.")
if ordered_tracing_run is False:
LOGGER.error("Curvature enabled but Ordered Tracing disabled. Please check your configuration file.")
if nodestats_run is False:
LOGGER.error("Curvature enabled but NodeStats disabled. Tracing will use the 'old' method.")
if disordered_tracing_run is False:
LOGGER.error("Curvature enabled but Disordered Tracing disabled. Please check your configuration file.")
elif grainstats_run is False:
LOGGER.error("Curvature enabled but Grainstats disabled. Please check your configuration file.")
elif grains_run is False:
LOGGER.error("Curvature enabled but Grains disabled. Please check your configuration file.")
elif filter_run is False:
LOGGER.error("Curvature enabled but Filters disabled. Please check your configuration file.")
else:
LOGGER.info("Configuration run options are consistent, processing can proceed.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why the switch from multiple ifs to elifs in this block?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on a quick review of the code as I've not looked at these sections since #1317, but I think this is why...

Using elif means only the first error in the configuration would be reported, but there are instances where it is useful for users to know if there is more than one processing stage which is incorrectly disabled.

This can be seen if you switch everything to elif: for everything which results in the second paramterised test (lines 440-452 tests/test_processing.py) and some others failing as its checking whether grainstats being disabled is captured (logging will also have noted that nodestats is not enabled).

Why not use if:... throughout? From memory it is because there is an on-going task to allow entry points at any stage of the processing because we save data to .topostats and can now, thanks to the big-refactoring I undertook to get the internal TopoStats and nested dataclasses consistent with the HDF5 structure, read in files that do actually have the necessary data from earlier intermediary processing steps. Thus the checking of the various *_run options becomes less relevant/almost redundant in this scenario. I think I got upto introducing an entry point for everything upto grainstats and so those checks are elif: but didn't complete them all because the refactoring to dataclasses was such a 🦣 task and took ages.

Once there is an entry point for each stage in processing this check_run would need reworking and only calling when running the pipeline in its entirety. It would probably be worth noting this as something to think about carefully in the future and adjust if necessary.

@SylviaWhittle SylviaWhittle enabled auto-merge April 8, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: Grain_statistics.csv doesn't populate grains that have no spline

4 participants