Use a left join when adding splining stats to avoid row deletion#1310
Use a left join when adding splining stats to avoid row deletion#1310tobyallwood wants to merge 6 commits intomainfrom
Conversation
ns-rse
left a comment
There was a problem hiding this comment.
Minor suggestion to log and error message so that users know what has happened (splining either hasn't run or failed).
Left join makes sense 👍
It probably hasn't cropped up before because in the tests we don't have that situation. In such cases what we should do is construct a test that covers such a scenario, however because the code is wrapped inside the process() function that is a little tricky. It looks like a I left a note about a common pattern of combining dataframes and to investigate abstracting this out to a function of its own, but alas don't have time to work on that.
| # Set additions to none if splining was not run | ||
| except KeyError: | ||
| grain_stats_additions = None |
There was a problem hiding this comment.
Perhaps include an error in the logs to indicate what has happened. Not sure what the message should be as I can't think which key is missing. Quite possibly coming from within Pandas during one of the .groupby() methods but would be indicative of splining not having been run.
| # Set additions to none if splining was not run | |
| except KeyError: | |
| grain_stats_additions = None | |
| # Set additions to none if splining was not run | |
| except KeyError: | |
| LOGGER.error("<some message>") | |
| grain_stats_additions = None |
There was a problem hiding this comment.
Would it be more suitable to log a warning rather than an error? Seeing as the program will continue and valid inputs can cause this exception (i.e. splining turned off in config). The specific key error is 'contour_length' and 'end_to_end_distance' keys not existing, both of which are defined only when splining is run.
I also feel like rather than relying on a KeyError catch a specific check could be added to where it attempts to access contour_length and end_to_end_distance and the contents of the except block can just be an else statement (including a logged warning still).
There was a problem hiding this comment.
LOGGER.warning() would be fine, as long as we tell users what has happened so they can understand why the output is as it is.
As if if: ... else: ... v try: ... except: to check if keys exist, well it depends.
- Try-Except vs. If-Else in Python: What’s the Difference? | by Radithya Zuhayr Fasya | Medium
- try-except vs If in Python - GeeksforGeeks
Second has some timings and explains why when an exception is raised it takes longer. How often are these exceptions likely to be raised might therefore be worth considering.
Here though there are two exceptions being caught so you would have to construct if: ... else: ... to cover both of those.
|
I've added a couple of checks to ensure no uncaught errors occur through the process, including automatically turning off curvature if it's enabled and splining is not. (However, curvature is disabled from the splining method so that it can still run independently without automatically getting disabled every time). Also added an if statement in As far as the I have not looked into making a function for combining |
ns-rse
left a comment
There was a problem hiding this comment.
Good catch on needing to have topostats_object.grain_crops @tobyallwood jogged my memory that it is possible for a GrainCrop to not have height_profile attribute itself so I've suggested a check on each of those.
With regards to running, or not, curvature if splining isn't run the additions are great but there is also the processing.check_run_steps() function and associated tests where this logic could perhaps be incorporated.
For each step that is requested a check is made that all preceding steps that are required are also configured to be run. This check of the configuration, which is performed early before processing begins, means that users won't have to wait until the end to find that the configuration doesn't include what they wanted and then have to run everything a second time after correcting the configuration.
It's structured in reverse order, so adding curvature should come before the check for splining (although this isn't essential really), and adding a test to ensure it does what it should would of course be sensible.
Co-authored-by: Neil Shephard <n.shephard@sheffield.ac.uk>
|
Am I right in thinking the failing pre-commit test is nothing to do with this PR? If so I guess we can consider all checks passed |
Fixed in #1315 |
@tobyallwood has implemented some changes to ensure output is consistent when some grains don't have molecules. Similar problems can arise if curvature doesn't run, perhaps because splining has not been configured to run. We have `processing.check_run_steps()` function which checks that configuration options for which steps to run are consistent and should help capture some of these problems prior to processing, saving users from running analyses and not getting the results they expected. Regardless this function did not have a parameter for `curvature_run` which checks that all steps require to process images are enabled `if curvature_run:`. This commit/pull request adds that functionality and a couple of basic tests.
ns-rse
left a comment
There was a problem hiding this comment.
I still think it would be prudent to add the necessary logic to processing.check_run_steps() to capture whether options mean this is going to run in the first place. I've addressed in #1317.
feature(processing): Adds curvature checks to run_check_steps()
ubdbra001
left a comment
There was a problem hiding this comment.
Just one question about the code, other than that this looks good to me.
| if splining_run is False: | ||
| LOGGER.error("Curvature enabled but Splining disabled. Please check your configuration file.") | ||
| if ordered_tracing_run is False: | ||
| LOGGER.error("Curvature enabled but Ordered Tracing disabled. Please check your configuration file.") | ||
| if nodestats_run is False: | ||
| LOGGER.error("Curvature enabled but NodeStats disabled. Tracing will use the 'old' method.") | ||
| if disordered_tracing_run is False: | ||
| LOGGER.error("Curvature enabled but Disordered Tracing disabled. Please check your configuration file.") | ||
| elif grainstats_run is False: | ||
| LOGGER.error("Curvature enabled but Grainstats disabled. Please check your configuration file.") | ||
| elif grains_run is False: | ||
| LOGGER.error("Curvature enabled but Grains disabled. Please check your configuration file.") | ||
| elif filter_run is False: | ||
| LOGGER.error("Curvature enabled but Filters disabled. Please check your configuration file.") | ||
| else: | ||
| LOGGER.info("Configuration run options are consistent, processing can proceed.") |
There was a problem hiding this comment.
Out of curiosity, why the switch from multiple ifs to elifs in this block?
There was a problem hiding this comment.
This is based on a quick review of the code as I've not looked at these sections since #1317, but I think this is why...
Using elif means only the first error in the configuration would be reported, but there are instances where it is useful for users to know if there is more than one processing stage which is incorrectly disabled.
This can be seen if you switch everything to elif: for everything which results in the second paramterised test (lines 440-452 tests/test_processing.py) and some others failing as its checking whether grainstats being disabled is captured (logging will also have noted that nodestats is not enabled).
Why not use if:... throughout? From memory it is because there is an on-going task to allow entry points at any stage of the processing because we save data to .topostats and can now, thanks to the big-refactoring I undertook to get the internal TopoStats and nested dataclasses consistent with the HDF5 structure, read in files that do actually have the necessary data from earlier intermediary processing steps. Thus the checking of the various *_run options becomes less relevant/almost redundant in this scenario. I think I got upto introducing an entry point for everything upto grainstats and so those checks are elif: but didn't complete them all because the refactoring to dataclasses was such a 🦣 task and took ages.
Once there is an entry point for each stage in processing this check_run would need reworking and only calling when running the pipeline in its entirety. It would probably be worth noting this as something to think about carefully in the future and adjust if necessary.
Closes #1304
When splining is enabled and some grains do not have molecules these grains would be omitted from the grain_stats
.csvin error.grain_stats_additionswas merged withgrain_stats_allwith an inner join with the keysimageandgrain_number. If a grain did not have a molecule then theimageandgrain_numberkeys did not exist ingrain_stats_additionscausing that row to be removed from the merged DataFrame. Switching this join to a left join rather than an inner solves this issue.I also discovered an issue in this process where not running splining causes a
KeyErrorwhich in turn meansgrain_stats_additionsnever get defined (the existance of which is required further down the process).molecule_stats_allentry did not contain the keyscontour_lengthorend_to_end_distanceeven though their existence is then assumed byrun_modules.py process(). I have added a catch forKeyErrors which setsgrain_stats_additionsto None which lets the program run smoothly after this point.As far as I can see these changes won't cause any problems, the program runs as expected whether splining is turned on or not and all existing tests pass, but if anyone can see potential issues with these changes please let me know.
Side note:
In
grain_statistics.csvwhen splining is enabled and some grains don't have molecules thetotal_contour_lengthandmean_end_to_end_distancecolumns exist, grains with molecules have these fields filled and grains without molecules just have these fields empty. Is this the right way to go about it or would we specifically want to assign some sort ofn/aindicator to these empty fields to avoid user confusion?Before submitting a Pull Request please check the following.