CORE-000783 appears to be holding up a validation for a stakeholder--the logs made it clear this was due to the fact that they have a very large SUPP-- dataset.
"
Since multiple processes are running simultaneously, in the logs the completion does not always immediately follow the start; log records from other rules can appear in between.
Validation runs up to 99% — after that I made a comparison of:
all rules that should run
all rules for which validation was started
all rules for which validation was completed
Conclusion: all rules are started, but one rule gets stuck, namely CORE-000783
(Raise an error when SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM is equal to a variable name defined in the corresponding SDTM version.)
If I start the validation without this rule, the validation does complete successfully.
JSON
Check:
YAML
all:
- name: QNAM
operator: is_contained_by
value: $model_variables
Core:
Id: CORE-000783
Status: Published
Version: "1"
Description:
Raise an error when SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM is equal to a variable name defined in the corresponding SDTM version.
Executability:
Fully Executable
Operations:
YAML
- id: $model_variables
operator: get_parent_model_column_order
Outcome:
• Message:
SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM equals a variable name defined in the corresponding SDTM version.
• Output Variables:
QNAM
$model_variables
Rule Type:
Record Data
Scope:
If I remove SUPPLB, or if I ensure that SUPPLB contains only 10 records, validation also completes successfully.
I suspect this is related to the more than 64,000 records in SUPPLB → it probably performs the check for each QNAM, so I suspect that a distinct on QNAM should be applied first.
"
https://github.com/cdisc-org/sdtm-adam-pilot-project/tree/master/updated-pilot-submission-package/900172/m5/datasets/cdiscpilot01/tabulations/sdtm - pilot supplb data
we will need to find the bottleneck:
i believe it is the is_contained_by operatordoing iloc() in a for loop, causing significant call overhead from the indexing cost of doing that on 64k calls. We could convert columns to lists and enumerate to iterate over plain Python values instead. This would eliminate the repeated pandas indexing overhead by paying the conversion cost once upfront via .tolist(). (assuming this is the bottleneck but it does appear to be the most memory intensive step of this)
CORE-000783 appears to be holding up a validation for a stakeholder--the logs made it clear this was due to the fact that they have a very large SUPP-- dataset.
"
Since multiple processes are running simultaneously, in the logs the completion does not always immediately follow the start; log records from other rules can appear in between.
Validation runs up to 99% — after that I made a comparison of:
all rules that should run
all rules for which validation was started
all rules for which validation was completed
Conclusion: all rules are started, but one rule gets stuck, namely CORE-000783
(Raise an error when SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM is equal to a variable name defined in the corresponding SDTM version.)
If I start the validation without this rule, the validation does complete successfully.
JSON
Check:
YAML
all:
operator: is_contained_by
value: $model_variables
Core:
Id: CORE-000783
Status: Published
Version: "1"
Description:
Raise an error when SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM is equal to a variable name defined in the corresponding SDTM version.
Executability:
Fully Executable
Operations:
YAML
operator: get_parent_model_column_order
Outcome:
• Message:
SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM equals a variable name defined in the corresponding SDTM version.
• Output Variables:
QNAM
$model_variables
Rule Type:
Record Data
Scope:
If I remove SUPPLB, or if I ensure that SUPPLB contains only 10 records, validation also completes successfully.
I suspect this is related to the more than 64,000 records in SUPPLB → it probably performs the check for each QNAM, so I suspect that a distinct on QNAM should be applied first.
"
https://github.com/cdisc-org/sdtm-adam-pilot-project/tree/master/updated-pilot-submission-package/900172/m5/datasets/cdiscpilot01/tabulations/sdtm - pilot supplb data
we will need to find the bottleneck:
i believe it is the is_contained_by operatordoing iloc() in a for loop, causing significant call overhead from the indexing cost of doing that on 64k calls. We could convert columns to lists and enumerate to iterate over plain Python values instead. This would eliminate the repeated pandas indexing overhead by paying the conversion cost once upfront via .tolist(). (assuming this is the bottleneck but it does appear to be the most memory intensive step of this)