CORE-000783

CORE-000783 appears to be holding up a validation for a stakeholder--the logs made it clear this was due to the fact that they have a very large SUPP-- dataset.  
"
Since multiple processes are running simultaneously, in the logs the completion does not always immediately follow the start; log records from other rules can appear in between.

Validation runs up to 99% — after that I made a comparison of:
all rules that should run
all rules for which validation was started
all rules for which validation was completed
Conclusion: all rules are started, but one rule gets stuck, namely CORE-000783
(Raise an error when SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM is equal to a variable name defined in the corresponding SDTM version.)
If I start the validation without this rule, the validation does complete successfully.

JSON
Check:









YAML

all:
- name: QNAM
operator: is_contained_by
value: $model_variables

Core:


Id: CORE-000783
Status: Published
Version: "1"
Description:

Raise an error when SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM is equal to a variable name defined in the corresponding SDTM version.
Executability:
 Fully Executable
Operations:









YAML

- id: $model_variables
operator: get_parent_model_column_order

Outcome:

• Message:
SUPP--.QNAM is present in the dataset, but the value of SUPP--.QNAM equals a variable name defined in the corresponding SDTM version.
• Output Variables:
QNAM
$model_variables
Rule Type:
 Record Data
Scope:

If I remove SUPPLB, or if I ensure that SUPPLB contains only 10 records, validation also completes successfully.
I suspect this is related to the more than 64,000 records in SUPPLB → it probably performs the check for each QNAM, so I suspect that a distinct on QNAM should be applied first.
"

https://github.com/cdisc-org/sdtm-adam-pilot-project/tree/master/updated-pilot-submission-package/900172/m5/datasets/cdiscpilot01/tabulations/sdtm - pilot supplb data


we will need to find the bottleneck:
i believe it is the is_contained_by operatordoing iloc() in a for loop, causing significant call overhead from the indexing cost of doing that on 64k calls.  We could convert columns to lists and enumerate to iterate over plain Python values instead. This would eliminate the repeated pandas indexing overhead by paying the conversion cost once upfront via .tolist().  (assuming this is the bottleneck but it does appear to be the most memory intensive step of this)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-000783 #1699

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CORE-000783 #1699

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions