Skip to content

Fix incorrect loop variable handling in substring removal step#29

Open
gloria-trivitt wants to merge 1 commit into
mainfrom
stg_fix_loop_variable_detection
Open

Fix incorrect loop variable handling in substring removal step#29
gloria-trivitt wants to merge 1 commit into
mainfrom
stg_fix_loop_variable_detection

Conversation

@gloria-trivitt
Copy link
Copy Markdown
Contributor

@gloria-trivitt gloria-trivitt commented Apr 17, 2026

Issue
In module2_v2, during the column transformation step, loop variables were not processed correctly. As a result, the final table retained duplicated loop suffixes (_N_N) instead of being normalized to a single loop index.

Example:

D_779052408_19_19 AS d_779052408_19_19,
D_133297530_18_18 AS d_133297530_18_18,
D_771528780_2_2 AS d_771528780_2_2

Expected:

D_779052408_19_19 AS d_779052408_19,
D_133297530_18_18 AS d_133297530_18,
D_771528780_2_2 AS d_771528780_2

Refer to nih-nci-dceg-connect-stg-5519.ForTestingOnly.module2_v2_with_cleaned_columns.sql in sql_stage_2026-04-16.zip. This file contains the SQL output before the fix.

Root Cause
During the substring removal step (_build_substring_removal_clauses):

  • The function iterated over all columns, not just those requiring substring removal
  • Loop variables (e.g., _N_N) were unintentionally included
  • These columns were marked as processed early
  • As a result, they were skipped in the loop variable processing step

Resolution
Updated _build_substring_removal_clauses to:

  • Restrict processing to subset_columns (columns that actually contain removable substrings)
  • Explicitly skip loop variables using extract_loop_number(). This ensures loop variables are preserved for Step 4 and correctly normalized.

Additional Behavior Corrected
This change also corrected unintended handling of non-loop columns during the substring removal step.

These examples are specific to the participants table and can be reviewed in:
nih-nci-dceg-connect-stg-5519.ForTestingOnly.participants_with_cleaned_columns.sql

Before fix:

  • Step 2 iterated over all columns and could mark non-loop columns as processed early
  • These columns bypassed intended validation and exclusion logic
  • As a result, some invalid or non-pure variables were included in the final output

After fix:

  • Step 2 only processes columns requiring substring cleanup (subset_columns)
  • Non-loop columns are preserved until later steps
  • Step 4 (_build_loop_variable_clauses) now correctly evaluates them:
    • Non-pure variables are identified via is_pure_variable() and excluded
    • Forbidden and datatype-conflict columns are no longer prematurely passed through
# FORBIDDEN_NON_CID_VARIABLE_NAMES
firstSurveyCompletedSeen
state_studyId
token
updatesSeen

# SUBSTRINGS_DATATYPE_CONFLICT
d_837244890_string
d_837244890_integer

sql_stage_2026-04-16.zip: Contains SQL statements to create tables in CleanConnect for 4/16/2026. This is before the fix and has the issue present.
sql_stage_2026-04-16.zip

sql_stage_2026-04-17.zip: Contains SQL statements to create tables in CleanConnect for 4/17/2026. This is after the fix.
sql_stage_2026-04-17.zip

…to prevent them from being marked as processed before loop handling.
@gloria-trivitt gloria-trivitt changed the title Restrict substring removal to subset_columns and skip loop variables … Fix incorrect loop variable handling in substring removal step Apr 21, 2026
@gloria-trivitt gloria-trivitt marked this pull request as ready for review April 21, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant