Skip to content

Fixing errors in the repo setup#3

Open
TheVidz wants to merge 4 commits into
genxnetwork:mainfrom
TheVidz:main
Open

Fixing errors in the repo setup#3
TheVidz wants to merge 4 commits into
genxnetwork:mainfrom
TheVidz:main

Conversation

@TheVidz
Copy link
Copy Markdown

@TheVidz TheVidz commented Mar 31, 2026

fix naming issue deep_ancestry, redirecting to flan

error in qc.py file causing issue on "prepare"

fix num folds error in sample_splitter

Fixing Error: --read-freq variant ID '.' appears multiple times

Summary by Sourcery

Fix QC preprocessing and sample splitting to consistently operate on QC-processed genotype data and add the deep_ancestry package marker module.

Bug Fixes:

  • Ensure QC preprocessing writes to a separate *_qc pfile and updates the cache to use this QC output for downstream steps.
  • Update PLINK missing variant ID handling in QC to avoid errors when variant ID '.' appears multiple times.
  • Default the number of folds to a minimum of 5 in the sample splitter when not provided to prevent invalid split configurations.
  • Force genotype splitting to consume QC-processed pfiles rather than raw inputs, ensuring consistency between QC and downstream splits.
  • Ensure the sample splitter uses the QC output pfile in its fit_transform workflow to avoid path-related errors.
  • Add an init module for the deep_ancestry package to fix import and naming issues.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 31, 2026

Reviewer's Guide

Refactors QC and sample splitting to correctly produce and consume QC’d PLINK pfiles, adds safer defaults for fold counts, and introduces a basic deep_ancestry package initializer to fix repo import/setup issues.

Sequence diagram for QC and sample splitting pipeline after fixes

sequenceDiagram
    actor User
    participant QCProcessor
    participant SampleSplitter
    participant FileCache
    participant Plink

    User->>QCProcessor: fit_transform(cache)
    QCProcessor->>FileCache: pfile_path()
    QCProcessor->>Plink: run_plink(--pfile original_pfile --make-pgen --out original_pfile_qc --set-missing-var-ids @:#:$r:$a qc_config)
    Plink-->>QCProcessor: QC pfile generated
    QCProcessor->>FileCache: set _pfile_path = original_pfile_qc

    User->>SampleSplitter: fit_transform(cache)
    SampleSplitter->>FileCache: pfile_path()
    SampleSplitter->>FileCache: ids_path()
    SampleSplitter->>SampleSplitter: _split_ids(cache)
    SampleSplitter->>SampleSplitter: _split_genotypes(cache)
    SampleSplitter->>FileCache: pfile_path()
    SampleSplitter->>Plink: run_plink(--pfile qc_pfile --keep fold_ids --make-pgen --out fold_pfile)
    SampleSplitter->>SampleSplitter: _split_phenotypes(cache)
    SampleSplitter->>Plink: run_plink for phenotypes
    Plink-->>SampleSplitter: split pfiles and phenotype files ready
    SampleSplitter-->>User: cache with QC based splits
Loading

Updated class diagram for QC, sample splitting, and cache handling

classDiagram
    class QCProcessor {
        - qc_config Dict
        + __init__(qc_config Dict) None
        + fit_transform(cache FileCache) None
        + transform(source_path str, dest_path str) None
    }

    class SampleSplitter {
        - args Any
        + __init__(args Any) None
        + _split_ids(cache FileCache, y Any, random_state int) None
        + _split_genotypes(cache FileCache) None
        + _split_phenotypes(cache FileCache) None
        + fit_transform(cache FileCache) None
    }

    class FileCache {
        - _pfile_path str
        + pfile_path() str
        + ids_path(fold_index int, part str) str
        + ids_path() str
        + num_folds int
    }

    class PlinkRunner {
        + run_plink(args_list List, args_dict Dict) None
    }

    QCProcessor --> FileCache : uses
    QCProcessor --> PlinkRunner : calls
    SampleSplitter --> FileCache : uses
    SampleSplitter --> PlinkRunner : calls
    SampleSplitter --> QCProcessor : consumes QC output via cache
    FileCache o-- QCProcessor : stores QC pfile_path
    FileCache o-- SampleSplitter : provides QC pfile_path and ids_path
Loading

Flow diagram for QC to fold-split genotype processing

flowchart TD
    A[Raw PLINK pfile in FileCache] --> B[QCProcessor fit_transform]
    B --> C[run_plink with --set-missing-var-ids @:#:$r:$a]
    C --> D[QC pfile with _qc suffix]
    D --> E[Update FileCache _pfile_path to QC pfile]
    E --> F[SampleSplitter fit_transform]
    F --> G[_split_ids uses ids_path and num_folds default 5]
    G --> H[_split_genotypes uses QC pfile base_path ending with _qc]
    H --> I[run_plink per fold and split train val test]
    F --> J[_split_phenotypes uses QC based folds]
    I --> K[Per fold genotype pfiles]
    J --> L[Per fold phenotype files]
    K --> M[Downstream training uses QC genotypes]
    L --> M
Loading

File-Level Changes

Change Details Files
QC pipeline now writes to a dedicated *_qc pfile and updates downstream cache usage accordingly.
  • Change QC fit_transform to output PLINK pfiles to a new *_qc path instead of overwriting the original pfile.
  • Update PLINK arguments to remove the stray 'vzs' token and use an expanded --set-missing-var-ids pattern '@:#:$r:$a' to avoid duplicate '.' variant IDs.
  • Mutate the FileCache instance after QC to point its internal pfile path to the new *_qc output so later stages operate on QC’d data.
flan/preprocess/qc.py
Sample splitting now has a safe default for num_folds and consistently uses QC-processed genotypes.
  • Ensure args.num_folds has a minimum default of 5 when not provided, preventing errors from missing configuration.
  • Force genotype splitting to read from the QC pfile by deriving a *_qc base path for --pfile instead of using the raw cache.pfile_path.
  • In fit_transform, normalize the cache pfile path to *_qc if necessary so all splitting steps are aligned on QC output.
flan/preprocess/sample_splitter.py
Adds a minimal deep_ancestry package initializer to resolve naming/import issues and properly redirect to flan.
  • Create an empty or lightweight init.py for the deep_ancestry package so it can be imported and linked into the main flan repo structure.
deep_ancestry/__init__.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • Directly mutating the private attribute cache._pfile_path in multiple places couples this logic tightly to the current FileCache implementation; consider exposing a public setter or method on FileCache to update the pfile path instead.
  • The _qc suffix handling is currently duplicated and inconsistent (qc.fit_transform mutates the cache path, _split_genotypes recomputes a _qc path, and fit_transform in sample_splitter may append _qc again); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path.
  • The comment in _split_ids says 'adding min 5 folds' but the code only provides a default of 5 when num_folds is missing; if a true minimum is desired, add an explicit check to clamp or validate self.args.num_folds.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Directly mutating the private attribute `cache._pfile_path` in multiple places couples this logic tightly to the current `FileCache` implementation; consider exposing a public setter or method on `FileCache` to update the pfile path instead.
- The `_qc` suffix handling is currently duplicated and inconsistent (`qc.fit_transform` mutates the cache path, `_split_genotypes` recomputes a `_qc` path, and `fit_transform` in `sample_splitter` may append `_qc` again); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path.
- The comment in `_split_ids` says 'adding min 5 folds' but the code only provides a default of 5 when `num_folds` is missing; if a true minimum is desired, add an explicit check to clamp or validate `self.args.num_folds`.

## Individual Comments

### Comment 1
<location path="flan/preprocess/qc.py" line_range="35" />
<code_context>
+        )
+
+        # ✅ VERY IMPORTANT: update cache to point to QC output
+        cache._pfile_path = qc_path

     def transform(self, source_path: str, dest_path: str) -> None:
</code_context>
<issue_to_address>
**suggestion:** Avoid mutating a cache private attribute directly; prefer a dedicated API on FileCache

Directly setting `cache._pfile_path` tightly couples this code to `FileCache`’s internals and may break if `FileCache` later adds validation or derives related paths from this value. Prefer exposing a public setter or a dedicated method (e.g. `set_qc_pfile_path`/`update_pfile_path`) on `FileCache` and calling that instead.

Suggested implementation:

```python
        )

        # ✅ VERY IMPORTANT: update cache to point to QC output
        cache.update_pfile_path(qc_path)

```

To fully implement this change you also need to:
1. Add a public method to the `FileCache` class (where it is defined), for example:
   ```python
   class FileCache:
       ...
       def update_pfile_path(self, new_path: Path | str) -> None:
           # Perform any validation/normalization here if needed
           self._pfile_path = Path(new_path)
   ```
2. Replace any other direct writes to `cache._pfile_path` in the codebase with calls to `update_pfile_path(...)` to keep the abstraction consistent.
</issue_to_address>

### Comment 2
<location path="flan/preprocess/sample_splitter.py" line_range="32-36" />
<code_context>
             y: y can be passed to trigger StratifiedKFold instead of KFold
             random_state (int): Fixed random_state for train_test_split sklearn function
         """
+        # adding min 5 folds
+        num_folds = getattr(self.args, "num_folds", 5)
+        self.args.num_folds = num_folds
+
         ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
</code_context>
<issue_to_address>
**suggestion:** Comment and implementation for num_folds are slightly misleading and may surprise callers

The comment describes a minimum of 5 folds, but the code only sets a default of 5 when `num_folds` is missing; it doesn’t enforce a minimum when a smaller value is provided. It also always reassigns `self.args.num_folds` even when already set. Consider updating the comment to match the behavior and/or only setting `self.args.num_folds` when the attribute is absent, e.g. `if not hasattr(self.args, "num_folds"): self.args.num_folds = 5`.

```suggestion
        # default to 5 folds if not specified
        if not hasattr(self.args, "num_folds"):
            self.args.num_folds = 5

        ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
```
</issue_to_address>

### Comment 3
<location path="flan/preprocess/sample_splitter.py" line_range="101-103" />
<code_context>

     def fit_transform(self, cache: FileCache) -> None:
-        
+        # Force splitter to use QC output
+        if not str(cache.pfile_path()).endswith("_qc"):
+            cache._pfile_path = str(cache.pfile_path()) + "_qc"
         self._split_ids(cache)
         self._split_genotypes(cache)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Duplicated logic for enforcing QC `_qc` suffix and direct private attribute mutation may lead to inconsistencies

This re-derives the QC path via string operations and directly assigns `cache._pfile_path`, duplicating the logic already present in `QC.fit_transform` / `_split_genotypes`. To avoid divergent suffix rules or paths like `_qc_qc` on repeated calls, consider exposing a `FileCache` API (e.g., a method to switch to QC data) and using that here instead of mutating the private attribute and re-encoding the `_qc` convention.

Suggested implementation:

```python
    def fit_transform(self, cache: FileCache) -> None:
        # Ensure splitter uses QC output via FileCache API (avoids direct private attribute mutation)
        cache.use_qc_pfiles()
        self._split_ids(cache)
        self._split_genotypes(cache)
        self._split_phenotypes(cache)

```

To fully implement this suggestion, you should also:
1. Add a method like `use_qc_pfiles()` (or similar) to the `FileCache` class. That method should:
   - Encapsulate the logic for switching to QC paths (e.g., appending `_qc` only once, or following whatever convention is used in `QC.fit_transform` / `_split_genotypes`).
   - Avoid re-adding the `_qc` suffix if it's already present.
2. Update any other call sites (e.g., in `QC.fit_transform` / `_split_genotypes`) that currently:
   - Manually append `"_qc"` to `pfile_path`, or
   - Mutate `cache._pfile_path` directly,
   so that they call `cache.use_qc_pfiles()` instead. This centralizes the QC path rule and prevents inconsistencies like `_qc_qc`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread flan/preprocess/qc.py
)

# ✅ VERY IMPORTANT: update cache to point to QC output
cache._pfile_path = qc_path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Avoid mutating a cache private attribute directly; prefer a dedicated API on FileCache

Directly setting cache._pfile_path tightly couples this code to FileCache’s internals and may break if FileCache later adds validation or derives related paths from this value. Prefer exposing a public setter or a dedicated method (e.g. set_qc_pfile_path/update_pfile_path) on FileCache and calling that instead.

Suggested implementation:

        )

        # ✅ VERY IMPORTANT: update cache to point to QC output
        cache.update_pfile_path(qc_path)

To fully implement this change you also need to:

  1. Add a public method to the FileCache class (where it is defined), for example:
    class FileCache:
        ...
        def update_pfile_path(self, new_path: Path | str) -> None:
            # Perform any validation/normalization here if needed
            self._pfile_path = Path(new_path)
  2. Replace any other direct writes to cache._pfile_path in the codebase with calls to update_pfile_path(...) to keep the abstraction consistent.

Comment thread flan/preprocess/sample_splitter.py
Comment on lines +101 to +103
# Force splitter to use QC output
if not str(cache.pfile_path()).endswith("_qc"):
cache._pfile_path = str(cache.pfile_path()) + "_qc"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Duplicated logic for enforcing QC _qc suffix and direct private attribute mutation may lead to inconsistencies

This re-derives the QC path via string operations and directly assigns cache._pfile_path, duplicating the logic already present in QC.fit_transform / _split_genotypes. To avoid divergent suffix rules or paths like _qc_qc on repeated calls, consider exposing a FileCache API (e.g., a method to switch to QC data) and using that here instead of mutating the private attribute and re-encoding the _qc convention.

Suggested implementation:

    def fit_transform(self, cache: FileCache) -> None:
        # Ensure splitter uses QC output via FileCache API (avoids direct private attribute mutation)
        cache.use_qc_pfiles()
        self._split_ids(cache)
        self._split_genotypes(cache)
        self._split_phenotypes(cache)

To fully implement this suggestion, you should also:

  1. Add a method like use_qc_pfiles() (or similar) to the FileCache class. That method should:
    • Encapsulate the logic for switching to QC paths (e.g., appending _qc only once, or following whatever convention is used in QC.fit_transform / _split_genotypes).
    • Avoid re-adding the _qc suffix if it's already present.
  2. Update any other call sites (e.g., in QC.fit_transform / _split_genotypes) that currently:
    • Manually append "_qc" to pfile_path, or
    • Mutate cache._pfile_path directly,
      so that they call cache.use_qc_pfiles() instead. This centralizes the QC path rule and prevents inconsistencies like _qc_qc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant