Fixing errors in the repo setup#3
Conversation
Reviewer's GuideRefactors QC and sample splitting to correctly produce and consume QC’d PLINK pfiles, adds safer defaults for fold counts, and introduces a basic deep_ancestry package initializer to fix repo import/setup issues. Sequence diagram for QC and sample splitting pipeline after fixessequenceDiagram
actor User
participant QCProcessor
participant SampleSplitter
participant FileCache
participant Plink
User->>QCProcessor: fit_transform(cache)
QCProcessor->>FileCache: pfile_path()
QCProcessor->>Plink: run_plink(--pfile original_pfile --make-pgen --out original_pfile_qc --set-missing-var-ids @:#:$r:$a qc_config)
Plink-->>QCProcessor: QC pfile generated
QCProcessor->>FileCache: set _pfile_path = original_pfile_qc
User->>SampleSplitter: fit_transform(cache)
SampleSplitter->>FileCache: pfile_path()
SampleSplitter->>FileCache: ids_path()
SampleSplitter->>SampleSplitter: _split_ids(cache)
SampleSplitter->>SampleSplitter: _split_genotypes(cache)
SampleSplitter->>FileCache: pfile_path()
SampleSplitter->>Plink: run_plink(--pfile qc_pfile --keep fold_ids --make-pgen --out fold_pfile)
SampleSplitter->>SampleSplitter: _split_phenotypes(cache)
SampleSplitter->>Plink: run_plink for phenotypes
Plink-->>SampleSplitter: split pfiles and phenotype files ready
SampleSplitter-->>User: cache with QC based splits
Updated class diagram for QC, sample splitting, and cache handlingclassDiagram
class QCProcessor {
- qc_config Dict
+ __init__(qc_config Dict) None
+ fit_transform(cache FileCache) None
+ transform(source_path str, dest_path str) None
}
class SampleSplitter {
- args Any
+ __init__(args Any) None
+ _split_ids(cache FileCache, y Any, random_state int) None
+ _split_genotypes(cache FileCache) None
+ _split_phenotypes(cache FileCache) None
+ fit_transform(cache FileCache) None
}
class FileCache {
- _pfile_path str
+ pfile_path() str
+ ids_path(fold_index int, part str) str
+ ids_path() str
+ num_folds int
}
class PlinkRunner {
+ run_plink(args_list List, args_dict Dict) None
}
QCProcessor --> FileCache : uses
QCProcessor --> PlinkRunner : calls
SampleSplitter --> FileCache : uses
SampleSplitter --> PlinkRunner : calls
SampleSplitter --> QCProcessor : consumes QC output via cache
FileCache o-- QCProcessor : stores QC pfile_path
FileCache o-- SampleSplitter : provides QC pfile_path and ids_path
Flow diagram for QC to fold-split genotype processingflowchart TD
A[Raw PLINK pfile in FileCache] --> B[QCProcessor fit_transform]
B --> C[run_plink with --set-missing-var-ids @:#:$r:$a]
C --> D[QC pfile with _qc suffix]
D --> E[Update FileCache _pfile_path to QC pfile]
E --> F[SampleSplitter fit_transform]
F --> G[_split_ids uses ids_path and num_folds default 5]
G --> H[_split_genotypes uses QC pfile base_path ending with _qc]
H --> I[run_plink per fold and split train val test]
F --> J[_split_phenotypes uses QC based folds]
I --> K[Per fold genotype pfiles]
J --> L[Per fold phenotype files]
K --> M[Downstream training uses QC genotypes]
L --> M
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- Directly mutating the private attribute
cache._pfile_pathin multiple places couples this logic tightly to the currentFileCacheimplementation; consider exposing a public setter or method onFileCacheto update the pfile path instead. - The
_qcsuffix handling is currently duplicated and inconsistent (qc.fit_transformmutates the cache path,_split_genotypesrecomputes a_qcpath, andfit_transforminsample_splittermay append_qcagain); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path. - The comment in
_split_idssays 'adding min 5 folds' but the code only provides a default of 5 whennum_foldsis missing; if a true minimum is desired, add an explicit check to clamp or validateself.args.num_folds.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Directly mutating the private attribute `cache._pfile_path` in multiple places couples this logic tightly to the current `FileCache` implementation; consider exposing a public setter or method on `FileCache` to update the pfile path instead.
- The `_qc` suffix handling is currently duplicated and inconsistent (`qc.fit_transform` mutates the cache path, `_split_genotypes` recomputes a `_qc` path, and `fit_transform` in `sample_splitter` may append `_qc` again); it would be safer to centralize this logic in one place or add a helper that guarantees a single, well-formed QC path.
- The comment in `_split_ids` says 'adding min 5 folds' but the code only provides a default of 5 when `num_folds` is missing; if a true minimum is desired, add an explicit check to clamp or validate `self.args.num_folds`.
## Individual Comments
### Comment 1
<location path="flan/preprocess/qc.py" line_range="35" />
<code_context>
+ )
+
+ # ✅ VERY IMPORTANT: update cache to point to QC output
+ cache._pfile_path = qc_path
def transform(self, source_path: str, dest_path: str) -> None:
</code_context>
<issue_to_address>
**suggestion:** Avoid mutating a cache private attribute directly; prefer a dedicated API on FileCache
Directly setting `cache._pfile_path` tightly couples this code to `FileCache`’s internals and may break if `FileCache` later adds validation or derives related paths from this value. Prefer exposing a public setter or a dedicated method (e.g. `set_qc_pfile_path`/`update_pfile_path`) on `FileCache` and calling that instead.
Suggested implementation:
```python
)
# ✅ VERY IMPORTANT: update cache to point to QC output
cache.update_pfile_path(qc_path)
```
To fully implement this change you also need to:
1. Add a public method to the `FileCache` class (where it is defined), for example:
```python
class FileCache:
...
def update_pfile_path(self, new_path: Path | str) -> None:
# Perform any validation/normalization here if needed
self._pfile_path = Path(new_path)
```
2. Replace any other direct writes to `cache._pfile_path` in the codebase with calls to `update_pfile_path(...)` to keep the abstraction consistent.
</issue_to_address>
### Comment 2
<location path="flan/preprocess/sample_splitter.py" line_range="32-36" />
<code_context>
y: y can be passed to trigger StratifiedKFold instead of KFold
random_state (int): Fixed random_state for train_test_split sklearn function
"""
+ # adding min 5 folds
+ num_folds = getattr(self.args, "num_folds", 5)
+ self.args.num_folds = num_folds
+
ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
</code_context>
<issue_to_address>
**suggestion:** Comment and implementation for num_folds are slightly misleading and may surprise callers
The comment describes a minimum of 5 folds, but the code only sets a default of 5 when `num_folds` is missing; it doesn’t enforce a minimum when a smaller value is provided. It also always reassigns `self.args.num_folds` even when already set. Consider updating the comment to match the behavior and/or only setting `self.args.num_folds` when the attribute is absent, e.g. `if not hasattr(self.args, "num_folds"): self.args.num_folds = 5`.
```suggestion
# default to 5 folds if not specified
if not hasattr(self.args, "num_folds"):
self.args.num_folds = 5
ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
```
</issue_to_address>
### Comment 3
<location path="flan/preprocess/sample_splitter.py" line_range="101-103" />
<code_context>
def fit_transform(self, cache: FileCache) -> None:
-
+ # Force splitter to use QC output
+ if not str(cache.pfile_path()).endswith("_qc"):
+ cache._pfile_path = str(cache.pfile_path()) + "_qc"
self._split_ids(cache)
self._split_genotypes(cache)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Duplicated logic for enforcing QC `_qc` suffix and direct private attribute mutation may lead to inconsistencies
This re-derives the QC path via string operations and directly assigns `cache._pfile_path`, duplicating the logic already present in `QC.fit_transform` / `_split_genotypes`. To avoid divergent suffix rules or paths like `_qc_qc` on repeated calls, consider exposing a `FileCache` API (e.g., a method to switch to QC data) and using that here instead of mutating the private attribute and re-encoding the `_qc` convention.
Suggested implementation:
```python
def fit_transform(self, cache: FileCache) -> None:
# Ensure splitter uses QC output via FileCache API (avoids direct private attribute mutation)
cache.use_qc_pfiles()
self._split_ids(cache)
self._split_genotypes(cache)
self._split_phenotypes(cache)
```
To fully implement this suggestion, you should also:
1. Add a method like `use_qc_pfiles()` (or similar) to the `FileCache` class. That method should:
- Encapsulate the logic for switching to QC paths (e.g., appending `_qc` only once, or following whatever convention is used in `QC.fit_transform` / `_split_genotypes`).
- Avoid re-adding the `_qc` suffix if it's already present.
2. Update any other call sites (e.g., in `QC.fit_transform` / `_split_genotypes`) that currently:
- Manually append `"_qc"` to `pfile_path`, or
- Mutate `cache._pfile_path` directly,
so that they call `cache.use_qc_pfiles()` instead. This centralizes the QC path rule and prevents inconsistencies like `_qc_qc`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| ) | ||
|
|
||
| # ✅ VERY IMPORTANT: update cache to point to QC output | ||
| cache._pfile_path = qc_path |
There was a problem hiding this comment.
suggestion: Avoid mutating a cache private attribute directly; prefer a dedicated API on FileCache
Directly setting cache._pfile_path tightly couples this code to FileCache’s internals and may break if FileCache later adds validation or derives related paths from this value. Prefer exposing a public setter or a dedicated method (e.g. set_qc_pfile_path/update_pfile_path) on FileCache and calling that instead.
Suggested implementation:
)
# ✅ VERY IMPORTANT: update cache to point to QC output
cache.update_pfile_path(qc_path)To fully implement this change you also need to:
- Add a public method to the
FileCacheclass (where it is defined), for example:class FileCache: ... def update_pfile_path(self, new_path: Path | str) -> None: # Perform any validation/normalization here if needed self._pfile_path = Path(new_path)
- Replace any other direct writes to
cache._pfile_pathin the codebase with calls toupdate_pfile_path(...)to keep the abstraction consistent.
| # Force splitter to use QC output | ||
| if not str(cache.pfile_path()).endswith("_qc"): | ||
| cache._pfile_path = str(cache.pfile_path()) + "_qc" |
There was a problem hiding this comment.
suggestion (bug_risk): Duplicated logic for enforcing QC _qc suffix and direct private attribute mutation may lead to inconsistencies
This re-derives the QC path via string operations and directly assigns cache._pfile_path, duplicating the logic already present in QC.fit_transform / _split_genotypes. To avoid divergent suffix rules or paths like _qc_qc on repeated calls, consider exposing a FileCache API (e.g., a method to switch to QC data) and using that here instead of mutating the private attribute and re-encoding the _qc convention.
Suggested implementation:
def fit_transform(self, cache: FileCache) -> None:
# Ensure splitter uses QC output via FileCache API (avoids direct private attribute mutation)
cache.use_qc_pfiles()
self._split_ids(cache)
self._split_genotypes(cache)
self._split_phenotypes(cache)To fully implement this suggestion, you should also:
- Add a method like
use_qc_pfiles()(or similar) to theFileCacheclass. That method should:- Encapsulate the logic for switching to QC paths (e.g., appending
_qconly once, or following whatever convention is used inQC.fit_transform/_split_genotypes). - Avoid re-adding the
_qcsuffix if it's already present.
- Encapsulate the logic for switching to QC paths (e.g., appending
- Update any other call sites (e.g., in
QC.fit_transform/_split_genotypes) that currently:- Manually append
"_qc"topfile_path, or - Mutate
cache._pfile_pathdirectly,
so that they callcache.use_qc_pfiles()instead. This centralizes the QC path rule and prevents inconsistencies like_qc_qc.
- Manually append
fix naming issue deep_ancestry, redirecting to flan
error in qc.py file causing issue on "prepare"
fix num folds error in sample_splitter
Fixing Error: --read-freq variant ID '.' appears multiple times
Summary by Sourcery
Fix QC preprocessing and sample splitting to consistently operate on QC-processed genotype data and add the deep_ancestry package marker module.
Bug Fixes: