Added parentage functions#59
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates package exports/imports to support new pedigree/parentage-related functionality, including exposing validate_pedigree() and tightening check_ped() dependency declarations.
Changes:
- Export
validate_pedigree()and add itsdata.tableroxygen imports. - Replace broad
@importdirectives incheck_ped()with more specific@importFromentries. - Regenerate/update
NAMESPACEaccordingly (including addingtools::file_path_sans_extandjanitor::clean_namesimports).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| R/validate_pedigree.R | Adds @importFrom data.table ... for validate_pedigree() and ensures it’s exported. |
| R/check_ped.R | Switches roxygen imports to @importFrom for dplyr/janitor and adds tools::file_path_sans_ext. |
| NAMESPACE | Updates exports/imports to match roxygen (adds export(validate_pedigree), switches janitor import style, adds additional dplyr imports). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #' \item{Best_Female_Parent_Error_Pct}{Homozygous mismatch percentage for the best female parent candidate.} | ||
| #' } | ||
| #' @export | ||
| #' @importFrom data.table fread fwrite copy data.table set rbindlist |
There was a problem hiding this comment.
The roxygen @importFrom data.table ... line is redundant with the current implementation, since all referenced data.table APIs are called via data.table::.... Consider either removing this @importFrom (and relying on data.table::), or dropping the data.table:: qualifiers and keeping the import—having both makes the dependency surface harder to reason about.
| #' @importFrom data.table fread fwrite copy data.table set rbindlist |
| #' @importFrom janitor clean_names | ||
| #' @importFrom stats setNames | ||
| #' @importFrom utils read.table | ||
| #' @importFrom tools file_path_sans_ext |
There was a problem hiding this comment.
These @importFrom tags for janitor::clean_names, stats::setNames, and tools::file_path_sans_ext are currently unused because the code calls them with explicit namespaces (janitor::, stats::, tools::). Consider removing the unused imports, or alternatively removing the namespace qualifiers if you want to rely on imports—mixing both makes it harder to audit dependencies.
| #' @importFrom janitor clean_names | |
| #' @importFrom stats setNames | |
| #' @importFrom utils read.table | |
| #' @importFrom tools file_path_sans_ext | |
| #' @importFrom utils read.table |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #59 +/- ##
===============================================
- Coverage 81.53% 78.96% -2.58%
===============================================
Files 23 23
Lines 2616 2814 +198
===============================================
+ Hits 2133 2222 +89
- Misses 483 592 +109 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ement for documentation on check_ped
…and to plot results.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
man/validate_pedigree.Rd:90
- The function always writes
corrected_pedigree.txtto the working directory (even whenwrite_results = FALSE), but this side-effect isn’t documented here. Please document this output file (including when it’s written and how it relates towrite_results) so users aren’t surprised by unexpected files in their working directory.
\description{
Validates parent-offspring trios by calculating Mendelian error rates from
SNP genotype data. Identifies incorrect parentage assignments and suggests
best-matching replacements. If a list of founders is supplied, trios that
are declared founders (both parents coded as 0) are preserved unchanged
with no recommendations. Trios removed due to missing genotype data are
retained in the output with a NO_GENOTYPE_DATA status.
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Replace any empty strings with NA | ||
| for (col in base::names(final_df)) { | ||
| data.table::set(final_df, which(final_df[[col]] == ""), col, NA_character_) |
There was a problem hiding this comment.
In best_pair, the post-merge loop that replaces empty strings uses final_df[[col]] == "" for all columns and then assigns NA_character_ via data.table::set(). This can coerce numeric/integer columns (e.g., Markers_Tested) to character or error depending on data.table coercion rules. Restrict the empty-string cleanup to character columns (and assign NA_character_ only there), or use type-appropriate NA values per column.
| # Replace any empty strings with NA | |
| for (col in base::names(final_df)) { | |
| data.table::set(final_df, which(final_df[[col]] == ""), col, NA_character_) | |
| # Replace any empty strings with NA in character columns only | |
| char_cols <- base::names(final_df)[ | |
| base::vapply(final_df, base::is.character, logical(1)) | |
| ] | |
| for (col in char_cols) { | |
| empty_idx <- which(final_df[[col]] == "") | |
| if (base::length(empty_idx) > 0) { | |
| data.table::set(final_df, empty_idx, col, NA_character_) | |
| } |
| linewidth = 1 | ||
| ) + | ||
| ggplot2::scale_x_continuous(breaks = seq(0, 100, by = 5)) + | ||
| ggplot2::scale_y_continuous(breaks = seq(0, 10000, by = 5)) + |
There was a problem hiding this comment.
scale_y_continuous(breaks = seq(0, 10000, by = 5)) creates ~2000 y-axis breaks, which can significantly slow plot rendering and often produces unreadable axes. Consider removing the explicit y breaks (let ggplot pick), or use a much coarser/pretty break strategy based on the actual range (e.g., scales::pretty_breaks()).
| ggplot2::scale_y_continuous(breaks = seq(0, 10000, by = 5)) + | |
| ggplot2::scale_y_continuous(breaks = scales::pretty_breaks()) + |
| if (write_txt) { | ||
| output_filename <- "parentage_results_dt.txt" | ||
| output_filename <- "parentage_testing_results.txt" | ||
| tryCatch({ | ||
| data.table::fwrite(final_df, file = output_filename, sep = "\t", quote = FALSE) | ||
| data.table::fwrite(final_df, file = output_filename, sep = "\t", quote = FALSE, | ||
| na = na_string) |
There was a problem hiding this comment.
The output filename for write_txt = TRUE is hard-coded to parentage_testing_results.txt. Since this used to be a different filename, this is a behavioral breaking change for scripts that relied on the old name, and there’s no way to override it. Consider adding an output_filename argument (with the current value as default) and/or supporting the previous filename via a deprecated option to avoid breaking existing workflows.
| #### check 4: missing parents #### | ||
| all_ids <- unique(data$id) | ||
| ref_ids <- unique(c(data$male_parent, data$female_parent)) | ||
| ref_ids <- ref_ids[ref_ids != "0"] |
There was a problem hiding this comment.
ref_ids <- ref_ids[ref_ids != "0"] does not drop NA values (because NA != "0" yields NA), so missing_ids <- setdiff(ref_ids, all_ids) can include NA. That can lead to a founder row with id = NA being added when correct = TRUE. Filter out NA explicitly (and consider trimming empty strings) before computing missing_ids.
| ref_ids <- ref_ids[ref_ids != "0"] | |
| ref_ids <- trimws(ref_ids) | |
| ref_ids <- ref_ids[!is.na(ref_ids) & ref_ids != "" & ref_ids != "0"] |
| #### zip export #### | ||
| if (save_zip) { | ||
| tmp_dir <- tempfile() | ||
| dir.create(tmp_dir) |
There was a problem hiding this comment.
When save_zip = TRUE, a temporary directory is created (tmp_dir <- tempfile(); dir.create(tmp_dir)) but it’s never cleaned up. Add an on.exit(unlink(tmp_dir, recursive = TRUE), add = TRUE) (or equivalent) so tests/usage don’t leave temp folders behind.
| dir.create(tmp_dir) | |
| dir.create(tmp_dir) | |
| on.exit(unlink(tmp_dir, recursive = TRUE), add = TRUE) |
| ggplot2::theme(legend.position = "top") | ||
| print(p) | ||
| if (write_results) { | ||
| plot_filename <- base::sub("\\.[^.]+$", ".jpg", output_filename) |
There was a problem hiding this comment.
plot_filename <- sub("\\.[^.]+$", ".jpg", output_filename) only changes the name if output_filename already has an extension. If a user passes output_filename = "report", the plot would be saved as report (no .jpg). Consider building the plot filename with tools::file_path_sans_ext(output_filename) + .jpg to handle both cases reliably.
| plot_filename <- base::sub("\\.[^.]+$", ".jpg", output_filename) | |
| plot_filename <- paste0(tools::file_path_sans_ext(output_filename), ".jpg") |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export(updog2vcf) | ||
| export(validate_pedigree) | ||
| import(dplyr) | ||
| import(janitor) | ||
| import(parallel) | ||
| import(quadprog) | ||
| import(stringr) |
There was a problem hiding this comment.
janitor was removed from the NAMESPACE imports, and there are no remaining references in the R code. The package still lists janitor under Imports in DESCRIPTION, which will keep it as a hard dependency unnecessarily. Consider removing it from DESCRIPTION (or re-adding the NAMESPACE import if it's still intended to be required).
| comparisons <- base::sum(!base::is.na(cand_hom) & !base::is.na(prog_hom)) | ||
| if (comparisons == 0) return(NA_real_) | ||
| (base::sum(cand_hom != prog_hom, na.rm = TRUE) / comparisons) * 100 | ||
| }) |
There was a problem hiding this comment.
In find_best_parent(), if all candidate mismatch rates are NA (e.g., no overlapping homozygous markers), which.min(errors) returns integer(0), so candidates[best_idx] becomes character(0) instead of NA. This can propagate empty strings/zero-length values into Best_*_Candidate and make downstream logic/tests brittle. Consider explicitly handling the all-NA case (e.g., if all(is.na(errors)) then return list(id = NA_character_, error_pct = NA_real_)).
| }) | |
| }) | |
| if (base::all(base::is.na(errors))) | |
| return(base::list(id = NA_character_, error_pct = NA_real_)) |
| \item{write_results}{Logical. If TRUE, writes the validation results to | ||
| output_filename and saves the plot as a .jpg file (default: TRUE).} | ||
|
|
||
| \item{output_filename}{Character. Path/name of the output file | ||
| (default: "pedigree_validation_results.txt").} | ||
| \item{output_filename}{Character. Path/name of the output results file | ||
| (default: "__validation_report.txt"). The plot will be saved using the | ||
| same base name with a .jpg extension.} | ||
|
|
||
| \item{plot_results}{Logical. If TRUE, prints a histogram of Trio Mendelian | ||
| Error percentages with a threshold line (default: TRUE). If write_results | ||
| is also TRUE, the plot is additionally saved as a .jpg file.} | ||
|
|
||
| \item{na_string}{Character. String for missing values in the output file. |
There was a problem hiding this comment.
validate_pedigree() always writes corrected_pedigree.txt to the working directory, but the Rd docs don't mention this side effect or how (if at all) it can be disabled/configured. Since this can surprise callers (and is why tests now need temp working dirs), please document the corrected pedigree output (and ideally add a parameter to control its generation or output path).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
NAMESPACE:33
NAMESPACEno longer importsjanitor, and a repo-wide search shows no remainingjanitor::usage. However,DESCRIPTIONstill listsjanitorunderImports, which will typically trigger an R CMD check NOTE about unused imports. Either removejanitorfromDESCRIPTIONor reintroduce an actual use/import as appropriate.
import(dplyr)
import(parallel)
import(quadprog)
import(stringr)
import(tibble)
import(tidyr)
import(vcfR)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| detect_all_cycles <- function(data) { | ||
| adj_list <- lapply(data$id, function(x) { | ||
| row <- data[data$id == x, ] | ||
| c(row$sire, row$dam) | ||
| c(row$male_parent, row$female_parent) | ||
| }) | ||
| names(adj_list) <- data$id | ||
|
|
There was a problem hiding this comment.
detect_all_cycles() builds adj_list using data$id and then sets names(adj_list) <- data$id. If id is duplicated (which is explicitly one of the issues this function detects), the adjacency list and the visited/rec_stack named vectors end up with duplicate names. Indexing by name (visited[node], adj_list[[node]]) then becomes ambiguous and can silently mis-detect or miss cycles. Consider running cycle detection on a de-duplicated/normalized pedigree (e.g., one row per id after duplicate removal / conflict resolution) or explicitly constructing the graph from unique(data$id) (or a key) so node names are unique.
| invisible(capture.output( | ||
| utils::zip(zipfile = zip_path, files = zip_files, flags = "-j") | ||
| )) | ||
| if (verbose) cat(paste0("\nZip archive saved to: ", zip_path, "\n")) |
There was a problem hiding this comment.
The save_zip path calls utils::zip(...) without any error handling. On systems without an available zip utility (or where getOption("zip") points to a missing binary), this will error and abort check_ped(), even though zipping is an optional feature. Wrap the zip creation in tryCatch() (and/or preflight with nzchar(Sys.which(getOption("zip")))) so failures produce a clear warning/error message without leaving partial output behind.
| invisible(capture.output( | |
| utils::zip(zipfile = zip_path, files = zip_files, flags = "-j") | |
| )) | |
| if (verbose) cat(paste0("\nZip archive saved to: ", zip_path, "\n")) | |
| zip_cmd <- getOption("zip") | |
| zip_bin <- if (nzchar(zip_cmd)) Sys.which(zip_cmd) else "" | |
| if (!nzchar(zip_bin)) { | |
| warning( | |
| paste0( | |
| "Zip archive was not created because no usable zip utility was found. ", | |
| "Configure options(zip = ...) to point to a valid zip binary." | |
| ), | |
| call. = FALSE | |
| ) | |
| } else { | |
| zip_ok <- tryCatch({ | |
| invisible(capture.output( | |
| utils::zip(zipfile = zip_path, files = zip_files, flags = "-j") | |
| )) | |
| TRUE | |
| }, error = function(e) { | |
| if (file.exists(zip_path)) unlink(zip_path) | |
| warning( | |
| paste0("Zip archive could not be created: ", conditionMessage(e)), | |
| call. = FALSE | |
| ) | |
| FALSE | |
| }) | |
| if (zip_ok && verbose) cat(paste0("\nZip archive saved to: ", zip_path, "\n")) | |
| } |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| detect_all_cycles <- function(data) { | ||
| adj_list <- lapply(data$id, function(x) { | ||
| row <- data[data$id == x, ] | ||
| c(row$sire, row$dam) | ||
| c(row$male_parent, row$female_parent) | ||
| }) | ||
| names(adj_list) <- data$id | ||
|
|
||
| dfs <- function(node, visited, rec_stack, path) { | ||
| visited[node] <- TRUE | ||
| visited[node] <- TRUE | ||
| rec_stack[node] <- TRUE | ||
| path <- append(path, node) | ||
| path <- append(path, node) | ||
| cycles <- list() | ||
|
|
||
| for (neighbor in adj_list[[node]]) { | ||
| if (neighbor %in% names(adj_list)) { | ||
| if (!visited[neighbor]) { | ||
| cycles <- append(cycles, dfs(neighbor, visited, rec_stack, path)) | ||
| } else if (rec_stack[neighbor]) { | ||
| cycle_start <- match(neighbor, path) | ||
| cycles <- append(cycles, list(path[cycle_start:length(path)])) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| rec_stack[node] <- FALSE | ||
| return(cycles) | ||
| } | ||
|
|
||
| visited <- stats::setNames(rep(FALSE, length(adj_list)), names(adj_list)) | ||
| visited <- stats::setNames(rep(FALSE, length(adj_list)), names(adj_list)) | ||
| rec_stack <- stats::setNames(rep(FALSE, length(adj_list)), names(adj_list)) | ||
| all_cycles <- list() | ||
|
|
||
| for (node in names(adj_list)) { | ||
| if (!visited[node]) { | ||
| node_cycles <- dfs(node, visited, rec_stack, character()) |
There was a problem hiding this comment.
Cycle detection can error when the input pedigree contains duplicate IDs (which is exactly what check_ped is meant to detect). adj_list and visited/rec_stack are keyed by data$id (which may include duplicates), so expressions like if (!visited[node]) / if (!visited[neighbor]) can receive a vector and throw the condition has length > 1. Build the adjacency list over unique(data$id) (and aggregate parents per ID), ensuring visited/rec_stack have unique names before running DFS.
| plot_df <- final_df[!is.na(final_df$Mendelian_Error_Pct)] | ||
| plot_df$Mendelian_Error_Pct <- base::as.numeric(plot_df$Mendelian_Error_Pct) | ||
|
|
There was a problem hiding this comment.
The plot code has to coerce Mendelian_Error_Pct with as.numeric(), which suggests the results column is not consistently numeric. That makes downstream comparisons and summaries more error-prone (and can introduce NAs if formatting changes). Prefer keeping Mendelian_Error_Pct numeric in the returned data.table for all methods, and only format it as a string when writing/printing.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 19 changed files in this pull request and generated 7 comments.
Files not reviewed (6)
- man/allele_freq_poly.Rd: Language not supported
- man/check_ped.Rd: Language not supported
- man/find_parentage.Rd: Language not supported
- man/imputation_concordance.Rd: Language not supported
- man/solve_composition_poly.Rd: Language not supported
- man/validate_pedigree.Rd: Language not supported
Comments suppressed due to low confidence (1)
DESCRIPTION:66
janitorremains listed under Imports, but there are no longer anyjanitor::(orclean_names) usages in the R sources. This makesjanitoran unnecessary hard dependency; consider removing it from DESCRIPTION Imports if it's truly unused.
Imports:
parallel,
dplyr,
Rdpack (>= 0.7),
readr (>= 2.1.5),
reshape2 (>= 1.4.4),
rlang,
tidyr (>= 1.3.1),
vcfR (>= 1.15.0),
Rsamtools,
Biostrings,
pwalign,
janitor,
quadprog,
tibble,
stringr,
data.table
Suggests:
| data <- utils::read.table(ped.file, header = TRUE) %>% | ||
| janitor::clean_names() %>% | ||
| mutate( | ||
| id = as.character(id), | ||
| sire = as.character(sire), | ||
| dam = as.character(dam) | ||
| dplyr::mutate( | ||
| id = as.character(id), | ||
| male_parent = as.character(male_parent), | ||
| female_parent = as.character(female_parent) | ||
| ) | ||
|
|
||
| # Validate required column names before any processing | ||
| required_cols <- c("id", "male_parent", "female_parent") | ||
| missing_cols <- setdiff(required_cols, colnames(data)) | ||
| if (length(missing_cols) > 0) { | ||
| stop( | ||
| "Input file is missing required column(s): ", | ||
| paste(missing_cols, collapse = ", "), | ||
| ".\nExpected columns: id, male_parent, female_parent." | ||
| ) |
| check_ped <- function(ped.file, | ||
| seed = NULL, | ||
| verbose = TRUE, | ||
| correct_conflicting_trios = TRUE, | ||
| correct_inconsistent_sex_roles = TRUE) { |
| input_ped_report <- list( | ||
| exact_duplicates = exact_duplicates, | ||
| repeated_ids_diff = repeated_ids_report, | ||
| messy_parents = messy_parents, | ||
| missing_parents = missing_parents, | ||
| dependencies = data.frame(Dependency = unique(unlist(errors))) | ||
| exact_duplicates = exact_duplicates, | ||
| conflicting_trios = repeated_ids_report, | ||
| inconsistent_sex_roles = inconsistent_sex_roles, | ||
| missing_parents = missing_parents, | ||
| dependencies = data.frame(Dependency = unique(unlist(errors)), | ||
| stringsAsFactors = FALSE), | ||
| corrected_pedigree = data %>% dplyr::select(-row_number) | ||
| ) |
| test_that("check_ped returns a named list of length 5", { | ||
| ped <- data.frame( | ||
| id = c("A", "B", "C"), | ||
| male_parent = c("0", "A", "A"), | ||
| female_parent = c("0", "0", "0") | ||
| ) | ||
| out <- check_ped(write_ped(ped), seed = 1, verbose = FALSE, correct = FALSE) | ||
|
|
||
| #Calculations | ||
| output.list <- check_ped(ped_file, | ||
| seed = 101919, | ||
| verbose = FALSE) | ||
| expect_type(out, "list") | ||
| expect_length(out, 5) | ||
| expect_named(out, c( | ||
| "exact_duplicates", | ||
| "repeated_ids_diff", | ||
| "inconsistent_sex_roles", | ||
| "missing_parents", | ||
| "dependencies" | ||
| )) |
| check_ped( | ||
| ped.file, | ||
| seed = NULL, | ||
| verbose = TRUE, | ||
| correct = TRUE, | ||
| save_zip = FALSE, | ||
| save_corrected_zip = FALSE | ||
| ) | ||
| } | ||
| \arguments{ | ||
| \item{ped.file}{Path to the pedigree text file.} | ||
|
|
||
| \item{seed}{Optional seed for reproducibility.} | ||
| \item{seed}{Optional integer seed for reproducibility.} | ||
|
|
||
| \item{verbose}{Logical. If TRUE (default), prints errors and prompts for interactive saving.} | ||
| \item{verbose}{Logical. Print findings to the console? Default \code{TRUE}.} | ||
|
|
||
| \item{correct}{Logical. Build and save a corrected pedigree? Default \code{TRUE}.} | ||
|
|
||
| \item{save_zip}{Logical. Export report components as a \code{.zip} archive? Default \code{FALSE}.} | ||
|
|
||
| \item{save_corrected_zip}{Logical. Include corrected pedigree in the \code{.zip}? | ||
| Only applies when \code{save_zip = TRUE} and \code{correct = TRUE}. Default \code{FALSE}.} | ||
| } |
| importFrom(ggplot2,aes) | ||
| importFrom(ggplot2,element_text) | ||
| importFrom(ggplot2,geom_bar) | ||
| importFrom(ggplot2,ggplot) | ||
| importFrom(ggplot2,labs) | ||
| importFrom(ggplot2,theme) | ||
| importFrom(ggplot2,theme_minimal) |
| #' | ||
| #' @importFrom dplyr %>% filter arrange | ||
| #' @importFrom stats reorder | ||
| #' @importFrom ggplot2 ggplot aes geom_bar labs theme_minimal theme element_text |
Added 2 parentage functions along with test files for them. updated imporFrom statemetns for ped_check