CCBR · ccbr-bot · Mar 13, 2026 · Mar 13, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,164 @@
+# CoPilot Instructions for CCBR Repositories
+
+## Reviewer guidance (what to look for in PRs)
+
+- Reviewers must validate enforcement rules: no secrets, container specified, and reproducibility pins.
+- If code is AI-generated, reviewers must ensure the author documents what was changed and why, and that the PR is labeled `generated-by-AI`.
+- Reviewers should verify license headers and ownership metadata (for example, `CODEOWNERS`) are present.
+- Reviews must read the code and verify that it adheres to the project's coding standards, guidelines, and best practices in software engineering.
+
+## CI & enforcement suggestions (automatable)
+
+1. **PR template**: include optional AI-assistance disclosure fields (model used, high-level prompt intent, manual review confirmation).
+2. **Pre-merge check (GitHub Action)**: verify `.github/copilot-instructions.md` is present in the repository and that new pipeline files include a `# CRAFT:` header.
+3. **Lint jobs**: `ruff` for Python, `shellcheck` for shell, `lintr` for R, and `nf-core lint` or Snakemake lint checks where applicable.
+4. **Secrets scan**: run `TruffleHog` or `Gitleaks` on PRs to detect accidental credentials.
+5. **AI usage label**: if AI usage is declared, an Action should add `generated-by-AI` label (create this label if it does not exist); the PR body should end with the italicized Markdown line: _Generated using AI_, and any associated commit messages should end with the plain footer line: `Generated using AI`.
+
+_Sample GH Action check (concept): if AI usage is declared, require an AI-assistance disclosure field in the PR body._
+
+## Security & compliance (mandatory)
+
+- Developers must not send PHI or sensitive NIH internal identifiers to unapproved external AI services; use synthetic examples.
+- Repository content must only be sent to model providers approved by NCI/NIH policy (for example, Copilot for Business or approved internal proxies).
+- For AI-assisted actions, teams must keep an auditable record including: user, repository, action, timestamp, model name, and endpoint.
+- If using a server wrapper (Option C), logs must include the minimum metadata above and follow institutional retention policy.
+- If policy forbids external model use for internal code, teams must use approved local/internal LLM workflows.
+
+## Operational notes (practical)
+
+- `copilot-instructions.md` should remain concise and prescriptive; keep only high-value rules and edge-case examples.
+- Developers should include the CRAFT block in edited files when requesting substantial generated code to improve context quality.
+- CoPilot must ask the user for permission before deleting any file unless the file was created by CoPilot for a temporary run or test.
+- CoPilot must not edit any files outside of the current open workspace.
+
+## Code authoring guidance
+
+- Code must not include hard-coded secrets, credentials, or sensitive absolute paths on disk.
+- Code should be designed for modularity, reusability, and maintainability. It should ideally be platform-agnostic, with special support for running on the Biowulf HPC.
+- Use pre-commit to enforce code style and linting during the commit process.
+
+### Pipelines
+
+- Authors must review existing CCBR pipelines first: <https://github.com/CCBR>.
+- New pipelines should follow established CCBR conventions for folder layout, rule/process naming, config structure, and test patterns.
+- Pipelines must define container images and pin tool/image versions for reproducibility.
+- Contributions should include a test dataset and a documented example command.
+
+#### Snakemake
+
+- In general, new pipelines should be created with Nextflow rather than Snakemake, unless there is a compelling reason to use Snakemake.
+- Generate new pipelines from the CCBR_SnakemakeTemplate repo: <https://github.com/CCBR/CCBR_SnakemakeTemplate>
+- For Snakemake, run `snakemake --lint` and a dry-run before PR submission.
+
+#### Nextflow
+
+- Generate new pipelines from the CCBR_NextflowTemplate repo: <https://github.com/CCBR/CCBR_NextflowTemplate>
+- For Nextflow pipelines, authors must follow nf-core patterns and references: <https://nf-co.re>.
+- Nextflow code must use DSL2 only (DSL1 is not allowed).
+- For Nextflow, run `nf-core lint` (or equivalent checks) before PR submission.
+- Where possible, reuse modules and subworkflows from CCBR/nf-modules or nf-core/modules.
+- New modules and subworkflows should be tested with `nf-test`.
+
+### Python scripts and packages
+
+- Python scripts must include module and function/class docstrings.
+- Where a standard CLI framework is adopted, Python CLIs should use `click` or `typer` for consistency with existing components.
+- Scripts must support `--help` and document required/optional arguments.
+- Python code must follow [PEP 8](https://peps.python.org/pep-0008/), use `snake_case`, and include type hints for public functions.
+- Scripts must raise descriptive error messages on failure and warnings when applicable. Prefer raising an exception over printing an error message, and over returning an error code.
+- Python code should pass `ruff`;
+- Each script must include a documented example usage in comments or README.
+- Tests should be written with `pytest`. Other testing frameworks may be used if justified.
+- Do not catch bare exceptions. The exception type must always be specified.
+- Only include one return statement at the end of a function.
+
+### R scripts and packages
+
+- R scripts must include function and class docstrings via roxygen2.
+- CLIs must be defined using the `argparse` package.
+- CLIs must support `--help` and document required/optional arguments.
+- R code should pass `lintr` and `air`.
+- Tests should be written with `testthat`.
+- Packages should pass `devtools::check()`.
+- R code should adhere to the tidyverse style guide. https://style.tidyverse.org/
+- Only include one return statement at the end of a function, if a return statement is used at all. Explicit returns are preferred but not required for R functions.
+
+## AI-generated commit messages (Conventional Commits)
+
+- Commit messages must follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) (as enforced in `CONTRIBUTING.md`).
+- Generate messages from staged changes only (`git diff --staged`); do not include unrelated work.
+- Commits should be atomic: one logical change per commit.
+- If mixed changes are present, split into multiple logical commits; the number of commits does not need to equal the number of files changed.
+- Subject format must be: `<type>(optional-scope): short imperative summary` (<=72 chars), e.g., `fix(profile): update release table parser`.
+- Add a body only when needed to explain **why** and notable impact; never include secrets, tokens, PHI, or large diffs.
+- For AI-assisted commits, add this final italicized footer line in the commit message body: _commit message is ai-generated_
+
+Suggested prompt for AI tools:
+
+```text
+Create a Conventional Commit message from this staged diff.
+Rules:
+1) Use one of: feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert.
+2) Keep subject <= 72 chars, imperative mood, no trailing period.
+3) Include optional scope when clear.
+4) Add a short body only if needed (why/impact), wrapped at ~72 chars.
+5) Output only the final commit message.
+```
+
+## Pull Requests
+
+When opening a pull request, use the repository's pull request template (usually it is `.github/PULL_REQUEST_TEMPLATE.md`).
+Different repos have different PR templates depending on their needs.
+Ensure that the pull request follows the repository's PR template and includes all required information.
+Do not allow the developer to proceed with opening a PR if it does not fill out all sections of the template.
+Before a PR can be moved from draft to "ready for review", all of the relevant checklist items must be checked, and any
+irrelevant checklist items should be crossed out.
+
+When new features, bug fixes, or other behavioral changes are introduced to the code,
+unit tests must be added or updated to cover the new or changed functionality.
+
+If there are any API or other user-facing changes, the documentation must be updated both inline via docstrings and long-form docs in the `docs/` or `vignettes/` directory.
+
+When a repo contains a build workflow (i.e. a workflow file in `.github/workflows` starting with `build` or named `R-CMD-check`),
+the build workflow must pass before the PR can be approved.
+
+### Changelog
+
+The changelog for the repository should be maintained in a `CHANGELOG.md` file
+(or `NEWS.md` for R packages) at the root of the repository. Each pull request
+that introduces user-facing changes must include a concise entry with the PR
+number and author username tagged. Developer-only changes (i.e. updates to CI
+workflows, development notes, etc.) should never be included in the changelog.
+Example:
+
+```
+## development version
+
+- Fix bug in `detect_absolute_paths()` to ignore comments. (#123, @username)
+```
+
+## Onboarding checklist for new developers
+
+- [ ] Read `.github/CONTRIBUTING.md` and `.github/copilot-instructions.md`.
+- [ ] Configure VSCode workspace to open `copilot-instructions.md` by default (so Copilot Chat sees it).
+- [ ] Install pre-commit and run `pre-commit install`.
+
+## Appendix: VSCode snippet (drop into `.vscode/snippets/craft.code-snippets`)
+
+```json
+{
+  "Insert CRAFT prompt": {
+    "prefix": "craft",
+    "body": [
+      "/* C: Context: Repo=${workspaceFolderBasename}; bioinformatics pipelines; NIH HPC (Biowulf/Helix); containers: quay.io/ccbr */",
+      "/* R: Rules: no PHI, no secrets, containerize, pin versions, follow style */",
+      "/* F: Flow: inputs/ -> results/, conf/, tests/ */",
+      "/* T: Tests: provide a one-line TEST_CMD and expected output */",
+      "",
+      "A: $1"
+    ],
+    "description": "Insert CRAFT prompt and place cursor at Actions"
+  }
+}
+```
diff --git a/.tests/README.md b/.tests/README.md
@@ -3,4 +3,3 @@
 These input files are used for continuous integration purposes, specificially to dry run the pipeline whenever commits have been made to the main, master, or unified branches.
 
 **Please Note:** Each of the provided FastQ files and BAM files have only headers and will not work for the LOGAN pipeline
-
diff --git a/.tests/pairs.tsv b/.tests/pairs.tsv
@@ -1,2 +1,2 @@
 Tumor	Normal
-WGS_NC_T	WGS_NC_N	
+WGS_NC_T	WGS_NC_N
diff --git a/README.md b/README.md
@@ -154,7 +154,7 @@ Example of Tumor_Normal calling mode
 # Step 0: Set up
 
 sinteractive --mem=8g -N 1 -n 4
-module load ccbrpipeliner # v8 
+module load ccbrpipeliner # v8
 
 # set up directories
 

diff --git a/bin/ascat.R b/bin/ascat.R
@@ -51,14 +51,14 @@ ascat.prepareHTS(
   normalBAF_file = sprintf("%s_BAF.txt",normal_name),
   BED_file=bed)
 
-ascat.bc = ascat.loadData(Tumor_LogR_file = sprintf("%s_LogR.txt",tumor_name), 
-    Tumor_BAF_file = sprintf("%s_BAF.txt",tumor_name), 
-    Germline_LogR_file = sprintf("%s_LogR.txt",normal_name), Germline_BAF_file = sprintf("%s_BAF.txt",normal_name), 
+ascat.bc = ascat.loadData(Tumor_LogR_file = sprintf("%s_LogR.txt",tumor_name),
+    Tumor_BAF_file = sprintf("%s_BAF.txt",tumor_name),
+    Germline_LogR_file = sprintf("%s_LogR.txt",normal_name), Germline_BAF_file = sprintf("%s_BAF.txt",normal_name),
     gender = gender, genomeVersion = genome)
 
 ascat.plotRawData(ascat.bc, img.prefix = "Before_correction_")
-ascat.bc = ascat.correctLogR(ascat.bc, 
-  GCcontentfile = sprintf("%s/GC_G1000/GC_G1000_%s.txt",genomebasedir,genome), 
+ascat.bc = ascat.correctLogR(ascat.bc,
+  GCcontentfile = sprintf("%s/GC_G1000/GC_G1000_%s.txt",genomebasedir,genome),
   replictimingfile = sprintf("%s/RT_G1000/RT_G1000_%s.txt",genomebasedir,genome))
 ascat.plotRawData(ascat.bc, img.prefix = "After_correction_")
 ascat.bc = ascat.aspcf(ascat.bc)

diff --git a/bin/assess_significance.R b/bin/assess_significance.R
@@ -12,7 +12,7 @@ cnvs<- data.frame(dataTable)
 
 ratio$Ratio[which(ratio$Ratio==-1)]=NA
 
-cnvs.bed=GRanges(cnvs[,1],IRanges(cnvs[,2],cnvs[,3]))  
+cnvs.bed=GRanges(cnvs[,1],IRanges(cnvs[,2],cnvs[,3]))
 ratio.bed=GRanges(ratio$Chromosome,IRanges(ratio$Start,ratio$Start),score=ratio$Ratio)
 
 overlaps <- subsetByOverlaps(ratio.bed,cnvs.bed)
@@ -46,13 +46,13 @@ ifelse(resultks == "try-error",kscore <- c(kscore, "NA"),kscore <- c(kscore, ks.
 cnvs = cbind(cnvs, as.numeric(wscore), as.numeric(kscore))
 
 if (numberOfCol==7) {
-  names(cnvs)=c("chr","start","end","copy number","status","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")  
+  names(cnvs)=c("chr","start","end","copy number","status","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
 }
 if (numberOfCol==9) {
-  names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")  
+  names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
 }
 if (numberOfCol==11) {
-  names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","somatic/germline","precentageOfGermline","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")  
+  names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","somatic/germline","precentageOfGermline","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
 }
 write.table(cnvs, file=paste(args[4],".p.value.txt",sep=""),sep="\t",quote=F,row.names=F)
 

diff --git a/bin/combineAllSampleCompareResults.R b/bin/combineAllSampleCompareResults.R
@@ -78,4 +78,3 @@ colnames(finalPredPairs)<-c("Sample1","Sample2","Som:relatedness","Som:hom_conco
 #mergedDF<-merge(x=finalPredPairs,y=finalpredictedPairsVerifyBAMID,by = "Sample1",all = TRUE)
 #write.table(mergedDF[,c(1:4,6)],file = user.input.3,sep = "\t",quote = FALSE,row.names = FALSE)
 write.table(finalPredPairs,file = user.input.3,sep = "\t",quote = FALSE,row.names = FALSE)
-
diff --git a/bin/flowcell_lane.py b/bin/flowcell_lane.py
@@ -29,12 +29,17 @@
 # +SRR6755966.1 1 length=101
 # CC@FFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHIJJJJI
 
-def usage(message = '', exitcode = 0):
+
+def usage(message="", exitcode=0):
     """Displays help and usage information. If provided invalid usage
     returns non-zero exit-code. Additional message can be displayed with
     the 'message' parameter.
     """
-    print('Usage: python {} sampleName.R1.fastq.gz  sampleName > sampleName.flowcell_lanes.txt'.format(sys.argv[0]))
+    print(
+        "Usage: python {} sampleName.R1.fastq.gz  sampleName > sampleName.flowcell_lanes.txt".format(
+            sys.argv[0]
+        )
+    )
     if message:
         print(message)
     sys.exit(exitcode)
@@ -49,11 +54,11 @@ def get_flowcell_lane(sequence_identifer):
     IDs in its sequence indentifer.
     For more information visit: https://en.wikipedia.org/wiki/FASTQ_format
     """
-    id_list = sequence_identifer.strip().split(':')
+    id_list = sequence_identifer.strip().split(":")
     if len(id_list) < 7:
         # No Flowcell IDs in this format
         # Return next instrument id instead (next best thing)
-        if sequence_identifer.startswith('@SRR'):
+        if sequence_identifer.startswith("@SRR"):
             # SRA format or downloaded SRA FastQ file
             # SRA format 1: contains machine and lane information
             # @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
@@ -66,20 +71,20 @@ def get_flowcell_lane(sequence_identifer):
             except IndexError:
                 # SRA format 2
                 id1 = id_list[0].split()[0].split(".")[0]
-                id2 = id1.lstrip('@')
-            return id1,id2
+                id2 = id1.lstrip("@")
+            return id1, id2
         else:
             # Casava < 1.8 (fastq format)
             # @HWUSI-EAS100R:6:73:941:1973#0/1
-            return id_list[0],id_list[1]
+            return id_list[0], id_list[1]
     else:
         # Casava >= 1.8
         # Normal FastQ format
         # @J00170:88:HNYVJBBXX:8:1101:6390:1244 1:N:0:ACTTGA
-        return id_list[2],id_list[3]
+        return id_list[2], id_list[3]
 
 
-def md5sum(filename, blocksize = 65536):
+def md5sum(filename, blocksize=65536):
     """Gets md5checksum of a file in memory-safe manner.
     The file is read in blocks defined by the blocksize parameter. This is a safer
     option to reading the entire file into memory if the file is very large.
@@ -93,7 +98,7 @@ def md5sum(filename, blocksize = 65536):
     import hashlib
 
     hasher = hashlib.md5()
-    with open(filename, 'rb') as fh:
+    with open(filename, "rb") as fh:
         buf = fh.read(blocksize)
         while len(buf) > 0:
             hasher.update(buf)
@@ -102,13 +107,15 @@ def md5sum(filename, blocksize = 65536):
     return hasher.hexdigest()
 
 
-if __name__ == '__main__':
-
+if __name__ == "__main__":
     # Check Usage
-    if '-h' in sys.argv or '--help' in sys.argv or '-help' in sys.argv:
-        usage(exitcode = 0)
+    if "-h" in sys.argv or "--help" in sys.argv or "-help" in sys.argv:
+        usage(exitcode=0)
     elif len(sys.argv) != 3:
-        usage(message = 'Error: failed to provide all required positional arguments!', exitcode = 1)
+        usage(
+            message="Error: failed to provide all required positional arguments!",
+            exitcode=1,
+        )
 
     # Get file name and sample name prefix
     filename = sys.argv[1]
@@ -117,23 +124,34 @@ def md5sum(filename, blocksize = 65536):
     md5 = md5sum(filename)
 
     # Get Flowcell and Lane information
-    handle = gzip.open if filename.endswith('.gz') else open
-    meta = {'flowcell': [], 'lane': [], 'flowcell_lane': []}
+    handle = gzip.open if filename.endswith(".gz") else open
+    meta = {"flowcell": [], "lane": [], "flowcell_lane": []}
     i = 0  # keeps track of line number
-    with handle(filename, 'rt') as file:
-        print('sample_name\ttotal_read_pairs\tflowcell_ids\tlanes\tflowcell_lanes\tmd5_checksum')
+    with handle(filename, "rt") as file:
+        print(
+            "sample_name\ttotal_read_pairs\tflowcell_ids\tlanes\tflowcell_lanes\tmd5_checksum"
+        )
         for line in file:
             line = line.strip()
-            if i%4 == 0: # read id or sequence identifer
+            if i % 4 == 0:  # read id or sequence identifer
                 fc, lane = get_flowcell_lane(line)
-                fc = fc.lstrip('@')
-                fc_lane = "{}_{}".format(fc,lane)
-                if fc not in meta['flowcell']:
-                    meta['flowcell'].append(fc)
-                if lane not in meta['lane']:
-                    meta['lane'].append(lane)
-                if fc_lane not in meta['flowcell_lane']:
-                    meta['flowcell_lane'].append(fc_lane)
+                fc = fc.lstrip("@")
+                fc_lane = "{}_{}".format(fc, lane)
+                if fc not in meta["flowcell"]:
+                    meta["flowcell"].append(fc)
+                if lane not in meta["lane"]:
+                    meta["lane"].append(lane)
+                if fc_lane not in meta["flowcell_lane"]:
+                    meta["flowcell_lane"].append(fc_lane)
             i += 1
 
-    print("{}\t{}\t{}\t{}\t{}\t{}".format(sample, int(i/4),",".join(sorted(meta['flowcell'])),",".join(sorted(meta['lane'])),",".join(sorted(meta['flowcell_lane'])), md5))
+    print(
+        "{}\t{}\t{}\t{}\t{}\t{}".format(
+            sample,
+            int(i / 4),
+            ",".join(sorted(meta["flowcell"])),
+            ",".join(sorted(meta["lane"])),
+            ",".join(sorted(meta["flowcell_lane"])),
+            md5,
+        )
+    )
diff --git a/bin/lofreq_convert.sh b/bin/lofreq_convert.sh
@@ -29,4 +29,4 @@ zcat "${INPUT_FILE}" \
                 my @data = map { chomp; [ split /=|;/ ] } $_;
                 $NEW_ROW = "$_\tDP:DP4\t$data[0][1]:$data[0][7]\n";
                 print $NEW_ROW;
-              }'
+              }'
Original file line number	Diff line number	Diff line change
Expand Up		@@ -3,4 +3,3 @@
		These input files are used for continuous integration purposes, specificially to dry run the pipeline whenever commits have been made to the main, master, or unified branches.

		Please Note: Each of the provided FastQ files and BAM files have only headers and will not work for the LOGAN pipeline
Original file line number	Diff line number	Diff line change
Expand Up		@@ -78,4 +78,3 @@ colnames(finalPredPairs)<-c("Sample1","Sample2","Som:relatedness","Som:hom_conco
		#mergedDF<-merge(x=finalPredPairs,y=finalpredictedPairsVerifyBAMID,by = "Sample1",all = TRUE)
		#write.table(mergedDF[,c(1:4,6)],file = user.input.3,sep = "\t",quote = FALSE,row.names = FALSE)
		write.table(finalPredPairs,file = user.input.3,sep = "\t",quote = FALSE,row.names = FALSE)