Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# CoPilot Instructions for CCBR Repositories

## Reviewer guidance (what to look for in PRs)

- Reviewers must validate enforcement rules: no secrets, container specified, and reproducibility pins.
- If code is AI-generated, reviewers must ensure the author documents what was changed and why, and that the PR is labeled `generated-by-AI`.
- Reviewers should verify license headers and ownership metadata (for example, `CODEOWNERS`) are present.
- Reviews must read the code and verify that it adheres to the project's coding standards, guidelines, and best practices in software engineering.

## CI & enforcement suggestions (automatable)

1. **PR template**: include optional AI-assistance disclosure fields (model used, high-level prompt intent, manual review confirmation).
2. **Pre-merge check (GitHub Action)**: verify `.github/copilot-instructions.md` is present in the repository and that new pipeline files include a `# CRAFT:` header.
3. **Lint jobs**: `ruff` for Python, `shellcheck` for shell, `lintr` for R, and `nf-core lint` or Snakemake lint checks where applicable.
4. **Secrets scan**: run `TruffleHog` or `Gitleaks` on PRs to detect accidental credentials.
5. **AI usage label**: if AI usage is declared, an Action should add `generated-by-AI` label (create this label if it does not exist); the PR body should end with the italicized Markdown line: _Generated using AI_, and any associated commit messages should end with the plain footer line: `Generated using AI`.

_Sample GH Action check (concept): if AI usage is declared, require an AI-assistance disclosure field in the PR body._

## Security & compliance (mandatory)

- Developers must not send PHI or sensitive NIH internal identifiers to unapproved external AI services; use synthetic examples.
- Repository content must only be sent to model providers approved by NCI/NIH policy (for example, Copilot for Business or approved internal proxies).
- For AI-assisted actions, teams must keep an auditable record including: user, repository, action, timestamp, model name, and endpoint.
- If using a server wrapper (Option C), logs must include the minimum metadata above and follow institutional retention policy.
- If policy forbids external model use for internal code, teams must use approved local/internal LLM workflows.

## Operational notes (practical)

- `copilot-instructions.md` should remain concise and prescriptive; keep only high-value rules and edge-case examples.
- Developers should include the CRAFT block in edited files when requesting substantial generated code to improve context quality.
- CoPilot must ask the user for permission before deleting any file unless the file was created by CoPilot for a temporary run or test.
- CoPilot must not edit any files outside of the current open workspace.

## Code authoring guidance

- Code must not include hard-coded secrets, credentials, or sensitive absolute paths on disk.
- Code should be designed for modularity, reusability, and maintainability. It should ideally be platform-agnostic, with special support for running on the Biowulf HPC.
- Use pre-commit to enforce code style and linting during the commit process.

### Pipelines

- Authors must review existing CCBR pipelines first: <https://github.com/CCBR>.
- New pipelines should follow established CCBR conventions for folder layout, rule/process naming, config structure, and test patterns.
- Pipelines must define container images and pin tool/image versions for reproducibility.
- Contributions should include a test dataset and a documented example command.

#### Snakemake

- In general, new pipelines should be created with Nextflow rather than Snakemake, unless there is a compelling reason to use Snakemake.
- Generate new pipelines from the CCBR_SnakemakeTemplate repo: <https://github.com/CCBR/CCBR_SnakemakeTemplate>
- For Snakemake, run `snakemake --lint` and a dry-run before PR submission.

#### Nextflow

- Generate new pipelines from the CCBR_NextflowTemplate repo: <https://github.com/CCBR/CCBR_NextflowTemplate>
- For Nextflow pipelines, authors must follow nf-core patterns and references: <https://nf-co.re>.
- Nextflow code must use DSL2 only (DSL1 is not allowed).
- For Nextflow, run `nf-core lint` (or equivalent checks) before PR submission.
- Where possible, reuse modules and subworkflows from CCBR/nf-modules or nf-core/modules.
- New modules and subworkflows should be tested with `nf-test`.

### Python scripts and packages

- Python scripts must include module and function/class docstrings.
- Where a standard CLI framework is adopted, Python CLIs should use `click` or `typer` for consistency with existing components.
- Scripts must support `--help` and document required/optional arguments.
- Python code must follow [PEP 8](https://peps.python.org/pep-0008/), use `snake_case`, and include type hints for public functions.
- Scripts must raise descriptive error messages on failure and warnings when applicable. Prefer raising an exception over printing an error message, and over returning an error code.
- Python code should pass `ruff`;
- Each script must include a documented example usage in comments or README.
- Tests should be written with `pytest`. Other testing frameworks may be used if justified.
- Do not catch bare exceptions. The exception type must always be specified.
- Only include one return statement at the end of a function.

### R scripts and packages

- R scripts must include function and class docstrings via roxygen2.
- CLIs must be defined using the `argparse` package.
- CLIs must support `--help` and document required/optional arguments.
- R code should pass `lintr` and `air`.
- Tests should be written with `testthat`.
- Packages should pass `devtools::check()`.
- R code should adhere to the tidyverse style guide. https://style.tidyverse.org/
- Only include one return statement at the end of a function, if a return statement is used at all. Explicit returns are preferred but not required for R functions.

## AI-generated commit messages (Conventional Commits)

- Commit messages must follow [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) (as enforced in `CONTRIBUTING.md`).
- Generate messages from staged changes only (`git diff --staged`); do not include unrelated work.
- Commits should be atomic: one logical change per commit.
- If mixed changes are present, split into multiple logical commits; the number of commits does not need to equal the number of files changed.
- Subject format must be: `<type>(optional-scope): short imperative summary` (<=72 chars), e.g., `fix(profile): update release table parser`.
- Add a body only when needed to explain **why** and notable impact; never include secrets, tokens, PHI, or large diffs.
- For AI-assisted commits, add this final italicized footer line in the commit message body: _commit message is ai-generated_

Suggested prompt for AI tools:

```text
Create a Conventional Commit message from this staged diff.
Rules:
1) Use one of: feat|fix|docs|style|refactor|perf|test|build|ci|chore|revert.
2) Keep subject <= 72 chars, imperative mood, no trailing period.
3) Include optional scope when clear.
4) Add a short body only if needed (why/impact), wrapped at ~72 chars.
5) Output only the final commit message.
```

## Pull Requests

When opening a pull request, use the repository's pull request template (usually it is `.github/PULL_REQUEST_TEMPLATE.md`).
Different repos have different PR templates depending on their needs.
Ensure that the pull request follows the repository's PR template and includes all required information.
Do not allow the developer to proceed with opening a PR if it does not fill out all sections of the template.
Before a PR can be moved from draft to "ready for review", all of the relevant checklist items must be checked, and any
irrelevant checklist items should be crossed out.

When new features, bug fixes, or other behavioral changes are introduced to the code,
unit tests must be added or updated to cover the new or changed functionality.

If there are any API or other user-facing changes, the documentation must be updated both inline via docstrings and long-form docs in the `docs/` or `vignettes/` directory.

When a repo contains a build workflow (i.e. a workflow file in `.github/workflows` starting with `build` or named `R-CMD-check`),
the build workflow must pass before the PR can be approved.

### Changelog

The changelog for the repository should be maintained in a `CHANGELOG.md` file
(or `NEWS.md` for R packages) at the root of the repository. Each pull request
that introduces user-facing changes must include a concise entry with the PR
number and author username tagged. Developer-only changes (i.e. updates to CI
workflows, development notes, etc.) should never be included in the changelog.
Example:

```
## development version

- Fix bug in `detect_absolute_paths()` to ignore comments. (#123, @username)
```

## Onboarding checklist for new developers

- [ ] Read `.github/CONTRIBUTING.md` and `.github/copilot-instructions.md`.
- [ ] Configure VSCode workspace to open `copilot-instructions.md` by default (so Copilot Chat sees it).
- [ ] Install pre-commit and run `pre-commit install`.

## Appendix: VSCode snippet (drop into `.vscode/snippets/craft.code-snippets`)

```json
{
"Insert CRAFT prompt": {
"prefix": "craft",
"body": [
"/* C: Context: Repo=${workspaceFolderBasename}; bioinformatics pipelines; NIH HPC (Biowulf/Helix); containers: quay.io/ccbr */",
"/* R: Rules: no PHI, no secrets, containerize, pin versions, follow style */",
"/* F: Flow: inputs/ -> results/, conf/, tests/ */",
"/* T: Tests: provide a one-line TEST_CMD and expected output */",
"",
"A: $1"
],
"description": "Insert CRAFT prompt and place cursor at Actions"
}
}
```
1 change: 0 additions & 1 deletion .tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,3 @@
These input files are used for continuous integration purposes, specificially to dry run the pipeline whenever commits have been made to the main, master, or unified branches.

**Please Note:** Each of the provided FastQ files and BAM files have only headers and will not work for the LOGAN pipeline

2 changes: 1 addition & 1 deletion .tests/pairs.tsv
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
Tumor Normal
WGS_NC_T WGS_NC_N
WGS_NC_T WGS_NC_N
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ Example of Tumor_Normal calling mode
# Step 0: Set up

sinteractive --mem=8g -N 1 -n 4
module load ccbrpipeliner # v8
module load ccbrpipeliner # v8

# set up directories

Expand Down
10 changes: 5 additions & 5 deletions bin/ascat.R
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,14 @@ ascat.prepareHTS(
normalBAF_file = sprintf("%s_BAF.txt",normal_name),
BED_file=bed)

ascat.bc = ascat.loadData(Tumor_LogR_file = sprintf("%s_LogR.txt",tumor_name),
Tumor_BAF_file = sprintf("%s_BAF.txt",tumor_name),
Germline_LogR_file = sprintf("%s_LogR.txt",normal_name), Germline_BAF_file = sprintf("%s_BAF.txt",normal_name),
ascat.bc = ascat.loadData(Tumor_LogR_file = sprintf("%s_LogR.txt",tumor_name),
Tumor_BAF_file = sprintf("%s_BAF.txt",tumor_name),
Germline_LogR_file = sprintf("%s_LogR.txt",normal_name), Germline_BAF_file = sprintf("%s_BAF.txt",normal_name),
gender = gender, genomeVersion = genome)

ascat.plotRawData(ascat.bc, img.prefix = "Before_correction_")
ascat.bc = ascat.correctLogR(ascat.bc,
GCcontentfile = sprintf("%s/GC_G1000/GC_G1000_%s.txt",genomebasedir,genome),
ascat.bc = ascat.correctLogR(ascat.bc,
GCcontentfile = sprintf("%s/GC_G1000/GC_G1000_%s.txt",genomebasedir,genome),
replictimingfile = sprintf("%s/RT_G1000/RT_G1000_%s.txt",genomebasedir,genome))
ascat.plotRawData(ascat.bc, img.prefix = "After_correction_")
ascat.bc = ascat.aspcf(ascat.bc)
Expand Down
8 changes: 4 additions & 4 deletions bin/assess_significance.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ cnvs<- data.frame(dataTable)

ratio$Ratio[which(ratio$Ratio==-1)]=NA

cnvs.bed=GRanges(cnvs[,1],IRanges(cnvs[,2],cnvs[,3]))
cnvs.bed=GRanges(cnvs[,1],IRanges(cnvs[,2],cnvs[,3]))
ratio.bed=GRanges(ratio$Chromosome,IRanges(ratio$Start,ratio$Start),score=ratio$Ratio)

overlaps <- subsetByOverlaps(ratio.bed,cnvs.bed)
Expand Down Expand Up @@ -46,13 +46,13 @@ ifelse(resultks == "try-error",kscore <- c(kscore, "NA"),kscore <- c(kscore, ks.
cnvs = cbind(cnvs, as.numeric(wscore), as.numeric(kscore))

if (numberOfCol==7) {
names(cnvs)=c("chr","start","end","copy number","status","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
names(cnvs)=c("chr","start","end","copy number","status","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
}
if (numberOfCol==9) {
names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
}
if (numberOfCol==11) {
names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","somatic/germline","precentageOfGermline","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
names(cnvs)=c("chr","start","end","copy number","status","genotype","uncertainty","somatic/germline","precentageOfGermline","WilcoxonRankSumTestPvalue","KolmogorovSmirnovPvalue")
}
write.table(cnvs, file=paste(args[4],".p.value.txt",sep=""),sep="\t",quote=F,row.names=F)

Expand Down
1 change: 0 additions & 1 deletion bin/combineAllSampleCompareResults.R
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,3 @@ colnames(finalPredPairs)<-c("Sample1","Sample2","Som:relatedness","Som:hom_conco
#mergedDF<-merge(x=finalPredPairs,y=finalpredictedPairsVerifyBAMID,by = "Sample1",all = TRUE)
#write.table(mergedDF[,c(1:4,6)],file = user.input.3,sep = "\t",quote = FALSE,row.names = FALSE)
write.table(finalPredPairs,file = user.input.3,sep = "\t",quote = FALSE,row.names = FALSE)

76 changes: 47 additions & 29 deletions bin/flowcell_lane.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,17 @@
# +SRR6755966.1 1 length=101
# CC@FFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHIJJJJI

def usage(message = '', exitcode = 0):

def usage(message="", exitcode=0):
"""Displays help and usage information. If provided invalid usage
returns non-zero exit-code. Additional message can be displayed with
the 'message' parameter.
"""
print('Usage: python {} sampleName.R1.fastq.gz sampleName > sampleName.flowcell_lanes.txt'.format(sys.argv[0]))
print(
"Usage: python {} sampleName.R1.fastq.gz sampleName > sampleName.flowcell_lanes.txt".format(
sys.argv[0]
)
)
if message:
print(message)
sys.exit(exitcode)
Expand All @@ -49,11 +54,11 @@ def get_flowcell_lane(sequence_identifer):
IDs in its sequence indentifer.
For more information visit: https://en.wikipedia.org/wiki/FASTQ_format
"""
id_list = sequence_identifer.strip().split(':')
id_list = sequence_identifer.strip().split(":")
if len(id_list) < 7:
# No Flowcell IDs in this format
# Return next instrument id instead (next best thing)
if sequence_identifer.startswith('@SRR'):
if sequence_identifer.startswith("@SRR"):
# SRA format or downloaded SRA FastQ file
# SRA format 1: contains machine and lane information
# @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
Expand All @@ -66,20 +71,20 @@ def get_flowcell_lane(sequence_identifer):
except IndexError:
# SRA format 2
id1 = id_list[0].split()[0].split(".")[0]
id2 = id1.lstrip('@')
return id1,id2
id2 = id1.lstrip("@")
return id1, id2
else:
# Casava < 1.8 (fastq format)
# @HWUSI-EAS100R:6:73:941:1973#0/1
return id_list[0],id_list[1]
return id_list[0], id_list[1]
else:
# Casava >= 1.8
# Normal FastQ format
# @J00170:88:HNYVJBBXX:8:1101:6390:1244 1:N:0:ACTTGA
return id_list[2],id_list[3]
return id_list[2], id_list[3]


def md5sum(filename, blocksize = 65536):
def md5sum(filename, blocksize=65536):
"""Gets md5checksum of a file in memory-safe manner.
The file is read in blocks defined by the blocksize parameter. This is a safer
option to reading the entire file into memory if the file is very large.
Expand All @@ -93,7 +98,7 @@ def md5sum(filename, blocksize = 65536):
import hashlib

hasher = hashlib.md5()
with open(filename, 'rb') as fh:
with open(filename, "rb") as fh:
buf = fh.read(blocksize)
while len(buf) > 0:
hasher.update(buf)
Expand All @@ -102,13 +107,15 @@ def md5sum(filename, blocksize = 65536):
return hasher.hexdigest()


if __name__ == '__main__':

if __name__ == "__main__":
# Check Usage
if '-h' in sys.argv or '--help' in sys.argv or '-help' in sys.argv:
usage(exitcode = 0)
if "-h" in sys.argv or "--help" in sys.argv or "-help" in sys.argv:
usage(exitcode=0)
elif len(sys.argv) != 3:
usage(message = 'Error: failed to provide all required positional arguments!', exitcode = 1)
usage(
message="Error: failed to provide all required positional arguments!",
exitcode=1,
)

# Get file name and sample name prefix
filename = sys.argv[1]
Expand All @@ -117,23 +124,34 @@ def md5sum(filename, blocksize = 65536):
md5 = md5sum(filename)

# Get Flowcell and Lane information
handle = gzip.open if filename.endswith('.gz') else open
meta = {'flowcell': [], 'lane': [], 'flowcell_lane': []}
handle = gzip.open if filename.endswith(".gz") else open
meta = {"flowcell": [], "lane": [], "flowcell_lane": []}
i = 0 # keeps track of line number
with handle(filename, 'rt') as file:
print('sample_name\ttotal_read_pairs\tflowcell_ids\tlanes\tflowcell_lanes\tmd5_checksum')
with handle(filename, "rt") as file:
print(
"sample_name\ttotal_read_pairs\tflowcell_ids\tlanes\tflowcell_lanes\tmd5_checksum"
)
for line in file:
line = line.strip()
if i%4 == 0: # read id or sequence identifer
if i % 4 == 0: # read id or sequence identifer
fc, lane = get_flowcell_lane(line)
fc = fc.lstrip('@')
fc_lane = "{}_{}".format(fc,lane)
if fc not in meta['flowcell']:
meta['flowcell'].append(fc)
if lane not in meta['lane']:
meta['lane'].append(lane)
if fc_lane not in meta['flowcell_lane']:
meta['flowcell_lane'].append(fc_lane)
fc = fc.lstrip("@")
fc_lane = "{}_{}".format(fc, lane)
if fc not in meta["flowcell"]:
meta["flowcell"].append(fc)
if lane not in meta["lane"]:
meta["lane"].append(lane)
if fc_lane not in meta["flowcell_lane"]:
meta["flowcell_lane"].append(fc_lane)
i += 1

print("{}\t{}\t{}\t{}\t{}\t{}".format(sample, int(i/4),",".join(sorted(meta['flowcell'])),",".join(sorted(meta['lane'])),",".join(sorted(meta['flowcell_lane'])), md5))
print(
"{}\t{}\t{}\t{}\t{}\t{}".format(
sample,
int(i / 4),
",".join(sorted(meta["flowcell"])),
",".join(sorted(meta["lane"])),
",".join(sorted(meta["flowcell_lane"])),
md5,
)
)
2 changes: 1 addition & 1 deletion bin/lofreq_convert.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ zcat "${INPUT_FILE}" \
my @data = map { chomp; [ split /=|;/ ] } $_;
$NEW_ROW = "$_\tDP:DP4\t$data[0][1]:$data[0][7]\n";
print $NEW_ROW;
}'
}'
Loading
Loading