-
Notifications
You must be signed in to change notification settings - Fork 5
Take in multiple inputs with different assemblies; pull chr info from bam instead of fasta #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ser controlled input filtering
…s from input bam instead of fasta
…ent assembly inputs
mrvollger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two comments that apply in lots of places.
- Delete old unused code.
- We should use only the bam header instead of fai_df and the bam header.
workflow/Snakefile
Outdated
| FAI = get_fai() | ||
| REF_NAME = config["ref_name"] | ||
| EXCLUDES = get_excludes() | ||
| #REF = get_ref() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete completely instead of commenting
workflow/rules/common.smk
Outdated
| raise ValueError(f"FIRE: reference file {ref} does not exist") | ||
| return os.path.abspath(ref) | ||
|
|
||
| def get_ref_old(wc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the functions you have added with _old or _orig are not used. if so they should be removed.
workflow/rules/common.smk
Outdated
| return fai | ||
|
|
||
| def get_fai(wc): | ||
| ref = MANIFEST.loc[wc.sm, "ref"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to get ref you should call get_ref. And then generally it is good to use f strings in situations like this:
ref = get_ref(wc)
fai = f"{ref}.fai"
...
workflow/rules/common.smk
Outdated
|
|
||
| bam_chr_list=[] | ||
| input_bam_path=get_input_bam(wc) | ||
| input_bam = pysam.AlignmentFile(input_bam_path, "rc", threads=MAX_THREADS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need extra threads here, I would drop that. Also the "rc" part can be infered and not all inputs will be cram so we dont want it. basically delete the last two args.
workflow/rules/common.smk
Outdated
| input_bam = pysam.AlignmentFile(input_bam_path, "rc", threads=MAX_THREADS) | ||
| bam_header_dict = input_bam.header.to_dict() | ||
|
|
||
| for line in bam_header_dict['SQ']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length of each chromosome is also available in the bam header. So we should use the bam header only instead of mixing fai_df and this method.
workflow/rules/track-hub.smk
Outdated
| suffix=get_hap_col_suffix, | ||
| nzooms=NZOOMS, | ||
| chrom=get_chroms()[0], | ||
| #chroms=get_chroms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete comments
…d removed commented unused code
This PR address two main issues:
(1) Previously, a single assembly was defined in the config file and if multiple samples were submitted in the input file, all had to be associated with that assembly. This PR makes it possible to structure an input file to have 4 columns: sample, bam, ref, ref_name. See example: config/config_4col.tbl. The yaml file can omit ref and ref_name entirely or leave them blank (example: config/config_4col.yaml). If a single reference can be used with all samples, the original format where the ref and ref_name are only indicated in config.yaml will still work.
(2) Previously, a list of chrs was pulled from the input fasta. This can be problematic if not all of these are present in the input bam. This PR modifies these names to be pulled directly from the bam instead.