Skip to content

Conversation

@minkinaa
Copy link
Collaborator

This PR address two main issues:

(1) Previously, a single assembly was defined in the config file and if multiple samples were submitted in the input file, all had to be associated with that assembly. This PR makes it possible to structure an input file to have 4 columns: sample, bam, ref, ref_name. See example: config/config_4col.tbl. The yaml file can omit ref and ref_name entirely or leave them blank (example: config/config_4col.yaml). If a single reference can be used with all samples, the original format where the ref and ref_name are only indicated in config.yaml will still work.

(2) Previously, a list of chrs was pulled from the input fasta. This can be problematic if not all of these are present in the input bam. This PR modifies these names to be pulled directly from the bam instead.

Copy link
Member

@mrvollger mrvollger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comments that apply in lots of places.

  • Delete old unused code.
  • We should use only the bam header instead of fai_df and the bam header.

FAI = get_fai()
REF_NAME = config["ref_name"]
EXCLUDES = get_excludes()
#REF = get_ref()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete completely instead of commenting

raise ValueError(f"FIRE: reference file {ref} does not exist")
return os.path.abspath(ref)

def get_ref_old(wc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the functions you have added with _old or _orig are not used. if so they should be removed.

return fai

def get_fai(wc):
ref = MANIFEST.loc[wc.sm, "ref"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to get ref you should call get_ref. And then generally it is good to use f strings in situations like this:

ref = get_ref(wc)
fai = f"{ref}.fai"
...


bam_chr_list=[]
input_bam_path=get_input_bam(wc)
input_bam = pysam.AlignmentFile(input_bam_path, "rc", threads=MAX_THREADS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need extra threads here, I would drop that. Also the "rc" part can be infered and not all inputs will be cram so we dont want it. basically delete the last two args.

input_bam = pysam.AlignmentFile(input_bam_path, "rc", threads=MAX_THREADS)
bam_header_dict = input_bam.header.to_dict()

for line in bam_header_dict['SQ']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of each chromosome is also available in the bam header. So we should use the bam header only instead of mixing fai_df and this method.

suffix=get_hap_col_suffix,
nzooms=NZOOMS,
chrom=get_chroms()[0],
#chroms=get_chroms,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants