diff --git a/README.md b/README.md index 96a72fbc..3072a34b 100644 --- a/README.md +++ b/README.md @@ -120,6 +120,10 @@ To load your own data into the Sequence Tube Map, see the guide to [Adding Your Previously we provided a Docker image at [https://hub.docker.com/r/wolfib/sequencetubemap/](https://hub.docker.com/r/wolfib/sequencetubemap/), which contained the build of this repo as well as a vg executable for data preprocessing and extraction. We now recommend a different installation approach, either using the [online version](#online-version) or a full installation of the [local version](#local-version). However, if you would like to Dockerize the Sequence Tube Map, the repository includes a `Dockerfile`. +## Using tabix-based index files + +More information about using this faster alternative in [README.tabix.md](README.tabix.md). + ## Contributing For information on how to develop on the Sequence Tube Map codebase, pleas see the [Development Guide](doc/development.md). diff --git a/README.tabix.md b/README.tabix.md new file mode 100644 index 00000000..cd6ac8c8 --- /dev/null +++ b/README.tabix.md @@ -0,0 +1,120 @@ +## Tabix-based index files + +Three files are used, each one indexed with tabix (additional `.tbi` file): + +1. `nodes.tsv.gz` contains the sequence of each node. +2. `pos.bed.gz` contains the position (as node intervals) of regions on each haplotype. +3. `haps.gaf.gz` contains the path followed by each haplotype (split in pieces). + +Briefly, these three index files can be quickly queried to extract a subgraph covering a region of interest: the `pos.bed.gz` index can first tell us which nodes are covered, then the `nodes.tsv.gz` index gives us the sequence of these nodes, and finally we can stitch the haplotype pieces in those nodes from the `haps.gaf.gz` index. +This approach was implemented in a [`chunkix.py`](scripts/chunkix.py) script which can produce a GFA file or files used by the sequenceTubeMap. +The sequenceTubeMap uses this script internally when given tabix-based index files. + +## Using tabix-based index files in the sequenceTubeMap + +The version on this `tabix` branch can use those index files, for example when mounted files are provided: + +- the `pos.bed.gz` index in the *graph* field +- the `nodes.tsv.gz` index in the *node* field +- the `haps.gaf.gz` index in the *haplotype* field + +--- + +![](images/mount.tabix.index.png) + +--- + +Once the index files are mounted, one can query any region on any haplotype in the form *HAPNAME_CONTIG:START-END*. + +Other tracks, for example reads or annotations in bgzipped/indexed GAF files, can be added as *reads* in the menu. + +--- + +![](images/mount.tabix.index.annot.png) + +--- + +Of note, you can set a color for each track using the existing palettes or by picking a specific color. + +--- + +![](images/mount.tabix.index.annot.color.png) + +--- + +## ~~Installation~~ Using the docker container + +A docker container with this new sequenceTubeMap version, and all the dependencies necessary, is available at `quay.io/jmonlong/sequencetubemap:tabix_dev`. + +To use it, run: + +```sh +docker run -it -p 3210:3000 -v `pwd`:/data quay.io/jmonlong/sequencetubemap:tabix_dev +``` + +Of note, the `-p` option redirects port 3000 to 3210. +In practice, pick an unused port. + +Then open: http://localhost:3210/ + +Note: For mounted files, this assumes all files (pangenomes, reads, annotations) are in the current working directory or in subdirectories. +To test with the files that are already prepared, download all the files (see below). +Then, either use them as *custom* Data adding the tracks with the *Configure Tracks* button, or use the prepared Data set "HPRC Minigraph-Cactus v1.1". +For info, the files for this Dataset were defined in the [config.json file](docker/config.json) used to build the docker. + +## Available tabix-based index files for the Minigraph-Cactus v1.1 pangenome + +Index files and some annotations have been deposited at https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/ + +To download it all: + +``` +# pangenome index files +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz.tbi +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz.tbi +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz.tbi + +# annotation files +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz.tbi +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz.tbi +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz +wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz.tbi +``` + +## Building tabix-based index files from a GFA + +### Optional. Make a GFA from a GBZ file + +In some cases, you will want to use exactly the same pangenome space as a specific GBZ file. +For example, to visualize reads or annotation on that pangenome. +The GFA provided in the HPRC repo might not match exactly because some nodes may have been split when making the GBZ file. +You can convert a GBZ to a GFA (and not translate the nodes back to the original GFA) with: + +```sh +vg convert --no-translation -f -t 4 hprc-v1.1-mc-grch38.gbz | gzip > hprc-v1.1-mc-grch38.gfa.gz +``` + +### Run the `pgtabix.py` python script + +The `pgtabix.py` script can be found in the [`scripts` directory](scripts). +It's also present in the `/build/sequenceTubeMap/scripts` directory of the Docker container `quay.io/jmonlong/sequencetubemap:tabix_dev`. + +```sh +python3 pgtabix.py -g hprc-v1.1-mc-grch38.gfa.gz -o output.prefix +``` + +It takes about 1h30-2h to build index files for the Minigraph-Cactus v1.1 pangenome. +This process should scale linearly with the number of haplotypes. + +## Making your own annotation files + +To make your own annotation files, we have developed a pipeline to project annotation files at the haplotype level (e.g. BED, GFF) onto a pangenome (e.g. GBZ). +Once the projected GAF files are sorted, bgzipped and indexed, they can be queried fast, for example by sequenceTubeMap. + +The pipeline is described in the [manuscript](https://jmonlong.github.io/manu-vggafannot/) and script/docs was deposited in [the GitHub repository](https://github.com/jmonlong/manu-vggafannot?tab=readme-ov-file). +In particular, example on how annotation files were projected for this manuscript are described in [this section](https://github.com/jmonlong/manu-vggafannot/tree/main/analysis/annotate). diff --git a/docker/Dockerfile b/docker/Dockerfile index bea3e7a2..ddf743ca 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -6,22 +6,37 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true ENV TZ=America/Los_Angeles - # install basic apt dependencies # note: most vg apt dependencies are installed by "make get-deps" below RUN apt-get -qq update && apt-get -qq install -y \ - git \ - wget \ - less \ - npm \ - nano - + git \ + wget \ + less \ + npm \ + nano \ + make \ + g++ \ + gcc \ + zlib1g-dev \ + libbz2-dev \ + liblzma-dev \ + python3 \ + build-essential + +# install tabix/bgzip +RUN wget --quiet --no-check-certificate https://github.com/samtools/htslib/releases/download/1.21/htslib-1.21.tar.bz2 && \ + tar -xjvf htslib-1.21.tar.bz2 && \ + cd htslib-1.21 && \ + ./configure && \ + make && make install + +# install node RUN npm cache clean -f RUN npm install -g n && n stable # download vg binary -RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.59.0/vg \ +RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.64.1/vg \ && mv vg /bin/vg && chmod +x /bin/vg WORKDIR /build diff --git a/docker/config.json b/docker/config.json index 0bde79ad..389e9451 100644 --- a/docker/config.json +++ b/docker/config.json @@ -13,6 +13,22 @@ "simplify": false, "removeSequences": false }, + { + "name": "HPRC Minigraph-Cactus v1.1", + "tracks": [ + {"trackFile": "/data/hprc.pos.bed.gz", "trackType": "graph", "trackColorSettings": {"mainPalette": "ygreys", "auxPalette": "greys"}}, + {"trackFile": "/data/hprc.nodes.tsv.gz", "trackType": "node"}, + {"trackFile": "/data/hprc.haps.gaf.gz", "trackType": "haplotype"}, + {"trackFile": "/data/gene_exon.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "reds", "auxPalette": "reds"}}, + {"trackFile": "/data/rm.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "blues", "auxPalette": "blues"}}, + {"trackFile": "/data/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz", + "trackType": "read", "trackColorSettings": {"mainPalette": "plainColors", "auxPalette": "plainColors"}} + ], + "region": "GRCh38#0#chr17:7674450-7675333", + "dataType": "built-in", + "simplify": false, + "removeSequences": false + }, { "name": "Lancet example", "tracks": [ @@ -48,6 +64,7 @@ } ], "vgPath": [""], + "chunkixPath": ["/data", "scripts"], "dataPath": "/data", "internalDataPath": "exampleData/internal/", "tempDirPath": "temp", @@ -57,19 +74,22 @@ "defaultGraphColorPalette" : { "mainPalette": "#000000", "auxPalette": "greys", - "colorReadsByMappingQuality": false + "colorReadsByMappingQuality": false, + "alphaReadsByMappingQuality": false }, "defaultHaplotypeColorPalette" : { "mainPalette": "plainColors", "auxPalette": "lightColors", - "colorReadsByMappingQuality": false + "colorReadsByMappingQuality": false, + "alphaReadsByMappingQuality": false }, "defaultReadColorPalette" : { "mainPalette": "blues", "auxPalette": "reds", - "colorReadsByMappingQuality": false + "colorReadsByMappingQuality": false, + "alphaReadsByMappingQuality": false }, "defaultTrackProps" : { @@ -77,7 +97,8 @@ "trackColorSettings": { "mainPalette": "#000000", "auxPalette": "greys", - "colorReadsByMappingQuality": false + "colorReadsByMappingQuality": false, + "alphaReadsByMappingQuality": false } }, diff --git a/images/mount.tabix.index.annot.color.png b/images/mount.tabix.index.annot.color.png new file mode 100644 index 00000000..ac1ddf1a Binary files /dev/null and b/images/mount.tabix.index.annot.color.png differ diff --git a/images/mount.tabix.index.annot.png b/images/mount.tabix.index.annot.png new file mode 100644 index 00000000..bef93375 Binary files /dev/null and b/images/mount.tabix.index.annot.png differ diff --git a/images/mount.tabix.index.png b/images/mount.tabix.index.png new file mode 100644 index 00000000..891fc1c8 Binary files /dev/null and b/images/mount.tabix.index.png differ diff --git a/scripts/chunkix.py b/scripts/chunkix.py new file mode 100755 index 00000000..bfe7d2dc --- /dev/null +++ b/scripts/chunkix.py @@ -0,0 +1,628 @@ +import argparse +import subprocess +import json +import sys + + +def parsePath(path_raw): + cur_node = '' + cur_orient = '' + path = [] + fwd_diff = 0 + for ii in range(len(path_raw)): + if path_raw[ii] == '<' or path_raw[ii] == '>': + if cur_node != '': + path.append([cur_node, cur_orient]) + cur_orient = path_raw[ii] == '>' + if cur_orient: + fwd_diff += 1 + else: + fwd_diff += -1 + cur_node = '' + else: + cur_node += path_raw[ii] + if cur_node != '': + path.append([cur_node, cur_orient]) + return (path) + + +def parseCigar(cs_tag, cs_type='cs'): + cs = [] + val = '' + edit = '' + for ii in range(len(cs_tag)): + if cs_type == 'cs': + # ex: cs:Z:+ATG:360*AT-G:1*CG:137 + if cs_tag[ii] in [':', '*', '+', '-']: + if val != '': + cs.append([edit, val]) + val = '' + edit = cs_tag[ii] + else: + val += cs_tag[ii] + elif cs_type == 'cg': + # ex: cg:Z:24=1D185=1X1151=1X87=1I157= + if cs_tag[ii] in ['=', 'X', 'I', 'D']: + edit = cs_tag[ii] + if edit == '=': + edit = ':' + elif edit == 'X': + edit = '*' + val = 'N' * int(val) + val = val + val + elif edit == 'I': + edit = '+' + val = 'N' * int(val) + elif edit == 'D': + edit = '-' + val = 'N' * int(val) + cs.append([edit, val]) + val = '' + else: + val += cs_tag[ii] + # for cs type, don't forget to insert the last 'ongoing' one + if cs_type == 'cs' and val != '': + cs.append([edit, val]) + return (cs) + + +def prepareJsonMapOnNode(node_orient, offset, cigar, nodes_seq): + json_map = {} + json_map['edit'] = [] + json_map['position'] = {} + if not node_orient[1]: + json_map['position']['is_reverse'] = True + json_map['position']['node_id'] = node_orient[0] + if node_orient[0] in nodes_seq and len(cigar) > 0: + bp_remaining = len(nodes_seq[node_orient[0]]) - offset + while bp_remaining > 0 and len(cigar) > 0: + # get the next cigar operation + next_co = cigar.pop(0) + cig_bp = 0 + # update edits, remaining bases, and cigar + if next_co[0] == ':': + # matches + cig_bp = int(next_co[1]) + if cig_bp > bp_remaining: + json_map['edit'].append({'from_length': bp_remaining, + 'to_length': bp_remaining}) + # add back remaining cigar operation + next_co[1] = str(cig_bp - bp_remaining) + cigar.insert(0, next_co) + # no more bases remaining in this node + bp_remaining = 0 + else: + json_map['edit'].append({'from_length': cig_bp, + 'to_length': cig_bp}) + bp_remaining += - cig_bp + elif next_co[0] == '*': + # substitution + json_map['edit'].append({'from_length': 1, + 'sequence': next_co[1][1], + 'to_length': 1}) + bp_remaining += -1 + elif next_co[0] == '+': + # insertion in read + json_map['edit'].append({'sequence': next_co[1], + 'to_length': len(next_co[1])}) + elif next_co[0] == '-': + # deletion in read + cig_bp = len(next_co[1]) + if cig_bp > bp_remaining: + json_map['edit'].append({'from_length': bp_remaining}) + # update cigar and add it back at the front + next_co[1] = next_co[1][bp_remaining:] + cigar.insert(0, next_co) + # no more bases remaining in this node + bp_remaining = 0 + else: + json_map['edit'].append({'from_length': cig_bp}) + bp_remaining += - cig_bp + # DEBUG test: if the next operation is an insert, add it to this node + if len(cigar) > 0 and cigar[0][0] == '+' and node_orient[1]: + next_co = cigar.pop(0) + # insertion in read + json_map['edit'].append({'sequence': next_co[1], + 'to_length': len(next_co[1])}) + # don't forget to specify offset + if offset > 0: + json_map['position']['offset'] = str(offset) + elif node_orient[0] in nodes_seq: + # we're here if we know about this node but a node before was missing + # we can't trust the cigar anymore so we just write full node overlap + nlen = len(nodes_seq[node_orient[0]]) + json_map['edit'].append({'from_length': nlen, + 'to_length': nlen}) + else: + # if we don't know this node, assumes it's 1bp long and forget + # about the cigar/offset + json_map['edit'].append({'from_length': 1, + 'to_length': 1}) + # to remember that we had a missing node (and can't trust the cigar for + # the other nodes) we clear the cigar info + cigar.clear() + return (json_map) + + +parser = argparse.ArgumentParser() +parser.add_argument('-n', help='indexed nodes', required=True) +parser.add_argument('-p', help='indexed positions BED', required=True) +parser.add_argument('-g', help='indexed haplotypes GAF', required=True) +parser.add_argument('-r', help='region to query', required=True) +parser.add_argument('-a', default=[], action='append', + help='indexed annotations (optional). Can repeat.') +parser.add_argument('-o', help='output prefix', default='chunk') +parser.add_argument('-j', help='also make JSON output (e.g. for tubemap)', + action='store_true') +parser.add_argument('-s', help='simplify haplotypes by merging identical ones' + ' (for the JSON output)', action='store_true') +args = parser.parse_args() + +# args = parser.parse_args(['-n', 'pg.nodes.tsv.bgz', +# '-g', 'pg.haps.gaf.gz', +# '-p', 'pg.pos.bed.gz', +# '-a', 'pg.haps.gaf.gz', +# '-r', 'GRCh38#0#chr6:3000000-3001000']) + +# parse queried region +ref_path_name = args.r.split(':')[0] +qint = args.r.split(':')[1] +qstart = int(qint.split('-')[0]) +qend = int(qint.split('-')[1]) + +# extract nodes in region +cmd = ['tabix', args.p, args.r] +try: + cmd_o = subprocess.run(cmd, check=True, capture_output=True) +except subprocess.CalledProcessError as e: + sys.exit(e.stderr.decode()) +nodes_bed = cmd_o.stdout.decode().rstrip().split('\n') +nodes_bed = [line.split('\t') for line in nodes_bed] + +# stop if no nodes were found in that region +if len(nodes_bed) == 1 and len(nodes_bed[0]) == 1 and nodes_bed[0][0] == '': + sys.exit('No nodes overlap this region.') + +# find minimum and maximum node IDs and reference positions +min_node = int(nodes_bed[0][3]) +max_node = int(nodes_bed[0][3]) +# loop over all extracted nodes +for nbed in nodes_bed: + node_s = int(nbed[3]) + node_e = int(nbed[4]) + min_node = min(min_node, node_s, node_e) + max_node = max(max_node, node_s, node_e) +max_node += 1 + +# extract haplotypes in region +cmd = ['tabix', args.g, '{{node}}:{}-{}'.format(min_node, max_node)] +try: + cmd_o = subprocess.run(cmd, check=True, capture_output=True) +except subprocess.CalledProcessError as e: + sys.exit(e.stderr.decode()) +haps_gaf = cmd_o.stdout.decode().rstrip().split('\n') +haps_gaf = [line.split('\t') for line in haps_gaf] + +# organize haplotype subpaths +subhaps_path = {} +for hgaf in haps_gaf: + # extract contig name and position from rc tag + rc_ii = 12 + while 'rc:Z:' not in hgaf[rc_ii]: + rc_ii += 1 + rc = hgaf[rc_ii].replace('rc:Z:', '').split(':') + hname = rc[0] + hstart = int(rc[1]) + hend = int(rc[1]) + int(hgaf[1]) + # if queried path, make sure it overlaps the queried range + if hname == ref_path_name and (hend < qstart or hstart > qend): + # skip, if not + continue + # parse path + path = parsePath(hgaf[5]) + # make sure at least one node is in the node range + overlap_range = False + for no in path: + if int(no[0]) >= min_node and int(no[0]) <= max_node: + overlap_range = True + break + if not overlap_range: + continue + # save all in dict + if hname not in subhaps_path: + subhaps_path[hname] = {'paths': {}, 'pos': [], 'end': {}} + # save path + subhaps_path[hname]['paths'][rc[1]] = path + # save position in haplotype + subhaps_path[hname]['pos'].append(hstart) + subhaps_path[hname]['end'][rc[1]] = hend + +# stitch into one path per haplotype +haps_full_paths = {} +min_ref_pos = 0 +all_perfectly_stitched = True +for hname in subhaps_path: + # sort the positions + pos = subhaps_path[hname]['pos'] + pos.sort() + # remember the start position of ref path + if hname == ref_path_name: + min_ref_pos = pos[0] + # stitch paths + paths = [[]] + prev_end = -1 + for pp in pos: + if prev_end != -1 and prev_end != pp: + # end position of the previous chunk doesn't match current position + # we shouldn't stich and instead create another sub-path + paths.append([]) + all_perfectly_stitched = False + paths[-1] += subhaps_path[hname]['paths'][str(pp)] + prev_end = subhaps_path[hname]['end'][str(pp)] + # save the haplotype path or each pieces (if not stitched in one piece) + if len(paths) == 1: + haps_full_paths[hname] = paths[0] + else: + for ii, path in enumerate(paths): + haps_full_paths['{}_p{}'.format(hname, ii)] = path + +if all_perfectly_stitched: + print('All haplotypes could be stitched back in one piece') +else: + print('Some haplotypes could not be stitched back in one piece') + +# to save the "default" reference path +nodes_ref = {} +for no in haps_full_paths[ref_path_name]: + nodes_ref[no[0]] = True +# this is used for a first round of "trimming" of the haplotypes +# another round is done later to match the queried range better + +# prepare paths before the actual round of trimming +haps_paths_pretrim = {} +# this is to know which nodes to include (and get information later) +nodes_inc = {} +# process each haplotype path +for hname in haps_full_paths: + path = haps_full_paths[hname] + # trim ends of path until they touch a node on ref (queried) path + path_beg = 0 + while path[path_beg][0] not in nodes_ref and path_beg < len(path) - 1: + path_beg += 1 + path_end = len(path) - 1 + while path[path_end][0] not in nodes_ref and path_end > 0: + path_end += -1 + path_end += 1 + path = path[path_beg:path_end] + # skip empty path + if len(path) == 0: + continue + # otherwise save path + haps_paths_pretrim[hname] = path + for no in path: + nodes_inc[no[0]] = True + +# optional extract and write GAF +annot_forjsons_l = [] +# eventually other nodes that we'll need to include (just in case) +nodes_inc_others = {} +if args.a != '': + for annot_idx, annot_fn in enumerate(args.a): + # extract with tabix + cmd = ['tabix', annot_fn, '{{node}}:{}-{}'.format(min_node, max_node)] + try: + cmd_o = subprocess.run(cmd, check=True, capture_output=True) + except subprocess.CalledProcessError as e: + sys.exit(e.stderr.decode()) + annot_gaf = cmd_o.stdout.decode().rstrip().split('\n') + agaf_of = open(args.o + '.{}.annot.gaf'.format(annot_idx), 'wt') + annot_forjson = [] + for gaf_line_r in annot_gaf: + # prepare the json output, just in case + if gaf_line_r == '': + continue + gaf_line = gaf_line_r.split('\t') + path = parsePath(gaf_line[5]) + # check if at least one node is in subgraph of interest + no_overlap = True + for no in path: + if no[0] in nodes_inc: + no_overlap = False + break + if no_overlap: + continue + # it touches the subgraph, we can write it + agaf_of.write(gaf_line_r + '\n') + # prepare JSON information + apath = {'name': gaf_line[0]} + apath['mapq'] = int(gaf_line[11]) + apath['pathl_remain'] = int(gaf_line[1]) + apath['offset'] = int(gaf_line[7]) + apath['path'] = path + # check is nodes are missing + for nidx, no in enumerate(path): + if no[0] not in nodes_inc: + nodes_inc_others[no[0]] = True + # add optional tags as "sample_name" + if len(gaf_line) > 12: + # TODO: don't include base qualities + tags = {} + for tag in gaf_line[12:]: + pos_col = tag.index(':') + tagn = tag[0:pos_col] + tagv = tag[(tag.index(':', pos_col+1)+1):] + tags[tagn] = tagv + apath['tags'] = tags + # save path info + annot_forjson.append(apath) + agaf_of.close() + annot_forjsons_l.append(annot_forjson) + +# update nodes to include with the potentially new ones +for node in nodes_inc_others: + nodes_inc[node] = True + +# find sequence for nodes to include +nodes_seq = {} +# first, cluster them to avoid too many queries +# to cluster, sort by node ID +sorted_nodes = [int(node) for node in nodes_inc] +sorted_nodes.sort() +# then group if next ID is not too far +node_cls = [] +max_node_jump = 1000 +for node in sorted_nodes: + if len(node_cls) == 0: + node_cls.append([node]) + else: + if node_cls[-1][-1] + max_node_jump < node: + # print(node - node_cls[-1][-1] + max_node_jump) + # we need a new node cluster + node_cls.append([node]) + else: + node_cls[-1].append(node) +# query node information for each cluster +for ncl in node_cls: + print('Searching for {}-{} node sequences.'.format(ncl[0], ncl[-1])) + cmd = ['tabix', args.n, 'n:{}-{}'.format(ncl[0], ncl[-1])] + try: + cmd_o = subprocess.run(cmd, check=True, capture_output=True) + except subprocess.CalledProcessError as e: + sys.exit(e.stderr.decode()) + pos_tsv = cmd_o.stdout.decode().rstrip().split('\n') + for posr in pos_tsv: + posr = posr.split('\t') + if posr[1] in nodes_inc: + nodes_seq[posr[1]] = posr[2] + +# second round of trimming to match the queried range exactly +# trim the "reference" path first +cur_pos = min_ref_pos +max_ref_pos = 0 +nodes_ref = {} +within_range = False +for no in haps_full_paths[ref_path_name]: + if cur_pos + len(nodes_seq[no[0]]) >= qstart and len(nodes_ref) == 0: + # if we weren't within range, now we are + within_range = True + min_ref_pos = cur_pos + if within_range: + nodes_ref[no[0]] = True + if cur_pos + len(nodes_seq[no[0]]) >= qend and within_range: + # we'll be out of range next node + within_range = False + max_ref_pos = cur_pos + len(nodes_seq[no[0]]) + cur_pos += len(nodes_seq[no[0]]) +assert max_ref_pos > 0, 'no end position found, pb with haplotype stitching?' +# now trim all the haplotypes until they touch these nodes +haps_paths = {} +nodes_inc = {} +# also simplify haplotypes and tally hap frequency +haps_freq = {} +haps_hash = {} +haps_names = {} +# and save edges +edges_l = {} +lfmt = 'L\t{}\t{}\t{}\t{}\t0M' +orient_v = ['-', '+'] +# process each path +for hname in haps_paths_pretrim: + path = haps_paths_pretrim[hname] + # trim ends of path until they touch a node on ref (queried) path + path_beg = 0 + while path[path_beg][0] not in nodes_ref and path_beg < len(path) - 1: + path_beg += 1 + path_end = len(path) - 1 + while path[path_end][0] not in nodes_ref and path_end > 0: + path_end += -1 + path_end += 1 + path = path[path_beg:path_end] + # skip empty path + if len(path) == 0: + continue + # otherwise save path + # maybe check if we want to simplify it first + if args.s: + # make a string ID of the path and check if identical already present + path_hash = '_'.join(['{}_{}'.format(ro[0], ro[1]) for ro in path]) + if hname == ref_path_name: + # keep the reference path separate + haps_paths[hname] = path + haps_freq[hname] = 1 + haps_names[hname] = hname + elif path_hash in haps_hash: + # if present, increment frequency + haps_freq[haps_hash[path_hash]] += 1 + haps_names[haps_hash[path_hash]] += ', ' + hname + else: + # otherwise, save and init the hash dict + haps_paths[hname] = path + haps_freq[hname] = 1 + haps_hash[path_hash] = hname + haps_names[hname] = hname + else: + haps_paths[hname] = path + haps_names[hname] = hname + haps_freq[hname] = 1 + # save edges and nodes to include + prev_no = [] + for no in path: + nodes_inc[no[0]] = True + if len(prev_no) != 0: + ename = '{}_{}-{}_{}'.format(prev_no[0], prev_no[1], + no[0], no[1]) + if ename not in edges_l: + edges_l[ename] = lfmt.format(prev_no[0], orient_v[prev_no[1]], + no[0], orient_v[no[1]]) + prev_no = no + +print('{} nodes, {} edges, {} paths in subgraph'.format(len(nodes_inc), + len(edges_l), + len(haps_paths_pretrim))) + +# write GFA +gfa_of = open(args.o + '.gfa', 'wt') +# write header +gfa_of.write('H\tVN:Z:1.1\n') +# write nodes +for node in nodes_seq: + # currently we keep nodes in that new subgraph of interest and nodes that + # might be touched by the annotations. Could be too much but that might + # avoid some errors later? Of note those extra nodes won't be on any + # haplotype or edges. So it's only useful is the sequenceTubeMap wants + # the node information/sequence to represent the reads/annotations + if node in nodes_inc or node in nodes_inc_others: + gfa_of.write('S\t{}\t{}\n'.format(node, nodes_seq[node])) +# write paths +for hname in haps_paths: + # guess sample/haplotype/contig names + hap = 0 + contig = hname + sample = hname + hname_v = hname.split('#') + if len(hname_v) == 2: + # reference path as "sample#contig" + sample = hname_v[0] + contig = hname_v[1] + elif len(hname_v) == 3: + # reference path as "sample#contig" + sample = hname_v[0] + hap = hname_v[1] + contig = hname_v[2] + # compute path size and string + psize = 0 + pstring = '' + for no in haps_paths[hname]: + psize += len(nodes_seq[no[0]]) + pstring += ['<', '>'][no[1]] + no[0] + # prepare W line + wline = 'W\t{}\t{}\t{}\t0'.format(sample, hap, contig) + wline += '\t{}\t{}\n'.format(psize, pstring) + gfa_of.write(wline) +# write edges +for ename in edges_l: + gfa_of.write(edges_l[ename] + '\n') +gfa_of.close() + +# write JSON output of graph +nodes_in_graph = {} +if args.j: + json_pg = {} + # nodes + json_pg['node'] = [] + for node in nodes_seq: + if node in nodes_inc or node in nodes_inc_others: + json_pg['node'].append({'id': node, + 'sequence': nodes_seq[node]}) + # paths + json_pg['path'] = [] + # the reference path should be first + path_names = list(haps_paths.keys()) + path_names.remove(ref_path_name) + path_names.insert(0, ref_path_name) + for hname in path_names: + json_path = {} + json_path['name'] = haps_names[hname] + json_path['mapping'] = [] + for nidx, no in enumerate(haps_paths[hname]): + json_map = {} + json_map['edit'] = {} + node_size = len(nodes_seq[no[0]]) + json_map['edit']['from_length'] = str(node_size) + json_map['edit']['to_length'] = str(node_size) + json_map['position'] = {} + if not no[1]: + json_map['position']['is_reverse'] = True + json_map['position']['node_id'] = no[0] + json_map['rank'] = str(nidx + 1) + json_path['mapping'].append(json_map) + nodes_in_graph[no[0]] = True + json_pg['path'].append(json_path) + json_of = open(args.o + '.graph.json', 'wt') + json.dump(json_pg, json_of) + json_of.close() + # also make an annotate.txt file with frequency of each haplotype + annot_of = open(args.o + '.annotate.txt', 'wt') + for hname in path_names: + annot_of.write('{}\t{}\n'.format(haps_names[hname], haps_freq[hname])) + annot_of.close() + # also a region.tsv file with the reference path name and region offset + reg_of = open(args.o + '.regions.tsv', 'wt') + reg_of.write('{}\t{}\t{}\n'.format(ref_path_name, + min_ref_pos, max_ref_pos)) + reg_of.close() + +# write JSON output for the annotations +if args.a != '' and args.j: + for annot_idx, annot_forjson in enumerate(annot_forjsons_l): + annot_json = [] + for apath in annot_forjson: + path = apath['path'] + json_read = {'name': apath['name']} + json_read['mapping_quality'] = apath['mapq'] + json_mapping = [] + pathl_remain = apath['pathl_remain'] + offset = apath['offset'] + cigar = [] + if 'cs' in apath['tags']: + cigar = parseCigar(apath['tags']['cs']) + elif 'cg' in apath['tags']: + cigar = parseCigar(apath['tags']['cg'], cs_type='cg') + no_overlap = True + for nidx, no in enumerate(path): + if nidx != 0: + offset = 0 + json_map = prepareJsonMapOnNode(no, offset, cigar, nodes_seq) + json_map['rank'] = str(nidx + 1) + json_mapping.append(json_map) + if json_map['position']['node_id'] in nodes_in_graph: + no_overlap = False + # skip paths that don't overlap with subgraph + if no_overlap: + continue + # trim ends + while json_mapping[0]['position']['node_id'] not in nodes_in_graph: + json_mapping.pop(0) + while json_mapping[-1]['position']['node_id'] not in nodes_in_graph: + json_mapping.pop() + json_read['path'] = {'mapping': json_mapping} + # add optional tags as "sample_name" + if 'tags' in apath: + opts = [] + for tagn in apath['tags']: + tagv = apath['tags'][tagn] + # filter keep some tags + if tagn in ['bq']: + continue + # shorten long tag values + if len(tagv) > 20: + tagv = tagv[:20] + '...' + opts.append('{}={}'.format(tagn, tagv)) + opts = ' '.join(opts) + # json_read['sample_name'] = opts.replace(',', ', ') + annot_json.append(json_read) + # write JSON + ajson_of = open(args.o + '.{}.annot.json'.format(annot_idx), 'wt') + for ajson in annot_json: + ajson_of.write(json.dumps(ajson) + "\n") + ajson_of.close() diff --git a/scripts/pgtabix.py b/scripts/pgtabix.py new file mode 100644 index 00000000..7ad9882c --- /dev/null +++ b/scripts/pgtabix.py @@ -0,0 +1,218 @@ +import gzip +import sys +import argparse +import datetime +# from subprocess import Popen,PIPE +import subprocess + +parser = argparse.ArgumentParser(description='Build tabix index files for a pangenome.') +parser.add_argument('-g', help='pangenome in GFA', required=True) +parser.add_argument('-o', help='output prefix', default='pg') +parser.add_argument('-s', help='size of haplotype blocks to save', default=100) +args = parser.parse_args() + +nodes = {} +orients = ['<', '>'] + + +# TODO try without auto_flip. Don't remember if it's really necessary. +def parsePath(path_raw, auto_flip=True): + cur_node = '' + cur_orient = '' + path = [] + fwd_diff = 0 + for ii in range(len(path_raw)): + if path_raw[ii] == '<' or path_raw[ii] == '>': + if cur_node != '': + cur_node = int(cur_node) + path.append([cur_node, cur_orient]) + cur_orient = path_raw[ii] == '>' + if cur_orient: + fwd_diff += 1 + else: + fwd_diff += -1 + cur_node = '' + else: + cur_node += path_raw[ii] + if cur_node != '': + cur_node = int(cur_node) + path.append([cur_node, cur_orient]) + # flip path if mostly in reverse? + if auto_flip and fwd_diff < 0: + path.reverse() + for ii in range(len(path)): + path[ii][1] = not path[ii][1] + return (path) + + +def parsePathP(path_raw): + cur_node = '' + path = [] + for ii in range(len(path_raw)): + if path_raw[ii] == '+' or path_raw[ii] == '-': + cur_node = int(cur_node) + path.append([cur_node, path_raw[ii] == '+']) + cur_node = '' + elif path_raw[ii] == ',': + continue + else: + cur_node += path_raw[ii] + return (path) + + +def prepareHapChunk(path, nodes, name='ref', coord=''): + gaf_v = name + # path length/start/end/strand + path_len = sum([nodes[no[0]][0] for no in path]) + gaf_v += '\t{}\t0\t{}\t+'.format(path_len, path_len) + # path information: string representation and length + path_string = '' + min_node = path[0][0] + max_node = path[0][0] + for no in path: + path_string += orients[no[1]] + str(no[0]) + min_node = min(min_node, no[0]) + max_node = max(max_node, no[0]) + gaf_v += '\t{}\t{}'.format(''.join(path_string), path_len) + gaf_v += '\t{}\t{}'.format(0, path_len) + # residues matching, alignment block size, and mapping quality + fake_mapq = 60 + gaf_v += '\t{}\t{}\t{}'.format(path_len, path_len, fake_mapq) + # cigar string + gaf_v += '\tcs:Z::{}'.format(path_len) + # coordinate tag + if coord != '': + gaf_v += '\trc:Z:{}'.format(coord) + return ({'gaf': gaf_v, 'path_len': path_len, + 'min_node': min_node, 'max_node': max_node}) + + +# To only do one pass, let's assume that (all) the nodes are defined +# before (any) paths in the GFA +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Reading {}...'.format(args.g), file=sys.stderr) +if args.g.endswith('gz'): + gfa_inf = gzip.open(args.g, 'rt') +else: + gfa_inf = open(args.g, 'rt') + +# start processes to sort and bgzip output +h_gz_cmd = "vg gamsort -G - | bgzip > " + args.o + '.haps.gaf.gz' +h_gz_p = subprocess.Popen(h_gz_cmd, stdin=subprocess.PIPE, shell=True) +p_gz_cmd = "sort -k1V -k2n -k3n | bgzip > " + args.o + '.pos.bed.gz' +p_gz_p = subprocess.Popen(p_gz_cmd, stdin=subprocess.PIPE, shell=True) + +pos_fmt = '{}\t{}\t{}\t{}\t{}\n' +nodes_written = False +npaths = 0 +for line in gfa_inf: + line = line.rstrip().split('\t') + if line[0] == 'S': + # saving both the size and sequence (maybe faster than recomputing the + # size every time it's needed later) + nodes[int(line[1])] = [len(line[2]), line[2]] + else: + if len(nodes) > 0 and not nodes_written: + # we were reading nodes and we've just finished + # write them, in ascending order in a file + print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Writing node file...', file=sys.stderr) + n_gz_p = subprocess.Popen("bgzip > " + args.o + ".nodes.tsv.gz", + stdin=subprocess.PIPE, shell=True) + nodes_ord = list(nodes.keys()) + nodes_ord.sort() + for node in nodes_ord: + tow = 'n\t{}\t{}\n'.format(node, nodes[node][1]) + n_gz_p.stdin.write(tow.encode()) + n_gz_p.stdin.close() + n_gz_p.wait() + print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Node information written to ' + '{}.'.format(args.o + ".nodes.tsv.gz"), file=sys.stderr) + nodes_written = True + # create tabix index + print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Indexing nodes', file=sys.stderr) + idx_cmd = ['tabix', '-f', '-s', '1', '-b', '2', '-e', '2', + args.o + ".nodes.tsv.gz"] + subprocess.run(idx_cmd, check=True) + print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Nodes indexed.', file=sys.stderr) + # now potentially prepare the path information + path = [] + if line[0] == 'W': + path = parsePath(line[6]) + sampn = line[1] + hapn = '#' + line[2] + pathn = '#' + line[3] + startpos = int(line[4]) + elif line[0] == 'P' and line[1].startswith(args.r): + path = parsePathP(line[2]) + sampn = line[1] + pathn = '' + hapn = '' + startpos = 0 + if len(path) > 0: + # we've parsed a path + # save node positions + # save haplotype in GAF chunks + path_pos = 0 + hap_pos = startpos + while path_pos < len(path): + path_end = min(len(path), path_pos + args.s) + seqn = '{}{}{}'.format(sampn, hapn, pathn) + coordn = '{}:{}'.format(seqn, hap_pos) + hapchunk = prepareHapChunk(path[path_pos:path_end], + nodes, name=sampn, + coord=coordn) + # write GAF record + tow = hapchunk['gaf'] + '\n' + h_gz_p.stdin.write(tow.encode()) + # write position BED record + tow = pos_fmt.format(seqn, hap_pos, + hap_pos + hapchunk['path_len'], + hapchunk['min_node'], + hapchunk['max_node']) + p_gz_p.stdin.write(tow.encode()) + # update position on haplotype + hap_pos += hapchunk['path_len'] + # go to next chunk + path_pos = path_end + npaths += 1 + if npaths % 10000 == 0: + print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Wrote info for {} paths'.format(npaths), + file=sys.stderr) +gfa_inf.close() + +# close position BED bgzip process +p_gz_p.stdin.close() + +# close haps GAF bgzip process +h_gz_p.stdin.close() + +# wait for sorting/compressing processes to finish +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Waiting for sort/bgzip processes to finish.', file=sys.stderr) +p_gz_p.wait() +h_gz_p.wait() +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Output files sorted and bgzipped.', file=sys.stderr) + +# create tabix index +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Indexing position BED', file=sys.stderr) +idx_cmd = ['tabix', '-p', 'bed', args.o + ".pos.bed.gz"] +subprocess.run(idx_cmd, check=True) +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Position BED indexed.', file=sys.stderr) + +# create tabix index +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Indexing haplotype GAF', file=sys.stderr) +idx_cmd = ['tabix', '-p', 'gaf', args.o + ".haps.gaf.gz"] +subprocess.run(idx_cmd, check=True) +print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + '- Haplotype GAF indexed.', file=sys.stderr) + +exit(0) diff --git a/src/App.js b/src/App.js index 271741fd..24d25228 100644 --- a/src/App.js +++ b/src/App.js @@ -65,6 +65,7 @@ class App extends Component { showReads: true, showSoftClips: true, colorReadsByMappingQuality: false, + alphaReadsByMappingQuality: false, colorSchemes: getColorSchemesFromTracks(this.defaultViewTarget.tracks), mappingQualityCutoff: 0, }, diff --git a/src/Types.ts b/src/Types.ts index b33faf48..cd67c989 100644 --- a/src/Types.ts +++ b/src/Types.ts @@ -1,9 +1,10 @@ // Type file for documentation // (not actually enforced by Typescript) -// Possible filestypes taken from the request -// Files like GBZ contains graph and maybe haplotype and so can be either -type filetype = "graph" | "haplotype" | "read" | "bed"; +// Possible filestypes taken from the request. +// Files like GBZ contains graph and maybe haplotype and so can be either. +// Tabix-indexed TSV/BED needs a "node" and "graph" file pair to produce graph data. +type filetype = "graph" | "node" | "haplotype" | "read" | "bed"; // The basic concept of a track type BaseTrack = { @@ -82,7 +83,8 @@ type ReactColor = { type ColorScheme = { mainPalette: ColorPalette | ColorHex , auxPalette: ColorPalette | ColorHex , - colorReadsByMappingQuality: boolean + colorReadsByMappingQuality: boolean, + alphaReadsByMappingQuality: boolean } // Stores the assigned colorschemes of all tracks diff --git a/src/common.mjs b/src/common.mjs index e350fa2a..08884b9e 100644 --- a/src/common.mjs +++ b/src/common.mjs @@ -107,6 +107,8 @@ export function defaultTrackColors(trackType) { return config.defaultReadColorPalette; } else if (trackType === "haplotype") { return config.defaultHaplotypeColorPalette; + } else if (trackType === "node") { + return config.defaultHaplotypeColorPalette; } else { throw new Error("Invalid track type: " + trackType); } diff --git a/src/components/CustomizationAccordion.js b/src/components/CustomizationAccordion.js index 22219024..7dffbf99 100644 --- a/src/components/CustomizationAccordion.js +++ b/src/components/CustomizationAccordion.js @@ -80,6 +80,8 @@ class VisualizationOptions extends Component { ); } else if (type === "haplotype") { // TODO: Do nothing for now. Haplotypes get assigned to the graph as their source track right now. + } else if (type === "node") { + // TODO: Do nothing for now. Haplotypes get assigned to the graph as their source track right now. } else if (type === "read") { if (visOptions.showReads) { trackSettingsList.push( @@ -197,6 +199,18 @@ class VisualizationOptions extends Component { Color reads by mapping quality + + +