diff --git a/README.md b/README.md
index 96a72fbc..3072a34b 100644
--- a/README.md
+++ b/README.md
@@ -120,6 +120,10 @@ To load your own data into the Sequence Tube Map, see the guide to [Adding Your
Previously we provided a Docker image at [https://hub.docker.com/r/wolfib/sequencetubemap/](https://hub.docker.com/r/wolfib/sequencetubemap/), which contained the build of this repo as well as a vg executable for data preprocessing and extraction. We now recommend a different installation approach, either using the [online version](#online-version) or a full installation of the [local version](#local-version). However, if you would like to Dockerize the Sequence Tube Map, the repository includes a `Dockerfile`.
+## Using tabix-based index files
+
+More information about using this faster alternative in [README.tabix.md](README.tabix.md).
+
## Contributing
For information on how to develop on the Sequence Tube Map codebase, pleas see the [Development Guide](doc/development.md).
diff --git a/README.tabix.md b/README.tabix.md
new file mode 100644
index 00000000..cd6ac8c8
--- /dev/null
+++ b/README.tabix.md
@@ -0,0 +1,120 @@
+## Tabix-based index files
+
+Three files are used, each one indexed with tabix (additional `.tbi` file):
+
+1. `nodes.tsv.gz` contains the sequence of each node.
+2. `pos.bed.gz` contains the position (as node intervals) of regions on each haplotype.
+3. `haps.gaf.gz` contains the path followed by each haplotype (split in pieces).
+
+Briefly, these three index files can be quickly queried to extract a subgraph covering a region of interest: the `pos.bed.gz` index can first tell us which nodes are covered, then the `nodes.tsv.gz` index gives us the sequence of these nodes, and finally we can stitch the haplotype pieces in those nodes from the `haps.gaf.gz` index.
+This approach was implemented in a [`chunkix.py`](scripts/chunkix.py) script which can produce a GFA file or files used by the sequenceTubeMap.
+The sequenceTubeMap uses this script internally when given tabix-based index files.
+
+## Using tabix-based index files in the sequenceTubeMap
+
+The version on this `tabix` branch can use those index files, for example when mounted files are provided:
+
+- the `pos.bed.gz` index in the *graph* field
+- the `nodes.tsv.gz` index in the *node* field
+- the `haps.gaf.gz` index in the *haplotype* field
+
+---
+
+
+
+---
+
+Once the index files are mounted, one can query any region on any haplotype in the form *HAPNAME_CONTIG:START-END*.
+
+Other tracks, for example reads or annotations in bgzipped/indexed GAF files, can be added as *reads* in the menu.
+
+---
+
+
+
+---
+
+Of note, you can set a color for each track using the existing palettes or by picking a specific color.
+
+---
+
+
+
+---
+
+## ~~Installation~~ Using the docker container
+
+A docker container with this new sequenceTubeMap version, and all the dependencies necessary, is available at `quay.io/jmonlong/sequencetubemap:tabix_dev`.
+
+To use it, run:
+
+```sh
+docker run -it -p 3210:3000 -v `pwd`:/data quay.io/jmonlong/sequencetubemap:tabix_dev
+```
+
+Of note, the `-p` option redirects port 3000 to 3210.
+In practice, pick an unused port.
+
+Then open: http://localhost:3210/
+
+Note: For mounted files, this assumes all files (pangenomes, reads, annotations) are in the current working directory or in subdirectories.
+To test with the files that are already prepared, download all the files (see below).
+Then, either use them as *custom* Data adding the tracks with the *Configure Tracks* button, or use the prepared Data set "HPRC Minigraph-Cactus v1.1".
+For info, the files for this Dataset were defined in the [config.json file](docker/config.json) used to build the docker.
+
+## Available tabix-based index files for the Minigraph-Cactus v1.1 pangenome
+
+Index files and some annotations have been deposited at https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/
+
+To download it all:
+
+```
+# pangenome index files
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.haps.gaf.gz.tbi
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.nodes.tsv.gz.tbi
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/hprc.pos.bed.gz.tbi
+
+# annotation files
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gene_exon.gaf.gz.tbi
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz.tbi
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz
+wget https://public.gi.ucsc.edu/~jmonlong/sequencetubemap_tabix/rm.gaf.gz.tbi
+```
+
+## Building tabix-based index files from a GFA
+
+### Optional. Make a GFA from a GBZ file
+
+In some cases, you will want to use exactly the same pangenome space as a specific GBZ file.
+For example, to visualize reads or annotation on that pangenome.
+The GFA provided in the HPRC repo might not match exactly because some nodes may have been split when making the GBZ file.
+You can convert a GBZ to a GFA (and not translate the nodes back to the original GFA) with:
+
+```sh
+vg convert --no-translation -f -t 4 hprc-v1.1-mc-grch38.gbz | gzip > hprc-v1.1-mc-grch38.gfa.gz
+```
+
+### Run the `pgtabix.py` python script
+
+The `pgtabix.py` script can be found in the [`scripts` directory](scripts).
+It's also present in the `/build/sequenceTubeMap/scripts` directory of the Docker container `quay.io/jmonlong/sequencetubemap:tabix_dev`.
+
+```sh
+python3 pgtabix.py -g hprc-v1.1-mc-grch38.gfa.gz -o output.prefix
+```
+
+It takes about 1h30-2h to build index files for the Minigraph-Cactus v1.1 pangenome.
+This process should scale linearly with the number of haplotypes.
+
+## Making your own annotation files
+
+To make your own annotation files, we have developed a pipeline to project annotation files at the haplotype level (e.g. BED, GFF) onto a pangenome (e.g. GBZ).
+Once the projected GAF files are sorted, bgzipped and indexed, they can be queried fast, for example by sequenceTubeMap.
+
+The pipeline is described in the [manuscript](https://jmonlong.github.io/manu-vggafannot/) and script/docs was deposited in [the GitHub repository](https://github.com/jmonlong/manu-vggafannot?tab=readme-ov-file).
+In particular, example on how annotation files were projected for this manuscript are described in [this section](https://github.com/jmonlong/manu-vggafannot/tree/main/analysis/annotate).
diff --git a/docker/Dockerfile b/docker/Dockerfile
index bea3e7a2..ddf743ca 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -6,22 +6,37 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
ENV TZ=America/Los_Angeles
-
# install basic apt dependencies
# note: most vg apt dependencies are installed by "make get-deps" below
RUN apt-get -qq update && apt-get -qq install -y \
- git \
- wget \
- less \
- npm \
- nano
-
+ git \
+ wget \
+ less \
+ npm \
+ nano \
+ make \
+ g++ \
+ gcc \
+ zlib1g-dev \
+ libbz2-dev \
+ liblzma-dev \
+ python3 \
+ build-essential
+
+# install tabix/bgzip
+RUN wget --quiet --no-check-certificate https://github.com/samtools/htslib/releases/download/1.21/htslib-1.21.tar.bz2 && \
+ tar -xjvf htslib-1.21.tar.bz2 && \
+ cd htslib-1.21 && \
+ ./configure && \
+ make && make install
+
+# install node
RUN npm cache clean -f
RUN npm install -g n && n stable
# download vg binary
-RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.59.0/vg \
+RUN wget --quiet --no-check-certificate https://github.com/vgteam/vg/releases/download/v1.64.1/vg \
&& mv vg /bin/vg && chmod +x /bin/vg
WORKDIR /build
diff --git a/docker/config.json b/docker/config.json
index 0bde79ad..389e9451 100644
--- a/docker/config.json
+++ b/docker/config.json
@@ -13,6 +13,22 @@
"simplify": false,
"removeSequences": false
},
+ {
+ "name": "HPRC Minigraph-Cactus v1.1",
+ "tracks": [
+ {"trackFile": "/data/hprc.pos.bed.gz", "trackType": "graph", "trackColorSettings": {"mainPalette": "ygreys", "auxPalette": "greys"}},
+ {"trackFile": "/data/hprc.nodes.tsv.gz", "trackType": "node"},
+ {"trackFile": "/data/hprc.haps.gaf.gz", "trackType": "haplotype"},
+ {"trackFile": "/data/gene_exon.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "reds", "auxPalette": "reds"}},
+ {"trackFile": "/data/rm.gaf.gz", "trackType": "read", "trackColorSettings": {"mainPalette": "blues", "auxPalette": "blues"}},
+ {"trackFile": "/data/gwasCatalog.hprc-v1.1-mc-grch38.sorted.gaf.gz",
+ "trackType": "read", "trackColorSettings": {"mainPalette": "plainColors", "auxPalette": "plainColors"}}
+ ],
+ "region": "GRCh38#0#chr17:7674450-7675333",
+ "dataType": "built-in",
+ "simplify": false,
+ "removeSequences": false
+ },
{
"name": "Lancet example",
"tracks": [
@@ -48,6 +64,7 @@
}
],
"vgPath": [""],
+ "chunkixPath": ["/data", "scripts"],
"dataPath": "/data",
"internalDataPath": "exampleData/internal/",
"tempDirPath": "temp",
@@ -57,19 +74,22 @@
"defaultGraphColorPalette" : {
"mainPalette": "#000000",
"auxPalette": "greys",
- "colorReadsByMappingQuality": false
+ "colorReadsByMappingQuality": false,
+ "alphaReadsByMappingQuality": false
},
"defaultHaplotypeColorPalette" : {
"mainPalette": "plainColors",
"auxPalette": "lightColors",
- "colorReadsByMappingQuality": false
+ "colorReadsByMappingQuality": false,
+ "alphaReadsByMappingQuality": false
},
"defaultReadColorPalette" : {
"mainPalette": "blues",
"auxPalette": "reds",
- "colorReadsByMappingQuality": false
+ "colorReadsByMappingQuality": false,
+ "alphaReadsByMappingQuality": false
},
"defaultTrackProps" : {
@@ -77,7 +97,8 @@
"trackColorSettings": {
"mainPalette": "#000000",
"auxPalette": "greys",
- "colorReadsByMappingQuality": false
+ "colorReadsByMappingQuality": false,
+ "alphaReadsByMappingQuality": false
}
},
diff --git a/images/mount.tabix.index.annot.color.png b/images/mount.tabix.index.annot.color.png
new file mode 100644
index 00000000..ac1ddf1a
Binary files /dev/null and b/images/mount.tabix.index.annot.color.png differ
diff --git a/images/mount.tabix.index.annot.png b/images/mount.tabix.index.annot.png
new file mode 100644
index 00000000..bef93375
Binary files /dev/null and b/images/mount.tabix.index.annot.png differ
diff --git a/images/mount.tabix.index.png b/images/mount.tabix.index.png
new file mode 100644
index 00000000..891fc1c8
Binary files /dev/null and b/images/mount.tabix.index.png differ
diff --git a/scripts/chunkix.py b/scripts/chunkix.py
new file mode 100755
index 00000000..bfe7d2dc
--- /dev/null
+++ b/scripts/chunkix.py
@@ -0,0 +1,628 @@
+import argparse
+import subprocess
+import json
+import sys
+
+
+def parsePath(path_raw):
+ cur_node = ''
+ cur_orient = ''
+ path = []
+ fwd_diff = 0
+ for ii in range(len(path_raw)):
+ if path_raw[ii] == '<' or path_raw[ii] == '>':
+ if cur_node != '':
+ path.append([cur_node, cur_orient])
+ cur_orient = path_raw[ii] == '>'
+ if cur_orient:
+ fwd_diff += 1
+ else:
+ fwd_diff += -1
+ cur_node = ''
+ else:
+ cur_node += path_raw[ii]
+ if cur_node != '':
+ path.append([cur_node, cur_orient])
+ return (path)
+
+
+def parseCigar(cs_tag, cs_type='cs'):
+ cs = []
+ val = ''
+ edit = ''
+ for ii in range(len(cs_tag)):
+ if cs_type == 'cs':
+ # ex: cs:Z:+ATG:360*AT-G:1*CG:137
+ if cs_tag[ii] in [':', '*', '+', '-']:
+ if val != '':
+ cs.append([edit, val])
+ val = ''
+ edit = cs_tag[ii]
+ else:
+ val += cs_tag[ii]
+ elif cs_type == 'cg':
+ # ex: cg:Z:24=1D185=1X1151=1X87=1I157=
+ if cs_tag[ii] in ['=', 'X', 'I', 'D']:
+ edit = cs_tag[ii]
+ if edit == '=':
+ edit = ':'
+ elif edit == 'X':
+ edit = '*'
+ val = 'N' * int(val)
+ val = val + val
+ elif edit == 'I':
+ edit = '+'
+ val = 'N' * int(val)
+ elif edit == 'D':
+ edit = '-'
+ val = 'N' * int(val)
+ cs.append([edit, val])
+ val = ''
+ else:
+ val += cs_tag[ii]
+ # for cs type, don't forget to insert the last 'ongoing' one
+ if cs_type == 'cs' and val != '':
+ cs.append([edit, val])
+ return (cs)
+
+
+def prepareJsonMapOnNode(node_orient, offset, cigar, nodes_seq):
+ json_map = {}
+ json_map['edit'] = []
+ json_map['position'] = {}
+ if not node_orient[1]:
+ json_map['position']['is_reverse'] = True
+ json_map['position']['node_id'] = node_orient[0]
+ if node_orient[0] in nodes_seq and len(cigar) > 0:
+ bp_remaining = len(nodes_seq[node_orient[0]]) - offset
+ while bp_remaining > 0 and len(cigar) > 0:
+ # get the next cigar operation
+ next_co = cigar.pop(0)
+ cig_bp = 0
+ # update edits, remaining bases, and cigar
+ if next_co[0] == ':':
+ # matches
+ cig_bp = int(next_co[1])
+ if cig_bp > bp_remaining:
+ json_map['edit'].append({'from_length': bp_remaining,
+ 'to_length': bp_remaining})
+ # add back remaining cigar operation
+ next_co[1] = str(cig_bp - bp_remaining)
+ cigar.insert(0, next_co)
+ # no more bases remaining in this node
+ bp_remaining = 0
+ else:
+ json_map['edit'].append({'from_length': cig_bp,
+ 'to_length': cig_bp})
+ bp_remaining += - cig_bp
+ elif next_co[0] == '*':
+ # substitution
+ json_map['edit'].append({'from_length': 1,
+ 'sequence': next_co[1][1],
+ 'to_length': 1})
+ bp_remaining += -1
+ elif next_co[0] == '+':
+ # insertion in read
+ json_map['edit'].append({'sequence': next_co[1],
+ 'to_length': len(next_co[1])})
+ elif next_co[0] == '-':
+ # deletion in read
+ cig_bp = len(next_co[1])
+ if cig_bp > bp_remaining:
+ json_map['edit'].append({'from_length': bp_remaining})
+ # update cigar and add it back at the front
+ next_co[1] = next_co[1][bp_remaining:]
+ cigar.insert(0, next_co)
+ # no more bases remaining in this node
+ bp_remaining = 0
+ else:
+ json_map['edit'].append({'from_length': cig_bp})
+ bp_remaining += - cig_bp
+ # DEBUG test: if the next operation is an insert, add it to this node
+ if len(cigar) > 0 and cigar[0][0] == '+' and node_orient[1]:
+ next_co = cigar.pop(0)
+ # insertion in read
+ json_map['edit'].append({'sequence': next_co[1],
+ 'to_length': len(next_co[1])})
+ # don't forget to specify offset
+ if offset > 0:
+ json_map['position']['offset'] = str(offset)
+ elif node_orient[0] in nodes_seq:
+ # we're here if we know about this node but a node before was missing
+ # we can't trust the cigar anymore so we just write full node overlap
+ nlen = len(nodes_seq[node_orient[0]])
+ json_map['edit'].append({'from_length': nlen,
+ 'to_length': nlen})
+ else:
+ # if we don't know this node, assumes it's 1bp long and forget
+ # about the cigar/offset
+ json_map['edit'].append({'from_length': 1,
+ 'to_length': 1})
+ # to remember that we had a missing node (and can't trust the cigar for
+ # the other nodes) we clear the cigar info
+ cigar.clear()
+ return (json_map)
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument('-n', help='indexed nodes', required=True)
+parser.add_argument('-p', help='indexed positions BED', required=True)
+parser.add_argument('-g', help='indexed haplotypes GAF', required=True)
+parser.add_argument('-r', help='region to query', required=True)
+parser.add_argument('-a', default=[], action='append',
+ help='indexed annotations (optional). Can repeat.')
+parser.add_argument('-o', help='output prefix', default='chunk')
+parser.add_argument('-j', help='also make JSON output (e.g. for tubemap)',
+ action='store_true')
+parser.add_argument('-s', help='simplify haplotypes by merging identical ones'
+ ' (for the JSON output)', action='store_true')
+args = parser.parse_args()
+
+# args = parser.parse_args(['-n', 'pg.nodes.tsv.bgz',
+# '-g', 'pg.haps.gaf.gz',
+# '-p', 'pg.pos.bed.gz',
+# '-a', 'pg.haps.gaf.gz',
+# '-r', 'GRCh38#0#chr6:3000000-3001000'])
+
+# parse queried region
+ref_path_name = args.r.split(':')[0]
+qint = args.r.split(':')[1]
+qstart = int(qint.split('-')[0])
+qend = int(qint.split('-')[1])
+
+# extract nodes in region
+cmd = ['tabix', args.p, args.r]
+try:
+ cmd_o = subprocess.run(cmd, check=True, capture_output=True)
+except subprocess.CalledProcessError as e:
+ sys.exit(e.stderr.decode())
+nodes_bed = cmd_o.stdout.decode().rstrip().split('\n')
+nodes_bed = [line.split('\t') for line in nodes_bed]
+
+# stop if no nodes were found in that region
+if len(nodes_bed) == 1 and len(nodes_bed[0]) == 1 and nodes_bed[0][0] == '':
+ sys.exit('No nodes overlap this region.')
+
+# find minimum and maximum node IDs and reference positions
+min_node = int(nodes_bed[0][3])
+max_node = int(nodes_bed[0][3])
+# loop over all extracted nodes
+for nbed in nodes_bed:
+ node_s = int(nbed[3])
+ node_e = int(nbed[4])
+ min_node = min(min_node, node_s, node_e)
+ max_node = max(max_node, node_s, node_e)
+max_node += 1
+
+# extract haplotypes in region
+cmd = ['tabix', args.g, '{{node}}:{}-{}'.format(min_node, max_node)]
+try:
+ cmd_o = subprocess.run(cmd, check=True, capture_output=True)
+except subprocess.CalledProcessError as e:
+ sys.exit(e.stderr.decode())
+haps_gaf = cmd_o.stdout.decode().rstrip().split('\n')
+haps_gaf = [line.split('\t') for line in haps_gaf]
+
+# organize haplotype subpaths
+subhaps_path = {}
+for hgaf in haps_gaf:
+ # extract contig name and position from rc tag
+ rc_ii = 12
+ while 'rc:Z:' not in hgaf[rc_ii]:
+ rc_ii += 1
+ rc = hgaf[rc_ii].replace('rc:Z:', '').split(':')
+ hname = rc[0]
+ hstart = int(rc[1])
+ hend = int(rc[1]) + int(hgaf[1])
+ # if queried path, make sure it overlaps the queried range
+ if hname == ref_path_name and (hend < qstart or hstart > qend):
+ # skip, if not
+ continue
+ # parse path
+ path = parsePath(hgaf[5])
+ # make sure at least one node is in the node range
+ overlap_range = False
+ for no in path:
+ if int(no[0]) >= min_node and int(no[0]) <= max_node:
+ overlap_range = True
+ break
+ if not overlap_range:
+ continue
+ # save all in dict
+ if hname not in subhaps_path:
+ subhaps_path[hname] = {'paths': {}, 'pos': [], 'end': {}}
+ # save path
+ subhaps_path[hname]['paths'][rc[1]] = path
+ # save position in haplotype
+ subhaps_path[hname]['pos'].append(hstart)
+ subhaps_path[hname]['end'][rc[1]] = hend
+
+# stitch into one path per haplotype
+haps_full_paths = {}
+min_ref_pos = 0
+all_perfectly_stitched = True
+for hname in subhaps_path:
+ # sort the positions
+ pos = subhaps_path[hname]['pos']
+ pos.sort()
+ # remember the start position of ref path
+ if hname == ref_path_name:
+ min_ref_pos = pos[0]
+ # stitch paths
+ paths = [[]]
+ prev_end = -1
+ for pp in pos:
+ if prev_end != -1 and prev_end != pp:
+ # end position of the previous chunk doesn't match current position
+ # we shouldn't stich and instead create another sub-path
+ paths.append([])
+ all_perfectly_stitched = False
+ paths[-1] += subhaps_path[hname]['paths'][str(pp)]
+ prev_end = subhaps_path[hname]['end'][str(pp)]
+ # save the haplotype path or each pieces (if not stitched in one piece)
+ if len(paths) == 1:
+ haps_full_paths[hname] = paths[0]
+ else:
+ for ii, path in enumerate(paths):
+ haps_full_paths['{}_p{}'.format(hname, ii)] = path
+
+if all_perfectly_stitched:
+ print('All haplotypes could be stitched back in one piece')
+else:
+ print('Some haplotypes could not be stitched back in one piece')
+
+# to save the "default" reference path
+nodes_ref = {}
+for no in haps_full_paths[ref_path_name]:
+ nodes_ref[no[0]] = True
+# this is used for a first round of "trimming" of the haplotypes
+# another round is done later to match the queried range better
+
+# prepare paths before the actual round of trimming
+haps_paths_pretrim = {}
+# this is to know which nodes to include (and get information later)
+nodes_inc = {}
+# process each haplotype path
+for hname in haps_full_paths:
+ path = haps_full_paths[hname]
+ # trim ends of path until they touch a node on ref (queried) path
+ path_beg = 0
+ while path[path_beg][0] not in nodes_ref and path_beg < len(path) - 1:
+ path_beg += 1
+ path_end = len(path) - 1
+ while path[path_end][0] not in nodes_ref and path_end > 0:
+ path_end += -1
+ path_end += 1
+ path = path[path_beg:path_end]
+ # skip empty path
+ if len(path) == 0:
+ continue
+ # otherwise save path
+ haps_paths_pretrim[hname] = path
+ for no in path:
+ nodes_inc[no[0]] = True
+
+# optional extract and write GAF
+annot_forjsons_l = []
+# eventually other nodes that we'll need to include (just in case)
+nodes_inc_others = {}
+if args.a != '':
+ for annot_idx, annot_fn in enumerate(args.a):
+ # extract with tabix
+ cmd = ['tabix', annot_fn, '{{node}}:{}-{}'.format(min_node, max_node)]
+ try:
+ cmd_o = subprocess.run(cmd, check=True, capture_output=True)
+ except subprocess.CalledProcessError as e:
+ sys.exit(e.stderr.decode())
+ annot_gaf = cmd_o.stdout.decode().rstrip().split('\n')
+ agaf_of = open(args.o + '.{}.annot.gaf'.format(annot_idx), 'wt')
+ annot_forjson = []
+ for gaf_line_r in annot_gaf:
+ # prepare the json output, just in case
+ if gaf_line_r == '':
+ continue
+ gaf_line = gaf_line_r.split('\t')
+ path = parsePath(gaf_line[5])
+ # check if at least one node is in subgraph of interest
+ no_overlap = True
+ for no in path:
+ if no[0] in nodes_inc:
+ no_overlap = False
+ break
+ if no_overlap:
+ continue
+ # it touches the subgraph, we can write it
+ agaf_of.write(gaf_line_r + '\n')
+ # prepare JSON information
+ apath = {'name': gaf_line[0]}
+ apath['mapq'] = int(gaf_line[11])
+ apath['pathl_remain'] = int(gaf_line[1])
+ apath['offset'] = int(gaf_line[7])
+ apath['path'] = path
+ # check is nodes are missing
+ for nidx, no in enumerate(path):
+ if no[0] not in nodes_inc:
+ nodes_inc_others[no[0]] = True
+ # add optional tags as "sample_name"
+ if len(gaf_line) > 12:
+ # TODO: don't include base qualities
+ tags = {}
+ for tag in gaf_line[12:]:
+ pos_col = tag.index(':')
+ tagn = tag[0:pos_col]
+ tagv = tag[(tag.index(':', pos_col+1)+1):]
+ tags[tagn] = tagv
+ apath['tags'] = tags
+ # save path info
+ annot_forjson.append(apath)
+ agaf_of.close()
+ annot_forjsons_l.append(annot_forjson)
+
+# update nodes to include with the potentially new ones
+for node in nodes_inc_others:
+ nodes_inc[node] = True
+
+# find sequence for nodes to include
+nodes_seq = {}
+# first, cluster them to avoid too many queries
+# to cluster, sort by node ID
+sorted_nodes = [int(node) for node in nodes_inc]
+sorted_nodes.sort()
+# then group if next ID is not too far
+node_cls = []
+max_node_jump = 1000
+for node in sorted_nodes:
+ if len(node_cls) == 0:
+ node_cls.append([node])
+ else:
+ if node_cls[-1][-1] + max_node_jump < node:
+ # print(node - node_cls[-1][-1] + max_node_jump)
+ # we need a new node cluster
+ node_cls.append([node])
+ else:
+ node_cls[-1].append(node)
+# query node information for each cluster
+for ncl in node_cls:
+ print('Searching for {}-{} node sequences.'.format(ncl[0], ncl[-1]))
+ cmd = ['tabix', args.n, 'n:{}-{}'.format(ncl[0], ncl[-1])]
+ try:
+ cmd_o = subprocess.run(cmd, check=True, capture_output=True)
+ except subprocess.CalledProcessError as e:
+ sys.exit(e.stderr.decode())
+ pos_tsv = cmd_o.stdout.decode().rstrip().split('\n')
+ for posr in pos_tsv:
+ posr = posr.split('\t')
+ if posr[1] in nodes_inc:
+ nodes_seq[posr[1]] = posr[2]
+
+# second round of trimming to match the queried range exactly
+# trim the "reference" path first
+cur_pos = min_ref_pos
+max_ref_pos = 0
+nodes_ref = {}
+within_range = False
+for no in haps_full_paths[ref_path_name]:
+ if cur_pos + len(nodes_seq[no[0]]) >= qstart and len(nodes_ref) == 0:
+ # if we weren't within range, now we are
+ within_range = True
+ min_ref_pos = cur_pos
+ if within_range:
+ nodes_ref[no[0]] = True
+ if cur_pos + len(nodes_seq[no[0]]) >= qend and within_range:
+ # we'll be out of range next node
+ within_range = False
+ max_ref_pos = cur_pos + len(nodes_seq[no[0]])
+ cur_pos += len(nodes_seq[no[0]])
+assert max_ref_pos > 0, 'no end position found, pb with haplotype stitching?'
+# now trim all the haplotypes until they touch these nodes
+haps_paths = {}
+nodes_inc = {}
+# also simplify haplotypes and tally hap frequency
+haps_freq = {}
+haps_hash = {}
+haps_names = {}
+# and save edges
+edges_l = {}
+lfmt = 'L\t{}\t{}\t{}\t{}\t0M'
+orient_v = ['-', '+']
+# process each path
+for hname in haps_paths_pretrim:
+ path = haps_paths_pretrim[hname]
+ # trim ends of path until they touch a node on ref (queried) path
+ path_beg = 0
+ while path[path_beg][0] not in nodes_ref and path_beg < len(path) - 1:
+ path_beg += 1
+ path_end = len(path) - 1
+ while path[path_end][0] not in nodes_ref and path_end > 0:
+ path_end += -1
+ path_end += 1
+ path = path[path_beg:path_end]
+ # skip empty path
+ if len(path) == 0:
+ continue
+ # otherwise save path
+ # maybe check if we want to simplify it first
+ if args.s:
+ # make a string ID of the path and check if identical already present
+ path_hash = '_'.join(['{}_{}'.format(ro[0], ro[1]) for ro in path])
+ if hname == ref_path_name:
+ # keep the reference path separate
+ haps_paths[hname] = path
+ haps_freq[hname] = 1
+ haps_names[hname] = hname
+ elif path_hash in haps_hash:
+ # if present, increment frequency
+ haps_freq[haps_hash[path_hash]] += 1
+ haps_names[haps_hash[path_hash]] += ', ' + hname
+ else:
+ # otherwise, save and init the hash dict
+ haps_paths[hname] = path
+ haps_freq[hname] = 1
+ haps_hash[path_hash] = hname
+ haps_names[hname] = hname
+ else:
+ haps_paths[hname] = path
+ haps_names[hname] = hname
+ haps_freq[hname] = 1
+ # save edges and nodes to include
+ prev_no = []
+ for no in path:
+ nodes_inc[no[0]] = True
+ if len(prev_no) != 0:
+ ename = '{}_{}-{}_{}'.format(prev_no[0], prev_no[1],
+ no[0], no[1])
+ if ename not in edges_l:
+ edges_l[ename] = lfmt.format(prev_no[0], orient_v[prev_no[1]],
+ no[0], orient_v[no[1]])
+ prev_no = no
+
+print('{} nodes, {} edges, {} paths in subgraph'.format(len(nodes_inc),
+ len(edges_l),
+ len(haps_paths_pretrim)))
+
+# write GFA
+gfa_of = open(args.o + '.gfa', 'wt')
+# write header
+gfa_of.write('H\tVN:Z:1.1\n')
+# write nodes
+for node in nodes_seq:
+ # currently we keep nodes in that new subgraph of interest and nodes that
+ # might be touched by the annotations. Could be too much but that might
+ # avoid some errors later? Of note those extra nodes won't be on any
+ # haplotype or edges. So it's only useful is the sequenceTubeMap wants
+ # the node information/sequence to represent the reads/annotations
+ if node in nodes_inc or node in nodes_inc_others:
+ gfa_of.write('S\t{}\t{}\n'.format(node, nodes_seq[node]))
+# write paths
+for hname in haps_paths:
+ # guess sample/haplotype/contig names
+ hap = 0
+ contig = hname
+ sample = hname
+ hname_v = hname.split('#')
+ if len(hname_v) == 2:
+ # reference path as "sample#contig"
+ sample = hname_v[0]
+ contig = hname_v[1]
+ elif len(hname_v) == 3:
+ # reference path as "sample#contig"
+ sample = hname_v[0]
+ hap = hname_v[1]
+ contig = hname_v[2]
+ # compute path size and string
+ psize = 0
+ pstring = ''
+ for no in haps_paths[hname]:
+ psize += len(nodes_seq[no[0]])
+ pstring += ['<', '>'][no[1]] + no[0]
+ # prepare W line
+ wline = 'W\t{}\t{}\t{}\t0'.format(sample, hap, contig)
+ wline += '\t{}\t{}\n'.format(psize, pstring)
+ gfa_of.write(wline)
+# write edges
+for ename in edges_l:
+ gfa_of.write(edges_l[ename] + '\n')
+gfa_of.close()
+
+# write JSON output of graph
+nodes_in_graph = {}
+if args.j:
+ json_pg = {}
+ # nodes
+ json_pg['node'] = []
+ for node in nodes_seq:
+ if node in nodes_inc or node in nodes_inc_others:
+ json_pg['node'].append({'id': node,
+ 'sequence': nodes_seq[node]})
+ # paths
+ json_pg['path'] = []
+ # the reference path should be first
+ path_names = list(haps_paths.keys())
+ path_names.remove(ref_path_name)
+ path_names.insert(0, ref_path_name)
+ for hname in path_names:
+ json_path = {}
+ json_path['name'] = haps_names[hname]
+ json_path['mapping'] = []
+ for nidx, no in enumerate(haps_paths[hname]):
+ json_map = {}
+ json_map['edit'] = {}
+ node_size = len(nodes_seq[no[0]])
+ json_map['edit']['from_length'] = str(node_size)
+ json_map['edit']['to_length'] = str(node_size)
+ json_map['position'] = {}
+ if not no[1]:
+ json_map['position']['is_reverse'] = True
+ json_map['position']['node_id'] = no[0]
+ json_map['rank'] = str(nidx + 1)
+ json_path['mapping'].append(json_map)
+ nodes_in_graph[no[0]] = True
+ json_pg['path'].append(json_path)
+ json_of = open(args.o + '.graph.json', 'wt')
+ json.dump(json_pg, json_of)
+ json_of.close()
+ # also make an annotate.txt file with frequency of each haplotype
+ annot_of = open(args.o + '.annotate.txt', 'wt')
+ for hname in path_names:
+ annot_of.write('{}\t{}\n'.format(haps_names[hname], haps_freq[hname]))
+ annot_of.close()
+ # also a region.tsv file with the reference path name and region offset
+ reg_of = open(args.o + '.regions.tsv', 'wt')
+ reg_of.write('{}\t{}\t{}\n'.format(ref_path_name,
+ min_ref_pos, max_ref_pos))
+ reg_of.close()
+
+# write JSON output for the annotations
+if args.a != '' and args.j:
+ for annot_idx, annot_forjson in enumerate(annot_forjsons_l):
+ annot_json = []
+ for apath in annot_forjson:
+ path = apath['path']
+ json_read = {'name': apath['name']}
+ json_read['mapping_quality'] = apath['mapq']
+ json_mapping = []
+ pathl_remain = apath['pathl_remain']
+ offset = apath['offset']
+ cigar = []
+ if 'cs' in apath['tags']:
+ cigar = parseCigar(apath['tags']['cs'])
+ elif 'cg' in apath['tags']:
+ cigar = parseCigar(apath['tags']['cg'], cs_type='cg')
+ no_overlap = True
+ for nidx, no in enumerate(path):
+ if nidx != 0:
+ offset = 0
+ json_map = prepareJsonMapOnNode(no, offset, cigar, nodes_seq)
+ json_map['rank'] = str(nidx + 1)
+ json_mapping.append(json_map)
+ if json_map['position']['node_id'] in nodes_in_graph:
+ no_overlap = False
+ # skip paths that don't overlap with subgraph
+ if no_overlap:
+ continue
+ # trim ends
+ while json_mapping[0]['position']['node_id'] not in nodes_in_graph:
+ json_mapping.pop(0)
+ while json_mapping[-1]['position']['node_id'] not in nodes_in_graph:
+ json_mapping.pop()
+ json_read['path'] = {'mapping': json_mapping}
+ # add optional tags as "sample_name"
+ if 'tags' in apath:
+ opts = []
+ for tagn in apath['tags']:
+ tagv = apath['tags'][tagn]
+ # filter keep some tags
+ if tagn in ['bq']:
+ continue
+ # shorten long tag values
+ if len(tagv) > 20:
+ tagv = tagv[:20] + '...'
+ opts.append('{}={}'.format(tagn, tagv))
+ opts = ' '.join(opts)
+ # json_read['sample_name'] = opts.replace(',', ', ')
+ annot_json.append(json_read)
+ # write JSON
+ ajson_of = open(args.o + '.{}.annot.json'.format(annot_idx), 'wt')
+ for ajson in annot_json:
+ ajson_of.write(json.dumps(ajson) + "\n")
+ ajson_of.close()
diff --git a/scripts/pgtabix.py b/scripts/pgtabix.py
new file mode 100644
index 00000000..7ad9882c
--- /dev/null
+++ b/scripts/pgtabix.py
@@ -0,0 +1,218 @@
+import gzip
+import sys
+import argparse
+import datetime
+# from subprocess import Popen,PIPE
+import subprocess
+
+parser = argparse.ArgumentParser(description='Build tabix index files for a pangenome.')
+parser.add_argument('-g', help='pangenome in GFA', required=True)
+parser.add_argument('-o', help='output prefix', default='pg')
+parser.add_argument('-s', help='size of haplotype blocks to save', default=100)
+args = parser.parse_args()
+
+nodes = {}
+orients = ['<', '>']
+
+
+# TODO try without auto_flip. Don't remember if it's really necessary.
+def parsePath(path_raw, auto_flip=True):
+ cur_node = ''
+ cur_orient = ''
+ path = []
+ fwd_diff = 0
+ for ii in range(len(path_raw)):
+ if path_raw[ii] == '<' or path_raw[ii] == '>':
+ if cur_node != '':
+ cur_node = int(cur_node)
+ path.append([cur_node, cur_orient])
+ cur_orient = path_raw[ii] == '>'
+ if cur_orient:
+ fwd_diff += 1
+ else:
+ fwd_diff += -1
+ cur_node = ''
+ else:
+ cur_node += path_raw[ii]
+ if cur_node != '':
+ cur_node = int(cur_node)
+ path.append([cur_node, cur_orient])
+ # flip path if mostly in reverse?
+ if auto_flip and fwd_diff < 0:
+ path.reverse()
+ for ii in range(len(path)):
+ path[ii][1] = not path[ii][1]
+ return (path)
+
+
+def parsePathP(path_raw):
+ cur_node = ''
+ path = []
+ for ii in range(len(path_raw)):
+ if path_raw[ii] == '+' or path_raw[ii] == '-':
+ cur_node = int(cur_node)
+ path.append([cur_node, path_raw[ii] == '+'])
+ cur_node = ''
+ elif path_raw[ii] == ',':
+ continue
+ else:
+ cur_node += path_raw[ii]
+ return (path)
+
+
+def prepareHapChunk(path, nodes, name='ref', coord=''):
+ gaf_v = name
+ # path length/start/end/strand
+ path_len = sum([nodes[no[0]][0] for no in path])
+ gaf_v += '\t{}\t0\t{}\t+'.format(path_len, path_len)
+ # path information: string representation and length
+ path_string = ''
+ min_node = path[0][0]
+ max_node = path[0][0]
+ for no in path:
+ path_string += orients[no[1]] + str(no[0])
+ min_node = min(min_node, no[0])
+ max_node = max(max_node, no[0])
+ gaf_v += '\t{}\t{}'.format(''.join(path_string), path_len)
+ gaf_v += '\t{}\t{}'.format(0, path_len)
+ # residues matching, alignment block size, and mapping quality
+ fake_mapq = 60
+ gaf_v += '\t{}\t{}\t{}'.format(path_len, path_len, fake_mapq)
+ # cigar string
+ gaf_v += '\tcs:Z::{}'.format(path_len)
+ # coordinate tag
+ if coord != '':
+ gaf_v += '\trc:Z:{}'.format(coord)
+ return ({'gaf': gaf_v, 'path_len': path_len,
+ 'min_node': min_node, 'max_node': max_node})
+
+
+# To only do one pass, let's assume that (all) the nodes are defined
+# before (any) paths in the GFA
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Reading {}...'.format(args.g), file=sys.stderr)
+if args.g.endswith('gz'):
+ gfa_inf = gzip.open(args.g, 'rt')
+else:
+ gfa_inf = open(args.g, 'rt')
+
+# start processes to sort and bgzip output
+h_gz_cmd = "vg gamsort -G - | bgzip > " + args.o + '.haps.gaf.gz'
+h_gz_p = subprocess.Popen(h_gz_cmd, stdin=subprocess.PIPE, shell=True)
+p_gz_cmd = "sort -k1V -k2n -k3n | bgzip > " + args.o + '.pos.bed.gz'
+p_gz_p = subprocess.Popen(p_gz_cmd, stdin=subprocess.PIPE, shell=True)
+
+pos_fmt = '{}\t{}\t{}\t{}\t{}\n'
+nodes_written = False
+npaths = 0
+for line in gfa_inf:
+ line = line.rstrip().split('\t')
+ if line[0] == 'S':
+ # saving both the size and sequence (maybe faster than recomputing the
+ # size every time it's needed later)
+ nodes[int(line[1])] = [len(line[2]), line[2]]
+ else:
+ if len(nodes) > 0 and not nodes_written:
+ # we were reading nodes and we've just finished
+ # write them, in ascending order in a file
+ print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Writing node file...', file=sys.stderr)
+ n_gz_p = subprocess.Popen("bgzip > " + args.o + ".nodes.tsv.gz",
+ stdin=subprocess.PIPE, shell=True)
+ nodes_ord = list(nodes.keys())
+ nodes_ord.sort()
+ for node in nodes_ord:
+ tow = 'n\t{}\t{}\n'.format(node, nodes[node][1])
+ n_gz_p.stdin.write(tow.encode())
+ n_gz_p.stdin.close()
+ n_gz_p.wait()
+ print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Node information written to '
+ '{}.'.format(args.o + ".nodes.tsv.gz"), file=sys.stderr)
+ nodes_written = True
+ # create tabix index
+ print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Indexing nodes', file=sys.stderr)
+ idx_cmd = ['tabix', '-f', '-s', '1', '-b', '2', '-e', '2',
+ args.o + ".nodes.tsv.gz"]
+ subprocess.run(idx_cmd, check=True)
+ print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Nodes indexed.', file=sys.stderr)
+ # now potentially prepare the path information
+ path = []
+ if line[0] == 'W':
+ path = parsePath(line[6])
+ sampn = line[1]
+ hapn = '#' + line[2]
+ pathn = '#' + line[3]
+ startpos = int(line[4])
+ elif line[0] == 'P' and line[1].startswith(args.r):
+ path = parsePathP(line[2])
+ sampn = line[1]
+ pathn = ''
+ hapn = ''
+ startpos = 0
+ if len(path) > 0:
+ # we've parsed a path
+ # save node positions
+ # save haplotype in GAF chunks
+ path_pos = 0
+ hap_pos = startpos
+ while path_pos < len(path):
+ path_end = min(len(path), path_pos + args.s)
+ seqn = '{}{}{}'.format(sampn, hapn, pathn)
+ coordn = '{}:{}'.format(seqn, hap_pos)
+ hapchunk = prepareHapChunk(path[path_pos:path_end],
+ nodes, name=sampn,
+ coord=coordn)
+ # write GAF record
+ tow = hapchunk['gaf'] + '\n'
+ h_gz_p.stdin.write(tow.encode())
+ # write position BED record
+ tow = pos_fmt.format(seqn, hap_pos,
+ hap_pos + hapchunk['path_len'],
+ hapchunk['min_node'],
+ hapchunk['max_node'])
+ p_gz_p.stdin.write(tow.encode())
+ # update position on haplotype
+ hap_pos += hapchunk['path_len']
+ # go to next chunk
+ path_pos = path_end
+ npaths += 1
+ if npaths % 10000 == 0:
+ print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Wrote info for {} paths'.format(npaths),
+ file=sys.stderr)
+gfa_inf.close()
+
+# close position BED bgzip process
+p_gz_p.stdin.close()
+
+# close haps GAF bgzip process
+h_gz_p.stdin.close()
+
+# wait for sorting/compressing processes to finish
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Waiting for sort/bgzip processes to finish.', file=sys.stderr)
+p_gz_p.wait()
+h_gz_p.wait()
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Output files sorted and bgzipped.', file=sys.stderr)
+
+# create tabix index
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Indexing position BED', file=sys.stderr)
+idx_cmd = ['tabix', '-p', 'bed', args.o + ".pos.bed.gz"]
+subprocess.run(idx_cmd, check=True)
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Position BED indexed.', file=sys.stderr)
+
+# create tabix index
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Indexing haplotype GAF', file=sys.stderr)
+idx_cmd = ['tabix', '-p', 'gaf', args.o + ".haps.gaf.gz"]
+subprocess.run(idx_cmd, check=True)
+print(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+ '- Haplotype GAF indexed.', file=sys.stderr)
+
+exit(0)
diff --git a/src/App.js b/src/App.js
index 271741fd..24d25228 100644
--- a/src/App.js
+++ b/src/App.js
@@ -65,6 +65,7 @@ class App extends Component {
showReads: true,
showSoftClips: true,
colorReadsByMappingQuality: false,
+ alphaReadsByMappingQuality: false,
colorSchemes: getColorSchemesFromTracks(this.defaultViewTarget.tracks),
mappingQualityCutoff: 0,
},
diff --git a/src/Types.ts b/src/Types.ts
index b33faf48..cd67c989 100644
--- a/src/Types.ts
+++ b/src/Types.ts
@@ -1,9 +1,10 @@
// Type file for documentation
// (not actually enforced by Typescript)
-// Possible filestypes taken from the request
-// Files like GBZ contains graph and maybe haplotype and so can be either
-type filetype = "graph" | "haplotype" | "read" | "bed";
+// Possible filestypes taken from the request.
+// Files like GBZ contains graph and maybe haplotype and so can be either.
+// Tabix-indexed TSV/BED needs a "node" and "graph" file pair to produce graph data.
+type filetype = "graph" | "node" | "haplotype" | "read" | "bed";
// The basic concept of a track
type BaseTrack = {
@@ -82,7 +83,8 @@ type ReactColor = {
type ColorScheme = {
mainPalette: ColorPalette | ColorHex ,
auxPalette: ColorPalette | ColorHex ,
- colorReadsByMappingQuality: boolean
+ colorReadsByMappingQuality: boolean,
+ alphaReadsByMappingQuality: boolean
}
// Stores the assigned colorschemes of all tracks
diff --git a/src/common.mjs b/src/common.mjs
index e350fa2a..08884b9e 100644
--- a/src/common.mjs
+++ b/src/common.mjs
@@ -107,6 +107,8 @@ export function defaultTrackColors(trackType) {
return config.defaultReadColorPalette;
} else if (trackType === "haplotype") {
return config.defaultHaplotypeColorPalette;
+ } else if (trackType === "node") {
+ return config.defaultHaplotypeColorPalette;
} else {
throw new Error("Invalid track type: " + trackType);
}
diff --git a/src/components/CustomizationAccordion.js b/src/components/CustomizationAccordion.js
index 22219024..7dffbf99 100644
--- a/src/components/CustomizationAccordion.js
+++ b/src/components/CustomizationAccordion.js
@@ -80,6 +80,8 @@ class VisualizationOptions extends Component {
);
} else if (type === "haplotype") {
// TODO: Do nothing for now. Haplotypes get assigned to the graph as their source track right now.
+ } else if (type === "node") {
+ // TODO: Do nothing for now. Haplotypes get assigned to the graph as their source track right now.
} else if (type === "read") {
if (visOptions.showReads) {
trackSettingsList.push(
@@ -197,6 +199,18 @@ class VisualizationOptions extends Component {
Color reads by mapping quality
+