-
Notifications
You must be signed in to change notification settings - Fork 30
Integrating tabix-based index files for faster subgraph extraction #484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jmonlong I did the back-merge from It also looks like the tabix-backed codepath can't mix at all with anything on the vg-backed codepath, so I can't use a tabix-backed graph to view a GAM or a small unindexed GFA, and I can't use the simplify switch that calls I also don't think this will work right with the file upload feature, since we can only upload one file per track but Tabix needs two (data and index). We might need to adjust the way that works to allow uploading multiple files per track. It should be possible to make this work great with remote data URLs, by doing range reads on them directly, but I'm not sure if the |
|
Instead of a "node track" and a "graph track", it might make more sense to present this feature as a graph track that consists of four files: the positions and nodes files and their indexes. Is it reasonable to imagine drawing a view with only those four files, and no haplotypes? It looks like that is locked out right now, probably because without the hapolotypes file there are no paths/edges at all, and the tube map needs those to draw anything. To support that we might need to get rid of the 1 to 1 connection between having a haplotypes track file and displaying the haplotype paths. Someone might want to look at a tabix-backed graph but only see the reference paths in a particular view, but the haplotype database still needs to be included to see anything. GBZ is a little like this because it has the haplotype data in the same file as the graph, and we fake it by having the GBZ file provide the graph track and also having it separately as the source of a haplotype track that can turn on and off. Maybe we need to change the track model so that tracks come from databases, and databases are sets of n files that offer m tracks that you can toggle on and off. |
With this, sequenceTubeMap could work with tabix-indexed files representing the pangenome and use a python script to query a subgraph fast. For the HPRC MC v1.1 pangenome, it takes, on average, less than a second to query a region, versus ~30s currently with vg chunk. All haplotypes in the pangenome can be queried.
More details on the tabix indexing and this subgraph extraction in https://github.com/jmonlong/manu-vggafannot
I've tried to document what are the new index files, how to use them and how to make them in a new
README.tabix.mdfile.This branch also contained minor other changes, like