Skip to content

ts: Create haplotypes from a tree sequence for downstream AlphaSimR work (but subsample sites!) #4

@gregorgorjanc

Description

@gregorgorjanc

@LynxJinyangii @hannesbecher We can import tree sequence into R or simulate it via reticulate. For AlphaSimR, we however need base population haplotypes/genomes. Usually we get these via:

founderGenomes = quickHaplo(nInd = 10, nChr = 2, segSites = 3)
founderGenomes = runMacs(nInd = 10, nChr = 2, segSites = 3)

key point here is that we preset the number of sites we will work with segSites (we limit this to a manageable number to avoid working with large haplotype matrix that has quadratic storage and compute complexity) and that the resulting object is of type:

> is(founderGenomes)
[1] "MapPop" "RawPop"

To swap quickHaplo or runMacs with imported tree sequence, we should create a function that:

  1. converts a ts object to haplotypes (but only for requested number of sites)
  2. creates a MapPop object

Re 1) We should be mindful of sub-sampling sites to make the haplotype matrix manageable. Looking at https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.genotype_matrix I don't see a way to subsample, but https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.variants seems to be a viable alternative

Re 2) look at AlphaSimR::quickHaplo() or AlphaSimR:runMacs() on how to construct a MapPop object.

How shall we call this function? Suggestions: asMapPop() would follow the usual casting function names in R, so the call would look like founderGenomes = asMapPop(ts). An alternative would be ts2MapPop(), so the call would look like founderGenomes = ts2MapPop(ts). Since we already have isMapPop(), then asMapPop() feels like a natural choice. Keen to hear suggestions!

Hmm, we should think about handling one vs multiple ts objects for multiple chromosomes! How will we present these on R side? As a vector/list of multiple ts objects? How would one do this on the Python side? As [ts1, ts2, ...]?

Edit: Once we have the asMapPop() functionality, we could then run:

# founderTs is obtained in an appropriate way
founderGenomes = asMapPop(founderTs, segSites = 3)
# asMapPop should recognise how many chromosomes (how many separate tree sequences we have in founderTs)
SP = SimParam$new(founderGenomes)
# Continue with AlphaSimR as usually with the "box of haplotype" approach and
# ignoring tree sequence entirely from this point onwards
# (it was used only to get base population haplotypes stored in founderGenomes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions