Skip to content

Compatibility with compositional data #10

@ThirstyGeo

Description

@ThirstyGeo

Sometimes we know that a set of variables should add up to a given total. Measurements involving proportions, percentages, probabilities, concentrations are compositional data. These data occur often in household and business surveys, nutritional information for food, population surveys, biological and genetic data, etc.

The complication of compositional data are that the features are inherently mathematically related, leading to spurious correlation coefficients if applying conventional statistical or ML approaches (e.g., calculating Euclidean distance metrics). However, use of K-L distance is potentially a way to avoid this issue, and so MIDAS might offer a nice Deep Learning solution to imputation issues concerning compositional data.

However, some preliminary experiments using classic compositional data imputation datasets and MIDASpy hasn't performed as well as I might have expected, and I was wondering if you'd be able to comment?

For example, I imposed 30% missingness at random on the 'Kola soil horizon' geochemical dataset, and compared the known vs imputed samples against each other. You can see a marked linear trend to the imputed values.

If you are interested to take a look, here is a recent paper which references the Kola datasets, along with a copy of the data:
Paper and two datasets

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions