-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Sometimes we know that a set of variables should add up to a given total. Measurements involving proportions, percentages, probabilities, concentrations are compositional data. These data occur often in household and business surveys, nutritional information for food, population surveys, biological and genetic data, etc.
The complication of compositional data are that the features are inherently mathematically related, leading to spurious correlation coefficients if applying conventional statistical or ML approaches (e.g., calculating Euclidean distance metrics). However, use of K-L distance is potentially a way to avoid this issue, and so MIDAS might offer a nice Deep Learning solution to imputation issues concerning compositional data.
However, some preliminary experiments using classic compositional data imputation datasets and MIDASpy hasn't performed as well as I might have expected, and I was wondering if you'd be able to comment?
For example, I imposed 30% missingness at random on the 'Kola soil horizon' geochemical dataset, and compared the known vs imputed samples against each other. You can see a marked linear trend to the imputed values.
If you are interested to take a look, here is a recent paper which references the Kola datasets, along with a copy of the data:
Paper and two datasets