Treffer: Advances in Machine Learning for Compositional Data

Title:
Advances in Machine Learning for Compositional Data
Publication Year:
2022
Collection:
Columbia University: Academic Commons
Document Type:
Dissertation thesis
Language:
English
DOI:
10.7916/vztk-yc59
Accession Number:
edsbas.3C1E1436
Database:
BASE

Weitere Informationen

Compositional data refers to simplex-valued data, or equivalently, nonnegative vectors whose totals are uninformative. This data modality is of relevance across several scientific domains. A classical example of compositional data is the chemical composition of geological samples, e.g., major-oxide concentrations. A more modern example arises from the microbial populations recorded using high-throughput genetic sequencing technologies, e.g., the gut microbiome. This dissertation presents a set of methodological and theoretical contributions that advance the state of the art in the analysis of compositional data. Our work can be divided along two categories: problems in which compositional data represents the input to a predictive model, and problems in which it represents the output of the model. For the first class of problems, we build on the popular log-ratio framework to develop an efficient learning algorithm for high-dimensional compositional data. Our algorithm runs orders of magnitude faster than competing alternatives, without sacrificing model quality. For the second class of problems, we define a novel exponential family of probability distributions supported on the simplex. This distribution enjoys attractive mathematical properties and provides a performant probability model for simplex-valued outcomes. Taken together, our results constitute a broad contribution to the toolkit of researchers and practitioners studying compositional data.