Treffer: Statistical analysis of RNA sequencing count data

Title:
Statistical analysis of RNA sequencing count data
Authors:
Contributors:
Di, Yanming, Schafer, Daniel W., Emerson, Sarah C., Chang, Jeff H., Jiang, Yuan, Statistics, Oregon State University. Graduate School
Publisher Information:
Oregon State University
Collection:
ScholarsArchive@OSU (Oregon State University)
Document Type:
Dissertation doctoral or postdoctoral thesis
Language:
English
unknown
Rights:
All rights reserved
Accession Number:
edsbas.460A6612
Database:
BASE

Weitere Informationen

RNA-Sequencing (RNA-Seq) has rapidly become the de facto technique in transcriptome studies. However, established statistical methods for analyzing experimental and observational microarray studies need to be revised or completely re-invented to accommodate RNA-Seq data's unique characteristics. In this dissertation, we focus on statistical analyses performed at two particular stages in the RNA-Seq pipeline, namely, regression analysis of gene expression levels including tests for differential expression (DE) and the downstream Gene Ontology (GO) enrichment analysis. The negative binomial (NB) distribution has been widely adopted to model RNA-Seq read counts for its flexibility in accounting for any extra-Poisson variability. Because of the relatively small number of samples in a typical RNA-Seq experiment, power-saving strategies include assuming some commonalities of the NB dispersion parameters across genes, via simple models relating them to mean expression rates. Many such NB dispersion models have been proposed, but there is limited research on evaluating model adequacy. We propose a simulation-based goodness-of- t (GOF) test with diagnostic graphics to assess the NB assumption for a single gene via parametric bootstrap and empirical probability plots, and assess the adequacy of NB dispersion models by combining individual GOF test p-values from all genes. Our simulation studies and real data analyses suggest the NB assumption is valid for modeling a gene's read counts, and provide evidence on how NB dispersion models differ in capturing the variation in the dispersion. It is not well understood to what degree a dispersion-modeling approach can still be useful when a fitted dispersion model captures a significant part, but not all, of the variation in the dispersion. As a further step towards understanding the power-robustness trade-offs of NB dispersion models, we propose a simple statistic to quantify the inadequacy of a fitted NB dispersion model. Subsequent power-robustness analyses are guided by this ...