Computational Methods for Characterizing Gene Expression Profiles
Advances of the next-generation sequencing (NGS) technologies have revolutionized the research of genomics and transcriptomics in cancer. High-throughput transcriptome sequencing (RNA-seq) is the most commonly used NGS technologies to investigate the aberration of mRNA expression in different biological conditions. Novel computational methods need to be developed in order to elucidate complex gene alteration and interaction while adapting to the accumulation of genomic data. The thesis aims to investigate the computational means of characterizing gene expression profiles in cancer to pave the way for understanding the molecular events associated with tumor initiation, promotion, and progression. Specifically, we proposed three bioinformatics methods to study the characterization of gene expression profiling in different phases. 1) Precision and quantification of differentially expressed genes (DEGs). We proposed a novel statistical model that estimates the actual expression level from the observable and noise measurements from RNA-seq data, thus the much-improved detection of differential expressed genes. 2) Quantification of variably expressed genes (VEGs). We designed a gene expression variation model to characterize single-cell RNA-seq data. We utilized the relation between coefficient of variation and average expression level to address the over-dispersion of single-cell data, and its corresponding statistical significance to quantify the variably expressed genes that are associated with tumor heterogeneity. Lastly, 3) Deep learning machines and their regularization with functional gene-sets to characterize pan-cancer data. Empowered by the ability to discover knowledge from big data, we employed deep learning architecture to characterize pan-cancer data. We developed a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced gene superset, an unbiased combination of gene sets, which provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.