Machine learning and graph theoretical approaches toward precise disease classification, prognosis and treatment
Recently, high-throughput profiling techniques such as microarray and next generation sequencing have revolutionized modern biology and enabled disease understanding at a genome scale. Due to the large volume of data from these technologies, enormous amount of scientific efforts are needed for identifying key biological insights. To achieve this goal, it is urgent to develop novel computational methodologies and tools that integrate different types of data for augmenting system level understanding of disease and improving personalize treatment to the next level. From this perspective we address three critical challenges of computational biology and bioinformatics in this dissertation: (1) to identify key genes/biomarkers related to disease, (2) to design better classifiers to identify disease status for improving overall treatment process, and (3) to design better model for predicting drug efficacy in early drug development stage. This dissertation has made significant contributions to improve disease classification, prognosis and treatment for the above mentioned challenges. First, we developed a graph theoretical approach to combine microarray gene expression profiles and protein-protein interaction (PPI) network for biomarker discovery for different cancers. Identifying key genes (biomarkers) involved in different disease is a central problem in system biology and an important step towards constructing better models for disease prognosis and treatment. Using PPI network to capture pathway-level gene-gene relationships, our method have the potential to identify true biomarkers that are reproducible across different patient cohorts and can increase the accuracy of disease diagnosis / prognosis. Next, we designed a personalized committee based approach to predict metastatic status for cancer patients based on their gene expression profiles. The key idea of our method is to construct personalized models that address both the heterogeneity of disease, which is normally overlooked by existing methods, and the ambiguous as well as stereotypical cancer subtypes defined in previous studies. Results showed that our method can significantly improve cancer metastasis prediction compared to other popular methods. Finally, we developed an ensemble classification method to combine multiple types of data for better prediction of drug side-effects. The data were obtained from different public domains that include drug chemical structural information and drug side-effect information. Applying our method on large scale characterized and un-characterized small molecule drugs in drug-bank database, we find that our method can significantly increase accuracy for drug side-effect prediction compared to other methods and showed better performance for `hard to predict' rare side-effects cases. Taken together, the results achieved from the sub-problems demonstrates in this dissertation shows the feasibility of applying and enhancing machine learning and graph based approaches to solve complex biological problems.