Biological network reconstruction, denoising, and applications in cancer classification
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Recent advances in high-throughput technology have dramatically increased the amount of available experimental data in biological research, such as complete genome sequences, transcriptomic data under diverse conditions, and interaction networks among different components in the cell. However, the exponentially increasing data challenges the conventional gene-based paradigm to understand biology. Efficient and effective computational methods are needed to clean, analyze and model the data from a whole systems perspective. To achieve these goals, this research attempts to addresses several key challenging problems in bioinformatics that are associated with constructing functional gene networks and utilizing the networks for better understanding and prediction of cancer development and progression. Specifically, this dissertation has made significant contributions in three relatively independent but highly related sub-areas of bioinformatics. First, an optimization algorithm based on particle swarm intelligence has been developed to efficiently identify transcription factor binding sites (TFBS) motifs that often consist of two short DNA sequence patterns separated by a variable length gap. This work can help decipher the complex gene regulatory networks and understand gene functions. Second, a novel random walk based algorithm has been proposed to remove spurious protein-protein interactions and predict new interactions based solely on the basis of the topological properties of proteins in an existing protein-protein interaction network. Experimental results showed that the method can significantly improve the quality of existing protein-protein interaction networks in yeast and human, which in turn resulted in much better accuracy of protein complex prediction. Finally, new method has been developed to improve cancer prognosis by combining gene expression microarray data and protein-protein interaction networks. Utilizing a random walk algorithm, our method was able to identify novel biomarker genes that can significantly improve the prognosis accuracy of breast cancer metastasis. Importantly, these individual biomarkers are not differentially expressed and therefore would not be detectable by conventionally classification methods that treat individual genes as independent features. Taken together, the results achieved in these diverse sub-areas demonstrated the feasibility of using machine learning approaches to assist biological research at a systems level.