Machine learning approaches for genome-wide cis-regulatory element discovery and transcription factor binding sites modeling
The advance of experimental technologies in biology, including complete genome sequencing and high density microarray, has enabled biologists to collect molecular biology data at an unprecedented pace and scale. However, due to the diverse types and enormous amount of data from these high-throughput experiments, more sophisticated computational methods are urgently needed to analyze them in order to reveal useful biological insight. In this dissertation work, we identified several critical challenges in modeling gene transcriptional regulatory networks, and developed machine learning based algorithms to address these challenges. First, we proposed a genome-wide cis-regulatory motif discovery approach by combining promoter sequences and gene co-expression networks to predict the cis-regulatory motif of each individual gene, thereby overcoming the disadvantages of current clustering based methods that often fail to provide gene- specific or species-specific predictions. Second, we developed a multi-instance-learning based method to model the physical interactions between transcription factors (TF) and DNA, which, by better handling of DNA sequence regional information, significantly outperformed traditional single-instance-learning based methods in predicting both in vivo and in vitro TF-DNA interactions. Finally, we proposed a novel TF-DNA interaction model by utilizing structural features with multi-instance learning, which further improved the accuracy of modeling in vitro TF-DNA interactions. This research clearly demonstrated the advantage of machine learning methods in modeling transcriptional regulatory networks, and revealed several promising new directions for future development of computational methods in this area.