Network-based unsupervised machine learning for single cell data analysis

Date

2021

Authors

Zand, Maryam

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The recently emerged and groundbreaking technology for genome-wide sequencing at single cell resolution possess promising potentials to equip researchers in obtaining critical and unprecedented insights into complex biological systems, enabling heterogeneous cell type identification and developmental trajectory inferences. However, the potential of this technology cannot be fully unleashed unless the urgently needed effective computational algorithms, fitted to address the unique and challenging characteristics of single cell datasets, are developed. As the traditional computational methods fall short to address several challenges associated with single cell data analysis, hindering cell/gene level characterizations such as cell type identification and downstream analysis, it is therefore the focus of my thesis to develop specialized algorithms capable of dealing with single cell data. In this dissertation, we tackled three impedances hindering the advancement of single cell omics data analysis: i) the widespread presence of missing values (dropout), ii) difficulty of unsupervised knowledge discovery due to noise and sparsity of data, and iii) loss of spatial information of cells during experiments.

We propose to address these challenges by designing network-based unsupervised machine learning methods. The incorporation of networks as a prior was motivated by the nature of single cell data which is constrained by the structure of the underlying cell system. To alleviate the first issue mentioned above, we introduce netImpute, to leverage the hidden information in gene co-expression networks through employing a Random Walk with Restart strategy aiming at recovering real signals. Our results show that netImpute substantially enhances clustering accuracy and data visualization clarity, thanks to its effective treatment of dropouts. In addition, we propose scQcut, a machine learning method robust against noise and sparsity to improve the cell type identification. scQcut is an enhanced graph-based parameter-free clustering method, which utilizes a topology-based criterion to guide the selection of the optimal parameters for a k-nearest neighbor graph to predict intrinsic clustering structures in single-cell data. Our results obtained from a comprehensive study of experimental and synthetic datasets demonstrate that scQcut outperforms several state-of-the-art clustering methods in terms of both clustering accuracy and the ability to correctly identify rare cell types. Finally, to help recover spatial information of single cells, we propose an unsupervised feature selection method to optimize two biologically motivated metrics based on the consistency between gene expression similarity and cell proximity. Result shows that this method can lead to recovering the landmark gene patterns and cell locations successfully. This finding encourages us to explore the opportunity to realize the spatial mapping with fewer counts of landmark genes, even with none. Given the encouraging results, this dissertation will be able to provide biologists with a set of versatile and easy-to-use tools for unsupervised knowledge discovery from large-scale single-cell data.

Description

This item is available only to currently enrolled UTSA students, faculty or staff. To download, navigate to Log In in the top right-hand corner of this screen, then select Log in with my UTSA ID.

Keywords

Clustering, Graph theory, Imputation, Machine learning, Network, Single cell Sequencing

Citation

Department

Computer Science