Signal processing and machine learning for bioinformatics applications
The bioinformatics -omics application's pipeline include low-level and high-level data processing. After data acquisition using specific devices like the Liquid Chromatography Mass Spectrometry/Tandem Mass Spectrometry (LC-MS/MS), the low-level processing aim is to clean the data by getting help of signal processing, statistics, computer sciences and applied mathematics techniques. This level quantifies the features (Proteins, Genes, Metabolites etc.) and high-level data processing is applied to find the biomarkers or the most informative features, which can be used to build a classifier to distinguish between cancer and normal samples or to model the genetic pathway of a disease.
In this work, we perform a broad study on existing method on both low/ligh-level processing steps, and we could develop our own algorithms to improve the existing results. We've developed Peak-Link (PL) algorithm and MZDASoft software specifically the proteomics data. We also, proposed a feature selection method, which called PSI to extract the informative genes/proteins in biological samples. PL uses information in both the time and frequency domain as inputs to a non-linear support vector machine (SVM) classifier. The PL algorithm first uses a threshold on retention time to remove candidate corresponding peaks with excessively large elution time shifts, then PL calculates the correlation between a pair of candidate peaks after removing noise through wavelet transformation. After converting retention time and peak shape correlation to statistical scores, an SVM classifier is trained and applied for differentiating corresponding and non-corresponding peptide peaks. PL is tested in two challenging cases, in which LC-MS/MS samples are collected from different disease states and from different labs. Testing results show significant improvement in linking accuracy comparing to other algorithms.
MZDASoft is a new architecture based on parallel processing, which extracts LC-MS peak features, and saves them in database files to enable the implementation of PL for multiple samples. The software has been deployed in High Performance Computing (HPC) environments. The core part of the software, MZDASoft Parallel Peak Extractor (PPE), which is publically available and can be downloaded with users and developer's guide. It can be run on HPC centers directly. The quantification applications, MZDASoft TandemQuant and MZDASoft PeakLink are written in MATLAB, which are compiled with MATLAB runtime compiler. A sample script that incorporates all necessary processing steps of MZDASoft for LC-MS/MS quantification in a parallel processing environment is available. The project webpage is http://compgenomics.utsa.edu/zgroup/MZDASoft. The proposed architecture enables the implementation of PeakLink for multiple samples. Significantly more (100%--500%) proteins can be compared over multiple samples with better quantification accuracy in test cases.
After getting and cleaning the data, biologist are interested in capturing the important genes to differentiate between normal and cancer patients. Despite of having wide rage of feature selection methods for microarray datasets, still researchers far from having a major method works to all kind of biological samples. Lack of a robust algorithm for distinguishing target feature to a disease and removing irrelevant and redundant features is a big challenge in high-throughput -omics experiments. We make the interested researchers aware of the performance of 13 more commonly used methods on large dataset, which consists of 22 microarray gene expression binary class datasets. Also, we present an algorithm, Positive Synergy Index (PSI) and compare it with 13 selected methods. Specifically, we compare the classification accuracy and running time between theses well-known feature selection methods as well as PSI. Each algorithm applied to all datasets separately to rank the features based on its decision rules, the we applied top 50 features reported by that method to SVM and KNN classifiers to separate cancer and normal samples. The average results of all methods compared and shows that PSI in average has slightly better results than other methods, while has less time complexity than cooperative index and k-TSP methods, which like PSI, use similar concept in ranking the features.