Local parametric density-based outlier detection and ensemble learning with applications to malware detection
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Local density-based outlier detection has shown to be a powerful tool for detecting outliers in the unsupervised setting. However, most methods fail to exploit useful information such as the local covariance structure in the density estimates. In addition, local density-based techniques present computational challenges for large high-dimensional data sets, which make it difficult to classify test samples quickly.
In this dissertation, a new local density-based outlier detection method is presented for ranking outliers in the unsupervised setting using a multivariate t-distribution with robust locally weighted location and scatter parameter estimates. A simple updating procedure is introduced to re-estimate location and scatter using a weight function based on the initial density estimates. The robustness of the weight function is explored through M-estimation and simulation. Three outlier scores are proposed, the local parametric density estimate (LPDE), the local parametric density factor (LPDF), and the local parametric density ratio (LPDR). Extensive experimental results demonstrate the effectiveness of the proposed method on simulated and real data and show significant improvement in performance compared to other local density-based methods. Performance results are reported over a wide range of parameter settings including: full and diagonal parameterizations of the covariance matrices, neighborhood size, k, and multiple iterations, t of the updating procedure.
To address the scalability issues of local density-based methods, a two stage training approach is proposed. First, an ensemble of local density-based outlier scores are computed in low-dimensional subspaces of the training data and converted to inlier/outlier class labels. Then, the class labels are learned using an ensemble of classification trees that can assign class probabilities to test samples. The ensemble of supervised classifiers allow for fast computation of outlier scores in the testing phase and provide the benefit of a probabilistic interpretation of outliers. The proposed ensemble technique is applied to a large high-dimensional malware detection data set and results show the method is capable of detecting malware under various conditions such as: the proportion of malware in training, type of local density-based method, number of ensemble members and threshold parameter alpha. Results also indicate that detection rates significantly differ between malware types.
Lastly, an R package called ldbod is developed for local density-based outlier detection. It implements the proposed local parametric density-based outlier detection method along with all competing techniques in this dissertation.