Inferring Malware Detector Metrics in the Absence of Ground-Truth
Measuring metrics in relation to file-scanning malware detectors is an important problem, which is also extremely intriguing. The problem would be relatively straightforward to solve if malware ground-truth labels were readily available. However, malware ground-truth labels are extremely costly to obtain in the real world, especially so when a large number of malware examples are involved, which is certainly true in practice. Therefore, the real-world problem is: How can we metrics related to file-scanning malware detectors in the absence of ground-truth labels? This problem has recently started to receive attention from the research community, but our understanding is still at a superficial level because there are many unaddressed questions. The present dissertation makes a solid step towards ultimately tackling this important problem, by making four contributions. The first contribution is to introduce and investigate the notion of the relative accuracy of malware detectors in the absence of ground-truth, where relative accuracy is measured in an ordinal scale. We propose an algorithm to estimate the relative accuracy of malware detectors. We characterize when the algorithm leads to accurate relative accuracy using synthetic data with known ground-truth. We then apply the algorithm to measure the relative accuracy of real-world malware detectors based on a real data set consisting of 10.7 million files and 62 malware detectors. The second contribution is to use the relative accuracy rankings to infer the accuracy of malware detectors, including the true positive rate and the true negative rate. To this end, we introduce the novel idea of the Bellwether detector. We validate the algorithm via synthetic data with known ground-truth. We then apply the algorithm to the real-world data set. The third contribution is to enable the detection of metrics in regards to individual file types, as scanned by the detectors. We use this data to analyze the specific ability of various detectors to classify these file types, and analyze when individual detectors can provide detailed data in regards to certain types and recovering accuracy metrics. The fourth contribution is to put together information from the recovered detector metrics and the similarity matrix, and provide a method that allows for the selection of a heterogeneous selection of accurate detectors. We utilize this selection method to illustrate how detectors that are less well rated can be removed from the data set in a rigorous, or well defined, fashion.