Detection and classification of obfuscated malware
The large number of malicious files that are produced daily outpaces the current capacity of malware analysis and detection. For example, Intel Security Labs report that during each hour in the third quarter of 2015, more than 3.5M infected files were exposed to their customers' networks, and an additional 7.4M potentially unwanted programs attempted to install or launch. The damage of malware attacks is also increasingly devastating, as witnessed by the recent Cryptowall malware that has reportedly generated more than $325M in ransom payments to its perpetrators. In terms of defense, it has been widely accepted that the traditional approach based on byte-string signatures is increasingly ineffective, especially for new malware samples and sophisticated variants of existing ones. New techniques are therefore needed for effective defense against malware. Motivated by this problem, the dissertation investigates three new defense techniques against malware. The first technique aims at the automatic detection of program obfuscation, which has been abused by malware writers as an attack strategy to make their malware evade the defense. The key idea is to extract and exploit useful information from Control Flow Graphs (CFGs) of malware programs. Experimental results show that the new technique can detect a variety of obfuscation methods (e.g., packing, encryption, and instruction overlapping). This patent-pending technique paves the way for developing the two other techniques presented in the dissertation. The second technique aims at automatically classifying whether a suspicious file is malicious or not. The suspicious file may have been identified as obfuscated via the first technique mentioned above (or any technique of its kind). Machine learning methods are used to learn detection models, which are shown to be effective against both plain and obfuscated malware samples. A key contribution of this technique is the definition and utilization of over 32,000 features of files, including file structure, runtime behavior, and instructions. To the best of our knowledge, this is the first effort that defines and uses such a comprehensive feature set. The third technique leverages the first technique mentioned above for the automatic identification of malware packers that were used to obfuscate malware programs. Signatures of malware packers and obfuscators are extracted from the CFGs of malware samples. Unlike conventional signatures that can be evaded by simply modifying one or multiple bytes in malware samples, these signatures are more difficult to evade. For example, CFG-based signatures are shown to be resilient against instruction modifications and shuffling, as a single signature is sufficient for detecting mildly polymorphic versions of the same malware. Last but not least, the process for extracting CFG-based signatures is also made automatic.