A sublexical unit based hash model approach for spam detection




Zhang, Like

Journal Title

Journal ISSN

Volume Title



This research introduces an original anomaly detection approach based on a sublexical unit hash model for application level content. This approach is an advance over previous arbitrarily defined payload keyword and 1-gram frequency analysis approaches. Based on the split fovea theory in human recognition, this new approach uses a special hash function to identify groups of neighboring words. The hash frequency distribution is calculated to build the profile for a specific content type. Examples of utilizing the algorithm for detecting spam and phishing emails are illustrated in this dissertation. A brief review of network intrusion and anomaly detection will first be presented, followed by a discussion of recent research initiatives on application level anomaly detection. Previous research results for payload keyword and byte frequency based anomaly detection will also be presented. The drawback in using N-gram analysis, which has been applied in most related research efforts, is discussed at the end of chapter 2. The importance of text content analysis to application level anomaly detection will also be explained. After a background introduction of the split fovea theory in psychological research, the proposed sublexical unit hash frequency distribution based method will be presented. How human recognition theory is applied as the fundamental element for a proposed hashing algorithm will be examined followed by a demonstration of how the hashing algorithm is applied to anomaly detection. Spam email is used as the major example in this discussion. The reason spam and phishing emails are used in our experiments includes the availability of detailed experimental data and the possibility of conducting an in-depth analysis of the test data. An interesting comparison between the proposed algorithm and several popular commercial spam email filters used by Google and Yahoo is also presented. The outcome shows the benefits of the proposed approach. The last chapter provides a review of the research and explains how the previous payload keyword approach evolved into the hash model solution. The last chapter discusses the possibility of extending the hash model based anomaly detection to other areas including Unicode applications.


This item is available only to currently enrolled UTSA students, faculty or staff. To download, navigate to Log In in the top right-hand corner of this screen, then select Log in with my UTSA ID.


anomaly detection, Hash, spam detection, Split fovea theory



Computer Science