Code Element Vector Representations Through the Application of Natural Language Processing Techniques for Automation of Software Privacy Analysis

Heaps, John
Journal Title
Journal ISSN
Volume Title

The increasing popularity of mobile and web apps has prompted an increase in the collection and storage of personally identifiable information of app users, causing users to be at continually greater privacy risk if that information is misused or mishandled. In order to reduce such risk, software must be in compliance with privacy policies, which detail how privacy information is collected, stored, and maintained. The current state-of-the-art in determining privacy compliance is through static analysis techniques such as model checking or pattern-based detection, but these techniques lack both automation and generalizability, suffering from limitations such as: the state explosion problem, conservatism, and the manual definition of models, specifications, or patterns. Deep learning models and approaches have been shown to help solve similar limitations in other problem domains, and may be adaptable to privacy policy and software analysis. Being written in natural language, privacy policies can be immediately applied to deep learning natural language processing. Further, it has been asserted that code can be treated and processed like a natural language as it exhibits behaviours and properties analogous to natural language. However there are many obstacles in the application of deep learning to code, such as: the complex syntactic structures of code, the constant definition of new code elements, and increased severity of the data sparsity problem. This work proposes a novel semantic learning approach that can help overcome such obstacles by taking advantage of the equivalence relationship between code elements and their declarations and definitions. The approach is implemented by creating a plugin to perform textual pre-processing and preparation unique to code and constructing deep learning models for producing code element vector representations for processing code elements in deep learning tasks. The models are shown to produce quality vector representations for code elements and are applied to a deep learning task to predict access of privacy information by software.

This item is available only to currently enrolled UTSA students, faculty or staff.
Deep Learning, Natural Language Processing, Privacy, Privacy compliance, Semantic learning approach
Computer Science