System Call Anomaly Detection in Multi-Threaded Programs
Despite the prevalence of intrusion detection systems (IDS), they have garnered even more support due to the realization by many experts that intrusion prevention is an unrealistic goal. System calls, or syscalls, have been a popular data source for IDSs because the collection of which imposes low impact on system performance and they have strong security semantic implications; other than some denial-of-service attacks (DOS), malware must utilize syscalls in order to provide any utility to the attacker. However, proposed solutions fall short in modeling, and thus protecting, real-world, complex programs. Namely, they fall short in dealing with highly multi-threaded programs, especially those which contain diverse thread behaviors. Motivated by this problem, this thesis takes a holistic approach by 1) improving the quality of syscall datasets, 2) refining the modeling of program behavior, and 3) introducing new anomaly detection logic which can leverage the tailored model produced as a result of the previous two.
The first contribution is a syscall dataset collector which enables the production of custom datasets for syscall host intrusion system research (HIDS) and development. With aging datasets, current syscall HIDS solutions are pigeonholed into using their limited characteristics, thus, limiting their effectiveness when applied to real-world programs and systems. We provide an extensible syscall dataset collector which includes structural and contextual information regarding syscalls, yet allows for researchers to easily add their own features to more quickly develop and evaluate their systems. This dataset collector can aid researchers in widening the solution space for syscall HIDS.
The second contribution is a methodology to identify thread behaviors in a complex program, enabling the construction of more tailored models in legacy syscall HIDS approaches, or for use directly in anomaly detection. Due to the flat, interleaved structure of syscall patterns from simple programs in existing datasets, the problem of effectively modeling, and thus, monitoring complex multi-threaded programs remains largely unaddressed. Providing thread-wise sequences from more complex, multi-threaded programs is a step in the right direction. However, threads are often anonymous and do not lend themselves to easy identification. Therefore, we propose clustering thread behaviors, which are represented by graphs, as a preprocessing step that can be used as a means of thread behavior classification via clustering. Consequently, prior techniques can more accurately model the syscall patterns of multi-threaded programs as they are now tasked with modeling more cohesive subsets of the training data. This work has implications in real-time and offline anomaly detection.
The third contribution is an anomaly detection technique leveraging the identified groups of behaviors from the second contribution. As mentioned earlier, modeling and monitoring complex multi-threaded programs in syscall HIDS is extremely challenging as threads may exhibit different behaviors, each emitting a distinct syscall pattern. Therefore, a "one size fits all" approach in capturing the diverse behaviors confounds the monolithic models of previous approaches. We present detection logic utilizing the clusters capturing behavior groups and their respective boundary members to automatically determine thresholds between normal and anomalous behavior. The result is an accurate and tailored detection model effective in monitoring multi-threaded programs with fast testing and training times.