Addressing Privacy Challenges in Modern Audio-Video Communication Systems and Applications
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Catalyzed by advances in communication and Internet technology, web-based audio-video calling has become a mainstream method of remote communication. Recently, the trend has seen a further boost due to the COVID-19 pandemic, whereby audio-video calls by means of applications such as Skype and Zoom have become the default medium for professionals to confer remotely and for students to attend lectures from home. In addition, modern Virtual Reality (VR) devices and systems take this on step further by enabling applications that allow users to remotely co-locate and communicate in the same virtual space or world. Despite their extreme popularity and utility, audio, video and sensor data made available by these modern communication systems and applications could contain sensitive information about the participants, their surroundings or their current activity and context, and can present significant user-privacy challenges if not appropriately protected. This dissertation, at a high level, studies and demonstrates the feasibility of several novel user-privacy threats in popular web-based video-calling (e.g., Skype and Zoom) and virtual reality (e.g., VR Chat) applications and proposes novel mitigation and protection measures against these threats. Specifically, this dissertation has already made the following three significant contributions: (i) First, the dissertation investigated if an adversary, who is at one end of an online video call, can infer some potentially sensitive information about the participant at the other end which is not trivially visible/audible from the call? More specifically, the dissertation evaluated the feasibility of inferring keystrokes of a target user on a traditional QWERTY keyboard by just observing their video feed on a video calling application. This was accomplished by modeling commonly observed typing behaviors during a video call, and utilizing them to construct a novel video-based keystroke and typing detection framework. A text inference framework then uses the keystrokes detected from the video to predict words that were most likely typed by the target user. The proposed keystroke/typing detection and text inference frameworks were then empirically evaluated using data collected from a large number of human subject participants in several practical settings and scenarios. Finally, multiple techniques to mitigate such keystroke inference attacks from video calls were also propose and evaluated. (ii) Second, a popular privacy feature in online video calls virtual backgrounds or background filter was extensively studied to understand how effective it was in protecting users' actual backgrounds, and the sensitive information therein. For that, a novel background reconstruction framework, which reconstructs the real background in a video call that has a virtual background blended in, was first designed. Then, a through investigative analysis of the virtual background feature was accomplished by employing the real background (partially) reconstructed by this framework to carry out four different privacy attacks, namely, location inference, specific object tracking, generic object inference, and text inference attack. Finally, by means of video call data collected from real human subject participants (in a variety of different settings and parameters) and prerecorded videos collected in the wild, the performance of the proposed inference frameworks was empirically verified. As before, several mitigation strategies were also proposed and evaluated. (iii) The third aspect of this dissertation research focuses on investigating the privacy of user identities in virtual reality (VR) applications, where users may be recorded by an adversary while using worn-out-of-band motion sensors, such as smartwatches and smartphones. To address this issue, we record the video of virtual avatars to represent the users' movements in the VR application. We then use existing activity machine learning classifiers to classify video and motion data, and correlate both data streams using the Hamming distance and the Spearman Rank Correlation Coefficient. However, we have discovered that the time complexity of our correlation algorithm is not practical for large-scale applications. To overcome this challenge, This dissertation presents optimized correlation algorithm that balances speed and accuracy, while being feasible for large-scale data sets.