Exploring the Vulnerability of the State-of-the-Art Content Moderation Image Classifiers Against Adversarial Attacks
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The goal of this research is to assess and describe the vulnerabilities of deep-learning based image classifiers in the context of content moderation. While similar assessments have been made on adversarial attacks involving covering up offending parts of images in order to bypass computer vision-based content moderation, no work has been done around assessing the effectiveness of the more sophisticated adversarial attacks that does not alter the context of the images. In order to achieve this, I study the effect of various adversarial attacks in different strengths and their combinations on the classification accuracy of various state-of-the-art content moderation APIs designed to classify pornographic images employed by online social media platforms. The discovered weaknesses have been shared with respective online social media platforms to alert them to their weaknesses.