Attention-based Audio Driven Facial Animation




Zand, Neda

Journal Title

Journal ISSN

Volume Title



In the virtual world, the human digital twin is the digital representation of the real-world counterpart or twin. By using these digital equivalent, the performance of products can be predicted in advance, allowing them to be designed and manufactured more efficiently. Compare to other digital twins, generating human digital twin is more critical because it needs to be believable and as a result, the numerous elements that must be roughly estimated in order to produce convincing facial movements. Extra difficulty will be added if we generate facial animation based on some input signals like audio signals. It will be converted to a multi module problem to handle audio and image. Synthesizing facial movement is widely used in a variety applications such as healthcare (surgical planning and facial tissue surgical simulation, facial therapy and prosthetics), game industry (facial animation, real time sequencing and face and audio synchronization), video teleconferencing (mapping individual photographs to canonical representations of the face), and social robots (facial animation in interactive robots). This diverse application add extra difficulty to the facial animation in computer graphic and computer vision. Modeling and animating convincing characters (2-D or 3-D human or non-human) is the key to all of these applications. Many theoretical methods have been investigated and put into practice in order to produce the most precise animations that can successfully represent facial human animations. While many of these techniques like key-framing or performance capture, may produce realistic facial animation, they are either time consuming or difficult to alter. Novel methods that make use of machine learning has shown a huge improvements in both accuracy and time efficiency. My thesis provides a thorough review of existing literature in this area with a special focus on deep learning methods. Additionally, I propose a novel attention-based deep learning model to generate audio driven facial animation. I have used an encoder decoder architecture to encode the audio features and map these features to the 3-D facial movements. To aim this, I have used convolutional neural networks for the encoder part and added spatial and channel attention module. The key idea for adding attention module was to focus on more relevant features. Both spatial and channel attention help the convolutional layers to focus on more relevant features and simultaneously suppressing less important ones. The proposed model has been trained on the VOCASET dataset and the 3-D result has been visualized in Omniverse. The attention module has improved the lip synchronization as it was expected. The result has been recorded as a supplementary video.


This item is available only to currently enrolled UTSA students, faculty or staff. To download, navigate to Log In in the top right-hand corner of this screen, then select Log in with my UTSA ID.
The full text of this item is not available at this time because the author has placed this item under an embargo until August 15, 2024.


3D facial animation, computer vision, data science, deep learning, facial animation, lip synchronization



Computer Science