Soundwise: Synthetic Acoustic Signals from Video Streams for Augmented Human Perception with Deep Learning




Ghose, Sanchita

Journal Title

Journal ISSN

Volume Title



The two most essential perceptual modalities of humans are vision and hearing capabilities. In everyday life, people have to analyze enormous audio and visual information in order to deal with multiple multisensory events which necessitates the development of research in the area of audiovisual learning (AVL) through vigorous artificial intelligence technologies. Learning the coherence between audio-visual signals is a very challenging task, however, researchers are considering these correlation challenges focusing on leveraging these two modalities to improve the performance of previously addressed single-modality tasks. Since, sound plays a crucial role to perceive the inherent action information of most of the visual scenarios of the real world, auditory guidance can assist a person or a device to analyze the surrounding events more effectively. This research work is focused on synthesizing both content and temporally aligned sound from natural videos. In this research, we propose novel visuals-to-sound generating deep learning systems capable to serve in diverse multimodal applications developing interactive intelligence. This research will also address the prevailing gaps in multisensory research fields that can be eliminated by our proposed auto sound generation techniques impacting different multimodal learning applied territories.


This item is available only to currently enrolled UTSA students, faculty or staff. To download, navigate to Log In in the top right-hand corner of this screen, then select Log in with my UTSA ID.


Artificial Intelligence, Computer Vision, Deep Learning, IoT, Internet of things, Multimedia Application, Sound Generation



Electrical and Computer Engineering