Efficient and Lightweight Encoder-Decoder Architectures for Semantic Segmentation
Computer Vision has benefited significantly from Deep Learning architectures, such as Convolutional Neural Networks (CNNs). Semantic Segmentation, which is pixel-wise image classification, is a key area of research within Computer Vision that has seen much growth from CNNs as well. With stand-alone embedded computer hardware becoming more capable and prominent, more applications of Semantic Segmentation have arisen. Thus, current technologies have been augmented, namely autonomous vehicles, and new technologies, including augmented reality (AR) and virtual reality (VR), have emerged in recent years. The effectiveness and performance of these technologies are dependent on object detection and, as a result, Semantic Segmentation has become a key aspect of research and development. CNNs designed for Semantic Segmentation tasks have shown significant performance in scene understanding, however most CNNs for these tasks require copious amounts of data as well as extensive computational resources. Consequently, current methods using CNNs for Semantic Segmentation tasks lack the real-time processing capabilities required for these technologies that are employed on embedded systems. In addition, the amount of annotated data readily available for AR/VR technologies is limited, rendering conventional training methods ineffective. In this work, a Minimized Efficient Network (MinENet) architecture is first shown to improve upon accuracy of Semantic Segmentation designed for an embedded AR/VR system. Second, this research presents EyeSeg, an encoder-decoder architecture, designed for accurate Semantic Segmentation with sparsely annotated data and applied to similar AR/VR problems. Lastly, CitySeg, an encoder-decoder architecture augmented with convolutional long-short term memory units (ConvLSTM), is presented as an extension of this work. CitySeg's preliminary results showcase the capabilities of efficient lightweight architectures while maintaining accuracy on data that has significantly higher dimensionality as well as much larger class feature size. This research reports results on OpenEDS2019 and OpenEDS2020 datasets compared against the related state-of-the-art approaches. Preliminary results on the CityScapes dataset are shown for the extended work of CitySeg using two different supervised training scenarios: utilizing CityScapes finely annotated training data. This work demonstrates real-time inference capabilities and accuracy performance in terms of mean Intersection over Union (mIoU), for embedded systems with limited memory and in scenarios that provide only sparsely annotated data.