Video representation for fine-grained action recognition




Zhou, Yang

Journal Title

Journal ISSN

Volume Title



Recently, fine-grained action analysis has raised a lot of research interests due to its potential applications in smart home, medical surveillance, daily living assist and child/elderly care, where action videos are captured indoor with fixed camera. Although background motion (i.e. one of main challenges for general action recognition) is more controlled compared to general action recognition, it is widely acknowledged that fine-grained action recognition is very challenging due to large intra-class variability, small inter-class variability, large variety of action categories, complex motions and complicated interactions. Fine-Grained actions, especially the manipulation sequences involve a large amount of interactions between hands and objects, therefore how to model the interactions between human hands and objects (i.e., context) plays an important role in action representation and recognition. We propose to discover the manipulated objects by human by modeling which objects are being manipulated and how they are being operated. Firstly, we propose a representation and classification pipeline which seamlessly incorporates localized semantic information into every processing step for fine-grained action recognition. In the feature extraction stage, we explore the geometric information between local motion features and the surrounding objects. In the feature encoding stage, we develop a semantic-grouped locality-constrained linear coding (SG-LLC) method that captures the joint distributions between motion and object-in-use information. Finally, we propose a semantic-aware multiple kernel learning framework (SA-MKL) by utilizing the empirical joint distribution between action and object type for more discriminative action classification. This approach can discover and model the inter- actions between human and objects. However, discovering the detailed knowledge of pre-detected objects (e.g. drawer and refrigerator). Thus, the performance of action recognition is constrained by object recognition, not to mention detection of objects requires tedious human labor for object annotation. Secondly, we propose a mid-level video representation to be suitable for fine-grained action classification. Given an input video sequence, we densely sample a large amount of spatio-temporal motion parts by temporal segmentation with spatial segmentation, and represent them with local motion features. The dense mid-level candidate parts are rich in localized motion information, which is crucial to fine-grained action recognition. From the candidate spatio-temporal parts, we perform an unsupervised approach to discover and learn the representative part detectors for final video representation. By utilizing the dense spatio-temporal motion parts, we highlight the human-object interactions and localized delicate motion in the local spatio-temporal sub-volume of the video. Thirdly, we propose a novel fine-grained action recognition pipeline by interaction part proposal and discriminative mid-level part mining. Firstly, we generate a large number of candidate object regions using off-the-shelf object proposal tool, e.g., BING. Secondly, these object regions are matched and tracked across frames to form a large spatio-temporal graph based on the appearance matching and the dense motion trajectories through them. We then propose an efficient approximate graph segmentation algorithm to partition and filter the graph into consistent local dense sub-graphs. These sub-graphs, which are spatio-temporal sub-volumes, represent our candidate interaction parts. Finally, we mine discriminative mid-level part detectors from the features computed over the candidate interaction parts. Bag-of-detection scores based on a novel Max-N pooling scheme are computed as the action representation for a video sample. Finally, we also focus on the first-view (egocentric) action recognition problem, which contains lots of hand-object interactions. On one hand, we propose a novel end-to-end trainable semantic parsing network for hand segmentation. On the other hand, we propose a second end-to-end deep convolutional network to maximally utilize the contextual information among hand, foreground object, and motion for interactional foreground object detection.


This item is available only to currently enrolled UTSA students, faculty or staff. To download, navigate to Log In in the top right-hand corner of this screen, then select Log in with my UTSA ID.


Action recognition, Action videos, Fine-grained actions, Semantic information



Computer Science