Attribute-assisted learning for web image applications
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Content-based Image Retrieval (CBIR), a technique for retrieving images from a large database of digital images based on visual content, has been studied extensively since the early 1990s. In spite of the remarkable progress made in the last two decades, CBIR remains challenging due to the semantic gap between low-level visual features and high-level semantic concepts. Recent studies have shown that visual attributes, a kind of human-nameable mid-level image concept, provide a promising route towards narrowing down the semantic gap. In this thesis proposal, we aim to (1) identify subsistent problems and challenges existing in CBIR, such as image reranking, dictionary learning, semantic visual indexing, etc. (2) develop effective and efficient solutions assisted by semantic attributes to tackle these problems, and (3) evaluate the proposed approaches on real world multimedia benchmarks. Firstly, we propose a novel attribute-assisted retrieval model for reranking images, which serves to boost the performance of text-based image search engine for general queries. Based on the classifiers for all the predefined attributes, we represent each image by an attribute feature consisting of the responses from these classifiers. Then, a hypergraph is constructed to model the relationship between images by integrating low-level visual features and semantic attribute features. We perform hypergraph ranking to re-order the images, which is also to model the relationship of all images. Its basic principle is that visually similar images should have similar ranking scores. A visual-attribute joint hypergraph learning approach is proposed to simultaneously explore two information sources. We conduct extensive experiments on 1000 queries in MSRA-MM V2.0 dataset. The experimental results demonstrate the effectiveness of our proposed attribute-assisted Web image reranking method. We also study the problem of semantic information loss in current Bag-of-visual Words (BoW) model. To narrow the semantic gap between low-level features and high-level semantics and make up for the loss, a myriad of approaches and works have been investigated. We introduce an Attribute-aware Dictionary Learning (AttrDL) scheme to learn multiple sub-dictionaries with specific semantic meanings. We divide training images into different sets and each represents a specific attribute. For each image set, an attribute-aware sub-vocabulary is learned. Hence, these resulting sub-vocabularies are more discriminative for semantics than the traditional vocabularies. Second, to get semantic-aware and discriminative BoW representation with the learned sub-vocabularies, we adopt the idea of l21-norm regularized sparse coding and recode the resulting sparse representation of each image. Experimental results show that the proposed scheme outperforms the existing algorithms in both image classification and search tasks. Further, we investigate the problem of semantic-visual indexing framework when the pipeline of image search is built upon either visual space or attribute space. In reality, both visual and semantic information are rich sources of information for mining and the full power of mining and processing algorithms can be realized with the integration of the two. We aim to mine a novel joint semantic-visual space by integrating their visual content as well as the semantic attributes. We propose a novel indexing strategy, termed as Coherent Semantic-visual Indexing (CSI) in this proposal, where spectral hashing is utilized for compact binary code of features to boost the retrieval performance. We evaluate the proposed algorithms on two datasets and competitive results are obtained by our proposed strategy in experiments. We also move one step ahead of conventional Bag-of-visual Words model and propose a novel indexing strategy with binary descriptors as direct codebook indexes (addresses) without explicit codeword training. We build multiple index tables which concurrently check for collision of same binary features. The evaluation is performed on two public image datasets: DupImage and Holidays. The experimental results demonstrate the indexing efficiency and retrieval accuracy of our approach. Finally, we aim to apply the success of static image semantic recognition to the video domain of action discrimination, by leveraging both static and motion based descriptors in different stages of the semantic ladder. We examine the effects of three types of features: low-level dynamic descriptors, intermediate level static deep architecture outputs, and static high-level semantics. In order to combine such heterogeneous sources of information, we employ a scalable method to fuse these features based on a discriminative SVM fusion strategy. Our investigation implies that static deep and semantic features are largely complementary to low-level dynamic trajectory features. The SVM based fusion approach provided the best framework to combine the information coming from the heterogeneous static and dynamic features at different semantic levels.