Hybrid multi-layer CNN/Aggregator feature

Télécharger le fichier original (Mémoire de fin d’études)

Introduction

This chapter addresses the problem of image classification with Part-Based Models (PBMs). Decomposing images into salient parts and aggregating them to form dis-criminative representations is a central topic in the computer vision literature. It is raising several important questions such as: How to find discriminative features? How to detect them? How to organize them into a coherent model? How to model the vari-ation in the appearance and spatial organization? Even if works such as the pictorial structure [33], the constellation model [128], object fragments [118], the Deformable Part Model [31] or the Discriminative Modes Seeking approach of [22] brought interest-ing contributions, as well as those in [107, 53, 23], the automatic discovery and usage of discriminative parts for image classification remains a diﬃcult and open question.
Recent PBMs for image classification e.g, [22, 107, 53, 22, 85] rely on five key components: (i) The generation of a large pool of candidate regions per image from (annotated) training data; (ii) The mining of the most discriminative and represen-tative regions from the pool of candidate parts; (iii) The learning of part classifiers using the mined parts; (iv) The definition of a part-based image model aggregating (independently) the learnt parts across a pool of candidate parts per image; (v) The learning of final image classifiers over part-based representations of training images.
One key challenge in the 2nd and 3rd components of PBMs lies in the selection of discriminative regions and the learning of interdependent part classifiers. For in-stance, one cannot learn the part classifiers before knowing discriminative regions and vice-versa. Extensive work has been done to alleviate the problem of identifying dis-criminative regions in a huge pool of candidate regions, e.g, [53, 23, 22].
Once the discriminative regions are discovered and subsequently part classifiers are trained, the 4th component in a PBM – i.e., the construction of the image model based on the per image part presence – is basically obtained by average or sum pooling of part classifier responses across the pool of candidate regions in the image. The final classifiers are then learnt on top of this part-based image representation. Although the aforementioned methods address one of the key components of PBMs, i.e., mining discriminative regions by using some heuristics to improve final classification, they fail to leverage the advantage of jointly learning all the components together.
The joint learning approach of all components of PBMs is indeed particularly ap-pealing since the discriminative regions are explicitly optimized for the targeted task. But intertwining all components makes the problem highly non-convex and initializa-tion critical. The recent works of Lobel et al.[73] and Parizi et al.[85] showed that the joint learning of a PBM is possible. However, these approaches suﬀer from several limitations. First, their intermediate part classifiers are simple linear classifiers and the expression power of these part classifiers is limited in capturing complex patterns in regions. Furthermore, they pool the part classifier responses over candidate regions per image using max pooling which is suboptimal [47]. Finally, as the objective function is non-convex they rely on a strong initialization of the parts.
In the present work, we propose a novel framework, coined “Soft Pooling of Learned Parts” (SPLeaP), to jointly optimize all the five components of the proposed PBM. A first contribution is that we describe each part classifier as a linear combination of weak non-linear classifiers, learned greedily and resulting in a strong classifier which is non-linear. This greedy approach is inspired by [77, 34] wherein they use gradient descent for choosing linear combinations of weak classifiers. The complexity of the part detector is increased along with the construction of the image model. This classifier is eventually

SPLEAP: SOFT POOLING OF LEARNED PARTS

able to better capture the complex patterns in regions. A second contribution is that we softly aggregate the computed part classifier responses over all the candidate regions per image. We introduce a parameter, referred as the “pooling parameter”, for each part classifier independently inside the optimization process. The value of this pooling parameter determines the softness level of the aggregation done over all candidate regions, with higher softness levels approaching sum pooling and lower softness levels resembling max pooling. This permits to leverage diﬀerent pooling regimes for diﬀerent part classifiers. It also oﬀers an interesting way to relax the assignment between regions and parts and lessens the need for strong initialization of the parts. The outputs of all part classifiers are fed to the final classifiers driven by the classifier loss objective.
The proposed PBM can be applied to various visual recognition problems, such as the classification of objects, scenes or actions in still images. In addition, our approach is agnostic to the low-level description of image regions and can easily benefit from the powerful features delivered by modern Convolutional Neural Nets (CNNs). By relying on such representations, and outperforming [79, 10], the proposed approach can also be seen as a low-cost adaptation mechanism: pre-trained CNNs features are fed to a mid-to-high level model that is trained for a new target task. To validate this adaptation scheme we use the pre-trained CNNs of [10]. Note that this network is not fine-tuned on target datasets.
We validated our method on three challenging datasets: Pascal-VOC-2007 (object), MIT-Indoor-67 (scenes) and Willow (actions). We improve over state-of-the-art PBMs on the three of them.

describes the algorithm proposed to jointly optimize the parameters, while Section

contains the experimental validation of our work.

Related works

Most of the recent advances on image classification are concentrated on the devel-opment of novel Convolutional Neural Networks (CNNs), motivated by the excellent performance obtained by Krizhevsky et al.[59]. As CNNs require huge amount of train-ing data (e.g, ImageNet) and are expensive to train, some authors such as Razavian et al.[90] showed that the descriptors produced by CNNs pre-trained on a large dataset are generic enough to outperform many classification tasks on diverse small datasets, with reduced training cost. Oquaba et al.[79] and Chatfield et al.[10] were the first to leverage the benefit of fine-tuning the pre-trained CNNs to new datasets such as Pascal-VOC-2007 [27]. Oquab et al.[79] reused the weights of initial layers of CNN pre-trained on ImageNet and added two new adaptation layers. They trained these two new layers using multi-scale overlapping regions from Pascal-VOC-2007 training images, using the provided bounding box annotations. Chatfield et al.[10], on the other hand, fine-tuned the whole network to new datasets, which involved intensive compu-tations due to the large number of network parameters to be estimated. They reported state-of-art performance on Pascal-VOC-2007 till date by fine-tuning pre-trained CNN architecture.
In line with many other authors, [101, 10] utilized the penultimate layer of CNNs to obtain global descriptors of images. However, it has been observed that computing and aggregating local descriptors on multiple regions described by pre-trained CNNs provides an even better image representation and improves classification performance.
Methods such as Gong et al.[40], Kulkarni et al.[60] and Cimpoi et al.[13] relied on such aggregation using standard pooling techniques, e.g, VLAD, Bag-of-Words and Fisher vectors respectively.
On the other hand, Part-Based Models (PBMs) proposed in the recent literature, e.g, [107, 22, 53, 23], can be seen as more powerful aggregators compared to [40, 50, 60]. PBMs attempt to select few relevant patterns or discriminative regions and focus on them in the aggregation, making the image representation more robust to occlusions or to frequent non-discriminative background regions.
PBMs diﬀer in the way they discover discriminative parts and combine them into a unique description of the image. The Deformable Part Model proposed by Felzen-szwalb et al.[31] solves the aforementioned problems by selecting discriminative regions that have significant overlap with the bounding box location. The association between regions and part is done through the estimation of latent variables, i.e., the positions of the regions w.r.t. the position of the root part of the model. Diﬀerently, Singh et al.[107] aimed at discovering a set of relevant patches by considering the representative and frequent enough patches which are, in addition, discriminative w.r.t. the rest of the visual world. The problem is formulated as an unsupervised discriminative clustering problem on a huge dataset of image patches, optimized by an iterative procedure alter-nating between clustering and training discriminative classifiers. More recently, Juneja et al.[53] also aimed at discovering distinctive parts for an object or scene class by first identifying the likely discriminative regions by low-level segmentation cues, and then, in a second time, learning part classifiers on top of these regions. The two steps are alternated iteratively until a convergence criterion based on Entropy-Rank is satisfied. Doersch et al.[22] used density based mean-shift algorithms to discover discriminative regions. Starting from a weakly-labeled image collection, coherent patch clusters that are maximally discriminative with respect to the labels are produced, requiring a single pass through the data.
Contrasting with previous approaches, Li et al.[70] were among the first to rely on CNN activations as region descriptors. Their approach discovers the discrimina-tive regions using association rule mining techniques, well-known in the data mining community. Sicre et al.[104] also build on CNN-encoded regions, introducing an algo-rithm that models image categories as collections of automatically discovered distinctive parts. Parts are matched across images while learning their visual model and are finally pooled to provide images signatures.
One common characteristic of the aforementioned approaches is that they discover the discriminative parts first and then combine them into a model of the classes to recognize. There is therefore no guaranty that the so-learned parts are optimal for the classification task. Lobel et al.[73] showed that the joint learning of part and category models was possible. More recently, Parizi et al.[85] build on the same idea, using max pooling and l1 /l2 regularization.
Variour authors have likewise studied learned soft-pooling mechanisms. Gulcehre et al.[43] investigate the eﬀect of using generalized soft pooling as a non-linear activation unit, bearing some similarity with the maxout non-linear unit of [42]. In contrast, our method uses a generalized soft pooling strategy as a down sampling layer. Our method is close to that of Lee et al.[66], who use linear interpolation of max and average pooling. Our approach, on the other hand, uses a non-linear interpolation of these two extrema.

Table des matières

0.1 Acknowledgement
0.2 R´esum´e
0.3 Summary in English
1 General Introduction
1.1 Context
1.2 Objectives of this thesis
1.3 Contributions
1.4 Flow of thesis
2 Review of the related work
2.1 Basic setup of supervised image classification
2.2 Traditional image representation
2.3 Convolutional Neural Networks (CNNs)
2.4 Discovering discriminative regions
2.5 Linear classifiers
2.6 Dataset used in this thesis
3 Transfer Learning via Attributes
3.1 Introduction
3.2 Related Works
3.3 Approach
3.4 Experimental results
3.5 Discussion and conclusion
4 Hybrid multi-layer CNN/Aggregator feature
4.1 Introduction
4.2 Background
4.3 A hybrid CNN/Aggregator feature
4.4 Results
4.5 Conclusion
5 Max-Margin, Single-Layer Adaptation
5.1 Introduction
5.2 Proposed approach
5.3 Results
6 Learning the Structure of Deep Architectures
6.1 Introduction
6.2 Background
6.3 Learning the structure of deep architectures
6.4 Results
6.5 Conclusion
7 SPLeaP: Soft Pooling of Learned Parts
7.1 Introduction
7.2 Related works
7.3 Proposed Approach
7.4 Optimization specific details
7.5 Results
7.6 Qualitative Analysis
7.7 Conclusions
8 SPLeaP with Per-Part Latent Scale Selection
8.1 Introduction
8.2 Related Work
8.3 Proposed Approach
8.4 Optimization specific details
8.5 Results
9 Summary and Conclusion
9.1 Summary
9.2 Conclusion
A Annexes
A.1 Appendix for Chapter 8
A.2 Publications and Patents