The Proposed Approach - A Synthetic plus Variational Model

Still-to-Video Face Recognition Systems

In a still-to-video FR scenario, there is typically one or more still image(s) to enroll an individual to the system while a set of video frames is available for recognition. Given one or few reference still images, still-to-video FR system seeks to detect the presence of target individuals enrolled to the system over a network of surveillance cameras. In recent years, few specialized approaches have been proposed for still-to-video FR in the literature. Bashbaghi et al. (2017b) proposed a robust still-to-video FR system based on multiple face representations. In this work, various feature extraction techniques are applied to face patches isolated in the single reference sample to generate multiple face representations to make it robust to nuisance factors commonly found in video surveillance applications.

An individual-specific ensemble of exemplar-SVM classifiers is proposed by Bashbaghi et al. (2017a) to develop a domain adaptive still-to-video FR to improve its robustness to intra-class variations. Parchami et al. (2017c) developed an accurate still-to-video FR from a SSPP based on deep supervised autoencoder that can represent the divergence between the source and target domains. The autoencoder network is trained using a novel weighted pixel-wise loss function that is specialized for SSPP problems, and allows to reconstruct high-quality canonical ROIs for matching. Parchami et al. (2017a) presented an efficient network for still-to-video FR from a single reference still based on cross-correlation matching and triplet-loss optimization that provides discriminant face representations. The matching pipeline exploits a matrix Hadamard product followed by a fully connected layer inspired by adaptive weighted cross-correlation. Parchami et al. (2017b) introduced an ensemble of CNNs named HaarNet for still-to-video FR, where a trunk network first extracts features from the global appearance of the facial ROIs.

Then, three branch networks effectively embed asymmetrical and complex facial features based on Haar-like features. Dewan et al. (2016) exploited an adaptive appearance model tracking for still-to-video FR to gradually learn a track-face-model for each individual appearing in the scene. The models are matched over successive frames against the reference still images of each target individual enrolled to the system, and then matching scores are accumulated over several frames for robust spatio-temporal recognition. Migneault et al. (2018) considered adaptive visual trackers for still-to-video FR to regroup faces (based on appearance and temporal coherency) that correspond to the same individual captured along a trajectory, and thereby learn diverse appearance models on-line. Mokhayeri et al. (2015) designed a practical still-to-video FR system for video surveillance applications by benefiting from face synthesis. the synthetic images are produced based on camera-specific capture conditions.

3D Morphable Model

A common approach for synthetic face generation is to reconstruct the 3D model of a face using its 2D face image. As a classic statistical model of 3D facial shape and texture, 3D Morphable Model (3DMM) is widely used to reconstruct a 3D face from a single 2D face image and accordingly synthesize new face images (Blanz & Vetter (2003)). This algorithm is based on designing a morphable model from 3D scans and fitting the model to 2D images for 3D shape and texture reconstruction. The 3DMM is based on two key ideas: first, all faces are in dense point-to-point correspondence, which is usually established on a set of example faces in a registration procedure and then maintained throughout any further processing steps. The second idea is to separate facial shape and color and to disentangle these from external factors such as illumination and camera parameters. The Morphable Model may involve a statistical model of the distribution of faces, which was a principal component analysis in the original work and has included other learning techniques in subsequent work. In the past decade, several extensions of 3DMM is presented for 3D face reconstruction.

Zhang & Samaras (2006) proposed a 3D spherical harmonic basis morphable model that is an integration of spherical harmonics into the 3DMM framework. More recently, Koppen et al. (2018) expanded 3DMM by adopting a shared covariance structure to mitigate small sample estimation problems associated with data in high dimensional spaces. It models the global population as a mixture of Gaussian sub-populations, each with its own mean value. Gecer et al. (2019) revisited the original 3DMM fitting making use of non-linear optimization to find the optimal latent parameters that best reconstruct the test image. They optimized the parameters with the supervision of pre-trained deep identity features through an end-to-end differentiable framework. Despite the significant success of 3DMM-based techniques for 3D face modeling they often fail to represent small details since they are not spanned by the principal components. An alternative line of work considers CNNs for 3D face modeling with 3DMM. Embedding 3D morphable basis functions into deep neural networks opens great potential for models with better representation power which is superior in capturing a higher level of details. Tran et al. (2019) improved the nonlinear 3DMM in both learning objective and architecture by solving the conflicting objective problem with learning shape and albedo proxies with proper regularization.

The novel pairing scheme allows learning both detailed shape and albedo without sacrificing one. Tran et al. (2017a) employed a CNN to regress 3DMM shape and texture parameters directly from an input image without an optimization process which renders the face and compares it to the image. Richardson et al. (2017) presented a face reconstruction technique from a single image by introducing an end-to-end CNN framework which derives a novel rendering layer, allowing back-propagation from a rendered depth map to the 3DMM model. In the same line, Tewari et al. (2017) proposed a CNN regression-based approach for face reconstruction, where a single forward pass of the network estimates a much more complete face model, including pose, shape, expression, and illumination, at a high quality. Due to the type and amount of training data, as well as, the linear bases, the representation power of 3DMM can be limited.

To address these problems, Tran & Liu (2018) proposed an innovative framework to learn a nonlinear 3DMM model from a large set of in-the-wild face images, without collecting 3D face scans. Specifically, given a face image as input, a network encoder estimates the projection, lighting, shape and albedo parameters. Two decoders serve as the nonlinear 3DMM to map from the shape and albedo parameters to the 3D shape and albedo, respectively. Although their results are encouraging, the synthetic face images may not be realistic enough to represent intra-class variations of target domain capture conditions. The synthetic images generated in this way are highly correlated with the original facial stills from enrolment, and there is typically a domain shift between the distribution of synthetic faces and that of faces captured in the target domain which poses the problem of domain adaptation. The FR models naively trained on these synthetic images, often fail to generalize well when matched to real images captures in the target domain. Producing realistic synthetic face images while preserving their identity information is still an ill-posed problem.

Generative Adversarial

Network Recently, Generative Adversarial Networks (GANs) have shown promising performance in generating realistic images (Gonzalez-Garcia et al. (2018); Choi et al. (2018); Chen & Koltun (2017)). GANs are a framework to produce a model distribution that mimics a given target distribution, and it consists of a generator that produces the model distribution and a discriminator that distinguishes the model distribution from the target. The concept is to consecutively train the model distribution and the discriminator in turn, with the goal of reducing the difference between the model distribution and the target distribution measured by the best discriminator possible at each step of the training (Goodfellow & et al. (2014)). Benefiting from GAN, FaceID-GAN is proposed by Shen et al. (2018) which generates photorealistic and identity preserving faces. It competes with the generator by distinguishing the identities of the real and synthesized faces to preserve the identity of original images. Gecer et al. (2018) proposed a novel end-to-end semi-supervised adversarial framework to generate photorealistic face images of new identities with a wide range of expressions, poses, and illuminations conditioned by synthetic images sampled from a 3DMM. Huang et al. (2017a) proposed TP-GAN for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.

They made problem well constrained by introducing a combination of adversarial loss, symmetry loss and identity preserving loss. The combined loss function leverages both frontal face distribution and pre-trained discriminative deep face models to guide an identity preserving inference of frontal views from profiles. Wang et al. (2018b) proposed a variant of GANs for face aging in which a conditional GAN module functions as generating a face that looks realistic and is with the target age, an identity-preserved module preserves the identity information and an age classifier forces the generated face with the target age. Tewari et al. (2017) proposed a novel model-based deep convolutional autoencoder for 3D face reconstruction from a single in-the-wild color image that combine a convolutional encoder network with a model-based face reconstruction model. In this way, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. WGAN is a recent technique which employs integral probability metrics based on the earth mover distance rather than the Jensen–Shannon divergence that the original GAN uses (Arjovsky et al. (2017)).

BEGAN built upon WGAN using an autoencoder based equilibrium enforcing technique alongside the Wasserstein distance to stabilize the training of the discriminator (Berthelot et al. (2017)). Difficulty in controlling the output of the generator is a challenging issue in GAN-based face synthesis models. To reduce this gap, conditional GANs are proposed that leverage conditional information in the generative and discriminative networks for conditional image synthesis. Tran et al. (2018) used pose codes in conjunction with random noise vectors as the inputs to the discriminator with the goal of generating a face of the same identity with the target pose in order to fool the discriminator. Hu et al. (2018) introduced a coupled-agent discriminator which forms a mask image to guide the generator during the learning process. Mokhayeri et al. (2019b) proposed a controllable GAN that employs an additional adversarial game as the third player to the GAN, competing with the generator to preserve the specific attributes, and accordingly, providing control over the face generation process. Despite the success of GAN in generating realistic images, they still struggle in learning complex underlying modalities in a given dataset, resulting in poor-quality generated images.

Table des matières

INTRODUCTION
CHAPTER 1 LITERATURE REVIEW
1.1 Still-to-Video Face Recognition Systems
1.1.1 Challenges
1.2 Data Augmentation
1.2.1 Face Synthesis
1.2.1.1 3D Morphable Model
1.2.1.2 Generative Adversarial Network
1.2.2 Generic Learning
1.3 Deep Face Recognition
1.4 Summary
CHAPTER 2 DOMAIN-SPECIFIC FACE SYNTHESIS FOR VIDEO FACE RECOGNITION FROM A SINGLE SAMPLE PER PERSON
2.1 Introduction
2.2 Related Work
2.2.1 Multiple Face Representations
2.2.2 Generic Learning
2.2.3 Synthetic Face Generation
2.3 Domain-Specific Face Synthesis
2.3.1 Characterizing the Capture Conditions
2.3.1.1 Estimation of Head Pose
2.3.1.2 Luminance-Contrast Distortion
2.3.1.3 Representative Selection
2.3.2 Face Synthesis
2.3.2.1 Intrinsic Image Decomposition
2.3.2.2 3D Face Reconstruction
2.3.2.3 Illumination Transferring
2.4 Domain-invariant Face Recognition with DSFS
2.5 Experimental Methodology
2.5.1 Databases
2.5.2 Experimental protocol
2.5.3 Performance Measures
2.6 Results and Discussion
2.6.1 Face Synthesis
2.6.1.1 Frontal View
2.6.1.2 Profile View
2.6.2 Face Recognition
2.6.2.1 Pose Variations
2.6.2.2 Mixed Pose and Illumination Variations
2.6.2.3 Impact of Representative Selection
2.6.3 Comparison with Reference Techniques
2.6.3.1 Generic Set Dimension
2.7 Conclusions
CHAPTER 3 A PAIRED SPARSE REPRESENTATION MODEL FOR ROBUST FACE RECOGNITION FROM A SINGLE SAMPLE
3.1 Introduction
3.2 Background on Sparse Coding
3.2.1 Sparse Representation-based Classification
3.2.2 SRC through Generic Learning
3.3 The Proposed Approach – A Synthetic plus Variational Model
3.3.1 Dictionary Design
3.3.2 Synthetic Plus Variational Encoding
3.4 Face Recognition with the S+V Model
3.5 Experimental Methodology
3.5.1 Datasets
3.5.2 Protocol and Performance Measures
3.6 Results and Discussion
3.6.1 Synthetic Face Generation
3.6.2 Impact of Number of Synthetic Images
3.6.3 Impact of Camera Viewpoint
3.6.4 Impact of Feature Representations
3.6.5 Comparison with State-of-the-Art Methods
3.6.6 Ablation Study
3.6.7 Complexity Analysis
3.7 Conclusion
CHAPTER 4 VIDEO FACE RECOGNITION USING SIAMESE NETWORKS WITH BLOCK-SPARSITY MATCHING
4.1 Introduction
4.2 Related Work
4.2.1 Deep Siamese Networks for Face Recognition
4.2.2 Face Synthesis
4.2.3 Sparse Representation-based Classification
4.3 The SiamSRC Network
4.3.1 Notation
4.3.2 Representative Selection
4.3.3 Face Synthesis
4.3.4 Block-Sparsity Matching
4.4 Experimental Methodology
4.4.1 Datasets
4.4.2 Protocol and Performance Measures
4.5 Results and Discussion
4.5.1 Synthetic Face Generation
4.5.2 Impact of Number of Synthetic Images
4.5.3 Comparison with State-of-the-Art Methods
4.5.4 Ablation Study
4.6 Conclusion
CONCLUSION AND RECOMMENDATIONS
LIST OF PUBLICATIONS
APPENDIX I CROSS-DOMAIN FACE SYNTHESIS USING A CONTROLLABLE
GENERATIVE ADVERSARIAL NETWORK
BIBLIOGRAPHY