Facial Expression Recognition in Videos

Studying facial expression recognition is a challenging task and has drawn increasing attention from computer vision researchers due to its variety of applications in interacting human with computers, medical and psychological assistance and marketing. In fact, facial expression is one of the most meaningful manners for human beings to express their feelings, emotions and intentions and in the process of communication plays a significant role in human interactions. In addition, Facial expression recognition has vital applications in a variety of fields including security purposes which can reduce crime, enhance safety, assist psychologists and behavioural traits analysts, improve the advertisement techniques and enhance the human-robot interactions. In spite of the fact that facial expression recognition is an actively researched topic in computer vision society, it is still a challenging problem and has been significantly investigated among computer vision researcher communities over the past few decades.

The problem of facial expression recognition firstly addressed by studies and experiments of Darwin in (1872), demonstrate that movements of components of face and the tone of the speech are the two major ways for expressing the common emotions of human beings when communicating. In addition, Mehrabian (1968) indicated that the facial expression of the speaker contributes 55% to the effect of the spoken message, which is more than the verbal part (7%) and the vocal part (38%). Therefore, the face tends to be the most visible form of emotion communication. Those facts make facial expression recognition a widely used scheme for measuring the emotional state of human beings.

The first attempt to define the problem of facial expressions by computer vision community, naturally refers from the psychology theories and then adopt some of their theories, conventions and apply those concepts and theories to design the system. In spite of the fact that human beings make use of a range of ways to express their emotions for everyday communication than the six basic expressions with some expressions for everyday communication, Darwin was the first scientist who theorized and defined the basic expressions. Humans have a universal way of expressing and understanding a set of feelings and emotions. The set of basic emotions are divided in to six : anger, disgust, fear, happiness, sadness, and surprise.

Facial expressions can be coded and defined using facial Action Units (AU) and the Facial Action Coding System (FACS), which was first introduced by (Ekman & Rosenberg, 1997). Typically, facial AU analysis should be accomplished by taking several steps: (i) face detection and tracking; (ii) alignment and registration; (iii) feature extraction and representation; and (iv) AU detection and expression analysis. Due to the recent advances that have been made in the face tracking and alignment steps, most approaches focus on feature extraction and classification methods. (Abbasnejad, Sridharan, Denman, Fookes & Lucey, 2015; Martinez & Du, 2012) .

Although expression recognition is a well studied topic both in academia and industry and stunning progresses have been made over time, the problem of expression recognition has not been fully addressed in all its aspects. Expressions are complex movements of muscles and are correlated with the other objects and actions in videos. This issue makes the existing models fail in many scenarios due to the insufficiency of the extracted features from the video frames and also lack of robustness. For examples, many existing techniques fail to truly classify two expressions of « happiness » and « surprise » due to complexity of determining of starting and ending points in video frames. In addition, due to the complexity of video frames, most of the current classifiers in the field fail to model the temporal dynamic among video frames adequately. Furthermore, recognition of expression heavily rely on data, this needs human labour for labeling the videos and generating data for better event analysis. These challenges call for the development of novel methods to address these issues in FER systems.

Automatic Facial Expression Recognition System

In the following section we provide information about how to recognise facial expression. To solve the problem of facial expression recognition, generally two procedures can be followed, (i) statistical FER Systems, (ii) deep learning model. Statistical method consists of three stages, face acquisition, facial expression extraction and representation, and classification . We briefly summarize the major aims and challenges of each stages in the next section.

In order to achieve a robust FER system model, an understating is required for how a typical system is designed, what steps should be taken to reach from raw data (image sequences) to expression (desired output). The general architecture for these typical systems consist of some stages that are widely used in almost all computer vision systems.

Darwin was the first person who commence to make assumptions related issues to expressions and ways of communication of humans, he figured out that humans have the same way of expressing and understanding a set of basic or prototypical emotions. After that, Ekman Ekman & Rosenberg (1997) extended the set of basic emotions to six expressions: anger, disgust, fear, happiness, sadness, and surprise. In computer vision community, the majority of researchers model the facial expression by either categorical approaches or Facial Action Coding System (FACS). Before reviewing researches related to FER, the most important terminology can be briefly summarized as follows:

1. The facial action coding system (FACS) Ekman & Rosenberg (1997) : This system was developed by scientists to encode some crucial parts of face according to facial muscle movements and is able to characterize facial actions to show individual human emotions. FACS are able to encode the micro movements of specific facial muscles called action units (AUs).

2. Basic expressions: Human expression is categorized in seven classes: happiness, surprise, anger, sadness, fear, disgust, and neutral .

3. Facial Landmarks (FLs) Koestinger, Wohlhart, Roth & Bischof (2011): Facial Landmarks are some visually critical points in facial regions such as the starting and end points nose, eye brows, and the mouth .

Conventional FER Systems

The most specific attribute of the conventional FER method is that the whole system is highly dependent on manual feature engineering Huang, Chen, Lv & Wang (2019). To obtain our desirable output sequences shall go through some pre-processing steps and then it is time for making the decision about the choices of feature extraction and classification method for the target dataset. Typically, the conventional FER procedure can be categorized into three major .

Pre-processing

The purpose of this step is to remove unrelated information of each sequence such as variations that are not related to facial expressions, such as backgrounds, various poses, illuminations, and overall enhance the detection ability of relevant information. Pre-processing of sequences can directly influence the extraction of features and the performance of expression classification Huang et al. (2019). In addition, many datasets are different in the number of high quality images, and some are encompassed colour images, while some are include grayscale images. The main steps in process of sequence pre-processing are introduced as follows:

– Face detection: The initial step in expression recognition task is face detection. Face detector tries to find the position of the faces in an image and even returns the coordinates of a bounding box for each one of them. Some time we apply some algorithms to detect and extract only special regions of face, for instance finding components of face which playing significant role in expression such as mouth and eyes;

– Face alignment: During the face alignment process, faces should be scaled, cropped and most of the times compared with some template reference points located at fixed locations in the image. Typically this process requires finding a set of facial landmarks using a landmark detector algorithms, determining the best transformation that fits the reference points. For instance, changing the pose of a face to frontal.

Feature extraction and representation

The most vital step in modeling the system is the process of face representation. Feature extraction is a process of representing desirable information from region of interests. These extracted representations or descriptions are our desirable regions of the image, etc. Feature extraction is directly influence the performance of the algorithms, which is usually the backbone of the FER system. In fact the purpose of this step is to transform the desire regions at the face (pixel values of a face image) into a compact and discriminative feature vector. Since one of the contributions in this work is to compare a conventional feature extraction method with a more recent developed approach, we will discuss more about it in this section.

Effective feature extraction is a crucial stage for facial expression recognition. In general, existing feature expression features can be categorized into two groups: appearance features Fasel & Luettin (2003); Zhang, Lyons, Schuster & Akamatsu (1998) and geometric features Baraniuk & Wakin (2009); Shan, Gong & McOwan (2009). The appearance features model the appearance changes of faces, such as wrinkles and furrows, by directly utilizing and calculating pixel values. On the other hand, geometric features exploit structure of shapes and locaations of facial components (e.g. eyes and mouth) to represent the face geometry .

Optical Flow Method
Optical flow is able to capture the motion of patterns on objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene two consecutive frames caused by the movement of object or camera Kass, Witkin & Terzopoulos (1988). In Horn & Schunck (1981) scientists combine the two dimensional velocity field and the gray scale to gather the maximum temporal dependencies. An efficient procedure for analysing temporal facial variations is presented in Yacoob & Davis (1996). Typically, optical flow caused by facial expressions to identify the direction of motions. Then a classifier is used for expression recognition. In Cohn, Zlochower, Lien & Kanade (1998), an optical flow based approach is designed and implemented to capture emotional expression by automatically recognising subtle changes in facial expressions. In Sánchez, Ruiz, Moreno, Montemayor, Hernández & Pantrigo (2011) authors made a comparison between two optical flow-based facial recognition methods.

Table des matières

INTRODUCTION
CHAPTER 1 LITERATURE REVIEW AND APPLICATIONS OF FACIAL EXPRESSION RECOGNITION
1.1 Automatic Facial Expression Recognition System
1.2 Conventional FER Systems
1.2.1 Pre-processing
1.2.2 Feature extraction and representation
1.2.2.1 Optical Flow Method
1.2.2.2 Haar-like Feature Extraction
1.2.2.3 Gabor Feature Extraction
1.2.2.4 Local Binary Pattern Family
1.2.3 Classification
1.2.3.1 Support Vector Machine Hearst, Dumais, Osuna, Platt & Scholkopf (1998)
1.2.3.2 Naive Bayes Classifier
1.3 Deep Learning-Based FER Systems
1.3.1 Spatio-temporal Neural Network
1.3.2 Hybrid Models
1.3.3 3D CNN
1.3.4 GAN-Based Models
1.4 Datasets
1.5 Performance Metrics
1.5.1 Evaluation Methods
1.5.2 Evaluation Metrics
1.6 Chapter Summary
CHAPTER 2 STATISTICAL MODEL
2.1 Introduction
2.2 Approach
2.2.1 Image Pre-processing
2.2.2 Feature Extraction
2.2.2.1 LBP in time domain
2.2.3 Classification
2.2.3.1 Support Vector Machines Hearst et al. (1998)
2.3 Implementation
2.4 Results
2.5 Conclusion
CHAPTER 3 DEEP LEARNING FER MODEL WITH SYNTACTIC DATA
3.1 Introduction
3.2 Convolutional Neural Network
3.2.1 Convolution layer and activation function
3.2.2 Downsampling
3.2.3 Recurrent Neural Networks
3.2.4 Transfer learning
3.3 Generating Synthetic Method
3.3.1 Modeling of Faces and Expressions
3.3.2 Expression model
3.4 Conclusion
CHAPTER 4 PROPOSED MODELS AND ARCHITECTURES
4.1 Introduction
4.2 Network architectures and training process
4.3 Evaluation
4.3.1 Dataset
4.3.2 Evaluation Setting
4.3.3 Results
4.3.4 Within-dataset Evaluation
4.3.5 Cross-dataset Evaluation
4.3.6 Synthetic Model
4.3.7 Real dataset
4.3.8 Comparisons and discussions
CONCLUSION