Augmenting ensemble of early and late fusion using context gating

Emotion Representation

Models of emotion are typically divided into two main groups, namely discrete (or categorical) and dimensional (continuous) ones (Stevenson et al. (2007)) Discrete models are built around particular sets of emotional categories deemed fundamental and universal. Ekman et al. Ekman (1992) propose to classify the human facial expression resulting from an emotion into six basic classes (happiness, sadness, anger, disgust, surprise and fear). These emotions were selected because they have unambiguous meaning across cultures. In contrast, dimensional models consider emotions to be composed out of several low-dimensional signals (mainly two or three). The two most commonly used dimensions are referred to as Valence (how positive or negative a subject appears), Arousal (represents the excitation rate). A third dimension has been added by Mehrabian (1996), the dominance, which depends on the degree of control exerted by a stimulus. Russell (1980) suggests that all Ekman’s emotions (Ekman (1992)) and compound emotions could be mapped in the circumplex model of affect. Furthermore, this two-dimensional approach allows a more accurate specification of the emotional state, especially by taking its intensity into account. This relationship is shown visually in Figure 1.1. Several large databases of face images have been collected and annotated according to the emotional state of the person. The RECOLA database (Ringeval et al. (2013)) was recorded to study socio-affective behaviors from multimodal data in the context of remote collaborative work, for the development of computer-mediated communication tools. In addition to these recordings, 6 annotators measured emotion continuously on two dimensions: arousal and valence, as well as social behavior labels on five dimensions. SFEW Dhall et al. (2011), FER-13 Goodfellow et al. (2013) and RAF Li et al. (2017) propose images in the wild annotated in basic emotions; AFEW Dhall et al. (2012) is a dynamic temporal facial expressions data corpus consisting of close to real-world environment extracted from movies annotated in discrete emotions.

Problem Statement

For defining the emotion recognition problem, we define the pattern recognition problem in general. A pattern is a representative signature of data by which we can take actions such as classifying the data into different categories (Bishop (2006)). Pattern recognition refers to the automatic discovery of regularities in data through the use of computer algorithms. Occasionally, a pattern is represented by a vector containing data features. Given the general definition of pattern recognition, we can define the task of recognizing the emotion as discovering regularities in the psychological and the physical state of the human. In general, an emotion recognition system can be described by three fundamental steps, namely, Pre-Processing, Feature Extraction, and Decision. Figure 1.2 provides a general scheme for pattern recognition. In emotion recognition tasks, we usually deal with raw data such as raw video inputs or static images. Given the emotional state of the human mind is expressed in different modes including facial, voice, gesture, posture, and biopotential signals, the raw data carries some unnecessary information. This extra information not only confuses the model but sometimes can lead to a nonoptimal result. In the Pre-Processing step, we extract useful cues from the raw data before applying further steps.

For example, facial expression–based emotion recognition requires the extraction of the bounding box around the face. The Feature Extraction process involves transforming the raw features into some new space of variables where the pattern is expected to be easier to recognize. In general, the main goal of the Decision component is to map the Feature Extraction results in the designated output space. For instance, for the emotion recognition task, based on the previous section we saw that there exists more than one representation for emotions. For recognizing categorical emotion which contains a finite number of discrete categories, the Decision module simply classifies the extracted features into one of several object classes. Whereas for the dimensional representation of emotion, the Decision module does a regression task and outputs a continuous value. Next, we define the task of emotion recognition in a learning algorithm framework. The focus of the emotion recognition task is on the problem of prediction: given a sample of training examples (x1, y1), . . . , (xn, yn) from Rd → {1, . . . ,K}, the model learns a predictor hθ : Rd → {1, . . . ,K} defined by parameters θ, that is used to predict the label y of a new point x, unseen in training. The input feature, xi, has D dimensions and for each label, yi, there are K number of classes.

The predictor hn is commonly chosen from some function class H , such as neural networks with a certain architecture, optimized with empirical risk minimization (ERM) and its variants. In ERM, the predictor is a function hθ ∈H that minimizes the empirical (or training) risk 1n Σni =1 _(h(xi;θ), yi)+Ω(θ), where Ω(θ) is a regularizer over model parameters, _ is a loss function; negative cross entropy loss _(y _, y) = −y log y _ in case of classification. Here y _ is the predicted value from the model. The goal of machine learning is to find hn that performs well on new data, unseen in training. To study performance on new data (known as generalization) we typically assume the training examples are sampled randomly from a probability distribution P over Rd →R, and evaluate hn on a new test example (x, y) drawn independently from P. The challenge stems from the mismatch between the goals of minimizing the empirical risk 1n Σn i=1 _(h(xi;θ), yi)+Ω(θ) (the explicit goal of ERM algorithms, optimization) and minimizing the true (or test) risk E(x,y)∼Ptrue [_(h(x;θ), y)] (the goal of machine learning). Intuitively, when the number of samples in the empirical risk is high, it can approximate well the true risk. So far we defined the task of emotion recognition in terms of a classification problem. As we described in section 1.1, there is another model for mapping the emotional state which is continuous. I such cases we usually cast the task as a regression problem. The regression task has two major differences compared to the classification task. Firstly, the learning algorithm is asked to output a predictor hθ : Rd →R defined by parameters θ, that is used to predict the numerical value given some input. Secondly, the loss function that usually is minimized is the squared loss _(y _, y) = (y _−y)2. The same architectures and learning procedures can apply to both of the tasks.

Deep Learning In this work, we use special class of machine learning models to approximate a predictor for the emotion recognition task called deep learning methods. We begin by describing the deep feedforward networks also known as feedforward neural networks, or multilayer perceptrons (MLPs) (Goodfellow et al. (2016)). These models are called feedforward because there are no feedback connections in which outputs of the model are fed back into itself and information flows directly from the input through the intermediate layers and the output. The goal of a feedforward network is to approximate some unknown function f ∗. A feedforward network learns a mapping y = f ∗(x;θ) by optimizing the value of the parameters θ that best describes the true underling function f . A deep neural network is usually represented as the composition of multiple nonlinear functions. Therefore, f (x) can be expressed in the form of f (x) = f 3( f 2( f 1(x))). Each of f i is the layer of the neural network, in this case, f 1 is called the first layer of the network and so on. The depth of the network is given by the overall length of the chain. Consider some input x ∈ RN and its corresponding output h ∈ RM, the equation for one layer of the network is defined as: h = g(WT x+b) (1.1) the network using weights matrix W ∈ RM×N and bias vector b ∈ RM first linearly transform the input to the new representation and then apply nonlinear function g often rectified linear unit (ReLU).

The premise of a feedforward network is at the end of the learning process each layer captures a meaningful aspect of the data and by combining them together the model can make a decision. Figure 1.3 provides a general view for one layer of a neural networks. Among several architectures of a neural network, there are two specialized architectures which are more common in the field: Convolutional Neural Networks (CNNs) (LeCun et al. (1998)) and Recurrent Neural Networks (RNNs) (Rumelhart et al. (1988)). Convolutional neural networks are a specialized kind of neural network which makes the network more suitable for tasks specific to the grid-like topology. Examples include images that can be seen as a 2-D grid of pixels and also time-series data which can be thought of as 1-D data acquired through time intervals. A convolutional network uses a convolution operation in place of the affine transformation in neural network layers. Instead of regular matrix weights for each layers, a convolutional network takes advantage of the grid-like topology of the data and defines sets of weights as a filter (or a kernel), that is convolved with the input. The way the kernels are defined is especially important. It turns out that we can dramatically reduce the number of parameters by making two reasonable assumptions. Firstly, parameter sharing which states that a feature detector (such as vertical edge in case of image data) that is useful in one part of an image is probably useful in other parts of the image as well. Secondly, sparse connectivity. This is accomplished by defining the kernel smaller than the input. As a result, in each layer, the output value in the hidden layer depends only on a few input values. Reducing the number of parameters helps in training with smaller training sets, and also it is less prone to overfitting. Figure 1.4 provides a general architecture of a convolutional neural networks for an image input.

Table des matières

INTRODUCTION
CHAPTER 1 BACKGROUND
1.1 Emotion Representation
1.2 Problem Statement
1.3 Deep Learning
1.4 Network Training: Optimization
1.5 Related Work
1.5.1 Emotion Recognition
1.5.2 Attention and Sequence Modeling
1.5.3 Multimodal Learning
CHAPTER 2 EMOTION RECOGNITION WITH SPATIAL ATTENTION AND TEMPORAL SOFTMAX POOLING
2.1 Introduction
2.2 Proposed Model
2.2.1 Local Feature Extraction
2.2.2 Spatial Attention
2.2.3 Temporal Pooling
2.3 Experiments
2.3.1 Data Preparation
2.3.2 Training Details
2.3.3 Spatial Attention
2.3.4 Temporal Pooling
2.4 Conclusion
CHAPTER 3 AUGMENTING ENSEMBLE OF EARLY AND LATE FUSION USING CONTEXT GATING
3.1 Introduction
3.2 Prerequisite
3.2.1 Notations
3.2.2 Early Fusion
3.2.3 Late Fusion
3.3 Proposed Model
3.3.1 Augmented Ensemble Network
3.3.2 Context Gating
3.4 Experiments
3.4.1 Youtube-8M v2 Dataset
3.4.2 RECOLA Dataset
3.4.3 Visualization
3.5 Conclusion
CONCLUSION AND RECOMMENDATIONS
LIST OF REFERENCES