Facial Image Analysis Techniques & Related Processing Fundamentals

Facebook Tweet Pin Email

Many video encoders do motion analysis over video sequences to search for motion information that will help compression. The concept of motion vectors, first conceived at the time of the development of the first video coding techniques, is intimately related to motion analysis. These first analysis techniques help to regenerate video sequences as the exact or approximate reproduction of the original frames, by using motion compensation from neighboring pictures. They are able to compensate but not to understand the actions of the objects moving on the video and therefore they cannot restore the object’s movements from a different point of view, or immersed in a three-dimensional scenario. Current trends in research focus on the development of new ways of communicating through the use of visual tools that would permit more human interaction while communicating. For instance, this interaction is sought when using 3D in creating virtual teleconference rooms. As said before, traditional motion analysis techniques are not sufficient to provide the information needed for these applications.

Faces play an essential role in human communication. Consequently, they have been the first objects whose motion has been studied in order to recreate animation on synthesized models or to interpret motion for an a posteriori use .

Each of the modules may be more or less complex depending on the purpose of the analysis (i.e., from the understanding of general behavior to exact 3D-motion extraction). If the analysis is intended for later face expression animation, the type of Facial Animation synthesis often determines the methodology used during expression analysis. Some systems may not go through either the first or the last stages or some others may blend these stages in the main motion & expression image analysis. Systems lacking the pre-motion analysis step are most likely to be limited by environmental constraints like special lighting conditions or pre-determined head pose. Those systems that do not perform motion interpretation do not focus on delivering any specific information to perform face animation synthesis afterwards. A system that is supposed to analyze video to generate face animation data in a robust and efficient consists of the three modules. The approaches currently under research and that will be exposed in this section clearly perform the facial motion and expression image analysis and to some extend the motion interpretation to be able to animate. Nevertheless, many of them fail to have a strong pre-motion analysis step to ensure some robustness during the subsequent analysis.

Pre-processing techniques

The conditions under which the user may be recorded are susceptible to change from one system to another, and from one determined moment to the next one. Some changes may come from the hardware, for instance, the camera, the lighting environment, etc. Furthermore, even though only one camera is used, we cannot presuppose that the speaker’s head will remain motionless and looking straight onto that camera at all instants. Therefore, pre-processing techniques must help to homogenize the analysis conditions before studying non-rigid face motion, therefore in this group we also include head detection and pose determination techniques.

Camera calibration
Accurate motion retrieval is highly dependent on the precision of the image data we analyze. Images recorded by a camera undergo different visual deformations due to the nature of the acquisition material. Camera calibration can be seen as the starting point of a precise analysis. If we want to express motion in real space we must relate the motion measured in terms of pixel coordinates to the real/virtual world coordinates, that is, we need to relate the image reference frame to the world reference frame. Simply knowing the pixel separation in an image does not allow us to determine the distance of those points in the real world. We must derive some equations to link the world reference frame to the image reference frame in order to find the relationship between the coordinates of points in 3D-space and the coordinates of the points in the image. In Appendix I-A we describe the basics of camera calibration. The developed methods can be classified into two groups: photogrammetic calibration and self-calibration. We refer the reader to (Zhang, 2000) and (Luong & Faugeras, 1997) for some examples and more details about these approaches. Although camera calibration is basically used in Shape From Motion systems, above all, when accurate 3D-data is used to generate 3D-mesh models from video sequences of static objects, it is a desired step for face analysis techniques that aim at providing motion accuracy.

Illumination analysis and compensation
Other unknown parameters during face analysis are the lighting characteristics of the environment in which the user is being filmed. The number, origin, nature and intensity of the light sources of the scene can easily transform the appearance of a face. Face reflectance is not uniform all over the face and thus, very difficult to model. Appendix I-B contains information about the characteristics of the nature of light and one of the most commonly used models for surfaces. Due to the difficulty of deducing the large number of parameters and variables that the light models compute, some assumptions need to be taken. One common hypothesis is to consider faces as lambertian surfaces (only reflecting diffuse light), so as to reduce the complexity of the illumination model. Using this hypothesis, Luong, Fua and Lecrerc (2002) study the light conditions of faces to be able to obtain texture images for realistic head synthesis from video sequences. Other reflectance models are also used (Debevec et al., 2000) although they focus more on reproducing natural lighting on synthetic surfaces than on understanding the consequences of the lighting on the surface itself. In most cases, the analysis of motion and expressions on faces is more concerned with the effect of illumination on the facial surface studied than with the overall understanding of the lighting characteristics. A fairly extended approach to appreciate the result of lighting on faces is to analyze it by trying to synthetically reproduce it on a realistic 3D-model of the user’s head. Whether it is used to compensate the 3D model texture (Eisert & Girod, 2002) or to lighten the 3D model used to help the analysis (Valente & Dugelay, 2001), it proves to be reasonable to control how the lighting modifies the aspect of the face on the image.

Head detection and pose determination
If we intend to perform robust expression and face motion analysis, it is important to control the location of the face on the image plane and it is also crucial to determine the orientation of the face with regard to the camera. The find-a-face problem is generally reduced to the detection of its skin on the image. The most generalized methods for skin detection use a probabilistic approach where the colorimetric characteristics of human skin are taken into account. First, a probabilistic density function – P(rgb skin) – is usually generated for a given color space (RGB, YUV, HSV, or others.). P(rgb skin) indicates what is the probability of a color belonging to the skin surface. It is difficult to create this function as well as to decide which will be threshold to use to determine if the current pixel belongs to the skin or not. Some approaches (Jones & Rehg, 1999) study in detail the color models used and also give a probability function for the pixels that do not belong to the skin – P(rgb ¬skin). Others, like the one presented by Sahbi, Geman and Boujemaa (2002), perform their detection in different stages, giving more refinement at each step of the detection. More complex algorithms (Garcia & Tziritas, 1999) allow regions with non homogeneous skin color characteristics to be found.

Determining the exact orientation of the head becomes a more complicated task. In general, we find two different ways to derive the head pose: using static methods and using dynamic approaches. Static methods search for specific features of the face (eyes, lip corners, nostrils, etc.) on a frame-by-frame basis, and determine the user’s head orientation by finding the correspondences between the projected coordinates of these features and the real world coordinates. They may use template-matching techniques to find the specific features, as Nikolaidis and Pitas (2000) do. This method works fine although it requires very accurate spotting of the relevant features; unfortunately, this action has to be redone at each frame and it is somewhat tedious and imprecise. Another possibility is to use 3D-data, for instance from a generic 3D-head model, to accurately determine the pose of the head on the image. This is the solution given by Shimizu, Zhang, Akamatsu and Deguchi (1998).

To introduce time considerations and to take advantage of previous results, dynamic methods have been developed. These methods perform face tracking by analyzing video sequences as a more or less smooth sequence of frames and they use the pose information retrieved from one frame to analyze and derive the pose information of the next one. One of the most extended techniques involves the use of Kalman filters to predict some analytical data as well as the pose parameters themselves. We refer the reader to (Ström, Jebara, Basu & Pentland, 1999; Valente & Dugelay, 2001; Cordea, E. M. Petriu, Georganas, D. C. Petriu & Whalen, 2001) to find related algorithmic details.

Other approaches, like the one presented by Huang and Chen (2000), are able to find and track more than just one face on a video sequence but they do not provide any head pose information. Other techniques (Zhenyun, Wei, Luhong, Guangyou & Hongjian , 2001; Spors & Rabestein, 2001), simply look for the features they are interested in. They find the features’ rough location but they do not deduce any pose from this information because their procedure is not accurate enough.

Table des matières

Introduction
1 Motivation
2 Contribution
3 Outline of the thesis report
I Facial Image Analysis Techniques & Related Processing Fundamentals
I.1 Introduction
I.2 Processing Fundamentals
I.2.1 Pre-processing techniques
I.2.2 Image processing algorithms
I.2.3 Post-processing techniques and their related mathematical tools
I.3 Face Motion and Expression Analysis Techniques: a State of the Art
I.3.1 Methods that retrieve emotion information
I.3.2 Methods that obtain parameters related to the Facial Animation synthesis used
I.3.3 Methods that use explicit face synthesis during the image analysis
II Realistic Facial Animation & Face Cloning
II.1 Understanding the Concept of Realism in Facial Animation
II.2 The Semantics of Facial Animation
II.3 Animating Realism
II.4 Privacy and Security Issues about Face Cloning: Watermarking Possibilities
III Investigated FA Framework for Telecommunications
III.1 Introduction
III.2 Framework Overview
III.3 Our FA Framework from a Telecom Perspective
III.3.1 Coding face models and facial animation parameters: an MPEG-4 perspective
III.3.2 Facial animation parameters transmission
III.4 Facial Motion Analysis: Coupling Expression and Pose
IV Facial Non-rigid Motion Analysis from a Frontal Perspective
IV.1 Introduction
IV.2 Eye State Analysis Algorithm
IV.2.1 Analysis description
IV.2.2 Experimental evaluation and conclusions
IV.3 Introducing Color Information for Eye Motion Analysis
IV.3.1 Eye opening detection
IV.3.2 Gaze detection simplification
IV.3.3 Analysis interpretation for parametric description
IV.3.4 Experimental evaluation and conclusions
IV.4 Eyebrow Motion Analysis Algorithm
IV.4.1 Anatomical-mathematical eyebrow movement modeling
IV.4.2 Image analysis algorithm: deducing model parameters
IV.4.3 Experimental evaluation and conclusions
IV.5 Eye-Eyebrow Spatial Correlation: Studying Extreme Expressions
IV.5.1 Experimental evaluation and conclusions
IV.6 Analysis of mouth and lip motion
IV.6.1 Introduction
IV.6.2 Modeling lip motion with complete mouth action
IV.6.3 Image analysis of the mouth area: Color and intensity-based segmentation
Conclusion