Automatic Eye Localization for Hospitalized Infants and Children Using Convolutional Neural Networks

Facebook Tweet Pin Email

Consciousness–that is, the patient’s wakefulness and awareness of his or her surroundings– is an important indicator of health and neurological state. Altered consciousness, ranging from confusion and lethargy to loss of consciousness or even coma, may indicate many problems, including neurological disorders, poisoning, or brain injury (Avner, 2006). In cases when consciousness is purposefully altered by sedation or anesthesia, it is important to track the level of consciousness and ensure that patients are not oversedated or undersedated. Oversedation « will often result in prolonged mechanical ventilation and hemodynamic instability » (Johansson & Kokinsky, 2009), while undersedation may cause patients undue stress and pain.

It is therefore common practice to evaluate patient consciousness and distress regularly in intensive care units, using well-known tools like the Glasgow Coma Score (Teasdale & Jennett, 1974). In pediatric units, tools like the COMFORT behavioural (COMFORT-B) scale (Van Dijk, Peters, Van Deventer & Tibboel, 2005) and the Face Legs Activity Cry Consolability (FLACC) scale (Merkel, Voepel-Lewis, Shayevitz & Malviya, 1997) attempt to quantify consciousness and distress in children by observing their behaviour. These scales are typically evaluated by nursing staff using visual observation and require less than 5 minutes to perform, but the parameters of these scales are not always easy to score. For example, when determining the facial tension score for the COMFORT-B scale, it may be difficult for a nurse to distinguish between « facial muscle totally relaxed » and « facial muscle tone normal », especially for a child he or she has not seen before. As a result, different people may calculate different scores for the same patient.

The assessment of consciousness and distress in children under two years of age presents a particular challenge because these patients do not have the ability to respond to commands or explain what they are feeling. A study by Johansson & Kokinsky (2009) demonstrated that in 20% of cases there was a disagreement between nurses’ bedside assessments of the level of sedation of young patients, highlighting the necessity of pediatric assessment tools to support their work .

Cascade classifiers, as first proposed in Viola & Jones (2001), are a widely used method for face and eye localization due to the availability of pre-trained models and the ease of training new models using limited data. Cascade classifiers are trained using manually defined features, such as Haar features or a local binary pattern (LBP), and use a sliding window approach to locate objects in new images.

The « cascade » refers to the narrowing of the search region as the classifier is applied to an image. Rather than calculate all relevant features for each window, a window is evaluated with a subset of features. If the object is not detected using those features, the window is discarded and not evaluated further. If, on the other hand, the results are promising, another subset of features is applied, and so on until the window is classified. This narrowing of the search region allows a classifier trained with a large number of features to run efficiently. Cascade classifiers are much faster than convolutional neural networks, making it easy to achieve real-time object localization.

The use of cascade classifiers is illustrated in a study to detect fatigue in bus drivers (Mandal, Li, Wang & Lin, 2017) which shares similar constraints to our project: images are recorded using a camera placed off-centre and above the subject, who may move freely and even face away from the camera. A chain of cascade detectors is used to identify first the upper body, then the face and its orientation, and finally the eye region. This gradual narrowing of the region of interest improves performance and accuracy by shrinking the candidate search region and eliminating false positives in the background.

Mingxin, Yingzi & Xiangzhou (2016) also use a trained classifier to search for eyes but add some rules: the search is limited to the upper half of the face, and this upper region is further divided into left and right halves (which should have one eye each). A variance filter is used to eliminate non-eye regions, then a support vector machine (SVM) is used to check the remaining candidate regions for eyes. El Kaddouhi, Saaidi & Abarkan (2017) eschewed the usual classifier and instead detected corner points using the Shi-Tomasi corner detector and grouped these points (by k-means) to produce candidate eye regions, which were then analyzed by template matching .

One approach not based on machine learning is proposed by Chen & Liu (2011). The facial image is converted to YUV colour space and the U component is thresholded to highlight the eye, which has a lower intensity than the surrounding region. Projection functions are then used on this binary image to find the eye boundaries. Chun-Ning, Tai-Ning, Pin & Sheng-Jiang (2012) describe another similar approach, without the thresholding, where rapid changes of intensity around the eyes are detected. However, both these approaches work best if the eyes are open, limiting their usefulness in practice.

Nevertheless, colour can be useful for narrowing the search region for other face and eye localization algorithms. For example, masking parts of an image based on skin colour can be used as a pre-processing step before a cascade classifier or other machine-learning-based approach, as in Mutneja & Singh (2017). It can also be used after face localization to discard face candidates that do not contain a sufficient number of skin colour pixels, as suggested by Ge, Han & Quan (2015). The accuracy of skin localization has been demonstrated on adults using the HSV and YCbCr colour spaces (Shaik, Ganesan, Kalist, Sathish & Jenitha, 2015), the RGB, YCbCr, and HSV colour spaces (Kolkur, Kalbande, Shimpi, Bapat & Jatakia, 2017), and the YCbCr colour space alone (Emmanuel & Ibiyemi, 2017; Ge et al., 2015).

Convolutional neural networks (CNNs), a class of deep neural network particularly well suited to computer vision tasks, grew in popularity through the 2010s as computation power increased to permit faster training and predictions.

CNNs can be trained to sort images of objects into different classes, but there is an additional problem to solve in real-world situations: how to identify which objects are of interest in a crowded scene. Many algorithms have been proposed to solve this problem of object detection, among them Faster R-CNN (Ren, He, Girshick & Sun, 2015), which sacrifices the speed of other algorithms like YOLO (Redmon, Divvala, Girshick & Farhadi, 2016) for greater accuracy. Faster R-CNN builds on previous work in the R-CNN and Fast R-CNN algorithms, which used slow selective search algorithms to identify regions of interest in an image. Faster R-CNN speeds up this procedure by training a « region proposal network » during the training process, which identifies these regions of interest. Once identified, these subsections of a larger image can be classified by a conventional CNN.

CNNs underlie the most powerful facial and eye detection systems now available, and recent work has applied them in the hospital setting to detect adult patients exiting their beds (Chwyl, Chung, Shafiee, Fu & Wong, 2017), to identify the pose of adult patients in hospital beds (Liu, Yin & Ostadabbas, 2019a), and to detect infants in bed and segment their skin region (Chaichulee, Villarroel, Jorge, Arteta, Green, McCormick, Zisserman & Tarassenko, 2017). Nevertheless, the application of CNNs to real-world problems remains limited by the need for large quantities of training data and the computational cost of analyzing images through complex, many-layered networks.

Table des matières

INTRODUCTION
CHAPTER 1 LITERATURE REVIEW
1.1 Object detection
1.2 Eye detection for pediatrics
1.3 The eyes as a measure of mental activity
1.4 Machine learning from eye movements
1.5 Eye features
1.6 Consciousness in pediatrics
1.7 Pediatric pain recognition
CHAPTER 2 ORGANIZATION OF THE DOCUMENT
CHAPTER 3 METHODOLOGY
3.1 Training Datasets
3.2 Algorithms
3.2.1 Cascade classifiers
3.2.2 Convolutional neural networks
3.2.3 Evaluation
CHAPTER 4 AUTOMATIC EYE LOCALIZATION FOR HOSPITALIZED INFANTS AND CHILDREN USING CONVOLUTIONAL NEURAL NETWORKS
4.1 Introduction
4.2 Technical challenges
4.3 Existing Work
4.4 Methodology
4.4.1 Test data
4.4.2 Training data
4.4.3 Cascade classifiers
4.4.4 Convolutional neural networks
4.5 Results
4.6 Discussion
4.6.1 Task-specific training dataset
4.6.2 Model weaknesses
4.6.3 Convolutional neural networks vs. cascade classifiers
4.6.4 Limitations
4.7 Conclusion
CHAPTER 5 DISCUSSION OF THE RESULTS
CONCLUSION