Identication of emotion from face expressions is a relatively well understood problem where state-of-the-art solutions perform almost as well as humans. However, in many practical applications, disruptingfactors still make identication of face expression a very challenging problem. Within the project DREAM1- Development of Robot Enhanced Therapy for Children with Autism Spectrum Disorder (ASD), we areidentifying face expressions from children with ASD, during therapy. Identied face expressions are usedboth in the online system, to guide the behavior of the robot, and o-line, to automatically annotate videofor measurements of clinical outcomes.
This setup puts several new challenges on the face expression technology. First of all, in contrast tomost open databases of face expressions comprising adult faces, we are recognizing emotions from childrenbetween the age of 4 to 7 years. Secondly, children with ASD may show emotions dierently, compared totypically developed children. Thirdly, the children move freely during the intervention and, despite the useof several cameras tracking the face of the child from dierent angles, we rarely have a full frontal view ofthe face. Fourthly, and nally, the amount of native data is very limited.
Although we have access to extensive video recorded material from therapy sessions with ASD children,potentially constituting a very valuable dataset for both training and testing of face expression implemen-tations, this data proved to be dicult to use. A session of 10 minutes of video may comprise only a fewinstances of expressions e.g. smiling. As such, although we have many hours of video in total, the data isvery sparse and the number of clear face expressions is still rather small for it to be used as training data inmost machine learning (ML) techniques.
We therefore focused on the use of synthetic datasets for transfer learning, trying to overcome thechallenges mentioned above. Three techniques were evaluated: (1) convolutional neural networks for imageclassication by analyzing separate video frames, (2) recurrent neural networks for sequence classication tocapture facial dynamics, and (3) ML algorithms classifying pre-extracted facial landmarks.
The performance of all three models are unsatisfactory. Although the proposed models were of highaccuracy, approximately 98%, while classifying a test set, they performed poorly on the real-world data.This was due to the usage of a synthetic dataset which had mostly a frontal view of faces. The models whichhave not seen similar examples before failed to classify them correctly. The accuracy decreased drasticallywhen the child rotated her head or covered a part of her face. Even if the frame clearly captured a facialexpression, ML algorithms were not able to provide a stable positive classication rate. Thus, elaborationon training datasets and designing robust ML models are required. Another option is to incorporate voiceand gestures of the child into the model to classify emotional state as a complex concept.