Authors
Valentina Zhang, Phillips Exeter Academy, USA
Abstract
While facial expression is a complex and individualized behavior, all facial emotion recognition (FER) systems known to us rely on a single facial representation and are trained on universal data. We conjecture that: (i) different facial representations can provide complementing views of emotions; (ii) when employed collectively in a discussion group setting, they enable accurate FER which is highly desirable in autism care and applications sensitive to errors. In this paper, we first study FER using pixel-based DL vs semantics-based DL in the context of deepfake videos. The study confirms our conjectures. Armed with the findings, we have constructed an adaptive FER system learning from both types of models for dyadic or small interacting groups and further leveraging the synthesized group emotions as the ground truth for individualized FER training. Using a collection of group conversation videos, we demonstrate that FER accuracy and personalization can benefit from such an approach.
Keywords
Emotion recognition, facial representations, adaptive algorithm, training data ground truth.