Treffer: Vision Transformer-Based Facial Emotion Recognition.

Title:
Vision Transformer-Based Facial Emotion Recognition.
Authors:
Ezzameli, Kaouther1 kaouther.ezzameli@fsb.ucar.tn, Mahersia, Hela2 hela.mahersia@fsb.ucar.tn
Source:
IAENG International Journal of Computer Science. Jan2026, Vol. 53 Issue 1, p410-423. 14p.
Database:
Supplemental Index

Weitere Informationen

Emotion recognition is recognized as playing a key role in the development of intelligent systems that should understand and respond to human emotional states. These systems have a transformative impact on the quality and reactivity of public services that we encounter in our daily life, especially in online education, mental health monitoring, remote medical diagnosis, and intelligent surveillance. Although there are numerous channels through which emotions can be recognized, Facial Emotion Recognition (FER) is one of the most significant sources of emotional information. Despite the existence of a large variety of data sources with face images that allow for successful FER, high accuracy levels are still very hard to reach due to obstacles such as glasses and masks, illumination conditions, head pose, and orientation. To overcome these issues, we present a new emotion recognition model based on the Vision Transformer (ViT) architecture, which has recently given promising results for computer vision tasks. Compared to traditional Convolutional Neural Networks (CNNs), ViTs are better at capturing long-range dependencies and global context information, and they provide more robustness and accuracy under different conditions. The proposed model is extensively trained and tested on two widely used FER benchmark datasets: Extended Cohn-Kanade (CK+) and Japanese Female Facial Expression (JAFFE). From the experiments, it can be seen that for CK+ and JAFFE, our approach achieves classification rates of 98.98% and 97.67% respectively, higher than in some recent deep learning models. In addition to being quantitatively improved, the model also presents an excellent generalization capability for different facial characteristics as well as for other environmental variations. In general, our work demonstrates the potential of ViT architectures and applications in facial expression emotion recognition. The potential of our approach by overcoming severe limitations of current systems lies in the direction of natural, intuitive, and context-aware human-machine interaction systems capable of functioning in dynamic, real-world environments. [ABSTRACT FROM AUTHOR]