Treffer: Auto-Masked Audio Spectrogram Transformer for depression detection from speech.
Weitere Informationen
Background: Depression is a psychological disorder characterized by altered self-referential cognition and impaired emotional expression. Traditional diagnostic methods can be costly or intrusive, while Speech-based analysis offers an accessible alternative for early detection.
Method: This study introduces the Auto-Masked Audio Spectrogram Transformer (AMAST), a deep learning framework that extracts depression-related features from speech spectrograms. AMAST incorporates sliding window segmentation, auto-masked training to enhance contextual learning, and a time-frequency attention mechanism to capture both time and frequency information.
Result: AMAST achieved F1 scores of 0.92 on the Distress Analysis Interview Corpus-Wizard of Oz dataset and 0.91 on the Multi-modal Open Dataset for Mental disorder Analysis dataset, outperforming baseline models. Emotionally evocative tasks such as word reading and interviews significantly improved classification performance. The model demonstrated robustness in detecting subtle depressive speech markers across various speaking conditions.
Conclusion: AMAST provides a promising tool for non-invasive depression screening. Its effectiveness across diverse tasks and datasets supports its potential use in clinical and remote mental health assessments. Our code is available at https://github.com/zmc314/AMAST.
(Copyright © 2025 Elsevier B.V. All rights reserved.)
Declaration of competing interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Chunxue Wang reports financial support was provided by Beijing Tiantan Hospital Affiliated to Capital Medical University. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.