Treffer: Auto-Masked Audio Spectrogram Transformer for depression detection from speech.

Title:
Auto-Masked Audio Spectrogram Transformer for depression detection from speech.
Authors:
Zhang M; College of Computer Science and Technology, Beijing University of Technology, No. 100, Pingleyuan, Beijing 100124, China. Electronic address: zmc16384@emails.bjut.edu.cn., He J; College of Computer Science and Technology, Beijing University of Technology, No. 100, Pingleyuan, Beijing 100124, China; Beijing Engineering Research Center for IOT Software and Systems, Beijing University of Technology, No. 100, Pingleyuan, Beijing 100124, China. Electronic address: jianhee@bjut.edu.cn., Peng X; Institute of Software Chinese Academy of Sciences, No. 4 South Fourth Street, Zhong Guan Cun, Beijing 100190, China., Huang J; Institute of Software Chinese Academy of Sciences, No. 4 South Fourth Street, Zhong Guan Cun, Beijing 100190, China., Zhang N; Beijing Tiantan Hospital, Capital Medical University, Beijing, China., Wang C; Beijing Tiantan Hospital, Capital Medical University, Beijing, China., Jiang D; Beijing Tiantan Hospital, Capital Medical University, Beijing, China.
Source:
Journal of affective disorders [J Affect Disord] 2026 Jan 15; Vol. 393 (Pt A), pp. 120295. Date of Electronic Publication: 2025 Sep 16.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Elsevier/North-Holland Biomedical Press Country of Publication: Netherlands NLM ID: 7906073 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1573-2517 (Electronic) Linking ISSN: 01650327 NLM ISO Abbreviation: J Affect Disord Subsets: MEDLINE
Imprint Name(s):
Original Publication: Amsterdam, Elsevier/North-Holland Biomedical Press.
Contributed Indexing:
Keywords: Audio spectrogram transformer; Auto-masked; Major depression disorder; Sliding window; Time–frequency attention
Entry Date(s):
Date Created: 20250918 Date Completed: 20251104 Latest Revision: 20251104
Update Code:
20251104
DOI:
10.1016/j.jad.2025.120295
PMID:
40967413
Database:
MEDLINE

Weitere Informationen

Background: Depression is a psychological disorder characterized by altered self-referential cognition and impaired emotional expression. Traditional diagnostic methods can be costly or intrusive, while Speech-based analysis offers an accessible alternative for early detection.
Method: This study introduces the Auto-Masked Audio Spectrogram Transformer (AMAST), a deep learning framework that extracts depression-related features from speech spectrograms. AMAST incorporates sliding window segmentation, auto-masked training to enhance contextual learning, and a time-frequency attention mechanism to capture both time and frequency information.
Result: AMAST achieved F1 scores of 0.92 on the Distress Analysis Interview Corpus-Wizard of Oz dataset and 0.91 on the Multi-modal Open Dataset for Mental disorder Analysis dataset, outperforming baseline models. Emotionally evocative tasks such as word reading and interviews significantly improved classification performance. The model demonstrated robustness in detecting subtle depressive speech markers across various speaking conditions.
Conclusion: AMAST provides a promising tool for non-invasive depression screening. Its effectiveness across diverse tasks and datasets supports its potential use in clinical and remote mental health assessments. Our code is available at https://github.com/zmc314/AMAST.
(Copyright © 2025 Elsevier B.V. All rights reserved.)

Declaration of competing interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Chunxue Wang reports financial support was provided by Beijing Tiantan Hospital Affiliated to Capital Medical University. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.