Result: Improving Speaker Diarization for Overlapped Speech with Texture-Aware Feature Fusion.
Further information
Speaker diarization (SD), which aims to address the "who spoke when" problem, is a key technology in speech processing. Although end-to-end neural speaker diarization methods have simplified the traditional multi-stage pipeline, their capability to extract discriminative speaker-specific features remains constrained, particularly in overlapping speech segments. To address this limitation, we propose EEND-ECB-CGA, an enhanced neural network built upon the EEND-VC framework. Our approach introduces a texture-aware fusion module that integrates an Edge-oriented Convolution Block (ECB) with Content-Guided Attention (CGA). The ECB extracts complementary texture and edge features from spectrograms, capturing speaker-specific structural patterns that are often overlooked by energy-based features, thereby improving the detection of speaker change points. The CGA module then dynamically weights the texture-enhanced features based on their importance, emphasizing speaker-dominant regions while suppressing noise and overlap interference. Evaluations on the LibriSpeech_mini and LibriSpeech datasets demonstrate that our EEND-ECB-CGA method significantly reduces the diarization error rate (DER) compared to the baseline. Furthermore, it outperforms several mainstream end-to-end clustering-based approaches. These results validate the robustness of our method in complex, multi-speaker environments, particularly in challenging scenarios with overlapping speech. [ABSTRACT FROM AUTHOR]