Fusion-Net: Time-Frequency Information Fusion Y-Network for Speech Enhancement
(Oral presentation)
Santhan Kumar Reddy Nareddula (IIT Tirupati, India), Subrahmanyam Gorthi (IIT Tirupati, India), Rama Krishna Sai S. Gorthi (IIT Tirupati, India) |
---|
This paper proposes a deep learning-based densely connected Y-Net as an effective network architecture for the fusion of time and frequency domain loss functions for speech enhancement. The proposed architecture performs speech enhancement in the time domain while fusing information from the frequency domain. Y-network consists of an encoder branch followed by two decoder branches, where the first and second decoder loss functions enforce speech enhancement in time and frequency domains respectively. Each layer of the proposed network is formed with densely connected blocks comprising dilated and causal convolutions for significant feature collection and error backpropagation. The proposed model is trained on a publicly available data set of 28 speakers with 40 different noise conditions. The evaluations are performed on an independent, unseen test set of 2 speakers and 20 different noise conditions. The results from the proposed method are compared with five state-of-the-art methods using various metrics. The proposed method has resulted in an overall perceptual evaluation of speech quality of 3.4. It has outperformed the existing methods by a significant margin in terms of all the evaluation metrics.