End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
(3 minutes introduction)
Kai Wang (Xinjiang University, China), Hao Huang (Xinjiang University, China), Ying Hu (Xinjiang University, China), Zhihua Huang (Xinjiang University, China), Sheng Li (NICT, Japan) |
---|
Traditional single channel speech separation in the time-frequency (T-F) domain often faces the problem of phase reconstruction. Due to the fact that the real-valued network is not suitable for dealing with complex-valued representation, the performance of the T-F domain speech separation method is often constrained from reaching the state-of-the-art. In this paper, we propose improved speech separation methods in both complex and real T-F domain using orthogonal representation. For the complex-valued case, we combine the deep complex network (DCN) and Conv-TasNet to design an end-to-end complex-valued model. Specifically, we incorporate short-time Fourier transform (STFT) and learnable complex layers to build a hybrid encoder-decoder structure, and use a DCN based separator. Then we present the importance of weights orthogonality in the T-F domain transformation and propose a multi-segment orthogonality (MSO) architecture for further improvements. For the real-valued case, we performed separation in real T-F domain by introducing the short-time DCT (STDCT) with orthogonal representation as well. Experimental results show that the proposed complex model outperforms the baseline Conv-TasNet with a comparable parameter size by 1.8 dB, and the STDCT-based real-valued T-F model by 1.2 dB, showing the advantages of speech separation in the T-F domain.