Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification
Hossein Zeinali, Lukas Burget, Hossein Sameti, Ondrej Glembek and Oldrich Plchot |
---|
Techniques making use of Deep Neural Networks (DNN) have recently been seen to bring large improvements in text-independent speaker recognition. In this paper, we verify that the DNN based methods result in excellent performances in the context of text-dependent speaker verification as well. We build our system on the previously introduced HMM based i-vector approach, where phone models are used to obtain frame level alignment in order to collect sufficient statistics for i-vector extraction. For comparison, we experiment with an alternative alignment obtained directly from the output of DNN trained for phone classification. We also experiment with DNN based bottleneck features and their combinations with standard cepstral features. Although the i-vector approach is generally considered not suitable for text-dependent speaker verification, we show that our HMM based approach combined with bottleneck features provides truly state-of-the-art performance on RSR2015 data.