Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance <BR>(3 minutes introduction)

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
(3 minutes introduction)

Takanori Ashihara (NTT, Japan), Takafumi Moriya (NTT, Japan), Makio Kashino (NTT, Japan)

Humans have a sophisticated capability to robustly handle incomplete sensory input, as often happens in real environments. In earlier studies, the robustness of human speech perception was observed qualitatively by spectrally and temporally degraded stimuli. The current study investigates how machine speech recognition, especially end-to-end automatic speech recognition (E2E-ASR), can yield similar robustness against distorted acoustic cues. To evaluate the performance of E2E-ASR, we employ four types of distorted speech based on previous studies: locally time-reversed speech, noise-vocoded speech, phonemic restoration, and modulation-filtered speech. Those stimuli are synthesized by spectral and/or temporal manipulation from original speech samples whose human speech intelligibility scores have been well-reported. An experiment was conducted on the TED-LIUM2 for English and the Corpus of Spontaneous Japanese (CSJ) for Japanese. We found that while there is a tendency to exhibit similar robustness in some experiments, full recovery from the harmful effect of the severe spectral degradation is not achieved.

Search in Audio

Related Recordings

Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research
(3 minutes introduction)

Marieke Einfeldt , Rita Sevastjanova , Katharina Zahner-Ritter , Ekaterina Kazak , Bettina Braun

Towards the explainability of Multimodal Speech Emotion Recognition
(3 minutes introduction)

Puneet Kumar , Vishesh Kaushik , Balasubramanian Raman

InterSpeech 2021

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance (3 minutes introduction)

Search in Audio

Related Recordings

Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research (3 minutes introduction)

Towards the explainability of Multimodal Speech Emotion Recognition (3 minutes introduction)

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
(3 minutes introduction)

Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research
(3 minutes introduction)

Towards the explainability of Multimodal Speech Emotion Recognition
(3 minutes introduction)