Sitemap
- InterSpeech 2021
- Keynotes (4)
- Survey talks (4)
- Acoustic event detection and acoustic scene classification (5)
- SpecMix : A Mixed Sample Data Augmentation method for Training with Time-Frequency Domain Features
(3 minutes introduction) - Acoustic Scene Classification using Kervolution-Based SubSpectralNet
(3 minutes introduction) - Event Specific Attention for Polyphonic Sound Event Detection
(3 minutes introduction) - AST: Audio Spectrogram Transformer
(3 minutes introduction) - Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
(3 minutes introduction)
- SpecMix : A Mixed Sample Data Augmentation method for Training with Time-Frequency Domain Features
- Applications in transcription, education and learning (8)
- Weakly-supervised word-level pronunciation error detection in non-native English speech
(longer introduction) - End-to-End Speaker-Attributed ASR with Transformer
(3 minutes introduction) - Explore Wav2vec 2.0 for Mispronunciation Detection
(3 minutes introduction) - Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
(3 minutes introduction) - Deep feature transfer learning for automatic pronunciation assessment
(3 minutes introduction) - "You don't understand me!": Comparing ASR results for L1 and L2 speakers of Swedish
(3 minutes introduction) - NeMo Inverse Text Normalization: From Development To Production
(3 minutes introduction) - Improvement of Automatic English Pronunciation Assessment with Small Number of Utterances Using Sentence Speakability
(3 minutes introduction)
- Weakly-supervised word-level pronunciation error detection in non-native English speech
- ASR Technologies and systems (1)
- Assessment of pathological speech and language I (4)
- Automatic extraction of speech rhythm descriptors for speech intelligibility assessment in the context of head and neck cancers
(Oral presentation) - Speech Disorder Classification Using Extended Factorized Hierarchical Variational Auto-encoders
(Oral presentation) - The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation
(Oral presentation) - Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
(Oral presentation)
- Automatic extraction of speech rhythm descriptors for speech intelligibility assessment in the context of head and neck cancers
- Assessment of pathological speech and language II (13)
- Speech intelligibility of dysarthric speech: human scores and acoustic-phonetic features
(3 minutes introduction) - Analyzing short term dynamic speech features for understanding behavioral traits of children with autism spectrum disorder
(3 minutes introduction) - Analyzing short term dynamic speech features for understanding behavioral traits of children with autism spectrum disorder
(longer introduction) - Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms
(3 minutes introduction) - Detection of Consonant Errors in Disordered Speech Based on Consonant-vowel Segment Embedding
(3 minutes introduction) - Detection of Consonant Errors in Disordered Speech Based on Consonant-vowel Segment Embedding
(longer introduction) - Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions
(3 minutes introduction) - Identifying cognitive impairment using sentence representation vectors
(3 minutes introduction) - Parental spoken scaffolding and narrative skills in crowd-sourced storytelling samples of young children
(3 minutes introduction) - Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data
(3 minutes introduction) - Source and Vocal Tract Cues for Speech-based Classification of Patients with Parkinson’s Disease and Healthy Subjects
(3 minutes introduction) - Source and Vocal Tract Cues for Speech-based Classification of Patients with Parkinson’s Disease and Healthy Subjects
(longer introduction) - CLAC: A Speech Corpus Of Healthy English Speakers
(3 minutes introduction)
- Speech intelligibility of dysarthric speech: human scores and acoustic-phonetic features
- Automatic Speech Recognition in Air Traffic Management (4)
- Detecting English Speech in the Air Traffic Control Voice Communication
(Oral presentation) - Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems
(Oral presentation) - Boosting of contextual information in ASR for air-traffic call-sign recognition
(Oral presentation) - Modeling the effect of military oxygen masks on speech characteristics
(Oral presentation)
- Detecting English Speech in the Air Traffic Control Voice Communication
- Communication and interaction, multimodality (8)
- A Psychology-Driven Computational Analysis of Political Interviews
(3 minutes introduction) - Speech Emotion Recognition based on Attention Weight Correction Using Word-level Confidence Measure
(3 minutes introduction) - Speech Emotion Recognition based on Attention Weight Correction Using Word-level Confidence Measure
(longer introduction) - Effects of voice type and task on L2 learners’ awareness of pronunciation errors
(3 minutes introduction) - Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues
(3 minutes introduction) - Detecting Alzheimer's Disease using Interactional and Acoustic features from spontaneous speech
(3 minutes introduction) - Investigating the interplay between affective, phonatory and motoric subsystems in Autism Spectrum Disorder using an audiovisual dialogue agent
(3 minutes introduction) - Analysis of eye gaze reasons and gaze aversions during three-party conversations
(3 minutes introduction)
- A Psychology-Driven Computational Analysis of Political Interviews
- ConferencingSpeech 2021 challenge: Far-field Multi-Channel Speech Enhancement for Video Conferencing (5)
- A Causal U-net based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement
(Oral presentation) - A Partitioned-Block Frequency-Domain Adaptive Kalman Filter for Stereophonic Acoustic Echo Cancellation
(Oral presentation) - Real-Time Independent Vector Analysis Using Semi-Supervised Nonnegative Matrix Factorization as a Source Model
(Oral presentation) - Improving Channel Decorrelation for Multi-Channel Target Speech Extraction
(Oral presentation) - Inplace Gated Convolutional Recurrent Neural Network For Dual-channel Speech Enhancement
(Oral presentation)
- A Causal U-net based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement
- Cross/multi-lingual and code-switched ASR (7)
- Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio
(3 minutes introduction) - Unsupervised Cross-lingual Representation Learning for Speech Recognition
(3 minutes introduction) - Multilingual and code-switching ASR challenges for low resource Indian languages
(3 minutes introduction) - SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
(3 minutes introduction) - Hierarchical Phone Recognition with Compositional Phonetics
(3 minutes introduction) - Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR
(3 minutes introduction) - Differentiable Allophone Graphs for Language Universal Speech Recognition
(3 minutes introduction)
- Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio
- Disordered speech (3)
- Diverse modes of speech acquisition and processing (10)
- Segment and Tone Production in Continuous Speech of Hearing and Hearing-impaired Children
(3 minutes introduction) - Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-acoustic Hearing
(3 minutes introduction) - Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-acoustic Hearing
(longer introduction) - A Comparative Study Of Different EMG Features For Acoustic-to-EMG Mapping
(3 minutes introduction) - A Comparative Study Of Different EMG Features For Acoustic-to-EMG Mapping
(longer introduction) - An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech
(3 minutes introduction) - An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech
(longer introduction) - Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(3 minutes introduction) - Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(longer introduction) - Silent versus modal multi-speaker speech recognition from ultrasound and video
(3 minutes introduction)
- Segment and Tone Production in Continuous Speech of Hearing and Hearing-impaired Children
- Embedding and Network Architecture for Speaker Recognition (3)
- A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection
(Oral presentation) - Leveraging speaker attribute information using multi task learning for speaker verification and diarization
(Oral presentation) - Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition
(Oral presentation)
- A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection
- Emotion and Sentiment Analysis I (2)
- Emotion and Sentiment Analysis II (9)
- Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit
(3 minutes introduction) - Multimodal Sentiment Analysis with Temporal Modality Attention
(3 minutes introduction) - Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition
(3 minutes introduction) - Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition
(longer introduction) - Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
(3 minutes introduction) - Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition
(3 minutes introduction) - Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition
(longer introduction) - Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech
(3 minutes introduction) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis
(3 minutes introduction)
- Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit
- Emotion and Sentiment Analysis III (4)
- Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
(3 minutes introduction) - Metric Learning Based Feature Representation With Gated Fusion Model For Speech Emotion Recognition
(3 minutes introduction) - Speech Emotion Recognition with Multi-task Learning
(3 minutes introduction) - Generalized Dilated CNN Models for Depression Detection Using Inverted Vocal Tract Variables
(3 minutes introduction)
- Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes
- Feature, Embedding and Neural Architecture for Speaker Recognition (8)
- Bidirectional Multiscale Feature Aggregation for Speaker Verification
(3 minutes introduction) - Improving Time Delay Neural Network Based Speaker Recognition With Convolutional Block And Feature Aggregation Methods
(3 minutes introduction) - Improving Time Delay Neural Network Based Speaker Recognition With Convolutional Block And Feature Aggregation Methods
(longer introduction) - Binary Neural Network for Speaker Verification
(3 minutes introduction) - Mutual Information Enhanced Training for Speaker Embedding
(3 minutes introduction) - Y-Vector: Multiscale Waveform Encoder for Speaker Embedding
(3 minutes introduction) - Phoneme-aware and Channel-wise Attentive Learning for Text Dependent Speaker Verification
(3 minutes introduction) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding
(3 minutes introduction)
- Bidirectional Multiscale Feature Aggregation for Speaker Verification
- Graph and End-to-End Learning for Speaker Recognition (1)
- Health and Affect I (3)
- Health and Affect II (9)
- Automatic Speech Recognition systems errors for objective sleepiness detection through voice
(3 minutes introduction) - Robust Laughter Detection in Noisy Environments
(3 minutes introduction) - Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech
(3 minutes introduction) - Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children
(3 minutes introduction) - Emotion Carrier Recognition from Personal Narratives
(3 minutes introduction) - Non-verbal Vocalisation and Laughter Detection using Sequence-to-sequence Models and Multi-label Training
(3 minutes introduction) - Visual Speech for Obstructive Sleep Apnea Detection
(3 minutes introduction) - Analysis of Contextual Voice Changes in Remote Meetings
(3 minutes introduction) - Speech based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model
(3 minutes introduction)
- Automatic Speech Recognition systems errors for objective sleepiness detection through voice
- INTERSPEECH 2021 Acoustic Echo Cancellation Challenge (3)
- INTERSPEECH 2021 Acoustic Echo Cancellation Challenge
(Oral presentation) - F-T-LSTM based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement
(Oral presentation) - Acoustic Echo Cancellation using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information
(Oral presentation)
- INTERSPEECH 2021 Acoustic Echo Cancellation Challenge
- INTERSPEECH 2021 Deep Noise Suppression Challenge (2)
- Keyword search and spoken language processing (3)
- Language and Accent Recognition (3)
- Language and Lexical Modeling for ASR (8)
- Incorporating External POS Tagger for Punctuation Restoration
(3 minutes introduction) - Phonetically Induced Subwords for End-to-End Speech Recognition
(3 minutes introduction) - Lookup-Table Recurrent Language Models for Long Tail Speech Recognition
(3 minutes introduction) - Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
(3 minutes introduction) - Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
(longer introduction) - Token-Level Supervised Contrastive Learning for Punctuation Restoration
(3 minutes introduction) - Class-Based Neural Network Language Model For Second-Pass Rescoring In ASR
(3 minutes introduction) - Correcting Automated and Manual Speech Transcription Errors usingWarped Language Models
(3 minutes introduction)
- Incorporating External POS Tagger for Punctuation Restoration
- Language Modeling and Text-based Innovations for ASR (3)
- Linguistic Components in end-to-end ASR (5)
- The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
(Oral presentation) - Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
(Oral presentation) - Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept
(Oral presentation) - Modeling Dialectal Variation for Swiss German Automatic Speech Recognition
(Oral presentation) - Out-of-vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System
(Oral presentation)
- The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
- Low-resource speech recognition (7)
- Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks
(3 minutes introduction) - Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning
(3 minutes introduction) - Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning
(longer introduction) - Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language
(3 minutes introduction) - Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing
(3 minutes introduction) - The Zero Resource Speech Challenge 2021: Spoken language modelling
(3 minutes introduction) - Zero-Shot Federated Learning with New Classes for Audio Classification
(3 minutes introduction)
- Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks
- Miscellanous topics in ASR (3)
- Multi- and cross-lingual ASR, other topics in ASR (8)
- Cross-domain Speech Recognition with Unsupervised Character-level Distribution Matching
(3 minutes introduction) - Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone
(3 minutes introduction) - Reducing Streaming ASR Model Delay with Self Alignment
(3 minutes introduction) - Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
(3 minutes introduction) - Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages
(longer introduction) - Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
(3 minutes introduction) - Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models
(3 minutes introduction) - Earnings-21: A Practical Benchmark for ASR in the Wild
(3 minutes introduction)
- Cross-domain Speech Recognition with Unsupervised Character-level Distribution Matching
- Multi-channel speech enhancement and hearing aids (9)
- LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement
(3 minutes introduction) - Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks
(3 minutes introduction) - Cancellation of Local Competing Speaker with Near-field Localization for Distributed Ad-Hoc Sensor Network
(3 minutes introduction) - Cancellation of Local Competing Speaker with Near-field Localization for Distributed Ad-Hoc Sensor Network
(longer introduction) - A Deep Learning Method to Multi-Channel Active Noise Control
(3 minutes introduction) - Clarity-2021 challenges: Machine learning challenges for advancing hearing aid processing
(3 minutes introduction) - Clarity-2021 challenges: Machine learning challenges for advancing hearing aid processing
(longer introduction) - Explaining deep learning models for speech enhancement
(3 minutes introduction) - Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones
(3 minutes introduction)
- LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement
- Multimodal systems (10)
- Direct multimodal few-shot learning of speech and images
(3 minutes introduction) - Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
(3 minutes introduction) - Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
(3 minutes introduction) - Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
(longer introduction) - Attention-Based Keyword Localisation in Speech using Visual Grounding
(3 minutes introduction) - Automatic Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
(3 minutes introduction) - Automatic Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
(longer introduction) - LiRA: Learning Visual Speech Representations from Audio through Self-supervision
(3 minutes introduction) - End-to-end audio-visual speech recognition for overlapping speech}
(3 minutes introduction) - Audio-Visual Multi-Talker Speech Recognition in A Cocktail Party
(3 minutes introduction)
- Direct multimodal few-shot learning of speech and images
- Neural Network Training Methods and Architectures for ASR (4)
- Self-paced ensemble learning for speech and audio classification
(Oral presentation) - Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
(Oral presentation) - Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning
(Oral presentation) - Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models
(Oral presentation)
- Self-paced ensemble learning for speech and audio classification
- Neural network training methods for ASR (9)
- Towards Lifelong Learning of End-to-end ASR
(3 minutes introduction) - Towards Lifelong Learning of End-to-end ASR
(longer introduction) - Regularizing Word Segmentation by Creating Misspellings
(3 minutes introduction) - Multitask Training with Text Data for End-to-End Speech Recognition
(3 minutes introduction) - Emitting Word Timings with HMM-free End-to-End System in Automatic Speech Recognition
(3 minutes introduction) - Leveraging non-target language resources to improve ASR performance in a target language
(3 minutes introduction) - Leveraging non-target language resources to improve ASR performance in a target language
(longer introduction) - 4-bit Quantization of LSTM-based Speech Recognition Models
(3 minutes introduction) - 4-bit Quantization of LSTM-based Speech Recognition Models
(longer introduction)
- Towards Lifelong Learning of End-to-end ASR
- Non-Autoregressive Sequential Modeling for Speech Processing (7)
- Pushing the Limits of Non-Autoregressive Speech Recognition
(Oral presentation) - Layer Pruning on Demand with Intermediate CTC
(Oral presentation) - Real-time End-to-End Monaural Multi-speaker Speech Recognition
(Oral presentation) - TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis
(Oral presentation) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
(Oral presentation) - Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
(Oral presentation) - VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
(Oral presentation)
- Pushing the Limits of Non-Autoregressive Speech Recognition
- Non-native speech (5)
- Cross-linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared
(3 minutes introduction) - Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention
(longer introduction) - Acquisition of prosodic focus marking by three- to six-year-old children learning Mandarin Chinese
(3 minutes introduction) - A neural network-based noise compensation method for pronunciation assessment
(3 minutes introduction) - A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives
(3 minutes introduction)
- Cross-linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared
- Novel neural network architectures for ASR (8)
- Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency
(3 minutes introduction) - Librispeech Transducer Model with Internal Language Model Prior Correction
(3 minutes introduction) - A Deliberation-based Joint Acoustic and Text Decoder
(3 minutes introduction) - On the limit of English conversational speech recognition
(3 minutes introduction) - SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
(3 minutes introduction) - Online Compressive Transformer for End-to-End Speech Recognition
(3 minutes introduction) - End to end transformer-based contextual speech recognition based on pointer network
(3 minutes introduction) - A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition
(3 minutes introduction)
- Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency
- OpenASR20 and Low Resource ASR Development (3)
- Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges
(Oral presentation) - The TNT Team System Descriptions of Cantonese and Mongolian for IARPA OpenASR20
(Oral presentation) - Combining Hybrid and End-to-end Approaches for the OpenASR20 Challenge
(Oral presentation)
- Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges
- Oriental Language Recognition (3)
- Phonation and voicing (4)
- Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing
(Oral presentation) - Glottal Stops in Upper Sorbian: a Data-Driven Approach
(Oral presentation) - Glottal Sounds in Korebaju
(Oral presentation) - Automatic classification of phonation types in spontaneous speech: towards a new workflow for the characterization of speakers' voice quality
(Oral presentation)
- Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing
- Phonetics I (1)
- Phonetics II (11)
- Leveraging Real-time MRI for Illuminating Linguistic Velum Action
(3 minutes introduction) - Segmental Alignment of English Syllables with Singleton and Cluster Onsets
(3 minutes introduction) - Exploration of Welsh English Pre-aspiration: How Wide-Spread is it?
(3 minutes introduction) - Revisiting recall effects of filler particles in German and English
(3 minutes introduction) - How reliable are phonetic data collected remotely? Comparison of recording devices and environments on acoustic measurements
(3 minutes introduction) - How reliable are phonetic data collected remotely? Comparison of recording devices and environments on acoustic measurements
(longer introduction) - Quantifying vocal tract shape variation and its acoustic impact: a geometric morphometric approach
(3 minutes introduction) - Speakers coarticulate less when facing real and imagined communicative difficulties: An analysis of read and spontaneous speech from the LUCID corpus
(3 minutes introduction) - Developmental changes of vowel acoustics in adolescents
(3 minutes introduction) - A New Vowel Normalization for Sociophonetics
(3 minutes introduction) - The Pacific Expansion: Optimizing phonetic transcription of archival corpora
(3 minutes introduction)
- Leveraging Real-time MRI for Illuminating Linguistic Velum Action
- Privacy-preserving Machine Learning for Audio & Speech Processing (9)
- Privacy-preserving voice anti-spoofing using secure multi-party computation
(3 minutes introduction) - Configurable Privacy-Preserving Automatic Speech Recognition
(3 minutes introduction) - Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation
(3 minutes introduction) - Communication-Efficient Agnostic Federated Averaging
(3 minutes introduction) - Communication-Efficient Agnostic Federated Averaging
(longer introduction) - Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification
(3 minutes introduction) - PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification
(3 minutes introduction) - PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification
(longer introduction) - SynthASR: Unlocking Synthetic Data for Speech Recognition
(3 minutes introduction)
- Privacy-preserving voice anti-spoofing using secure multi-party computation
- Prosodic features and structure (8)
- An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus
(3 minutes introduction) - An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus
(longer introduction) - ProsoBeast Prosody Annotation Tool
(3 minutes introduction) - Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts
(3 minutes introduction) - In-group advantage in the perception of emotions: Evidence from three varieties of German
(3 minutes introduction) - The LF Model in the Frequency Domain for Glottal Airflow Modelling without Aliasing Distortion
(3 minutes introduction) - Parsing speech for grouping and prominence and the typology of rhythm
(3 minutes introduction) - Leveraging the uniformity framework to examine crosslinguistic similarity for long-lag stops in spontaneous Cantonese-English bilingual speech
(3 minutes introduction)
- An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus
- Resource-constrained ASR (8)
- Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices
(3 minutes introduction) - Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices
(longer introduction) - Weakly Supervised Construction of ASR Systems from Massive Video Data
(3 minutes introduction) - Weakly Supervised Construction of ASR Systems from Massive Video Data
(longer introduction) - Extremely Low Footprint End-to-End ASR System for Smart Device
(3 minutes introduction) - Tied \& Reduced RNN-T Decoder
(3 minutes introduction) - PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation
(3 minutes introduction) - Collaborative Training of Acoustic Encoders for Speech Recognition
(3 minutes introduction)
- Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices
- Robust and Far-field ASR (3)
- Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition
(Oral presentation) - ETLT 2021: SHARED TASK ON AUTOMATIC SPEECH RECOGNITION FOR NON-NATIVE CHILDREN’S SPEECH
(Oral presentation) - Learning to Rank Microphones for Distant Speech Recognition
(Oral presentation)
- Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition
- Robust Speaker Recognition (8)
- Unsupervised Bayesian Adaptation of PLDA for Speaker Verification
(3 minutes introduction) - Variational Information Bottleneck based Regularization for Speaker Recognition
(3 minutes introduction) - SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System
(3 minutes introduction) - ANTVOICE NEURAL SPEAKER EMBEDDING SYSTEM FOR FFSVC 2020
(3 minutes introduction) - ANTVOICE NEURAL SPEAKER EMBEDDING SYSTEM FOR FFSVC 2020
(longer introduction) - Deep Feature CycleGANs: Speaker Identity Preserving Non-parallel Microphone-Telephone Domain Adaptation for Speaker Verification
(3 minutes introduction) - Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network
(3 minutes introduction) - Speaker anonymisation using the McAdams coefficient
(3 minutes introduction)
- Unsupervised Bayesian Adaptation of PLDA for Speaker Verification
- SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification (2)
- Search/decoding techniques and confidence measures for ASR (6)
- LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring
(3 minutes introduction) - Deep neural network calibration for E2E speech recognition system
(3 minutes introduction) - Deep neural network calibration for E2E speech recognition system
(longer introduction) - Residual Energy-Based Models for End-to-End Speech Recognition
(3 minutes introduction) - Insights on Neural Representations for End-to-End Speech Recognition
(3 minutes introduction) - Sequence-level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models
(3 minutes introduction)
- LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring
- Self-supervision and semi-supervision for neural ASR training (5)
- On the Learning Dynamics of Semi-Supervised Training for ASR
(3 minutes introduction) - Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
(3 minutes introduction) - Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation
(3 minutes introduction) - Phonetically Motivated Self-Supervised Speech Representation Learning
(3 minutes introduction) - Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS
(3 minutes introduction)
- On the Learning Dynamics of Semi-Supervised Training for ASR
- Show and Tell 1 (5)
- Application for detecting depression, Parkinson’s disease and dysphonic speech
(3 minutes introduction) - Beey: More than a Speech to Text Editor
(3 minutes introduction) - Downsizing of vocal-tract models to line up variations and reduce manufacturing costs
(3 minutes introduction) - ROXANNE Research Platform: Automate criminal investigations and leverage multimodal fusion
(longer introduction) - Advanced semi-blind speaker extraction and tracking implemented in experimental device with revolving dense microphone array
(3 minutes introduction)
- Application for detecting depression, Parkinson’s disease and dysphonic speech
- Show and Tell 2 (5)
- Multi-speaker Emotional Text-to-speech Synthesizer
(3 minutes introduction) - Autonomous Robot for Measuring Room Impulse Responses
(3 minutes introduction) - ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction
(3 minutes introduction) - ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction
(longer introduction) - Audio Segmentation based Conversational Silence Detection for Contact Call Centers
(3 minutes introduction)
- Multi-speaker Emotional Text-to-speech Synthesizer
- Show and Tell 3 (7)
- MoM: Minutes of Meeting Bot
(3 minutes introduction) - Articulatory Data Recorder: A Framework for Real-Time Articulatory Data Recording
(3 minutes introduction) - The INGENIOUS Multilingual Operations App
(3 minutes introduction) - Digital Einstein Experience: Fast Text-to-Speech for Conversational AI
(3 minutes introduction) - Live Subtitling for BigBlueButton with Open-Source Software
(3 minutes introduction) - Expressive Latvian Speech Synthesis for Dialog Systems
(3 minutes introduction) - Vi STA FAE : A Visual Speech Training Aid with Feedback of Articulatory Efforts
(3 minutes introduction)
- MoM: Minutes of Meeting Bot
- Show and Tell 4 (7)
- Interactive and real-time acoustic measurement tools for speech data acquisition and presentation: Application of an extended member of time stretched pulses
(3 minutes introduction) - Save your Voice: Voice Banking and TTS for Anyone
(3 minutes introduction) - NeMo (Inverse) Text Normalization: From Development To Production
(3 minutes introduction) - NeMo (Inverse) Text Normalization: From Development To Production
(longer introduction) - Automatic Radiology Report Editing through Voice
(3 minutes introduction) - WittyKiddy: Multilingual Spoken Language Learning for Kids
(3 minutes introduction) - Web Interface for estimating articulatory movements in speech production from acoustics and text
(3 minutes introduction)
- Interactive and real-time acoustic measurement tools for speech data acquisition and presentation: Application of an extended member of time stretched pulses
- Single-channel speech enhancement (7)
- Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification
(3 minutes introduction) - Speech Denoising with Auditory Models
(3 minutes introduction) - Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement
(3 minutes introduction) - A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement
(3 minutes introduction) - WHISPER SPEECH ENHANCEMENT USING JOINT VARIATIONAL AUTOENCODER FOR IMPROVED SPEECH RECOGNITION
(3 minutes introduction) - Speech Denoising without Clean Training Data: a Noise2Noise Approach
(3 minutes introduction) - Speech Enhancement with Topology-enhanced Generative Adversarial Networks (GANs)
(3 minutes introduction)
- Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification
- Source Separation I (2)
- Source Separation II (10)
- Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers
(3 minutes introduction) - TEACHER-STUDENT MIXIT FOR UNSUPERVISED AND SEMI-SUPERVISED SPEECH SEPARATION
(3 minutes introduction) - Few shot-learning of new sound classes for target sound extraction
(3 minutes introduction) - AvaTr: One-Shot Speaker Extraction with Transformers
(3 minutes introduction) - Vocal Harmony Separation using Time-domain Neural Networks
(3 minutes introduction) - Vocal Harmony Separation using Time-domain Neural Networks
(longer introduction) - Speaker Verification-Based Evaluation of Single-Channel Speech Separation
(3 minutes introduction) - IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION
(3 minutes introduction) - Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
(3 minutes introduction) - Deep audio-visual speech separation based on facial motion
(3 minutes introduction)
- Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers
- Source Separation III (3)
- Combating Reverberation in NTF-based Speech Separation using a Sub-Source Weighted Multichannel Wiener Filter and Linear Prediction
(Oral presentation) - A Hands-on Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation
(Oral presentation) - GlobalPhone Mix-to-Separate out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation
(Oral presentation)
- Combating Reverberation in NTF-based Speech Separation using a Sub-Source Weighted Multichannel Wiener Filter and Linear Prediction
- Source separation, dereverberation and echo cancellation (3)
- Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
(3 minutes introduction) - Residual Echo and Noise Cancellation with Feature Attention Module and Multi-domain Loss Function
(3 minutes introduction) - A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation
(3 minutes introduction)
- Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
- Speaker Diarization I (3)
- Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network
(3 minutes introduction) - Online Speaker Diarization Equipped with Discriminative Modeling and Guided Inference
(3 minutes introduction) - End-to-end speaker segmentation for overlap-aware resegmentation
(3 minutes introduction)
- Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network
- Speaker Diarization II (9)
- LEAP Submission for the Third DIHARD Diarization Challenge
(3 minutes introduction) - LEAP Submission for the Third DIHARD Diarization Challenge
(longer introduction) - Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings
(3 minutes introduction) - Target-Speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker
(3 minutes introduction) - ECAPA-TDNN Embeddings for Speaker Diarization
(3 minutes introduction) - Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech
(3 minutes introduction) - Anonymous speaker clusters: Making distinctions between anonymised speech recordings with clustering interface
(3 minutes introduction) - Anonymous speaker clusters: Making distinctions between anonymised speech recordings with clustering interface
(longer introduction) - Speaker Diarization using Two-pass GMM PLDA Clustering of DNN Embeddings
(3 minutes introduction)
- LEAP Submission for the Third DIHARD Diarization Challenge
- Speaker Recognition: Applications (9)
- Graph-based Label Propagation for Semi-Supervised Speaker Identification
(3 minutes introduction) - Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
(3 minutes introduction) - Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition
(3 minutes introduction) - Multi-Channel Speaker Verification for Single and Multi-talker Speech
(3 minutes introduction) - Chronological Self-Training for Real-Time Speaker Diarization
(3 minutes introduction) - Presentation matters: Evaluating speaker identification tasks
(3 minutes introduction) - Presentation matters: Evaluating speaker identification tasks
(longer introduction) - Automatic Error Correction for Speaker Embedding Learning with Noisy Labels
(3 minutes introduction) - An Integrated Framework for Two-pass Personalized Voice Trigger
(3 minutes introduction)
- Graph-based Label Propagation for Semi-Supervised Speaker Identification
- Speaker, Language, and Privacy (3)
- Using Games to Augment Corpora for Language Recognition and Confusability
(Oral presentation) - Fair Voice Biometrics: Impact of Demographic Imbalance on Group Fairness in Speaker Recognition
(Oral presentation) - Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification
(Oral presentation)
- Using Games to Augment Corpora for Language Recognition and Confusability
- Speech and audio analysis (4)
- Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations
(Oral presentation) - ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification
(Oral presentation) - Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-specific Scaling
(Oral presentation) - Audio Retrieval with Natural Language Queries
(Oral presentation)
- Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations
- Speech coding and privacy (9)
- NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
(3 minutes introduction) - NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
(longer introduction) - WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution
(3 minutes introduction) - Multi-channel Opus compression for far-field automatic speech recognition with a fixed bitrate budget
(longer introduction) - Effects of Prosodic Variations on Accidental Triggers of a commercial Voice Assistant
(3 minutes introduction) - Improving the expressiveness of neural vocoding with non-affine Normalizing Flows
(3 minutes introduction) - A Two-stage Approach to Speech Bandwidth Extension
(3 minutes introduction) - Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder
(3 minutes introduction) - Protecting gender and identity with disentangled speech representations
(3 minutes introduction)
- NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
- Speech enhancement and coding (2)
- Speech enhancement and intelligibility (12)
- Funnel Deep Complex U-net for Phase-Aware Speech Enhancement
(3 minutes introduction) - Perceptual Contributions of Vowels and Consonant-vowel Transitions in Understanding Time-compressed Mandarin Sentences
(3 minutes introduction) - Speech Enhancement with Weakly Labelled Data from AudioSet
(3 minutes introduction) - Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement
(3 minutes introduction) - MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
(3 minutes introduction) - A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction
(3 minutes introduction) - A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction
(longer introduction) - Self-Supervised Learning Based Phone-Fortified Speech Enhancement
(3 minutes introduction) - Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement
(3 minutes introduction) - Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement
(longer introduction) - Restoring degraded speech via a modified diffusion model
(3 minutes introduction) - Restoring degraded speech via a modified diffusion model
(longer introduction)
- Funnel Deep Complex U-net for Phase-Aware Speech Enhancement
- Speech Localization, Enhancement, and Quality Assessment (4)
- PILOT: Introducing Transformers for Probabilistic Sound Event Localization
(3 minutes introduction) - Assessment of von Mises--Bernoulli Deep Neural Network in Sound Source Localization
(3 minutes introduction) - Far-field Speaker Localization and Adaptive GLMB Tracking
(3 minutes introduction) - Far-field Speaker Localization and Adaptive GLMB Tracking
(longer introduction)
- PILOT: Introducing Transformers for Probabilistic Sound Event Localization
- Speech perception I (2)
- Speech perception II (9)
- Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors
(3 minutes introduction) - Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors
(longer introduction) - VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification
(3 minutes introduction) - Relationships between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication
(3 minutes introduction) - Human spoofing detection performance on degraded speech
(3 minutes introduction) - Human spoofing detection performance on degraded speech
(longer introduction) - Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research
(3 minutes introduction) - Towards the explainability of Multimodal Speech Emotion Recognition
(3 minutes introduction) - Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
(3 minutes introduction)
- Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors
- Speech production I (4)
- Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated
(Oral presentation) - Comparison of the finite element method, the multimodal method and thetransmission-line model for the computation of vocal tract transfer functions
(Oral presentation) - Importance of Parasagittal Sensor Information in Tongue Motion Capture through a Diphonic Analysis
(Oral presentation) - Changes in glottal source parameter values with light to moderate physical load
(Oral presentation)
- Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated
- Speech production II (6)
- A Simplified Model for the Vocal Tract of [s] with Inclined Incisors
(3 minutes introduction) - Vocal-tract models to visualize the airstream of human breath and droplets while producing speech
(3 minutes introduction) - Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data
(3 minutes introduction) - Inhalations in speech: acoustic and physiological characteristics
(3 minutes introduction) - Take a breath: Respiratory sounds improve recollection in synthetic speech
(3 minutes introduction) - Mixture of orthogonal sequences made from extended time-stretched pulses enables measurement of involuntary voice fundamental frequency response to pitch perturbation
(3 minutes introduction)
- A Simplified Model for the Vocal Tract of [s] with Inclined Incisors
- Speech Recognition of Atypical Speech (11)
- Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases
(Oral presentation) - Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale
(Oral presentation) - Handling acoustic variation in dysarthric speech recognition systems through model combination
(Oral presentation) - Adversarial Data Augmentation for Disordered Speech Recognition
(Oral presentation) - Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition
(Oral presentation) - Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition
(Oral presentation) - A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks
(Oral presentation) - Conformer Parrotron: a Faster and Stronger End-to-end SpeechConversion and Recognition Model for Atypical Speech
(Oral presentation) - Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia
(Oral presentation) - Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases
(Oral presentation) - Analysis and Tuning of a Voice Assistant System for Dysfluent Speech
(Oral presentation)
- Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases
- Speech signal analysis and representation I (12)
- Estimating articulatory movements in speech production with transformer networks
(3 minutes introduction) - Estimating articulatory movements in speech production with transformer networks
(longer introduction) - Speech Decomposition based on a Hybrid Speech Model and Optimal Segmentation
(3 minutes introduction) - Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation
(3 minutes introduction) - Noise robust pitch stylization using minimum mean absolute error criterion
(3 minutes introduction) - An Attribute-Aligned Strategy for Learning Speech Representation
(3 minutes introduction) - Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation
(3 minutes introduction) - Unsupervised Training of a DNN-based Formant Tracker
(3 minutes introduction) - Unsupervised Training of a DNN-based Formant Tracker
(longer introduction) - Synchronising speech segments with musical beats in Mandarin and English singing
(3 minutes introduction) - Pitch contour separation from overlapping speech
(3 minutes introduction) - Pitch contour separation from overlapping speech
(longer introduction)
- Estimating articulatory movements in speech production with transformer networks
- Speech signal analysis and representation II (4)
- A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
(Oral presentation) - Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods
(Oral presentation) - Identification of F1 and F2 in speech using modified zero frequency filtering
(Oral presentation) - Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice
(Oral presentation)
- A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
- Speech Synthesis: Linguistic processing, paradigms and other topics (8)
- UNSUPERVISED LEARNING OF DISENTANGLED SPEECH CONTENT AND STYLE REPRESENTATION
(3 minutes introduction) - Label Embedding for Chinese Grapheme-to-Phoneme Conversion
(3 minutes introduction) - Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-pooling Strategy and Window-based Attention
(3 minutes introduction) - Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-pooling Strategy and Window-based Attention
(longer introduction) - Polyphone Disambiguition in Mandarin Chinese with Semi-Supervised Learning
(3 minutes introduction) - A Neural-Network-Based Approach to Identifying Speakers in Novels
(3 minutes introduction) - Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder
(3 minutes introduction) - Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder
(longer introduction)
- UNSUPERVISED LEARNING OF DISENTANGLED SPEECH CONTENT AND STYLE REPRESENTATION
- Speech Synthesis: Neural Waveform Generation (6)
- Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis
(3 minutes introduction) - Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator
(3 minutes introduction) - Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
(3 minutes introduction) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
(3 minutes introduction) - Continuous Wavelet Vocoder-based Decomposition of Parametric Speech Waveform Synthesis
(3 minutes introduction) - High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model
(3 minutes introduction)
- Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis
- Speech Synthesis: Other topics I (4)
- Conversion of airborne to bone-conducted speech with deep neural networks
(Oral presentation) - T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion
(Oral presentation) - Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values
(Oral presentation) - A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages
(Oral presentation)
- Conversion of airborne to bone-conducted speech with deep neural networks
- Speech Synthesis: Prosody Modeling I (6)
- Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
(3 minutes introduction) - Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
(3 minutes introduction) - Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation
(3 minutes introduction) - Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing
(3 minutes introduction) - Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing
(longer introduction) - Applying the Information Bottleneck Principle to Prosodic Representation Learning
(3 minutes introduction)
- Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
- Speech Synthesis: Prosody Modeling II (3)
- Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis (8)
- Cross-lingual Low Resource Speaker Adaptation Using Phonological Features
(3 minutes introduction) - Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
(3 minutes introduction) - EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder
(3 minutes introduction) - Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
(3 minutes introduction) - Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
(longer introduction) - Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation
(3 minutes introduction) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation
(3 minutes introduction) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation
(longer introduction)
- Cross-lingual Low Resource Speaker Adaptation Using Phonological Features
- Speech Synthesis: Speaking Style and Emotion (7)
- STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech
(3 minutes introduction) - Controllable Context-Aware Conversational Speech Synthesis
(3 minutes introduction) - Expressive Text-to-Speech using Style Tag
(3 minutes introduction) - SponSpeech: Adaptive Text to Speech for Spontaneous Style
(3 minutes introduction) - Towards Multi-Scale Style Control for Expressive Speech Synthesis
(3 minutes introduction) - Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
(3 minutes introduction) - Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture
(3 minutes introduction)
- STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech
- Speech Synthesis: tools, data, evaluation (8)
- Spectral and Latent Speech Representation Distortion for TTS Evaluation
(3 minutes introduction) - RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
(3 minutes introduction) - Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis
(3 minutes introduction) - Perception of social speaker characteristics in synthetic speech
(3 minutes introduction) - Hi-Fi Multi-Speaker English TTS Dataset
(3 minutes introduction) - Utilizing Self-supervised Representations for MOS Prediction
(3 minutes introduction) - Utilizing Self-supervised Representations for MOS Prediction
(longer introduction) - KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
(3 minutes introduction)
- Spectral and Latent Speech Representation Distortion for TTS Evaluation
- Speech Synthesis: Toward End-to-End Synthesis I (7)
- Federated Learning with Dynamic Transformer on Text to Speech
(3 minutes introduction) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
(3 minutes introduction) - Diff-TTS: A Denoising Diffusion Model for Text-to-Speech
(3 minutes introduction) - A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Rényi Divergence Minimization
(3 minutes introduction) - Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
(3 minutes introduction) - Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time LPCNet
(3 minutes introduction) - SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
(3 minutes introduction)
- Federated Learning with Dynamic Transformer on Text to Speech
- Speech Synthesis: Toward End-to-End Synthesis II (8)
- TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
(3 minutes introduction) - Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech
(3 minutes introduction) - Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech
(longer introduction) - Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
(3 minutes introduction) - Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
(longer introduction) - PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
(3 minutes introduction) - Speed up training with variable length inputs by efficient batching strategies
(3 minutes introduction) - Speed up training with variable length inputs by efficient batching strategies
(longer introduction)
- TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
- Speech type classification and diagnosis (8)
- A Multi-Branch Deep Learning Network for Automated Detection of COVID-19
(3 minutes introduction) - RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform
(3 minutes introduction) - Fake Audio Detection in Resource-constrained Settings using Microfeatures
(3 minutes introduction) - Coughing-based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks
(3 minutes introduction) - Coughing-based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks
(longer introduction) - Knowledge Distillation for Singing Voice Detection
(3 minutes introduction) - Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification
(3 minutes introduction) - Automatic Detection of Shouted Speech Segments in Indian News Debates
(3 minutes introduction)
- A Multi-Branch Deep Learning Network for Automated Detection of COVID-19
- Spoken Dialogue Systems I (2)
- Spoken Dialogue Systems II (5)
- Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering
(3 minutes introduction) - Injecting Descriptive Meta-information into Pre-trained Language Models with Hypernetworks
(3 minutes introduction) - Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy
(3 minutes introduction) - Timing Generating Networks: Neural Network based Precise Turn-taking Timing Prediction in Multiparty Conversation
(3 minutes introduction) - Phoneme-BERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript
(3 minutes introduction)
- Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering
- Spoken Language Processing I (7)
- SmallER: Scaling Neural Entity Resolution for Edge Devices
(3 minutes introduction) - Disfluency Detection with Unlabeled Data and Small BERT Models
(3 minutes introduction) - Discriminative Self-training for Punctuation Prediction
(3 minutes introduction) - A noise robust method for word-level pronunciation assessment
(3 minutes introduction) - Multimodal Speech Summarization through Semantic Concept Learning
(3 minutes introduction) - Enhancing Semantic Understanding with Self-supervised Methods for Abstractive Dialogue Summarization
(3 minutes introduction) - Enhancing Semantic Understanding with Self-supervised Methods for Abstractive Dialogue Summarization
(longer introduction)
- SmallER: Scaling Neural Entity Resolution for Edge Devices
- Spoken Language Processing II (2)
- Spoken Language Understanding I (8)
- Sequential End-to-End Intent and Slot Label Classification and Localization
(3 minutes introduction) - DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants
(3 minutes introduction) - DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants
(longer introduction) - A Context-Aware Hierarchical BERT Fusion Network for Multi-turn Dialog Act Detection
(3 minutes introduction) - Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
(3 minutes introduction) - Integrating Dialog History into End-to-End Spoken Language Understanding Systems
(3 minutes introduction) - Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking
(3 minutes introduction) - Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
(3 minutes introduction)
- Sequential End-to-End Intent and Slot Label Classification and Localization
- Spoken Language Understanding II (3)
- Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models
(3 minutes introduction) - END-to-END Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining
(3 minutes introduction) - Factorization-Aware Training of Transformers for Natural Language Understanding On the Edge
(3 minutes introduction)
- Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models
- Spoken machine translation (12)
- Subtitle Translation as Markup Translation
(3 minutes introduction) - Large-Scale Self- and Semi-Supervised Learning for Speech Translation
(3 minutes introduction) - CoVoST 2 and Massively Multilingual Speech Translation
(3 minutes introduction) - AlloST: Low-resource Speech Translation without Source Transcription
(3 minutes introduction) - Weakly-supervised Speech-to-text Mapping with Visually Connected Non-parallel Speech-text Data using Cyclic Partially-aligned Transformer
(3 minutes introduction) - Weakly-supervised Speech-to-text Mapping with Visually Connected Non-parallel Speech-text Data using Cyclic Partially-aligned Transformer
(longer introduction) - End-to-end Speech Translation via Cross-modal Progressive Training
(3 minutes introduction) - End-to-end Speech Translation via Cross-modal Progressive Training
(longer introduction) - Towards simultaneous machine interpretation
(3 minutes introduction) - Lexical Modeling of ASR Errors for Robust Speech Translation
(3 minutes introduction) - Lexical Modeling of ASR Errors for Robust Speech Translation
(longer introduction) - Effects of Feature Scaling and Fusion on Sign Language Translation
(3 minutes introduction)
- Subtitle Translation as Markup Translation
- Spoken Term Detection & Voice Search (9)
- Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study
(3 minutes introduction) - Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding
(3 minutes introduction) - Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding
(longer introduction) - Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
(3 minutes introduction) - Few-Shot Keyword Spotting in Any Language
(3 minutes introduction) - Text Anchor Based Metric Learning for Small-footprint Keyword Spotting
(3 minutes introduction) - A meta-learning approach for user-defined spoken term classification with varying classes and examples
(3 minutes introduction) - Auxiliary Sequence Labeling Tasks for Disfluency Detection
(3 minutes introduction) - Keyword Transformer: A Self-Attention Model for Keyword Spotting
(3 minutes introduction)
- Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study
- Streaming for ASR/RNN Transducers (7)
- Super-Human Performance in Online Low-latency Recognition of Conversational Speech
(3 minutes introduction) - Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
(3 minutes introduction) - An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
(3 minutes introduction) - Reducing Exposure Bias in Training Recurrent Neural Network Transducers
(3 minutes introduction) - Bridging the gap between streaming and non-streaming ASR systems by distilling ensembles of CTC and RNN-T models
(3 minutes introduction) - Bridging the gap between streaming and non-streaming ASR systems by distilling ensembles of CTC and RNN-T models
(longer introduction) - Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
(3 minutes introduction)
- Super-Human Performance in Online Low-latency Recognition of Conversational Speech
- Target speaker detection, localization and separation (5)
- Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics
(Oral presentation) - Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers
(Oral presentation) - Using X-vectors for Speech Activity Detection in Broadcast Streams
(Oral presentation) - Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features
(Oral presentation) - Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network
(Oral presentation)
- Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics
- The ADReSSo Challenge: Detecting cognitive decline using speech only (7)
- Influence of the Interviewer on the Automatic Assessment of Alzheimer's Disease in the Context of the ADReSSo Challenge
(3 minutes introduction) - WavBERT: Exploiting Semantic and Non-semantic Speech using Wav2vec and BERT for Dementia Detection
(3 minutes introduction) - Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models
(3 minutes introduction) - Exploring using the outputs of different automatic speech recognition paradigms for acoustic- and BERT-based Alzheimer's Dementia detection through spontaneous speech
(3 minutes introduction) - Automatic Detection of Alzheimer's Disease Using Spontaneous Speech Only
(3 minutes introduction) - Modular Multi-Modal Attention Network for Alzheimer's Disease Detection Using Patient Audio and Language Data
(3 minutes introduction) - Modular Multi-Modal Attention Network for Alzheimer's Disease Detection Using Patient Audio and Language Data
(longer introduction)
- Influence of the Interviewer on the Automatic Assessment of Alzheimer's Disease in the Context of the ADReSSo Challenge
- The First DiCOVA Challenge: Diagnosis of COVid-19 using Acoustics (6)
- Diagnosis of COVID-19 using Auditory Acoustic Cues
(Oral presentation) - Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation
(Oral presentation) - The DiCOVA 2021 Challenge -- An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio
(Oral presentation) - COVID-19 Detection from Spectral features on the DiCOVA Dataset
(Oral presentation) - Cough-based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information
(Oral presentation) - Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis
(Oral presentation)
- Diagnosis of COVID-19 using Auditory Acoustic Cues
- The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) - COVID-19 Cough, COVID-19 Speech, Escalation & Primates (8)
- Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19
(Oral presentation) - Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021
(Oral presentation) - Vision Transformer for Audio-based Primates Classification and COVID Detection
(Oral presentation) - Deep-learning-based central African primate species classification with MixUp and SpecAugment
(Oral presentation) - Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification
(Oral presentation) - Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild
(Oral presentation) - Identifying Conflict Escalation and Primates by Using Ensemble X-vectors and Fisher Vector Features
(Oral presentation) - Ensemble-within-ensemble classification for escalation prediction from speech
(Oral presentation)
- Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19
- Tools, corpora and resources (11)
- The Multilingual TEDx Corpus for Speech Recognition and Translation
(3 minutes introduction) - GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
(3 minutes introduction) - AusKidTalk: An Auditory-Visual Corpus of 3- to 12-year-old Australian Children’s Speech
(3 minutes introduction) - Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
(3 minutes introduction) - Annotation Confidence vs. Training Sample Size: Trade-off Solution for Partially-Continuous Categorical Emotion Recognition
(3 minutes introduction) - Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization
(3 minutes introduction) - Towards Automatic Speech to Sign Language Generation
(3 minutes introduction) - Towards Automatic Speech to Sign Language Generation
(longer introduction) - kosp2e: Korean Speech to English Translation Corpus
(3 minutes introduction) - speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment
(3 minutes introduction) - speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment
(longer introduction)
- The Multilingual TEDx Corpus for Speech Recognition and Translation
- Topics in ASR: Adaptation, transfer learning, children's speech, and low-resource settings (9)
- Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
(3 minutes introduction) - Low Resource German ASR with Untranscribed Data Spoken by Non-native Children - INTERSPEECH 2021 Shared Task SPAPL System
(3 minutes introduction) - Low Resource German ASR with Untranscribed Data Spoken by Non-native Children - INTERSPEECH 2021 Shared Task SPAPL System
(longer introduction) - Speaker normalization using Joint Variational Autoencoder
(3 minutes introduction) - The TAL system for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
(3 minutes introduction) - The TAL system for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
(longer introduction) - Zero-shot Cross-Lingual Phonetic Recognition with External Language Embedding
(3 minutes introduction) - Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning
(3 minutes introduction) - Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII's System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge
(3 minutes introduction)
- Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
- Topics in ASR: Robustness, feature extraction, and far-field ASR (8)
- End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-switching Speech Recognition
(3 minutes introduction) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties
(3 minutes introduction) - Speech Acoustic Modelling using Raw Source and Filter Components
(3 minutes introduction) - IR-GAN: Room impulse response generator for far-field speech recognition
(3 minutes introduction) - Multi-Channel Transformer Transducer for Speech Recognition
(3 minutes introduction) - Multi-Channel Transformer Transducer for Speech Recognition
(longer introduction) - Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
(3 minutes introduction) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition
(3 minutes introduction)
- End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-switching Speech Recognition
- Tutorials (8)
- Intonation Transcription and Modelling in Research and Speech Technology Applications
- Neural target speech extraction
- Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall)
- An Introduction to Automatic Differentiation with Weighted Finite-State Automata
- SpeechBrain: Unifying Speech Technologies and Deep Learning With an Open Source Toolkit
- SpeechBrain: Unifying Speech Technologies and Deep Learning With an Open Source Toolkit
- SpeechBrain: Unifying Speech Technologies and Deep Learning With an Open Source Toolkit
- Concept to Code: Semi-Supervised End-To-End Approaches For Speech Recognition
- Voice activity detection (5)
- Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021
(Oral presentation) - The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge
(Oral presentation) - Speech Activity Detection Based on Multilingual Speech Recognition System
(Oral presentation) - Voice Activity Detection With Teacher-Student Domain Emulation
(Oral presentation) - EML Online Speech Activity Detection for Fearless Steps Challenge Phase-III
(Oral presentation)
- Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021
- Voice activity detection and keyword spotting (10)
- Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
(3 minutes introduction) - Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
(3 minutes introduction) - Multi-Channel VAD for Transcription of Group Discussion
(3 minutes introduction) - Multi-Channel VAD for Transcription of Group Discussion
(longer introduction) - Audio-Visual Information Fusion Using Cross-modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments
(3 minutes introduction) - Enrollment-less training for personalized voice activity detection
(3 minutes introduction) - FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications
(3 minutes introduction) - End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention
(3 minutes introduction) - Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation
(3 minutes introduction) - A Lightweight Framework for Online Voice Activity Detection in the Wild
(3 minutes introduction)
- Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
- Voice and voicing (6)
- System performance as a function of calibration methods, sample size and sampling variability in likelihood ratio-based forensic voice comparison
(3 minutes introduction) - System performance as a function of calibration methods, sample size and sampling variability in likelihood ratio-based forensic voice comparison
(longer introduction) - The Four-way Classification of Stops with Voicing and Aspiration for Non-native Speech Evaluation
(3 minutes introduction) - A comparison of the accuracy of Dissen and Keshets’s (2016) DeepFormants and traditional LPC methods for semi-automatic speaker recognition
(3 minutes introduction) - Sound change in spontaneous bilingual speech: A corpus study on the Cantonese n-l merger in Cantonese-English bilinguals
(3 minutes introduction) - Sound change in spontaneous bilingual speech: A corpus study on the Cantonese n-l merger in Cantonese-English bilinguals
(longer introduction)
- System performance as a function of calibration methods, sample size and sampling variability in likelihood ratio-based forensic voice comparison
- Voice Anti-Spoofing and Countermeasure (11)
- A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection
(3 minutes introduction) - An Initial Investigation for Detecting Partially Spoofed Audio
(3 minutes introduction) - Cross-database replay detection in terminal-dependent speaker verification
(3 minutes introduction) - Pairing Weak with Strong: Twin Models for Defending against Adversarial Attack on Speaker Verification
(3 minutes introduction) - Pairing Weak with Strong: Twin Models for Defending against Adversarial Attack on Speaker Verification
(longer introduction) - Voting for the right answer: Adversarial defense for speaker verification
(3 minutes introduction) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing
(3 minutes introduction) - Representation Learning to Classify and Detect Adversarial Attacks against Speaker and Speech Recognition Systems
(3 minutes introduction) - An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems
(3 minutes introduction) - An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems
(longer introduction) - Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks
(3 minutes introduction)
- A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection
- Voice Conversion and Adaptation I (7)
- CVC: Contrastive Learning for Non-parallel Voice Conversion
(3 minutes introduction) - A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion
(3 minutes introduction) - Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data
(3 minutes introduction) - Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
(3 minutes introduction) - StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition
(3 minutes introduction) - Two-Pathway Style Embedding for Arbitrary Voice Conversion
(3 minutes introduction) - Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics
(3 minutes introduction)
- CVC: Contrastive Learning for Non-parallel Voice Conversion
- Voice Conversion and Adaptation II (4)
- Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
(3 minutes introduction) - Adversarial Voice Conversion against Neural Spoofing Detectors
(3 minutes introduction) - Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
(3 minutes introduction) - Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder
(3 minutes introduction)
- Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
- Voice quality characterization for clinical voice assessment: Voice production, acoustics, and auditory perception (4)
- Optimizing an automatic creaky voice detection method for Australian English-speaking females
(Oral presentation) - Investigating voice function characteristics of Greek speakers with hearing loss using automatic glottal source feature extraction
(Oral presentation) - Automated Detection of Voice Disorder in the Saarbruecken Voice Database: Effects of Pathology Subset and Audio Materials
(Oral presentation) - Articulatory Coordination for Speech Motor Tracking in Huntington Disease
(Oral presentation)
- Optimizing an automatic creaky voice detection method for Australian English-speaking females
- Opening (1)
- Closing (3)