0:00:15hello everyone
0:00:17i am then used to but often
0:00:18i am a research scientist i'd been dropped security
0:00:22based in atlanta are you would say
0:00:25i'm here to present our paper
0:00:28i to read speech bandwidth expansion for speaker recognition on telephone you or you
0:00:38this is the overview of might all
0:00:41i will start by giving a motivation as to why we need bandwidth expansion
0:00:45followed by explaining the problem statement
0:00:48and then i will describe some prior research in this area
0:00:53we will then explain
0:00:55the bandwidth expansion system that we propose in this paper
0:00:58and show some results of bandwidth expansion performance
0:01:03finally
0:01:04i really others this show you some speaker verification experiments that you perform
0:01:10and the results that we obtained with the bandwidth expanse just
0:01:17in this paper we therefore to y
0:01:20no audio that the sampled at sixteen khz
0:01:24now has wideband or you
0:01:26typically the audio that is sampled at sixteen khz and has frequency content
0:01:33between zero to eight khz
0:01:35but is called wideband audio in this paper
0:01:39input additional telephone the audio
0:01:41which is due back band limited to a three hundred to three thousand four hundred
0:01:46hz
0:01:47an example of the universe is referred to as
0:01:51narrowband audio in this paper
0:01:55speaker verification systems
0:01:57typically work well on why nine between all you
0:02:02this is because
0:02:04the higher frequency content maybe in four and eight khz
0:02:10in by band all your
0:02:12is helpful in speaker discrimination
0:02:17the wine mandarin systems
0:02:20the one of the wideband audio stream systems perform warily
0:02:24a narrowband or you to the mismatch in the training and testing conditions
0:02:29so the lack of you have higher frequency information in the narrowband speech leads to
0:02:35the degraded performance
0:02:37so the question that we days in this paper is can be indexed estimate the
0:02:42higher frequency content that is missing in narrowband or you
0:02:47in such a way that it improves the performance
0:02:50on why band trained
0:02:53speaker verification systems
0:02:55so this is the problem statement
0:02:57the narrowband on or the u
0:03:00well as a and which is band limited to four khz
0:03:04is shown on the left
0:03:06in this figure we have shown the spectrogram
0:03:10for showing the frequencies between zero and
0:03:13e you know parts
0:03:15you see that there is no information or frequency contained in between forty kilobytes
0:03:20and the objective will be banded expansion system is to use the lower frequency content
0:03:25of the narrowband audio to estimate the missing higher frequency content that is typically present
0:03:32in the wideband audio
0:03:36the objective of the s estimation of the higher frequency content is inside to be
0:03:41of that it improves the performance of speaker verification systems
0:03:49then as being a lot of research that has been conducted in bandwidth expansion
0:03:54the earliest approaches to bandwidth expansion they don't the problem into two parts
0:03:59estimating the on the log of the spectrum and the excitation signal of this of
0:04:06this paper
0:04:07the on the left estimation is typically
0:04:10are then made using spline fitting cubic spline fitting one option mixture model based approaches
0:04:17and
0:04:18spectral folding is a is used for
0:04:21estimating it's extending the excitation six
0:04:25so this is the earliest approaches in bandwidth expansion
0:04:30more recent approaches use the neural network based bandwidth extraction and
0:04:35these kind of deep neural network based systems have shown improvement in the performance of
0:04:41asr systems
0:04:42are trained on wideband speech
0:04:46more recent work in speaker verification related to bandwidth expansion
0:04:51has you have used
0:04:53d plus it will networks
0:04:55and bidirectional l s t m network architectures for or forming bandwidth expansion
0:05:01this work has also shown significant improvement in the performance of speaker verification systems
0:05:11in this
0:05:12but we propose a novel bandwidth expansion system
0:05:15that is lightweight compared to all the systems proposed in the literature
0:05:22in this system the band with the bantered expansion is performed using a c n
0:05:27b and then network architecture
0:05:30a feed-forward c n and t are not capture better in there is a
0:05:34single convolutional layer
0:05:36which is which more forms one deconvolution along the time axis
0:05:41followed by three v forward layers
0:05:44containing
0:05:45one thousand twenty four nodes in each layer
0:05:50there are sixty four filters in the convolutional ears
0:05:54and after the convolution operation the feature maps that slightly and fact that the feed-forward
0:06:00here's
0:06:01this is the architecture of the d and then
0:06:04that performs the bandwidth expansion
0:06:06the input
0:06:07to the deep neural network is
0:06:11and the narrowband log spectrum
0:06:15narrowband log spectrum so we extract the spectrogram
0:06:20from the eight khz telephone the audio
0:06:23and we perform
0:06:25the mean and variance normalization of the spectrum
0:06:30and compute the logarithm
0:06:32of
0:06:34of the spectrum and feed it as input to the network
0:06:39the output of the network is the s is tries to estimate the complete than
0:06:44some of the
0:06:46corresponding by back to see that
0:06:49the input to the network
0:06:51a fixed
0:06:52eleven frames
0:06:54of one twenty eight dimensional narrowband log spectrum
0:06:59the features are computed at twenty millisecond frame size and ten milisecond frame rate
0:07:06the network output is to fifty seven dimensional wideband log spectrum
0:07:12the network is trained with the mean squared error loss and adam optimiser
0:07:19after the and the network output is a pain
0:07:23the mean and variance computed from the input us narrowband spectrum is added back
0:07:30to the wideband spectra
0:07:33often i think that the mean and variance
0:07:35and inverse
0:07:37no bias vector or the
0:07:40and inverse filtering is applied
0:07:42bring up the energy content in the higher frequencies
0:07:47this is made him than do
0:07:49in a to compensate for the mean values of the energies
0:07:54in the higher frequency which
0:07:59the output of this system
0:08:01is the white that lost spectrum which is for the processed
0:08:05for
0:08:06speaker verification
0:08:08this bandwidth expansion
0:08:10b and then system is trained on every speech on the rubber dataset
0:08:16and the v c d k dataset
0:08:20this is the inverse filtering that is use the reverse the low-pass filtering effect
0:08:25the mean and variance of the not narrowband log spectrum is added back to the
0:08:29estimated wideband log spectrum which is the output of the vienna
0:08:34the higher frequency energies of the narrowband all your are attenuated viewable by selectively
0:08:40you re well as they do clustering be added back
0:08:44the this filter the i about this
0:08:49inverse vector in the log domain two
0:08:53the estimated by
0:08:55well getting back the ugly normalized wideband spectrum estimate
0:09:03the data for this for training the bandwidth expansion system is simulated using or telephone
0:09:13equally codec simulation software
0:09:17the limited speech and v c d k datasets
0:09:20i'll hold
0:09:22wideband audio data sets libby speech as a sampling rate of sixteen khz and b
0:09:27c d k is originally forty eight khz audio it should be bring down by
0:09:33down sampling to sixteen khz
0:09:36what these datasets are clean speech bit by band data at sixteen khz
0:09:44in order to simulate telephone the artifacts in the wideband speech be perform a
0:09:53coding and decoding using three different
0:09:57audio codecs the three audio codecs that be used for simulating the telephone data are
0:10:03of the adaptive might be the narrowband amr and b
0:10:07the allpass narrowband codec and this week data back codec
0:10:12so this three codecs cover a wide range of telephone the applications that are commonly
0:10:17used as you can see from this table that my ten b is typically used
0:10:22in mobile telephone
0:10:23allpass is used in white like what's a playstation for except and silk is also
0:10:30used in wide applications voip applications
0:10:35so be it a sixteen a and i don't a wide band audio from delivery
0:10:42speech data set or d dct case dataset we passed through it a d v
0:10:48boss the audio through the
0:10:50audio coding application
0:10:53which course which converts it into a coded signal
0:10:58and then be passing through the audio codec decoder to get back the telephone e
0:11:03just a or they started narrowband sick so this is how they sixteen khz audio
0:11:11is converted to eight khz or a telephone e distorted audio
0:11:16we simulate the data set for bandwidth expansion train
0:11:22the bandwidth expansion system is that's trained on a hundred hours of liberty speech and
0:11:28we syndicated a sec
0:11:29the performance of the bandwidth expansion is computed by the log spectral distortion measure which
0:11:38is basically the mean squared error in this between the estimated wideband spectrum and the
0:11:46actual wideband spectrum
0:11:48in the log domain
0:11:50so the by a d
0:11:51not spectral distortion
0:11:54is show the results are shown here
0:11:57the simple up sampling of
0:12:00narrowband audio now gives there'll low a log spectral distortion of one point seven nine
0:12:05three in the higher frequency d h by doing simple subsampling we are not adding
0:12:10any new information but the audio all that simple a lab sampling does is
0:12:17performance
0:12:19interpolation between samples
0:12:21and followed by
0:12:24no less affected so interpolation followed by an no but followed by smoothing that this
0:12:31simple up sampling
0:12:32and simple have sampling gives a log spectral distortion of one point seven nine three
0:12:38the be a bandwidth x expanded system output
0:12:43has
0:12:45l s d value of one point two nine one it just a significant reduction
0:12:48compared to
0:12:51the simple of sampled signal
0:12:54the loss but we have been be due to bandwidth expansion be estimated the complete
0:13:00spectrum of the art
0:13:02that is the spectrum from zero
0:13:04at universe of the
0:13:06wideband audio
0:13:07we also compute the log spectral distortion of
0:13:11the in order bags
0:13:13of a as a as a result
0:13:15so in the lower frequency band zero to four khz which is already but i
0:13:19sent it but
0:13:21in the narrowband spectrum
0:13:23this simple up sampling gives the
0:13:26not spectral distortion of point nine three four
0:13:29benesty bandwidths expanded system output
0:13:32as
0:13:33a distortion of one point zero to nine
0:13:36so this means that the bandwidth expansion system
0:13:41introduces a mind that of distortion
0:13:43in the lower frequencies
0:13:45compared to simple laptop
0:13:49and also remember
0:13:50that
0:13:51the audio codecs that be applied
0:13:54to simulate the telephone e audio would have introduced more distortions in the lower frequencies
0:14:00that is why
0:14:02that is
0:14:03a significant amount of log spectral distortion
0:14:06even in the lower frequencies
0:14:08for the simple example signal
0:14:12finally this table shows that the bandwidth expansion system clearly or phone
0:14:18spectral estimation of higher frequency content
0:14:22this is an example log of the output of the bandwidth expansion system on top
0:14:27is the eight khz narrowband telephone you argue
0:14:31we have
0:14:32performs simple subsampling of the telephone the audio
0:14:35to show the spectrogram
0:14:37you see no frequency content in
0:14:41i between forty kilobytes
0:14:43the bandwidth expanded output
0:14:46is shown in the
0:14:48a total pain
0:14:49you see that of the higher frequencies are estimated
0:14:55by the
0:14:56we have a pretty well by the bandwidth expansion system
0:14:59and the bottom be in
0:15:01shows the sixteen khz reference
0:15:07next i we will more to the speaker verification experiments
0:15:13the speaker verification experiments in this paper are performed
0:15:17on a speaker verification system
0:15:19as shown in this figure
0:15:22our speaker verification system is that the convolution neural network based speaker embedding
0:15:29it consists of five convolutional layers
0:15:32followed by
0:15:33i statistics pulling here
0:15:36for followed by two fully connected us
0:15:40and the output is a softmax
0:15:43layer and beating speaker labels
0:15:46the input to the speaker embedding system is thirty dimensional
0:15:50mfcc features
0:15:52the training of these speaker and speaker recognition system
0:15:57the speaker embedding system is performed in two stages
0:16:00the first stage be used
0:16:04a softmax output
0:16:05of speaker labels
0:16:07and doing the network with
0:16:09a cross entropy loss
0:16:11training
0:16:12in stage two
0:16:13be the remote the second fully connected eer and the output layer
0:16:18and that's a card a fully connected leotard different layer
0:16:23in the all in the output
0:16:25actually is all the layers
0:16:28before the back
0:16:29and train the network with large margin or side loss
0:16:33this is how this you got embedding system is trained two stages
0:16:38this system is train
0:16:41completely on be walks lm to dataset it's consists of
0:16:45sixteen khz clean audio
0:16:48so this is a wide band plain system so we train
0:16:51two different speaker embedding systems
0:16:54using the same architecture
0:16:56one system used train on the only one select two
0:17:01sixteen khz audio
0:17:03the second system we train it on mixed training
0:17:07the use both p by nine audio
0:17:09and the bandwidth expanded downsampled and band with expanded
0:17:14audio
0:17:15so we possibly walk select to dataset
0:17:19to what it
0:17:21distortion
0:17:21and followed by
0:17:23the bandwidth expansion
0:17:25using them
0:17:26bandwidth expansion system that the proposed in this paper
0:17:30and then we combine the two datasets the by the original wideband
0:17:34box a let
0:17:35and divine with expanded
0:17:37downsampled what's that
0:17:40brain
0:17:41the speaker recognition system
0:17:43a speaker recognition speaker and body
0:17:45we train the speaker every
0:17:47so note that
0:17:49both of these systems are trained on wideband audio
0:17:53one is train on the original wideband data
0:17:56the second w b plus you w e is trained on by nine
0:18:00last
0:18:01bandwidth expanded by
0:18:05if all the results
0:18:06here
0:18:07the speaker verification results are shown in this they in this so by graph
0:18:13these this by graph shows the speaker verification equal error rates that we obtain
0:18:19using d by band only trained system
0:18:24so the system is trained only on what select sixteen khz wideband audio
0:18:28you see that we perform speaker verification test
0:18:33on
0:18:34for different test sets
0:18:37double select one
0:18:38e subset
0:18:40the as id w
0:18:42a dataset
0:18:44the speakers in the why you dataset
0:18:46and the nist sre two thousand and second ten second test set
0:18:54so these are before test sets that be a computer the results on
0:19:00the green
0:19:02you can see
0:19:04a performing bandwidth expansion
0:19:07other uses the equal error rate
0:19:10contrary to simply upsample signal
0:19:13so the block an audience shows the equal error rate obtained using
0:19:19a simple subsampling
0:19:22and the bottom plot in l a shows
0:19:26the design
0:19:27after nine bit extraction
0:19:29note that
0:19:30the box tell everyone
0:19:31yes i
0:19:32the s id w
0:19:34they have set and that's id w eval set
0:19:36at all
0:19:37sixty universe
0:19:40audio
0:19:41v past these test sets
0:19:44the would be coded distortion
0:19:47assimilation that we'd have a lot in this paper to simulate telephone e audio
0:19:53and then be ugly speaker verification on top of it is twenty distorted l a
0:19:58funny or that the start it yes say that is the results that those are
0:20:03the desires the actually in the orange blinded
0:20:08because a lot is the output of the bandwidth expansion system
0:20:13when you past the telephone need to start essex would have an expansion
0:20:19system
0:20:20normal that the nist sre two thousand and test set
0:20:25yes
0:20:26are telling it consists of the only telephone your you
0:20:29so i
0:20:31is inherently a narrowband speech signal
0:20:34recorded using a real telephone the audio
0:20:37so we don't have
0:20:42we don't have the us this we don't have the results for
0:20:47if the nist sre dataset
0:20:51four
0:20:52is divided by nine or go because there is no wideband audio in this design
0:20:57so we have only the
0:21:00and results
0:21:01for an simple up sampling and bandwidth expansion
0:21:05so you see that even in the unseen case of nist sre dataset consists of
0:21:10real telephone your
0:21:12that is a significant improvement in the equal error rate
0:21:16finally we show the results on the mixed plain system even on the mixed trained
0:21:21system there is a significant improvement in the equal error rates
0:21:25across all the test sets
0:21:28it is a particular of point to note here is that the equal error rates
0:21:34obtained that
0:21:34on the original wideband a test audio cassettes
0:21:40i might lower
0:21:42then what the obtained but the wideband plain system
0:21:45so initially for example forty bucks eleven ease set
0:21:50we obtain the four point one to eer
0:21:55by after d but i'm explaining the eer values to three point two percent
0:22:01that means that the bandwidth expansion has helped improve the performance even on the original
0:22:05sixteen khz audio
0:22:09so these are the conclusions of our paper the bandwidth expansion system that the proposed
0:22:14in this paper performs significantly better
0:22:19than upset simple up sampling
0:22:21we obtain a relate to equal error rate reduction of four point four percent
0:22:25on the nist sre two thousand and second
0:22:29and a nine point ninety percent improvement on the s i t w u about
0:22:34six and eleven point one percent improvement on the inside you don't you test set
0:22:39the bandwidth expansion well so improved in the accuracy on the original sixteen khz data
0:22:47across all
0:22:48the protocols across all the test sets
0:22:51which means that the bandwidth expansion system is helping as an augmentation mechanism for training
0:22:59speaker verification of for training the speaker recognition system
0:23:03so the perforce bandwidth expansion system is also significantly lightweight system
0:23:08compared to other systems that a recently proposed
0:23:12and the system can be
0:23:14deployed and used even in a legal times an audio
0:23:20these are some references that have order well i in this presentation
0:23:26please refer to the paper for further details and desires
0:23:31and
0:23:33i will be glad to answer your questions
0:23:36thank you for listening to my talk
0:23:39i look forward your
0:23:40questions and discussions regarding this paper
0:23:43thank you