0:00:15 | hello everyone |
---|
0:00:17 | i am then used to but often |
---|
0:00:18 | i am a research scientist i'd been dropped security |
---|
0:00:22 | based in atlanta are you would say |
---|
0:00:25 | i'm here to present our paper |
---|
0:00:28 | i to read speech bandwidth expansion for speaker recognition on telephone you or you |
---|
0:00:38 | this is the overview of might all |
---|
0:00:41 | i will start by giving a motivation as to why we need bandwidth expansion |
---|
0:00:45 | followed by explaining the problem statement |
---|
0:00:48 | and then i will describe some prior research in this area |
---|
0:00:53 | we will then explain |
---|
0:00:55 | the bandwidth expansion system that we propose in this paper |
---|
0:00:58 | and show some results of bandwidth expansion performance |
---|
0:01:03 | finally |
---|
0:01:04 | i really others this show you some speaker verification experiments that you perform |
---|
0:01:10 | and the results that we obtained with the bandwidth expanse just |
---|
0:01:17 | in this paper we therefore to y |
---|
0:01:20 | no audio that the sampled at sixteen khz |
---|
0:01:24 | now has wideband or you |
---|
0:01:26 | typically the audio that is sampled at sixteen khz and has frequency content |
---|
0:01:33 | between zero to eight khz |
---|
0:01:35 | but is called wideband audio in this paper |
---|
0:01:39 | input additional telephone the audio |
---|
0:01:41 | which is due back band limited to a three hundred to three thousand four hundred |
---|
0:01:46 | hz |
---|
0:01:47 | an example of the universe is referred to as |
---|
0:01:51 | narrowband audio in this paper |
---|
0:01:55 | speaker verification systems |
---|
0:01:57 | typically work well on why nine between all you |
---|
0:02:02 | this is because |
---|
0:02:04 | the higher frequency content maybe in four and eight khz |
---|
0:02:10 | in by band all your |
---|
0:02:12 | is helpful in speaker discrimination |
---|
0:02:17 | the wine mandarin systems |
---|
0:02:20 | the one of the wideband audio stream systems perform warily |
---|
0:02:24 | a narrowband or you to the mismatch in the training and testing conditions |
---|
0:02:29 | so the lack of you have higher frequency information in the narrowband speech leads to |
---|
0:02:35 | the degraded performance |
---|
0:02:37 | so the question that we days in this paper is can be indexed estimate the |
---|
0:02:42 | higher frequency content that is missing in narrowband or you |
---|
0:02:47 | in such a way that it improves the performance |
---|
0:02:50 | on why band trained |
---|
0:02:53 | speaker verification systems |
---|
0:02:55 | so this is the problem statement |
---|
0:02:57 | the narrowband on or the u |
---|
0:03:00 | well as a and which is band limited to four khz |
---|
0:03:04 | is shown on the left |
---|
0:03:06 | in this figure we have shown the spectrogram |
---|
0:03:10 | for showing the frequencies between zero and |
---|
0:03:13 | e you know parts |
---|
0:03:15 | you see that there is no information or frequency contained in between forty kilobytes |
---|
0:03:20 | and the objective will be banded expansion system is to use the lower frequency content |
---|
0:03:25 | of the narrowband audio to estimate the missing higher frequency content that is typically present |
---|
0:03:32 | in the wideband audio |
---|
0:03:36 | the objective of the s estimation of the higher frequency content is inside to be |
---|
0:03:41 | of that it improves the performance of speaker verification systems |
---|
0:03:49 | then as being a lot of research that has been conducted in bandwidth expansion |
---|
0:03:54 | the earliest approaches to bandwidth expansion they don't the problem into two parts |
---|
0:03:59 | estimating the on the log of the spectrum and the excitation signal of this of |
---|
0:04:06 | this paper |
---|
0:04:07 | the on the left estimation is typically |
---|
0:04:10 | are then made using spline fitting cubic spline fitting one option mixture model based approaches |
---|
0:04:17 | and |
---|
0:04:18 | spectral folding is a is used for |
---|
0:04:21 | estimating it's extending the excitation six |
---|
0:04:25 | so this is the earliest approaches in bandwidth expansion |
---|
0:04:30 | more recent approaches use the neural network based bandwidth extraction and |
---|
0:04:35 | these kind of deep neural network based systems have shown improvement in the performance of |
---|
0:04:41 | asr systems |
---|
0:04:42 | are trained on wideband speech |
---|
0:04:46 | more recent work in speaker verification related to bandwidth expansion |
---|
0:04:51 | has you have used |
---|
0:04:53 | d plus it will networks |
---|
0:04:55 | and bidirectional l s t m network architectures for or forming bandwidth expansion |
---|
0:05:01 | this work has also shown significant improvement in the performance of speaker verification systems |
---|
0:05:11 | in this |
---|
0:05:12 | but we propose a novel bandwidth expansion system |
---|
0:05:15 | that is lightweight compared to all the systems proposed in the literature |
---|
0:05:22 | in this system the band with the bantered expansion is performed using a c n |
---|
0:05:27 | b and then network architecture |
---|
0:05:30 | a feed-forward c n and t are not capture better in there is a |
---|
0:05:34 | single convolutional layer |
---|
0:05:36 | which is which more forms one deconvolution along the time axis |
---|
0:05:41 | followed by three v forward layers |
---|
0:05:44 | containing |
---|
0:05:45 | one thousand twenty four nodes in each layer |
---|
0:05:50 | there are sixty four filters in the convolutional ears |
---|
0:05:54 | and after the convolution operation the feature maps that slightly and fact that the feed-forward |
---|
0:06:00 | here's |
---|
0:06:01 | this is the architecture of the d and then |
---|
0:06:04 | that performs the bandwidth expansion |
---|
0:06:06 | the input |
---|
0:06:07 | to the deep neural network is |
---|
0:06:11 | and the narrowband log spectrum |
---|
0:06:15 | narrowband log spectrum so we extract the spectrogram |
---|
0:06:20 | from the eight khz telephone the audio |
---|
0:06:23 | and we perform |
---|
0:06:25 | the mean and variance normalization of the spectrum |
---|
0:06:30 | and compute the logarithm |
---|
0:06:32 | of |
---|
0:06:34 | of the spectrum and feed it as input to the network |
---|
0:06:39 | the output of the network is the s is tries to estimate the complete than |
---|
0:06:44 | some of the |
---|
0:06:46 | corresponding by back to see that |
---|
0:06:49 | the input to the network |
---|
0:06:51 | a fixed |
---|
0:06:52 | eleven frames |
---|
0:06:54 | of one twenty eight dimensional narrowband log spectrum |
---|
0:06:59 | the features are computed at twenty millisecond frame size and ten milisecond frame rate |
---|
0:07:06 | the network output is to fifty seven dimensional wideband log spectrum |
---|
0:07:12 | the network is trained with the mean squared error loss and adam optimiser |
---|
0:07:19 | after the and the network output is a pain |
---|
0:07:23 | the mean and variance computed from the input us narrowband spectrum is added back |
---|
0:07:30 | to the wideband spectra |
---|
0:07:33 | often i think that the mean and variance |
---|
0:07:35 | and inverse |
---|
0:07:37 | no bias vector or the |
---|
0:07:40 | and inverse filtering is applied |
---|
0:07:42 | bring up the energy content in the higher frequencies |
---|
0:07:47 | this is made him than do |
---|
0:07:49 | in a to compensate for the mean values of the energies |
---|
0:07:54 | in the higher frequency which |
---|
0:07:59 | the output of this system |
---|
0:08:01 | is the white that lost spectrum which is for the processed |
---|
0:08:05 | for |
---|
0:08:06 | speaker verification |
---|
0:08:08 | this bandwidth expansion |
---|
0:08:10 | b and then system is trained on every speech on the rubber dataset |
---|
0:08:16 | and the v c d k dataset |
---|
0:08:20 | this is the inverse filtering that is use the reverse the low-pass filtering effect |
---|
0:08:25 | the mean and variance of the not narrowband log spectrum is added back to the |
---|
0:08:29 | estimated wideband log spectrum which is the output of the vienna |
---|
0:08:34 | the higher frequency energies of the narrowband all your are attenuated viewable by selectively |
---|
0:08:40 | you re well as they do clustering be added back |
---|
0:08:44 | the this filter the i about this |
---|
0:08:49 | inverse vector in the log domain two |
---|
0:08:53 | the estimated by |
---|
0:08:55 | well getting back the ugly normalized wideband spectrum estimate |
---|
0:09:03 | the data for this for training the bandwidth expansion system is simulated using or telephone |
---|
0:09:13 | equally codec simulation software |
---|
0:09:17 | the limited speech and v c d k datasets |
---|
0:09:20 | i'll hold |
---|
0:09:22 | wideband audio data sets libby speech as a sampling rate of sixteen khz and b |
---|
0:09:27 | c d k is originally forty eight khz audio it should be bring down by |
---|
0:09:33 | down sampling to sixteen khz |
---|
0:09:36 | what these datasets are clean speech bit by band data at sixteen khz |
---|
0:09:44 | in order to simulate telephone the artifacts in the wideband speech be perform a |
---|
0:09:53 | coding and decoding using three different |
---|
0:09:57 | audio codecs the three audio codecs that be used for simulating the telephone data are |
---|
0:10:03 | of the adaptive might be the narrowband amr and b |
---|
0:10:07 | the allpass narrowband codec and this week data back codec |
---|
0:10:12 | so this three codecs cover a wide range of telephone the applications that are commonly |
---|
0:10:17 | used as you can see from this table that my ten b is typically used |
---|
0:10:22 | in mobile telephone |
---|
0:10:23 | allpass is used in white like what's a playstation for except and silk is also |
---|
0:10:30 | used in wide applications voip applications |
---|
0:10:35 | so be it a sixteen a and i don't a wide band audio from delivery |
---|
0:10:42 | speech data set or d dct case dataset we passed through it a d v |
---|
0:10:48 | boss the audio through the |
---|
0:10:50 | audio coding application |
---|
0:10:53 | which course which converts it into a coded signal |
---|
0:10:58 | and then be passing through the audio codec decoder to get back the telephone e |
---|
0:11:03 | just a or they started narrowband sick so this is how they sixteen khz audio |
---|
0:11:11 | is converted to eight khz or a telephone e distorted audio |
---|
0:11:16 | we simulate the data set for bandwidth expansion train |
---|
0:11:22 | the bandwidth expansion system is that's trained on a hundred hours of liberty speech and |
---|
0:11:28 | we syndicated a sec |
---|
0:11:29 | the performance of the bandwidth expansion is computed by the log spectral distortion measure which |
---|
0:11:38 | is basically the mean squared error in this between the estimated wideband spectrum and the |
---|
0:11:46 | actual wideband spectrum |
---|
0:11:48 | in the log domain |
---|
0:11:50 | so the by a d |
---|
0:11:51 | not spectral distortion |
---|
0:11:54 | is show the results are shown here |
---|
0:11:57 | the simple up sampling of |
---|
0:12:00 | narrowband audio now gives there'll low a log spectral distortion of one point seven nine |
---|
0:12:05 | three in the higher frequency d h by doing simple subsampling we are not adding |
---|
0:12:10 | any new information but the audio all that simple a lab sampling does is |
---|
0:12:17 | performance |
---|
0:12:19 | interpolation between samples |
---|
0:12:21 | and followed by |
---|
0:12:24 | no less affected so interpolation followed by an no but followed by smoothing that this |
---|
0:12:31 | simple up sampling |
---|
0:12:32 | and simple have sampling gives a log spectral distortion of one point seven nine three |
---|
0:12:38 | the be a bandwidth x expanded system output |
---|
0:12:43 | has |
---|
0:12:45 | l s d value of one point two nine one it just a significant reduction |
---|
0:12:48 | compared to |
---|
0:12:51 | the simple of sampled signal |
---|
0:12:54 | the loss but we have been be due to bandwidth expansion be estimated the complete |
---|
0:13:00 | spectrum of the art |
---|
0:13:02 | that is the spectrum from zero |
---|
0:13:04 | at universe of the |
---|
0:13:06 | wideband audio |
---|
0:13:07 | we also compute the log spectral distortion of |
---|
0:13:11 | the in order bags |
---|
0:13:13 | of a as a as a result |
---|
0:13:15 | so in the lower frequency band zero to four khz which is already but i |
---|
0:13:19 | sent it but |
---|
0:13:21 | in the narrowband spectrum |
---|
0:13:23 | this simple up sampling gives the |
---|
0:13:26 | not spectral distortion of point nine three four |
---|
0:13:29 | benesty bandwidths expanded system output |
---|
0:13:32 | as |
---|
0:13:33 | a distortion of one point zero to nine |
---|
0:13:36 | so this means that the bandwidth expansion system |
---|
0:13:41 | introduces a mind that of distortion |
---|
0:13:43 | in the lower frequencies |
---|
0:13:45 | compared to simple laptop |
---|
0:13:49 | and also remember |
---|
0:13:50 | that |
---|
0:13:51 | the audio codecs that be applied |
---|
0:13:54 | to simulate the telephone e audio would have introduced more distortions in the lower frequencies |
---|
0:14:00 | that is why |
---|
0:14:02 | that is |
---|
0:14:03 | a significant amount of log spectral distortion |
---|
0:14:06 | even in the lower frequencies |
---|
0:14:08 | for the simple example signal |
---|
0:14:12 | finally this table shows that the bandwidth expansion system clearly or phone |
---|
0:14:18 | spectral estimation of higher frequency content |
---|
0:14:22 | this is an example log of the output of the bandwidth expansion system on top |
---|
0:14:27 | is the eight khz narrowband telephone you argue |
---|
0:14:31 | we have |
---|
0:14:32 | performs simple subsampling of the telephone the audio |
---|
0:14:35 | to show the spectrogram |
---|
0:14:37 | you see no frequency content in |
---|
0:14:41 | i between forty kilobytes |
---|
0:14:43 | the bandwidth expanded output |
---|
0:14:46 | is shown in the |
---|
0:14:48 | a total pain |
---|
0:14:49 | you see that of the higher frequencies are estimated |
---|
0:14:55 | by the |
---|
0:14:56 | we have a pretty well by the bandwidth expansion system |
---|
0:14:59 | and the bottom be in |
---|
0:15:01 | shows the sixteen khz reference |
---|
0:15:07 | next i we will more to the speaker verification experiments |
---|
0:15:13 | the speaker verification experiments in this paper are performed |
---|
0:15:17 | on a speaker verification system |
---|
0:15:19 | as shown in this figure |
---|
0:15:22 | our speaker verification system is that the convolution neural network based speaker embedding |
---|
0:15:29 | it consists of five convolutional layers |
---|
0:15:32 | followed by |
---|
0:15:33 | i statistics pulling here |
---|
0:15:36 | for followed by two fully connected us |
---|
0:15:40 | and the output is a softmax |
---|
0:15:43 | layer and beating speaker labels |
---|
0:15:46 | the input to the speaker embedding system is thirty dimensional |
---|
0:15:50 | mfcc features |
---|
0:15:52 | the training of these speaker and speaker recognition system |
---|
0:15:57 | the speaker embedding system is performed in two stages |
---|
0:16:00 | the first stage be used |
---|
0:16:04 | a softmax output |
---|
0:16:05 | of speaker labels |
---|
0:16:07 | and doing the network with |
---|
0:16:09 | a cross entropy loss |
---|
0:16:11 | training |
---|
0:16:12 | in stage two |
---|
0:16:13 | be the remote the second fully connected eer and the output layer |
---|
0:16:18 | and that's a card a fully connected leotard different layer |
---|
0:16:23 | in the all in the output |
---|
0:16:25 | actually is all the layers |
---|
0:16:28 | before the back |
---|
0:16:29 | and train the network with large margin or side loss |
---|
0:16:33 | this is how this you got embedding system is trained two stages |
---|
0:16:38 | this system is train |
---|
0:16:41 | completely on be walks lm to dataset it's consists of |
---|
0:16:45 | sixteen khz clean audio |
---|
0:16:48 | so this is a wide band plain system so we train |
---|
0:16:51 | two different speaker embedding systems |
---|
0:16:54 | using the same architecture |
---|
0:16:56 | one system used train on the only one select two |
---|
0:17:01 | sixteen khz audio |
---|
0:17:03 | the second system we train it on mixed training |
---|
0:17:07 | the use both p by nine audio |
---|
0:17:09 | and the bandwidth expanded downsampled and band with expanded |
---|
0:17:14 | audio |
---|
0:17:15 | so we possibly walk select to dataset |
---|
0:17:19 | to what it |
---|
0:17:21 | distortion |
---|
0:17:21 | and followed by |
---|
0:17:23 | the bandwidth expansion |
---|
0:17:25 | using them |
---|
0:17:26 | bandwidth expansion system that the proposed in this paper |
---|
0:17:30 | and then we combine the two datasets the by the original wideband |
---|
0:17:34 | box a let |
---|
0:17:35 | and divine with expanded |
---|
0:17:37 | downsampled what's that |
---|
0:17:40 | brain |
---|
0:17:41 | the speaker recognition system |
---|
0:17:43 | a speaker recognition speaker and body |
---|
0:17:45 | we train the speaker every |
---|
0:17:47 | so note that |
---|
0:17:49 | both of these systems are trained on wideband audio |
---|
0:17:53 | one is train on the original wideband data |
---|
0:17:56 | the second w b plus you w e is trained on by nine |
---|
0:18:00 | last |
---|
0:18:01 | bandwidth expanded by |
---|
0:18:05 | if all the results |
---|
0:18:06 | here |
---|
0:18:07 | the speaker verification results are shown in this they in this so by graph |
---|
0:18:13 | these this by graph shows the speaker verification equal error rates that we obtain |
---|
0:18:19 | using d by band only trained system |
---|
0:18:24 | so the system is trained only on what select sixteen khz wideband audio |
---|
0:18:28 | you see that we perform speaker verification test |
---|
0:18:33 | on |
---|
0:18:34 | for different test sets |
---|
0:18:37 | double select one |
---|
0:18:38 | e subset |
---|
0:18:40 | the as id w |
---|
0:18:42 | a dataset |
---|
0:18:44 | the speakers in the why you dataset |
---|
0:18:46 | and the nist sre two thousand and second ten second test set |
---|
0:18:54 | so these are before test sets that be a computer the results on |
---|
0:19:00 | the green |
---|
0:19:02 | you can see |
---|
0:19:04 | a performing bandwidth expansion |
---|
0:19:07 | other uses the equal error rate |
---|
0:19:10 | contrary to simply upsample signal |
---|
0:19:13 | so the block an audience shows the equal error rate obtained using |
---|
0:19:19 | a simple subsampling |
---|
0:19:22 | and the bottom plot in l a shows |
---|
0:19:26 | the design |
---|
0:19:27 | after nine bit extraction |
---|
0:19:29 | note that |
---|
0:19:30 | the box tell everyone |
---|
0:19:31 | yes i |
---|
0:19:32 | the s id w |
---|
0:19:34 | they have set and that's id w eval set |
---|
0:19:36 | at all |
---|
0:19:37 | sixty universe |
---|
0:19:40 | audio |
---|
0:19:41 | v past these test sets |
---|
0:19:44 | the would be coded distortion |
---|
0:19:47 | assimilation that we'd have a lot in this paper to simulate telephone e audio |
---|
0:19:53 | and then be ugly speaker verification on top of it is twenty distorted l a |
---|
0:19:58 | funny or that the start it yes say that is the results that those are |
---|
0:20:03 | the desires the actually in the orange blinded |
---|
0:20:08 | because a lot is the output of the bandwidth expansion system |
---|
0:20:13 | when you past the telephone need to start essex would have an expansion |
---|
0:20:19 | system |
---|
0:20:20 | normal that the nist sre two thousand and test set |
---|
0:20:25 | yes |
---|
0:20:26 | are telling it consists of the only telephone your you |
---|
0:20:29 | so i |
---|
0:20:31 | is inherently a narrowband speech signal |
---|
0:20:34 | recorded using a real telephone the audio |
---|
0:20:37 | so we don't have |
---|
0:20:42 | we don't have the us this we don't have the results for |
---|
0:20:47 | if the nist sre dataset |
---|
0:20:51 | four |
---|
0:20:52 | is divided by nine or go because there is no wideband audio in this design |
---|
0:20:57 | so we have only the |
---|
0:21:00 | and results |
---|
0:21:01 | for an simple up sampling and bandwidth expansion |
---|
0:21:05 | so you see that even in the unseen case of nist sre dataset consists of |
---|
0:21:10 | real telephone your |
---|
0:21:12 | that is a significant improvement in the equal error rate |
---|
0:21:16 | finally we show the results on the mixed plain system even on the mixed trained |
---|
0:21:21 | system there is a significant improvement in the equal error rates |
---|
0:21:25 | across all the test sets |
---|
0:21:28 | it is a particular of point to note here is that the equal error rates |
---|
0:21:34 | obtained that |
---|
0:21:34 | on the original wideband a test audio cassettes |
---|
0:21:40 | i might lower |
---|
0:21:42 | then what the obtained but the wideband plain system |
---|
0:21:45 | so initially for example forty bucks eleven ease set |
---|
0:21:50 | we obtain the four point one to eer |
---|
0:21:55 | by after d but i'm explaining the eer values to three point two percent |
---|
0:22:01 | that means that the bandwidth expansion has helped improve the performance even on the original |
---|
0:22:05 | sixteen khz audio |
---|
0:22:09 | so these are the conclusions of our paper the bandwidth expansion system that the proposed |
---|
0:22:14 | in this paper performs significantly better |
---|
0:22:19 | than upset simple up sampling |
---|
0:22:21 | we obtain a relate to equal error rate reduction of four point four percent |
---|
0:22:25 | on the nist sre two thousand and second |
---|
0:22:29 | and a nine point ninety percent improvement on the s i t w u about |
---|
0:22:34 | six and eleven point one percent improvement on the inside you don't you test set |
---|
0:22:39 | the bandwidth expansion well so improved in the accuracy on the original sixteen khz data |
---|
0:22:47 | across all |
---|
0:22:48 | the protocols across all the test sets |
---|
0:22:51 | which means that the bandwidth expansion system is helping as an augmentation mechanism for training |
---|
0:22:59 | speaker verification of for training the speaker recognition system |
---|
0:23:03 | so the perforce bandwidth expansion system is also significantly lightweight system |
---|
0:23:08 | compared to other systems that a recently proposed |
---|
0:23:12 | and the system can be |
---|
0:23:14 | deployed and used even in a legal times an audio |
---|
0:23:20 | these are some references that have order well i in this presentation |
---|
0:23:26 | please refer to the paper for further details and desires |
---|
0:23:31 | and |
---|
0:23:33 | i will be glad to answer your questions |
---|
0:23:36 | thank you for listening to my talk |
---|
0:23:39 | i look forward your |
---|
0:23:40 | questions and discussions regarding this paper |
---|
0:23:43 | thank you |
---|