0:00:14 | and |
---|
0:00:19 | she's |
---|
0:00:21 | computer science from having university |
---|
0:00:25 | france |
---|
0:00:26 | the title of our work is a single thing quarter for robust speaker recognition |
---|
0:00:33 | in this work we try to focus on and that even noise |
---|
0:00:38 | to compare and additive noise |
---|
0:00:40 | for speaker recognition systems that use it "'cause" vector and endings |
---|
0:00:51 | firstly we discuss about the problem of additive noise and the effect of additive noise |
---|
0:00:58 | speaker it is |
---|
0:01:01 | specifically in exactly framework |
---|
0:01:04 | after they do that we have us from the lattice of our previous works that |
---|
0:01:09 | are known |
---|
0:01:11 | to compensate for additive noise and different levels |
---|
0:01:16 | after that |
---|
0:01:18 | are we discuss about |
---|
0:01:20 | different denoising techniques that we used for |
---|
0:01:24 | compensating the additive noise and extract an army |
---|
0:01:30 | here you can see the name of denoising techniques that to be |
---|
0:01:34 | you know just |
---|
0:01:36 | like i that and you know using those encounters the are all techniques |
---|
0:01:41 | and goes denoising go change in culture and the text you know a single chain |
---|
0:01:46 | codes that are |
---|
0:01:48 | new architectures and but we introduce them in this paper |
---|
0:01:54 | after that i speak about the experiments experimental protocol and the results that the achieved |
---|
0:02:00 | by denoising coding colours |
---|
0:02:03 | in noisy environment |
---|
0:02:08 | here you can see their problem of additive noise |
---|
0:02:14 | here is a new techniques |
---|
0:02:17 | that are used for speaker model like |
---|
0:02:22 | deep learning techniques |
---|
0:02:24 | that use that argumentation to their information about the noise reverberation and so on to |
---|
0:02:30 | create a robust system |
---|
0:02:33 | in noisy environments |
---|
0:02:36 | if and when we use their state of the art speaker modeling system like extractor |
---|
0:02:41 | if we see |
---|
0:02:44 | new noise there are not seen in data alimentation |
---|
0:02:49 | we can see that the results not dramatically different |
---|
0:02:56 | this problem |
---|
0:02:58 | and motivates us to do compensation |
---|
0:03:02 | for additive noise and because the i-vector framework |
---|
0:03:06 | because it's speaker recognition system we are looking for |
---|
0:03:11 | for clean signal and we just want to have |
---|
0:03:16 | with performance of recognizing |
---|
0:03:19 | speakers |
---|
0:03:21 | we can do i think noise compensation in different levels |
---|
0:03:26 | i mean slim was like signal |
---|
0:03:29 | like features |
---|
0:03:31 | for example doing noise compensation on mfccs or |
---|
0:03:36 | we can deal with higher levels like it "'cause" vectors speaker modeling level |
---|
0:03:43 | to do noise compensation in our research we try to do compensation in it "'cause" |
---|
0:03:48 | i-vector level because |
---|
0:03:50 | in this level the vectors have a gaussian distribution and working in this level is |
---|
0:03:57 | lower dimensions is easier |
---|
0:04:03 | in previous works we can see that |
---|
0:04:06 | there are some researchers |
---|
0:04:08 | in signal level for example |
---|
0:04:11 | in the first rule you can see a paper from nineteen that |
---|
0:04:17 | different techniques one convolutional and b l this you know are used |
---|
0:04:23 | to the noise on features |
---|
0:04:26 | a like log my magnitude spectrum and |
---|
0:04:31 | stft |
---|
0:04:33 | in another paper you can see that |
---|
0:04:36 | the |
---|
0:04:38 | denoising is done on raw speech |
---|
0:04:43 | in previous research |
---|
0:04:45 | and that was not in i-vector domain |
---|
0:04:49 | several statistical a neural techniques |
---|
0:04:52 | proposed for denoising for example |
---|
0:04:56 | wideband for the and two thousand at proposed i in that |
---|
0:05:01 | to map from noisy to clean i-vectors |
---|
0:05:08 | and there are also some other techniques like enjoyment i that and you know is |
---|
0:05:13 | a denoising both encoders that |
---|
0:05:17 | that i in the same manner and try to map from noisy i-vectors talking i-vectors |
---|
0:05:25 | based on that |
---|
0:05:26 | because the noisy techniques you would do you lose results and i-vector domain we can |
---|
0:05:33 | propose the previous techniques |
---|
0:05:36 | in extractors is also or |
---|
0:05:39 | we can make the proposed and you techniques would you noise in it "'cause" i-vector |
---|
0:05:43 | space |
---|
0:05:48 | the first second that is used for minimizing is i that is the statistical techniques |
---|
0:05:54 | that is used for denoising in i-vector space |
---|
0:05:59 | i amount |
---|
0:06:00 | we estimate that was because vectors clean and noisy i-vectors are |
---|
0:06:05 | be like to dollars distribution there is an assumption |
---|
0:06:11 | and the decision the noise random variable is the difference |
---|
0:06:17 | between clean and noisy |
---|
0:06:20 | here you can see these a probability it |
---|
0:06:25 | what is zero |
---|
0:06:26 | given x |
---|
0:06:27 | what is your |
---|
0:06:29 | is and noisy it "'cause" vector |
---|
0:06:31 | and x is clean exact |
---|
0:06:34 | we use |
---|
0:06:35 | and i mean why table two |
---|
0:06:38 | i estimate x zero |
---|
0:06:41 | it can version of |
---|
0:06:43 | it "'cause" it clean version |
---|
0:06:46 | or a denoising version of exact reuse and the estimated to do that |
---|
0:06:51 | here who casting the final solution for these |
---|
0:06:55 | formula |
---|
0:06:57 | a signal and is the covariance of |
---|
0:07:00 | a noisy |
---|
0:07:03 | vectors that are used for training |
---|
0:07:06 | i mean is the average of |
---|
0:07:09 | noisy because vectors |
---|
0:07:11 | cy x |
---|
0:07:12 | is the covariance a because vectors that are used for training and you know it's |
---|
0:07:18 | is the average of |
---|
0:07:20 | is the average of clean because vectors |
---|
0:07:25 | this set second technique that is used in our paper for denoising is conditional denoising |
---|
0:07:32 | of encoders |
---|
0:07:33 | conventional everything also encoders tries to minimize l x and f y where l is |
---|
0:07:39 | the loss function |
---|
0:07:41 | and why is this portray why is distorted extractor and why is |
---|
0:07:49 | the output of denoising coding condor |
---|
0:07:52 | organized because vectors |
---|
0:07:54 | a |
---|
0:07:55 | frankly |
---|
0:07:57 | a denoising opting for the rest right to minimize the distance between noisy "'cause" vectors |
---|
0:08:02 | and p x vectors |
---|
0:08:04 | we use this architecture in our research here you can see that in an input |
---|
0:08:10 | and output layers |
---|
0:08:12 | we use five hundred real to all the nodes |
---|
0:08:16 | we used a linear activation function the number of |
---|
0:08:20 | notice and this layers it is the thing because you want to exactly map and |
---|
0:08:26 | noisy |
---|
0:08:28 | because vectors |
---|
0:08:30 | we want to have exactly same dimension is organized expect to and output layer |
---|
0:08:37 | i in here down |
---|
0:08:39 | or you know there we use one thousand two training for knows we use non |
---|
0:08:45 | linear conjugate hyperbolic a few iterations activation function |
---|
0:08:48 | function |
---|
0:08:50 | here you can still error rate |
---|
0:08:52 | the loss function that is used |
---|
0:08:54 | for denoising in this paper is doing a square |
---|
0:09:00 | our dinners at encoder is strenuous stochastic gradient descent |
---|
0:09:04 | in these are to be mentioned that we used one thousand and twenty four nodes |
---|
0:09:10 | in hidden layer because if you use |
---|
0:09:15 | a small number of nodes and is there may have lost of information and it's |
---|
0:09:21 | better to use lord knows in hidden layer |
---|
0:09:27 | another technique that he's used in our paper is combination of denoising auto-encoder i'm |
---|
0:09:36 | here we call i'm not as expensive because |
---|
0:09:39 | a be used it for i-vector system |
---|
0:09:42 | in this architecture we have noisy "'cause" vectors we try to deny these vectors firstly |
---|
0:09:48 | by denoising auto-encoder then we do the output of denoising auto-encoder two |
---|
0:09:53 | excel because |
---|
0:09:55 | by doing these step we impose on |
---|
0:09:59 | our system to |
---|
0:10:01 | to achieve to extract a that have no statistical distribution |
---|
0:10:07 | in another technique that we introduce and we called gaussian denoising auto-encoder really given noisy |
---|
0:10:15 | exact resources and do we impose on you know is altering order to give those |
---|
0:10:21 | in |
---|
0:10:23 | a distribution for the noise |
---|
0:10:26 | it "'cause" vectors here you can see the loss function that |
---|
0:10:31 | put some the impose restrictions on the output of you know using three |
---|
0:10:38 | here you can say again the mute and is the average sum rate because vector |
---|
0:10:43 | c and i is the covariance of |
---|
0:10:47 | and noisy because vector new x |
---|
0:10:49 | is the average of |
---|
0:10:52 | clean because vectors and co using likes |
---|
0:10:56 | is there |
---|
0:10:59 | of kleenex vectors |
---|
0:11:04 | their final technique that we used in the stack you know is noting whether |
---|
0:11:10 | this type the in denoising something closer tries to find an estimation for noise |
---|
0:11:17 | by estimating the noise |
---|
0:11:20 | a week and a |
---|
0:11:22 | however better results because we did an experiment that we gave exactly the information about |
---|
0:11:29 | the noise and we |
---|
0:11:31 | at a very good results no close to clean environment |
---|
0:11:37 | we use this softer and reach firstly we use the noisy "'cause" vectors to the |
---|
0:11:43 | right to denoising no single other we have the first estimation of the noisy "'cause" |
---|
0:11:48 | vector |
---|
0:11:49 | by in knowing by calculating the difference between noisy extractors |
---|
0:11:55 | and the output of the first log |
---|
0:11:58 | we try to find an estimation of noise and we given this information to the |
---|
0:12:03 | second block and we repeat |
---|
0:12:06 | a in the same manner |
---|
0:12:08 | to have a better estimation of |
---|
0:12:11 | noise |
---|
0:12:12 | to use this information and the next lot the |
---|
0:12:18 | you need is better results |
---|
0:12:20 | at the output we train all these plots jointly and yes |
---|
0:12:27 | we have several datasets in our paper was and is used |
---|
0:12:33 | four |
---|
0:12:34 | the argumentation to train extraction network babysitting noises are used to create noisy extractors for |
---|
0:12:42 | training and test in that are used you know it's techniques |
---|
0:12:47 | also to are |
---|
0:12:49 | i used for training is excellent for most of them one is that augmentation is |
---|
0:12:54 | used to train because next unit for |
---|
0:12:57 | and a combination of wants to than one and was set up to is used |
---|
0:13:02 | to create |
---|
0:13:05 | and he's extractors to train denoising techniques |
---|
0:13:09 | five is a french corpus then is used as test and enrollment in our experiments |
---|
0:13:15 | we divide the ideal |
---|
0:13:18 | corpus and to be separated based on the based on the duration of |
---|
0:13:25 | files to calculate the results for different durations |
---|
0:13:33 | here you can see in this steps |
---|
0:13:36 | that we followed in our experiments firstly we trained a "'cause" mixture |
---|
0:13:41 | designed a recipe |
---|
0:13:43 | to train these network reusable to that one augmented in these models are |
---|
0:13:50 | a new use the training data for in the next it |
---|
0:13:54 | because |
---|
0:13:55 | we usually use these nets for to create training data for denoising techniques recreate about |
---|
0:14:02 | four million |
---|
0:14:04 | no is a clean pairs because vectors from both so that one and both of |
---|
0:14:08 | them to |
---|
0:14:10 | also we extract enrollment and test extractors |
---|
0:14:14 | a be like to five dollars speech corpus |
---|
0:14:18 | and |
---|
0:14:19 | a also we add noise to our test data to create a noisy version |
---|
0:14:26 | we used d v c noise because we want to |
---|
0:14:29 | we want to make our system |
---|
0:14:33 | and robust against unseen noises here we use them as a to form a token |
---|
0:14:39 | augmentation to train the expected network but in this step we use the mississippi choice |
---|
0:14:45 | the data that the |
---|
0:14:47 | noise files that are used to |
---|
0:14:51 | and to create noisy |
---|
0:14:53 | it "'cause" vectors for train r |
---|
0:14:56 | and different from those nice is that are used for |
---|
0:15:00 | for this |
---|
0:15:01 | so the noises that are used in |
---|
0:15:05 | test |
---|
0:15:06 | is a c |
---|
0:15:08 | they are us |
---|
0:15:10 | after that we train p lda and we do scoring we pay lda that used |
---|
0:15:16 | as an back and scoring technique |
---|
0:15:20 | but before scoring when we do denoising |
---|
0:15:24 | alone |
---|
0:15:25 | to reduce the type of is our test files |
---|
0:15:31 | here you can see the results |
---|
0:15:34 | use it for error rates may take |
---|
0:15:37 | for different experiments and the first row you can see the results for different durations |
---|
0:15:44 | for example |
---|
0:15:45 | and the first column |
---|
0:15:47 | for utterances shorter than two secondary issue |
---|
0:15:52 | eleven point fifty nine |
---|
0:15:55 | equal error rate |
---|
0:15:56 | when we don't have noise |
---|
0:15:59 | and for stresses the line longer than twelve seconds we have zero point eight |
---|
0:16:05 | in the second row we can see that impact of annoyance |
---|
0:16:09 | for example for short utterances |
---|
0:16:13 | and the equal error rate increases eleven to fifteen and four |
---|
0:16:18 | utterances longer than twelve seconds |
---|
0:16:21 | it increases from zero point eight to five point one |
---|
0:16:25 | this results show that |
---|
0:16:28 | it's important we do denoising before scoring |
---|
0:16:31 | a system is no say in next call on our assumption is true and using |
---|
0:16:38 | a denoising component be of before scoring is very important |
---|
0:16:42 | here you can see the results obtained by |
---|
0:16:47 | statistical except that taking |
---|
0:16:49 | for utterances longer than twelve seconds |
---|
0:16:52 | the equal error rate reviews from |
---|
0:16:54 | five point one two point six |
---|
0:16:58 | and then extra used in the results obtained after applying denoising auto-encoder and is the |
---|
0:17:04 | expected |
---|
0:17:05 | in the next one |
---|
0:17:07 | we see the results that the in the |
---|
0:17:11 | in the next around used in the results that obtained by a combination of denoising |
---|
0:17:17 | don't think other x |
---|
0:17:19 | in the last row you seen the results that obtained by gaussian distribution the loss |
---|
0:17:25 | function that we used |
---|
0:17:28 | the new can you loss function that we used in our experiments to train denoising |
---|
0:17:32 | auto-encoder to have |
---|
0:17:35 | and to impose on you know singleton closer to use it "'cause" vectors belong to |
---|
0:17:40 | a gaussian distribution |
---|
0:17:43 | here used in the results |
---|
0:17:46 | for each state denoising post encoder |
---|
0:17:49 | and therefore a strong you can see the results we may use just two blocks |
---|
0:17:54 | the first this second walk |
---|
0:17:57 | as you can see that in both cases the results are better in the previous |
---|
0:18:02 | techniques |
---|
0:18:03 | for us france's between eight and ten and twelve and along the |
---|
0:18:09 | twelve seconds |
---|
0:18:11 | in the last rule using the results |
---|
0:18:14 | for this situation that we use |
---|
0:18:17 | it's really noisy auto-encoder exactly the same architecture that is shown in these speech |
---|
0:18:25 | in this case we have no |
---|
0:18:28 | in almost all cases |
---|
0:18:30 | we have |
---|
0:18:32 | better results than previous techniques |
---|
0:18:38 | in our paper we showed that it's important is that augmentation and the learning techniques |
---|
0:18:46 | to achieve and noise-robust |
---|
0:18:49 | speech recognition system but it's like you know |
---|
0:18:52 | we and we are in because vector space we can obtain better results if we |
---|
0:18:58 | use |
---|
0:18:59 | denoising techniques |
---|
0:19:01 | we show that simple statistical matters like i know that used in i-vector space can |
---|
0:19:07 | be used because that's nine also |
---|
0:19:11 | after that we showed that averaging the advantage is a statistical and that and denoising |
---|
0:19:17 | also includes event and give better results |
---|
0:19:20 | finally we introduce and you and you'll technique called the extent you know is not |
---|
0:19:27 | think of the that |
---|
0:19:29 | tries to find information about the noise and use this information in deeper stacked in |
---|
0:19:35 | a single thing colours by using these techniques |
---|
0:19:38 | really in this technique we achieve that but not the in almost all cases we |
---|
0:19:44 | achieve better results than statistical technique like iona or system all conventional denoising out a |
---|
0:19:52 | encoders |
---|
0:19:55 | text for your attention |
---|