0:00:14 | however |
---|
0:00:15 | this is the transition from clean laugh |
---|
0:00:18 | and the indian institute of science bounded off |
---|
0:00:22 | and i think presenting a work |
---|
0:00:24 | and b or the u wouldn't but a model for speaker verification |
---|
0:00:30 | the corpus this people actually as someone g |
---|
0:00:33 | actually automatically |
---|
0:00:36 | let's look at the roadmap of this presentation |
---|
0:00:40 | for this and we into using what a speaker verification task consists of |
---|
0:00:46 | no one to the motivation behind our work |
---|
0:00:50 | i don't talk about the front end one that be used |
---|
0:00:53 | discuss bus approaches to back and modeling |
---|
0:00:57 | before describing |
---|
0:00:59 | the proposed new will be lda and the lda model e |
---|
0:01:03 | and then |
---|
0:01:04 | some experiments and results before concluding the presentation |
---|
0:01:11 | let's look at what a speaker verification task consists of |
---|
0:01:16 | when you're given |
---|
0:01:17 | a segment of and alignment recording of a particular target speaker and a test segment |
---|
0:01:25 | the em audio objective of the speaker verification system |
---|
0:01:29 | is that it only |
---|
0:01:30 | whether the target speaker is speaking in the test segment which is the alternative hypothesis |
---|
0:01:36 | or |
---|
0:01:38 | if is not speaking |
---|
0:01:39 | which is the null hypothesis |
---|
0:01:42 | i think can see here |
---|
0:01:44 | this and one minute recording differently by x e |
---|
0:01:48 | and test recording denoted by x t |
---|
0:01:52 | these are given |
---|
0:01:53 | as an input to the speaker verification system |
---|
0:01:57 | this system outputs a log likelihood ratio score |
---|
0:02:02 | this score is used |
---|
0:02:04 | indeed only |
---|
0:02:05 | by the |
---|
0:02:06 | the test segment belongs to the target speaker on a non target speaker |
---|
0:02:14 | most popular state-of-the-art systems for speaker verification |
---|
0:02:19 | consist |
---|
0:02:19 | off and you will be celebrating extractor the most popular ones in the last few |
---|
0:02:25 | years have been the extractor models |
---|
0:02:28 | this is followed by a back a generative model such as the probabilistic linear discriminant |
---|
0:02:35 | analysis or b |
---|
0:02:38 | there are some discriminative backend approaches |
---|
0:02:41 | like the discriminative but and svm |
---|
0:02:45 | what we propose |
---|
0:02:47 | is |
---|
0:02:47 | one neural network of roads |
---|
0:02:49 | which is discriminative as was to generate a |
---|
0:02:53 | for back end modeling in speaker recognition and speaker verification tasks |
---|
0:03:01 | let's look at the front-end model may be used |
---|
0:03:05 | as i mentioned |
---|
0:03:06 | most popular model the last few years have been the extra to extract us |
---|
0:03:12 | we had they are extracted extractor |
---|
0:03:15 | on the wall salem got what |
---|
0:03:17 | which consisted |
---|
0:03:19 | of seven thousand three hundred and twenty three speakers |
---|
0:03:23 | this was clean |
---|
0:03:24 | using |
---|
0:03:25 | thirteen dimensional mfccs |
---|
0:03:27 | from twenty five miliseconds frames shifted every ten miliseconds using a twenty channel mel-scale filterbank |
---|
0:03:35 | this fan the frequency range twenty hours a seven thousand six hundred hz |
---|
0:03:42 | a fine for augmentation strategy |
---|
0:03:44 | which included |
---|
0:03:46 | all mandarin data using things |
---|
0:03:48 | like babble noise music |
---|
0:03:51 | to generate or six point three million training segments |
---|
0:03:56 | the architecture that we used to train on extractor model was the extended d v |
---|
0:04:02 | and in architecture |
---|
0:04:04 | this consists of well hidden layers |
---|
0:04:08 | and value nonlinearities |
---|
0:04:10 | the model is trained to discriminate among the speakers |
---|
0:04:15 | the forest and hidden layers |
---|
0:04:17 | already at the frame level well the last two layers already i'd a segment level |
---|
0:04:23 | often training the and bearings are extracted from the five hundred twelve dimensional affine component |
---|
0:04:31 | of the level cleo |
---|
0:04:33 | that is this the forest adamantly a after the statistics for you |
---|
0:04:38 | the and bindings extracted sure are the expectations |
---|
0:04:44 | let's look at a few approaches |
---|
0:04:47 | a back and modeling |
---|
0:04:48 | the most popular one |
---|
0:04:50 | in speaker verification systems is the general don't have gotten really order to be really |
---|
0:04:58 | once the x well there's are extracted |
---|
0:05:00 | there are a few steps of processing done on them |
---|
0:05:04 | that is their standard well then mean is a more |
---|
0:05:08 | the transform using lda |
---|
0:05:11 | and the unit length normalized |
---|
0:05:14 | the beheading model on this process extra go for a particular recording |
---|
0:05:19 | is given in equation one a |
---|
0:05:22 | but you dog |
---|
0:05:23 | is the extra for the particular recording |
---|
0:05:27 | make a describes a latent speaker factor which is a coalition |
---|
0:05:32 | file |
---|
0:05:33 | characterizes the speaker subspace matrix |
---|
0:05:36 | and epsilon arc is of caution decision |
---|
0:05:40 | now |
---|
0:05:41 | first warning |
---|
0:05:43 | and of these extra those |
---|
0:05:44 | there is one from the enrollment recording |
---|
0:05:47 | denoted by t e |
---|
0:05:49 | and one from the test recording |
---|
0:05:52 | denoted by county |
---|
0:05:54 | use with the leading really in order to compute |
---|
0:05:58 | the log-likelihood ratio score given in english |
---|
0:06:03 | english and two |
---|
0:06:04 | is that what i one |
---|
0:06:06 | and b and q are dying to make this is |
---|
0:06:11 | you all other approaches |
---|
0:06:13 | backend modeling on the discriminative but |
---|
0:06:18 | and pairwise abortion by |
---|
0:06:20 | the discriminative be lda rdb lda |
---|
0:06:24 | users |
---|
0:06:24 | and expanded by the |
---|
0:06:26 | in order to a by five of indulging or might be by entirely online you |
---|
0:06:32 | got be represents art right |
---|
0:06:35 | this computed using a quadratic on it |
---|
0:06:39 | which is given in equation three |
---|
0:06:42 | the final be a log-likelihood ratio score is computed |
---|
0:06:47 | as the dot product of all weight vector |
---|
0:06:50 | and this expanded vector |
---|
0:06:52 | file viewed i can might not be |
---|
0:06:56 | the pairwise collection backing |
---|
0:06:58 | models the pairs of enrollment and test extra goes |
---|
0:07:02 | using gosh distribution but i mean thus |
---|
0:07:06 | these bottom lead us |
---|
0:07:08 | i estimate |
---|
0:07:09 | but computing |
---|
0:07:10 | the sample mean and covariance matrix is off the target and non-target trials |
---|
0:07:16 | in the training data |
---|
0:07:18 | along with |
---|
0:07:20 | the and really a model that we propose |
---|
0:07:22 | we reported on results on the generative gaussian really their be clearly and pairwise gaussian |
---|
0:07:29 | backend |
---|
0:07:32 | no slow and the proposed new wouldn't but what and but architecture |
---|
0:07:38 | what we have your |
---|
0:07:39 | is a pair-wise the siamese time discriminative network |
---|
0:07:44 | as you can see the green portion of the network corresponds |
---|
0:07:48 | to the enrollment and ratings |
---|
0:07:51 | and the being portion of the network response that s and brings |
---|
0:07:57 | we can start the preprocessing steps |
---|
0:08:00 | in the generated of course but as layers |
---|
0:08:03 | in |
---|
0:08:04 | the neural network |
---|
0:08:07 | the lda |
---|
0:08:08 | the first affine layers |
---|
0:08:11 | unit length normalization as a nonlinear activation |
---|
0:08:15 | and then the bleu centring and diagonalization as |
---|
0:08:20 | another the affine transformation |
---|
0:08:23 | the final vad of airway scoring |
---|
0:08:26 | which is given in equation two |
---|
0:08:30 | is implemented as a quadratic here |
---|
0:08:34 | the bottom does of this model are optimized |
---|
0:08:38 | you were saying |
---|
0:08:38 | and approximation |
---|
0:08:40 | all the minimum detection cost function which is known as the mindcf or semen |
---|
0:08:47 | as the model |
---|
0:08:49 | optimizes to minimize the detection cost function |
---|
0:08:53 | we report results |
---|
0:08:55 | on the mindcf metric |
---|
0:08:57 | and the eer might |
---|
0:09:03 | the normalized detection cost function or dcf |
---|
0:09:07 | is defined as seen on all be done on a bigger |
---|
0:09:11 | which is you will to be miss of being a pleasant be done times p |
---|
0:09:16 | fa of data |
---|
0:09:18 | where b |
---|
0:09:19 | is and application basically |
---|
0:09:22 | p miss and b f e |
---|
0:09:24 | at the probability of miss |
---|
0:09:26 | and false alarms respectively |
---|
0:09:29 | on this |
---|
0:09:30 | is when the model but it's a target trial to be a non-target one |
---|
0:09:36 | that is the model believes |
---|
0:09:38 | that the enrollment and test come from different speakers |
---|
0:09:42 | whereas of false alarm is when non-target trial is wrong ready to as well |
---|
0:09:51 | p miss and b if a computed by applying i'm detection threshold of peter to |
---|
0:09:57 | the log-likelihood ratios |
---|
0:09:59 | how p miss and b if we are computed |
---|
0:10:02 | given in equation five |
---|
0:10:05 | here |
---|
0:10:06 | s i is the score all the log-likelihood ratio output by the model |
---|
0:10:13 | e i |
---|
0:10:14 | is the ground truth variable for by i |
---|
0:10:18 | that is equal to zero if the right i is a target i |
---|
0:10:23 | i d i is equal to one |
---|
0:10:25 | if it doesn't non-driving k |
---|
0:10:28 | one |
---|
0:10:30 | is the indicator function |
---|
0:10:34 | the normalized detection cost function is not as a function of the bad only those |
---|
0:10:40 | due to the status continues by the indicator function |
---|
0:10:45 | and hence |
---|
0:10:46 | it cannot be used as an objective function in your electro |
---|
0:10:51 | what we do all work on this is propose |
---|
0:10:54 | okay differentiable approximation |
---|
0:10:56 | of the normalized detection cost but approximating the indicator function but what sigmoid function |
---|
0:11:04 | so the integration is integration six |
---|
0:11:07 | i've |
---|
0:11:08 | the |
---|
0:11:09 | approximations of the normalized detection cost |
---|
0:11:12 | given by p miss soft and be a face off a soft detection costs |
---|
0:11:19 | g r e i is the client for index i s i is the system |
---|
0:11:25 | output score or the log-likelihood ratio |
---|
0:11:29 | signal a denotes the sigmoid function |
---|
0:11:32 | by choosing a large enough value for the wall in fact that i phone |
---|
0:11:37 | the approximation |
---|
0:11:38 | can be made arbitrarily close |
---|
0:11:41 | the actual detection cost function but a wide range of thresholds |
---|
0:11:47 | before we diving the designers |
---|
0:11:51 | let's look at the datasets used in training and testing the background model |
---|
0:11:57 | we sample about six point six million trials from the key inbox alive set |
---|
0:12:03 | and i don't put anything within five from the augmented boxes that say |
---|
0:12:08 | for testing we report results |
---|
0:12:11 | on three datasets |
---|
0:12:13 | the speakers in the white but i second you go core test condition which consist |
---|
0:12:18 | of around eight hundred thousand trials |
---|
0:12:21 | the voices development set which consists of all work four million by as |
---|
0:12:26 | and the wisest evaluations |
---|
0:12:29 | which okay and consisted of roughly on the and a half million trials |
---|
0:12:36 | the demon your words |
---|
0:12:38 | the results on the sat w goal wise as development and one evaluation sets on |
---|
0:12:45 | various models |
---|
0:12:46 | like the gaussian really going back in the yearly and approach was divided but |
---|
0:12:54 | along with the soft detection cost |
---|
0:12:56 | we also that our experiments |
---|
0:12:58 | with binding cross entropy as the loss which is denoted in the table i c |
---|
0:13:03 | vc loss |
---|
0:13:05 | we observe relative improvement in terms of mindcf |
---|
0:13:09 | of around thirty one was in twenty percent and eleven percent |
---|
0:13:13 | for s id w was is development and wise evaluation respectively |
---|
0:13:20 | the best scores for slu w local is and eer of two point zero five |
---|
0:13:27 | percent |
---|
0:13:27 | and the mindcf of point two |
---|
0:13:30 | for the wise as development began a best overall one point nine one person to |
---|
0:13:35 | sdr and point do the best mindcf |
---|
0:13:39 | for the voice is evaluation |
---|
0:13:41 | you get six point zero one percent eer as the best |
---|
0:13:44 | the other school and point four nine as the mindcf |
---|
0:13:49 | the improvements observed in then you wouldn't but a more consistent |
---|
0:13:54 | where data augmentation |
---|
0:13:56 | as well as |
---|
0:13:57 | for the eer metric |
---|
0:13:59 | all that on this soft detection costs for a and b performs even better than |
---|
0:14:04 | the binary cross entropy or b c loss |
---|
0:14:10 | to summarize |
---|
0:14:12 | the problem was model is the step and exploding on discriminative neural network model for |
---|
0:14:17 | the task of speaker verification |
---|
0:14:21 | using a single elegant back and more than just targeted to optimize the speaker verification |
---|
0:14:27 | lost the and maybe a model uses |
---|
0:14:30 | extract that and weightings directly to generate the speaker verification score |
---|
0:14:36 | this more shows significant performance gains |
---|
0:14:40 | on the s id w and was is datasets |
---|
0:14:44 | we have also observed considerable improvements on other datasets |
---|
0:14:49 | like the nist sre data six |
---|
0:14:52 | a standard as well |
---|
0:14:54 | two and do and model |
---|
0:14:56 | where the more to optimize is not just from the expected and weightings but directly |
---|
0:15:01 | from acoustic features like mfccs |
---|
0:15:04 | this work was accepted and interspeech twenty |
---|
0:15:10 | these are some of the difference is the reviewers |
---|
0:15:16 | thank you |
---|