0:00:15 | my name is the only hobby a from part process research centre points of what's |
---|
0:00:20 | and on a take the topic e is the i-vector more than in q we |
---|
0:00:25 | deep belief networks |
---|
0:00:27 | for multi session speaker recognition |
---|
0:00:32 | you know the acoustic modeling a using deep belief networks have been shown to be |
---|
0:00:37 | effective in speech recognition area and it's the getting popular not nowadays |
---|
0:00:43 | but a very few items the using only r p m's restricted boltzmann machines or |
---|
0:00:49 | generative ubms have been carried out in speaker recognition area |
---|
0:00:54 | we have proposed in our period previous work is that the was published in i |
---|
0:01:00 | can speak at some fourteen |
---|
0:01:01 | we use the both generative and discriminative it dbn |
---|
0:01:07 | on that work we use the only a single session target i-vectors as the inputs |
---|
0:01:12 | the to the networks |
---|
0:01:15 | in this paper we extend our previous work from a single decision to a more |
---|
0:01:20 | decision test |
---|
0:01:22 | that the we have used the then |
---|
0:01:24 | i-vector challenge database in these experiments |
---|
0:01:28 | and also we have modified our proposed impostor selection method that the |
---|
0:01:34 | to be more accurate and more robust against the its parameters |
---|
0:01:41 | first the ability to short a background about the deep belief networks and then i |
---|
0:01:46 | will go |
---|
0:01:48 | i will describe a all our dbn based system and then i will go or |
---|
0:01:54 | more in details the in our proposed impostor selection method |
---|
0:01:58 | and the i didn't show the experimental results that and at the and the conclusion |
---|
0:02:07 | deep belief networks the are originally a problems |
---|
0:02:11 | probabilistic generative models |
---|
0:02:14 | that every two at some layers are treated as the restricted boltzmann machines |
---|
0:02:21 | and the old ones are you to our bn will be the inputs to the |
---|
0:02:29 | above all the m and is trained to label layer |
---|
0:02:33 | however by adding top |
---|
0:02:37 | label layer this you know generative dbn can be converted to a discriminative want by |
---|
0:02:42 | doing the standard back propagation |
---|
0:02:48 | in this is like the i have some information about the how they are bm |
---|
0:02:54 | is trained and trained and |
---|
0:02:56 | how it's the good fit for to be matched with the per training a neural |
---|
0:03:04 | networks but i think i can escape is |
---|
0:03:09 | it's and i and is better to focus on our method |
---|
0:03:17 | less remind what's the problem |
---|
0:03:19 | the problem is to model each target the speaker be a valuable i-vectors what we |
---|
0:03:26 | have you are five i-vectors are part of i-vectors per each target speaker and a |
---|
0:03:32 | large amount of background the i-vectors of the development set |
---|
0:03:37 | our proposal is to use the deep belief networks for two main reasons |
---|
0:03:42 | first is the two |
---|
0:03:46 | face first is to take that want a job well unsupervised learning using the |
---|
0:03:52 | i relevant background data at the development set |
---|
0:03:55 | and to take that mine page of a supervised learning to train each target model |
---|
0:04:00 | and discriminatively |
---|
0:04:04 | this is the whole blacked out drama all our proposed method let's the two in |
---|
0:04:12 | the widely in three main is that's |
---|
0:04:15 | the first is that is balanced training |
---|
0:04:19 | what what's the problem imbalanced training here in this case the we have a large |
---|
0:04:25 | amount of background i make doors as a negative samples and if you amount of |
---|
0:04:31 | a target data at the positive samples |
---|
0:04:33 | as we are going to model each target speaker discriminative leaving it you get let's |
---|
0:04:41 | and the training the network with such a on balanced training be the list the |
---|
0:04:48 | overfitting |
---|
0:04:51 | so the solutions we have proposed here to decrease the number of background i-vectors as |
---|
0:04:57 | much as possible in their effective way |
---|
0:05:02 | we don't is in tremendous that's the first |
---|
0:05:05 | we select the only those background i-vectors that are more informative |
---|
0:05:14 | and then clustering the selected on in post or by k-means algorithm and the using |
---|
0:05:21 | cosine distance criteria |
---|
0:05:24 | and then using the |
---|
0:05:28 | the imposed and the cluster centroids as a negative samples |
---|
0:05:33 | and then finally a we will distribute a the positive and negative samples and equality |
---|
0:05:39 | in mind the mini batch it |
---|
0:05:47 | the second step is the adaptation process that you have proposed in our previous work |
---|
0:05:54 | i adaptation using all the background i-vectors we have be trained at a deep net |
---|
0:06:03 | network |
---|
0:06:04 | unsupervised think the without a label |
---|
0:06:07 | and because the trained model universal deep belief network |
---|
0:06:12 | and then each to target the speaker network speaker will be adapted from this a |
---|
0:06:19 | universal dbn |
---|
0:06:21 | but how adaptation the works |
---|
0:06:25 | adaptation |
---|
0:06:26 | be initialized and the networks the i instead of randomly and be initialized by the |
---|
0:06:33 | ubm parameters |
---|
0:06:35 | and then do they are unsupervised learning |
---|
0:06:40 | on we the balanced data all |
---|
0:06:44 | from this of one for only a few iterations |
---|
0:06:50 | in our previous work we have shown that |
---|
0:06:53 | the period and the pre-training in this case |
---|
0:06:56 | works better than random initialization |
---|
0:06:58 | and the proposed occupation works better then pre-training |
---|
0:07:05 | the second is that this last is that is fine tuning that is actually a |
---|
0:07:10 | back propagating is |
---|
0:07:13 | the neural networks using the label later |
---|
0:07:17 | but we have to change something here in comparison to estimate would be perverts the |
---|
0:07:25 | do one the only one layer error by provided |
---|
0:07:29 | propagation |
---|
0:07:30 | for few iterations the before full back propagation is carried out |
---|
0:07:35 | our experimental results in our last in our own previous works shown |
---|
0:07:41 | as shown that is this works better because and the op the top |
---|
0:07:47 | the label layer |
---|
0:07:50 | by this is the something like a pre-training the top layer as well and it |
---|
0:07:54 | works better that during the whole backprop right migration |
---|
0:07:59 | without doing this |
---|
0:08:03 | on the other hand be bic and bic and a d by our black there'd |
---|
0:08:09 | role models is then be to two main phases that the first the phase is |
---|
0:08:15 | target independent and the c can is target dependent |
---|
0:08:19 | actually target independent using the whole background i-vectors we have we train a universal deep |
---|
0:08:26 | belief networks |
---|
0:08:27 | and it be compute the impostor centroids |
---|
0:08:32 | that how this process is carried out only once for all the target speakers we |
---|
0:08:38 | have |
---|
0:08:40 | in the second that's |
---|
0:08:41 | and you think |
---|
0:08:45 | using the you db and impostor centroids |
---|
0:08:49 | and the available target i-vectors we will train our networks the discriminative be |
---|
0:09:00 | let's scroll more in details in the proposed impostor selection method |
---|
0:09:04 | and this method is |
---|
0:09:07 | it is similar to the |
---|
0:09:09 | support vector or bayes the |
---|
0:09:13 | approach that proposed by mitchell at clarion and the is it compose the but we |
---|
0:09:19 | have used here the cosine distance criteria and the we have changes some other things |
---|
0:09:28 | it composed of well four main steps the |
---|
0:09:31 | as some of the we have the whole background i-vectors in wants to hang out |
---|
0:09:36 | on another so that we have the client i-vectors |
---|
0:09:39 | each collect direct or |
---|
0:09:42 | that in this case is the average all five i-vectors berries client |
---|
0:09:47 | be to compare our bit all background i-vectors we have |
---|
0:09:51 | using cosine distance criteria |
---|
0:09:53 | and the top and i killers this the background i-vectors to each client |
---|
0:09:59 | will be kept in address that thought age in this |
---|
0:10:04 | a steps |
---|
0:10:05 | and maybe do the same for all the reliant i-vectors |
---|
0:10:10 | until the car i-vectors the cocktail ends that we have |
---|
0:10:15 | and the be compute the impostor frequencies in this that age and be normalized aim |
---|
0:10:22 | at n is the and top i-vectors the in each other for each client and |
---|
0:10:29 | the whole number of collect i-vectors |
---|
0:10:33 | and beep is that the this normalisation |
---|
0:10:37 | at the impostor frequency is more robust the against the threshold that we will define |
---|
0:10:45 | on this the frequencies |
---|
0:10:48 | then we set a threshold on this normalized impostor frequencies and those impostors have higher |
---|
0:10:55 | frequency frequencies then this are sure will be selected that the most informative impostors |
---|
0:11:05 | actually we have b |
---|
0:11:10 | we have the impostor frequencies and for all the background i-vectors we will have one |
---|
0:11:15 | frequencies will be defined iterations and those i-vectors the impostors that have higher impostor frequencies |
---|
0:11:23 | that then defined threshold will be selected |
---|
0:11:27 | this the threshold and the then and parameter will be defined experimentally |
---|
0:11:33 | at the experiment on section |
---|
0:11:41 | if the order or the impostor frequencies for the |
---|
0:11:46 | impostors the we will see that the any post or the have the same frequency |
---|
0:11:53 | a impostor frequencies |
---|
0:11:55 | that the that's why be have |
---|
0:11:58 | defined at a ritual the on the impostor frequencies not just the selecting the top |
---|
0:12:06 | a fixed number of a simple so |
---|
0:12:12 | in experimental station the dataset the that you have used is the |
---|
0:12:18 | nist the two thousand fourteen a i-vector challenge the i-vector size that you know is |
---|
0:12:23 | six hundred |
---|
0:12:25 | post processing that you have like eight out on i-vectors on |
---|
0:12:29 | all mean normalization the last whitening |
---|
0:12:33 | one hidden layers is used in this extreme as and the hidden layer like a |
---|
0:12:37 | layer size is four hundred |
---|
0:12:42 | forty owning the |
---|
0:12:43 | the two parameters for the impostor selection method that is |
---|
0:12:49 | the threshold and the and parameter if we plot the per the minimum dcf |
---|
0:12:57 | verses the this threshold for different and |
---|
0:13:01 | we will see and he's a |
---|
0:13:03 | a small |
---|
0:13:05 | the results are not good i if and is the too high |
---|
0:13:11 | biz the performance of the system want to be used a bell white changing the |
---|
0:13:17 | original |
---|
0:13:18 | and the best one is the choosing in according to our experiments is choosing |
---|
0:13:23 | and equal to one hundred and it shows the |
---|
0:13:28 | by setting that originals by this we will have a minimum m |
---|
0:13:34 | value for minimal dcf by these utterance rolled and setting and equals to one hundred |
---|
0:13:44 | in experiment all the results the be in this challenge we have we had one |
---|
0:13:50 | baseline system that everyone knows what's the baseline |
---|
0:13:55 | our proposed a dbn based is then be the target independent impostors that is good |
---|
0:14:01 | lowball impostors for the same for all the |
---|
0:14:05 | target speakers |
---|
0:14:06 | if we |
---|
0:14:08 | do this experiments we will have a this results |
---|
0:14:11 | that the is the big difference between |
---|
0:14:15 | the baseline system and our system |
---|
0:14:18 | and if we add a the target dependent the |
---|
0:14:22 | targets |
---|
0:14:22 | to the target independent impostors that in this case is one hundred is and the |
---|
0:14:29 | parameter and the at this pool is targeting depend the non-target depend then we will |
---|
0:14:34 | have |
---|
0:14:36 | better performance that is the |
---|
0:14:39 | this |
---|
0:14:40 | when you |
---|
0:14:41 | but in this case a if we at the target dependent the complexity of the |
---|
0:14:47 | system will be more than the first one because the in for each target the |
---|
0:14:54 | for each a target speaker for just speaker we need to do the clustering separately |
---|
0:15:01 | what in this case we just the compute the impostor centroids the ones for all |
---|
0:15:06 | the speakers |
---|
0:15:11 | if we do this that normal score normalisation on our baseline i have on or |
---|
0:15:17 | dbn and basis them maybe without that normalization and the results in this |
---|
0:15:24 | what if the ad that normalization using the all the whole impostor database we have |
---|
0:15:29 | the development set we will have words results |
---|
0:15:33 | if it's select the only ten top one thousand kilos this i-vectors impostors we would |
---|
0:15:40 | have it be better what is it is the worse than a without using that |
---|
0:15:45 | norm that normalization |
---|
0:15:47 | but the |
---|
0:15:50 | beach the but if |
---|
0:15:52 | we use the same impostor selection method for that normalization v a v is the |
---|
0:16:01 | and setting the parameter t and aiken again for this that normalization |
---|
0:16:07 | we will see that we have a be in for right you be improvement here |
---|
0:16:15 | and the |
---|
0:16:20 | and the in comparison to the baseline system we will see that the we will |
---|
0:16:26 | have |
---|
0:16:27 | to in the three percent improvements |
---|
0:16:31 | actually this twenty percent improvement is the in comparison with these results with these results |
---|
0:16:37 | the that he's the all the results the improvement is more than this |
---|
0:16:44 | but |
---|
0:16:46 | in this experiment so the for impostor selection method you have used the client i-vectors |
---|
0:16:53 | our experiment our new results experimental results have shown that if we don't use the |
---|
0:16:58 | client i-vectors |
---|
0:17:00 | i collect i-vectors the |
---|
0:17:03 | and the just select the particular and the i-vectors collect i-vectors from only the development |
---|
0:17:10 | set we will see that the |
---|
0:17:14 | we will have almost the same results then this that are very similar that actually |
---|
0:17:20 | a |
---|
0:17:23 | for our system proposed system it doesn't matter that we used the client i-vectors in |
---|
0:17:28 | or impostor selection method or select or jobs randomly choosing a the actual and i-vectors |
---|
0:17:36 | from only the background i-vectors |
---|
0:17:44 | and the main conclusions and |
---|
0:17:46 | in this paper or b and b have the problem of the impostor selection method |
---|
0:17:51 | for that we have shown that the helps to well outs is then to what |
---|
0:18:00 | the |
---|
0:18:01 | we'll have a good important for performance in multi session task |
---|
0:18:07 | and that really been the out more i-vectors the well very sharp where each target |
---|
0:18:13 | speaker helped the dbn system to capture more speaker and session variabilities in comparison to |
---|
0:18:21 | the single session task |
---|
0:18:25 | and also the final discriminative dbn per dbn based the approach showed a considerable performance |
---|
0:18:33 | in comparison to the com conventional baseline system propose the wine is seen in this |
---|
0:18:41 | challenge |
---|
0:18:42 | thank you |
---|
0:18:51 | we have time for question |
---|
0:18:58 | thanks to talk alike extension of the background dataset selection that you on the |
---|
0:19:03 | one question that comes to mono is when you doing a selection you looking at |
---|
0:19:08 | all the clients that are going to be enrolled system sorry i and you know |
---|
0:19:12 | also are not close enough again a so when you doing this dataset selection you |
---|
0:19:16 | looking at what is just statistically important are the clients that are going to be |
---|
0:19:21 | rolling system so you're |
---|
0:19:23 | system itself fourteen hours information about are you going to test on |
---|
0:19:27 | why wouldn't you just to closed set speaker i'd say that |
---|
0:19:33 | so reading it |
---|
0:19:34 | the when you're choosing at your impostors your before you dbn training all z norm |
---|
0:19:41 | that selection process itself is aware of all your target speakers |
---|
0:19:47 | yes that's correct |
---|
0:19:48 | so why not take a further and just a closed set speaker i they for |
---|
0:19:52 | the i-vector challenge |
---|
0:19:53 | yes that's why i'm telling you at the experiment the results extend i told you |
---|
0:20:00 | if we don't use the non-target i-vectors and just the and select randomly the same |
---|
0:20:08 | number of actual and i-vectors only from the development set |
---|
0:20:14 | and we use these a in iteration process use the for instance the one thousand |
---|
0:20:21 | the three hundred the i-vectors randomly from the development set and do the same processes |
---|
0:20:27 | the computing the and impostor frequencies |
---|
0:20:30 | and then again choose the and the random i-vectors and do the same and computing |
---|
0:20:37 | the impostors and then being the outrage overall impose an impostor frequencies and you the |
---|
0:20:43 | same set the threshold and setting the parameters |
---|
0:20:46 | we had almost the very similar results of these results that you have views on |
---|
0:20:51 | the target like make so that's a that's a very |
---|
0:20:56 | client specific selection menu not aware of the other clients in that sense |
---|
0:21:01 | very nice |
---|
0:21:03 | with data yes technically looking at the other clients with against the rules of the |
---|
0:21:06 | i-vector challenge but he has a solution that didn't have the other thing is the |
---|
0:21:10 | closed set scoring don't make here for wouldn't actually work because they are all different |
---|
0:21:15 | speaker |
---|