0:00:00 | okay |
---|
0:00:01 | i everyone mm and you know |
---|
0:00:04 | and i work the |
---|
0:00:06 | brno university of technology and amelia and i will be giving this the control room |
---|
0:00:11 | about |
---|
0:00:13 | and two and speaker verification |
---|
0:00:20 | so the topics |
---|
0:00:22 | to discuss in this tutorial is we'll start with some background and definition l and |
---|
0:00:28 | when training |
---|
0:00:30 | and then discuss some alternative training proceed years but mean which often use |
---|
0:00:36 | and then talk about the motivation for ends when training |
---|
0:00:40 | and continue with some difficult this of and when training |
---|
0:00:45 | and then |
---|
0:00:46 | talk about that were reviewing sound |
---|
0:00:50 | existing work on and then speaker recognition but not that you in like grade the |
---|
0:00:55 | real |
---|
0:00:57 | and then we will rubber with some summary and all but and |
---|
0:01:01 | questions |
---|
0:01:03 | i was would like to give some acknowledgement and assigns to my colleagues from but |
---|
0:01:07 | in the media |
---|
0:01:08 | ming |
---|
0:01:10 | by who i'm |
---|
0:01:11 | it's cost bayes topics a lot |
---|
0:01:15 | so let's start we recognition |
---|
0:01:19 | this is |
---|
0:01:21 | i kind of typical |
---|
0:01:23 | mm at the recognition scenario and |
---|
0:01:26 | in steps to marry we assume we have some features x and some labels line |
---|
0:01:30 | and we wish to find some function which is parameterized by |
---|
0:01:35 | second let's say |
---|
0:01:37 | and it which |
---|
0:01:38 | given the |
---|
0:01:41 | features critiques |
---|
0:01:42 | some label or predict the label |
---|
0:01:46 | which should be close or equal to the true |
---|
0:01:50 | like |
---|
0:01:53 | to be more precise |
---|
0:01:54 | we would like to me prediction to be such that some loss function which compares |
---|
0:02:00 | the |
---|
0:02:00 | predicted label with the true label is as small as possible on unseen data |
---|
0:02:06 | and the loss functions for example if we do call a classification it can be |
---|
0:02:11 | something that used |
---|
0:02:13 | zero if the predicted label is same as the true label and one |
---|
0:02:16 | otherwise the basis i kind of error or |
---|
0:02:18 | not case |
---|
0:02:23 | of course ideally what we want to do is to |
---|
0:02:28 | minimize the expected loss on unseen test data which we could calculate like bass |
---|
0:02:36 | and here we use capital x and y to denote that they are unseeing random |
---|
0:02:40 | variables |
---|
0:02:41 | but since we don't know the probability distribution of |
---|
0:02:44 | x and y we cannot do this |
---|
0:02:46 | exactly or explicitly |
---|
0:02:49 | so |
---|
0:02:51 | in the supervised learning problem we have access to some training data which would be |
---|
0:02:56 | many examples of features and labels we can complete not the most set |
---|
0:03:02 | and |
---|
0:03:02 | p check the average loss on the training data and we are trying to minimize |
---|
0:03:07 | that |
---|
0:03:08 | and then we hope that this we |
---|
0:03:11 | this procedure here means that we will also get a low loss on |
---|
0:03:15 | unseen test data |
---|
0:03:18 | and this is a call empirical risk minimisation |
---|
0:03:21 | and use expected to work uses |
---|
0:03:25 | the classifier that we use this not to our four |
---|
0:03:29 | e |
---|
0:03:30 | the to be precise something would be to dimension should be if units and it |
---|
0:03:34 | also requires that the distribution of the loss |
---|
0:03:37 | needs |
---|
0:03:38 | not to have attained but to for typical scenarios this |
---|
0:03:42 | really into it improves in your is expected to work |
---|
0:03:49 | so then let's talk about speaker recognition |
---|
0:03:53 | as probably most |
---|
0:03:55 | in the audience here knows we have these three some tasks of speaker recognition |
---|
0:04:01 | it's speaker identification |
---|
0:04:04 | which basically is used to classify close to all speakers of this is a very |
---|
0:04:09 | standard |
---|
0:04:11 | i recognition |
---|
0:04:14 | scenario and then we have speaker verification where we deal we |
---|
0:04:17 | open set as we say |
---|
0:04:19 | so the speakers that we may see in testing |
---|
0:04:22 | or not the same as we have access to in training when building the model |
---|
0:04:27 | and our task is typically to say whether two segments utterances are from the same |
---|
0:04:32 | speaker or not |
---|
0:04:34 | and then there's also speaker diarization which is |
---|
0:04:38 | to assign basically you know in a long recording each time you mean you need |
---|
0:04:43 | to a speaker |
---|
0:04:47 | so here i will focus on speaker verification because the speaker identification task is |
---|
0:04:53 | quite easy you know at least conceptually |
---|
0:04:57 | and the speaker diarization is card and then approaches are still in very rarely station |
---|
0:05:03 | or although some great |
---|
0:05:05 | stuff as has been done |
---|
0:05:07 | it's maybe too early to focus on that you know tutorial |
---|
0:05:14 | so |
---|
0:05:15 | generally |
---|
0:05:17 | it's |
---|
0:05:19 | preferable |
---|
0:05:20 | if a classifier |
---|
0:05:22 | i'll codes |
---|
0:05:23 | not a heart the heart prediction like it's this class or in this class but |
---|
0:05:28 | rather probability of different classes |
---|
0:05:31 | so we would like some |
---|
0:05:34 | classifier that uses an estimate of the probability of some label given the data |
---|
0:05:39 | in the case of speaker verification with are rather prefer it all put the log-likelihood |
---|
0:05:45 | ratios |
---|
0:05:46 | because from that we can |
---|
0:05:49 | okay |
---|
0:05:51 | the probability of a class given the labour i classes here is just target over |
---|
0:05:55 | non-target |
---|
0:05:57 | but we can |
---|
0:05:59 | do this based on a specified prior probability |
---|
0:06:03 | so it uses a bit more flexibility in how to use this |
---|
0:06:07 | system |
---|
0:06:11 | so |
---|
0:06:13 | but some talk about and training |
---|
0:06:16 | and my impression is that it's not completely or well defined in the literature |
---|
0:06:23 | but it seems to enable |
---|
0:06:26 | these two |
---|
0:06:29 | aspects |
---|
0:06:30 | first all parameters of the system |
---|
0:06:34 | should be trained jointly and that could be anything from feature extraction to producing some |
---|
0:06:38 | speaker inventing |
---|
0:06:40 | to the back in the comparison of speaker and endings and increasing the score |
---|
0:06:46 | a second aspect is that |
---|
0:06:48 | and then system should be trained specifically for the and |
---|
0:06:51 | intended task in which in our case would be verification |
---|
0:06:58 | one could go even more stricter say that it should match to extract evaluation metrics |
---|
0:07:02 | but we are interested in for example in right |
---|
0:07:06 | so |
---|
0:07:07 | in this tutorial i will try to |
---|
0:07:11 | discuss |
---|
0:07:13 | no |
---|
0:07:17 | how |
---|
0:07:19 | important |
---|
0:07:20 | these criterias are or what is it can be |
---|
0:07:24 | to impose this criteria or what doesn't mean if we don't do it |
---|
0:07:31 | so |
---|
0:07:33 | first |
---|
0:07:33 | let's look at what would |
---|
0:07:37 | typical and when speaker verification architecture |
---|
0:07:41 | look like and |
---|
0:07:43 | well i process first i know this was first attempted for speaker verification in two |
---|
0:07:47 | thousand sixteen |
---|
0:07:49 | in the paper mentioned here the mortal |
---|
0:07:53 | and |
---|
0:07:54 | it will be some so we start with some |
---|
0:07:57 | enrollment utterance so as |
---|
0:07:59 | here it's three and we have some test utterance |
---|
0:08:02 | all of these goes through some embedding extracting neural networks |
---|
0:08:06 | reducing in many different architectures there |
---|
0:08:09 | we produced and bindings which are fixed size |
---|
0:08:12 | so |
---|
0:08:14 | utterance representations |
---|
0:08:16 | one for each utterance of in three now enrollment and endings and one test reading |
---|
0:08:22 | and then we will create one and rollers model by some kind of pulling for |
---|
0:08:26 | example taking the meeting |
---|
0:08:28 | of the and warm of them buildings |
---|
0:08:31 | and then we have some similarity measure and in the and |
---|
0:08:35 | a score comes out that says |
---|
0:08:38 | the log-likelihood ratio for four |
---|
0:08:41 | the hypothesis that these |
---|
0:08:43 | test segments |
---|
0:08:44 | it's from the same speaker as this enrollment segments |
---|
0:08:49 | and |
---|
0:08:50 | all of these models should all these parts of the speaker model should be |
---|
0:08:55 | trained |
---|
0:08:56 | jointly |
---|
0:09:02 | to be a bit fair and maybe a for historical interest we should say that |
---|
0:09:08 | this is |
---|
0:09:10 | no a |
---|
0:09:12 | new idea |
---|
0:09:15 | we had it's already in nineteen ninety three maybe that's their list i'm aware of |
---|
0:09:21 | at least |
---|
0:09:22 | and the one paper at the time was about |
---|
0:09:26 | handwritten signature recognition and another paper was about the fingerprint recognition |
---|
0:09:33 | but they used exactly this idea |
---|
0:09:39 | and |
---|
0:09:40 | okay so we talk about and |
---|
0:09:43 | training and modeling |
---|
0:09:46 | so what would be the alternative |
---|
0:09:49 | one thing would be |
---|
0:09:51 | generative modeling so we train a generative model |
---|
0:09:54 | that |
---|
0:09:55 | means a model that can generate the data both the observations x and |
---|
0:10:02 | labels line and it can you was |
---|
0:10:08 | it can also give us |
---|
0:10:10 | probability of or probability density for such a observations |
---|
0:10:16 | me typically training with maximum likelihood and if the model is correctly specified for example |
---|
0:10:22 | of the data really comes from a normal distribution and we have assumed that |
---|
0:10:26 | in our model are then |
---|
0:10:29 | with enough training data we will find the correct parameters but the |
---|
0:10:33 | that is no |
---|
0:10:35 | and it's may be worth pointing out that |
---|
0:10:37 | and the lars from such a model is the best |
---|
0:10:40 | we can have its |
---|
0:10:43 | so to have access to the log-likelihood ratios from |
---|
0:10:47 | from the model that really generated today that is |
---|
0:10:51 | then we can make the model decision for classification verification is a long |
---|
0:10:56 | no |
---|
0:10:57 | other |
---|
0:10:58 | classifier would have was more |
---|
0:11:04 | problem with this is that when the |
---|
0:11:07 | more than |
---|
0:11:09 | assumptions are not correct then the parameters we find with maximum likelihood may not be |
---|
0:11:14 | optimal for classification |
---|
0:11:17 | and sometimes maximum likelihood training is also difficult |
---|
0:11:25 | other approaches will be some type of discriminative training so and then training can be |
---|
0:11:30 | seen as a where is a lot one type of discriminative training but other discriminative |
---|
0:11:36 | approaches we can tries to train the neural network where the embedding extractor for speaker |
---|
0:11:41 | identification which seems to be the most |
---|
0:11:45 | popular approach right now |
---|
0:11:48 | and then we will use output of some intermediate layer as somebody and train and |
---|
0:11:54 | i'm not either |
---|
0:11:55 | back end on top of that |
---|
0:11:58 | then there is this a course of the metric learning which |
---|
0:12:05 | there |
---|
0:12:07 | mean kind of train the embedding extractor together with a distance matrix with sometimes can |
---|
0:12:12 | be simple |
---|
0:12:14 | so in principle the inventing and kind of distance metric or back end |
---|
0:12:19 | trained jointly |
---|
0:12:21 | but typically not for the speaker verification task |
---|
0:12:24 | so this is kind of and then training according to the first criteria but not |
---|
0:12:28 | according to the second |
---|
0:12:32 | so |
---|
0:12:34 | now |
---|
0:12:36 | when we know that we will |
---|
0:12:38 | is costs |
---|
0:12:40 | why the end-to-end training would be preferable |
---|
0:12:44 | so |
---|
0:12:45 | we had two things one is that we should train models jointly and the other |
---|
0:12:48 | thing is that which are trained for the |
---|
0:12:50 | intended task |
---|
0:12:52 | so |
---|
0:12:54 | mm |
---|
0:12:56 | in the case of joint training is actually quite obvious selects the consider |
---|
0:13:01 | system consisting of two modules a and b and we have fit that a which |
---|
0:13:05 | is the parameters of model a and b which is the |
---|
0:13:08 | only there's of what would be if we just first training module a and then |
---|
0:13:14 | module b |
---|
0:13:15 | it is essentially like doing |
---|
0:13:18 | one iteration of |
---|
0:13:20 | coordinate descent or block coordinate descent |
---|
0:13:22 | so we train model |
---|
0:13:24 | a |
---|
0:13:25 | and we get here we train one ubm we get here |
---|
0:13:29 | but we will not get for them that's not to the optimum which would be |
---|
0:13:34 | so of course we could trade continue |
---|
0:13:38 | two |
---|
0:13:39 | do a few more iterations |
---|
0:13:40 | and we might end up in the |
---|
0:13:43 | optimal and this is actually kind of in principle equivalent to a joint optimization |
---|
0:13:51 | when we have right kind of a non-convex model as one we may not actually |
---|
0:13:55 | get the same |
---|
0:13:57 | right optimum but as if we did |
---|
0:14:00 | all the parameters in one go what would happen also depending on which optimize the |
---|
0:14:05 | we used so |
---|
0:14:06 | in principle |
---|
0:14:08 | this is |
---|
0:14:12 | why or so joint training would be like |
---|
0:14:16 | really make sure that you find the optimal |
---|
0:14:19 | also both |
---|
0:14:20 | models and that's clearly better than just training one |
---|
0:14:25 | first one and then the other ones |
---|
0:14:28 | so i think there is no really argument here |
---|
0:14:31 | that the these part of and then training is justified |
---|
0:14:36 | the joint training of for more details |
---|
0:14:42 | the task specific training the idea that we should training for |
---|
0:14:48 | the |
---|
0:14:51 | the intended task so if we do |
---|
0:14:55 | you our application we want to do speaker verification why we should training for verification |
---|
0:15:00 | and not for identification for example |
---|
0:15:04 | well |
---|
0:15:05 | first mission say that |
---|
0:15:10 | we have some guarantee that this idea of minimizing loss on training data |
---|
0:15:14 | we need was good performance on test a the empirical risk minimisation idea |
---|
0:15:20 | and the only guarantee we have there is |
---|
0:15:26 | this in this case the only holds if we are training for four we for |
---|
0:15:30 | the metric that we are interested in with the task of very interested in |
---|
0:15:35 | if we |
---|
0:15:36 | trained for one task and or |
---|
0:15:39 | you can evaluate |
---|
0:15:40 | on another source we don't really have any guarantee that |
---|
0:15:45 | we find the optimal model parameters for this task |
---|
0:15:49 | but one can of course ask shouldn't is really work anyway training for |
---|
0:15:54 | identification |
---|
0:15:55 | and use the model for verification "'cause" it's kind of similar tasks |
---|
0:16:00 | it does as we know |
---|
0:16:02 | so but let's just discuss a little bit what could |
---|
0:16:05 | go wrong |
---|
0:16:08 | or why it wouldn't be optimal |
---|
0:16:16 | so here is kind of toy example |
---|
0:16:20 | we are looking at one dimensional inventing so we imagine that these have been |
---|
0:16:25 | where rather the distribution of one dimensional and endings |
---|
0:16:31 | so the embedding space is here and each of these colour represent the |
---|
0:16:38 | distribution of impending is for some speakers of you is one speaker or will is |
---|
0:16:42 | another speaker and so one |
---|
0:16:46 | of course this is a little bit that we are |
---|
0:16:49 | shape of the distributions i showed it alright okay kind of for simplicity |
---|
0:16:54 | so in this kind of for example we assume that the mean of the |
---|
0:16:59 | speakers are used a new that when you call distance like this |
---|
0:17:06 | so |
---|
0:17:09 | what would be the identification error in this case |
---|
0:17:13 | so whenever we observe an amending we will assign it to the closest speaker |
---|
0:17:19 | so |
---|
0:17:20 | if we |
---|
0:17:22 | observed on a bending in this region we will assign it so that no speaker |
---|
0:17:26 | if we also observe it here |
---|
0:17:28 | we will assign its to this end |
---|
0:17:31 | this and |
---|
0:17:32 | you green |
---|
0:17:34 | speaker |
---|
0:17:36 | and of course it means that sometimes it will be the blue speaker |
---|
0:17:43 | when something sampled from the blue speaker will be here but we will assign its |
---|
0:17:47 | the v is |
---|
0:17:48 | green |
---|
0:17:48 | style speaker area |
---|
0:17:50 | so we will have some error in this situation |
---|
0:17:54 | and |
---|
0:17:55 | if we consider only the neighboring speakers the error rate will be |
---|
0:17:59 | a twelve point two percent in this example |
---|
0:18:10 | what would be the verification error rate |
---|
0:18:14 | so |
---|
0:18:15 | if we consider |
---|
0:18:16 | for this type of data |
---|
0:18:18 | so |
---|
0:18:19 | we will assume that we |
---|
0:18:21 | have speakers |
---|
0:18:23 | which are you can be installed is to muted |
---|
0:18:26 | like well |
---|
0:18:27 | these stars |
---|
0:18:29 | and |
---|
0:18:30 | now the target trial we will sample |
---|
0:18:35 | and bending from one speaker |
---|
0:18:37 | and see if they are closer to each other than some threshold |
---|
0:18:41 | based happen to the optimal special for this iteration |
---|
0:18:46 | and if the |
---|
0:18:48 | they are after that first we that i think so that you |
---|
0:18:54 | thank you |
---|
0:18:56 | okay |
---|
0:18:56 | but i |
---|
0:18:58 | if available |
---|
0:19:02 | the case |
---|
0:19:03 | the |
---|
0:19:05 | and for nontarget trials |
---|
0:19:08 | so |
---|
0:19:11 | here in this image we could see |
---|
0:19:14 | it would have an error rate of fourteen percent |
---|
0:19:17 | again i'm only actually considering that the non-target trials are from neighboring speakers |
---|
0:19:26 | that's why they're rate is high |
---|
0:19:33 | so |
---|
0:19:34 | no |
---|
0:19:35 | i'm only changing this is to use a little bit |
---|
0:19:39 | the within speaker is to me you show so |
---|
0:19:43 | as before |
---|
0:19:45 | the speaker means are on the same distance |
---|
0:19:48 | like this |
---|
0:19:50 | and |
---|
0:19:50 | we have made them little bit more narrow here the within speaker distribution a little |
---|
0:19:55 | bit more broad here |
---|
0:19:57 | the overall variance the within speaker variance this the same obtain a little bit different |
---|
0:20:01 | shape |
---|
0:20:02 | and we will see that identification error has increased to thirteen point seven percent |
---|
0:20:09 | whereas the verification error is that there |
---|
0:20:15 | well |
---|
0:20:16 | more extreme situation we have made them |
---|
0:20:19 | the distributions equally sake or broad |
---|
0:20:23 | do those two mixtures |
---|
0:20:26 | now id and the means speaker means are all the same distance |
---|
0:20:31 | like this |
---|
0:20:32 | but the within speaker variance is |
---|
0:20:35 | well in the within speaker variance is also the same as before |
---|
0:20:40 | and here it would actually get |
---|
0:20:42 | zero |
---|
0:20:44 | identification error |
---|
0:20:46 | but you will have worse |
---|
0:20:48 | verification error or in any of the other example and it's because |
---|
0:20:53 | if you sample a target trial you we very often have |
---|
0:20:57 | and endings that are far from each other and similarly |
---|
0:21:01 | for a non-target trials will very often have weddings that are close to each other |
---|
0:21:07 | so this |
---|
0:21:08 | example |
---|
0:21:10 | should illustrate that |
---|
0:21:14 | the within speaker distribution that is optimal for identification is not the same is not |
---|
0:21:20 | necessarily the distribution that is optimal for verification |
---|
0:21:27 | okay so |
---|
0:21:29 | as another example |
---|
0:21:31 | let us consider triplet loss which is another popular |
---|
0:21:38 | most |
---|
0:21:38 | function |
---|
0:21:40 | could i |
---|
0:21:42 | so it looks like this that |
---|
0:21:44 | each training example you have |
---|
0:21:48 | and bending for some speaker which we call the anchor invading |
---|
0:21:52 | and then you have an embedding from the same speaker in which all the positive |
---|
0:21:55 | example and animating from another speaker we should call the |
---|
0:21:59 | negative example |
---|
0:22:00 | and basically we want the distance between the anchor and the positive example can be |
---|
0:22:06 | small |
---|
0:22:07 | and the anchor between the at the distance between the anchor and the negative example |
---|
0:22:11 | to be big |
---|
0:22:14 | so |
---|
0:22:15 | if this distance is bigger than |
---|
0:22:18 | this class and |
---|
0:22:20 | then these loss is gonna be zero |
---|
0:22:26 | however |
---|
0:22:27 | this is not |
---|
0:22:29 | ideal the an ideal criteria for speaker verification and two show this i have a |
---|
0:22:34 | rather complicated feed your here the illustrates |
---|
0:22:40 | three speakers |
---|
0:22:41 | and the embedding some three speakers in a |
---|
0:22:45 | two dimensional space |
---|
0:22:47 | so we have |
---|
0:22:48 | the speaker may |
---|
0:22:50 | with and buildings |
---|
0:22:52 | distributed in this area |
---|
0:22:55 | speaker be with the meetings in this area and speaker c with them endings in |
---|
0:22:59 | this area |
---|
0:23:01 | and |
---|
0:23:03 | eve |
---|
0:23:04 | we are using some and go from speaker to a the worst case would be |
---|
0:23:08 | to use it here on the border |
---|
0:23:10 | and then the biggest this test for a positive example would be to have it |
---|
0:23:15 | here on the other side |
---|
0:23:17 | and the biggest the smallest this there's to a negative example would be to take |
---|
0:23:21 | something here |
---|
0:23:23 | so simply we want this |
---|
0:23:27 | and distance with the positive example |
---|
0:23:30 | here class some margin to be smaller than the distance from the |
---|
0:23:35 | negative example of anchor |
---|
0:23:37 | so it's okay |
---|
0:23:38 | in this situation |
---|
0:23:41 | consider then speaker seen which hasn't b |
---|
0:23:46 | wind |
---|
0:23:46 | is the fusion of data now if we have i'm gonna here |
---|
0:23:51 | we need |
---|
0:23:52 | the |
---|
0:23:53 | distance to the next speaker the closest speaker to be |
---|
0:23:58 | be here then the internal distance |
---|
0:24:00 | class some margie |
---|
0:24:02 | so |
---|
0:24:03 | and that's the case in this figure so that replied loss is completely fine with |
---|
0:24:08 | this situation |
---|
0:24:10 | but if we want to use |
---|
0:24:12 | we do |
---|
0:24:13 | verification on data that is distributed in this way then we should |
---|
0:24:19 | accept |
---|
0:24:21 | at all well if we want to have good |
---|
0:24:24 | performance of target trials from speakers t |
---|
0:24:27 | we need to accept |
---|
0:24:30 | trials as target trials whenever we have a smaller distance then this otherwise we will |
---|
0:24:34 | have some error or for target trials of speakers e |
---|
0:24:38 | but this means that if we have a threshold like this year we will have |
---|
0:24:42 | would be in confusion between |
---|
0:24:45 | speaker a and b |
---|
0:24:47 | so |
---|
0:24:48 | this |
---|
0:24:49 | again of course they could be ways to compensate for this environment or another but |
---|
0:24:53 | it's just to show that like to sign |
---|
0:24:55 | these |
---|
0:24:57 | metric is not |
---|
0:24:58 | gonna lead to optimal |
---|
0:25:00 | performance for |
---|
0:25:03 | verification |
---|
0:25:06 | so if we try to summarise a little bit about the idea of task specific |
---|
0:25:10 | training |
---|
0:25:12 | minimizing identification error wouldn't necessarily the minimal verification error or |
---|
0:25:18 | but of course i was showing these on kind of toy examples and the reality |
---|
0:25:22 | is much more complicated |
---|
0:25:24 | we |
---|
0:25:25 | usually don't optimize classification error but they're all the cross entropy |
---|
0:25:29 | or something like that |
---|
0:25:31 | and we may use some loss to encourage more jean |
---|
0:25:36 | between the speaker and endings |
---|
0:25:39 | and maybe these assumptions that the made about the |
---|
0:25:42 | distributions here are |
---|
0:25:44 | well to compute more realistic at all |
---|
0:25:48 | and |
---|
0:25:50 | mm |
---|
0:25:53 | so the maybe not completely clear |
---|
0:25:56 | what would happen we knew test speakers that were not in the training set as |
---|
0:26:00 | one |
---|
0:26:01 | so i one and then to say is that this should not be interpreted as |
---|
0:26:05 | some kind of proof that other object is would fan maybe they would even be |
---|
0:26:09 | really good |
---|
0:26:11 | but |
---|
0:26:12 | yes to use training data be that it's not really |
---|
0:26:17 | completely just defined to use them |
---|
0:26:20 | and this is of course something that ideally should be studied much more |
---|
0:26:24 | in future |
---|
0:26:27 | but |
---|
0:26:31 | and so we discuss that the and then training has some and good motivation |
---|
0:26:39 | but still it's not really the most popular strategy for building speaker recognition systems today |
---|
0:26:46 | at least in my impression it is my impression is that the multiclass training is |
---|
0:26:50 | still the most popular |
---|
0:26:52 | and |
---|
0:26:54 | so |
---|
0:26:55 | why is that well there are many difficulties with the and when training |
---|
0:26:59 | it seems |
---|
0:27:01 | no e |
---|
0:27:02 | he's more prone to overfitting |
---|
0:27:05 | we have additions we statistical dependence of training |
---|
0:27:08 | trials which are we go more into detail in |
---|
0:27:12 | i of the dislike |
---|
0:27:15 | and |
---|
0:27:16 | they're also maybe questionable how to do how should be trained based in the system |
---|
0:27:21 | when we want to |
---|
0:27:23 | and many enrollment utterances also to be mentioned of it |
---|
0:27:28 | but one |
---|
0:27:30 | so |
---|
0:27:35 | the issue |
---|
0:27:36 | one of the issues with using a cane of verification objective let's call it that |
---|
0:27:41 | when we are comparing draw |
---|
0:27:43 | two utterances and wondered say whether it's the same speaker or not |
---|
0:27:48 | is that |
---|
0:27:51 | the day that |
---|
0:27:52 | we e |
---|
0:27:54 | statistical independence i mean same y |
---|
0:27:57 | well you know minutes about |
---|
0:27:59 | so this is |
---|
0:28:01 | generally these idea of training of minimizing some training also assumes that |
---|
0:28:07 | the training data |
---|
0:28:09 | are independent samples from whatever distribution comes from |
---|
0:28:14 | and this is often the case i mean we have data that has been independently |
---|
0:28:19 | selected |
---|
0:28:21 | but |
---|
0:28:21 | in speaker verification |
---|
0:28:23 | the data |
---|
0:28:25 | x |
---|
0:28:26 | automation |
---|
0:28:27 | is |
---|
0:28:28 | a pair also happens then roll utterance and the testing utterance and the label is |
---|
0:28:34 | indicating whether it's the target trial or a non-target trial |
---|
0:28:38 | so for location i mean use |
---|
0:28:41 | why equal one for target trial and one equal minus one for nontarget trials |
---|
0:28:46 | the issue here is that |
---|
0:28:49 | typically at least if we have limited amount of training data |
---|
0:28:53 | we create |
---|
0:28:54 | many trials |
---|
0:28:56 | from the same speaker from the same utterance of each of the speaker and utterances |
---|
0:29:01 | are used in many different right and then these |
---|
0:29:05 | date time is not |
---|
0:29:06 | these trials are not which is the training data |
---|
0:29:10 | is not |
---|
0:29:12 | statistically independent |
---|
0:29:14 | which is something that the training procedure assumes they are |
---|
0:29:19 | so |
---|
0:29:22 | this can be a problem exactly how big the problem is |
---|
0:29:25 | i think it's still something that needs to be investigated more but let's elaborately to |
---|
0:29:30 | be what about what happens |
---|
0:29:35 | so |
---|
0:29:38 | here i brought adjust the training objective that we would use in the for a |
---|
0:29:43 | kind of a verification loss when we train the systems and in verification |
---|
0:29:48 | so it looks |
---|
0:29:49 | complicated than being but it's not really anything special is yes the average training loss |
---|
0:29:55 | of |
---|
0:29:56 | target trials here and the average training loss of |
---|
0:30:00 | non-target trials here and they are weighted with a fact or |
---|
0:30:05 | probability of target trials and probability of non-target trials which are |
---|
0:30:10 | some parameter that we use that to |
---|
0:30:14 | dear the system to fit |
---|
0:30:15 | better for the application that we are interested in |
---|
0:30:19 | and again |
---|
0:30:22 | what we hope is that this would minimize the expected loss |
---|
0:30:26 | of |
---|
0:30:28 | target trials and non-target trials |
---|
0:30:32 | weighted we these |
---|
0:30:33 | min |
---|
0:30:34 | probability of target trials and non-target trials |
---|
0:30:38 | on some unseen data |
---|
0:30:40 | this loss function here is often the cross entropy but could be other things |
---|
0:30:49 | so what are the desirable properties of training objective |
---|
0:30:56 | so |
---|
0:30:57 | here |
---|
0:30:59 | we have |
---|
0:31:00 | are hat which is the |
---|
0:31:03 | and directional for training the loss |
---|
0:31:06 | and |
---|
0:31:07 | since the training data |
---|
0:31:09 | use |
---|
0:31:09 | can be assumed to be generated from some probability distribution this or have is also |
---|
0:31:14 | a random variable |
---|
0:31:18 | and we won't these |
---|
0:31:20 | to be close |
---|
0:31:21 | to the |
---|
0:31:23 | expect that |
---|
0:31:24 | loss |
---|
0:31:29 | where the expectation is calculated according for the true probability distribution of the data |
---|
0:31:35 | and for every value of |
---|
0:31:37 | fit that because |
---|
0:31:39 | in that case |
---|
0:31:43 | and |
---|
0:31:46 | if |
---|
0:31:47 | the expected loss is this black line here |
---|
0:31:54 | then |
---|
0:31:56 | e |
---|
0:31:57 | well let's say we are |
---|
0:31:59 | we have some training set the blue one |
---|
0:32:02 | and we check the average loss as a function of data |
---|
0:32:06 | it may look like this |
---|
0:32:09 | another training set it may look like this the red line and the third one |
---|
0:32:13 | would be |
---|
0:32:14 | the power of one so the point is that it's a little bit random and |
---|
0:32:17 | it's not gonna be exactly like the expected loss |
---|
0:32:22 | but ideally it should be close to this one because if we find a filter |
---|
0:32:26 | that minimize the training loss for example here for the in the case of the |
---|
0:32:29 | red training set |
---|
0:32:31 | then |
---|
0:32:32 | e we know that okay it will be also a good value for the |
---|
0:32:38 | expected loss which means that the loss on things test data |
---|
0:32:43 | so we want |
---|
0:32:46 | the |
---|
0:32:47 | training loss |
---|
0:32:48 | for some as a function of the parameter in grammar the model parameters |
---|
0:32:53 | can be close to the expected loss for one values of the |
---|
0:32:57 | parameters |
---|
0:33:02 | so |
---|
0:33:03 | in order to study the effect of |
---|
0:33:08 | statistical dependences in the training data in this context |
---|
0:33:11 | we |
---|
0:33:12 | right the |
---|
0:33:14 | training objective slightly more general than before |
---|
0:33:19 | so |
---|
0:33:20 | use the same as before but yes that's for each trial |
---|
0:33:23 | we have a way to be done |
---|
0:33:25 | and if we set the to when one over and then it would be the |
---|
0:33:30 | same as before but now we consider that we can choose some other value of |
---|
0:33:35 | these |
---|
0:33:37 | try and weights |
---|
0:33:38 | in the training data |
---|
0:33:39 | training trials |
---|
0:33:44 | we won't |
---|
0:33:45 | the training objective so the average training loss to have an expected value which is |
---|
0:33:52 | same as the expected value |
---|
0:33:56 | of the loss of test data so it should be an unbiased estimator of the |
---|
0:34:03 | the test loss or the expected loss |
---|
0:34:07 | and we also want these want to be good in the sense that it has |
---|
0:34:10 | a small variance |
---|
0:34:18 | well the expected value of the training loss is just calculated like this so we |
---|
0:34:23 | end up with the expected value of a loss |
---|
0:34:26 | and this is exactly are |
---|
0:34:28 | what we what we usually denoted or |
---|
0:34:30 | so in order for these to be |
---|
0:34:32 | unbiased we simply want the sum of the weights to be one |
---|
0:34:39 | and of course this would be the case when we use the standard choice of |
---|
0:34:45 | meta which is one over and the number of |
---|
0:34:48 | trials |
---|
0:34:49 | in the training data |
---|
0:34:53 | the variance |
---|
0:34:55 | of this empirical loss |
---|
0:34:58 | is gonna look like this |
---|
0:34:59 | it's the |
---|
0:35:00 | weight vector or for all the trials |
---|
0:35:03 | and so on the matrix |
---|
0:35:06 | times the weight vector |
---|
0:35:09 | and this matrix is the covariance matrix for the loss of all trials with the |
---|
0:35:14 | with this little t so that easy the one for the target trials or |
---|
0:35:18 | minus one for the non-target trials |
---|
0:35:21 | and one could derive that |
---|
0:35:23 | the optimal |
---|
0:35:24 | choice of |
---|
0:35:26 | he does that would minimize this variance |
---|
0:35:29 | is |
---|
0:35:29 | and i look like this |
---|
0:35:36 | so this is what we can call them you training objective |
---|
0:35:40 | a best linear |
---|
0:35:42 | unbiased estimate |
---|
0:35:44 | that's the meaning of you so this is the best linear unbiased estimate of |
---|
0:35:48 | the |
---|
0:35:50 | test loss |
---|
0:35:51 | using the training data to estimate what |
---|
0:35:53 | well the test loss would be |
---|
0:35:59 | some |
---|
0:36:00 | details about this is that we don't really need covariance between the most of the |
---|
0:36:05 | raw the correlation |
---|
0:36:07 | because |
---|
0:36:08 | we assume the diagonal elements in section matrix is |
---|
0:36:12 | equal |
---|
0:36:14 | then it turns out like this |
---|
0:36:18 | and in practice we would assume that |
---|
0:36:21 | search |
---|
0:36:22 | and lennon's in this covariance matrix does not depend on cedar which |
---|
0:36:26 | could be questioned |
---|
0:36:31 | so |
---|
0:36:32 | the objective that we discussed is not really specific the speaker verification in this is |
---|
0:36:37 | that whenever you have a |
---|
0:36:39 | dependence is in the training data can you could |
---|
0:36:42 | use this idea |
---|
0:36:43 | but for |
---|
0:36:45 | the structure of this the covariance matrix |
---|
0:36:49 | between the training which describes the covariances of the loss of the training data |
---|
0:36:54 | that depends on the problem the specific problem that you're studying |
---|
0:36:58 | so now we will look into how to |
---|
0:37:01 | creating search a matrix for speaker verification |
---|
0:37:06 | so here |
---|
0:37:07 | we will use |
---|
0:37:08 | x |
---|
0:37:09 | i two denotes the |
---|
0:37:12 | i utterances of speaker x |
---|
0:37:16 | so we will assume that |
---|
0:37:19 | correlation coefficients |
---|
0:37:21 | hands on what trials i mean comments so for example |
---|
0:37:24 | the here we have |
---|
0:37:26 | trial of speaker a utterance one speaker to a utterance to and some loss of |
---|
0:37:31 | that and the all several also speaker eight utterance long speaker eight |
---|
0:37:36 | utterance three and some loss of that |
---|
0:37:38 | and they have some correlation |
---|
0:37:40 | it because |
---|
0:37:42 | they involve the same speaker |
---|
0:37:45 | so we assume there is a correlation |
---|
0:37:48 | coefficient denoted c |
---|
0:37:50 | at least eight here |
---|
0:37:52 | so in total we have these kind of situation in verification if we consider target |
---|
0:37:57 | trials |
---|
0:38:00 | there you could have the situation that's |
---|
0:38:02 | well okay let's look here |
---|
0:38:05 | the |
---|
0:38:05 | to target trials which have one utterance in common this is speak a target trial |
---|
0:38:10 | of speaker eight |
---|
0:38:11 | and here we have buttons one of those two and here you have buttons one |
---|
0:38:15 | utterance trees is also has a long using both |
---|
0:38:17 | trials there is some correlation between these trite |
---|
0:38:21 | here |
---|
0:38:22 | there is no common utterance but the speaker still the same and this is as |
---|
0:38:26 | opposed to this situation where |
---|
0:38:28 | you have |
---|
0:38:30 | i |
---|
0:38:30 | trial of speaker a and the trial of speaker a they have nothing in common |
---|
0:38:34 | so we assume here the correlation is zero |
---|
0:38:37 | for such trials |
---|
0:38:39 | for the non-target trials you have more complicated situation but all possible situations are listed |
---|
0:38:46 | here |
---|
0:38:47 | for example |
---|
0:38:48 | you may have that |
---|
0:38:50 | okay |
---|
0:38:50 | the speaker is you have one |
---|
0:38:54 | utterance in common |
---|
0:38:58 | so we have this utterance in common and in addition to that |
---|
0:39:02 | these speaker is in common that's what they mean with this notation here |
---|
0:39:08 | and so one |
---|
0:39:14 | and if we have such weights one can derive |
---|
0:39:18 | yes |
---|
0:39:18 | the all the words such correlation push coefficients we can drive the optimal weights for |
---|
0:39:24 | a speaker with this many utterances |
---|
0:39:27 | is gonna look like this |
---|
0:39:32 | the exact form is maybe not so important but just |
---|
0:39:34 | we should note that one could |
---|
0:39:37 | the right |
---|
0:39:38 | how to |
---|
0:39:39 | given the way to these speaker and it depends on how many utterances |
---|
0:39:44 | the speaker s |
---|
0:39:47 | for the non-target trials to formalize more complex |
---|
0:39:51 | it would depend on me if the trial involves speaker names p can be it |
---|
0:39:55 | depends on how many |
---|
0:39:56 | utterances speech to speaker as |
---|
0:40:02 | so |
---|
0:40:03 | then comes they show how to estimate correlation coefficients one could look at some recorrelation |
---|
0:40:09 | of some trained model |
---|
0:40:12 | or we couldn't |
---|
0:40:14 | learned them somehow |
---|
0:40:16 | or which we will mention briefly later or we can just make some assumption and |
---|
0:40:21 | into neat so for example one simple assumption is the set |
---|
0:40:25 | this for score coefficient of target trials are five and this one which we assume |
---|
0:40:30 | should be smaller so i'll four square |
---|
0:40:32 | and then |
---|
0:40:35 | to an affine this range and similarly for the non-target trials |
---|
0:40:44 | just to get some idea of how we would change the weight for the target |
---|
0:40:47 | trials |
---|
0:40:48 | well |
---|
0:40:49 | for target trials |
---|
0:40:51 | we see here that this is the number of utterances for the speaker |
---|
0:40:56 | on the y-axis here we have their corresponding weights |
---|
0:41:01 | so |
---|
0:41:02 | and for different values of these correlations so if the correlation is |
---|
0:41:07 | a small |
---|
0:41:09 | then |
---|
0:41:11 | even when we have many utterances up to twenty here we will still give reasonable |
---|
0:41:16 | way to each utterance |
---|
0:41:19 | but if the correlation is a large |
---|
0:41:22 | then we will not give so much weight to |
---|
0:41:25 | but each utterance when a speaker as many utterances |
---|
0:41:29 | which means that the total |
---|
0:41:31 | and |
---|
0:41:32 | wait for this speaker is not gonna increased a much even if it has a |
---|
0:41:35 | lot of |
---|
0:41:36 | utterances |
---|
0:41:45 | and |
---|
0:41:48 | in the past i was exploring little bits how |
---|
0:41:52 | these kind of correlations really are |
---|
0:41:55 | this was on the i-vector system with clearly a and the scores |
---|
0:42:01 | here in the first |
---|
0:42:05 | i in this |
---|
0:42:07 | column here |
---|
0:42:08 | it's a |
---|
0:42:09 | okay lda model trained with em algorithm and then the score samples and instigated system |
---|
0:42:14 | i find calibration |
---|
0:42:18 | and the other column here is for discriminatively trained p lda |
---|
0:42:22 | so the main thing top so here is that we |
---|
0:42:25 | to have |
---|
0:42:26 | correlations between trials that's how for example an utterance in common answer one |
---|
0:42:32 | in correlations can be quite large in some situations |
---|
0:42:38 | so these |
---|
0:42:40 | problems seem to exist |
---|
0:42:44 | and doing this kind of correlation composition main goals this is like again on the |
---|
0:42:49 | kind of discriminative |
---|
0:42:50 | clearly a |
---|
0:42:55 | and |
---|
0:42:57 | e does have a bit |
---|
0:43:05 | so it's something |
---|
0:43:08 | two |
---|
0:43:11 | possibly take into account |
---|
0:43:13 | the course of ssl it's four db lda but the where we train a p |
---|
0:43:17 | lda model |
---|
0:43:18 | using all the trials in the training set |
---|
0:43:21 | that can be construct and then training set but of course the same |
---|
0:43:25 | problem with the dependence exist all seen and system |
---|
0:43:37 | so |
---|
0:43:40 | no some problems that the we could encounter if we tried to do this |
---|
0:43:45 | well mister the |
---|
0:43:47 | results or the |
---|
0:43:50 | compensation formless that we derive |
---|
0:43:52 | was assuming that |
---|
0:43:54 | all trials |
---|
0:43:55 | stuff can be created from the training set or used equally often which is the |
---|
0:43:58 | case if you train a backend likely p lda |
---|
0:44:02 | discriminatively and you use all the trials |
---|
0:44:05 | a we |
---|
0:44:07 | in |
---|
0:44:08 | well we train a kind of and system with involving neural networks |
---|
0:44:14 | we use media bashers so one could achieve this situation by |
---|
0:44:20 | making a |
---|
0:44:21 | list of trials |
---|
0:44:24 | and |
---|
0:44:25 | then we just sample trials from years okay here is a trial is this speaker |
---|
0:44:29 | compared to this final trial is the speaker compared to this one as a long |
---|
0:44:33 | and this is |
---|
0:44:34 | long list of all trials that can be formed and then we just |
---|
0:44:41 | select some of them into the mini batch |
---|
0:44:44 | the point is of course that if we have these speakers like this |
---|
0:44:47 | in the mini batch and we compare this one with this one |
---|
0:44:50 | this one we this one and so long |
---|
0:44:53 | we are not using all the trials that we have |
---|
0:44:56 | we have for example not comparing this one with this one in the mini batch |
---|
0:45:01 | recall and that's maybe a bit the waste because we are anyway using this deep |
---|
0:45:06 | neural network to produce them paintings and so once we can just as well |
---|
0:45:12 | produced and reading or will use all of them in the in the scoring part |
---|
0:45:15 | as well |
---|
0:45:17 | well then |
---|
0:45:17 | we will have a little bit different |
---|
0:45:20 | balanced |
---|
0:45:22 | of the trials |
---|
0:45:24 | globally compared to what we had before |
---|
0:45:27 | so the former lastly that we derived wouldn't be exactly valid in this situation |
---|
0:45:33 | so |
---|
0:45:34 | the |
---|
0:45:36 | question then it is if we do decide that all the segments |
---|
0:45:40 | that |
---|
0:45:42 | can that be extract them ratings for |
---|
0:45:44 | that we have in the mini batch if we want to use all of them |
---|
0:45:48 | was in the scoring what you how are we gonna select |
---|
0:45:52 | the data for the mini batch |
---|
0:45:54 | they can be different strategies here |
---|
0:45:57 | we could consider for example |
---|
0:45:59 | we |
---|
0:46:00 | strategy a |
---|
0:46:02 | we |
---|
0:46:03 | select some speakers |
---|
0:46:05 | and then for each speaker we take all the day the segments that they have |
---|
0:46:08 | let's say that these rates speaker has |
---|
0:46:11 | three segments and these yellow speaker has |
---|
0:46:14 | for speaker for segments |
---|
0:46:17 | and then all |
---|
0:46:21 | we can consider only five so we can have |
---|
0:46:26 | segment one of the red speaker scored against segment to segment one scored against segment |
---|
0:46:30 | three as a long |
---|
0:46:33 | we don't use the diagonal because we don't consider |
---|
0:46:39 | try segment scored against themselves |
---|
0:46:42 | and the course here is just the same as here |
---|
0:46:46 | a scoring segment two |
---|
0:46:48 | i guess segment one |
---|
0:46:50 | so |
---|
0:46:52 | this would be one way another way would be constructed you be |
---|
0:46:57 | two |
---|
0:47:00 | select speakers but then just select to utterance for each speaker in the mini batch |
---|
0:47:08 | so |
---|
0:47:10 | you will have just one target right for each speaker |
---|
0:47:14 | it differs here is that |
---|
0:47:16 | we have |
---|
0:47:17 | we are gonna have |
---|
0:47:19 | fewer target trials |
---|
0:47:21 | overall in the mini batch but one of them will be from different speakers and |
---|
0:47:24 | we will add target five from more speakers |
---|
0:47:28 | typically |
---|
0:47:29 | so |
---|
0:47:30 | needs |
---|
0:47:31 | not exactly clear what would be the right thing but some little bit informal experiments |
---|
0:47:36 | we have done |
---|
0:47:37 | so just of this strategy b is a better |
---|
0:47:46 | then again the formulas that we'd right before how to weight strives on not completely |
---|
0:47:51 | the they were not the right on the assumption that we are doing like this |
---|
0:47:55 | so they are not |
---|
0:47:56 | darlene |
---|
0:47:58 | and |
---|
0:48:00 | and they need to be modified to be it and i mean come to that |
---|
0:48:03 | in a minute |
---|
0:48:07 | the second problem that can occur in and when training is that |
---|
0:48:12 | in respect of these issues is that |
---|
0:48:17 | we do want |
---|
0:48:19 | to use |
---|
0:48:20 | what we do want to have a system that can deal with the session enrollment |
---|
0:48:24 | and it |
---|
0:48:26 | of course of the session trials can be incorporated |
---|
0:48:30 | it work can be handled with dances and system as we discussed in the initial |
---|
0:48:34 | slide |
---|
0:48:36 | by having some pruning armour enrollment utterance |
---|
0:48:40 | but how to create a training date time is again a little bit the |
---|
0:48:46 | complicated |
---|
0:48:47 | because |
---|
0:48:48 | already in the case of single session tries we had a complicated situation how many |
---|
0:48:54 | different kind of dependent system can occurrence along and in them with the session case |
---|
0:48:59 | it's gonna be even more |
---|
0:49:01 | complicated because you can have situations like |
---|
0:49:04 | these |
---|
0:49:06 | trial |
---|
0:49:08 | for example these two could be the enrollment and this is the test and another |
---|
0:49:12 | trial where |
---|
0:49:13 | these two are the enrollment |
---|
0:49:15 | and |
---|
0:49:15 | this is the test then you have one optimizing common here |
---|
0:49:19 | we're gonna have a more extreme situation where both enrollment utterances |
---|
0:49:24 | in to try to solve the same but the test utterance is different |
---|
0:49:27 | so the number of possible a dependence is that can occur is way more complex |
---|
0:49:32 | and i think it's |
---|
0:49:33 | very difficult to derive some kind of formal or how the trials should be weighted |
---|
0:49:41 | so to deal both with the mini batch the fact that we're using mini batch |
---|
0:49:46 | as and to move the session trials and to estimate proper trial weights |
---|
0:49:52 | for that maybe one strategy can be to learn them hand this is not something |
---|
0:49:56 | i tried i just think it's |
---|
0:49:57 | something that maybe should be tried |
---|
0:49:59 | well |
---|
0:50:01 | so we can define |
---|
0:50:02 | i training loss |
---|
0:50:04 | again as average of losses over the training data with some weights |
---|
0:50:09 | and the we also neon use a development loss with some |
---|
0:50:14 | which is an average over |
---|
0:50:16 | another set of the average of most over the development set |
---|
0:50:22 | and these weights here should depend only on number of utterances of the speaker |
---|
0:50:31 | or speakers involved in that right |
---|
0:50:35 | then one can imagine some scheme like these |
---|
0:50:38 | mm |
---|
0:50:39 | we send both training and development data through the and then we get the neural |
---|
0:50:44 | network and we get some |
---|
0:50:47 | training loss and some |
---|
0:50:49 | and development lost |
---|
0:50:53 | as usual be estimate the |
---|
0:50:56 | the grand here we take the gradient with respect to the model parameter off |
---|
0:51:03 | for the training lost |
---|
0:51:05 | and it |
---|
0:51:06 | this |
---|
0:51:06 | right in is not a function of the weights the trial weights |
---|
0:51:11 | and we can update |
---|
0:51:13 | the model parameters still keeping in mind that these are then value is a function |
---|
0:51:18 | of the |
---|
0:51:21 | the trial weights |
---|
0:51:23 | the training try and weights |
---|
0:51:25 | and then |
---|
0:51:27 | we can |
---|
0:51:28 | on the development sets |
---|
0:51:30 | calculate |
---|
0:51:31 | the gradient |
---|
0:51:33 | with respect to these training weights |
---|
0:51:36 | and then |
---|
0:51:37 | use this to update |
---|
0:51:40 | the training try and weights |
---|
0:51:46 | a second |
---|
0:51:47 | thing |
---|
0:51:49 | to explore |
---|
0:51:51 | or like a final note on these |
---|
0:51:56 | and |
---|
0:51:57 | depend statistical dependence issue is that |
---|
0:52:00 | we just |
---|
0:52:02 | discussed some ideas for balancing the training data the training trials for better optimization |
---|
0:52:08 | but for example in the case when all speakers have the same |
---|
0:52:12 | number of utterances |
---|
0:52:14 | this rebalancing has no effect |
---|
0:52:17 | still of course there are dependence is there is a one would think shouldn't we |
---|
0:52:20 | do something more than just we balance the training data |
---|
0:52:24 | and one possibility that i think would we will worth |
---|
0:52:28 | try |
---|
0:52:29 | is to |
---|
0:52:32 | we |
---|
0:52:34 | we assume the following |
---|
0:52:35 | that |
---|
0:52:37 | the covariance of |
---|
0:52:39 | to what's a scores of the |
---|
0:52:42 | of a trial of speaker at |
---|
0:52:45 | which has |
---|
0:52:45 | one utterance |
---|
0:52:47 | in common should be bigger than |
---|
0:52:49 | the covariance between two trials |
---|
0:52:51 | of these |
---|
0:52:52 | speaker which has |
---|
0:52:54 | no often as in common |
---|
0:52:56 | which should be bigger than the covariance between |
---|
0:53:01 | two |
---|
0:53:02 | target trials of different speaker this should be zero actually |
---|
0:53:06 | so one could consider two regularized the model to be in that way |
---|
0:53:14 | so now |
---|
0:53:17 | after discussing the issues with |
---|
0:53:21 | and hence training |
---|
0:53:23 | then i will briefly mention some of the |
---|
0:53:27 | eight pairs |
---|
0:53:29 | or some papers |
---|
0:53:32 | on and trend |
---|
0:53:33 | training and i this should not be considered as i kind of literature review or |
---|
0:53:38 | describing the best architectures or anything like that |
---|
0:53:42 | it is |
---|
0:53:43 | more |
---|
0:53:45 | just a few selected paper that illustrate some point source on them |
---|
0:53:53 | some of which and some good take away messages about and find training |
---|
0:53:59 | so this paper called and point text dependent speaker verification as follows i know was |
---|
0:54:04 | the first the paper on and ten training in speaker verification |
---|
0:54:09 | and it also networks like this or some architecture like this feature goes in the |
---|
0:54:14 | throes on |
---|
0:54:15 | and neural network and in the end we are doing |
---|
0:54:21 | we this network is gonna say |
---|
0:54:24 | is it the same |
---|
0:54:26 | speaker or not |
---|
0:54:28 | the important thing here is that |
---|
0:54:30 | the |
---|
0:54:32 | input is fixed |
---|
0:54:36 | so the inputs to the neural network as the feature dimension times the number of |
---|
0:54:41 | features |
---|
0:54:45 | the duration that is |
---|
0:54:48 | and there was no temporal pooling which is |
---|
0:54:52 | the done in many other situations |
---|
0:54:55 | and this is suitable |
---|
0:54:56 | when |
---|
0:54:58 | when you do text dependent speaker verification as they did in this paper |
---|
0:55:02 | so because this means that |
---|
0:55:05 | the network is kind of aware of the word and phoneme order |
---|
0:55:10 | and |
---|
0:55:11 | i would say that the main conclusion from this paper is that |
---|
0:55:15 | the verification loss was better than the identification lost |
---|
0:55:19 | especially when you have been the amounts of training data for small amount of training |
---|
0:55:24 | data guys |
---|
0:55:25 | not as big difference |
---|
0:55:28 | and the one can also say that t-norm could |
---|
0:55:32 | too large extent to make these two things |
---|
0:55:35 | this colossus more the models trained with these two moses more similar |
---|
0:55:42 | but i still won't say that this kind of suggested verification loss is beneficial |
---|
0:55:48 | if you have large amounts of training data |
---|
0:55:55 | so this is another paper |
---|
0:55:59 | there wasn't doing in |
---|
0:56:01 | text-independent speaker verification and here |
---|
0:56:05 | different from the other is that they do have a temporal pooling layer |
---|
0:56:11 | so |
---|
0:56:12 | that would kind of remove the dependence on wonder of the input |
---|
0:56:17 | the to some extent at least and is maybe a more suitable architecture for text |
---|
0:56:22 | independent speaker verification |
---|
0:56:25 | and this was compared to i-vector p lda baseline down here to it was found |
---|
0:56:30 | that really large amount of training data is needed even to be something like an |
---|
0:56:34 | i-vector |
---|
0:56:36 | the lda system |
---|
0:56:44 | and this is |
---|
0:56:46 | some study that we did and |
---|
0:56:51 | it was |
---|
0:56:52 | use also again text independent speaker recognition or verification |
---|
0:56:58 | but trained on smaller amount of data and to make it work we instead constrained |
---|
0:57:04 | these neural network here this big and time system to behave |
---|
0:57:08 | something like a |
---|
0:57:10 | another i-vector and p lda baseline so we cannot constrain did not to be two |
---|
0:57:16 | different from the |
---|
0:57:18 | i-vector purely a baseline |
---|
0:57:21 | and |
---|
0:57:23 | we found there that training model blocks jointly with their verification also was improving |
---|
0:57:33 | so as can be seen here |
---|
0:57:36 | you |
---|
0:57:36 | little bit regrettably we data as a separate |
---|
0:57:40 | clearly whether that improvement came from the fact that we were doing joint training |
---|
0:57:45 | or the fact that we were |
---|
0:57:50 | using the verification loss |
---|
0:57:55 | another interesting thing here is that |
---|
0:57:59 | we found that |
---|
0:58:00 | training we verification most requires very large batches |
---|
0:58:05 | and this was an experiment done only on the |
---|
0:58:09 | scoring art and of course lda discriminatively lda |
---|
0:58:12 | so if we train is gonna be p lda with |
---|
0:58:16 | a and b if yes using full batches |
---|
0:58:19 | that |
---|
0:58:21 | so not i mean you match |
---|
0:58:24 | training scheme |
---|
0:58:26 | you achieve some |
---|
0:58:27 | loss |
---|
0:58:28 | like this on the development set |
---|
0:58:31 | and this dash |
---|
0:58:33 | blue line |
---|
0:58:34 | whereas if we trained with adam with mini batch just for different slices front end |
---|
0:58:39 | up to five thousand |
---|
0:58:41 | we see that we need really be batches to actually |
---|
0:58:45 | get close to be of q s |
---|
0:58:48 | trained model which was trained on full marshes |
---|
0:58:50 | so that kind of little bit suggests that you really need to have many trials |
---|
0:58:55 | within the mini batch for you know what of four |
---|
0:58:59 | training these kind of |
---|
0:59:02 | system with a verification lots which is a bit of a problem and maybe a |
---|
0:59:06 | challenge to deal with |
---|
0:59:07 | in future |
---|
0:59:12 | this is some more recent paper and the interesting point of this paper was that |
---|
0:59:17 | they didn't train the whole system |
---|
0:59:20 | all the way from the waveform is that this from features as the other |
---|
0:59:27 | first |
---|
0:59:29 | but it was |
---|
0:59:31 | i couldn't to |
---|
0:59:33 | understand completely the improvement came from the from the fact that they were |
---|
0:59:37 | training from the waveform or if it was because of |
---|
0:59:41 | the choice of architecture and so one |
---|
0:59:45 | but it's interesting that |
---|
0:59:48 | systems and going |
---|
0:59:49 | all the way from waveform to the and |
---|
0:59:53 | can work well |
---|
0:59:58 | and this is paper |
---|
1:00:00 | for this year's |
---|
1:00:02 | in their speech it's interesting because |
---|
1:00:08 | it's one of the more recent studies that the really proposed or showed some good |
---|
1:00:13 | performance of using verification loss |
---|
1:00:17 | here it was a joint |
---|
1:00:19 | you |
---|
1:00:20 | but i can have more details training so they were training using both identification was |
---|
1:00:24 | and verification lost |
---|
1:00:28 | and that's actually something i have tried to another and any |
---|
1:00:32 | benefit from we but one thing they did here was to |
---|
1:00:36 | start with a large weight for that it is indication of austin gradually |
---|
1:00:40 | increase the weight for the verification will also make this is the interesting and maybe |
---|
1:00:47 | actually the right way to go |
---|
1:00:49 | i'm curious about it |
---|
1:00:54 | so |
---|
1:00:55 | now comes just little bits summary of this talk |
---|
1:00:59 | we discussed about the motivation for and two and |
---|
1:01:04 | training |
---|
1:01:05 | and |
---|
1:01:06 | we said that it has some good motivation |
---|
1:01:09 | and |
---|
1:01:10 | we show that's on |
---|
1:01:13 | we will refer to some |
---|
1:01:16 | experimental results the of also another first |
---|
1:01:19 | which shows that it seems to work quite well for text-dependent task with large amount |
---|
1:01:24 | of training data |
---|
1:01:27 | in such case it's probably prefer able to preserve the temporal structure to avoid |
---|
1:01:33 | the temporal pooling |
---|
1:01:35 | in text-independent benchmark one would need to strongly like a regular station or a mix |
---|
1:01:42 | the training objective in order to benefit from |
---|
1:01:45 | and when training and typically we would want to do some temporal pooling their |
---|
1:01:54 | one couldn't guess that and twenty training would be preferable choice in scenarios where we |
---|
1:02:00 | have many training speaker with few utterances we have less of the statistical dependence in |
---|
1:02:05 | problem |
---|
1:02:09 | something that to me seems to be or button questions is and which would be |
---|
1:02:14 | great if someone it explore |
---|
1:02:17 | is |
---|
1:02:18 | okay |
---|
1:02:19 | it is difficult actually to train and then system especially for the text independent |
---|
1:02:24 | tell us |
---|
1:02:25 | so this is because of overfitting so training convergence this dependency issue we discussed |
---|
1:02:32 | it's not really clear i would say |
---|
1:02:34 | and |
---|
1:02:36 | practical question is how to adapt search systems because see this more blockwise systems we |
---|
1:02:43 | would of the nine at the back end |
---|
1:02:45 | well could be trained the system in a way that we don't need adaptation |
---|
1:02:53 | and also how could we input some human knowledge about speech into these training and |
---|
1:02:58 | we need it |
---|
1:03:00 | something we know about the data distribution or number of phonemes or |
---|
1:03:04 | whatever |
---|
1:03:07 | and we discuss that maybe |
---|
1:03:12 | training a model for speaker identification is not ideal for speaker verification but is there |
---|
1:03:18 | some way to |
---|
1:03:21 | to find and bindings that are good for all these tasks |
---|
1:03:27 | another interesting quick question is |
---|
1:03:32 | how well |
---|
1:03:34 | the llr is that comes from |
---|
1:03:36 | and to end |
---|
1:03:38 | architectures |
---|
1:03:39 | actually could simulate the true llr |
---|
1:03:44 | so in other words what kind of |
---|
1:03:47 | and |
---|
1:03:49 | distributions could be |
---|
1:03:51 | arbitrary accurately simulate or modeled by these architectures |
---|
1:03:57 | so completely clear out there |
---|
1:04:00 | okay so |
---|
1:04:02 | thank you for your attention |
---|
1:04:05 | by right |
---|
1:04:10 | hello this is you'll huh and no i really present the hassan session for that |
---|
1:04:19 | and that speaker verification concordia |
---|
1:04:26 | e these informal do not work well i don't know book |
---|
1:04:31 | well i'm not really run cold war used rate it let's see |
---|
1:04:37 | i mean "'cause" |
---|
1:04:40 | one and talk about ease |
---|
1:04:43 | two things first |
---|
1:04:44 | i will go through the call that are using their |
---|
1:04:48 | most of my experiments |
---|
1:04:51 | and |
---|
1:04:53 | after that i mean how well if you can do tricks to solve the batteries |
---|
1:04:58 | implementation issues |
---|
1:05:02 | that i have used and |
---|
1:05:06 | okay so |
---|
1:05:10 | first |
---|
1:05:11 | the call for and final system so this is a call that i started work |
---|
1:05:17 | on during my forestalled a but from to those in sixteen the person time t |
---|
1:05:23 | initially horse in the on all but the now consider a while |
---|
1:05:30 | and idea sees the |
---|
1:05:32 | time to switch to or a data tensor able to or like torture or something |
---|
1:05:38 | else |
---|
1:05:41 | the links of the repository is here |
---|
1:05:44 | and most stuff in this repository is no and is more most states there are |
---|
1:05:52 | four multiclass the weighting well mostly to use a little because training where maybe in |
---|
1:05:59 | combination with other stuff |
---|
1:06:01 | but the |
---|
1:06:03 | don't know much on a |
---|
1:06:05 | that's uses |
---|
1:06:07 | you're and then training with the verification loss |
---|
1:06:11 | the paper is that we're of only stores actually based on hold close to the |
---|
1:06:15 | on the one i think it's not so much point two |
---|
1:06:19 | maintain that are in more |
---|
1:06:23 | but i do have a one screen here that you to the verification lost in |
---|
1:06:29 | combination with the identification lost so that's description we will look at |
---|
1:06:37 | and generally |
---|
1:06:40 | or |
---|
1:06:41 | well it's a this first i'm trying to point out things in this call that |
---|
1:06:46 | i think yes certainly well known and worked well and are known and also mention |
---|
1:06:51 | what they we show |
---|
1:06:52 | really them differently |
---|
1:06:54 | to maybe give so |
---|
1:06:57 | well at least i can say from like stressful as good an allpass time |
---|
1:07:02 | some |
---|
1:07:03 | small toolkit for speaker verification |
---|
1:07:09 | i know that i didn't see and then if we hear from and the verification |
---|
1:07:15 | lost to that identification the most |
---|
1:07:18 | and contrary to the paper and mentioned in the tutorial |
---|
1:07:23 | it could be that these quite complicated scheme for changing the balance between the losses |
---|
1:07:30 | throughout the training is really ladies this may be something i don't look at some |
---|
1:07:37 | point |
---|
1:07:41 | and this screen i think units |
---|
1:07:45 | you want to try to instances where only you know little normal way |
---|
1:07:49 | the in the local but you don't want running in the not here unique feel |
---|
1:07:54 | a little bit with the intention because |
---|
1:07:58 | right in |
---|
1:07:59 | in |
---|
1:08:00 | cantonese in such a way that it's |
---|
1:08:02 | here three but |
---|
1:08:07 | some small adjustment might be needed if you actually want to run it here |
---|
1:08:16 | so |
---|
1:08:17 | nh |
---|
1:08:18 | i tried in these in when organising my experiments to high in the way that |
---|
1:08:25 | there is one screen where everything that is specifically the experiment is set so that |
---|
1:08:32 | includes which data to use and the configuration of the more balanced along |
---|
1:08:38 | i was really i |
---|
1:08:42 | an efficient lighting to have |
---|
1:08:44 | input arguments to this researchers we should be to use as long because anyway you |
---|
1:08:50 | were wireless always have to change something in this creation |
---|
1:08:56 | for a new experiments are then you can just as long routine often a and |
---|
1:09:02 | so on |
---|
1:09:03 | a wrestler |
---|
1:09:06 | but other things that a little bit more face from extend this experiment this is |
---|
1:09:12 | just the loaded from this good |
---|
1:09:15 | such as model on different architectures as long |
---|
1:09:25 | so usually i use these underscore for denoted sensible variables underscore v for placeholders |
---|
1:09:34 | so long |
---|
1:09:36 | the kind of |
---|
1:09:40 | models are |
---|
1:09:43 | similar to here as models are then maybe a little bit less |
---|
1:09:49 | fancy if you're |
---|
1:09:54 | features |
---|
1:09:58 | i didn't use here us here initially because when i started with this years ago |
---|
1:10:03 | cares more flexible enough there were i quite agree pure only those of recruited two |
---|
1:10:11 | neatly with this but i know it is definitely flexible enough |
---|
1:10:25 | so |
---|
1:10:31 | for example here is this is five where features are things that |
---|
1:10:37 | things maybe some one would think is that are those on a you all remember |
---|
1:10:43 | about a |
---|
1:10:45 | seems anyway necessary to change things in this problem for every experiment i prefer you |
---|
1:10:51 | their thing here |
---|
1:10:53 | so you're somebody stole training data |
---|
1:10:56 | how long as the shortest and a longer segments are trained on |
---|
1:11:03 | some other patterns related to training batch size |
---|
1:11:08 | maximum number of the box |
---|
1:11:10 | and |
---|
1:11:12 | number of bashes in an input so i don't really define |
---|
1:11:18 | yep or as warm day a by defining that's the second number of patches that |
---|
1:11:23 | the wine in it in a minute |
---|
1:11:29 | also patience probably most of your familiar with it is worth mentioning |
---|
1:11:34 | you train or |
---|
1:11:35 | what it is score |
---|
1:11:37 | so the next part of the screen is the bar for defining how to load |
---|
1:11:46 | and prepare data |
---|
1:11:48 | and here is long important points is the |
---|
1:11:53 | so you the bashers we will |
---|
1:11:56 | well gee chunks of feature from different utterances so randomly selected segments |
---|
1:12:05 | if you know say that from a normal hardest and randomly select different segments from |
---|
1:12:13 | different utterances |
---|
1:12:15 | this will be nice too small i was to sell |
---|
1:12:21 | so often |
---|
1:12:23 | you can would meeting it is time varying or case at a time or can |
---|
1:12:28 | compare a |
---|
1:12:29 | many lashes |
---|
1:12:32 | well |
---|
1:12:33 | so that's one way he in all my service so i is the to the |
---|
1:12:40 | data on missus the and then can be loaded as you wish feature shows can |
---|
1:12:47 | be loaded randomly fast enough for that |
---|
1:12:50 | so this is |
---|
1:12:53 | good because it allows for a lot much more flexibility in experiments for example sometimes |
---|
1:12:59 | you may want to load to segments from the same as is that what one |
---|
1:13:04 | proportional |
---|
1:13:06 | to go for some for some experiments |
---|
1:13:10 | or sometimes you just want to change the duration of the segments |
---|
1:13:16 | i |
---|
1:13:16 | you |
---|
1:13:18 | use our case then you have to prepare and you are case for this |
---|
1:13:22 | so i don't say that |
---|
1:13:24 | using is the ease |
---|
1:13:28 | and then just load features a single going is |
---|
1:13:33 | very good thing and as the c is really good however to invest see if |
---|
1:13:38 | you want to |
---|
1:13:39 | it can of experiments |
---|
1:13:44 | i define some functions for example low fee training process given some |
---|
1:13:52 | given and some list of finals this one we load the data and that could |
---|
1:13:59 | so if you want remotes parcels these batteries specifically as long again |
---|
1:14:05 | if find that here but if you want to do for example of the thing |
---|
1:14:08 | i mentioned too low to segments from the same utterances that one then you would |
---|
1:14:13 | have to change the function here |
---|
1:14:15 | so this was quite the |
---|
1:14:18 | useful way of organising is that for me at least in my experiments |
---|
1:14:26 | i also another important thing in this for easter creates on dictionary a religious train |
---|
1:14:32 | is sixty four conversation other missionaries of for example a closest eager not be |
---|
1:14:38 | and thus to fine off a thing and the law |
---|
1:14:43 | and that's |
---|
1:14:46 | created here |
---|
1:14:51 | and he's |
---|
1:14:55 | no means are used to create a media batches |
---|
1:14:59 | and a little bit later down here i create a generator for media batches and |
---|
1:15:04 | it takes the this stationary off |
---|
1:15:09 | mappings across a speaker mapping as a long and i have different the generators depending |
---|
1:15:15 | on what kind of media matches i won't for example you want |
---|
1:15:19 | randomly selected speakers and older data are going to the actual remote randomly selected speakers |
---|
1:15:24 | and for example two apples each or something like that |
---|
1:15:30 | so that's its shape by changing on a gender |
---|
1:15:39 | then the next step is to |
---|
1:15:42 | so that the modal |
---|
1:15:44 | and here i'm using here |
---|
1:15:47 | t v in a artificial light expect or other comics or |
---|
1:15:52 | and it i also a det lda model |
---|
1:15:57 | a half to the school and endings from this |
---|
1:16:03 | or text editors still called |
---|
1:16:09 | we should |
---|
1:16:11 | do kind of verification |
---|
1:16:19 | i mentioned is minor differences from the holiday architecture is that i found it necessary |
---|
1:16:24 | to have some kind of normalization layer alter the temporal coolly better or more just |
---|
1:16:32 | at feast elicitation but estimated on the data that supports in the beginning works fine |
---|
1:16:36 | as well |
---|
1:16:38 | i guess line is needed here could be because we use a simpler optimize the |
---|
1:16:44 | we use just stochastic gradient descent as compared to colour the use that are most |
---|
1:16:48 | of the monster |
---|
1:16:54 | so in this conan columns |
---|
1:17:00 | definition of the are they show like number of layers their sizes |
---|
1:17:07 | activation functions |
---|
1:17:09 | and so |
---|
1:17:11 | whether we should have a normalization of features normalization all the are truly |
---|
1:17:18 | and whether they these |
---|
1:17:22 | normalizations |
---|
1:17:25 | mm |
---|
1:17:27 | or you don't face of the data being initial last |
---|
1:17:37 | auctions for regular stations the lower |
---|
1:17:41 | we initialize the model here and we provide |
---|
1:17:47 | when you do this at the rate or the generator for the |
---|
1:17:52 | they the training data and this is used to initialize to model the normalisation layers |
---|
1:17:58 | this is something that creates be a mess and i probably wouldn't song i |
---|
1:18:04 | differently if i work right and you are |
---|
1:18:09 | maybe some knowingly initialization and that's around a few |
---|
1:18:16 | iterations |
---|
1:18:18 | in the before starting the can you just |
---|
1:18:21 | initialize the layers the normalisation layers |
---|
1:18:30 | you if we apply a smaller to today the which is in this place holders |
---|
1:18:35 | here |
---|
1:18:36 | and |
---|
1:18:39 | then |
---|
1:18:41 | what comes out will be this and endings the classifications |
---|
1:18:46 | and |
---|
1:18:47 | so or |
---|
1:18:49 | then ratings in this particular we will send them to |
---|
1:18:55 | in the lda model |
---|
1:18:58 | basically here |
---|
1:18:59 | we make some settings for here |
---|
1:19:03 | and |
---|
1:19:05 | probabilistic lda model we can get the score |
---|
1:19:09 | scores |
---|
1:19:10 | and for all pairwise comparisons |
---|
1:19:14 | it in the dash and also loss for that can provide |
---|
1:19:18 | labels for it |
---|
1:19:22 | so next car is to and are defined lost and train functions along |
---|
1:19:31 | we have lost as a weighted keisha lost it has lost and a single because |
---|
1:19:36 | the verification loss |
---|
1:19:38 | well here's in binary and their average fits weights in the original one point five |
---|
1:19:44 | and still one seventy five respectively |
---|
1:19:47 | and maybe one important thing use here we these forces are normalized in there and |
---|
1:19:54 | from be so that's |
---|
1:19:56 | minus |
---|
1:19:58 | log of their probability in the case of so long we're number of speakers |
---|
1:20:05 | i mean for around a classification of random quotes |
---|
1:20:09 | and the reason to do this is |
---|
1:20:13 | if the model is yes initialized or just a round of relations that the loss |
---|
1:20:18 | maybe one or approximately well |
---|
1:20:21 | and we do the same thing for the verification loss |
---|
1:20:27 | i |
---|
1:20:28 | you this means that all these also source data you know similar way and it |
---|
1:20:33 | becomes easier to choose to interpolate between them |
---|
1:20:40 | and the end of these the screen we define a training function which takes |
---|
1:20:45 | the data for actually in school and to one article |
---|
1:20:50 | the more |
---|
1:20:53 | next for please for a |
---|
1:20:59 | defining functions for a set i think parameters locating parameters for the more |
---|
1:21:04 | and |
---|
1:21:06 | define |
---|
1:21:07 | i function two |
---|
1:21:10 | change of the easy to shake some kind of validation lots of the each block |
---|
1:21:21 | so this starts just for setting parameters and getting parameters |
---|
1:21:26 | and |
---|
1:21:28 | maybe no so importance |
---|
1:21:31 | it can find |
---|
1:21:34 | function for changing the validation was here |
---|
1:21:39 | finally the training is to combine these |
---|
1:21:41 | function here which takes these |
---|
1:21:45 | function and therefore |
---|
1:21:47 | changing validation loss takes many other parameters |
---|
1:21:51 | and things that the undefined |
---|
1:21:55 | okay |
---|
1:22:03 | for example in function for training and so on |
---|
1:22:07 | so these the way we trained here is basically so |
---|
1:22:12 | alternately she for which was defined as |
---|
1:22:16 | alright for however bashers |
---|
1:22:20 | and this is because we don't really have a case you just complete equal continues |
---|
1:22:24 | every random statements |
---|
1:22:26 | this is |
---|
1:22:29 | as long as they won't work so there's really clear idea what is the what |
---|
1:22:33 | is data |
---|
1:22:35 | but anyway |
---|
1:22:38 | we do training if he doesn't include one the one additional also be a good |
---|
1:22:45 | review |
---|
1:22:46 | try a few more times o and two patients number of times and you is |
---|
1:22:51 | to include that we will |
---|
1:22:54 | research around there's to the best on the whole the learning rate increase but okay |
---|
1:23:00 | i don't know this is the best |
---|
1:23:02 | "'kay" be seen but as for well enough for me |
---|
1:23:13 | so |
---|
1:23:14 | yes for the whole piece |
---|
1:23:16 | and |
---|
1:23:18 | going on i would like to mention a few weeks |
---|
1:23:23 | not very complicated things |
---|
1:23:25 | it was maybe slightly difficult for me to figure out |
---|
1:23:30 | and |
---|
1:23:31 | they are related to back propagation and the things i wanted to modify their |
---|
1:23:41 | so let's just first briefly review the back propagation algorithm |
---|
1:23:48 | basically |
---|
1:23:52 | you know that the neural network is just |
---|
1:23:55 | some serious of affine transformation followed by nonlinearity then again affine transformation and again only |
---|
1:24:01 | the install |
---|
1:24:03 | so that's a result in some you will be applied affine transformation |
---|
1:24:09 | i guess is set here and then we apply some nonlinearity and |
---|
1:24:14 | i mean yes the a that's going to and we do that over and over |
---|
1:24:18 | and that's called a final |
---|
1:24:22 | i'll put four and then we have some cost function |
---|
1:24:26 | i'm on that for example cross entropy |
---|
1:24:29 | and we if we you know function composition bit is the reading here's basically means |
---|
1:24:34 | the compositional g and h is just like |
---|
1:24:39 | an h on the data and energy and still then we know that we can |
---|
1:24:43 | write the whole neural network s |
---|
1:24:46 | applying the first affine transformation of the input |
---|
1:24:50 | next door first the nonlinearity |
---|
1:24:54 | all the way |
---|
1:24:55 | but the output |
---|
1:24:57 | it can be written like these |
---|
1:24:59 | and is also easy to write well the |
---|
1:25:04 | gradient of the |
---|
1:25:06 | loss with respect to that you could point using the chain rule i is |
---|
1:25:11 | so it's just |
---|
1:25:13 | basically everybody will see with respect to improve this just |
---|
1:25:19 | change like this study video scene with respect to a time period well a i'm |
---|
1:25:23 | this dataset i install |
---|
1:25:25 | so i have this |
---|
1:25:27 | funny thing brackets here just and you know that these are |
---|
1:25:32 | just covariance so the multivariate shaver looks |
---|
1:25:37 | same as the second one just that we need to use digital us instead of |
---|
1:25:43 | this is not normal productive |
---|
1:25:48 | so forceful |
---|
1:25:55 | the |
---|
1:25:56 | relative lc with respect to a |
---|
1:26:01 | this is i criterion is really right because it's a vector so |
---|
1:26:07 | when all these elements like these here |
---|
1:26:12 | criminal a with respect was |
---|
1:26:15 | easy just gonna be a diagonal probably unlike is because f is the functional design |
---|
1:26:20 | elements bias |
---|
1:26:22 | and the other one three that you off |
---|
1:26:26 | san interesting to a |
---|
1:26:28 | if we look at this thing here we will see maybe a little bit for |
---|
1:26:33 | this is just the weight matrix |
---|
1:26:36 | so then back propagation is |
---|
1:26:39 | okay we start by calculating the |
---|
1:26:42 | d c |
---|
1:26:45 | this is a i |
---|
1:26:48 | and that's just these two |
---|
1:26:51 | and then |
---|
1:26:52 | we can |
---|
1:26:54 | continue with |
---|
1:26:56 | get it is easy with respect to some other set i by just taking that |
---|
1:27:02 | are that we have and multiply for example we these two then we get an |
---|
1:27:06 | extra and still |
---|
1:27:08 | so it's |
---|
1:27:10 | but course process like that so that yes you lost the remote people loss with |
---|
1:27:16 | respect to include in the of what we want this of course with respect to |
---|
1:27:19 | model parameters which is that |
---|
1:27:22 | biases in the weights |
---|
1:27:24 | which we have |
---|
1:27:26 | here and here |
---|
1:27:29 | those are given by these extensions here |
---|
1:27:33 | so |
---|
1:27:34 | for the biases is just these |
---|
1:27:38 | a second down here |
---|
1:27:40 | for the weights model can claim that the corresponding part of the weight matrix |
---|
1:27:49 | this is just sorry within corresponding part of the |
---|
1:27:55 | ye activation and a here we are interested in contributing with respect to this also |
---|
1:28:00 | we need more like the corresponding part this |
---|
1:28:07 | okay so no i'm talking about when we are fresh test |
---|
1:28:13 | and here we also to really good references for these if you want to |
---|
1:28:19 | further into it |
---|
1:28:24 | no where we have |
---|
1:28:27 | mentioned this |
---|
1:28:28 | i would say |
---|
1:28:30 | well buffy different issues that i run into their that require some |
---|
1:28:36 | little bit of thinking in relation to this |
---|
1:28:40 | first thing is that you see here that in order to calculate the derivative existing |
---|
1:28:47 | weights you need the our schools of each layer is a here |
---|
1:28:53 | and so that means that we need to see all of those memory from the |
---|
1:28:58 | forward also needed you the main memory okay we look that passed and if you |
---|
1:29:03 | have to be batches many utterances also long utterances this can become too much |
---|
1:29:10 | it can go up to many gigabytes several makes sense for example |
---|
1:29:17 | or larger batches |
---|
1:29:20 | so |
---|
1:29:23 | both |
---|
1:29:24 | the no and sensible well as on printing home way of getting around this |
---|
1:29:31 | and that is that you |
---|
1:29:37 | or where the data |
---|
1:29:39 | then they have the option in case of ten some for the case of the |
---|
1:29:43 | angle you have the option to discard the |
---|
1:29:48 | intermediate file was from the for us then maybe you that there are also you |
---|
1:29:52 | will recalculate then when you need that so you basically just have the |
---|
1:29:59 | in memory for one dollar score one on this time |
---|
1:30:04 | that's the floor one have the same thing about a little bit better because data |
---|
1:30:09 | to discard the corporate like to the cu memory which is generally bigger |
---|
1:30:15 | there you family |
---|
1:30:17 | so |
---|
1:30:19 | in that case |
---|
1:30:21 | or to use this we can |
---|
1:30:25 | we you over the inputs a until probably layer and all the pooling layer we |
---|
1:30:31 | put all these |
---|
1:30:33 | close together so that we have now a kind of |
---|
1:30:40 | tests or with the old adding store |
---|
1:30:44 | and then that can be processed normally |
---|
1:30:50 | and then you would just calculated los and ask for the right so or at |
---|
1:30:55 | least one okay so that to think carefully |
---|
1:31:02 | this of course also has the advantage that we can have same and different directions |
---|
1:31:08 | well we may things like |
---|
1:31:11 | but for a bit complicated or maybe not even possible |
---|
1:31:18 | i'm not showing the congo |
---|
1:31:21 | these people sees me see so many other things hours and makes is very difficult |
---|
1:31:27 | to see what's going on |
---|
1:31:30 | i have it does not seventeen |
---|
1:31:32 | is |
---|
1:31:34 | scripts |
---|
1:31:36 | but the i was hoping to write some small for example but they didn't manage |
---|
1:31:42 | to do it in time |
---|
1:31:47 | okay so that's one three |
---|
1:31:50 | a second |
---|
1:31:53 | tree is related to parallelization |
---|
1:32:03 | so |
---|
1:32:06 | suppose that we have some or detection like this because feature but and then we |
---|
1:32:11 | are probably |
---|
1:32:13 | and then we have some processing all them things and finally scoring |
---|
1:32:19 | no if we want to |
---|
1:32:22 | well normally if we want to do parallelization will be training for some multiclass okay |
---|
1:32:27 | it doesn't really a problem because we just is to give the day on different |
---|
1:32:31 | workers each of them calculate some radians and we can actually right yes |
---|
1:32:35 | or we can not irish the updated models |
---|
1:32:39 | but in this case seems this scoring large when we do use the verification lost |
---|
1:32:45 | in the scoring or we would like to have a comparison of all trials all |
---|
1:32:50 | possible trials |
---|
1:32:51 | so we need to do |
---|
1:32:53 | time delay and the things on individual workers the sound of all the and endings |
---|
1:32:59 | to the master where do this scoring |
---|
1:33:03 | no we do back propagation a to them but he's and then we sell those |
---|
1:33:10 | tries to each worker |
---|
1:33:13 | and the they can continue the |
---|
1:33:17 | the back propagation |
---|
1:33:20 | the thing is this is not exactly and by normal to the case when you |
---|
1:33:24 | have |
---|
1:33:26 | calculated the loss here then you |
---|
1:33:30 | a propagation but also the includes what is known has included which was just everybody's |
---|
1:33:37 | then you basically they try to loss with respect to and endings |
---|
1:33:42 | and how to use that s two |
---|
1:33:47 | continue the back propagation on |
---|
1:33:50 | the individual nodes |
---|
1:33:56 | one single tree to do this is defined like in a sequence only a loss |
---|
1:34:01 | like this here so i define a new loss which is yes |
---|
1:34:05 | this is the remote zero |
---|
1:34:08 | see the cost |
---|
1:34:11 | with respect to the embedding elements which is what we have to change |
---|
1:34:16 | problem most or no |
---|
1:34:18 | times now ready or just like doesn't all probably like this |
---|
1:34:23 | and if we know |
---|
1:34:26 | optimize these loss you will get |
---|
1:34:28 | what we won't be cost |
---|
1:34:31 | let's consider right and the order derivative of these loss increased a cell to some |
---|
1:34:36 | model parameter of the neural network |
---|
1:34:39 | okay we apply here |
---|
1:34:41 | just take this started in here |
---|
1:34:45 | here is something that it has on these |
---|
1:34:48 | there are so we are right yes here and this is i certainly exactly the |
---|
1:34:53 | loss |
---|
1:34:54 | the relative that the are |
---|
1:34:57 | off looking for so |
---|
1:35:01 | the remote view |
---|
1:35:03 | for these loss with respect to model or anything will be exactly the same passed |
---|
1:35:08 | a law that we are interested e |
---|
1:35:10 | is possible that some newer tutees has |
---|
1:35:15 | what is actually just do this without using some tree i'm not sure that |
---|
1:35:21 | this was as though to achieve this |
---|
1:35:25 | it's |
---|
1:35:29 | final tree |
---|
1:35:30 | ease |
---|
1:35:34 | related to |
---|
1:35:36 | something the holocaust repair saturated rental units |
---|
1:35:42 | so |
---|
1:35:45 | right is the sum operation function so let us remember we have a fine transformation |
---|
1:35:51 | formal by so |
---|
1:35:54 | activation function and if it's the revenue proposal is one of the |
---|
1:36:01 | problem on then |
---|
1:36:03 | whenever the goal is always below sea able to these rental will or when everything |
---|
1:36:09 | but this is close to the red will put zero so if or includes or |
---|
1:36:14 | below zero then this rhino is basically never all putting anything in because it's a |
---|
1:36:20 | vector |
---|
1:36:22 | useless |
---|
1:36:23 | and we there is also the opposite problem if they but is always a zero |
---|
1:36:27 | then there are n is just a linear units so we really models |
---|
1:36:34 | the includes threatens to be |
---|
1:36:37 | in a |
---|
1:36:40 | be sometimes |
---|
1:36:42 | positive and sometimes negative then the railways brady units |
---|
1:36:48 | nonlinearly and |
---|
1:36:51 | the network is doing something interesting |
---|
1:36:53 | so how we have these is that usually checks if read a unit has problem |
---|
1:37:00 | like this and in that case |
---|
1:37:03 | they will ask |
---|
1:37:04 | some a little also |
---|
1:37:07 | to test a |
---|
1:37:09 | so that everybody will see with respect to set |
---|
1:37:14 | a problem to do this in some of the standard neural network is that we |
---|
1:37:19 | don't really we can't really we don't have an easy way to manipulate this stuff |
---|
1:37:24 | that |
---|
1:37:26 | which is used in the back propagation |
---|
1:37:28 | so we will be set to manipulate the derivatives with respect to model parameters directly |
---|
1:37:35 | and |
---|
1:37:38 | seeing |
---|
1:37:41 | how |
---|
1:37:43 | these relations lou |
---|
1:37:46 | we wanted us from the data that and |
---|
1:37:49 | the derivative with respect to be easy just |
---|
1:37:52 | is the as we were asked thing is achieved in a place you can just |
---|
1:37:56 | at that it leads to this |
---|
1:37:58 | do not here we usually get from model to |
---|
1:38:02 | and similarly for the way it's is just as we also need to multiply these |
---|
1:38:08 | articles and a because that's called it remotely calculate |
---|
1:38:13 | so these for some small three weeks and there may be summary i can say |
---|
1:38:18 | that is quite helpful to when you were neural network to |
---|
1:38:23 | based on the back propagation probably so that you know what's going on |
---|
1:38:29 | and then you can easily too small fixes like is |
---|
1:38:34 | so that's |
---|
1:38:35 | or well from the hands on session thank you for attention and by |
---|