0:00:15 | all right about multi class discriminative training of i-vector language recognition this morning are now |
---|
0:00:21 | on the crate from john hopkins university |
---|
0:00:23 | and like to acknowledge some interesting discussions during this work with |
---|
0:00:27 | my current colleague daniel my previous colleagues dog in l yet |
---|
0:00:31 | duggan pedro sorry and more recently with nico |
---|
0:00:37 | so |
---|
0:00:39 | as an introduction |
---|
0:00:41 | you guys know i think we had one discussion this morning that |
---|
0:00:45 | language id using i-vectors is state-of-the-art system |
---|
0:00:49 | what i wanna talk about is some particular aspects of it is typically done as |
---|
0:00:54 | a two-stage process where we use a classifier is the first thing even after we've |
---|
0:00:58 | got the i-vectors first we build a classifier |
---|
0:01:01 | and then we separately build a backend which does the calibration and perhaps fusion as |
---|
0:01:06 | well so i wanna talk about two aspects that are a little different of that |
---|
0:01:10 | first i wanna talk about what if we try to have one system that does |
---|
0:01:14 | the discrimination |
---|
0:01:15 | the classification and the calibration once they're using discriminative training nobody ever said we have |
---|
0:01:21 | used to systems back to back what we do it all together |
---|
0:01:24 | and then the secondly i wanna talk about an open set extension to what is |
---|
0:01:28 | usually a closed set language recognition task |
---|
0:01:34 | so in the top i will start with a description of the gaussian model in |
---|
0:01:38 | the i-vector space it something that many be seen before but i need to talk |
---|
0:01:42 | about some particular aspects of it in order to get into the details here |
---|
0:01:47 | also talk about how that relates to the open set case in that case of |
---|
0:01:50 | go into some of the bayesian stuff that we do in speaker recognition and how |
---|
0:01:54 | that could or couldn't be relevant in language recognition what the differences are |
---|
0:01:59 | then i will talk about the two key things here which is the discriminative training |
---|
0:02:03 | that i'm using in particular which is based on mmi and then i'll talk about |
---|
0:02:07 | how i do the out of set model |
---|
0:02:12 | so as a signal processing guy i like to thank of |
---|
0:02:15 | this as an additive gaussian noise model and signal processing this is one of the |
---|
0:02:19 | most basic things that we see |
---|
0:02:21 | so |
---|
0:02:22 | it in this context what we're talking about is that the observed i-vector you see |
---|
0:02:26 | was generated from a language so it should look like the language vector mean but |
---|
0:02:31 | it's corrupted by an additive gaussian noise |
---|
0:02:34 | which we typically call a channel for lack of a better word |
---|
0:02:38 | so this model here from a pattern recognition point of view we have a unknown |
---|
0:02:43 | lean of each of our classes |
---|
0:02:45 | we have a channel which is gaussian looks the same for all of the classes |
---|
0:02:49 | that means that our classifier is a shared covariance gaussian model |
---|
0:02:54 | and each language model is described by its mean |
---|
0:02:58 | and that shared covariance is a channel or a within class covariance |
---|
0:03:06 | so the building language recognition system then we need a training process in the scoring |
---|
0:03:11 | process |
---|
0:03:12 | training means we need to learn this shared within class covariance and then for each |
---|
0:03:16 | language we need to learn what its mean looks like |
---|
0:03:19 | and testing again is this gaussian scoring |
---|
0:03:22 | and i guess unlike some people and stream are not particularly uncomfortable with closed set |
---|
0:03:27 | detection |
---|
0:03:28 | and that gives you a sort of funny looking form bayes rule the target if |
---|
0:03:33 | it is this class then that's just the likelihood of this class |
---|
0:03:37 | that's easy |
---|
0:03:38 | but the non-target means that it's one of the other classes and then you need |
---|
0:03:41 | some implicit prior of the distribution of the other classes |
---|
0:03:45 | which for the other eases design where you can use a flat prior |
---|
0:03:49 | given that is not the target |
---|
0:03:55 | so that the key question then for building language model is how do we estimate |
---|
0:04:00 | the mean estimating the mean of a gaussian is not one of the most complicated |
---|
0:04:04 | things and statistics but there are multiple ways to do it of course this paper |
---|
0:04:08 | or thing to do with just take the sample mean maximum likelihood |
---|
0:04:11 | and that's mainly what i'm gonna end up using here in this work but i |
---|
0:04:14 | wanna imprecise there are other things you could do and in speaker recognition we do |
---|
0:04:19 | not do that we do something more complicated |
---|
0:04:22 | the next a more sophisticated thing is map adaptation that we all know from gmm |
---|
0:04:26 | ubms and dogs work |
---|
0:04:28 | but you can do that in this context as well it's very simple formula that |
---|
0:04:33 | requires however that you have a second covariance matrix which we can probably across class |
---|
0:04:37 | covariance which is the prior distribution of what all models could look like |
---|
0:04:42 | where in this case the distribution of what means are drawn from |
---|
0:04:47 | and then finally from there you can go instead of taking point estimate you can |
---|
0:04:51 | go to a bayesian approach where you don't actually estimate the mean for each class |
---|
0:04:56 | you estimate the posterior distribution of the mean of each class given the training data |
---|
0:05:00 | for that class |
---|
0:05:02 | and in that case |
---|
0:05:04 | you can see that posterior distribution and then you could be scoring with what's called |
---|
0:05:08 | the predictive distribution which is |
---|
0:05:10 | a bigger gaussian the fatter gaussian it includes the within class covariance |
---|
0:05:15 | but also has an additional term which is the uncertainty of how much you didn't |
---|
0:05:18 | show me that particular class |
---|
0:05:25 | one little trick that i only learned recently no we started learned a lot sooner |
---|
0:05:28 | it's a |
---|
0:05:30 | develop many years ago have a reference in the book but it's really handy for |
---|
0:05:33 | all these kind of systems is |
---|
0:05:36 | everybody knows you can buy wise one covariance matrix recognizer data such that the covariance |
---|
0:05:40 | you can but a linear transform and datasets the covariance which by the fact you |
---|
0:05:45 | can do it for two |
---|
0:05:46 | and since we have to this is really helpful |
---|
0:05:49 | and i have a formulas in the paper it's actually not very heart |
---|
0:05:54 | and you end up with a linear transform where within classes identity which we're often |
---|
0:05:58 | used to w c and for example compasses that |
---|
0:06:01 | but across class is also diagonal and it's sorted in order so the most important |
---|
0:06:05 | dimensions are first |
---|
0:06:07 | and the beautiful global transformation |
---|
0:06:10 | it means that you can do linear discriminant analysis you can do dimension reduction easily |
---|
0:06:16 | in the space just by picking the most interesting mentions the person |
---|
0:06:20 | and it's also a reminder that when you say you do lda in your system |
---|
0:06:26 | these little careful because lda |
---|
0:06:29 | there's a number of ways to formulate lda they all give the same subspace but |
---|
0:06:33 | they don't give the same transformation within that subspace |
---|
0:06:37 | because that's not part of the criterion |
---|
0:06:40 | and that's what this doesn't give the same subspace but it's not the same linear |
---|
0:06:43 | transformation |
---|
0:06:48 | so i'm gonna that some experiments here i'll start with some simple ones of the |
---|
0:06:51 | want to the discriminative training next |
---|
0:06:54 | where using acoustic i-vectors i think maybe it was mentioned here |
---|
0:06:58 | the main thing we're gonna a lid system is you need to do shifted delta |
---|
0:07:01 | cepstra and you need to do vocal tract length normalisation might not do speaker |
---|
0:07:07 | i'm gonna present lre eleven "'cause" it's the most recent lre but as the kind |
---|
0:07:12 | of hinted i'm not gonna use pair detection "'cause" i'm not a big fan of |
---|
0:07:15 | pair detection |
---|
0:07:18 | somebody's the over metric c average |
---|
0:07:21 | but you get similar performance rankings |
---|
0:07:24 | when you pair |
---|
0:07:25 | detection as well |
---|
0:07:27 | and |
---|
0:07:30 | within lre |
---|
0:07:31 | you build your own train and that's of these are of lincoln's training data sets |
---|
0:07:35 | that are currently |
---|
0:07:36 | zero mean |
---|
0:07:40 | so i mentioned just as generative gaussian models i mentioned that you can do ml |
---|
0:07:45 | and you can do these other things i mentioned ml map of a |
---|
0:07:48 | have a nice applied here we just three things but is actually not those three |
---|
0:07:50 | things |
---|
0:07:52 | so you have to pay attention but didn't describe what this is |
---|
0:07:57 | but ml so what i'm doing here is there is no back and there is |
---|
0:08:00 | just bayes rule applied "'cause" that's the formula that i showed you to the generative |
---|
0:08:04 | model of gaussian |
---|
0:08:06 | and these numbers for people who do our reads these are not very good numbers |
---|
0:08:09 | but this is what happened straight out of the generative model |
---|
0:08:13 | and what i'm showing is c average didn't in c average |
---|
0:08:17 | so means the average means you had a heart detection hard-decisioning on the detection |
---|
0:08:23 | so the ml system |
---|
0:08:24 | is the baseline |
---|
0:08:27 | if you do this is the bayesian system so where you make the bayesian estimation |
---|
0:08:31 | of the being then you actually in the end don't actually have the same covariance |
---|
0:08:34 | for every class "'cause" they had different counts and that gives different a predictive uncertainty |
---|
0:08:39 | but in a factor are very similar because in language recognition |
---|
0:08:42 | you have many instances per class so it almost degenerates to the same thing |
---|
0:08:47 | the reason i didn't show map is "'cause" it's in between those two and there's |
---|
0:08:50 | not much space in between those two so it's not a very interesting thing |
---|
0:08:54 | this last one is kind of interesting in that |
---|
0:08:59 | it's not right but it actually works better |
---|
0:09:01 | from a calibration |
---|
0:09:03 | well as you say calibration that you think that it works better in the bayes |
---|
0:09:07 | rule |
---|
0:09:08 | what i've done here is what we typically do in speaker recognition where you use |
---|
0:09:10 | the right map what you pretend that there's only one cut instead of keeping the |
---|
0:09:14 | correct count of the number of cuts |
---|
0:09:16 | and that gives you in terms of the predicted distribution that gives you a greater |
---|
0:09:20 | uncertainty that and a wider covariance |
---|
0:09:23 | and so happens that actually works a little better in this case |
---|
0:09:29 | but i |
---|
0:09:31 | once you put a back into the in the system which is what everybody's usually |
---|
0:09:34 | showing with some then these differences really disappears so i'm gonna use ml systems for |
---|
0:09:40 | the rest of the discriminative training work |
---|
0:09:43 | as i said these numbers are very good there about three times as bad as |
---|
0:09:46 | a state-of-the-art |
---|
0:09:48 | what usually done it is with additionally trained back end the simplest one i think |
---|
0:09:52 | john had was the full tell the scalar multiclass thing that a coded decoded before |
---|
0:09:58 | that's logistic regression |
---|
0:09:59 | you can do a full of logistic regression with a matrix instead of with a |
---|
0:10:02 | scalar you can put a gaussian backend in front |
---|
0:10:09 | a logistic regression which is something that we tried for or you can use a |
---|
0:10:13 | discrimate we train gaussian as the back and which is something we were doing it |
---|
0:10:17 | lincoln for quite awhile |
---|
0:10:19 | and these systems all work much better and pretty similar to each other |
---|
0:10:24 | you can also build a classifier to be discriminative one of the more common things |
---|
0:10:28 | to do is an svm one verses rest |
---|
0:10:31 | that's not that still doesn't solve the final task but it can help |
---|
0:10:35 | and if you do one verses rest logistic regression you also still need to back |
---|
0:10:39 | end or you can do recently uniquer has been doing a multiclass |
---|
0:10:44 | training of the classifier itself followed by multiclass backend |
---|
0:10:47 | but what i wanna talk about is trying to do everything together one training of |
---|
0:10:51 | the multiclass system that won't need its own separate back end ready to apply bayes |
---|
0:10:56 | rule straight out |
---|
0:10:57 | and |
---|
0:10:59 | it's not commonly used in backends but in our field mmi is a very common |
---|
0:11:03 | thing in a given in the gmm world in the speech recognition work |
---|
0:11:07 | the criterion if you're not familiar with it |
---|
0:11:10 | it is another name for the cross entropy which is the same metric that logistic |
---|
0:11:14 | regression uses |
---|
0:11:15 | it is a multiclass are your probabilities correct kind of a metric |
---|
0:11:21 | and it is a this is a closed set |
---|
0:11:24 | discriminative training of classes against each other |
---|
0:11:28 | the update equations |
---|
0:11:30 | are you haven't seen are kind of cool and they're kind of different it's a |
---|
0:11:33 | it's a little bit of a we're derivation compare the gradient descent that everybody's two |
---|
0:11:38 | it can be interpreted of like a gradient descent with kind of a magical step |
---|
0:11:42 | size |
---|
0:11:44 | but it's quite effective and the weights it's always done in speech recognition is |
---|
0:11:49 | since you were doing this to a gaussian system you start with an ml version |
---|
0:11:52 | of the gaussian and then you discriminatively updated so to speak |
---|
0:11:56 | a that makes the converse is much easier |
---|
0:11:59 | it gives an actual regularisation because you're starting with something that is already a reasonable |
---|
0:12:03 | solution and in fact the simplest form of regularization is just to not let it |
---|
0:12:07 | run very long it is also a lot cheaper |
---|
0:12:10 | and it also gives you something you can tie back and put a penalty function |
---|
0:12:14 | that says don't be too different from the ml solution |
---|
0:12:17 | so regularization is it is and straightforward thing to do an mmi |
---|
0:12:22 | and this diagonal covariance transformation that i was talking about is really helpful there here |
---|
0:12:27 | because |
---|
0:12:28 | then we can only discriminately update these diagonal covariances instead of full covariances |
---|
0:12:33 | so we have fewer parameters than a full matrix logistic regression but more parameters the |
---|
0:12:38 | lowest logistic burst |
---|
0:12:45 | so now these are pretty much state-of-the-art numbers now remember the previous number that couldn't |
---|
0:12:50 | were up here essentially |
---|
0:12:54 | so this is the ml gaussian followed by an mmi gaussian backend in the score |
---|
0:12:59 | space which is kind of our dpot way of doing things when i was at |
---|
0:13:03 | lincoln |
---|
0:13:05 | this for score is kind of a disappointment which is what if you take the |
---|
0:13:08 | training set and you discrimate we trained with them in mine and they don't have |
---|
0:13:12 | a back here |
---|
0:13:13 | it is in fact |
---|
0:13:15 | considerably better than the ml system really of its equivalent would which i started |
---|
0:13:20 | but is nowhere near where we wanna be obviously |
---|
0:13:23 | so |
---|
0:13:24 | why not |
---|
0:13:25 | and |
---|
0:13:28 | one of the core of our e |
---|
0:13:30 | that |
---|
0:13:32 | is more data dependent i think then realistic |
---|
0:13:36 | is the dataset actually looks different than the training set |
---|
0:13:39 | so this is only done on the training set it's not using any dev set |
---|
0:13:43 | at all |
---|
0:13:44 | the most obvious thing is that the training set and that at the data set |
---|
0:13:47 | in the test set are all thirty seconds approximately the training set is whatever sides |
---|
0:13:52 | of conversations that happen to be so that's an obvious mismatch selected the training set |
---|
0:13:57 | and truncated everything to be thirty seconds instead of the entire sorry |
---|
0:14:00 | drawing away data in that way turned out to be very helpful because it's now |
---|
0:14:03 | what better match to what the test data looks like |
---|
0:14:06 | but not everything i wanted so then i to the thirty second training set |
---|
0:14:10 | concatenated together with the dev set which is a thirty second set |
---|
0:14:14 | used the entire set at once |
---|
0:14:17 | for training the system and that in fact works as well as in and slightly |
---|
0:14:22 | better |
---|
0:14:22 | then the two different us as the system followed by |
---|
0:14:26 | discriminant right by a backend |
---|
0:14:32 | so i looked at the number of different ways |
---|
0:14:37 | permutations of this mmi system the anybody who's done it for gmm mmi is no |
---|
0:14:41 | you can you can |
---|
0:14:42 | train this that or the other and various things that |
---|
0:14:45 | and that the simplest thing to do is just to do the means only and |
---|
0:14:48 | that is fairly effective at the moment |
---|
0:14:52 | you can train the mean and the within class covariance which is |
---|
0:14:56 | and of course in the clothes that system the across class covariance is not coming |
---|
0:15:00 | into play it's only the within class covariance which is having five |
---|
0:15:05 | one thing that i found kind of interesting used to instead of training the entire |
---|
0:15:08 | covariance matrix to train the scale factor which scales the covariance that's to a little |
---|
0:15:14 | bit simpler system with fewer parameters |
---|
0:15:16 | and you can also play with the sequential system |
---|
0:15:20 | and in particular i found interesting to do the scale factor first and then the |
---|
0:15:24 | means just in terms of the it it's really |
---|
0:15:29 | that will given the end the same solution but |
---|
0:15:32 | when you only do a limited number of iterations to starting point in the sequence |
---|
0:15:36 | does affect you get |
---|
0:15:39 | so |
---|
0:15:42 | again these same sorts a lot this is what happens if you do so this |
---|
0:15:47 | is now purely no back-end and the discriminately train classifier itself if you do need |
---|
0:15:51 | only |
---|
0:15:53 | your partisan system is not terribly good but you're means the average is pretty close |
---|
0:15:59 | so that is an indication |
---|
0:16:01 | what is calibration mean in a multiclass detection |
---|
0:16:05 | task is kind of controversy all but |
---|
0:16:08 | one thing that i think i can say comfortably is whenever you see this happen |
---|
0:16:13 | it means that you're not calibrated |
---|
0:16:15 | the fact that they might not doesn't necessarily mean that you are calibrated "'cause" bayes |
---|
0:16:18 | rule is more complicated than that but |
---|
0:16:20 | but this means that it is clearly not calibrated |
---|
0:16:23 | so once we do something to the variance this is doing the mean and the |
---|
0:16:26 | entire variance this is doing the mean and the scale factor is very except same |
---|
0:16:30 | time |
---|
0:16:31 | and this is due in a two stage process or of the scale factor of |
---|
0:16:34 | the various followed by the mean |
---|
0:16:36 | all of those |
---|
0:16:37 | work much better so in order to get calibration you need to actually adjust the |
---|
0:16:41 | covariance matrix which kinda makes sense you need to scale factor or something |
---|
0:16:45 | and |
---|
0:16:46 | once you fine tune on the numbers as we typically do when we're actually working |
---|
0:16:50 | on these kind of task |
---|
0:16:52 | been actually see that the two stage process it is the baddest the best one |
---|
0:16:56 | and it is better than error |
---|
0:16:58 | our a two-step process that we used to have before of separate system followed by |
---|
0:17:03 | back in |
---|
0:17:06 | okay so that's the discriminative training part the other thing i want to talk about |
---|
0:17:09 | is the out of set problem that has mentioned in a question earlier |
---|
0:17:15 | because oftentimes were interested in something where there's it could be another language is not |
---|
0:17:20 | one of the closer |
---|
0:17:23 | the nice thing about our two covariance mathematics that we've been using for speaker recognition |
---|
0:17:28 | is it has in front of you a model for what out of set is |
---|
0:17:31 | supposed to the |
---|
0:17:33 | already mentioned that essentially that if you have a gaussian |
---|
0:17:35 | distribution of what all models look like then an out of set languages are randomly |
---|
0:17:40 | drawn language from that who |
---|
0:17:42 | and that's represented by the gaussian distribution |
---|
0:17:46 | then at test time |
---|
0:17:48 | you have again and even bigger gaussian because the uncertainty is both the channel plus |
---|
0:17:55 | which language was |
---|
0:17:56 | so now you have |
---|
0:17:59 | the out of set is also a gaussian bided have the bigger covariance then all |
---|
0:18:03 | the others have a share variance which is smaller so it you no longer have |
---|
0:18:07 | this a linear system |
---|
0:18:09 | when you make a comparison |
---|
0:18:11 | this is the most general formula when you have |
---|
0:18:15 | and open-set problem which is both out of set and closed set |
---|
0:18:19 | this is how you would combine them this is what i had before the sort |
---|
0:18:22 | of bayes rule a quick competition of all the other closer classes this is the |
---|
0:18:26 | new distribution the out of set distribution |
---|
0:18:29 | if you wanna pure out of set problem which is what i'm gonna talk about |
---|
0:18:32 | here you just take it needs to be out of set is one but in |
---|
0:18:35 | fact you could make a mix distribution well |
---|
0:18:38 | okay so i wanna talk about the out of set |
---|
0:18:42 | just a touch on is what i have now |
---|
0:18:44 | if i where to do the bayesian numerator for each class that i mentioned before |
---|
0:18:48 | and then this denominator |
---|
0:18:51 | then i have what would you like to call |
---|
0:18:53 | bayesian speaker comparison |
---|
0:18:55 | jones narrative paper about four |
---|
0:18:59 | it is the same answer as p lda or the two covariance model |
---|
0:19:04 | and i'd like to |
---|
0:19:06 | emphasise that |
---|
0:19:07 | they're set up differently so the numerator and denominator are different in these two mathematics |
---|
0:19:12 | but the ratio is the same thing "'cause" it's a models and is the same |
---|
0:19:16 | correct answer |
---|
0:19:18 | i think you know formalism like i'm talking about here i find it much easier |
---|
0:19:21 | to understand it in this context |
---|
0:19:23 | the philosophy |
---|
0:19:25 | and |
---|
0:19:27 | daniel i've spent a lot of time on this can see that only a few |
---|
0:19:30 | of the a guy from this perspective point of view |
---|
0:19:33 | but in this terminology we say that we have a model for each class and |
---|
0:19:38 | the covariances are hyper parameters in this terminology you guys like to say that there |
---|
0:19:43 | is no model |
---|
0:19:44 | and the parameters of the system are the covariance matrices again is the same |
---|
0:19:49 | system it's a different same answer to different perspective but when we're talking about close |
---|
0:19:53 | that and ml models |
---|
0:19:55 | i know how to say that in this context and i don't know so well |
---|
0:19:58 | how to say that |
---|
0:19:59 | in the p lda one |
---|
0:20:02 | so discriminative training of the out that i described this is the out of set |
---|
0:20:06 | model but as i've said now i have this mmi hammer in my toolbox |
---|
0:20:10 | and this is just one more covariance that i can train so i've got an |
---|
0:20:14 | across class mean and covariance |
---|
0:20:17 | the ml out of set system just takes that all of these where the sample |
---|
0:20:22 | covariance matrices so this p |
---|
0:20:24 | but i can |
---|
0:20:26 | do an mmi updated this out of set classes well the simplest way for me |
---|
0:20:30 | to do this is to take the |
---|
0:20:32 | the closed set system there are presented and then separately |
---|
0:20:36 | frees the closed set models and then separately update the out of set model given |
---|
0:20:40 | the closed set models |
---|
0:20:42 | i can do that with the by scoring would one verses rest instead of scoring |
---|
0:20:47 | with bayes rule and doing a round robin on the same training set so |
---|
0:20:52 | the advantage of this is i can actually build a system without ever actually having |
---|
0:20:56 | any out-of-class data probably do better if i really did have out-of-class data but in |
---|
0:21:00 | this case i don't and i can build a perfectly legitimate system |
---|
0:21:03 | so |
---|
0:21:06 | the performance of this system whatever done here |
---|
0:21:10 | is scored this lre even though there is no out of set data scoring without |
---|
0:21:14 | bayes rule where the system is then allowed to know what the other classes were |
---|
0:21:21 | and so that the simulation one open set |
---|
0:21:23 | scoring function |
---|
0:21:25 | the ml version of this |
---|
0:21:27 | the actual c average is actually a the chart it is that's kind of bad |
---|
0:21:31 | numbers that i started with four |
---|
0:21:33 | the mmi training of the closed set system |
---|
0:21:37 | and then the |
---|
0:21:39 | mel version of the across class covariance in fact is already a lot better so |
---|
0:21:43 | whatever's happening in the closed set discriminative training is actually helping the |
---|
0:21:47 | open set scoring as well but explicitly retraining |
---|
0:21:51 | the out of set covariance matrix |
---|
0:21:53 | with the same mechanism mel scale factor then the me |
---|
0:21:58 | in fact if the pretty reasonably |
---|
0:22:01 | and the system which is not obviously on calibrated |
---|
0:22:05 | and it's pretty reasonable performance |
---|
0:22:07 | the closed set scoring performance is still down here but this is gone a lot |
---|
0:22:11 | better and it's perfectly feasible |
---|
0:22:14 | so |
---|
0:22:16 | the two contributions here where the single system concept of we don't have to do |
---|
0:22:20 | system design and then back end we can discriminatively trained system to already be calibrated |
---|
0:22:26 | and we can model out of set using the same mathematics that we have in |
---|
0:22:31 | speaker recognition |
---|
0:22:33 | but a simpler version "'cause" we don't need to be used bayesian in this case |
---|
0:22:36 | and i think can also be discriminatively updated so that we can that be reasonably |
---|
0:22:40 | calibrated for the open set |
---|
0:22:42 | task as well |
---|
0:23:06 | so thanks island |
---|
0:23:08 | the very nice to see that you unified those two parts of the system |
---|
0:23:13 | i which we could do that than in speaker recognition |
---|
0:23:17 | so my question is your |
---|
0:23:21 | your maximum likelihood |
---|
0:23:23 | across class covariance so you've got twenty four languages to work within a six hundred |
---|
0:23:28 | dimensional |
---|
0:23:30 | i-vectors so |
---|
0:23:31 | how did you estimated or a sign that |
---|
0:23:34 | parameter |
---|
0:23:36 | it is the sample covariance so everything here was done with the dimension reduction in |
---|
0:23:41 | the front |
---|
0:23:42 | to twenty three dimensions |
---|
0:23:45 | i'm sorry that's why i illustration |
---|
0:23:48 | already specified that there would be twenty three dimensions |
---|
0:23:52 | and anything that has a prior is limited to twenty three dimensions |
---|
0:23:58 | okay in this i i'd since i just took the sample covariance matrix at regularized |
---|
0:24:02 | it somehow you can make it |
---|
0:24:05 | appear to be bigger to be full size |
---|
0:24:09 | okay so |
---|
0:24:10 | well those formulas you showed with the covariances that happens in twenty three dimensional space |
---|
0:24:15 | yes |
---|
0:24:25 | so in this case you doing lda and then i'm not my tanks she's got |
---|
0:24:31 | at the same as doing question back and another calibration |
---|
0:24:36 | well as lda and regression back at you that this evaluation bloodless |
---|
0:24:41 | this was your computing the sample covariance as ones in the full space |
---|
0:24:46 | but |
---|
0:24:47 | the across class is only rank twenty three |
---|
0:24:51 | so you take that the six hundred dimensional within class and map-adapted twenty three but |
---|
0:24:56 | yes |
---|
0:24:59 | so if you do lda and regression but gaussian backend is the same subspace |
---|
0:25:04 | use lda |
---|
0:25:06 | yes if you product of lda |
---|
0:25:09 | and twenty three or you get the gore since any get twenty four scores |
---|
0:25:13 | is almost the same thing so you're still doing to state to steps |
---|
0:25:20 | it's still just two steps |
---|
0:25:22 | in my view are the ml estimation which in this case forces you to be |
---|
0:25:26 | twenty three dimensional |
---|
0:25:28 | and then |
---|
0:25:29 | the update of those equations |
---|
0:25:32 | but lda english and |
---|
0:25:34 | back and there is similarity very close |
---|
0:25:38 | well like the way we would have done a system before would be lda and |
---|
0:25:43 | then gaussian in that space and then |
---|
0:25:47 | mmi training in the score space |
---|
0:25:51 | the likelihood ratios of the first thing this is mmi training in the i-vector space |
---|
0:25:55 | directly |
---|
0:25:57 | but |
---|
0:25:58 | these are not very complicated mathematics of things are pretty closely related yes |
---|
0:26:05 | so when you did the joint diagonalization |
---|
0:26:08 | in there and then you |
---|
0:26:10 | work with diagonal covariance matrices but then you're also updating the covariance matrices training |
---|
0:26:16 | is that diagonalisation still valid then |
---|
0:26:18 | i mean you do the static one projection was that mean then when you forced |
---|
0:26:22 | to be it back it's sort of like saying i mean |
---|
0:26:25 | the entire thing can be mapped back to |
---|
0:26:28 | by undoing the diagonalisation into a full covariance so you in some sense you are |
---|
0:26:33 | still updating a full covariance with your only updating in a constraint what |
---|
0:26:38 | so the matrix is still an apple size but the number of parameters that you |
---|
0:26:42 | discriminatively updated is not the full set |
---|
0:26:58 | so if i guess i remember correctly so you're doing actually closed set |
---|
0:27:03 | twenty three or twenty four language that is that correct twenty four language right so |
---|
0:27:08 | is it possible i mean i don't want change your problem but if you were |
---|
0:27:12 | to look at a subset so you're gonna pick twelve each and take the others |
---|
0:27:16 | are is completely open set data so you to screw training it only on a |
---|
0:27:20 | portion of we don't have actions we have said data |
---|
0:27:23 | you have some sense of how strong your solution would be |
---|
0:27:28 | if you didn't have access to those similar sounds that languages that you want to |
---|
0:27:32 | reject |
---|
0:27:34 | i think it's an interesting thought that |
---|
0:27:38 | you could more extensively test this out of set hypothesis by doing a whole one |
---|
0:27:43 | out or something and round robin in that and i think that isn't it interesting |
---|
0:27:47 | idea but i haven't |
---|
0:27:48 | have done |
---|