0:00:06 | the title of my talk uh |
---|
0:00:08 | vision speaker verification with |
---|
0:00:10 | heavy tailed right |
---|
0:00:13 | yeah |
---|
0:00:16 | or not |
---|
0:00:17 | yeah |
---|
0:00:18 | right |
---|
0:00:19 | oh |
---|
0:00:20 | yeah |
---|
0:00:23 | oh |
---|
0:00:24 | or |
---|
0:00:24 | okay |
---|
0:00:30 | oh in a nutshell uh but still is about it is um |
---|
0:00:34 | applying uh joint factor analysis where |
---|
0:00:37 | i vectors |
---|
0:00:38 | as |
---|
0:00:39 | features |
---|
0:00:41 | so i'll be assuming that you have uh some familiarity with |
---|
0:00:45 | joint factor analysis |
---|
0:00:47 | i vectors |
---|
0:00:49 | and |
---|
0:00:49 | cosine distance |
---|
0:00:51 | scroll right |
---|
0:00:54 | uh the key fact |
---|
0:00:55 | about i actors is that they provide a representation of speech segments so |
---|
0:01:00 | arbitrator durations by |
---|
0:01:03 | vectors of |
---|
0:01:05 | uh fixed dimension |
---|
0:01:08 | uh these all these vectors uh seem to contain most of the information needed to distinguish between speakers |
---|
0:01:15 | and as a bonus they are of relatively low dimension |
---|
0:01:20 | typically four hundred rather than |
---|
0:01:23 | a hundred thousand |
---|
0:01:24 | as in the case of a gmm supervectors |
---|
0:01:29 | uh this means that |
---|
0:01:31 | it's |
---|
0:01:31 | possible to |
---|
0:01:33 | apply |
---|
0:01:34 | modern |
---|
0:01:36 | bayesian that because of uh pattern recognition |
---|
0:01:39 | to the speaker recognition problem |
---|
0:01:41 | we've banished |
---|
0:01:42 | the |
---|
0:01:43 | time dimension altogether |
---|
0:01:45 | and we're in a situation which is quite analogous to |
---|
0:01:48 | other |
---|
0:01:49 | action recognition problems |
---|
0:01:54 | the |
---|
0:01:55 | um |
---|
0:01:57 | i think i should |
---|
0:01:58 | at the outset explained what i need but where nation |
---|
0:02:01 | because it's open to several interpretations |
---|
0:02:05 | um |
---|
0:02:06 | what i intend is that it is |
---|
0:02:08 | uh |
---|
0:02:08 | in my mind |
---|
0:02:10 | the terms station |
---|
0:02:11 | and for the ballistic |
---|
0:02:13 | are synonymous with each other |
---|
0:02:16 | the idea is |
---|
0:02:17 | two |
---|
0:02:18 | as far as possible |
---|
0:02:21 | do everything within the framework of the cartoons probability |
---|
0:02:27 | it doesn't |
---|
0:02:29 | really matter whether you prefer |
---|
0:02:31 | to interpret probabilities and frequentist terms |
---|
0:02:35 | or and added then surely terms |
---|
0:02:38 | three |
---|
0:02:39 | rules the probability of the same or only two |
---|
0:02:42 | the sum rule |
---|
0:02:43 | and the product |
---|
0:02:45 | a very they give you the same results in both cases |
---|
0:02:50 | um |
---|
0:02:51 | and the advantage of this is that you have uh |
---|
0:02:53 | logically coherent way of doing |
---|
0:02:57 | reasoning in the face of uncertainty |
---|
0:03:01 | the disadvantage |
---|
0:03:03 | is that in practise |
---|
0:03:04 | you usually |
---|
0:03:06 | run into a computational brick wall in pretty short order |
---|
0:03:11 | if you try to to follow these rules |
---|
0:03:14 | consistently |
---|
0:03:16 | so in fact |
---|
0:03:17 | it's really only been in the past ten years |
---|
0:03:21 | that's uh |
---|
0:03:22 | this |
---|
0:03:22 | field of |
---|
0:03:24 | they shouldn't pattern recognition has really taken off |
---|
0:03:27 | and that's that thanks to the |
---|
0:03:30 | introduction |
---|
0:03:31 | um |
---|
0:03:32 | fast |
---|
0:03:32 | approximate |
---|
0:03:34 | methods |
---|
0:03:35 | all |
---|
0:03:36 | bayesian inference |
---|
0:03:38 | uh in particular age a variational bayes |
---|
0:03:43 | uh which makes it possible to treat |
---|
0:03:45 | probabilistic models which are |
---|
0:03:47 | well more sophisticated |
---|
0:03:50 | then |
---|
0:03:50 | was possible in the case of uh |
---|
0:03:53 | traditional statistic |
---|
0:03:55 | so the you know the unifying theme in my |
---|
0:03:57 | twelve will be the application of variational bayes method |
---|
0:04:00 | to the |
---|
0:04:02 | speaker recognition proper |
---|
0:04:06 | um |
---|
0:04:07 | i start out with the |
---|
0:04:09 | traditional assumptions in joint factor analysis that |
---|
0:04:13 | speaker and channel effects |
---|
0:04:15 | or |
---|
0:04:16 | and uh so |
---|
0:04:18 | statistically independent |
---|
0:04:20 | and |
---|
0:04:20 | gaussian the strip |
---|
0:04:23 | and in the first part might well |
---|
0:04:26 | i will simply a to show |
---|
0:04:29 | how joint factor analysis |
---|
0:04:31 | can be done |
---|
0:04:32 | under these assumptions |
---|
0:04:34 | using i pictures as |
---|
0:04:36 | features and |
---|
0:04:38 | a patient rate |
---|
0:04:42 | um |
---|
0:04:42 | this already works very well |
---|
0:04:44 | yeah in my experience it gives better results them then joint factor analysis |
---|
0:04:49 | uh the second part of my talk will be |
---|
0:04:53 | concerned with hell |
---|
0:04:54 | a variational bayes |
---|
0:04:57 | can be used |
---|
0:04:58 | two |
---|
0:04:59 | model non gaussian behaviour in the data |
---|
0:05:03 | uh i i found that this |
---|
0:05:05 | leads to to a substantial |
---|
0:05:07 | uh improvement in performance |
---|
0:05:10 | and uh as an added bonus it seems to be possible to do away with the need for |
---|
0:05:15 | score normalisation across the the whole day |
---|
0:05:22 | ah the fun part of my talk of this factor |
---|
0:05:25 | okay it's concerned with the problem |
---|
0:05:27 | of |
---|
0:05:28 | how to |
---|
0:05:29 | integrate the assumptions of |
---|
0:05:32 | joint factor analysis and cosine distance scoring you know |
---|
0:05:35 | coherent framework |
---|
0:05:38 | um |
---|
0:05:40 | on the face but this looks like a hopeless exercise |
---|
0:05:43 | okay the the assumptions appeared to be completely different |
---|
0:05:47 | uh however |
---|
0:05:48 | it is possible to do something about this |
---|
0:05:51 | thanks to the flexibility |
---|
0:05:53 | provided by variational bayes so even though this is like that of i think this is where |
---|
0:05:58 | uh talking about because it's a real object lesson in how harmful |
---|
0:06:03 | these beijing methods are |
---|
0:06:05 | at least potentially |
---|
0:06:08 | um |
---|
0:06:10 | before getting down to business uh i just say |
---|
0:06:12 | something about the way of organise this presentation |
---|
0:06:16 | uh in preparing the slides i i tried to ensure that they were |
---|
0:06:20 | reasonably complete and self contained |
---|
0:06:22 | okay what are the idea i have in my mind is that |
---|
0:06:25 | if anyone was interested in reading through the slides afterwards |
---|
0:06:28 | they should tell a fairly complete story |
---|
0:06:31 | okay but |
---|
0:06:32 | uh because of time constraints i'm going to have to gloss over |
---|
0:06:36 | uh |
---|
0:06:37 | some |
---|
0:06:37 | points in V in your presentation |
---|
0:06:41 | uh for the same reason there's going to be somehow |
---|
0:06:44 | in the slides |
---|
0:06:45 | okay okay |
---|
0:06:46 | to do some hand waving their |
---|
0:06:48 | um |
---|
0:06:49 | i found |
---|
0:06:50 | that by focusing on the gaussian dance just |
---|
0:06:54 | statistical independence assumptions |
---|
0:06:57 | uh i could explain the the variational bayes ideas |
---|
0:07:00 | but the uh an animal |
---|
0:07:02 | uh amount of uh of technicalities so i would spend almost half |
---|
0:07:07 | we |
---|
0:07:07 | time |
---|
0:07:08 | on the first part |
---|
0:07:09 | really |
---|
0:07:10 | tall |
---|
0:07:11 | uh on the other hand the last part of the talk |
---|
0:07:15 | uh |
---|
0:07:15 | is |
---|
0:07:16 | is technical is addressed |
---|
0:07:18 | primarily |
---|
0:07:20 | two |
---|
0:07:20 | uh members of the audience who would have read |
---|
0:07:23 | say the the chapter on variational bayes |
---|
0:07:26 | and uh bishop's book |
---|
0:07:30 | okay |
---|
0:07:35 | okay so here the the the the |
---|
0:07:37 | basic assumptions of factor analysis with |
---|
0:07:40 | i vectors |
---|
0:07:41 | uh |
---|
0:07:42 | features |
---|
0:07:43 | um |
---|
0:07:45 | we had used |
---|
0:07:46 | D for data as for speaker C for channel |
---|
0:07:49 | or |
---|
0:07:49 | recording |
---|
0:07:50 | okay we have a collection of recordings per speaker |
---|
0:07:54 | um |
---|
0:07:56 | we assume that that can be decomposed |
---|
0:07:58 | into two statistically independent parts a speaker part |
---|
0:08:01 | um |
---|
0:08:02 | uh channel or |
---|
0:08:04 | these assumptions are questionable but i'm going to stick with them for the um |
---|
0:08:09 | first part of the channel |
---|
0:08:15 | um |
---|
0:08:16 | this uh |
---|
0:08:18 | this model |
---|
0:08:19 | well we have replaced |
---|
0:08:21 | they had the supervector |
---|
0:08:22 | by |
---|
0:08:23 | and observable i vector already has a name |
---|
0:08:26 | it's known and |
---|
0:08:28 | uh face recognition |
---|
0:08:30 | as |
---|
0:08:30 | probabilistic |
---|
0:08:33 | a linear discriminant |
---|
0:08:34 | uh i mouses |
---|
0:08:36 | uh make i think as a |
---|
0:08:37 | that's twenty nine is the true covariance model |
---|
0:08:41 | okay but the other guy is is the one that you will find it very uh |
---|
0:08:45 | and the best picture |
---|
0:08:47 | the um |
---|
0:08:49 | it's not |
---|
0:08:50 | perhaps quite as straightforward as it appears |
---|
0:08:52 | because |
---|
0:08:54 | uh if you're dealing with high dimensional features for example |
---|
0:08:57 | mllr features |
---|
0:08:59 | you can treat these are covariance matrices |
---|
0:09:01 | as being a full rank |
---|
0:09:04 | yeah and you need uh a hidden variable |
---|
0:09:07 | a representation of the model which is practically |
---|
0:09:10 | analogous to the |
---|
0:09:13 | hidden variable description of |
---|
0:09:15 | joint factor analysis |
---|
0:09:19 | so here on the left hand side D on that's an observable ivector not a |
---|
0:09:24 | a hidden supervector |
---|
0:09:26 | um it turns out to be convenient for the heavy tails stuff to refer to the |
---|
0:09:31 | eigenvoice matrix |
---|
0:09:32 | and the eigenchannel matrix |
---|
0:09:34 | matrix using subscripts you want and you too |
---|
0:09:37 | rather than the traditional names that you wouldn't be |
---|
0:09:41 | uh same thing for the |
---|
0:09:42 | um |
---|
0:09:43 | where the speaker factors are labelled X one |
---|
0:09:46 | the channel factors i label them X two or |
---|
0:09:48 | B or indicates the V dependence on the right |
---|
0:09:51 | or the or the uh |
---|
0:09:53 | the channel |
---|
0:09:55 | uh there's one difference here from the um |
---|
0:09:59 | conventional formulation on a joint factor analysis in the lda this uh residual term |
---|
0:10:05 | the epsilon |
---|
0:10:08 | which |
---|
0:10:09 | in general has been modelled by right now |
---|
0:10:11 | by a diagonal covariance or or precision matrix |
---|
0:10:15 | it's associated |
---|
0:10:16 | traditionally with the channel |
---|
0:10:18 | rather than with the speaker |
---|
0:10:20 | okay in jfa i i formulated it slightly differently but i |
---|
0:10:24 | i'm i'm just going to follow this uh |
---|
0:10:26 | uh this model |
---|
0:10:28 | in in in this presentation |
---|
0:10:30 | so because the the residual epsilon is associated with the channel there are |
---|
0:10:35 | two |
---|
0:10:36 | noise terms |
---|
0:10:37 | okay that's the contribution of the eigenchannels |
---|
0:10:41 | okay that contribute |
---|
0:10:43 | uh |
---|
0:10:43 | this |
---|
0:10:44 | so the |
---|
0:10:46 | so the |
---|
0:10:47 | channel variance |
---|
0:10:48 | and the contribution to the residual and there's a precision matrix sense is to say the inverse |
---|
0:10:52 | of the covariance matrix sorry about that too because |
---|
0:10:56 | you have |
---|
0:10:57 | statistical independence |
---|
0:11:03 | uh |
---|
0:11:04 | is the graphical model that goes |
---|
0:11:06 | um |
---|
0:11:08 | with that application |
---|
0:11:09 | uh if you're not familiar with is that we just take a minute to explain how to read these uh |
---|
0:11:14 | these diagrams |
---|
0:11:16 | um |
---|
0:11:19 | uh a much uh mode like that |
---|
0:11:22 | in the case um observable there |
---|
0:11:23 | oh |
---|
0:11:25 | the black nodes |
---|
0:11:27 | in the case |
---|
0:11:28 | hidden variables |
---|
0:11:30 | the |
---|
0:11:31 | do not |
---|
0:11:32 | indicate model parameters |
---|
0:11:35 | and |
---|
0:11:36 | the arrows |
---|
0:11:37 | in the case |
---|
0:11:38 | conditional dependency |
---|
0:11:40 | okay so the |
---|
0:11:43 | the i vector is assumed to depend on a speaker factors |
---|
0:11:46 | the channel factors |
---|
0:11:47 | um |
---|
0:11:48 | residual |
---|
0:11:50 | this like notation indicates that something is |
---|
0:11:53 | replicated server |
---|
0:11:54 | time |
---|
0:11:55 | okay there are several sets of channel factors |
---|
0:11:58 | one for each recording |
---|
0:12:00 | but there's only one set of speaker factors |
---|
0:12:02 | so that's |
---|
0:12:03 | outside |
---|
0:12:04 | three |
---|
0:12:04 | of the plate |
---|
0:12:06 | uh |
---|
0:12:07 | here are specified |
---|
0:12:09 | say that parameter lambda |
---|
0:12:12 | but i did about |
---|
0:12:13 | specifying the distribution |
---|
0:12:16 | oh speaker factors because it's understood |
---|
0:12:18 | be |
---|
0:12:18 | standard normal |
---|
0:12:21 | um |
---|
0:12:24 | so |
---|
0:12:25 | as i mentioned well including the channel factors enables this decomposition here |
---|
0:12:31 | it's not always nest |
---|
0:12:33 | if you have i mean vectors of dimension four hundred it's actually possible to model |
---|
0:12:39 | full rank |
---|
0:12:40 | are |
---|
0:12:41 | rather full |
---|
0:12:42 | precision matrices |
---|
0:12:44 | instead of diagonal |
---|
0:12:46 | okay and in that case |
---|
0:12:48 | this time doesn't actually contribute anything |
---|
0:12:51 | um |
---|
0:12:52 | i have found it useful well |
---|
0:12:54 | in experimental work to use this term |
---|
0:12:56 | to estimate |
---|
0:12:57 | eigenchannels on microphone data |
---|
0:12:59 | so it's useful to people |
---|
0:13:02 | and in fact it turns out that so these channel factors can always be eliminated at recognition time that's a |
---|
0:13:07 | technical point i come back to it later |
---|
0:13:09 | if i |
---|
0:13:15 | okay so how do you do |
---|
0:13:17 | speaker recognition with the the lda model |
---|
0:13:19 | okay i'm gonna make some |
---|
0:13:21 | provisional assumptions here |
---|
0:13:23 | one is that you've already succeeded in estimating the model parameters |
---|
0:13:27 | yeah eigenvoices the eigenchannels et cetera |
---|
0:13:30 | and the other that you know how to uh evaluate |
---|
0:13:33 | this thing known as the evidence integral |
---|
0:13:36 | okay you have a collection of ivectors associated with each speaker |
---|
0:13:39 | you also have a collection of hidden variables |
---|
0:13:42 | to evaluate the marginal likelihood you have to integrate over it variables |
---|
0:13:48 | so |
---|
0:13:49 | and assume that |
---|
0:13:50 | we've tackle these two problems |
---|
0:13:53 | uh it turns out that the key to solving both problems in general |
---|
0:13:58 | is to evaluate the posterior distribution of the hidden variables |
---|
0:14:01 | and |
---|
0:14:02 | i returned |
---|
0:14:03 | so that in a minute |
---|
0:14:04 | but first i just one to show you have to do speaker recognition |
---|
0:14:10 | okay we take the simplest case |
---|
0:14:12 | the the |
---|
0:14:13 | the core condition in the nist evaluation |
---|
0:14:16 | yeah one recording which is usually |
---|
0:14:18 | designated as test |
---|
0:14:19 | mother |
---|
0:14:20 | designated |
---|
0:14:21 | trained and you're interested |
---|
0:14:24 | inception the question whether |
---|
0:14:26 | the two speakers are the same |
---|
0:14:28 | or different |
---|
0:14:30 | so if the two speakers are the same |
---|
0:14:34 | okay |
---|
0:14:34 | i think it's natural to call that the alternative hypothesis but that doesn't seem to be an a universal really |
---|
0:14:39 | about that |
---|
0:14:41 | um |
---|
0:14:43 | then |
---|
0:14:44 | the likelihood |
---|
0:14:45 | the atoms |
---|
0:14:46 | is calculated |
---|
0:14:48 | okay assumption that there is a |
---|
0:14:50 | common seven speaker factors |
---|
0:14:52 | but |
---|
0:14:52 | different channel factors |
---|
0:14:54 | for the two recording |
---|
0:14:58 | on the other hand |
---|
0:14:59 | it's the two speakers are different and |
---|
0:15:02 | then be calculation of these two likelihoods can be done uh independently because the speaker factors |
---|
0:15:08 | and that's channel factors |
---|
0:15:09 | or on time |
---|
0:15:11 | for that record |
---|
0:15:12 | so the point is that everything here is an evidence into |
---|
0:15:16 | okay |
---|
0:15:17 | if you can evaluate the evidence integral |
---|
0:15:19 | you're in this |
---|
0:15:22 | uh a few things to note |
---|
0:15:24 | uh unlike traditional likelihood ratios this is symmetric |
---|
0:15:27 | and D one and D two |
---|
0:15:30 | uh it also |
---|
0:15:31 | has |
---|
0:15:33 | an unusual |
---|
0:15:35 | denominator here |
---|
0:15:36 | okay |
---|
0:15:37 | you don't see anything like this |
---|
0:15:39 | and joint factor analysis |
---|
0:15:42 | okay this is this is something that comes out of |
---|
0:15:45 | following will be |
---|
0:15:46 | the patient |
---|
0:15:48 | um |
---|
0:15:49 | power line |
---|
0:15:51 | and it's actually |
---|
0:15:53 | we see this later |
---|
0:15:55 | potentially |
---|
0:15:57 | and effective method of score normalisation |
---|
0:16:01 | and the other |
---|
0:16:02 | point i would like to stress |
---|
0:16:04 | is |
---|
0:16:04 | but you can write down the likelihood ratio for any type |
---|
0:16:08 | speaker recognition problem in the same way |
---|
0:16:10 | for instance |
---|
0:16:11 | you eight conversations |
---|
0:16:13 | in training one conversations and test |
---|
0:16:16 | we might have three conversations and train into conversations and test |
---|
0:16:20 | in all cases |
---|
0:16:21 | it's just a matter of |
---|
0:16:23 | following the rules of probability consistently |
---|
0:16:26 | and you can write down the mic |
---|
0:16:27 | ratio |
---|
0:16:28 | or bayes factor |
---|
0:16:29 | uh as it is |
---|
0:16:30 | usually called in this field |
---|
0:16:36 | uh the standard insensible |
---|
0:16:38 | had to be evaluated exactly under gaussian assumptions |
---|
0:16:42 | table is |
---|
0:16:43 | it's rather convert |
---|
0:16:45 | and if you do |
---|
0:16:46 | relax |
---|
0:16:46 | the gaussian assumptions you can't do it |
---|
0:16:49 | um uh |
---|
0:16:50 | i believe that even in the gaussian case you're better off using variational bayes |
---|
0:16:54 | and the co disagrees |
---|
0:16:56 | best but i decided to let it stand |
---|
0:16:58 | and we can uh |
---|
0:16:59 | yeah |
---|
0:17:00 | go into it later |
---|
0:17:01 | if |
---|
0:17:02 | so um |
---|
0:17:03 | if there's time |
---|
0:17:06 | the |
---|
0:17:06 | uh |
---|
0:17:07 | key inside |
---|
0:17:08 | here |
---|
0:17:09 | is |
---|
0:17:09 | that |
---|
0:17:10 | this |
---|
0:17:10 | uh this inequality |
---|
0:17:13 | that you can always |
---|
0:17:14 | find a lower bound on the evidence with |
---|
0:17:16 | and we |
---|
0:17:17 | distribution of it on the hidden factors |
---|
0:17:21 | um |
---|
0:17:22 | it's |
---|
0:17:23 | and i i grant you it's not obvious just by looking at it but the derivation |
---|
0:17:27 | turns out to be just a cost |
---|
0:17:28 | once all the facts |
---|
0:17:29 | come back like |
---|
0:17:30 | right |
---|
0:17:31 | but are |
---|
0:17:32 | or |
---|
0:17:33 | a nonnegative |
---|
0:17:36 | um |
---|
0:17:37 | and what i'll be focusing on is |
---|
0:17:40 | the use of the |
---|
0:17:42 | variational bayes method |
---|
0:17:44 | so |
---|
0:17:45 | um |
---|
0:17:46 | find a principle |
---|
0:17:47 | approximation to the |
---|
0:17:49 | the true posterior |
---|
0:17:55 | oh |
---|
0:17:56 | let me just digress a minute to explain why posteriors of about nine |
---|
0:18:02 | there's nothing mysterious about this posterior distribution you you just apply bayes' rule this is what you get |
---|
0:18:08 | you can read all this term here from the graphical model |
---|
0:18:11 | this is the prior |
---|
0:18:13 | this is the evidence |
---|
0:18:15 | okay |
---|
0:18:16 | practically straightforward |
---|
0:18:17 | the only problem can practise |
---|
0:18:19 | says that you can't evaluate yeah |
---|
0:18:21 | exactly |
---|
0:18:22 | evaluating the evidence and evaluating the posterior |
---|
0:18:25 | are |
---|
0:18:25 | two sides of the same problem |
---|
0:18:29 | you can't do it just by numerical integration because these uh |
---|
0:18:33 | these integrals |
---|
0:18:34 | are in hundreds of dimensions |
---|
0:18:38 | um |
---|
0:18:39 | another way of saying the difficulty which i i think is a useful way to of thinking about it |
---|
0:18:43 | is that |
---|
0:18:45 | whatever factorisations you haven't the prior |
---|
0:18:47 | that's be a page they get destroyed when you multiply by |
---|
0:18:51 | okay factorisations in the prior art |
---|
0:18:53 | statistical independence assumptions |
---|
0:18:56 | statistical independence assumptions get destroyed in the poster |
---|
0:19:01 | uh it's easy to uh |
---|
0:19:04 | to see |
---|
0:19:05 | why this |
---|
0:19:05 | the case in terms of the graphical model but as i said i'm going to draw so |
---|
0:19:09 | if you |
---|
0:19:10 | uh a few things |
---|
0:19:13 | and |
---|
0:19:14 | return to this question variational bayes |
---|
0:19:17 | the um |
---|
0:19:20 | yeah the in the variational bayes approximation |
---|
0:19:23 | is that |
---|
0:19:24 | what you acknowledge that |
---|
0:19:26 | uh |
---|
0:19:27 | independence has been destroyed |
---|
0:19:29 | in the posterior |
---|
0:19:30 | but you go back and forth so |
---|
0:19:32 | impostor |
---|
0:19:33 | okay and you look for |
---|
0:19:34 | what's called a variational approximation of the poster |
---|
0:19:38 | variational because it's actually free form |
---|
0:19:40 | as in the countless variations you don't impose any restriction |
---|
0:19:45 | on the functional form |
---|
0:19:47 | of |
---|
0:19:47 | oh |
---|
0:19:48 | yeah |
---|
0:19:49 | and there's a standard set of couple uh update formulas that you can |
---|
0:19:54 | that you can apply here |
---|
0:19:56 | the couple because this expectation is calculated with the posterior on extra |
---|
0:20:01 | this |
---|
0:20:02 | expectation is calculated with the posterior next one |
---|
0:20:05 | so you have to uh iterate between the two |
---|
0:20:08 | um |
---|
0:20:10 | nice thing is that this iteration comes with ian like uh convergence uh guarantees |
---|
0:20:16 | and |
---|
0:20:16 | it's avoided |
---|
0:20:17 | altogether the need |
---|
0:20:19 | to invert |
---|
0:20:20 | um |
---|
0:20:22 | large sparse block matrices which is the only way you can evaluate the |
---|
0:20:26 | evidence exactly |
---|
0:20:28 | and then |
---|
0:20:28 | only in the gaussian |
---|
0:20:29 | okay |
---|
0:20:35 | uh this uh posterior distribution or the the variational |
---|
0:20:39 | approximation of the posterior distribution is also the |
---|
0:20:43 | the key |
---|
0:20:44 | to estimate and model parameter |
---|
0:20:47 | okay you use a lower bound |
---|
0:20:49 | as a proxy |
---|
0:20:50 | for the likelihood of the evidence |
---|
0:20:53 | and you see two |
---|
0:20:54 | optimise a lower bound |
---|
0:20:56 | calculated |
---|
0:20:57 | over |
---|
0:20:58 | uh a collection of training speakers |
---|
0:21:01 | uh here i just |
---|
0:21:02 | taking the definition and |
---|
0:21:04 | rewritten it this way |
---|
0:21:06 | uh it's convenient to do this because this term here doesn't involve me model parameters |
---|
0:21:11 | parameters at all |
---|
0:21:13 | so the |
---|
0:21:14 | first |
---|
0:21:15 | approach |
---|
0:21:16 | problem or would be just too |
---|
0:21:18 | uh optimise |
---|
0:21:19 | uh this term here |
---|
0:21:21 | okay the contribution again |
---|
0:21:24 | to the |
---|
0:21:25 | uh to the evidence criterion by summing this overall speaker |
---|
0:21:32 | okay um |
---|
0:21:33 | this |
---|
0:21:33 | when you when you work it out |
---|
0:21:35 | turns out to be formally identical |
---|
0:21:39 | two |
---|
0:21:41 | um |
---|
0:21:41 | probabilistic principal components analysis |
---|
0:21:44 | it's just a least squares problem |
---|
0:21:46 | the only um |
---|
0:21:51 | and it's actually the E M auxiliary function for probabilistic principal components analysis |
---|
0:21:56 | the only |
---|
0:21:58 | the only difference is that you have to use the variational posterior |
---|
0:22:02 | rather than be |
---|
0:22:03 | other than the exact |
---|
0:22:04 | that's true |
---|
0:22:07 | um there is another way of |
---|
0:22:10 | estimation |
---|
0:22:11 | which |
---|
0:22:13 | i called minimum divergence |
---|
0:22:15 | estimation the this is pretty good you can of confusion over here so uh |
---|
0:22:20 | try and explains briefly |
---|
0:22:23 | there is concentrate this term here |
---|
0:22:27 | it's independent of the model parameters |
---|
0:22:29 | okay |
---|
0:22:30 | but you can do you can |
---|
0:22:32 | the |
---|
0:22:33 | i changes of variable here |
---|
0:22:36 | okay which |
---|
0:22:37 | minimise the B divergence but are constrained in such a way as to preserve the value of the um auxiliary |
---|
0:22:44 | function |
---|
0:22:46 | and if you minimise |
---|
0:22:48 | these divergences you will them |
---|
0:22:50 | keeping this thing |
---|
0:22:51 | you will then |
---|
0:22:52 | increase |
---|
0:22:52 | the |
---|
0:22:54 | the uh value you have adams uh |
---|
0:22:56 | criterion |
---|
0:23:00 | uh the way this work |
---|
0:23:02 | say in the case of speaker factors |
---|
0:23:04 | to minimise the divergence |
---|
0:23:06 | what you do is you look for |
---|
0:23:08 | uh i'm transformations of the speaker factors such that the first and second order |
---|
0:23:13 | moments |
---|
0:23:16 | are the speaker factors |
---|
0:23:17 | agree on average |
---|
0:23:19 | as as the number of |
---|
0:23:21 | uh speakers in the training set |
---|
0:23:23 | with |
---|
0:23:23 | the |
---|
0:23:24 | first order moment of the prior and the second order moment |
---|
0:23:27 | right |
---|
0:23:27 | that's that's just a matter of uh |
---|
0:23:30 | a finding an affine transformation |
---|
0:23:32 | that satisfies |
---|
0:23:33 | this condition you then applied |
---|
0:23:35 | the inverse transformation |
---|
0:23:37 | to update the model parameters |
---|
0:23:39 | in such a way as to keep the value of the |
---|
0:23:43 | uh yeah i'm auxiliary function fixed |
---|
0:23:46 | and it turns out that if you |
---|
0:23:49 | interleaved these two uh steps |
---|
0:23:52 | you will be able to accelerate the um |
---|
0:23:56 | the convergence |
---|
0:23:59 | so ah |
---|
0:24:01 | well just one comment about |
---|
0:24:03 | about this |
---|
0:24:04 | uh and i set out to do here is to produce point estimates |
---|
0:24:08 | of three |
---|
0:24:10 | eigenvoice matrix and the uh i'm the eigenchannel matrix |
---|
0:24:15 | uh if you are really hardcore bayesian you don't allow point estimates |
---|
0:24:20 | into your |
---|
0:24:22 | model you have to do everything in terms of |
---|
0:24:24 | prior probabilities |
---|
0:24:26 | um |
---|
0:24:26 | posterior probabilities |
---|
0:24:29 | so a true blue bayesian approach a prior |
---|
0:24:32 | on the eigenvoices and calculate the posterior |
---|
0:24:35 | again by |
---|
0:24:36 | variational bayes |
---|
0:24:38 | even the |
---|
0:24:39 | number of speaker factors |
---|
0:24:40 | could be treated as a hidden random variable |
---|
0:24:43 | okay and the posterior distribution could be calculated |
---|
0:24:46 | again by |
---|
0:24:47 | haitian |
---|
0:24:47 | right |
---|
0:24:49 | so there is |
---|
0:24:50 | an extensive literature |
---|
0:24:52 | on this |
---|
0:24:53 | on this subject |
---|
0:24:54 | uh |
---|
0:24:55 | and say that if there's one problem with variational bayes |
---|
0:24:59 | it provides too much flexibility |
---|
0:25:01 | you have to |
---|
0:25:01 | exercise good judgement |
---|
0:25:03 | as to which things |
---|
0:25:05 | you should try |
---|
0:25:07 | i wish things are probably not |
---|
0:25:09 | going to help |
---|
0:25:10 | in other words don't lose sight of your |
---|
0:25:12 | you're engineering objective |
---|
0:25:15 | and the particular thing i chose |
---|
0:25:17 | to to focus on was |
---|
0:25:19 | the |
---|
0:25:20 | gaussian assumption |
---|
0:25:21 | okay |
---|
0:25:22 | uh as far as i can see |
---|
0:25:25 | the gaussian assumption is just not realistic |
---|
0:25:28 | for the |
---|
0:25:30 | i don't a so that |
---|
0:25:31 | we're dealing with |
---|
0:25:34 | and what i set out to do using variational bayes |
---|
0:25:37 | was to replace |
---|
0:25:39 | the |
---|
0:25:39 | gaussian assumption with the |
---|
0:25:41 | exponential decrease adam famously by |
---|
0:25:44 | a power law distribution |
---|
0:25:46 | which uh allows |
---|
0:25:48 | four |
---|
0:25:49 | um |
---|
0:25:50 | outlier |
---|
0:25:51 | exceptional |
---|
0:25:52 | speaker of facts |
---|
0:25:53 | severe channel distortions |
---|
0:25:55 | uh in the data |
---|
0:25:57 | and this term black swan is amusing |
---|
0:26:00 | uh it |
---|
0:26:01 | so um |
---|
0:26:02 | romans had a had a phrase or a rare bird much like a black |
---|
0:26:06 | one |
---|
0:26:07 | intended to convey the motion of something impossible or inconceivable |
---|
0:26:12 | and they were in no position to know that uh likes one's actually do exist |
---|
0:26:17 | uh in australia |
---|
0:26:19 | um |
---|
0:26:20 | um |
---|
0:26:21 | a financial forecaster by the name of |
---|
0:26:23 | tell the |
---|
0:26:25 | a few years ago he wrote a polemic |
---|
0:26:28 | against the gaussian distribution called |
---|
0:26:30 | the black swan |
---|
0:26:32 | the um |
---|
0:26:33 | yeah actually rolled before they start |
---|
0:26:36 | rationed in two thousand and made which of course is the |
---|
0:26:39 | mother of all blacks ones |
---|
0:26:41 | and |
---|
0:26:42 | as as a result |
---|
0:26:43 | is it |
---|
0:26:44 | uh quite a bigger |
---|
0:26:45 | media splash |
---|
0:26:50 | okay it turns out that the um |
---|
0:26:53 | textbook a definition of uh |
---|
0:26:56 | the student's T distribution the one which i'm |
---|
0:26:59 | going to use in place of the gaussian distribution that this is a workable |
---|
0:27:03 | with the variational bayes |
---|
0:27:06 | there is a not a construction that represents |
---|
0:27:09 | the student's T distribution um |
---|
0:27:12 | as a continuous mixture of |
---|
0:27:14 | um |
---|
0:27:15 | normal random variable |
---|
0:27:17 | uh it's based on the gamma distribution is unimodal distribution |
---|
0:27:21 | on the positive real switch has two parameters that enable you to adjust the |
---|
0:27:26 | the mean and the variance independently of each other |
---|
0:27:31 | but it was is |
---|
0:27:31 | this |
---|
0:27:32 | okay in order to |
---|
0:27:34 | sample from a student's T distribution |
---|
0:27:40 | you start with a gaussian distribution with precision matrix lambda |
---|
0:27:45 | you then |
---|
0:27:46 | yeah |
---|
0:27:47 | the covariance matrix by a random scale factor drawn from the |
---|
0:27:53 | gaussian distribution |
---|
0:27:55 | and then you sample from the |
---|
0:27:57 | normal distribution with the modified covariance matrix |
---|
0:28:00 | is that random scale factor that |
---|
0:28:04 | introduces the the heavy tail |
---|
0:28:06 | behaviour |
---|
0:28:08 | um |
---|
0:28:09 | the parameters of the |
---|
0:28:11 | gaussian distribution of the gamma distribution rather |
---|
0:28:14 | determine |
---|
0:28:15 | the extent to which this thing |
---|
0:28:17 | is is heavy tail you have the gaussian at at one extreme |
---|
0:28:21 | at the other extreme you something called the the cushion distribution which is |
---|
0:28:25 | so heavy tail that the |
---|
0:28:27 | variances in from |
---|
0:28:29 | uh this term degrees of freedom it comes from classical statistics but it doesn't have any particular main |
---|
0:28:35 | uh |
---|
0:28:36 | in in this context |
---|
0:28:39 | okay |
---|
0:28:40 | so for example |
---|
0:28:42 | suppose you want to make the |
---|
0:28:44 | channel factors heavy tail |
---|
0:28:47 | in order to model |
---|
0:28:48 | applying |
---|
0:28:49 | channel distortion |
---|
0:28:53 | well you have to do here X |
---|
0:28:55 | so |
---|
0:28:56 | remember |
---|
0:28:57 | are you one set of channel factors |
---|
0:28:58 | for each recording so this is inside the plate |
---|
0:29:02 | you associate a random scale factor |
---|
0:29:05 | okay with that |
---|
0:29:07 | hidden random variable |
---|
0:29:09 | okay and that one time scale factor |
---|
0:29:12 | is |
---|
0:29:12 | sampled |
---|
0:29:13 | from |
---|
0:29:14 | a gamma distribution |
---|
0:29:16 | call the member with the freedom into |
---|
0:29:19 | so handy to the lda does this |
---|
0:29:22 | for all of the |
---|
0:29:24 | hidden variables |
---|
0:29:25 | and the |
---|
0:29:27 | gaussian P L D A model |
---|
0:29:29 | yeah of speaker factors |
---|
0:29:31 | have an associated |
---|
0:29:32 | scale factor random scale factor |
---|
0:29:35 | channel factors |
---|
0:29:37 | and so pseudorandom scale factor |
---|
0:29:39 | residual |
---|
0:29:40 | has an associated time and scale |
---|
0:29:42 | vector |
---|
0:29:43 | so |
---|
0:29:44 | in fact |
---|
0:29:45 | all i didn't just here are just three extra |
---|
0:29:48 | parameters |
---|
0:29:49 | three extra degrees of freedom |
---|
0:29:51 | in order to |
---|
0:29:53 | model |
---|
0:29:53 | the |
---|
0:29:54 | the heavy tail |
---|
0:29:55 | behaviour |
---|
0:29:58 | yeah |
---|
0:29:59 | these are some tactical points |
---|
0:30:01 | okay |
---|
0:30:02 | uh how |
---|
0:30:04 | you can |
---|
0:30:06 | carryover variational bayes from the gaussian case to the heavy tailed case and do so |
---|
0:30:11 | in a computationally uh efficient way |
---|
0:30:14 | um |
---|
0:30:16 | i refer you to the paper for these |
---|
0:30:18 | the |
---|
0:30:19 | key point that i would like to draw your attention to |
---|
0:30:22 | is that these numbers degrees of freedom |
---|
0:30:25 | can actually be estimated |
---|
0:30:27 | using the same evidence criterion |
---|
0:30:30 | as the eigenvoices |
---|
0:30:32 | and the eigenchannels |
---|
0:30:38 | okay here's some results |
---|
0:30:40 | this is a a comparison of gas |
---|
0:30:42 | really |
---|
0:30:44 | and |
---|
0:30:45 | how detailed P L D A |
---|
0:30:47 | um the several conditions |
---|
0:30:49 | of the nist |
---|
0:30:50 | uh two thousand and eight evaluation |
---|
0:30:55 | okay so this is the equal error rate |
---|
0:30:58 | and the two thousand and eight |
---|
0:31:00 | detection cost function |
---|
0:31:02 | okay it's clear |
---|
0:31:03 | it in all three conditions the there's a very dramatic |
---|
0:31:06 | uh |
---|
0:31:07 | reduction in errors |
---|
0:31:09 | uh |
---|
0:31:10 | both |
---|
0:31:11 | the dcf point |
---|
0:31:12 | and we are |
---|
0:31:15 | uh this was done without score normalisation if you do what score normalisation |
---|
0:31:20 | what happens |
---|
0:31:21 | this |
---|
0:31:22 | you get |
---|
0:31:22 | uniform improvement in all cases |
---|
0:31:26 | okay i'll simply lda |
---|
0:31:28 | i get uniform degradation |
---|
0:31:30 | probably uh |
---|
0:31:31 | student's T distribution |
---|
0:31:33 | but only |
---|
0:31:34 | does normalisation not help you |
---|
0:31:36 | it's a nuisance |
---|
0:31:38 | in the students to |
---|
0:31:46 | uh let me just say a word about score normalisation |
---|
0:31:48 | um |
---|
0:31:50 | it's usually needed in order to |
---|
0:31:52 | set the decision threshold in speaker verification in a trial dependent way |
---|
0:31:59 | um |
---|
0:32:01 | it |
---|
0:32:01 | so uh this typically french are computationally expensive |
---|
0:32:05 | and |
---|
0:32:05 | it complicates life if you if you ever have to do cross gender |
---|
0:32:09 | uh trials |
---|
0:32:11 | on the other hand |
---|
0:32:13 | if you have a good general model for speech in other words if you insist on the probabilistic |
---|
0:32:18 | yeah |
---|
0:32:19 | way of thinking |
---|
0:32:21 | there's no wrong |
---|
0:32:22 | for for score normalisation |
---|
0:32:24 | if there is no need |
---|
0:32:25 | for calibration but we're not there |
---|
0:32:27 | yeah |
---|
0:32:29 | um |
---|
0:32:31 | in practice is needed because of |
---|
0:32:33 | applying recordings |
---|
0:32:35 | okay which tend to produce |
---|
0:32:37 | uh exceptionally low scores for all of |
---|
0:32:40 | trials in which they are |
---|
0:32:41 | involved |
---|
0:32:43 | and what the uh student's T distribution appears to be doing |
---|
0:32:47 | is that the extra hidden variables these scale factors that i introduce |
---|
0:32:53 | appear |
---|
0:32:53 | the |
---|
0:32:54 | capable of uh of modelling |
---|
0:32:57 | this uh |
---|
0:32:59 | this outlier behaviour adequate |
---|
0:33:02 | thus doing away with the need for uh |
---|
0:33:04 | for score normalisation |
---|
0:33:08 | uh i should |
---|
0:33:09 | so |
---|
0:33:10 | i have a copy of about |
---|
0:33:11 | microphones |
---|
0:33:12 | each |
---|
0:33:13 | if |
---|
0:33:13 | the situation with telephone speech seems to be quite clear |
---|
0:33:16 | okay i guess of the L D A |
---|
0:33:18 | what's globalisation |
---|
0:33:21 | gives results which are comparable to cosine distance scoring |
---|
0:33:24 | get better results but |
---|
0:33:26 | uh heavy tailed the lda at least on the two thousand and a data |
---|
0:33:30 | and in general there about twenty five |
---|
0:33:32 | send better than traditional joint factor analysis |
---|
0:33:36 | uh but it turns out to break down and that |
---|
0:33:38 | an interesting way |
---|
0:33:39 | um |
---|
0:33:39 | um |
---|
0:33:40 | on microphone speech |
---|
0:33:46 | uh |
---|
0:33:47 | now how much yesterday he described an ivector extractor of dimension six hundred |
---|
0:33:52 | which could be used |
---|
0:33:53 | for recognition both microphone |
---|
0:33:56 | and |
---|
0:33:57 | telephone speech |
---|
0:33:59 | so we started out by training a model using only telephone speech speaker factors |
---|
0:34:04 | and the residual was modelled |
---|
0:34:06 | with |
---|
0:34:06 | a full |
---|
0:34:07 | precision right right |
---|
0:34:09 | okay then we augmented that with |
---|
0:34:11 | the with eigenchannels |
---|
0:34:14 | and everything was treated in the heavy tailed right |
---|
0:34:17 | okay um |
---|
0:34:18 | well turned out |
---|
0:34:19 | upon |
---|
0:34:21 | unfortunately |
---|
0:34:22 | is that we ran straight into the |
---|
0:34:24 | cushy distribution |
---|
0:34:25 | for the |
---|
0:34:27 | microphone |
---|
0:34:28 | transducer |
---|
0:34:29 | affect |
---|
0:34:30 | that means is that the variance |
---|
0:34:32 | all the channel effects |
---|
0:34:34 | microphone back that |
---|
0:34:35 | is infinite |
---|
0:34:36 | um |
---|
0:34:37 | it's a short so |
---|
0:34:39 | it's a short step to realise that if you have infinite variance for channel effects |
---|
0:34:43 | you're not able |
---|
0:34:44 | to speaker recognition |
---|
0:34:46 | so um i haven't been able to uh to fix this |
---|
0:34:50 | uh at present |
---|
0:34:52 | the |
---|
0:34:53 | best strategy would seem to be too project away the V troubles some dimensions using some type of P O |
---|
0:34:58 | D A that |
---|
0:34:59 | so |
---|
0:35:00 | that's not gene structure which i i believe |
---|
0:35:02 | we |
---|
0:35:03 | talking about |
---|
0:35:04 | uh in the next presentation |
---|
0:35:09 | okay |
---|
0:35:10 | oh and then come to the third part of my talk |
---|
0:35:14 | which concerns the question |
---|
0:35:16 | oh |
---|
0:35:17 | how |
---|
0:35:19 | it would be possible |
---|
0:35:21 | to integrate |
---|
0:35:23 | joint factor analysis or P L B A |
---|
0:35:26 | and call centre |
---|
0:35:27 | and scoring |
---|
0:35:28 | or something resembling a |
---|
0:35:30 | in a coherent |
---|
0:35:33 | probably |
---|
0:35:33 | fig |
---|
0:35:34 | right |
---|
0:35:36 | uh if you haven't seen |
---|
0:35:38 | these |
---|
0:35:39 | types of uh scatter plots |
---|
0:35:41 | there are very interesting |
---|
0:35:43 | okay each colour here represents a speaker |
---|
0:35:46 | and each point |
---|
0:35:48 | represents an utterance |
---|
0:35:50 | the speech |
---|
0:35:55 | um |
---|
0:35:56 | this is a plot of of supervectors |
---|
0:35:58 | projected onto the |
---|
0:36:00 | what is essentially the first two |
---|
0:36:02 | uh i vector |
---|
0:36:04 | components |
---|
0:36:07 | so |
---|
0:36:07 | you see what's going on here |
---|
0:36:09 | this is the |
---|
0:36:10 | well i motivation for |
---|
0:36:12 | cosine distance scoring |
---|
0:36:13 | cosine distance scoring |
---|
0:36:15 | ignores the magnitude |
---|
0:36:17 | of the vectors |
---|
0:36:18 | and uses |
---|
0:36:19 | only the angle between them |
---|
0:36:21 | as |
---|
0:36:22 | the similar signature |
---|
0:36:27 | and this is completely inconsistent with the assumptions |
---|
0:36:29 | all |
---|
0:36:30 | joint factor analysis because |
---|
0:36:33 | there seems to be |
---|
0:36:34 | for each speaker |
---|
0:36:36 | a principal axes of variability that passes through the speakers me |
---|
0:36:40 | the |
---|
0:36:42 | session variability for speaker is augmented |
---|
0:36:45 | in a particular direction |
---|
0:36:46 | the direction i mean vector |
---|
0:36:48 | where is |
---|
0:36:49 | jfa or P L V A assumes |
---|
0:36:53 | that you can't model |
---|
0:36:55 | session |
---|
0:36:56 | okay |
---|
0:36:57 | for all speakers in the same way |
---|
0:37:00 | the strip |
---|
0:37:01 | that's three |
---|
0:37:01 | statistical independence |
---|
0:37:03 | assumption |
---|
0:37:03 | in in in jfa |
---|
0:37:08 | um |
---|
0:37:11 | i thought of necessarily just |
---|
0:37:13 | to add a |
---|
0:37:14 | you |
---|
0:37:15 | have the ad |
---|
0:37:16 | in interpreting these |
---|
0:37:18 | these plots to have to be careful that it's not a notified to |
---|
0:37:21 | the |
---|
0:37:22 | well the way you estimate supervectors and so on we |
---|
0:37:25 | we do find these plots with with an vectors but we have to cherry |
---|
0:37:29 | the results in order to get |
---|
0:37:30 | um ice pictures like one |
---|
0:37:32 | right |
---|
0:37:33 | i showed you |
---|
0:37:34 | but the the principle that |
---|
0:37:36 | okay for this type of behaviour |
---|
0:37:39 | which i call directional scatter |
---|
0:37:41 | is the effect |
---|
0:37:42 | that's |
---|
0:37:43 | of the |
---|
0:37:44 | colour distance |
---|
0:37:45 | matcher |
---|
0:37:46 | yeah |
---|
0:37:47 | uh |
---|
0:37:48 | in speaker recognition |
---|
0:37:51 | i don't know how to account for it i'm not concerned with that question |
---|
0:37:54 | the only question i would like |
---|
0:37:56 | to answer |
---|
0:37:56 | is how to model this type of behaviour probabilistic |
---|
0:38:05 | okay as i |
---|
0:38:06 | i said this part is going to get of the technical it's addressed to people who have |
---|
0:38:11 | red |
---|
0:38:11 | the chapter |
---|
0:38:13 | and |
---|
0:38:14 | bashers book |
---|
0:38:15 | um |
---|
0:38:16 | variational |
---|
0:38:16 | right |
---|
0:38:18 | uh in order to get a handle on this problem there seems to be a natural strategy |
---|
0:38:23 | okay instead of representing |
---|
0:38:25 | each speaker by a single point |
---|
0:38:27 | next one |
---|
0:38:28 | and the speaker factor space |
---|
0:38:30 | represent each speaker by a distribution which is specified by |
---|
0:38:34 | i mean vector |
---|
0:38:35 | you and the precision matrix model |
---|
0:38:41 | the |
---|
0:38:42 | i'm vectors are then generated by sampling speaker factors from this just |
---|
0:38:45 | version |
---|
0:38:46 | i have |
---|
0:38:47 | but this inverted commas |
---|
0:38:49 | because the speaker factors |
---|
0:38:51 | very |
---|
0:38:51 | from one recording to remember |
---|
0:38:54 | okay as to channel |
---|
0:38:56 | but the |
---|
0:38:57 | mechanism by push the generator is quite different |
---|
0:39:00 | that's willing to come |
---|
0:39:01 | the man |
---|
0:39:04 | the trick is to choose the prior |
---|
0:39:06 | on the |
---|
0:39:07 | mean and precision matrix read speaker |
---|
0:39:10 | in which |
---|
0:39:11 | you and then the |
---|
0:39:12 | or not |
---|
0:39:13 | statistically independent |
---|
0:39:15 | because what you want |
---|
0:39:17 | is |
---|
0:39:18 | you want to precision matrix for each speaker |
---|
0:39:21 | which varies |
---|
0:39:22 | with the location of |
---|
0:39:24 | speakers mean vector |
---|
0:39:28 | and of course |
---|
0:39:29 | once you set this out |
---|
0:39:31 | your |
---|
0:39:32 | immediately going to run into problems you you does not hold |
---|
0:39:34 | all of doing point estimation of the perceptual matrix if you only have one or two |
---|
0:39:39 | observations of the speaker |
---|
0:39:42 | uh you have to follow the rules of probability |
---|
0:39:44 | system play |
---|
0:39:46 | integrator prior |
---|
0:39:47 | and the way to do that |
---|
0:39:48 | courses with |
---|
0:39:49 | um |
---|
0:39:49 | right |
---|
0:39:56 | okay so he was an accountant |
---|
0:39:58 | we can either seems to be only one way to to |
---|
0:40:01 | um |
---|
0:40:02 | one natural prior on precision matrices |
---|
0:40:04 | although we should prior |
---|
0:40:08 | uh i won't talk about this |
---|
0:40:09 | okay i just |
---|
0:40:10 | put it down there so that if you're interested you be able to recognise that this is |
---|
0:40:15 | just a generalisation of the gamma distribution |
---|
0:40:18 | okay if you take an equal to one this will reduce to the gamma distribution |
---|
0:40:23 | in higher dimensions it's concentrating on positive definite |
---|
0:40:27 | major |
---|
0:40:30 | um |
---|
0:40:32 | there is a parameter call the the number of degrees of freedom again |
---|
0:40:35 | okay that |
---|
0:40:36 | so determines how P |
---|
0:40:38 | uh this uh distribution is |
---|
0:40:41 | uh also |
---|
0:40:42 | this point i think is worth mentioning there's no loss of generality in assuming that W |
---|
0:40:47 | which would matrix here |
---|
0:40:48 | is it good to be identity |
---|
0:40:51 | the reason this is worth mentioning is that this turns out to correspond exactly to something that nudging does |
---|
0:40:58 | and |
---|
0:40:59 | uh he's processing |
---|
0:41:01 | if you're familiar with his work |
---|
0:41:03 | you know that |
---|
0:41:05 | uh |
---|
0:41:05 | he estimates that W C C N |
---|
0:41:09 | matrix |
---|
0:41:10 | in the |
---|
0:41:11 | speaker space |
---|
0:41:12 | and then lightens the data with that matrix |
---|
0:41:15 | before evaluating the |
---|
0:41:18 | uh |
---|
0:41:18 | because |
---|
0:41:23 | okay |
---|
0:41:24 | first thing then |
---|
0:41:25 | we have generated the |
---|
0:41:27 | decision matrix for the speaker the next step |
---|
0:41:29 | is to generate |
---|
0:41:30 | the |
---|
0:41:31 | the mean vector |
---|
0:41:32 | speaker |
---|
0:41:34 | and you do that |
---|
0:41:35 | using |
---|
0:41:36 | a student's T distribution |
---|
0:41:39 | okay once you have a precision matrix |
---|
0:41:42 | that's all you need |
---|
0:41:43 | if you |
---|
0:41:44 | just adding the gamma distribution |
---|
0:41:47 | you can sample |
---|
0:41:48 | the mean vector |
---|
0:41:49 | according to a student's T distribution |
---|
0:41:51 | uh and explained in the manual white |
---|
0:41:53 | you need to use the student's T distribution |
---|
0:41:58 | uh |
---|
0:41:59 | the |
---|
0:42:00 | point i would just like to draw your attention to at this stage |
---|
0:42:04 | is that |
---|
0:42:05 | because |
---|
0:42:06 | the distribution of you depends on the land there |
---|
0:42:10 | the conditional distribution lambda depends on you |
---|
0:42:14 | okay |
---|
0:42:14 | so |
---|
0:42:15 | that means |
---|
0:42:17 | but |
---|
0:42:18 | he precision matrix for a speaker |
---|
0:42:20 | and |
---|
0:42:21 | on location |
---|
0:42:22 | all the speaker |
---|
0:42:24 | in the speaker factor space |
---|
0:42:25 | so that means that you have somehow |
---|
0:42:28 | modelling |
---|
0:42:28 | this |
---|
0:42:29 | directional scout |
---|
0:42:35 | skip that |
---|
0:42:36 | um |
---|
0:42:36 | go to the |
---|
0:42:38 | um |
---|
0:42:39 | graphical model |
---|
0:42:42 | i think it's clear from this uh remember |
---|
0:42:44 | when you're confronted with something like this that |
---|
0:42:46 | everything inside the plate |
---|
0:42:48 | is replicated |
---|
0:42:50 | for each of the recordings |
---|
0:42:52 | speaker |
---|
0:42:53 | everything that outside of the plate |
---|
0:42:54 | is done |
---|
0:42:55 | once |
---|
0:42:57 | per speaker |
---|
0:42:58 | okay so the first step |
---|
0:43:00 | is it generate the precision matrix |
---|
0:43:04 | you then generate the mean for the speaker |
---|
0:43:06 | by sampling from |
---|
0:43:08 | um |
---|
0:43:09 | a student's T distribution of call the hidden scale factor W |
---|
0:43:13 | and the parameters of the gamma distribution out |
---|
0:43:15 | data |
---|
0:43:16 | once you have the mean |
---|
0:43:18 | and the precision matrix |
---|
0:43:20 | you generate the speaker factors |
---|
0:43:22 | re speaker |
---|
0:43:24 | uh for each recording |
---|
0:43:25 | remember we're making the speaker factors depend on |
---|
0:43:28 | or |
---|
0:43:29 | okay bye |
---|
0:43:30 | something from another student's T distribution |
---|
0:43:34 | the interesting thing |
---|
0:43:35 | is that |
---|
0:43:36 | these three parameters alpha beta and tell |
---|
0:43:39 | the term and |
---|
0:43:40 | whether or not |
---|
0:43:41 | this |
---|
0:43:42 | oh |
---|
0:43:42 | business |
---|
0:43:43 | it's going to |
---|
0:43:44 | exhibit directions cat |
---|
0:43:46 | normal |
---|
0:43:51 | okay |
---|
0:43:52 | sorry |
---|
0:43:54 | this can be explained without some hundred |
---|
0:43:56 | you have to do it calculation |
---|
0:43:59 | remember landers |
---|
0:44:00 | a session matrix land inverse is the |
---|
0:44:03 | covariance matrix someone and comparing here |
---|
0:44:07 | is the distribution of the covariance matrix |
---|
0:44:09 | given the speaker dependent parameters |
---|
0:44:13 | and the prior distribution of the covariance |
---|
0:44:16 | you see what you have is a weighted average |
---|
0:44:19 | of the prior |
---|
0:44:20 | expectation |
---|
0:44:22 | and |
---|
0:44:23 | another term |
---|
0:44:25 | now this |
---|
0:44:26 | second term here |
---|
0:44:27 | and |
---|
0:44:28 | all the speakers me |
---|
0:44:30 | it's a rank one covariance matrix the only variability that's allowed |
---|
0:44:34 | is in the direction of the mean vector |
---|
0:44:37 | this is |
---|
0:44:37 | a picture book |
---|
0:44:38 | response to it |
---|
0:44:39 | which is exactly what |
---|
0:44:41 | the doctor |
---|
0:44:43 | four |
---|
0:44:43 | action scatter |
---|
0:44:46 | um |
---|
0:44:48 | i'd |
---|
0:44:49 | draw your attention to the fact |
---|
0:44:50 | that |
---|
0:44:50 | the |
---|
0:44:52 | this |
---|
0:44:53 | term here is multiplied by this so it depends on how the number of degrees of freedom |
---|
0:44:58 | and |
---|
0:44:59 | this uh random scale factor that |
---|
0:45:03 | okay so the extent |
---|
0:45:05 | a directional scattering |
---|
0:45:07 | is going to |
---|
0:45:08 | and |
---|
0:45:09 | on the behaviour of this uh |
---|
0:45:11 | this much |
---|
0:45:16 | uh |
---|
0:45:17 | it depends |
---|
0:45:17 | in fact on the parameters which govern the distribution |
---|
0:45:21 | oh |
---|
0:45:21 | the |
---|
0:45:22 | random scale factor W |
---|
0:45:24 | yeah |
---|
0:45:25 | W |
---|
0:45:27 | has |
---|
0:45:28 | a large mean and a small variance you can say that |
---|
0:45:32 | this |
---|
0:45:33 | this thing but |
---|
0:45:35 | the |
---|
0:45:36 | a fact |
---|
0:45:36 | all the variability in the direction of the mean vector |
---|
0:45:40 | okay so in that case |
---|
0:45:42 | directions kevin would be present to a large extent |
---|
0:45:46 | four |
---|
0:45:47 | um |
---|
0:45:48 | most speakers |
---|
0:45:49 | in the data |
---|
0:45:50 | on the other hand there's another limiting case where |
---|
0:45:53 | uh you can show that the thing reduces to to heavy tailed field again and there's no directional scattering at |
---|
0:45:59 | all |
---|
0:46:00 | so the |
---|
0:46:02 | key question would be to see how this model trains that |
---|
0:46:05 | uh |
---|
0:46:06 | to be frank this is going to take a couple |
---|
0:46:07 | models |
---|
0:46:09 | uh i don't have any |
---|
0:46:10 | results to uh |
---|
0:46:13 | okay so in conclusion |
---|
0:46:15 | um |
---|
0:46:16 | well guess immediately it's an effective model for speaker recognition |
---|
0:46:20 | and it's just joint factor analysis with ivectors |
---|
0:46:23 | uh as features |
---|
0:46:25 | my experience |
---|
0:46:26 | spain |
---|
0:46:27 | that it works better |
---|
0:46:29 | then |
---|
0:46:29 | uh traditional joint factor analysis even though the basic assumptions |
---|
0:46:33 | or |
---|
0:46:34 | are open to question |
---|
0:46:36 | okay |
---|
0:46:37 | variational bayes |
---|
0:46:39 | allows you to go a long way |
---|
0:46:41 | in relaxing these assumptions you can model outliers by adding these |
---|
0:46:45 | hidden |
---|
0:46:46 | variables |
---|
0:46:47 | you can model directional scattering by having |
---|
0:46:50 | these variables |
---|
0:46:53 | the |
---|
0:46:54 | derivation of the variational bayes update formulas is mechanical |
---|
0:46:59 | no i'm not saying it's always easy but it is |
---|
0:47:01 | coming |
---|
0:47:02 | okay |
---|
0:47:03 | and |
---|
0:47:04 | it comes with um |
---|
0:47:05 | yeah my convergence guarantees so that you can |
---|
0:47:09 | you have some hope of uh the barking or implementation |
---|
0:47:15 | one can get is that |
---|
0:47:16 | in practise you have to stay inside the exponential |
---|
0:47:19 | second work |
---|
0:47:20 | uh |
---|
0:47:21 | i can |
---|
0:47:21 | the other one |
---|
0:47:23 | uh |
---|
0:47:23 | it's also |
---|
0:47:24 | i'm |
---|
0:47:25 | personally of the opinion that is uh |
---|
0:47:27 | in order to get the full benefit of these methods we need for recall |
---|
0:47:30 | informative priors |
---|
0:47:33 | that is to say |
---|
0:47:34 | prior distributions on the hidden variables whose |
---|
0:47:36 | parameters can be |
---|
0:47:39 | i use the word of this is it because uh estimated is is really an appropriate here |
---|
0:47:44 | and this is a strong uh |
---|
0:47:46 | larger training sets |
---|
0:47:48 | so the example is that |
---|
0:47:50 | one of the hidden variables that i just |
---|
0:47:53 | uh disk right |
---|
0:47:54 | okay are controlled by a handful of |
---|
0:47:57 | scalar degrees of freedom |
---|
0:47:59 | and these can all be estimated using the |
---|
0:48:02 | using the evidence criterion |
---|
0:48:04 | from uh |
---|
0:48:05 | from training data |
---|
0:48:09 | now it to be to be |
---|
0:48:11 | trying to locate the advantage of probabilistic methods is is that you have |
---|
0:48:15 | uh logically coherent way reasoning and the phase uncertainty |
---|
0:48:19 | the disadvantage is |
---|
0:48:21 | that it needs |
---|
0:48:22 | timing |
---|
0:48:23 | um |
---|
0:48:23 | after |
---|
0:48:24 | okay too |
---|
0:48:26 | to master the techniques and to program them |
---|
0:48:29 | if you're |
---|
0:48:31 | principal concern is to get a good system |
---|
0:48:33 | up and running quickly |
---|
0:48:35 | i would recommend |
---|
0:48:36 | um |
---|
0:48:38 | something michael signed distance |
---|
0:48:41 | uh |
---|
0:48:42 | on the other hand |
---|
0:48:43 | if you're interested in |
---|
0:48:45 | mastering |
---|
0:48:46 | this |
---|
0:48:46 | family of methods |
---|
0:48:48 | i think they're really only three things you need to look at |
---|
0:48:51 | okay there's the original |
---|
0:48:53 | paper by prince analogy or a probabilistic linear discriminant analysis in |
---|
0:48:58 | uh face recognition |
---|
0:49:00 | that's the gaussian case |
---|
0:49:04 | everything you need to know |
---|
0:49:05 | about probabilities ambitious book |
---|
0:49:08 | which ah i highly recommend it |
---|
0:49:10 | so |
---|
0:49:10 | it's very well written and it starts from first run |
---|
0:49:13 | oh |
---|
0:49:15 | uh this is the |
---|
0:49:17 | this is |
---|
0:49:17 | paper |
---|
0:49:19 | um i don't believe the paper is actually found its way into |
---|
0:49:23 | proceedings |
---|
0:49:24 | but it is available along those lines |
---|
0:49:26 | uh |
---|
0:49:28 | okay |
---|
0:49:30 | thank you |
---|
0:49:41 | much |
---|
0:49:43 | right |
---|
0:49:44 | this is the |
---|
0:49:45 | action |
---|
0:49:54 | no |
---|
0:49:56 | yeah |
---|
0:49:57 | but it |
---|
0:50:01 | no |
---|
0:50:02 | and of course |
---|
0:50:03 | uh |
---|
0:50:03 | thanks representation which uh |
---|
0:50:05 | reuniting |
---|
0:50:06 | uh |
---|
0:50:07 | you use you to uh |
---|
0:50:10 | uh encourage us to |
---|
0:50:11 | uh as you said |
---|
0:50:12 | if you wanna which solution you can do it that way |
---|
0:50:15 | if you want a more principled solution |
---|
0:50:17 | uh |
---|
0:50:18 | but |
---|
0:50:18 | i |
---|
0:50:19 | i cannot |
---|
0:50:20 | uh |
---|
0:50:21 | notice i just know is that uh |
---|
0:50:23 | they use a point of uh you algorithm |
---|
0:50:26 | is based on that point is to |
---|
0:50:28 | um |
---|
0:50:28 | so you have |
---|
0:50:29 | a speech utterance |
---|
0:50:31 | uh use your factor analyses |
---|
0:50:34 | to summarise i decided |
---|
0:50:36 | and you completely ignored in certain people to process and then from that you should use that |
---|
0:50:41 | we should keep track of that uncertainty |
---|
0:50:44 | so how do you like that it's a an entirely empirical uh decision |
---|
0:50:48 | based on on the effectiveness of of machines uh cosine distance scoring |
---|
0:50:53 | no it just works really well um |
---|
0:50:56 | attend somewhere maybe |
---|
0:50:58 | so |
---|
0:50:58 | um |
---|
0:50:59 | incorporate the uncertainty |
---|
0:51:01 | in the |
---|
0:51:03 | i vector estimation procedure |
---|
0:51:05 | don't seem to have they |
---|
0:51:06 | complicate life |
---|
0:51:08 | it's it's really imperative |
---|
0:51:10 | what |
---|
0:51:11 | it's dictated by baltimore |
---|
0:51:23 | um |
---|
0:51:24 | but you know it's true |
---|
0:51:25 | tuition |
---|
0:51:25 | um |
---|
0:51:26 | one one question regarding you results are presented so i would uh one categories remote |
---|
0:51:32 | yeah |
---|
0:51:32 | um |
---|
0:51:33 | conversation sides down |
---|
0:51:35 | and so |
---|
0:51:35 | you know |
---|
0:51:36 | you were |
---|
0:51:37 | yeah |
---|
0:51:37 | you |
---|
0:51:38 | i picture |
---|
0:51:39 | which |
---|
0:51:39 | finding your |
---|
0:51:41 | um |
---|
0:51:41 | retail setup |
---|
0:51:43 | house and |
---|
0:51:44 | you |
---|
0:51:44 | nine |
---|
0:51:45 | you |
---|
0:51:46 | but when they did that |
---|
0:51:47 | you |
---|
0:51:47 | see |
---|
0:51:48 | it was |
---|
0:51:49 | not |
---|
0:51:50 | the ten second data |
---|
0:51:53 | you can circle |
---|
0:51:54 | um i |
---|
0:51:56 | well the best results were obtained without score normalisation |
---|
0:52:00 | okay so we're was no question of uh |
---|
0:52:04 | introducing a corporate your question is maybe |
---|
0:52:06 | in the gaussian case |
---|
0:52:07 | should we should be used that |
---|
0:52:10 | oh no |
---|
0:52:10 | a what you need |
---|
0:52:11 | yeah |
---|
0:52:12 | to me |
---|
0:52:13 | distribution |
---|
0:52:14 | i |
---|
0:52:15 | right so |
---|
0:52:16 | you see i |
---|
0:52:17 | yeah |
---|
0:52:19 | when you open |
---|
0:52:20 | we estimate |
---|
0:52:22 | you do |
---|
0:52:23 | these |
---|
0:52:23 | particular i picked |
---|
0:52:24 | right |
---|
0:52:25 | maybe |
---|
0:52:25 | oh |
---|
0:52:27 | and second |
---|
0:52:29 | um |
---|
0:52:30 | but i |
---|
0:52:31 | my experience has been and then this |
---|
0:52:33 | black or white |
---|
0:52:34 | okay is that it's better not to use |
---|
0:52:36 | ten seconds later |
---|
0:52:37 | right |
---|
0:52:39 | uh |
---|
0:52:40 | in the case of the |
---|
0:52:42 | indicate an interesting |
---|
0:52:44 | aspect of ivectors |
---|
0:52:45 | is |
---|
0:52:46 | but |
---|
0:52:46 | um |
---|
0:52:48 | they |
---|
0:52:48 | perform |
---|
0:52:49 | very well on the ten second in sec |
---|
0:52:52 | okay |
---|
0:52:53 | in other words the estimation |
---|
0:52:55 | figure drawing vectors |
---|
0:52:56 | is |
---|
0:52:56 | much less sense |
---|
0:52:58 | so |
---|
0:52:59 | um |
---|
0:53:01 | short duration |
---|
0:53:02 | um relevance map |
---|
0:53:03 | right |
---|
0:53:04 | prob |
---|
0:53:11 | a high one based on what the impact |
---|
0:53:13 | uh you make an assumption that um some um |
---|
0:53:17 | fig oceanic you |
---|
0:53:19 | the last |
---|
0:53:20 | the slide |
---|
0:53:21 | somehow |
---|
0:53:22 | exhibit a gaussian decent |
---|
0:53:24 | last of it |
---|
0:53:25 | i this way |
---|
0:53:26 | i mean he's doing at a nonparametric way to do so |
---|
0:53:30 | and i only sensations back |
---|
0:53:32 | so i i think i was careful to use students T distributions everywhere yeah decreases that that that require that |
---|
0:53:38 | it's that which gives me the flexibility |
---|
0:53:40 | to model of players and directions got |
---|
0:53:44 | that does that answer your question or |
---|
0:53:46 | yeah innocently used to model it uh some highlights |
---|
0:53:50 | are made |
---|
0:53:51 | are much more |
---|
0:53:52 | oh at the last |
---|
0:53:53 | uh last |
---|
0:53:55 | variational bayes |
---|
0:53:56 | does require that |
---|
0:53:58 | and in fact it was an actual an extra restriction that you have |
---|
0:54:02 | stay inside the |
---|
0:54:03 | the exponential |
---|
0:54:05 | uh funnelling solely |
---|
0:54:07 | the art |
---|
0:54:08 | consists in achieving what you want to do |
---|
0:54:11 | subject of those uh |
---|
0:54:12 | strange |
---|
0:54:14 | i |
---|
0:54:15 | is that an adequate response |
---|
0:54:18 | yeah |
---|
0:54:34 | about the product |
---|
0:54:35 | yeah |
---|
0:54:37 | you you |
---|
0:54:38 | hmmm you know |
---|
0:54:39 | so |
---|
0:54:39 | and like you |
---|
0:54:41 | yeah |
---|
0:54:42 | i |
---|
0:54:43 | i |
---|
0:54:44 | you |
---|
0:54:45 | right |
---|
0:54:46 | you can |
---|
0:54:48 | well |
---|
0:54:48 | in fact uh we use |
---|
0:54:50 | the evidence criterion |
---|
0:54:51 | you |
---|
0:54:52 | which is exactly the same criterion for estimating these |
---|
0:54:56 | the the numbers of trees of freedom |
---|
0:54:58 | as we did for estimating the eigenvoices |
---|
0:55:01 | and the eigenchannels |
---|
0:55:02 | so it's completely consistent |
---|
0:55:04 | there was no manual tuning |
---|
0:55:07 | thank you |
---|
0:55:21 | so |
---|
0:55:22 | there was a question |
---|
0:55:23 | let me think |
---|
0:55:24 | but okay |
---|
0:55:32 | because |
---|