0:00:15 | the next presentation is not factor analysis of acoustic features you i mixture of probabilistic |
---|
0:00:20 | principal component analyzers |
---|
0:00:22 | moreover speaker |
---|
0:00:26 | i |
---|
0:00:48 | and |
---|
0:00:50 | that is |
---|
0:00:53 | factor analysis of acoustic features using a mixture problems |
---|
0:00:57 | component analysis |
---|
0:00:59 | for robust speaker very i |
---|
0:01:05 | so in the introduction what i want to say is |
---|
0:01:09 | so factor analysis is very popular technique when applied in gmm supervectors |
---|
0:01:14 | and the main assumption there is |
---|
0:01:17 | therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace |
---|
0:01:24 | we actually it's kind of not that the acoustic features are also represent a low |
---|
0:01:30 | dimensional sub-spaces |
---|
0:01:32 | and this phenomenon is not really |
---|
0:01:35 | taken into consideration in gmm supervector bayes factor analysis |
---|
0:01:40 | so we propose to try to see |
---|
0:01:44 | what happens if we do factor analysis on the acoustic features |
---|
0:01:48 | in addition to those i based cross |
---|
0:01:53 | so just to say more about the motivation |
---|
0:01:57 | we do not a speech spectral components are highly related to so our in the |
---|
0:02:03 | mfcc features |
---|
0:02:06 | we have a pca dct to detect these |
---|
0:02:10 | a lot of work on trying to be really features |
---|
0:02:14 | it has been shown that the first few eigen directions of the feature covariance matrix |
---|
0:02:19 | is more speaker-dependent |
---|
0:02:22 | so by maximizing |
---|
0:02:25 | back into the |
---|
0:02:26 | so what we believe is the retaining the full feature all the directions of the |
---|
0:02:33 | eigen directions |
---|
0:02:34 | the features might actually be harmful there might be some |
---|
0:02:38 | directions that are not benefiting |
---|
0:02:41 | we also get the evidence from the full covariance based i-vector system that |
---|
0:02:45 | oh what a better than eigen system |
---|
0:02:48 | which |
---|
0:02:49 | so motivates us to investigate this further |
---|
0:02:54 | so if you look at a full covariance matrix |
---|
0:02:58 | the covariance matrix of a full covariance ubm this is how it kind of looks |
---|
0:03:03 | and if you look at the eigenvalue distribution see most of the energy is compressed |
---|
0:03:09 | in the forest |
---|
0:03:10 | as in thirty two eigenvalues in this case |
---|
0:03:12 | so they're pretty much contact |
---|
0:03:14 | so i |
---|
0:03:15 | i kind of thought okay that there might be a chance to |
---|
0:03:19 | the reason to believe that there is some components the image |
---|
0:03:26 | which are not really |
---|
0:03:28 | so we use the factor analysis |
---|
0:03:33 | on acoustic features so this is the basic formulation very simple |
---|
0:03:37 | so you have a feature vector X is the factor loading matrix |
---|
0:03:42 | why is the acoustic factors which is basically the |
---|
0:03:45 | the hidden variables |
---|
0:03:47 | you is the mean vector and |
---|
0:03:49 | absolute is the isotropic noise |
---|
0:03:51 | so this is basically a ppca |
---|
0:03:54 | and the interpretation of the covariance is now modeled by the cuban variables |
---|
0:04:00 | and the covariance of the acoustic features |
---|
0:04:03 | and the residual variance is modeled by a voice model |
---|
0:04:09 | so is the pdf of the model |
---|
0:04:13 | and so what we try to do here is we want to place the acoustic |
---|
0:04:17 | features by the acoustic factors basically the or the estimation of the acoustic factors |
---|
0:04:23 | and try to use them as the features |
---|
0:04:26 | believing that these acoustic factors |
---|
0:04:29 | have more speaker-dependent information and the full feature vector might have some nuisance components |
---|
0:04:35 | so a transformation matrix is derived |
---|
0:04:38 | so it's also coming from the testing condition papers you can see first you have |
---|
0:04:42 | to select the number of |
---|
0:04:44 | coefficients you want to change |
---|
0:04:46 | suppose they have six features i want to keep |
---|
0:04:49 | forty |
---|
0:04:50 | so he would be cost forty |
---|
0:04:52 | and that was varies estimation is done by this also that's the remaining components in |
---|
0:04:57 | the S this coverage |
---|
0:04:59 | oh of the |
---|
0:05:00 | eigenvalues |
---|
0:05:02 | sorted eigenvalues |
---|
0:05:04 | so the in its eigenvalue of the covariance matrix of X |
---|
0:05:07 | and this is the factor loading matrix the maximum likelihood estimate |
---|
0:05:12 | and it's also from the keeping initial paper |
---|
0:05:17 | so this is how we estimate the acoustic factors which is basically |
---|
0:05:23 | the expected value of the posterior mean of the acoustic factors |
---|
0:05:27 | and it can be shown to be to use the |
---|
0:05:30 | expression here so it's basically removal of the meeting and the transformation by this matrix |
---|
0:05:36 | so what is given by this |
---|
0:05:38 | and so it's just are the linear transformation |
---|
0:05:42 | and if you take a this is the transformed feature vector which were like to |
---|
0:05:47 | call it |
---|
0:05:47 | and if you look at the mean and covariance matrix of this quantity it's a |
---|
0:05:51 | zero-mean gaussian distributed with |
---|
0:05:54 | a diagonal covariance matrix given by this |
---|
0:06:00 | burgers |
---|
0:06:01 | in the paper |
---|
0:06:03 | i |
---|
0:06:05 | so what to do a mixture of if it models which is basically the mixture |
---|
0:06:09 | of ppca equation |
---|
0:06:11 | so it's basically like a gaussian mixture models the same |
---|
0:06:16 | but could think about this is you can |
---|
0:06:18 | directly compute the parameters we |
---|
0:06:22 | the fa parameters |
---|
0:06:23 | from the full covariance ubm |
---|
0:06:25 | and then becomes really handy the C |
---|
0:06:28 | oh |
---|
0:06:30 | next i'd like to talk about how we want to use the |
---|
0:06:34 | the transformation so you have set and twenty four mixtures and to make sure has |
---|
0:06:38 | a transformation so what you could do us a you take a feature vector and |
---|
0:06:42 | you find the most likely mixture and you transform the feature and then |
---|
0:06:47 | you know replace the original vector right |
---|
0:06:49 | but what we saw is |
---|
0:06:52 | actually it's kind of not be the optimal way of doing it because |
---|
0:06:57 | so if you find the top scoring mixture of say your development data across the |
---|
0:07:02 | again |
---|
0:07:02 | so this is kind of the distribution |
---|
0:07:05 | so what this tells you is |
---|
0:07:07 | it's very rare that the acoustic feature is unquestionable the online |
---|
0:07:11 | two in mixture most of times that you can get like that was like one |
---|
0:07:16 | point four point five |
---|
0:07:17 | so that kind of means is a |
---|
0:07:20 | you can't really say that this feature vector comes from this mixture it kind of |
---|
0:07:24 | the last a lot of mixtures |
---|
0:07:26 | maybe more than one so what we want to do not keep all the all |
---|
0:07:30 | the transformations |
---|
0:07:32 | that are done by of the mixtures |
---|
0:07:35 | so this is how we do it |
---|
0:07:36 | basically |
---|
0:07:38 | integrating the process within the total variability model |
---|
0:07:43 | so with the i-vector system |
---|
0:07:45 | so for speech and the ubm full covariance |
---|
0:07:48 | and then we compute the parameters like we set the value of Q well just |
---|
0:07:53 | a |
---|
0:07:53 | fifty |
---|
0:07:54 | i think |
---|
0:07:55 | oh data we find the noise variance these are all you different pictures |
---|
0:07:59 | for each mixture you find a |
---|
0:08:01 | a factor loading matrix and the transformation |
---|
0:08:03 | so how it flies is basically |
---|
0:08:06 | directly those on to the first order statistics you actually have to by frame-by-frame so |
---|
0:08:13 | you compute the statistics and you can just take a transformation of that estimation |
---|
0:08:17 | so it becomes very simple you just transform the first order statistics |
---|
0:08:22 | and actually know the transformation is completely integrated within this is |
---|
0:08:29 | so these are differences with the conventional the t-matrix training |
---|
0:08:34 | so the feature size becomes Q instead of D |
---|
0:08:36 | support vector becomes in Q |
---|
0:08:39 | and the T V image of size becomes smaller |
---|
0:08:41 | and most importantly the ubm gets replaced by the distribution of the transformed features so |
---|
0:08:48 | since we are not using the original features in the subsequent processing we will use |
---|
0:08:53 | this is not really the ubm this is basically to how the parameters can place |
---|
0:08:59 | and the i-vector expected |
---|
0:09:01 | procedures similar |
---|
0:09:05 | i system we have a phone recognizer based fantasy two-dimensional |
---|
0:09:10 | six with feature |
---|
0:09:12 | cepstral mean normalization |
---|
0:09:13 | we have a ubm a gender dependent on ten twenty four mixtures |
---|
0:09:18 | oh we train |
---|
0:09:20 | we train the full covariance ubm with |
---|
0:09:22 | a variance flooring it's the investigate parameter it's that's the |
---|
0:09:27 | mean value of the corpus matrix to be |
---|
0:09:30 | a fixed value |
---|
0:09:32 | and the i-vector size was four hundred |
---|
0:09:36 | and we used five iterations |
---|
0:09:38 | so we have the pot a backend where we have a full covariance was model |
---|
0:09:45 | and the only free parameters the eigenvoice size |
---|
0:09:49 | next to the we have the fa which i just talked about we derive all |
---|
0:09:53 | the parameters from the ubm directly |
---|
0:09:55 | and we performed experiments on sre twenty ten basically |
---|
0:10:00 | conditions want to find we use the male trials |
---|
0:10:05 | so this is the initial results as we can see |
---|
0:10:08 | we change the |
---|
0:10:10 | P of the inside the eigenvoice size from fifteen |
---|
0:10:14 | then we use the cubicles fifty four forty eight and forty two |
---|
0:10:18 | our feature sizes sixteen so you can see |
---|
0:10:21 | taking off six components and so on |
---|
0:10:25 | so also what we can get nice improvement using the proposed technique |
---|
0:10:31 | so here's |
---|
0:10:33 | table showing you some of the systems |
---|
0:10:36 | that we fused |
---|
0:10:38 | so the baseline is sitting here |
---|
0:10:39 | and we are getting nice improvement in all three a couple of two thousand Q |
---|
0:10:45 | it's kind of heart to say which that would work |
---|
0:10:47 | that's in challenge |
---|
0:10:50 | and also |
---|
0:10:52 | this to that kind of |
---|
0:10:53 | it can be optimal and it can have different value in each mixture depending on |
---|
0:10:58 | how the mixture how the covariance structure is in the mixture |
---|
0:11:02 | i also did some work on that and |
---|
0:11:05 | probably |
---|
0:11:06 | see interspeech |
---|
0:11:09 | oh |
---|
0:11:10 | so anyway |
---|
0:11:12 | when we fuse the systems it's too late fusion and we can see still we |
---|
0:11:17 | can get a pretty nice improvement |
---|
0:11:20 | by fusing |
---|
0:11:21 | and different combinations |
---|
0:11:23 | so these systems to have a complementary information |
---|
0:11:27 | so these are actually extra experiments that performed after the |
---|
0:11:31 | this paper submitted source one are shown |
---|
0:11:33 | oh in other conditions works in condition one |
---|
0:11:37 | oh maybe cubicles forty eight what's nicely what condition two Q was forty two words |
---|
0:11:43 | yeah |
---|
0:11:44 | condition |
---|
0:11:45 | three |
---|
0:11:47 | cubicles forty eight and fifty four |
---|
0:11:50 | oh but in take you information for we have |
---|
0:11:54 | maybe of the dcf |
---|
0:11:56 | the new dcf didn't from improve |
---|
0:11:59 | but you of the conditions |
---|
0:12:02 | but you can see clearly that a |
---|
0:12:04 | the proposed techniques |
---|
0:12:06 | a technique works well it reduces |
---|
0:12:09 | all three |
---|
0:12:10 | a performance in this is |
---|
0:12:12 | and after fusion you can actually see nice |
---|
0:12:15 | a really different from all three of parameters |
---|
0:12:22 | so here is the det curve it's on the to a condition one to five |
---|
0:12:28 | and we just pick the cubicles forty two system |
---|
0:12:32 | oh and you can see it's |
---|
0:12:34 | almost all |
---|
0:12:36 | the fa system is |
---|
0:12:37 | better than the baseline |
---|
0:12:39 | and with fusion we get |
---|
0:12:41 | for the |
---|
0:12:45 | so |
---|
0:12:47 | we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation |
---|
0:12:56 | a compact representation well |
---|
0:12:59 | and we propose the be probabilistic feature alignment method |
---|
0:13:04 | instead of hard-clustering a feature vector to a mixture |
---|
0:13:08 | and so we show that |
---|
0:13:10 | i provides better |
---|
0:13:12 | oh when we integrate it with the i-vector system |
---|
0:13:15 | and the as a kind of |
---|
0:13:18 | nice artifact it kind of makes it faster because |
---|
0:13:22 | you know you're reducing the feature vector dimensionality which actually in turn reduces that support |
---|
0:13:27 | vector size and tv matrix size |
---|
0:13:29 | and it's |
---|
0:13:30 | you can see in this paper is discussed that V |
---|
0:13:34 | the computational complexity is proportional to be |
---|
0:13:37 | supervectors |
---|
0:13:39 | so and future work |
---|
0:13:41 | there's nothing to like |
---|
0:13:43 | not |
---|
0:13:45 | it can be mixture dependent basically so |
---|
0:13:47 | we obtain colour feature dimension like say |
---|
0:13:51 | forty eight from all the mixtures |
---|
0:13:53 | what you can be different so one of my papers that supported in interspeech which |
---|
0:13:58 | deals about the trying to |
---|
0:13:59 | optimize the parameter in each mixture |
---|
0:14:03 | and also |
---|
0:14:05 | some of future work will be |
---|
0:14:07 | using iterative techniques in proposed to begin bishops method |
---|
0:14:12 | in table four mixture of ppca |
---|
0:14:16 | most of all actually |
---|
0:14:18 | this opens up |
---|
0:14:21 | we have |
---|
0:14:22 | using other transformations also in mixture wise which might also didn't in another interesting to |
---|
0:14:26 | people where i actually a by conventional transformations and the |
---|
0:14:32 | and |
---|
0:14:33 | nap or other techniques |
---|
0:14:34 | which actually sort of take |
---|
0:14:36 | transformations in each mixture and then |
---|
0:14:39 | yeah so |
---|
0:14:40 | and then basically integrated with the i-vectors |
---|
0:14:45 | so |
---|
0:14:46 | that is all i have a given |
---|
0:15:15 | sorry how do you can go back to the acoustic features |
---|
0:15:20 | i |
---|
0:15:23 | yeah |
---|
0:15:28 | yeah |
---|
0:15:29 | i |
---|
0:15:35 | what we need to train the ubm from scratch |
---|
0:15:40 | oh yeah i did i tried i've seen some papers to |
---|
0:15:44 | i didn't think i think the way i did i thought |
---|
0:15:50 | or |
---|
0:15:52 | sure |
---|
0:15:53 | you can |
---|
0:16:01 | so |
---|
0:16:02 | i |
---|
0:16:03 | cluster a feature dimension you have to have some kind of measurement |
---|
0:16:07 | usually you can find the find the mixture by oh the most |
---|
0:16:12 | the make sure that you to the highest posterior probability |
---|
0:16:15 | but in this distribution i'm showing that |
---|
0:16:18 | oh it's not always a one to one mixture because sometimes if the maximum value |
---|
0:16:23 | of |
---|
0:16:23 | the posterior probability of the mixture is if it's giving you point to that is |
---|
0:16:27 | there other mixtures |
---|
0:16:29 | one |
---|
0:16:30 | point something that means |
---|
0:16:32 | if you take point to as the maximum mixture and use that mixtures transformation it |
---|
0:16:36 | will be |
---|
0:16:38 | so |
---|
0:16:40 | yeah we can you "'cause" to do it |
---|
0:16:42 | but i try because i just have seen this and i thought |
---|
0:16:45 | it would be nicer generate things that make things |
---|
0:16:48 | are |
---|
0:16:51 | together what is |
---|
0:17:05 | i |
---|
0:17:17 | oh |
---|
0:17:20 | so a number of trials |
---|
0:17:23 | i |
---|
0:17:24 | yeah |
---|
0:17:26 | yeah |
---|
0:17:27 | i |
---|
0:17:36 | i think i normalized in a binary invariance |
---|
0:17:40 | oh |
---|
0:17:56 | although i |
---|
0:18:10 | right yes |
---|
0:18:11 | oh maybe what you're saying is true |
---|
0:18:14 | since i get |
---|
0:18:15 | maybe |
---|
0:18:17 | conditions |
---|
0:18:21 | maybe i don't know if i the folding problem |
---|
0:18:25 | i believe |
---|
0:18:26 | just to |
---|
0:18:28 | well |
---|
0:18:43 | yeah i think that |
---|