0:00:14 | hello my name is to be addressed model |
---|
0:00:16 | and in this video i describe our work in that narrow i-vectors |
---|
0:00:22 | this work was go out for me by can likely and the only thing on |
---|
0:00:26 | it |
---|
0:00:27 | we don't we are from the university of east and be learned and going i |
---|
0:00:31 | was the time of writing any |
---|
0:00:37 | our study proposes a new way of combining gaussian mixture model based gender the i-vector |
---|
0:00:43 | models discriminatively train the enhanced exact speaker and endings for speaker verification task |
---|
0:00:51 | our aim is to improve upon existing i-vector systems |
---|
0:00:56 | and we also hope to gain some insight is what causes the performance differences i |
---|
0:01:02 | mean |
---|
0:01:03 | in a speaker and the things and discriminant in the in speaker and that is |
---|
0:01:09 | our study also is that is stronger convex and the gaussian mixture models and some |
---|
0:01:15 | of the existing |
---|
0:01:17 | ian and holding layers |
---|
0:01:21 | as a background for how we look for different i are considered |
---|
0:01:26 | the last three this can start can start suggested here |
---|
0:01:31 | are combining ideas from all i-vectors and the intense |
---|
0:01:36 | we a special events and the jurors universal background models and i-vector extractors all these |
---|
0:01:44 | constants |
---|
0:01:46 | let's to be the standard i-vector |
---|
0:01:50 | so |
---|
0:01:51 | key components here the two gender models are there |
---|
0:01:56 | gaussian mixture model based universal background model and |
---|
0:02:00 | i i-vector extractor |
---|
0:02:04 | so even and |
---|
0:02:05 | is used together it |
---|
0:02:07 | initial easy readers to compute the supposition statistics for that |
---|
0:02:12 | i-vector extractor |
---|
0:02:14 | extract i-vectors |
---|
0:02:18 | so we know the features are rule based and the rest of the components are |
---|
0:02:23 | gender strange |
---|
0:02:28 | then the nn i-vectors in this construct the universal background model is replaced by |
---|
0:02:36 | these the and dates acoustic features as an input and reduce the senone posteriors as |
---|
0:02:42 | an hour |
---|
0:02:44 | these posteriors are used together and of c and it's easy features can be sufficient |
---|
0:02:50 | statistics for the i-vector extractor |
---|
0:02:55 | so this clustering differs from the standard i-vector and the universal background models discriminatively trained |
---|
0:03:03 | one of the audience |
---|
0:03:06 | third system descending an i-vector system |
---|
0:03:09 | the system combines three modules one i-vectors is then the one neural network |
---|
0:03:16 | these manuals are features statistics |
---|
0:03:19 | when you |
---|
0:03:20 | statistics i-vectors when you and that are module |
---|
0:03:23 | is responsible |
---|
0:03:25 | score and errors of i-vectors |
---|
0:03:29 | training of these |
---|
0:03:31 | kind of network goes as follows |
---|
0:03:34 | so that are then used to train these individual modules |
---|
0:03:38 | shortly these can benefit from the |
---|
0:03:43 | i guess the a wrong |
---|
0:03:45 | corresponding generative models |
---|
0:03:49 | after these modules have been trained separately and they can be combined or and then |
---|
0:03:55 | train |
---|
0:03:59 | so this course there |
---|
0:04:02 | you do less is generally models in the initialization stays |
---|
0:04:07 | well i and i will use discriminative training the whole network |
---|
0:04:14 | therefore and the last background construct this guy |
---|
0:04:19 | is using and the nn with a mixture factor analysis fourteen year |
---|
0:04:23 | in this for the authors used to estimate based you know texture you start speaker |
---|
0:04:29 | and things |
---|
0:04:31 | what is special about this |
---|
0:04:33 | is that they use their own in a fourteen layer |
---|
0:04:37 | these |
---|
0:04:38 | for the error is basically an i-vector extractor implemented inside the in |
---|
0:04:45 | is the m f a is based on after calling may or no must be |
---|
0:04:49 | learned dictionary hangover |
---|
0:04:52 | i think used right learned dictionary encoder right the wrong in this alliance |
---|
0:05:00 | so we get all the components of these last construct our discriminately discriminatively trained with |
---|
0:05:07 | speaker targets |
---|
0:05:12 | okay |
---|
0:05:12 | next we belong to the proposed neural i-vectors |
---|
0:05:18 | before explaining the cluster itself |
---|
0:05:22 | we will need to do |
---|
0:05:24 | prerequisites for our model |
---|
0:05:27 | and these are the know that and |
---|
0:05:30 | the only layers and describe these two only layers by some and how they relate |
---|
0:05:36 | to the standard c n |
---|
0:05:38 | so then next initialize will be quite match for most |
---|
0:05:45 | so first then it that |
---|
0:05:48 | and we will study |
---|
0:05:50 | the posterior combination formula or a standard gmm |
---|
0:05:55 | we can see how we get the |
---|
0:05:58 | note that formalize and all of their your question from here |
---|
0:06:04 | so okay here we have |
---|
0:06:07 | that was the number of gaussian components and its constant component power |
---|
0:06:12 | covariance matrix mean vector and the associated right |
---|
0:06:18 | okay in that assumes covariance matrices |
---|
0:06:23 | or gaussian components |
---|
0:06:27 | we will okay this four-mora in the is for |
---|
0:06:30 | by expanding the normal distributions |
---|
0:06:38 | then |
---|
0:06:39 | but no |
---|
0:06:41 | this inverse covariance times minima there will |
---|
0:06:45 | my god |
---|
0:06:48 | and |
---|
0:06:48 | then the slow terms minus the other there we see |
---|
0:06:55 | we get |
---|
0:06:56 | this |
---|
0:06:59 | and this happens to be exactly |
---|
0:07:02 | formally used in note that |
---|
0:07:04 | paper from two thousand and sixty |
---|
0:07:09 | so |
---|
0:07:10 | we have basically some on the last two means their covariance matrices |
---|
0:07:15 | and the gmms we get that |
---|
0:07:19 | same formalize and as in that |
---|
0:07:23 | okay in it but i |
---|
0:07:25 | illinois or learnable parameters there |
---|
0:07:29 | note that there are |
---|
0:07:31 | this form of grass |
---|
0:07:33 | she's and news |
---|
0:07:36 | and estimating these forming class and z is has to do not what i mean |
---|
0:07:41 | there |
---|
0:07:43 | we see from the posterior combination formula that doesn't depend and |
---|
0:07:49 | from the mean vectors it is quite interesting signal can and the |
---|
0:07:54 | standard gmms |
---|
0:07:57 | but anyway |
---|
0:07:59 | there you can compute the posteriors |
---|
0:08:03 | or is there |
---|
0:08:06 | input feature vectors |
---|
0:08:09 | we can compare the component wise |
---|
0:08:13 | what's |
---|
0:08:14 | or in it but layer |
---|
0:08:18 | formalize zone |
---|
0:08:19 | on the right side the screen |
---|
0:08:22 | and then |
---|
0:08:23 | well right there |
---|
0:08:26 | we have the first order centre so some statistics |
---|
0:08:32 | the denominator just length normalized is then |
---|
0:08:37 | so for each gaussian component we get one |
---|
0:08:41 | vector |
---|
0:08:42 | and finally |
---|
0:08:44 | is no but they are male |
---|
0:08:45 | concatenate is |
---|
0:08:48 | component lifestyle closed form a supervector |
---|
0:08:53 | so this is very similar to a |
---|
0:08:56 | standard |
---|
0:08:57 | c gmm supervectors |
---|
0:08:59 | and how they are form |
---|
0:09:05 | okay next |
---|
0:09:07 | do the same for the learned dictionary encoder |
---|
0:09:10 | only layer |
---|
0:09:12 | so we start be there |
---|
0:09:13 | gmm posterior combination formula |
---|
0:09:18 | okay this time we you know we then |
---|
0:09:21 | is |
---|
0:09:23 | once colour term |
---|
0:09:27 | do we get this |
---|
0:09:28 | by expanding the normal distributions |
---|
0:09:34 | okay |
---|
0:09:35 | no if we assume |
---|
0:09:38 | isotropic |
---|
0:09:39 | or spherical covariance matrices |
---|
0:09:44 | this formula |
---|
0:09:45 | we simply by |
---|
0:09:47 | this four |
---|
0:09:51 | and |
---|
0:09:52 | this is the |
---|
0:09:53 | for music in that kind of prediction reading over all in layer |
---|
0:09:59 | although in the |
---|
0:10:00 | original publication or is and t is |
---|
0:10:05 | be the term was not included but it was added later on by other authors |
---|
0:10:14 | so the key point here must the |
---|
0:10:16 | by assuming isotropic covariance matrices the l d |
---|
0:10:22 | formulation from the standard gmm performance |
---|
0:10:28 | then learnable parameters of this energy will are |
---|
0:10:33 | i is |
---|
0:10:34 | it's on the scaling factors for covariances then the mean vectors and is |
---|
0:10:40 | i the terms |
---|
0:10:44 | similarly as in that we can then going to the component was a rules or |
---|
0:10:49 | is there |
---|
0:10:51 | so again then we write directly have the first order some for some statistics |
---|
0:10:58 | well okay in the standard and denominator is will be different so it is a |
---|
0:11:02 | sample |
---|
0:11:04 | posteriors for its |
---|
0:11:07 | each component |
---|
0:11:09 | so this is model i and it's the traditional maximum likelihood ratio on the east |
---|
0:11:16 | is on one and vice outputs |
---|
0:11:21 | and then the |
---|
0:11:22 | on the nist and form a supervector |
---|
0:11:29 | okay |
---|
0:11:31 | so now we have the necessary can start you |
---|
0:11:35 | constructs explained extend the proposed neural i-vectors |
---|
0:11:41 | so we start with |
---|
0:11:43 | and standard |
---|
0:11:46 | extractor architecture |
---|
0:11:50 | and we replace that |
---|
0:11:52 | and are willing layer |
---|
0:11:54 | we either that or l d coordinator |
---|
0:12:00 | and as its or from the previous bias we can use |
---|
0:12:04 | each polling layers the extra stuff isn't statistics |
---|
0:12:09 | so we do that |
---|
0:12:11 | and by using this present study is this weekend frame |
---|
0:12:16 | regular i-vector extractor and you can also then extract i-vectors from these statistics |
---|
0:12:25 | so that's the idea |
---|
0:12:32 | so now we can completely stable |
---|
0:12:37 | so how our how our proposed functions are dressed differs from there |
---|
0:12:45 | able in their roles is that the |
---|
0:12:48 | i-vector extractor is generally |
---|
0:12:52 | otherwise the cluster is the same |
---|
0:12:55 | if we compare our proposed neural i-vectors we then the in and i-vectors |
---|
0:13:01 | we can see that the |
---|
0:13:02 | i-vector what is the same but that |
---|
0:13:05 | users and you the in verse |
---|
0:13:08 | no one ever then restrained speaker utterance |
---|
0:13:13 | and also the features are obtained from a |
---|
0:13:17 | last layer before the one in there |
---|
0:13:22 | next |
---|
0:13:23 | that's model and the experiments and results |
---|
0:13:27 | so we can say that speaker verification experiments on the speakers and one evaluation |
---|
0:13:33 | first we compare our role as the results the other i-vector systems |
---|
0:13:39 | the single fine from the literature and these are some of the best ones |
---|
0:13:46 | on the line we have started in this easy i may or |
---|
0:13:51 | and in the second one may have i-vector system that is isn't perceptual linear prediction |
---|
0:13:56 | features together with the actual in the features |
---|
0:14:00 | and this w the a is |
---|
0:14:03 | dereverberation system |
---|
0:14:06 | so we can see from this results the then all i-vectors performs the best |
---|
0:14:13 | okay |
---|
0:14:14 | so partial but let's next |
---|
0:14:18 | compare our results |
---|
0:14:21 | the nn speaker and endings |
---|
0:14:23 | so we can use the same the nn sticks there either sufficient statistics for a |
---|
0:14:27 | narrow i-vectors |
---|
0:14:30 | or the can extend the speaker and endings directly from the audience |
---|
0:14:38 | so |
---|
0:14:39 | here are all these are our results so |
---|
0:14:45 | in the first line we have a |
---|
0:14:48 | the and we notable dictionary encoder whoever |
---|
0:14:53 | be into one zero two equal error rate |
---|
0:14:57 | but then the corresponding no i-vectors |
---|
0:15:00 | okay that is that we use the same union the extended sufficient statistics |
---|
0:15:05 | and then bending |
---|
0:15:06 | then trained that generally |
---|
0:15:08 | i-vector extractor so |
---|
0:15:11 | the roles we can do one nine three |
---|
0:15:14 | so no that's |
---|
0:15:17 | okay the third level we have a modification of the learned dictionary encoder |
---|
0:15:23 | so this uses |
---|
0:15:25 | so i dunno |
---|
0:15:27 | covariance matrices in instead of |
---|
0:15:29 | isotropic covariance matrices |
---|
0:15:33 | so we got been improvements by doing |
---|
0:15:36 | these verification |
---|
0:15:38 | the last two lines of their |
---|
0:15:41 | results for then applied on there |
---|
0:15:47 | so the interesting |
---|
0:15:49 | they here |
---|
0:15:51 | i wonder is a |
---|
0:15:54 | what |
---|
0:15:55 | what courses the performance difference between the |
---|
0:15:59 | we generalize there's and then the in the things |
---|
0:16:02 | because these are using the same the nn |
---|
0:16:05 | but in |
---|
0:16:09 | so there are two possible sources for this dress |
---|
0:16:14 | so the first we used one |
---|
0:16:16 | is the difference between the |
---|
0:16:20 | generated by the model and they're |
---|
0:16:24 | thereafter holding their |
---|
0:16:27 | so |
---|
0:16:29 | because of the holy where there is only one layer |
---|
0:16:33 | even for the in the in here so |
---|
0:16:35 | only this small part seems to really |
---|
0:16:40 | well alarms |
---|
0:16:41 | in terms in the equal error rate |
---|
0:16:45 | so it seems that the discriminative |
---|
0:16:48 | training objective is better |
---|
0:16:52 | okay there is another |
---|
0:16:54 | possible reason for this performance difference |
---|
0:17:00 | so there is like mismatched how we trained the |
---|
0:17:04 | b and n |
---|
0:17:06 | one or we how we trained in the in holding linear and how |
---|
0:17:10 | how we use it in the i-vector approach |
---|
0:17:14 | can see that the and we explicitly form a supervector |
---|
0:17:20 | and in there |
---|
0:17:21 | i-vector |
---|
0:17:24 | roles it is not a |
---|
0:17:26 | i-vector of proteins the base adamantly so |
---|
0:17:30 | is |
---|
0:17:32 | like |
---|
0:17:33 | console how many alignments how many frames are aligned itself the gaussian components |
---|
0:17:39 | so this is missing from based supervector a row |
---|
0:17:44 | so this is one of the |
---|
0:17:47 | future works so |
---|
0:17:49 | i used in the in owning layer is that it will resemble more there |
---|
0:17:55 | i-vector approach |
---|
0:17:58 | so this mismatch will be going on there |
---|
0:18:03 | another |
---|
0:18:05 | idea for the future work is |
---|
0:18:07 | explain here |
---|
0:18:09 | so |
---|
0:18:10 | instead of substance that these three extra |
---|
0:18:14 | the errors and the universal background model on there |
---|
0:18:19 | the posteriors from this one in there |
---|
0:18:24 | and |
---|
0:18:25 | by using the is |
---|
0:18:27 | we will then |
---|
0:18:30 | have a neural gmm-ubm system we train based scoring |
---|
0:18:36 | so this might be useful for some |
---|
0:18:39 | special application why our for a welder a sore race and speaker verification |
---|
0:18:50 | before i finished i have to related announcements first one is the program goals are |
---|
0:18:55 | available |
---|
0:18:57 | so we have i-vector extractor and providing their systems and in addition to speakers in |
---|
0:19:02 | the mind you have also that there's |
---|
0:19:06 | the goal or python and by those based |
---|
0:19:09 | well we have or ugandans the on can be more research such |
---|
0:19:16 | the second announcement is the this study was also included in my dissertation |
---|
0:19:23 | and or is this tradition i have been extremely nice and coming residual and its |
---|
0:19:30 | but weeks so |
---|
0:19:32 | anyone who wants to jordan is pretty to design and we can be found |
---|
0:19:38 | well |
---|
0:19:38 | here |
---|
0:19:41 | so you there |
---|