0:00:06 | more |
---|
0:00:07 | really |
---|
0:00:08 | okay good |
---|
0:00:10 | right okay so |
---|
0:00:11 | today i'm gonna start |
---|
0:00:12 | talking a little bit about um unsupervised you adaptation |
---|
0:00:15 | and uh |
---|
0:00:16 | with respect to use a total variability cosine scoring like it changes |
---|
0:00:20 | previously discussed but |
---|
0:00:22 | first off |
---|
0:00:24 | i've only known a gene for for a couple of months |
---|
0:00:26 | and |
---|
0:00:27 | and i think a lot of you probably know a lot more about an idea but |
---|
0:00:31 | i just want to summarise maybe a couple of things |
---|
0:00:33 | i where |
---|
0:00:34 | and |
---|
0:00:35 | and these are in in court said that he's had over the last couple of a |
---|
0:00:39 | couple weeks now |
---|
0:00:40 | um i i had to get a little bit appear approval with with with this 'cause i wasn't sure this |
---|
0:00:45 | introduction slight was gonna be appropriate at all |
---|
0:00:47 | but well |
---|
0:00:48 | it seemed okay so we're gonna go with that |
---|
0:00:50 | so this is what happened |
---|
0:00:52 | during a |
---|
0:00:53 | this sre do that okay |
---|
0:00:54 | this is this is this main but |
---|
0:00:57 | um |
---|
0:00:58 | no |
---|
0:00:59 | upon arrival in brno |
---|
0:01:01 | um |
---|
0:01:03 | it goes ahead |
---|
0:01:07 | oh |
---|
0:01:08 | yeah |
---|
0:01:11 | but but i like his mind changes really quickly you 'cause |
---|
0:01:14 | i mean |
---|
0:01:15 | and not just put this together bottom |
---|
0:01:17 | a couple hours ago 'cause like this morning he was really |
---|
0:01:20 | it's really up for the city is |
---|
0:01:21 | is good |
---|
0:01:22 | fig |
---|
0:01:24 | you know |
---|
0:01:25 | no words that and so i decided you know protect what what's gonna happen in a few days and like |
---|
0:01:32 | and and this is kind of what i'm what i'm hoping for |
---|
0:01:34 | as you as you may know |
---|
0:01:36 | he doesn't |
---|
0:01:36 | during |
---|
0:01:37 | that much |
---|
0:01:38 | here anything but i'm thinking that maybe |
---|
0:01:41 | he also for co for some reason um but |
---|
0:01:44 | we maybe we should |
---|
0:01:45 | we should help them out with with fear and an open on |
---|
0:01:48 | give that a go anyway |
---|
0:01:49 | down to business |
---|
0:01:50 | um |
---|
0:01:52 | so |
---|
0:01:54 | the the whole idea of how my top |
---|
0:01:56 | is that |
---|
0:01:57 | uh it's on |
---|
0:01:58 | again |
---|
0:01:58 | unsupervised |
---|
0:02:00 | adaptation and and the whole motivation behind it is that uh |
---|
0:02:03 | capturing characterising every source |
---|
0:02:05 | oh variability |
---|
0:02:06 | um is pretty difficult especially with only one enrolment session |
---|
0:02:10 | now |
---|
0:02:11 | if we were |
---|
0:02:12 | able to have multiple enrolments of the same speaker |
---|
0:02:15 | this would |
---|
0:02:15 | help |
---|
0:02:16 | average out |
---|
0:02:17 | um more sources of the inconsistencies |
---|
0:02:20 | and provide a better representation |
---|
0:02:22 | apart |
---|
0:02:22 | and a speaker model |
---|
0:02:24 | and so |
---|
0:02:25 | this brings us to the problem of um |
---|
0:02:28 | so you can adaptation in general |
---|
0:02:30 | um |
---|
0:02:30 | but |
---|
0:02:31 | in the problem unsupervised speaker adaptation we are updating our speaker models with out |
---|
0:02:35 | a priori knowledge |
---|
0:02:37 | that utterance that were |
---|
0:02:38 | updating a model with |
---|
0:02:40 | actually belongs to the target speaker |
---|
0:02:42 | now |
---|
0:02:43 | and we do so based on utterance |
---|
0:02:45 | process during testing and this |
---|
0:02:47 | is |
---|
0:02:47 | was incorporated in uh |
---|
0:02:49 | i think in this very uh two thousand four |
---|
0:02:51 | what isn't five or so |
---|
0:02:53 | um |
---|
0:02:54 | now |
---|
0:02:56 | in previous work |
---|
0:02:57 | um using joint factor analysis |
---|
0:02:59 | before |
---|
0:03:00 | we began on it |
---|
0:03:01 | work with total variability |
---|
0:03:03 | um what we noticed was that there were indeed |
---|
0:03:05 | highly variable scores |
---|
0:03:06 | on that were produced |
---|
0:03:07 | by J F K |
---|
0:03:08 | and the required normalisation in particular is each you know or |
---|
0:03:12 | um |
---|
0:03:14 | and applying |
---|
0:03:15 | these |
---|
0:03:15 | score normalisations |
---|
0:03:17 | in the unsupervised adaptation domain requires |
---|
0:03:19 | a significant amount of additional computation |
---|
0:03:21 | with each adaptation of |
---|
0:03:24 | um i'll go a little bit more into that |
---|
0:03:26 | um just a little bit |
---|
0:03:28 | now |
---|
0:03:29 | when we when i |
---|
0:03:30 | we begin this attend |
---|
0:03:32 | with total variability |
---|
0:03:33 | we were hoping for a certain number |
---|
0:03:36 | of improvement |
---|
0:03:37 | where |
---|
0:03:38 | um we could |
---|
0:03:39 | do |
---|
0:03:40 | unsupervised adaptation with |
---|
0:03:42 | less computation |
---|
0:03:43 | um and we take advantage of |
---|
0:03:45 | total variabilities |
---|
0:03:46 | use of lowdimensional total factor vectors or ivectors |
---|
0:03:50 | we can debate on who once |
---|
0:03:51 | to name what later |
---|
0:03:53 | um but |
---|
0:03:54 | and then there's a cosine similarity scoring which is |
---|
0:03:57 | which is very que |
---|
0:03:58 | and |
---|
0:03:59 | and |
---|
0:04:00 | also there's also the news |
---|
0:04:02 | a set of news |
---|
0:04:02 | score normalisation |
---|
0:04:04 | strategies that we wanted to play with |
---|
0:04:06 | namely the the symmetric normalisation snr |
---|
0:04:08 | and uh |
---|
0:04:09 | normalised cosine disk |
---|
0:04:10 | that is you just talk |
---|
0:04:13 | so |
---|
0:04:14 | the a little bit about sort of a |
---|
0:04:16 | outline for this talk on |
---|
0:04:17 | i i'm gonna |
---|
0:04:18 | go over |
---|
0:04:20 | really quickly whatever |
---|
0:04:21 | total variability |
---|
0:04:22 | wasn't as in just |
---|
0:04:23 | are you get a very good job explaining |
---|
0:04:25 | some of on some of the ideas behind it |
---|
0:04:26 | oh going to uh |
---|
0:04:27 | the the unsupervised adaptation |
---|
0:04:30 | algorithm that that we |
---|
0:04:32 | then we came up with on that that's |
---|
0:04:34 | gotten decent |
---|
0:04:35 | is that |
---|
0:04:35 | in in results and then |
---|
0:04:37 | we can um |
---|
0:04:38 | proceed onward with uh |
---|
0:04:40 | score normalisation experiments |
---|
0:04:42 | and a little better for the disk |
---|
0:04:47 | so total variability et cetera |
---|
0:04:49 | has |
---|
0:04:49 | as all the components |
---|
0:04:51 | that um |
---|
0:04:52 | that was shown |
---|
0:04:53 | in the previous thought |
---|
0:04:54 | and and that we've |
---|
0:04:55 | probably |
---|
0:04:56 | all seen |
---|
0:04:57 | in the past |
---|
0:04:58 | on this |
---|
0:04:59 | you know we're using factor analysis |
---|
0:05:00 | feature extractor you have a |
---|
0:05:02 | speaker and channel dependent supervector |
---|
0:05:04 | um |
---|
0:05:05 | there's intersession compensation without the in the V C C N |
---|
0:05:09 | and cosine scoring |
---|
0:05:10 | um |
---|
0:05:11 | just |
---|
0:05:12 | in the final at the end of the day we're just gonna use um |
---|
0:05:15 | W prime |
---|
0:05:16 | um |
---|
0:05:17 | which is after everything has been applied |
---|
0:05:19 | and such that was an scoring is actually really just |
---|
0:05:21 | um the you know product between that you've |
---|
0:05:26 | so |
---|
0:05:28 | for previous work |
---|
0:05:29 | in |
---|
0:05:30 | in joint factor analysis |
---|
0:05:31 | um |
---|
0:05:32 | on the topic |
---|
0:05:33 | of unsupervised adaptation |
---|
0:05:34 | was |
---|
0:05:35 | done by dan rosen kenny |
---|
0:05:37 | um |
---|
0:05:38 | and what they did |
---|
0:05:39 | was given some new adaptation data they would |
---|
0:05:42 | compute |
---|
0:05:43 | the posterior distribution of that speakerdependent hyperparameters |
---|
0:05:46 | using um the current |
---|
0:05:47 | ever |
---|
0:05:47 | odours |
---|
0:05:48 | as |
---|
0:05:48 | prior |
---|
0:05:49 | um |
---|
0:05:50 | and what they what they also did was to set a fixed and predefined |
---|
0:05:54 | adaptation threshold |
---|
0:05:55 | and use log likelihood ratio scoring |
---|
0:05:58 | um |
---|
0:05:59 | the in that paper they also introduce an adaptive |
---|
0:06:02 | T known score normalisation technique |
---|
0:06:04 | um because what they had observed |
---|
0:06:06 | was there is a there's a good |
---|
0:06:07 | this attribute you normalise scores as more adaptation data was used |
---|
0:06:11 | um and |
---|
0:06:12 | and |
---|
0:06:13 | in order |
---|
0:06:14 | two |
---|
0:06:14 | use |
---|
0:06:15 | um be able to use |
---|
0:06:16 | fixed decision threshold |
---|
0:06:18 | with |
---|
0:06:19 | um in their in their decision process |
---|
0:06:21 | um they had to do |
---|
0:06:23 | a a new type of normalisation |
---|
0:06:25 | now |
---|
0:06:26 | that that that that was met with |
---|
0:06:27 | a good amount of success |
---|
0:06:29 | and um |
---|
0:06:30 | and they were very promising yet |
---|
0:06:32 | that to say in order to combat |
---|
0:06:33 | the |
---|
0:06:35 | in order to implement the adaptive you know normalisation it requires a good bit of computation |
---|
0:06:41 | um |
---|
0:06:41 | and |
---|
0:06:42 | um calculating pursue to speech design easy |
---|
0:06:45 | and um |
---|
0:06:47 | and also there was |
---|
0:06:49 | that there was a require a computational as you know in france |
---|
0:06:52 | every |
---|
0:06:53 | cation update |
---|
0:06:54 | um so |
---|
0:06:56 | uh and then lastly i guess |
---|
0:06:57 | success also dependent on the choice of adaptation threshold which was |
---|
0:07:01 | which was tuned but |
---|
0:07:02 | well we also get an animal |
---|
0:07:03 | but |
---|
0:07:04 | now |
---|
0:07:05 | for us then |
---|
0:07:06 | in order |
---|
0:07:07 | two |
---|
0:07:08 | better |
---|
0:07:09 | uh or try to improve upon this work |
---|
0:07:11 | in in the context |
---|
0:07:12 | of total variability |
---|
0:07:14 | um what we wanted |
---|
0:07:15 | was |
---|
0:07:15 | satisfy the following criteria |
---|
0:07:17 | we wanted a simple and robust method for setting an adaptation threshold |
---|
0:07:21 | data |
---|
0:07:22 | and what we decided would |
---|
0:07:23 | was |
---|
0:07:24 | to set it to be insane |
---|
0:07:25 | and some optimal decision thresholds and development data |
---|
0:07:28 | but we're gonna use was the nist two thousand |
---|
0:07:30 | six |
---|
0:07:30 | um |
---|
0:07:31 | sre |
---|
0:07:32 | sre data and |
---|
0:07:34 | basically |
---|
0:07:35 | um |
---|
0:07:36 | what we would do |
---|
0:07:37 | uh is |
---|
0:07:40 | carry out your test without adaptation |
---|
0:07:42 | um |
---|
0:07:42 | are |
---|
0:07:43 | carry out of one without adaptation and set |
---|
0:07:45 | the optimal |
---|
0:07:47 | i guess the the the point that will |
---|
0:07:49 | minimise the dcf |
---|
0:07:51 | i i a threshold and we'd set that as a special to to test |
---|
0:07:54 | um i don't get in to details about that in a little bit |
---|
0:07:57 | no |
---|
0:07:58 | next was |
---|
0:07:59 | we wanted |
---|
0:07:59 | minimise |
---|
0:08:00 | the amount of computation that was that's |
---|
0:08:02 | area during each |
---|
0:08:03 | unsupervised adaptation update |
---|
0:08:05 | and |
---|
0:08:05 | this helps that already in |
---|
0:08:07 | control variability |
---|
0:08:09 | we are we are able to use low dimensional |
---|
0:08:11 | um |
---|
0:08:12 | total factor vectors |
---|
0:08:13 | and that we are able to use |
---|
0:08:14 | cosine similarity scoring |
---|
0:08:16 | and lastly |
---|
0:08:17 | our hope was |
---|
0:08:18 | to simplify |
---|
0:08:19 | score normalisation procedures wherever part |
---|
0:08:25 | so |
---|
0:08:25 | really the really basically i mean if you were |
---|
0:08:27 | if we're able to use our total factor vectors or i vector |
---|
0:08:30 | as |
---|
0:08:31 | um |
---|
0:08:32 | when estimates |
---|
0:08:33 | um |
---|
0:08:33 | in in our speaker space then |
---|
0:08:35 | then given a limited training data we we might not obviously have |
---|
0:08:39 | it's perfect |
---|
0:08:41 | estimation of |
---|
0:08:42 | of where our speaker really lies |
---|
0:08:44 | and so i suppose |
---|
0:08:46 | this is just the cartoon so it's |
---|
0:08:47 | it's not |
---|
0:08:48 | not anything rigorous at all |
---|
0:08:50 | but suppose we had our |
---|
0:08:51 | true speaker identity S which is |
---|
0:08:53 | um which can be given |
---|
0:08:55 | by |
---|
0:08:55 | by the |
---|
0:08:56 | the little circle there |
---|
0:08:58 | but then |
---|
0:08:58 | and are estimated |
---|
0:09:00 | and one utterance or speaker identity of um the reason i |
---|
0:09:03 | and so |
---|
0:09:04 | this might not be |
---|
0:09:05 | this isn't exactly very |
---|
0:09:07 | right on the spot where the true speaker identity is |
---|
0:09:10 | but if we had a good number of these utterances i it would make sense that |
---|
0:09:13 | um |
---|
0:09:14 | we should |
---|
0:09:15 | converges towards a better representation of |
---|
0:09:17 | speaker |
---|
0:09:18 | now was this |
---|
0:09:19 | this question is also assuming a priori that the additional data that we actually have |
---|
0:09:23 | is |
---|
0:09:23 | from |
---|
0:09:24 | on speaker S |
---|
0:09:26 | now |
---|
0:09:27 | um |
---|
0:09:28 | so |
---|
0:09:30 | as such |
---|
0:09:30 | we |
---|
0:09:31 | decide to propose this |
---|
0:09:32 | on this algorithm here |
---|
0:09:34 | and |
---|
0:09:35 | in an effort to match |
---|
0:09:36 | the |
---|
0:09:37 | the the technical rigour of the previous two presentations |
---|
0:09:40 | for me i decided to pack as much math |
---|
0:09:42 | i could in this |
---|
0:09:43 | in this one slide |
---|
0:09:44 | um |
---|
0:09:45 | but |
---|
0:09:45 | basically |
---|
0:09:47 | it we're saying that we have a set of total factor vectors that are |
---|
0:09:50 | um assumed to pertain to an identity |
---|
0:09:52 | of a known speaker S |
---|
0:09:54 | um |
---|
0:09:55 | and then we have a set |
---|
0:09:56 | of total factor vectors |
---|
0:09:58 | T survive |
---|
0:09:59 | on that are extracted |
---|
0:10:00 | test utterance |
---|
0:10:01 | each |
---|
0:10:02 | of arc a test |
---|
0:10:02 | and |
---|
0:10:03 | with a defined if |
---|
0:10:04 | decision threshold |
---|
0:10:05 | data |
---|
0:10:06 | um we have |
---|
0:10:08 | a new equation for the score which |
---|
0:10:10 | um |
---|
0:10:11 | since |
---|
0:10:12 | this is uh the |
---|
0:10:13 | this notation is just the cardinality so that's is it just the mean of all possible |
---|
0:10:17 | um and then we compared to some threshold and if the threshold |
---|
0:10:20 | of of all the |
---|
0:10:22 | and that threshold of the score exceeds |
---|
0:10:25 | you're you're threshold then you decide |
---|
0:10:27 | that you're |
---|
0:10:28 | that |
---|
0:10:29 | but the utterance |
---|
0:10:30 | current utterance to supply would belong in inside here |
---|
0:10:33 | into the |
---|
0:10:33 | the identity of a new speaker as |
---|
0:10:36 | and |
---|
0:10:36 | you would |
---|
0:10:38 | a you would |
---|
0:10:38 | say yes to that trial and be you would |
---|
0:10:41 | and the |
---|
0:10:41 | um |
---|
0:10:43 | that that new utterance these to buy into you |
---|
0:10:45 | speaker big W sub |
---|
0:10:48 | now |
---|
0:10:49 | um what we have is the symmetry that allows for text |
---|
0:10:51 | sure |
---|
0:10:52 | and um later we will have |
---|
0:10:54 | more discussion on the ideas |
---|
0:10:56 | for on the design |
---|
0:10:58 | of this function |
---|
0:10:59 | is that it can |
---|
0:11:00 | um conceivably be in a better |
---|
0:11:03 | but |
---|
0:11:03 | to to reiterate what i just said |
---|
0:11:05 | um it's |
---|
0:11:06 | it's actually |
---|
0:11:07 | quite easy so if you had in |
---|
0:11:09 | initial enrolment utterance |
---|
0:11:10 | estimator speaker identity |
---|
0:11:12 | on W |
---|
0:11:13 | there is a one |
---|
0:11:14 | and you have incoming test utterance to supply this is assuming that |
---|
0:11:17 | these are all of your |
---|
0:11:18 | speaker i i didn't is right it is all you have |
---|
0:11:21 | just a single utterance |
---|
0:11:22 | and that you compute a score than if you're |
---|
0:11:24 | your singles what as one is great and data |
---|
0:11:27 | then |
---|
0:11:27 | you just |
---|
0:11:28 | take |
---|
0:11:29 | um you just take a test utterance and you can |
---|
0:11:31 | place it |
---|
0:11:32 | um |
---|
0:11:33 | in with |
---|
0:11:33 | with this and this becomes |
---|
0:11:35 | what you have is now when you you test utterance |
---|
0:11:37 | uh arcs are |
---|
0:11:38 | and you training utterance |
---|
0:11:39 | um |
---|
0:11:40 | that that was |
---|
0:11:41 | just to test |
---|
0:11:42 | and so and that that's how you simply admit |
---|
0:11:45 | um |
---|
0:11:45 | a bit |
---|
0:11:46 | more |
---|
0:11:47 | training vectors into in your set |
---|
0:11:50 | um and so now |
---|
0:11:51 | you had a |
---|
0:11:52 | second test |
---|
0:11:53 | utterance then |
---|
0:11:54 | and you can keep two scores |
---|
0:11:56 | right you have |
---|
0:11:57 | yeah you're initially |
---|
0:11:58 | um estimated |
---|
0:12:00 | speaker identity and this new |
---|
0:12:01 | training i utterance W so uh T one |
---|
0:12:04 | and you can see these two scores and now |
---|
0:12:06 | if you um |
---|
0:12:07 | if that function |
---|
0:12:08 | of those two scores is |
---|
0:12:09 | again greater than your fixed threshold data |
---|
0:12:11 | then you do the same thing and |
---|
0:12:15 | and do you |
---|
0:12:16 | you put another |
---|
0:12:17 | you training utterance |
---|
0:12:18 | yeah |
---|
0:12:19 | so |
---|
0:12:20 | that's |
---|
0:12:20 | that's all that's |
---|
0:12:21 | and so |
---|
0:12:22 | and the emphasis |
---|
0:12:24 | behind this approach again is |
---|
0:12:26 | such that we do not need to change decision thresholds data |
---|
0:12:29 | um |
---|
0:12:30 | previously that the idea i |
---|
0:12:32 | in the past |
---|
0:12:32 | in related work with that |
---|
0:12:34 | with more |
---|
0:12:35 | i i adapted utterances or |
---|
0:12:36 | and all that in the text dependent setting |
---|
0:12:39 | um |
---|
0:12:39 | they |
---|
0:12:40 | the |
---|
0:12:41 | what some work was doing |
---|
0:12:43 | in the past was just that you would um |
---|
0:12:45 | increase |
---|
0:12:46 | the decision threshold with each adaptation utterance before us |
---|
0:12:49 | we want to keep things as simple as possible and so |
---|
0:12:51 | we decided |
---|
0:12:52 | that we did not wanna change in your decision there shop data |
---|
0:12:55 | and also there's simply right now there's no modification of a troll factor vectors |
---|
0:12:59 | simply all we're doing is actually combined score |
---|
0:13:02 | um and now so that |
---|
0:13:04 | summarises what variability and um unsupervised adaptation |
---|
0:13:08 | something |
---|
0:13:08 | go really quickly in the score normalisation which |
---|
0:13:11 | i am aware |
---|
0:13:12 | it's very well known topic |
---|
0:13:14 | um |
---|
0:13:15 | and so i'm not gonna give it a |
---|
0:13:17 | give it a pretty |
---|
0:13:17 | brief review |
---|
0:13:19 | um with with uh |
---|
0:13:20 | couple of |
---|
0:13:21 | indices |
---|
0:13:22 | um |
---|
0:13:22 | in in the wording |
---|
0:13:24 | but |
---|
0:13:27 | so |
---|
0:13:27 | it for the idea behind score normalisation |
---|
0:13:30 | is that we are assuming that |
---|
0:13:31 | distribution of target speaker |
---|
0:13:33 | and impostor scores |
---|
0:13:34 | followed to the string |
---|
0:13:36 | two distinct |
---|
0:13:37 | normal distribution |
---|
0:13:38 | um and |
---|
0:13:39 | however the the parameters for these two distributions are |
---|
0:13:42 | on speaker independent |
---|
0:13:43 | start part |
---|
0:13:44 | target speaker depending as such we need to normalise to allow for |
---|
0:13:47 | universal decision pressure |
---|
0:13:48 | and so in zero normalisation ones enormous is well known |
---|
0:13:51 | um we scale each of the distribution of scores |
---|
0:13:53 | produced by target speaker |
---|
0:13:55 | model and and a set of impostor utterances |
---|
0:13:58 | two |
---|
0:13:58 | standard normal distribution |
---|
0:14:00 | now in test normalisation to no one |
---|
0:14:02 | it's like the same thing it said |
---|
0:14:04 | um |
---|
0:14:04 | in order to adjust for the idea of intersession variability |
---|
0:14:07 | we we |
---|
0:14:08 | sh |
---|
0:14:09 | scale into the distribution of scores produced by test |
---|
0:14:11 | utterance |
---|
0:14:12 | and the set of impostor models |
---|
0:14:14 | um |
---|
0:14:15 | to see a normal distribution now the the idea here is to keep |
---|
0:14:18 | in mind that the ad tell size words um utterance and model |
---|
0:14:21 | um not discuss how how the related in total but in the context of total variability in just a little |
---|
0:14:25 | bit |
---|
0:14:26 | is easy norm |
---|
0:14:27 | we've already seen that um it achieves the best results |
---|
0:14:30 | uh |
---|
0:14:31 | in a factor analysis based system |
---|
0:14:33 | um |
---|
0:14:34 | and and that that's what's currently being used |
---|
0:14:36 | stay there |
---|
0:14:38 | so |
---|
0:14:38 | now what we have |
---|
0:14:40 | here with um |
---|
0:14:42 | oh sorry uh what we have in G T not parameter updates during this model adaptation is |
---|
0:14:47 | is |
---|
0:14:47 | the lack of any |
---|
0:14:49 | um |
---|
0:14:50 | or |
---|
0:14:52 | uh the need for |
---|
0:14:53 | for more normalisation parameters right because |
---|
0:14:55 | it is we can indeed become training utterances |
---|
0:14:58 | um |
---|
0:14:59 | W |
---|
0:15:00 | the T one and diversity to then |
---|
0:15:02 | on their previously test utterance |
---|
0:15:04 | so they already had a T non associated with them |
---|
0:15:06 | but |
---|
0:15:06 | what we need additionally is obviously as you don't to be computed |
---|
0:15:09 | but that's actually it so that means we can |
---|
0:15:12 | simply |
---|
0:15:12 | um precompute |
---|
0:15:14 | roc norm parameters for each test utterance |
---|
0:15:16 | as we do it |
---|
0:15:17 | i |
---|
0:15:18 | the same way we do it for each um |
---|
0:15:20 | for each |
---|
0:15:21 | target |
---|
0:15:21 | figure utterances are |
---|
0:15:22 | target speaker model as well |
---|
0:15:24 | and so that's all we need to do |
---|
0:15:26 | in in the past we had |
---|
0:15:27 | compute is um |
---|
0:15:29 | adapted to known parameters |
---|
0:15:31 | um yeah that we can do is you don't parameters after each adaptation update however here |
---|
0:15:35 | very simple and it's very in should be much quicker |
---|
0:15:38 | um |
---|
0:15:39 | now |
---|
0:15:40 | the next |
---|
0:15:41 | thing was |
---|
0:15:42 | that |
---|
0:15:43 | total variability |
---|
0:15:44 | in the context of |
---|
0:15:45 | the difference between |
---|
0:15:47 | utterances |
---|
0:15:48 | and |
---|
0:15:48 | uh models |
---|
0:15:49 | well |
---|
0:15:50 | total variability uses factor analysis |
---|
0:15:52 | the front end |
---|
0:15:53 | and so |
---|
0:15:54 | all we have is that |
---|
0:15:56 | the extraction of total factors |
---|
0:15:58 | from an enrolment |
---|
0:15:59 | or test utterance |
---|
0:16:00 | follows really the exact same process and as such |
---|
0:16:03 | there's really no difference |
---|
0:16:05 | between an utterance |
---|
0:16:07 | and the model |
---|
0:16:08 | and |
---|
0:16:09 | and it there's no a and |
---|
0:16:11 | also with the cosine similarity in the symmetry behind all that |
---|
0:16:14 | um there is |
---|
0:16:15 | no distinction to be made as that we can think of them all as the same thing |
---|
0:16:19 | which brings us to an even more simplified method of score normalisation which is |
---|
0:16:23 | which isn't which is the |
---|
0:16:24 | S norm um |
---|
0:16:26 | which um |
---|
0:16:27 | which is |
---|
0:16:28 | yeah |
---|
0:16:28 | thing that uh that uh after kenny had |
---|
0:16:31 | um and so |
---|
0:16:32 | really |
---|
0:16:33 | all we do in implementing the as non here is |
---|
0:16:36 | two |
---|
0:16:36 | define a new set of impostors which is simply the union |
---|
0:16:39 | this is the non impostors in the teen on impulse |
---|
0:16:42 | and |
---|
0:16:42 | we get is the new scoring function |
---|
0:16:45 | um |
---|
0:16:46 | that |
---|
0:16:47 | that looks |
---|
0:16:47 | pretty similar to any other normalisation function that we have except we just simply |
---|
0:16:52 | the two |
---|
0:16:53 | uh added to normalise |
---|
0:16:55 | scores and this becomes R S nine |
---|
0:16:57 | um what the the first time there |
---|
0:16:59 | refers to |
---|
0:17:00 | um use of W S which is uh |
---|
0:17:03 | which is the estimate parameters associated with |
---|
0:17:05 | on your your model |
---|
0:17:07 | on W S |
---|
0:17:07 | and then you have your score |
---|
0:17:09 | um are you have you mute |
---|
0:17:10 | so the uh sixty a |
---|
0:17:12 | which |
---|
0:17:13 | becomes simply that's not |
---|
0:17:15 | you know |
---|
0:17:16 | your uh |
---|
0:17:17 | the |
---|
0:17:19 | the test utterance |
---|
0:17:19 | so |
---|
0:17:20 | so what this gives us now is universal procedure for exactly normalisation parameters and |
---|
0:17:25 | correspondingly simple method |
---|
0:17:26 | for score normal |
---|
0:17:28 | and then |
---|
0:17:29 | the the the last step of um |
---|
0:17:31 | normalisation that we we |
---|
0:17:33 | i explored here |
---|
0:17:34 | was |
---|
0:17:35 | um previously discussed in the genes discussion |
---|
0:17:37 | but um that's sort of what it looks like |
---|
0:17:39 | um i just as a quick reminder |
---|
0:17:42 | so |
---|
0:17:42 | we have now are are some maybe some some experiments that were right |
---|
0:17:46 | and the system that i used here was |
---|
0:17:48 | really the same system that um |
---|
0:17:50 | that niche in previously had |
---|
0:17:51 | um |
---|
0:17:53 | given that we're working together so |
---|
0:17:54 | yeah you're you're standard |
---|
0:17:56 | parameters for for the system setup |
---|
0:17:58 | and |
---|
0:18:00 | and then the |
---|
0:18:01 | the table of the corpora that we use |
---|
0:18:03 | um |
---|
0:18:04 | which |
---|
0:18:05 | can take a more detailed look at at some other |
---|
0:18:07 | time like |
---|
0:18:08 | but |
---|
0:18:09 | um that the idea of with with this |
---|
0:18:11 | the protocol of our experiments was |
---|
0:18:13 | we ran |
---|
0:18:13 | we wanted to test |
---|
0:18:15 | our results are based on the female part of the two thousand eight |
---|
0:18:18 | nist sre |
---|
0:18:19 | on data set and we fix |
---|
0:18:21 | our decision an adaptation threshold data |
---|
0:18:24 | um as the optimal |
---|
0:18:26 | min dcf |
---|
0:18:27 | a posteriori |
---|
0:18:28 | decision threshold and development data uh |
---|
0:18:30 | nist two thousand |
---|
0:18:34 | um and so these are like |
---|
0:18:36 | the the ten second condition results what we |
---|
0:18:38 | we cared mostly about the ten second |
---|
0:18:40 | results here |
---|
0:18:41 | and um |
---|
0:18:43 | and these are these are the results that we see we we can see that uh |
---|
0:18:46 | in terms of the minimising detection cost function the |
---|
0:18:49 | the adaptation based |
---|
0:18:50 | ct norm on |
---|
0:18:51 | achieves the best uh min dcf |
---|
0:18:54 | um whereas |
---|
0:18:55 | and there |
---|
0:18:56 | the |
---|
0:18:57 | then normalised cosine |
---|
0:18:59 | on that |
---|
0:19:00 | then is in previously discussed did it also very well |
---|
0:19:03 | um |
---|
0:19:04 | did very well |
---|
0:19:04 | yeah um |
---|
0:19:05 | or |
---|
0:19:06 | for everything else |
---|
0:19:07 | um the english trials for equal error rates |
---|
0:19:10 | and |
---|
0:19:10 | also than than in dcf or |
---|
0:19:12 | all all trials |
---|
0:19:17 | and but at the same time we we can also notice |
---|
0:19:20 | that uh |
---|
0:19:20 | as normal also come uh |
---|
0:19:22 | achieves |
---|
0:19:23 | uh good results very competitive |
---|
0:19:25 | um |
---|
0:19:26 | and in in in some cases even even better |
---|
0:19:29 | uh then |
---|
0:19:30 | then then then Z T norm |
---|
0:19:32 | and this |
---|
0:19:33 | um at least for english |
---|
0:19:34 | whatever |
---|
0:19:35 | so |
---|
0:19:36 | uh |
---|
0:19:38 | so |
---|
0:19:39 | in order to validate our results we also tried |
---|
0:19:41 | our work |
---|
0:19:41 | in the in the i guess this |
---|
0:19:44 | uh the |
---|
0:19:45 | log of the conversation the longer |
---|
0:19:47 | utterances and um |
---|
0:19:49 | and and this one in this case uh regions |
---|
0:19:51 | normalisations |
---|
0:19:52 | stan actually sweat |
---|
0:19:53 | our results across the board |
---|
0:19:55 | um |
---|
0:19:55 | in |
---|
0:19:56 | in achieving that the best results here |
---|
0:20:00 | um but |
---|
0:20:00 | and at the same time there there are a couple things |
---|
0:20:03 | to to take no |
---|
0:20:04 | um |
---|
0:20:06 | in in that um |
---|
0:20:07 | well we can see here |
---|
0:20:09 | is that uh |
---|
0:20:11 | what are our proposed adaptation algorithm is |
---|
0:20:13 | um |
---|
0:20:14 | successful |
---|
0:20:15 | in |
---|
0:20:16 | improving performance |
---|
0:20:17 | results uh regardless of the normalisation procedure |
---|
0:20:20 | um we |
---|
0:20:21 | it this is obviously consisting with the notion that unsupervised adaptation with of course an appropriately problem |
---|
0:20:26 | chosen threshold |
---|
0:20:27 | on should be at least as good as |
---|
0:20:29 | the hopefully better than on the baseline method without |
---|
0:20:32 | adaptation |
---|
0:20:33 | um |
---|
0:20:34 | and |
---|
0:20:35 | the next thing is that we have our simplified as an approach |
---|
0:20:38 | for |
---|
0:20:38 | performance |
---|
0:20:39 | competitively with the more complicated traditional teaching on but |
---|
0:20:43 | um |
---|
0:20:44 | and of course |
---|
0:20:45 | um what we've seen is that |
---|
0:20:46 | the best result is ultimately obtained using cosine normalisation |
---|
0:20:50 | um |
---|
0:20:50 | and as a result we i i think |
---|
0:20:53 | one of the |
---|
0:20:53 | cooler things is that we we we seem to come full circle with the study |
---|
0:20:57 | and the story of a score normalisation techniques in that |
---|
0:21:00 | um |
---|
0:21:02 | in the beginning we needed normalisation techniques in order to better |
---|
0:21:05 | sort of calibrate our scores on for fixed decision threshold |
---|
0:21:08 | and then |
---|
0:21:09 | however as we got to like the most complicated ct norm |
---|
0:21:11 | we actually started going backwards and sort of simplifying things |
---|
0:21:14 | um into an as normal |
---|
0:21:16 | which is much easier to calculate |
---|
0:21:18 | and then |
---|
0:21:18 | and then now it's |
---|
0:21:19 | it's almost a |
---|
0:21:20 | um |
---|
0:21:22 | it that the parameters that we need |
---|
0:21:24 | are |
---|
0:21:25 | are not speaker dependent at all there is no |
---|
0:21:27 | need in the in the normalised cosine distance to actually |
---|
0:21:30 | have |
---|
0:21:31 | um |
---|
0:21:32 | each speaker parameter |
---|
0:21:33 | uh each uh that the parameters of each distribution calculated |
---|
0:21:37 | um for each speaker it's just it's pretty it's a pretty universal |
---|
0:21:40 | um |
---|
0:21:41 | set of |
---|
0:21:42 | parameters that need you |
---|
0:21:44 | um and |
---|
0:21:45 | and now um i there's a |
---|
0:21:47 | bit of |
---|
0:21:47 | work |
---|
0:21:48 | that i i i |
---|
0:21:49 | brought up earlier that where we decided that a lot |
---|
0:21:52 | maybe |
---|
0:21:53 | um there's a better way to improve |
---|
0:21:55 | um our our score combination function and this is |
---|
0:21:58 | um |
---|
0:21:59 | basic some basic ideas i don't want to go over |
---|
0:22:01 | that we're currently working but we don't have any uh |
---|
0:22:04 | really we don't have any significant improvement |
---|
0:22:06 | in our results just yet |
---|
0:22:08 | in terms of uh |
---|
0:22:10 | in terms of our results |
---|
0:22:11 | um however the idea is weighted averaging because |
---|
0:22:14 | are currently proposed method for combining scores really treat |
---|
0:22:17 | every vector in the set |
---|
0:22:18 | oh so you control factors |
---|
0:22:20 | as equally important however |
---|
0:22:22 | at the end of the day on the only |
---|
0:22:25 | back to that unequivocally |
---|
0:22:26 | um |
---|
0:22:27 | the long |
---|
0:22:28 | speaker |
---|
0:22:28 | to the speaker as is the initial one long vector as such |
---|
0:22:31 | maybe it makes more sense to wait that vector is a little bit higher |
---|
0:22:35 | um then |
---|
0:22:36 | then we have uh then then the rest of the |
---|
0:22:39 | training utterances that that we admit |
---|
0:22:40 | because |
---|
0:22:41 | the presence of false alarm adaptation updates |
---|
0:22:43 | um in which |
---|
0:22:44 | it has utterances incorrectly admitted |
---|
0:22:46 | um will have an adverse effect on all sorts |
---|
0:22:49 | test |
---|
0:22:49 | i said maybe |
---|
0:22:50 | R R score combining functions should take |
---|
0:22:52 | take the following into one |
---|
0:22:54 | into account |
---|
0:22:55 | um |
---|
0:22:56 | where |
---|
0:22:57 | uh where we we each score by |
---|
0:22:59 | a coefficient eight |
---|
0:23:00 | where the school is a is um |
---|
0:23:02 | is the unnormalised score like the something like the cosine similarity |
---|
0:23:06 | on that ranges |
---|
0:23:07 | between negative one and one |
---|
0:23:08 | so it can be seen as a way |
---|
0:23:10 | teach score |
---|
0:23:11 | and then that that's sort of |
---|
0:23:13 | well we can look at |
---|
0:23:14 | on an a quick visualisation and all morning at time |
---|
0:23:17 | right now |
---|
0:23:17 | is that uh we can we can simply |
---|
0:23:20 | take a look at it this way where |
---|
0:23:21 | or if you're |
---|
0:23:22 | initial identity speaker vector is uh |
---|
0:23:25 | W someone is on most important for the next you vectors might be erroneous because |
---|
0:23:29 | uh based on |
---|
0:23:30 | so you're threshold you're only allowing |
---|
0:23:32 | um the green circle the region in the green circle |
---|
0:23:34 | in in so |
---|
0:23:35 | um if you're to speaker identity is S |
---|
0:23:38 | then |
---|
0:23:38 | you may incorrectly |
---|
0:23:40 | uh allow some factors in |
---|
0:23:42 | um however |
---|
0:23:43 | as you get |
---|
0:23:45 | hmmm |
---|
0:23:46 | um |
---|
0:23:47 | yeah |
---|
0:23:48 | a as you get more and more |
---|
0:23:49 | vectors then we can we can be more more certain that they belong |
---|
0:23:53 | into the speaker's identity |
---|
0:23:54 | and so |
---|
0:23:56 | what we would have |
---|
0:23:57 | is like |
---|
0:23:57 | you add a training vector and use it that |
---|
0:23:59 | space little bit and then |
---|
0:24:01 | and then maybe you you add in an incorrect one |
---|
0:24:04 | um but then you i don't more correct one and and |
---|
0:24:06 | uh as a result after |
---|
0:24:08 | after a couple more these utterances you can see |
---|
0:24:10 | finally that maybe |
---|
0:24:11 | you get you get the right one however you did that false alarm one maybe you should |
---|
0:24:15 | take it out or so |
---|
0:24:16 | like that |
---|
0:24:17 | um |
---|
0:24:17 | and so |
---|
0:24:18 | that's sort of like what we're looking at working on in in feature and and we're still looking to improve |
---|
0:24:22 | the score combination function |
---|
0:24:24 | welcome to any ideas and |
---|
0:24:25 | and um |
---|
0:24:27 | and one of the ideas |
---|
0:24:28 | also is that uh |
---|
0:24:29 | though it's not allowed in in this protocol say but |
---|
0:24:32 | but since we're most airport in the beginning after a while |
---|
0:24:35 | maybe we can go take another look |
---|
0:24:36 | at the training vectors and correct any errors from the beginning |
---|
0:24:39 | uh |
---|
0:24:41 | and this is easy to do because we don't actually modify |
---|
0:24:43 | modify vectors at all |
---|
0:24:45 | um |
---|
0:24:47 | and so |
---|
0:24:48 | we have a final summary is that we have |
---|
0:24:51 | uh propose they |
---|
0:24:52 | uh a method for unsupervised speaker adaptation um in the use of total variability cosine scoring |
---|
0:24:58 | um |
---|
0:24:59 | we have simple and efficient method for score combination |
---|
0:25:02 | and the fixed a priori decision threshold before class |
---|
0:25:05 | and um and this |
---|
0:25:06 | method here can also easily comedy |
---|
0:25:09 | all score normalisation procedures |
---|
0:25:11 | um |
---|
0:25:12 | and |
---|
0:25:12 | with respect to score normalisation |
---|
0:25:14 | um discuss some |
---|
0:25:16 | some of the more |
---|
0:25:17 | then you were |
---|
0:25:18 | non C T norm |
---|
0:25:19 | um ideas which are like a snore |
---|
0:25:21 | and the normalised cosine that |
---|
0:25:23 | that the gene |
---|
0:25:24 | um |
---|
0:25:24 | a talk about it |
---|
0:25:26 | so |
---|
0:25:27 | oh thanks |
---|
0:25:37 | uh the the |
---|
0:25:38 | questions for uh |
---|
0:25:39 | steven |
---|
0:25:48 | you different paper |
---|
0:25:49 | the last one was proposed by |
---|
0:25:52 | each student was in the |
---|
0:25:54 | we look at this problem and trying to sure but |
---|
0:25:57 | if you are using a fixed threshold |
---|
0:26:01 | what |
---|
0:26:01 | never |
---|
0:26:02 | the eighteen you uh speaker model |
---|
0:26:05 | you want thinking about the new uh |
---|
0:26:08 | cost function proposed by needs to be sure |
---|
0:26:11 | you know |
---|
0:26:11 | why |
---|
0:26:12 | new terms |
---|
0:26:13 | to uh |
---|
0:26:14 | you don't have the will to the beginning |
---|
0:26:16 | to have at least we all uh |
---|
0:26:20 | we need if we should all over the news for sure |
---|
0:26:23 | and we propose in this paper |
---|
0:26:25 | the way to do that they should like |
---|
0:26:27 | you are doing to schools |
---|
0:26:28 | we continue so that they shouldn't |
---|
0:26:30 | oh |
---|
0:26:31 | just |
---|
0:26:32 | try to evaluate the confidence so |
---|
0:26:34 | which we all |
---|
0:26:35 | and it would look easy |
---|
0:26:37 | wait |
---|
0:26:38 | this real with |
---|
0:26:39 | the confidence |
---|
0:26:41 | and |
---|
0:26:41 | use it for good reading put increasing between you do twice |
---|
0:26:44 | fig legal |
---|
0:26:47 | um |
---|
0:26:48 | so so the question is |
---|
0:26:49 | whether or not i tried |
---|
0:26:50 | using |
---|
0:26:52 | we should could be first |
---|
0:26:53 | you using |
---|
0:26:54 | but |
---|
0:26:55 | using a six |
---|
0:26:57 | sure |
---|
0:26:58 | two |
---|
0:26:58 | so if you well |
---|
0:27:00 | using or not to be only two two |
---|
0:27:03 | to view speaker model |
---|
0:27:05 | using |
---|
0:27:06 | good solution but |
---|
0:27:08 | would be |
---|
0:27:09 | we use to |
---|
0:27:11 | um |
---|
0:27:12 | if um |
---|
0:27:12 | upon hearing the question correctly um |
---|
0:27:16 | i think so yeah uh that the use of a a fixed |
---|
0:27:19 | decision to |
---|
0:27:20 | four |
---|
0:27:20 | i i believe would have |
---|
0:27:22 | um |
---|
0:27:22 | well i |
---|
0:27:24 | at the end of the i think it makes things simple |
---|
0:27:26 | um |
---|
0:27:28 | as opposed like |
---|
0:27:29 | and then |
---|
0:27:30 | our |
---|
0:27:32 | using a fixed decision threshold |
---|
0:27:34 | but at the same time using um a |
---|
0:27:36 | of varying score |
---|
0:27:37 | combination function that |
---|
0:27:39 | that will |
---|
0:27:40 | wait |
---|
0:27:41 | wait |
---|
0:27:41 | scores |
---|
0:27:42 | um |
---|
0:27:43 | oh |
---|
0:27:43 | each |
---|
0:27:45 | of each training |
---|
0:27:46 | training utterance that we have |
---|
0:27:47 | um i i think that that's like a pretty way good way to |
---|
0:27:51 | pretty good |
---|
0:27:51 | good way to do it but uh i guess |
---|
0:27:53 | we we could |
---|
0:27:54 | talk more about it |
---|
0:27:55 | later yeah i |
---|
0:27:56 | i would just be so simple but we question |
---|
0:27:58 | yeah |
---|
0:27:59 | when you |
---|
0:28:00 | compute you |
---|
0:28:01 | try to imagine you compute to weighted school board would exist |
---|
0:28:05 | resort materials |
---|
0:28:07 | okay |
---|
0:28:08 | i'm just the blah you addition reachable |
---|
0:28:11 | on this school |
---|
0:28:12 | you think it would be to work with them until now |
---|
0:28:17 | um |
---|
0:28:19 | oh computing the score with |
---|
0:28:20 | all the |
---|
0:28:22 | every single |
---|
0:28:23 | sure trial |
---|
0:28:25 | so i'm i'm not |
---|
0:28:28 | okay |
---|
0:28:30 | uh jumps also i i also have the non certainly we just will on it |
---|
0:28:35 | uh i also have an onset to uh |
---|
0:28:38 | your question about |
---|
0:28:40 | prior |
---|
0:28:41 | uh |
---|
0:28:42 | and |
---|
0:28:42 | uh |
---|
0:28:43 | i have a a rather mixed in the comment |
---|
0:28:46 | on on this in my presentation tomorrow |
---|
0:28:49 | so uh |
---|
0:28:50 | um |
---|
0:28:52 | it's |
---|
0:28:52 | this |
---|
0:28:53 | this |
---|
0:28:53 | do that |
---|
0:28:56 | might be a last question too |
---|
0:28:58 | steven |
---|
0:29:03 | okay |
---|
0:29:03 | it's time to speak again |
---|