0:00:06 | uh good morning everyone i'm much more claritin uh that would be presenting somewhere that it uh |
---|
0:00:12 | it it is here at Q U T back numbers |
---|
0:00:14 | try to |
---|
0:00:15 | up now relocated to another one |
---|
0:00:18 | S anyone's wondering |
---|
0:00:19 | are presenting on behalf of the colts as as well robbie by brendan baker and strata street hard |
---|
0:00:25 | the web today is basically an experimental study on how svms perform |
---|
0:00:30 | when you decrease the amount of |
---|
0:00:32 | speech that is available to them for speaker there |
---|
0:00:36 | some brief outline |
---|
0:00:37 | or the motivation why we did this study |
---|
0:00:40 | uh |
---|
0:00:40 | and then we'll do some experiments looking at how each of the components of a standard |
---|
0:00:46 | gmm svm |
---|
0:00:47 | it's them |
---|
0:00:48 | how how it responds to the rim job |
---|
0:00:50 | no |
---|
0:00:51 | page uh being available to it |
---|
0:00:53 | this includes the background dataset |
---|
0:00:55 | session compensation particularly now |
---|
0:00:57 | uh we look at the a bit of an analysis of the variation in the kernel space with short utterances |
---|
0:01:04 | and for score normalisation dataset |
---|
0:01:06 | then a present some |
---|
0:01:07 | creations |
---|
0:01:09 | so motivation |
---|
0:01:11 | uh it's quite well known that as you reduce the amount of speech available to assist them |
---|
0:01:15 | we're going to have a reduction |
---|
0:01:16 | performance |
---|
0:01:18 | no there have been some previous studies uh which generally focus on the gmmubm approach and even more recently with |
---|
0:01:25 | the uh joint factor analysis |
---|
0:01:27 | uh but nothing really targeted in the svm case and this is why |
---|
0:01:32 | uh we're doing this work here |
---|
0:01:34 | uh one of the things to mention here's acuity participated in the valley to which is almost a miniature nist |
---|
0:01:40 | evaluation i guess you'd say |
---|
0:01:41 | in two thousand on |
---|
0:01:43 | and some of the observations we got from this uh evaluation |
---|
0:01:48 | was that |
---|
0:01:48 | the svm outperform L J I sister |
---|
0:01:52 | when we had ample amount of spaces |
---|
0:01:54 | six minutes |
---|
0:01:55 | uh where is the op |
---|
0:01:56 | that was true for me twenty second |
---|
0:01:58 | condition subject i perform better |
---|
0:02:01 | there was a distinct difference between the generative and discriminative |
---|
0:02:04 | right is |
---|
0:02:05 | um |
---|
0:02:06 | that was depending on the duration of each |
---|
0:02:09 | come in |
---|
0:02:10 | another observation here was also the chair i was more effective when |
---|
0:02:13 | estimating the session and |
---|
0:02:15 | take it sells places |
---|
0:02:16 | on a duration of speech that was similar to evaluation condition |
---|
0:02:22 | so we're going to look at that a bit over this in |
---|
0:02:26 | of course it's the ends are quite right |
---|
0:02:28 | right |
---|
0:02:28 | in the speaker verification community we just have to look at the presentations last week |
---|
0:02:32 | um this two thousand ten where almost all |
---|
0:02:35 | submissions had uh the gmm svm |
---|
0:02:38 | configuration in this somehow |
---|
0:02:41 | uh so we're looking now at |
---|
0:02:43 | now |
---|
0:02:44 | having to to a T is to select element development ah ah |
---|
0:02:48 | uh when we have mismatch mismatch |
---|
0:02:51 | training and trot segment durations |
---|
0:02:53 | in the svm configure |
---|
0:02:56 | so the main questions here for the svm systems uh |
---|
0:03:00 | to what degree |
---|
0:03:01 | limited speech affect |
---|
0:03:02 | yes fan back |
---|
0:03:03 | base class |
---|
0:03:04 | okay |
---|
0:03:05 | and also which system components on my sense |
---|
0:03:07 | steve |
---|
0:03:08 | just speech quantity |
---|
0:03:09 | uh so we're presenting these results |
---|
0:03:12 | oh |
---|
0:03:12 | with the hypo |
---|
0:03:13 | pointing direction which time to uh counteract |
---|
0:03:17 | effects |
---|
0:03:17 | i should say |
---|
0:03:19 | most of you know about the gmm svm system i would suppose |
---|
0:03:23 | uh where we using stacked gmm component means that speech is for the svm classification |
---|
0:03:28 | we now we can get good |
---|
0:03:29 | formance when you have plenty of speech available |
---|
0:03:32 | and |
---|
0:03:33 | in this work we're looking at uh the important |
---|
0:03:36 | of matching and development dataset |
---|
0:03:38 | to the guy white |
---|
0:03:39 | conditions |
---|
0:03:39 | for each of the individual component |
---|
0:03:43 | let's take a look at uh |
---|
0:03:44 | the flow diagram of the |
---|
0:03:47 | system |
---|
0:03:47 | and basically we have three main datasets that uh go into development |
---|
0:03:52 | first of all we want to train i transfer matrix |
---|
0:03:55 | perception come |
---|
0:03:56 | section |
---|
0:03:57 | particularly now |
---|
0:03:58 | uh so we have a transform training data |
---|
0:04:00 | we also have a background dataset |
---|
0:04:02 | for about |
---|
0:04:03 | provide negative information during |
---|
0:04:05 | svm training |
---|
0:04:08 | and lastly we have score normalisation dataset secured |
---|
0:04:10 | choose to apply score normalisation |
---|
0:04:15 | the upright for this |
---|
0:04:16 | study |
---|
0:04:17 | uh |
---|
0:04:18 | is that we're going to go from a baseline svm system that's one without |
---|
0:04:22 | score normalisation and noise session comp |
---|
0:04:24 | citation |
---|
0:04:24 | and build onto that progressively |
---|
0:04:26 | looking at how it to the additional components |
---|
0:04:29 | um are affected by the duration |
---|
0:04:32 | speech |
---|
0:04:33 | uh so these three sets as i mentioned whether the background dataset |
---|
0:04:37 | training data set |
---|
0:04:38 | session compensation and lastly score |
---|
0:04:42 | so maybe a quick look at the uh system we're working with here's the gmm svm system five hundred twelve |
---|
0:04:48 | finding you the end |
---|
0:04:49 | twelve dimension if |
---|
0:04:51 | mfccs with appended delta is |
---|
0:04:53 | impostor daughter was like ninety from sre are for |
---|
0:04:56 | and we use this stuff by the background dataset and uh ct score normalisation |
---|
0:05:02 | with no we use uh only |
---|
0:05:04 | dimension dimensions |
---|
0:05:06 | greatest variation |
---|
0:05:07 | and then one from sre lance |
---|
0:05:09 | which boarding |
---|
0:05:12 | here we are |
---|
0:05:12 | valuations we perform here from the nist two thousand |
---|
0:05:15 | i corpora |
---|
0:05:17 | particularly the shore to ensure three condition |
---|
0:05:19 | now this usually has two and a half minutes of conversational speech per utterance |
---|
0:05:24 | uh |
---|
0:05:24 | and the way we looking introduced duration |
---|
0:05:27 | is uh into focus condition |
---|
0:05:30 | for short condition and sure sure |
---|
0:05:32 | dish |
---|
0:05:32 | and for sure condition really the training segment as is |
---|
0:05:36 | pulling |
---|
0:05:37 | and |
---|
0:05:37 | we uh |
---|
0:05:38 | progressively |
---|
0:05:40 | truncate the test utterance |
---|
0:05:41 | to to decide |
---|
0:05:43 | in the short short |
---|
0:05:44 | case |
---|
0:05:45 | we truncate by train and test |
---|
0:05:47 | to the same direction so it's essentially not |
---|
0:05:49 | uh duration in this evaluation |
---|
0:05:53 | so let's look at the baseline svm performance |
---|
0:05:56 | any particular going to go back to |
---|
0:05:58 | uh what we'll do it in detail later and say how phones compared to the G M and |
---|
0:06:03 | it's just a guess |
---|
0:06:03 | point of reference all |
---|
0:06:05 | what we will |
---|
0:06:07 | so here we using uh |
---|
0:06:10 | baseline and what we're timing state of the art |
---|
0:06:13 | um |
---|
0:06:14 | which is now not so true |
---|
0:06:15 | uh |
---|
0:06:16 | with the oh i vector part coming out |
---|
0:06:18 | um |
---|
0:06:19 | we're looking at the baseline and study are both gmm and svm systems |
---|
0:06:24 | four systems that were developed using the full two and a half minutes of speech in training |
---|
0:06:29 | test |
---|
0:06:30 | so we're not |
---|
0:06:30 | uh explicitly dealing with the |
---|
0:06:33 | load |
---|
0:06:33 | actions as |
---|
0:06:34 | fig |
---|
0:06:36 | the first thing we notice here |
---|
0:06:37 | this |
---|
0:06:38 | solid line |
---|
0:06:40 | all the baseline |
---|
0:06:41 | arches |
---|
0:06:41 | we say that the baseline svm part |
---|
0:06:44 | uh gives us |
---|
0:06:45 | better performance than the gmm baseline |
---|
0:06:48 | uh |
---|
0:06:49 | just doesn't like the gmm baseline he has nice session compensation |
---|
0:06:53 | and our score normalisation which might |
---|
0:06:56 | you what |
---|
0:06:56 | being conservative |
---|
0:06:58 | but |
---|
0:06:58 | as we reduce the duration of speech the S P N |
---|
0:07:01 | uh |
---|
0:07:02 | quickly deteriorates in performance compared to the gmm system |
---|
0:07:07 | uh |
---|
0:07:08 | it's not quite noticeable in the state of the art |
---|
0:07:11 | um but the gmm is |
---|
0:07:12 | uh |
---|
0:07:13 | in front of this in the hallway |
---|
0:07:15 | now if we look at the short short |
---|
0:07:17 | uh conditions this is where both train and test of being reduced |
---|
0:07:21 | actually see that the svm baselines |
---|
0:07:24 | them out |
---|
0:07:25 | on the |
---|
0:07:26 | cycles data they are |
---|
0:07:28 | uh |
---|
0:07:28 | once we reduce be like the eighty second sorry |
---|
0:07:31 | uh having that |
---|
0:07:33 | the development of the system on for two and how you know |
---|
0:07:36 | speech here |
---|
0:07:37 | might be the reason for this but we're got to look into that |
---|
0:07:40 | in the case the G M G M M system however |
---|
0:07:43 | less than ten seconds that was saying the baseline jump in front of |
---|
0:07:47 | D better |
---|
0:07:48 | yeah |
---|
0:07:50 | so there's a good some significant differences and issues we need to look into he |
---|
0:07:54 | and hopefully |
---|
0:07:55 | uh the development datasets that we look into here will help us out with that |
---|
0:08:00 | let's start with the background dataset |
---|
0:08:02 | and here we're going to look at the svm system |
---|
0:08:05 | and |
---|
0:08:06 | how changing the speech direction in the background dataset affects performance |
---|
0:08:10 | without score normalisation |
---|
0:08:11 | and without session compensation |
---|
0:08:15 | so as we know it background dataset gives us the negative information in svm training |
---|
0:08:20 | we generally have |
---|
0:08:21 | many more negative examples thanks fine examples in the nist sre is |
---|
0:08:26 | and we previously signed uh that the choice of this dataset greatly affects model quality |
---|
0:08:32 | a real question comes up with E S P N C is how we select this data set |
---|
0:08:37 | in mismatched train test duration |
---|
0:08:40 | we should we be matching the duration to the try not hurt |
---|
0:08:43 | the test utterance |
---|
0:08:44 | all the shorter of the two out |
---|
0:08:48 | so colour us there is a three slides here to print for present |
---|
0:08:52 | firstly we've got a short short conditions that match |
---|
0:08:55 | training and testing direction |
---|
0:08:57 | and that's quite obvious that it's better to match |
---|
0:08:59 | background to the uh evaluation conditions here |
---|
0:09:02 | in the fall shorts that's for training |
---|
0:09:05 | short testing |
---|
0:09:06 | actually signals better to match |
---|
0:09:08 | the background dataset to the test |
---|
0:09:10 | the shorter |
---|
0:09:11 | test after |
---|
0:09:15 | in the last condition which we have introduced a shortfall social testing |
---|
0:09:19 | training |
---|
0:09:20 | for test |
---|
0:09:21 | uh |
---|
0:09:22 | and again we don't see what |
---|
0:09:23 | uh as as large a discrepancy in the short their durations |
---|
0:09:27 | but |
---|
0:09:28 | we're actually saying that matching to the shorter |
---|
0:09:31 | training utterance give us a little bit of an impertinent towards the uh larger rice and see |
---|
0:09:37 | so what conclusions can we draw from this will let's look at the equal error rate as well on this |
---|
0:09:41 | click here to give us a bit more |
---|
0:09:43 | for you |
---|
0:09:44 | and we |
---|
0:09:44 | particularly by pressing on the ten second condition here |
---|
0:09:49 | first thing we can see here is that matching the background dataset to the training segment |
---|
0:09:54 | does not always maximise |
---|
0:09:55 | one |
---|
0:09:58 | however if we matched to the test segment |
---|
0:10:01 | in our results were always getting the best |
---|
0:10:03 | dcf performance |
---|
0:10:05 | and in contrast |
---|
0:10:06 | if we want the best equal error upon |
---|
0:10:08 | we next to the shortest you're right |
---|
0:10:11 | so is a bit of a choice can be made it depending on what you want justice |
---|
0:10:15 | the what operating point you wanna i |
---|
0:10:20 | so in the following |
---|
0:10:22 | chairman switch a reason uh to use |
---|
0:10:24 | the shorter test our |
---|
0:10:26 | as the duration that we're matching up |
---|
0:10:29 | granddaughters set |
---|
0:10:31 | that's look now session compensation |
---|
0:10:34 | nuisance attribute projection |
---|
0:10:37 | a or maybe some kind of spice the directions of greatest uh session variation |
---|
0:10:42 | and as a small honourably and showing that uh |
---|
0:10:45 | the dimensions captured in the U |
---|
0:10:47 | transform matrix are projected out of the kernel space |
---|
0:10:50 | 'cause transform you has to be learned from a training data set |
---|
0:10:55 | now what would be using in this transformed right training dataset when we've got limited test page |
---|
0:11:00 | what is what |
---|
0:11:01 | train and test speech of minutes |
---|
0:11:05 | on this board first are we looking at the whole short condition |
---|
0:11:09 | uh |
---|
0:11:10 | L system he has no score normalisation but the background as being that's to the shorter test |
---|
0:11:15 | abhorrence in each of these cases |
---|
0:11:18 | and it's quite clear that using match |
---|
0:11:20 | not training in this |
---|
0:11:22 | that's matching to the short test after |
---|
0:11:25 | gives us the best |
---|
0:11:26 | phone |
---|
0:11:27 | and in fact if we use |
---|
0:11:29 | full net |
---|
0:11:30 | training |
---|
0:11:31 | the referent |
---|
0:11:31 | system that's one without nap |
---|
0:11:33 | jumps in front in the longer duration |
---|
0:11:35 | sorry |
---|
0:11:36 | here we really wanna match to the net |
---|
0:11:38 | uh to the |
---|
0:11:39 | shorter |
---|
0:11:40 | test duration in than that trance |
---|
0:11:45 | and in that i was tied to the mice |
---|
0:11:47 | challenging trust |
---|
0:11:48 | so the short |
---|
0:11:52 | now let's look at the short short isis an interesting case |
---|
0:11:56 | because |
---|
0:11:56 | we actually observe that even though we match |
---|
0:11:59 | the net training data set to the ten second duration |
---|
0:12:03 | where |
---|
0:12:04 | still finding the best |
---|
0:12:05 | performance comes from baseline system so one without now |
---|
0:12:09 | so why is this the we we pointing up the full nap training of pasta great |
---|
0:12:13 | one |
---|
0:12:14 | quite |
---|
0:12:14 | significantly |
---|
0:12:15 | uh but matt's not just isn't something in front of the base |
---|
0:12:19 | so nasty |
---|
0:12:20 | point somewhere that |
---|
0:12:21 | not |
---|
0:12:22 | uh files to provide benefits |
---|
0:12:24 | uh in the limited training and testing |
---|
0:12:29 | so what point is |
---|
0:12:30 | well he's a plot where would match than that |
---|
0:12:32 | training |
---|
0:12:33 | based on the yeah duration |
---|
0:12:35 | in the short short |
---|
0:12:36 | remember this is short short condition whereas |
---|
0:12:39 | for sure we actually |
---|
0:12:40 | got more |
---|
0:12:41 | a benefit out of that |
---|
0:12:43 | well actually see that |
---|
0:12:45 | just below forty second mark a nasty |
---|
0:12:47 | uh is where the reference system jobs in front |
---|
0:12:50 | i compensated |
---|
0:12:53 | so then |
---|
0:12:54 | why is this happening |
---|
0:12:56 | let's look at the uh variability and we can |
---|
0:13:00 | so if and the not wasn't quite robust to limited |
---|
0:13:02 | training and testing speech |
---|
0:13:04 | um |
---|
0:13:05 | in the context of jack by |
---|
0:13:07 | uh systems |
---|
0:13:09 | the session subspace |
---|
0:13:10 | variation withstand too |
---|
0:13:12 | increase |
---|
0:13:13 | uh as the re |
---|
0:13:15 | the length of |
---|
0:13:16 | training and testing either |
---|
0:13:17 | do you reduce |
---|
0:13:18 | so we're going to say that's assigned times in the svm kernel |
---|
0:13:25 | on the slide we have a table with um number of durations |
---|
0:13:29 | will be short short |
---|
0:13:30 | uh draw condition |
---|
0:13:32 | and we |
---|
0:13:33 | also got a |
---|
0:13:34 | top i reference on that rare |
---|
0:13:36 | relevance factor all night |
---|
0:13:38 | uh and we're |
---|
0:13:39 | presenting the total variability |
---|
0:13:42 | uh in the |
---|
0:13:44 | they get space and session space |
---|
0:13:46 | um |
---|
0:13:47 | oh the svm kernel |
---|
0:13:49 | and we actually say that |
---|
0:13:50 | in contrast to what was observed which i pi |
---|
0:13:53 | we're getting a reduction in both of these bases as duration is |
---|
0:13:57 | great |
---|
0:13:58 | no wonder why is this the case what is the difference here |
---|
0:14:01 | and so what we did |
---|
0:14:02 | was actually take an inconsequential town close to zero |
---|
0:14:06 | uh so that |
---|
0:14:07 | uh |
---|
0:14:08 | S supervectors have more room to maybe |
---|
0:14:11 | we actually find that we do in fact agree with the jedi |
---|
0:14:14 | uh |
---|
0:14:15 | observations and that we are getting |
---|
0:14:17 | more |
---|
0:14:18 | uh i greater magnitude of cargo in each of these cases |
---|
0:14:22 | if we uh |
---|
0:14:23 | change irrelevant |
---|
0:14:24 | back to |
---|
0:14:25 | too close to zero |
---|
0:14:27 | so here we consider a map adaptation relevance factor has a significant influence on the observable variation in the svm |
---|
0:14:33 | kernel space |
---|
0:14:34 | that's just something to be aware of |
---|
0:14:37 | now what's interesting night irrespective of the town that we use |
---|
0:14:41 | we're getting very similar |
---|
0:14:43 | um |
---|
0:14:44 | session to speaker right here so you |
---|
0:14:47 | session variation that's coming out is a more dominant |
---|
0:14:51 | uh as the duration is reduced |
---|
0:14:53 | and of course this is why speaker |
---|
0:14:55 | okay |
---|
0:14:55 | she's more difficult with |
---|
0:14:57 | uh |
---|
0:14:57 | shorter |
---|
0:14:58 | speech segment |
---|
0:15:01 | so why then |
---|
0:15:02 | we're getting more session variation |
---|
0:15:04 | why is now struggling to estimate that |
---|
0:15:06 | um |
---|
0:15:07 | as we reduce the duration |
---|
0:15:10 | just look at this uh for you |
---|
0:15:12 | we have |
---|
0:15:13 | this session variability in the magnitude of session variability and speaker variability |
---|
0:15:18 | in the top one hundred eigenvectors estimated by now |
---|
0:15:21 | um |
---|
0:15:23 | for direction of eighty seconds and ten second |
---|
0:15:26 | now the |
---|
0:15:27 | solid lines i do seconds that one's a ten sec |
---|
0:15:30 | and session variability is the black line |
---|
0:15:33 | first thing we notice he is that |
---|
0:15:35 | when we have longer |
---|
0:15:37 | durations |
---|
0:15:37 | speech |
---|
0:15:38 | this large |
---|
0:15:39 | for the session variation is great |
---|
0:15:41 | so we're getting more |
---|
0:15:43 | session variation |
---|
0:15:44 | that can be represented in a lower than men |
---|
0:15:48 | uh whereas as the duration |
---|
0:15:50 | reduces we |
---|
0:15:51 | flattening out would be coming bit more isotropic in our session |
---|
0:15:55 | a variation |
---|
0:15:57 | in contrast L speaker variation |
---|
0:15:59 | slide is actually quite similar |
---|
0:16:03 | this aligns with the uh table we just saw |
---|
0:16:06 | where these session variation is uh |
---|
0:16:09 | it coming from one domain |
---|
0:16:12 | then that was developed on the assumption that the majority of session variation lots and like dimensional space |
---|
0:16:19 | so |
---|
0:16:19 | it's our understanding of it |
---|
0:16:21 | the because of the |
---|
0:16:23 | um |
---|
0:16:24 | isotropic |
---|
0:16:25 | uh more isotropic session variation that |
---|
0:16:28 | coming about on these reduced up |
---|
0:16:30 | says |
---|
0:16:31 | that |
---|
0:16:31 | the assumption no longer holds and this is why it's unable to our benefit |
---|
0:16:36 | in the short short condition |
---|
0:16:38 | so how do we can overcome this problem |
---|
0:16:40 | we're still working on the |
---|
0:16:45 | next to move on to score normalisation |
---|
0:16:47 | uh |
---|
0:16:48 | it quite a lot because everyone knows |
---|
0:16:50 | it's colonisation is he |
---|
0:16:52 | i think of the last you |
---|
0:16:54 | presentations |
---|
0:16:55 | uh basically can correct statistical variation in class |
---|
0:16:58 | cations goals |
---|
0:16:59 | and attentive |
---|
0:17:00 | scowl schools from |
---|
0:17:02 | uh i given trout or by what is |
---|
0:17:04 | fusion |
---|
0:17:04 | using a to Z normal T normal check line and test centric approaches respectively |
---|
0:17:10 | and again we using an impostor cohort something we need to |
---|
0:17:13 | select that way |
---|
0:17:16 | no typically |
---|
0:17:17 | score normalisation cohorts should match the evaluation conditions |
---|
0:17:21 | the context the |
---|
0:17:22 | S P Ns we want an R |
---|
0:17:24 | how important is it to match these |
---|
0:17:26 | uh conditions |
---|
0:17:27 | and how much to score normalisation X |
---|
0:17:29 | benefit us when we have limited space |
---|
0:17:34 | this type of here we've got the uh |
---|
0:17:36 | full short condition on the second row |
---|
0:17:39 | and the short short condition down the bottom they're looking at the ten sec |
---|
0:17:43 | condition in particular |
---|
0:17:44 | we have three different horrible selection method see none which other all schools are normalised |
---|
0:17:49 | full |
---|
0:17:50 | which means out by tells the and T norm |
---|
0:17:52 | uh cardboard so using two and a half minutes |
---|
0:17:55 | speech |
---|
0:17:55 | and then match |
---|
0:17:57 | sorry |
---|
0:17:57 | in the case of the full ten second |
---|
0:18:00 | condition he met |
---|
0:18:01 | simply means is that you know matter and |
---|
0:18:03 | a truncated to that end |
---|
0:18:05 | whereas in the ten second ten second case |
---|
0:18:07 | but it's the ending on that |
---|
0:18:09 | right |
---|
0:18:12 | that's quite obvious that the full uh hard what's it going give us worst performance we |
---|
0:18:17 | we can see |
---|
0:18:18 | and that maps no longer holds offer the best |
---|
0:18:22 | so uh quite elementary but |
---|
0:18:24 | the uh interesting observation here is that |
---|
0:18:28 | uh |
---|
0:18:29 | the relative performance gain from applying score normalisation |
---|
0:18:32 | seems quite minimal sorry |
---|
0:18:34 | the question is |
---|
0:18:36 | uh |
---|
0:18:37 | at what point are we willing to |
---|
0:18:39 | you go about choosing at a score normalisation sets to try and help |
---|
0:18:43 | on |
---|
0:18:45 | so that try and help answer that question we looked at the |
---|
0:18:48 | relative gain in min dcf that score normalisation provides |
---|
0:18:52 | as we reduce the duration of speech |
---|
0:18:56 | we say that would |
---|
0:18:56 | the full eighty seconds weakening i attend the same kind which is |
---|
0:18:59 | hmmm |
---|
0:19:00 | quite reasonable |
---|
0:19:01 | it's in the lower durations of speech five and ten seconds we've got less than two percent relative gain |
---|
0:19:07 | are these really worth yeah i do |
---|
0:19:08 | trying to choose at a good normalised |
---|
0:19:11 | that |
---|
0:19:11 | uh and the risk |
---|
0:19:12 | that |
---|
0:19:13 | and normalisation |
---|
0:19:14 | set |
---|
0:19:14 | uh |
---|
0:19:15 | i'm not actually kind of chosen well and |
---|
0:19:17 | reduced |
---|
0:19:18 | on |
---|
0:19:19 | that's another question is right now |
---|
0:19:22 | thank conclusion we've been investigated |
---|
0:19:24 | sensitivity of the populist the end system |
---|
0:19:27 | uh to reduce training and testing segments |
---|
0:19:29 | and we found the best phone i'm from selecting a background |
---|
0:19:33 | uh that match the shortest test duration depending on |
---|
0:19:37 | when you want to optimise the dcf or equal error rate |
---|
0:19:40 | but not a transforms trained on data matching |
---|
0:19:43 | it sure just |
---|
0:19:44 | a direction that was the best performance |
---|
0:19:46 | and score normalisation |
---|
0:19:48 | how much |
---|
0:19:49 | conditions were also the best |
---|
0:19:51 | the highlight an issue in that |
---|
0:19:53 | when dealing with a limited speech and this is judy session variability |
---|
0:19:57 | becoming more isotropic the speech duration was reduced |
---|
0:20:00 | and |
---|
0:20:01 | score normalisation provider uh what you |
---|
0:20:04 | in the |
---|
0:20:06 | uh condition |
---|
0:20:08 | thank you for |
---|
0:20:17 | thank you for the |
---|
0:20:18 | that's a systematic |
---|
0:20:20 | uh |
---|
0:20:21 | investigation into the effects of uh |
---|
0:20:23 | uh |
---|
0:20:24 | duration |
---|
0:20:25 | um |
---|
0:20:27 | as far as i can see |
---|
0:20:29 | but trick |
---|
0:20:30 | uh there's a patient this morning which i'm not sure |
---|
0:20:32 | you |
---|
0:20:33 | you will you have no impact at the sleeping well that had a uh right |
---|
0:20:37 | we're not going on you know that |
---|
0:20:39 | i i think |
---|
0:20:40 | um |
---|
0:20:41 | uh |
---|
0:20:42 | patrick |
---|
0:20:43 | observations this morning |
---|
0:20:44 | uh |
---|
0:20:45 | yeah |
---|
0:20:46 | a nice |
---|
0:20:48 | explanation |
---|
0:20:49 | of what you see |
---|
0:20:50 | so |
---|
0:20:51 | the short |
---|
0:20:52 | explanation |
---|
0:20:53 | uh if you using relevance map |
---|
0:20:55 | uh_huh then |
---|
0:20:56 | um |
---|
0:20:58 | you |
---|
0:20:59 | introducing |
---|
0:21:01 | speaker dependent |
---|
0:21:02 | uh |
---|
0:21:03 | within speaker |
---|
0:21:05 | variability |
---|
0:21:06 | uh that's what |
---|
0:21:07 | but recall uh the uh |
---|
0:21:09 | the original script |
---|
0:21:11 | um |
---|
0:21:11 | so |
---|
0:21:14 | you agree with me that explains |
---|
0:21:16 | perhaps explains |
---|
0:21:18 | what you see |
---|
0:21:24 | i'll i'll have to talk for the other ones are honest representation |
---|
0:21:27 | one |
---|
0:21:29 | so any any others |
---|
0:21:31 | any other questions |
---|
0:21:37 | about |
---|
0:21:38 | posted |
---|
0:21:39 | um |
---|
0:21:40 | your name uh you're you're matrix for the |
---|
0:21:43 | now |
---|
0:21:43 | and to do and relevance map and maybe we pca on |
---|
0:21:47 | that information |
---|
0:21:49 | sorry a saying |
---|
0:21:50 | yeah |
---|
0:21:50 | not quite well |
---|
0:21:52 | sorry |
---|
0:21:52 | um my question is regarding the |
---|
0:21:54 | uh how you really mean the U matrix uh |
---|
0:21:57 | to project the way |
---|
0:21:58 | so you're doing relevance map |
---|
0:22:00 | uh a man on bad |
---|
0:22:02 | you're not P C |
---|
0:22:04 | computing it |
---|
0:22:05 | pca pca on uh |
---|
0:22:07 | your uh um centre |
---|
0:22:09 | real time at that or or |
---|
0:22:13 | i know that uh to estimate you matrix we are doing some kind of pca to go to los lights |
---|
0:22:19 | for computational reasons |
---|
0:22:21 | but then we go back to the original |
---|
0:22:23 | so that would |
---|
0:22:24 | but not so my question is uh |
---|
0:22:27 | vicki lapsing when you learned that you matrix |
---|
0:22:30 | is that uh if you just doing a regular pca which is uh |
---|
0:22:34 | the computer low dimensional approximation of your uh |
---|
0:22:38 | if you put all your body that's vectors |
---|
0:22:40 | i mean you do uh low rank approximation about me to basically what piece you know |
---|
0:22:45 | you're not taking into account |
---|
0:22:47 | the |
---|
0:22:47 | the count |
---|
0:22:48 | uh that when you do your part to analyses |
---|
0:22:52 | um |
---|
0:22:53 | you using the count somehow |
---|
0:22:55 | to uh |
---|
0:22:56 | wait |
---|
0:22:57 | they |
---|
0:22:57 | four tones |
---|
0:22:58 | of uh |
---|
0:23:00 | information in in different parts of the |
---|
0:23:03 | the pen to |
---|
0:23:04 | so um i my question is mostly we're going to |
---|
0:23:08 | are you somehow incorporating |
---|
0:23:10 | the information that |
---|
0:23:11 | when you have a lot of gaussian and i'm very few points |
---|
0:23:15 | not all the gaussian get us assign points |
---|
0:23:18 | and then when you |
---|
0:23:19 | train your subspace |
---|
0:23:20 | you're subspace |
---|
0:23:21 | does not know that |
---|
0:23:23 | so maybe that accounts for a lot of these uh |
---|
0:23:26 | observations are you happier |
---|
0:23:28 | understanding point actually i think |
---|
0:23:30 | i think either |
---|
0:23:31 | uh i don't believe we're actually explicitly take into account |
---|
0:23:35 | um |
---|
0:23:37 | the fact that some gaussians might miss out on |
---|
0:23:40 | oh |
---|
0:23:40 | patient |
---|
0:23:42 | and yeah i think i can understand |
---|
0:23:44 | saying that it's might have an effect on the |
---|
0:23:46 | but on the united |
---|
0:24:01 | um |
---|
0:24:01 | uh |
---|
0:24:02 | you mean yeah |
---|
0:24:04 | i'm a little |
---|
0:24:06 | sure about the |
---|
0:24:08 | so what |
---|
0:24:08 | i i mean i'm all |
---|
0:24:09 | you're cool |
---|
0:24:10 | studies because you want to see what works best |
---|
0:24:13 | but you also want to understand why it works best |
---|
0:24:16 | so what you said sort of |
---|
0:24:18 | or magnitude of standpoint was |
---|
0:24:19 | you doing this |
---|
0:24:20 | process |
---|
0:24:22 | oh map to get gaussian |
---|
0:24:24 | and then you're comparing the means a some training |
---|
0:24:27 | gaussians you got mad |
---|
0:24:29 | with some test gaussians you go with mapping using U S B M |
---|
0:24:32 | and if it's not the same amount of data |
---|
0:24:35 | things go wrong |
---|
0:24:36 | basically |
---|
0:24:37 | and so |
---|
0:24:38 | uh |
---|
0:24:40 | the solution you're applying is your single make it the same length |
---|
0:24:44 | um |
---|
0:24:45 | it would seem like |
---|
0:24:46 | uh |
---|
0:24:48 | you |
---|
0:24:49 | yeah but you did that study without normalisation |
---|
0:24:52 | okay so of course when the noise |
---|
0:24:54 | uh you dicks |
---|
0:24:55 | back |
---|
0:24:56 | all kinds of normalisation is there as you said |
---|
0:24:59 | two |
---|
0:25:00 | deal with |
---|
0:25:01 | differences like just another differences |
---|
0:25:03 | um i'm wondering whether |
---|
0:25:05 | by doing it without normalisation that was true i |
---|
0:25:08 | making the worst possible condition that |
---|
0:25:11 | it wouldn't be fixed produced |
---|
0:25:12 | but your solution ended up being discard data |
---|
0:25:15 | so did you read it would so the first question i guess is |
---|
0:25:18 | when you truncated the training samples did you literally just discard the rest of the data where did you |
---|
0:25:23 | create additional short training utterances out of those |
---|
0:25:26 | and i would discover that i |
---|
0:25:28 | okay |
---|
0:25:28 | so |
---|
0:25:29 | one obvious thing is if you if you take a thirty second utterance |
---|
0:25:32 | truncated to ten seconds it would be wasteful not to use the other twenty |
---|
0:25:35 | seconds as two more to the second term |
---|
0:25:38 | um |
---|
0:25:39 | but |
---|
0:25:39 | besides that |
---|
0:25:40 | that observation |
---|
0:25:42 | i'm worried about the |
---|
0:25:44 | yeah |
---|
0:25:45 | uh |
---|
0:25:46 | if you had used normalisation uh_huh |
---|
0:25:48 | you might |
---|
0:25:49 | fix the problem |
---|
0:25:50 | to begin with did actually run they've also with school and |
---|
0:25:53 | quantisation but we can't we found based |
---|
0:25:56 | similar |
---|
0:25:56 | by |
---|
0:25:57 | but we wanted |
---|
0:25:58 | two |
---|
0:25:58 | uh try and get back to a very basic system just to help |
---|
0:26:02 | i guess you'd say the breeders understanding and floor |
---|
0:26:05 | of that i |
---|
0:26:06 | i i i i'm i'm hearing in many papers especially today |
---|
0:26:10 | a strong desire and everyone's part |
---|
0:26:13 | two |
---|
0:26:14 | find a way to do things without normalisation is it |
---|
0:26:17 | somehow normalisation were a bad thing |
---|
0:26:20 | when it seems to me that normalisation is |
---|
0:26:25 | almost |
---|
0:26:25 | beyond the obvious thing that you have to model the speech hmmm |
---|
0:26:29 | it seems like the only other thing |
---|
0:26:31 | you know very high level since |
---|
0:26:33 | is a normalisation |
---|
0:26:34 | after all we're doing |
---|
0:26:36 | we're doing some kind of hypothesis test |
---|
0:26:38 | verification |
---|
0:26:39 | and |
---|
0:26:40 | that |
---|
0:26:40 | inherently requires |
---|
0:26:43 | knowing how to set a threshold which require |
---|
0:26:45 | or some kind of normalisation |
---|
0:26:47 | and |
---|
0:26:47 | if |
---|
0:26:48 | to the extent that we try to get away from that |
---|
0:26:51 | we're trying her hands behind her back |
---|
0:26:54 | um |
---|
0:26:55 | i mean it's good it's good to look for methods that are |
---|
0:26:58 | inherently better |
---|
0:26:59 | but |
---|
0:27:00 | i guess i would |
---|
0:27:01 | say |
---|
0:27:02 | you know what |
---|
0:27:03 | we should still do normalisation it can ever |
---|
0:27:06 | okay |
---|
0:27:10 | done properly |
---|
0:27:18 | oh what is |
---|
0:27:19 | where that my my claim was |
---|
0:27:21 | uh |
---|
0:27:23 | i well that's good to look for better models |
---|
0:27:26 | um |
---|
0:27:27 | i i don't see it |
---|
0:27:29 | i don't |
---|
0:27:29 | i understand the desire to do away with normalisation |
---|
0:27:33 | seems like normalisation |
---|
0:27:35 | is |
---|
0:27:36 | at the crux of the problem |
---|
0:27:37 | and ultimately |
---|
0:27:39 | fig |
---|
0:27:40 | fixed whatever else you do wrong |
---|
0:27:42 | and if you never heard |
---|
0:27:53 | yes normalisation does exactly that so |
---|
0:27:56 | uh |
---|
0:27:57 | what |
---|
0:27:57 | what we are unhappy with |
---|
0:27:59 | that we did do something wrong so |
---|
0:28:02 | uh we we're trying to do |
---|
0:28:04 | that's a bit of |
---|
0:28:05 | uh |
---|
0:28:07 | and |
---|
0:28:08 | if and then we find |
---|
0:28:09 | it's still not perfect |
---|
0:28:11 | yeah |
---|
0:28:11 | then i'm sure we will keep a normalised |
---|
0:28:14 | so the other way to look at it is |
---|
0:28:16 | but the |
---|
0:28:17 | normalisation is just another |
---|
0:28:19 | modelling stage |
---|
0:28:21 | uh |
---|
0:28:21 | the |
---|
0:28:22 | extracting the mfcc features as modelling that the acoustic signal |
---|
0:28:26 | and then |
---|
0:28:27 | uh |
---|
0:28:28 | gmms is is |
---|
0:28:29 | modelling the mfccs and |
---|
0:28:32 | uh i victor's again this morning |
---|
0:28:35 | the |
---|
0:28:35 | the gmm supervectors and then in the end |
---|
0:28:38 | there's a score modelling stage |
---|
0:28:40 | uh |
---|
0:28:42 | so |
---|
0:28:43 | at the end you just expecting more most pages might be nice just to use |
---|
0:28:48 | uh |
---|
0:28:49 | the number of |
---|
0:28:49 | all stages but the |
---|
0:28:51 | probably probably |
---|
0:28:54 | we might just go on |
---|
0:28:55 | mobilising forever |
---|
0:28:59 | can we uh |
---|
0:29:00 | uh have the next week |
---|