0:00:18 | oh |
---|
0:00:19 | oh |
---|
0:00:22 | oh |
---|
0:00:35 | well this is so you organise race |
---|
0:00:39 | a that |
---|
0:00:40 | if you serve also |
---|
0:00:42 | third order all three here that the first author individual |
---|
0:00:47 | yeah it solves the general idea of a source position |
---|
0:00:52 | that's the i think right after |
---|
0:00:55 | the previous oversee |
---|
0:00:57 | no |
---|
0:00:58 | and |
---|
0:01:00 | if |
---|
0:01:01 | afterwards after this talk you think that this is fantastic i'm going to implement this |
---|
0:01:06 | tomorrow or next week |
---|
0:01:08 | in you need to do is done it so |
---|
0:01:10 | and mature |
---|
0:01:13 | and he slides because he'd make the slides form |
---|
0:01:17 | yeah |
---|
0:01:18 | i'm very happy with that if only a house you think yeah afterwards i didn't |
---|
0:01:23 | get i think of this and why did before but this is probably due to |
---|
0:01:27 | me and all being able to convey the message |
---|
0:01:31 | and if you before hands thing to the same thing |
---|
0:01:36 | then you're are sort of each what we have |
---|
0:01:40 | right |
---|
0:01:41 | anyway |
---|
0:01:42 | so |
---|
0:01:45 | this is sort of you automatically generated a summary of my presentation today |
---|
0:01:52 | which i think is kind of pointless in this particular case because |
---|
0:01:55 | contains lots of irony is that are can be explained later soldiers |
---|
0:02:00 | having to |
---|
0:02:01 | the motivation to do this work and the idea is that in speaker recognition |
---|
0:02:07 | we all phone that and you evaluations were things change from year to year we |
---|
0:02:12 | often |
---|
0:02:13 | get to the situation where you get new data we haven't seen for sitting here |
---|
0:02:18 | well yeah |
---|
0:02:20 | voicing data no you know what kind of noise maybe some people |
---|
0:02:24 | but most of us don't know |
---|
0:02:27 | and |
---|
0:02:29 | well |
---|
0:02:30 | how we're going to deal with that's |
---|
0:02:33 | i don't |
---|
0:02:35 | sometimes have to |
---|
0:02:37 | locations and see what i'm going to talk about |
---|
0:02:40 | every once a while actually to rubbish i guess because |
---|
0:02:44 | i haven't seen this to go from |
---|
0:02:46 | but anyway that the basic idea is that if you get conditions definitely to train |
---|
0:02:52 | and test |
---|
0:02:53 | are of a different kind you like to have to see |
---|
0:02:57 | that you like to have seen this before what you do |
---|
0:03:01 | i don't |
---|
0:03:02 | if you're |
---|
0:03:03 | if you know that you won't have seen |
---|
0:03:06 | and one way of to do with that is this ideal source normalization |
---|
0:03:12 | i'll try to explain |
---|
0:03:13 | the basic ideas source say |
---|
0:03:17 | oh here's some slides about i-vectors i think i i'll skip |
---|
0:03:21 | with these two then you probably in your hands and standard much better i |
---|
0:03:25 | the basic idea is that we review the i-vector in this particular presentation that's a |
---|
0:03:31 | very low dimensional representation |
---|
0:03:33 | of the entire utterance |
---|
0:03:35 | containing |
---|
0:03:36 | apart from speaker information other information |
---|
0:03:41 | i essential to the idea of source position is that wants to that we do |
---|
0:03:46 | in the standards |
---|
0:03:48 | approach is we hear by within covariance |
---|
0:03:52 | within class covariance normalization |
---|
0:03:56 | for the P lda |
---|
0:03:59 | and |
---|
0:04:00 | that's needs to be changed |
---|
0:04:02 | with data and in the training |
---|
0:04:05 | the |
---|
0:04:06 | within class and between-class |
---|
0:04:08 | scatter matrices are |
---|
0:04:11 | are computed |
---|
0:04:12 | and that's where the source normalisation takes place |
---|
0:04:18 | so here and notes that we actually need to estimate those scatter matrices |
---|
0:04:23 | for |
---|
0:04:25 | so this is the mathematics just to stay in line with the previous torso |
---|
0:04:29 | to have at least some mathematics so on the view screen |
---|
0:04:34 | this is the expression for the within |
---|
0:04:37 | speaker scatter matrix |
---|
0:04:40 | and this is what the source position is going to |
---|
0:04:44 | try and estimating a better way |
---|
0:04:48 | because what is the what is the |
---|
0:04:50 | problem with a wccn in this particular |
---|
0:04:54 | a matter is this issue is that |
---|
0:04:58 | relevant kinds of variation are observed in the training data |
---|
0:05:05 | and this is more often to if you don't have |
---|
0:05:09 | data |
---|
0:05:11 | so i hear another graphical representation of what typically happens here we look at a |
---|
0:05:17 | specific |
---|
0:05:19 | kind of data is the label of the data in mind say which is |
---|
0:05:25 | the language so you have also english language data and every once in a while |
---|
0:05:30 | we get some a tests |
---|
0:05:33 | where language model english |
---|
0:05:35 | how that in i think that |
---|
0:05:37 | six |
---|
0:05:38 | and before |
---|
0:05:40 | a also two thousand eight seconds content easy |
---|
0:05:43 | so when is you get in two thousand twelve be what you get |
---|
0:05:46 | so maybe language itself is not so relevant for the current that is what it |
---|
0:05:51 | is a good example of where things |
---|
0:05:53 | change |
---|
0:05:55 | an important |
---|
0:05:56 | here is that even if we have some training data from here's |
---|
0:06:00 | will not have for all speakers |
---|
0:06:05 | the different languages so typically |
---|
0:06:07 | the |
---|
0:06:09 | the speakers are decoupled from the language of for some language you have some speakers |
---|
0:06:13 | and for the language you have other speakers |
---|
0:06:15 | so how do you know the problem where you in the end in your |
---|
0:06:19 | recognition have to compare one segment in one second |
---|
0:06:23 | in the other language where the case might be that it's actually same speaker |
---|
0:06:30 | so what about shown out is why this kind of |
---|
0:06:36 | difference in language labels going to |
---|
0:06:39 | influence these |
---|
0:06:41 | we can |
---|
0:06:42 | speaker |
---|
0:06:43 | sky within class scatter matrix |
---|
0:06:46 | so this is one way of viewing how the |
---|
0:06:49 | i-vectors might be distributed in this very |
---|
0:06:52 | this way |
---|
0:06:55 | and |
---|
0:06:56 | is used |
---|
0:06:59 | so |
---|
0:07:00 | these |
---|
0:07:01 | three big circles denote the different sources in this case of source |
---|
0:07:06 | might be a language |
---|
0:07:08 | with some means there's a global mean which would be yeah mean i |
---|
0:07:14 | i guess |
---|
0:07:15 | i don't have some speaker so for the speaker you have a little bit of |
---|
0:07:17 | variability any comes from one source |
---|
0:07:20 | and the speaker is the she and he comes from another source and we have |
---|
0:07:25 | also you speaker sources in a last |
---|
0:07:28 | you think imagine if you're going to compute the between speaker variation that you actually |
---|
0:07:35 | i don't a lot of between source variation and that's probably not a good thing |
---|
0:07:40 | which you want to |
---|
0:07:41 | no we did different speakers and between source |
---|
0:07:46 | so |
---|
0:07:47 | the wccn is going to |
---|
0:07:51 | do this for myself |
---|
0:07:52 | based on this information |
---|
0:07:57 | and related to this |
---|
0:07:59 | is what is stacey the source variance |
---|
0:08:03 | is not correctly |
---|
0:08:06 | observed so the various tv sources |
---|
0:08:09 | is not explicitly |
---|
0:08:12 | models |
---|
0:08:14 | so there's another problem for wccn |
---|
0:08:20 | so |
---|
0:08:21 | this is as follows is summarising again |
---|
0:08:25 | what problems are |
---|
0:08:27 | that's moved to the solution i think this is much more interesting that to see |
---|
0:08:32 | what's how do we tackle this problem that these sources to hang around |
---|
0:08:37 | have |
---|
0:08:39 | globally different means in the this i-vector stage the solution is very simple is compute |
---|
0:08:45 | these means |
---|
0:08:46 | for every source |
---|
0:08:50 | so here you look at the |
---|
0:08:52 | scatter matrix |
---|
0:08:54 | for a |
---|
0:08:56 | conditioned on the source |
---|
0:08:58 | we simply say i compute the mean for every source |
---|
0:09:02 | and before computers contrary i |
---|
0:09:04 | subtract these means |
---|
0:09:06 | so the effect basically means that you |
---|
0:09:09 | all these three |
---|
0:09:11 | sources this is still going from like these two microphone |
---|
0:09:16 | yeah and telephone data |
---|
0:09:18 | also them for languages |
---|
0:09:20 | yeah more |
---|
0:09:21 | and you subtract the mean for |
---|
0:09:24 | label per language |
---|
0:09:26 | and then this scatter matrix will be estimated better so the mathematics then we'll say |
---|
0:09:32 | okay |
---|
0:09:34 | that's very nice fit within |
---|
0:09:36 | within a class variation |
---|
0:09:40 | we still have the between class variation |
---|
0:09:46 | but we'll just see that as the difference that's data rate |
---|
0:09:51 | issues |
---|
0:09:52 | so that the other way around |
---|
0:09:54 | but it does so the idea is that you can compensate for one |
---|
0:09:57 | scatter matrix and because you have total variability |
---|
0:10:01 | you can compute the other as the difference from |
---|
0:10:03 | total variability |
---|
0:10:08 | so this idea is to stress |
---|
0:10:10 | in fact that you only need the language labels see records applied to language |
---|
0:10:16 | for the development set |
---|
0:10:19 | so you're languages are you development |
---|
0:10:22 | and you're training your system you have all kinds of labels in your data in |
---|
0:10:25 | this case we consider |
---|
0:10:26 | language label |
---|
0:10:28 | but in applying this you do not need the languages |
---|
0:10:33 | because this is only used to make a better |
---|
0:10:37 | transforms for these wccn that make |
---|
0:10:42 | how can you actually see that it works well one way of doing that is |
---|
0:10:46 | to |
---|
0:10:48 | to look at the distribution of i-vectors |
---|
0:10:51 | a wccn |
---|
0:10:54 | when you |
---|
0:10:54 | do not apply this technique source-normalization a strong left |
---|
0:10:59 | and here in different colours U C encoded of the label that we want to |
---|
0:11:04 | so the way in this case language you see for each language recognition |
---|
0:11:11 | these |
---|
0:11:13 | languages might be familiar for these people needed |
---|
0:11:17 | was |
---|
0:11:18 | that what you see that |
---|
0:11:19 | languages seem to have different places |
---|
0:11:23 | this is by the dimension a dimension reduction |
---|
0:11:27 | two dimensions |
---|
0:11:28 | after the incision that's just for few problems |
---|
0:11:32 | and you see a that is language normalization this source |
---|
0:11:35 | source normalization by language |
---|
0:11:39 | you get that all these different labels too much more similar |
---|
0:11:43 | force for the basic assumptions that |
---|
0:11:47 | i-vector systems are based on |
---|
0:11:51 | should a little better |
---|
0:11:53 | okay in our system results because |
---|
0:11:56 | we need to have tables |
---|
0:11:57 | in the presentation of we're going to get some |
---|
0:12:00 | at first what kind of what kind of experiment we can do |
---|
0:12:06 | we use |
---|
0:12:07 | most i databases for is that the |
---|
0:12:11 | yeah men the training |
---|
0:12:13 | yeah i-vector make use of |
---|
0:12:15 | but we did at one specific database callfriend |
---|
0:12:19 | very little database are used |
---|
0:12:21 | oh two starts |
---|
0:12:23 | the first language recognition so it contains |
---|
0:12:26 | a variation of languages and twelve languages certainly |
---|
0:12:30 | for that |
---|
0:12:32 | right |
---|
0:12:33 | price |
---|
0:12:34 | and |
---|
0:12:36 | as for the evaluation data because these two data sets and from nist two thousand |
---|
0:12:42 | ten |
---|
0:12:42 | dataset and two thousand |
---|
0:12:44 | eight oh two thousand ten you might think why would you do that there wasn't |
---|
0:12:49 | actually much different language |
---|
0:12:52 | from english that was sense but we don't use that for purposes one for training |
---|
0:12:58 | calibration |
---|
0:12:59 | calibration as well |
---|
0:13:01 | another reason is to see actually what are we do doesn't spurts |
---|
0:13:06 | the basic english performance too much |
---|
0:13:09 | you a case of course is going to be used as a test data |
---|
0:13:14 | where there is a there are trials from different languages |
---|
0:13:19 | and there are also considered |
---|
0:13:21 | condition english only so that |
---|
0:13:23 | we compare |
---|
0:13:25 | do you actually hurt ourselves |
---|
0:13:27 | this is |
---|
0:13:28 | durations are a simple standard |
---|
0:13:30 | the U |
---|
0:13:31 | have seen either numbers i'd say before so there's nothing you hear |
---|
0:13:38 | these are indians the breakdown numbers for the |
---|
0:13:44 | per language |
---|
0:13:46 | for the training data |
---|
0:13:48 | i |
---|
0:13:49 | these funny are the results now here |
---|
0:13:52 | i'll try to explain |
---|
0:13:55 | database |
---|
0:13:56 | red |
---|
0:13:57 | it means this is you |
---|
0:13:59 | doesn't mean this is better |
---|
0:14:02 | but both figures means is better and the first condition |
---|
0:14:08 | shows |
---|
0:14:10 | see |
---|
0:14:11 | yeah |
---|
0:14:13 | the performance on all trials |
---|
0:14:15 | four sre eight |
---|
0:14:18 | and measured in where it and get |
---|
0:14:22 | does not in calibration here |
---|
0:14:26 | a C these numbers go down so for four O eight it works if we |
---|
0:14:31 | see some languages i believe that |
---|
0:14:33 | okay |
---|
0:14:34 | force is also in english |
---|
0:14:38 | if we |
---|
0:14:40 | oh you look at english then used to use a little bit so it does |
---|
0:14:44 | hurt our system but it doesn't hurt it's |
---|
0:14:47 | here |
---|
0:14:48 | a |
---|
0:14:49 | and |
---|
0:14:50 | the same for |
---|
0:14:51 | as we can |
---|
0:14:54 | for system gets hurt |
---|
0:14:56 | but here |
---|
0:14:58 | the basic conclusion there |
---|
0:15:01 | here we have a breakdown where we look at the english languages |
---|
0:15:06 | from history of weights |
---|
0:15:09 | where is where we look at different positions are there in the in the trials |
---|
0:15:14 | the same language or different language |
---|
0:15:17 | when english is |
---|
0:15:20 | so the top row which has to be the best performance because |
---|
0:15:24 | still contains these N yeah |
---|
0:15:26 | many english trials |
---|
0:15:27 | systems that works best for |
---|
0:15:31 | so the baseline |
---|
0:15:34 | but this includes |
---|
0:15:36 | both english and english so if you break down |
---|
0:15:40 | for instance where you say okay i want a different language in the trial suppose |
---|
0:15:44 | that the target of target trials language |
---|
0:15:46 | difference |
---|
0:15:47 | i was four |
---|
0:15:49 | we see that the new figures that once right |
---|
0:15:53 | are slightly better than |
---|
0:15:55 | the red ones |
---|
0:15:56 | left |
---|
0:15:57 | the background smooth |
---|
0:15:59 | and the same respect to four |
---|
0:16:01 | in addition so you can specifically look at them english trials |
---|
0:16:05 | where there's otherwise restriction |
---|
0:16:09 | it helps |
---|
0:16:10 | for the language |
---|
0:16:12 | trials where you actually restricts trials you say minus the same time but english |
---|
0:16:19 | still helps to there's one condition where for whatever it does not how |
---|
0:16:26 | so that's a big difference |
---|
0:16:28 | this is something we don't |
---|
0:16:30 | is that |
---|
0:16:31 | suppose |
---|
0:16:32 | and that's for the old english trials |
---|
0:16:34 | where you specify that the process |
---|
0:16:39 | different language trials |
---|
0:16:41 | so usually |
---|
0:16:42 | it seems to work |
---|
0:16:43 | except for one particular |
---|
0:16:46 | place |
---|
0:16:47 | where it's that's |
---|
0:16:48 | one |
---|
0:16:49 | dish |
---|
0:16:50 | but i say that are actually not too many trials |
---|
0:16:53 | it's not show the graph oh very nice |
---|
0:16:56 | if you vision |
---|
0:16:58 | so i don't know how |
---|
0:17:00 | accurate this measure |
---|
0:17:04 | now i'll except for also it's calibration |
---|
0:17:08 | and |
---|
0:17:10 | our to carlos also the it's a this kind of experiment i |
---|
0:17:15 | looking at |
---|
0:17:16 | make |
---|
0:17:16 | more robust for |
---|
0:17:18 | for languages |
---|
0:17:19 | and we use a better different measure |
---|
0:17:21 | is a measure used by the keynote speaker they |
---|
0:17:26 | as cllr and one way of looking at how |
---|
0:17:31 | however |
---|
0:17:32 | you're calibration is small rates is to look at the difference between the cllr and |
---|
0:17:37 | the minimum attainable C or your in G |
---|
0:17:41 | or oh |
---|
0:17:42 | C miss so as to that |
---|
0:17:44 | posts of |
---|
0:17:48 | this kind of H |
---|
0:17:50 | section |
---|
0:17:51 | it's not |
---|
0:17:56 | this is gonna |
---|
0:17:59 | alright so you have to this school mismatched different means |
---|
0:18:04 | mismatched and matched |
---|
0:18:05 | and |
---|
0:18:08 | i was actually thinking |
---|
0:18:11 | vigilance the intensity state we might build a set of mismatched |
---|
0:18:16 | my niched |
---|
0:18:17 | but |
---|
0:18:18 | that might be to heart for you guys |
---|
0:18:22 | anyway |
---|
0:18:23 | and the is the |
---|
0:18:26 | that they do thing that we tried to a remote here |
---|
0:18:30 | and black is the old approach |
---|
0:18:33 | at both is better figures |
---|
0:18:37 | so we see a separate from female to answer |
---|
0:18:42 | also |
---|
0:18:43 | we ask for |
---|
0:18:47 | and now |
---|
0:18:49 | generally |
---|
0:18:51 | for this mismatch condition by big mismatch we need to calibrate english only to be |
---|
0:18:56 | a straight answer |
---|
0:18:58 | ten for calibration and we applied to |
---|
0:19:00 | sre eight is |
---|
0:19:03 | to be the other way around that we consider that way in order to be |
---|
0:19:06 | able to calibrate english and test or |
---|
0:19:11 | in a channel |
---|
0:19:13 | so this particular |
---|
0:19:15 | in addition it works |
---|
0:19:17 | always and |
---|
0:19:19 | in the matched condition that is only looking at english scores of this really |
---|
0:19:24 | well calibrated english words |
---|
0:19:26 | ten |
---|
0:19:28 | you see that it doesn't always help factors on one condition where it helps to |
---|
0:19:33 | do so |
---|
0:19:35 | the miscalibration itself |
---|
0:19:37 | so you molecules |
---|
0:19:39 | in calibration |
---|
0:19:40 | becomes less |
---|
0:19:42 | see that's for calibration there is still somehow |
---|
0:19:46 | english only |
---|
0:19:48 | but for the arts and figures it doesn't |
---|
0:19:51 | however |
---|
0:19:53 | alright i hope that |
---|
0:19:55 | i |
---|
0:19:56 | explains the numbers well enough |
---|
0:19:59 | your first for the managers amongst |
---|
0:20:02 | yeah |
---|
0:20:03 | i just easier to draw at this |
---|
0:20:06 | the same time |
---|
0:20:07 | dataset |
---|
0:20:09 | calibration this is just miscalibration so this is just the amount of information by |
---|
0:20:14 | by not be able to |
---|
0:20:16 | produce proper likelihood ratios |
---|
0:20:20 | increases |
---|
0:20:22 | for |
---|
0:20:23 | the conditions where we applied is the language normalization |
---|
0:20:27 | oh |
---|
0:20:28 | but for english only trials you don't notice the difference |
---|
0:20:35 | so i have a slight |
---|
0:20:38 | conclusions are here |
---|
0:20:40 | used to source normalization wish to general framework and i have to say here's been |
---|
0:20:45 | applied before |
---|
0:20:47 | it should be machine |
---|
0:20:50 | three or four |
---|
0:20:54 | conference proceedings |
---|
0:20:56 | papers |
---|
0:20:57 | about this technique applied it to this |
---|
0:21:02 | definition of source being a microphone or integer interview or telephone |
---|
0:21:07 | and we even applied it |
---|
0:21:10 | i should say by |
---|
0:21:12 | fair |
---|
0:21:13 | to source being know the sex of the speaker so even though i speakers generally |
---|
0:21:19 | don't change six |
---|
0:21:21 | and that's in this evaluations |
---|
0:21:24 | you can use this approach |
---|
0:21:27 | to compensate for situations where you might not have enough data |
---|
0:21:32 | so for telephone conditions this |
---|
0:21:35 | didn't we make much difference but for conditions |
---|
0:21:39 | where there wasn't really much data i did how to shoot pool the male female |
---|
0:21:45 | i-vectors and make a human same gender independent |
---|
0:21:48 | recognition system |
---|
0:21:51 | and apply source normalization |
---|
0:21:53 | very sad speaker sex is the label of the i-vector and we normalize that way |
---|
0:21:58 | and that in your recognition |
---|
0:22:01 | you can only the labelling marcy can basically more second column of your trial based |
---|
0:22:09 | okay |
---|
0:22:10 | but you reply to two languages seems to work |
---|
0:22:14 | recently |
---|
0:22:16 | and |
---|
0:22:17 | that it doesn't for english trials too much first |
---|
0:22:23 | which |
---|
0:22:23 | and also basically S |
---|
0:22:27 | what's to go |
---|
0:22:30 | i |
---|
0:22:32 | i |
---|
0:22:47 | i |
---|
0:22:54 | i stopped speaker cases |
---|
0:22:56 | and we do not use try to use language as a discriminating speakers |
---|
0:23:02 | in this |
---|
0:23:02 | research of course you can see that very well |
---|
0:23:05 | we think you that |
---|
0:23:07 | take it as a challenge that you should be able to recognize speakers even if |
---|
0:23:11 | the speaker speaks a different language than seen before in the in |
---|
0:23:15 | in the training then you |
---|
0:23:18 | what |
---|
0:23:19 | of course |
---|
0:23:20 | oh |
---|
0:23:21 | make it will be easier by saying either or different speakers |
---|
0:23:24 | it's a speaker |
---|
0:23:26 | she |
---|
0:23:32 | i |
---|
0:23:33 | i |
---|
0:23:44 | i |
---|
0:23:48 | oh |
---|
0:23:49 | yeah |
---|
0:23:53 | i |
---|
0:23:56 | yeah |
---|
0:24:08 | i |
---|
0:24:10 | i |
---|
0:24:28 | i |
---|
0:24:36 | yeah and i remember calibration was in one of the one of the major problems |
---|
0:24:41 | in two thousand six |
---|
0:24:43 | where you know if you have more english |
---|
0:24:46 | performance actually be reasonable the discrimination performance but calibrations |
---|
0:24:52 | where |
---|
0:24:53 | car |
---|
0:24:54 | so sure that |
---|
0:24:55 | that even holds for be a systems |
---|
0:24:58 | nowadays though but a systems nowadays are |
---|
0:25:01 | generally behaving better |
---|
0:25:04 | i |
---|
0:25:18 | yeah |
---|
0:25:24 | i |
---|
0:25:27 | oh |
---|
0:25:34 | no i don't think that a |
---|
0:25:36 | that |
---|
0:25:37 | that is what we want |
---|
0:25:38 | say i think |
---|
0:25:40 | to say is that it |
---|
0:25:43 | with the channel |
---|
0:25:45 | between channel |
---|
0:25:47 | variation |
---|
0:25:49 | estimated one of the |
---|
0:25:52 | very of the total variance |
---|
0:25:55 | is used to the fact that |
---|
0:25:57 | things have a different language |
---|
0:26:01 | and you don't observed that's in the within speaker |
---|
0:26:04 | variability |
---|
0:26:06 | so the attributes within language variability |
---|
0:26:09 | as with the |
---|
0:26:11 | channel variability |
---|
0:26:13 | and that is not as to K stiff |
---|
0:26:16 | this case |
---|
0:26:20 | languages for same-speaker |
---|
0:26:29 | i |
---|
0:26:30 | i |
---|