0:00:15 | so as naked mentioned this is work mainly from last summer and continuing on after |
---|
0:00:20 | the end of last summer |
---|
0:00:22 | primarily by daniel my colleague at johns hopkins |
---|
0:00:26 | but also with steve issue will be talking next about a little bit different flavour |
---|
0:00:30 | and when you go and carlos |
---|
0:00:35 | so bear with me level then you'll like but a lot of animation into slides |
---|
0:00:40 | and use the and take them out but again so much in these that i |
---|
0:00:42 | really couldn't get model |
---|
0:00:44 | so i will try and do it with an animation style which is not natural |
---|
0:00:47 | to me so we're trying to build a speaker recognition system which is state-of-the-art how |
---|
0:00:51 | are we gonna do that well it depends what kind of an evaluation we're gonna |
---|
0:00:55 | run we wanna know what the data looks like that we're actually gonna be working |
---|
0:00:58 | on |
---|
0:00:59 | and since normally in for example in sre we know what that data is going |
---|
0:01:04 | to look like we go to our big pile a previous data that the ldc |
---|
0:01:08 | has kindly given us generated for us |
---|
0:01:11 | we use this development data we typically use very many speakers in very many labeled |
---|
0:01:16 | cuts |
---|
0:01:17 | to learn our system parameters |
---|
0:01:20 | in particular what we call the across class within class covariance matrix using the key |
---|
0:01:25 | things we need to make the lda were correctly |
---|
0:01:29 | and then we are ready |
---|
0:01:31 | to score our system and see what happens |
---|
0:01:36 | so the thought here for this workshop was |
---|
0:01:39 | what if we have this state-of-the-art system which you we have built for our sre |
---|
0:01:43 | ten or sre twelve |
---|
0:01:45 | and someone comes to us with the pilot data which doesn't look like in a |
---|
0:01:48 | story what are we going to do |
---|
0:01:53 | and the first thing in this corpus the direct also put together which is available |
---|
0:01:57 | from there are links to the lid lists from the g h u website |
---|
0:02:03 | we found that there is in fact a big performance gap with the p lda |
---|
0:02:07 | system even with what seems like a fairly simple mismatch namely you train your parameters |
---|
0:02:12 | on switchboard and you tested on mixture or sre ten |
---|
0:02:17 | and you can see that the green line |
---|
0:02:20 | which is a pure sre system designed for sre ten works extremely well and it's |
---|
0:02:26 | this at the same algorithm trained only on switchboard has three times the error rate |
---|
0:02:34 | so in the supervised domain adaptation that we attacked first which daniel presented at icassp |
---|
0:02:39 | we are given an additional data set which is in we have the out-of-domain switchboard |
---|
0:02:46 | data we have an in the main mixture set and its labeled but it may |
---|
0:02:49 | not be very be so how can we combine these two datasets |
---|
0:02:54 | to accomplish good performance on sre data |
---|
0:02:57 | the setup that we have used about these experiments is a typical i-vector system |
---|
0:03:05 | i think some people may do different things in this in this back part but |
---|
0:03:09 | daniel has convinced me that links norm with total covariance would just whitening is in |
---|
0:03:15 | fact the best most consistent way to do it |
---|
0:03:20 | a typical system parameters the |
---|
0:03:24 | the lda is typically four hundred or six hundred in our experiments |
---|
0:03:29 | and the important point in one of size here is that the i-vector extractor doesn't |
---|
0:03:34 | need any labeled data so we call that in unsupervised training |
---|
0:03:38 | the |
---|
0:03:39 | links norm also is unsupervised |
---|
0:03:42 | and to p lda parameters of the ones where we need the speaker labels that's |
---|
0:03:46 | the harder data to find |
---|
0:03:56 | and |
---|
0:03:57 | in these experiments we found that we can always use switchboard for the i-vector extractor |
---|
0:04:01 | itself we don't need to retrain that every time we go to a new domain |
---|
0:04:05 | which is a tremendous practical advantage |
---|
0:04:08 | they whitening parameters can be trained specifically for whatever domain you're working in which is |
---|
0:04:13 | not so far to do either because you only need an unlabeled pile of data |
---|
0:04:17 | to accomplish that |
---|
0:04:18 | and then i wanna focus on the at that adaptation part of the covariance matrices |
---|
0:04:23 | that was the biggest challenge for us |
---|
0:04:28 | in principle |
---|
0:04:29 | at least in a little bit of a simplistic math if we have no one |
---|
0:04:33 | covariance matrices we can do this map adaptation the dog has been doing in gmms |
---|
0:04:39 | for a long time |
---|
0:04:40 | the original map behind that is a conjugate prior for a covariance matrix |
---|
0:04:45 | and you end up with a sort of account based regularisation if you configure your |
---|
0:04:49 | prior in a certain tricky way |
---|
0:04:51 | you end up with a very simple formula which is account based regularization back to |
---|
0:04:56 | an initial matrix and a towards a new data sample covariance matrix so that's what |
---|
0:05:02 | shown here |
---|
0:05:03 | this is the in domain covariance matrix |
---|
0:05:05 | and where smoothing it back to what out-of-domain |
---|
0:05:08 | covariance |
---|
0:05:10 | and what we showed earlier inance first supervised adaptation we can get very good performance |
---|
0:05:15 | let's get used to this crap i'm gonna show a couple more the red line |
---|
0:05:18 | at the top is the out-of-domain system which has the bad performance which is trained |
---|
0:05:22 | purely on switchboard the green line at the bottom |
---|
0:05:25 | is the matched in domain system that's our target if we had all of the |
---|
0:05:28 | in domain data |
---|
0:05:29 | and what we're doing is taking various amounts |
---|
0:05:33 | of in domain data |
---|
0:05:34 | to see how well we can exploit it and even with a hundred speakers we |
---|
0:05:39 | can cut seventy percent of that |
---|
0:05:41 | with this adaptation process and if we use the entire data we get the same |
---|
0:05:46 | performance actually slightly better by using both sets the just the in domain set |
---|
0:05:53 | one of the questions with this is how do we set this alpha parameter i |
---|
0:05:57 | mean in theory if we knew the if we knew the prior exactly would tell |
---|
0:06:00 | is theoretically what it should be but empirically |
---|
0:06:03 | the main point of this crap is we're not very sensitive to it if output |
---|
0:06:06 | is zero |
---|
0:06:07 | where entirely the out-of-domain system and it's always pretty bad |
---|
0:06:11 | if output is one where entirely trying to do an in domain system if we |
---|
0:06:15 | have almost no data in domain we have a very bad performance |
---|
0:06:18 | but it soon as we start to have data that system is pretty good but |
---|
0:06:21 | we're always better by staying in the middle somewhere and using both datasets |
---|
0:06:26 | using a come a combination |
---|
0:06:29 | now this work the theme is an unsupervised adaptation what that means as we no |
---|
0:06:34 | longer have labels for this pile of in domain data |
---|
0:06:37 | so it's the same setup |
---|
0:06:39 | but now we don't have labels |
---|
0:06:46 | this means we wanna do some kind of clustering |
---|
0:06:49 | and we found empirically as i think people in the i-vector challenge seem to a |
---|
0:06:53 | found as well that h the user |
---|
0:06:56 | is a particularly good algorithm for this task for whatever reason |
---|
0:07:00 | and you can measure clustering performance |
---|
0:07:02 | with |
---|
0:07:03 | if you actually have the truth labels you can evaluate a clustering algorithm by purity |
---|
0:07:08 | and fragmentation purity being help your clusters are in fragmentation being how much a speaker |
---|
0:07:13 | was accidentally distributed into other clusters |
---|
0:07:20 | one of the things we spent quite a bit a time on in fact then |
---|
0:07:23 | you'll spend a lot of time and making an i-vector averaging system |
---|
0:07:26 | is what the metric used for the clustering you gotta do hierarchical clustering you gonna |
---|
0:07:30 | work your way out on the bottom but what's the definition of whether |
---|
0:07:34 | two |
---|
0:07:35 | clusters should be merged |
---|
0:07:38 | p lda the theory gives |
---|
0:07:41 | and answer for |
---|
0:07:42 | a speaker hypothesis test that these two are the same speaker |
---|
0:07:46 | that's something that we worked within the past |
---|
0:07:49 | and then you know as soon as we started up this year so that really |
---|
0:07:52 | doesn't work well at all which is a little disappointing from a theoretical point of |
---|
0:07:56 | view but we found that in a stories as well when we have multiple cuts |
---|
0:08:00 | using the correct formula doesn't always work as well as we would like |
---|
0:08:05 | will be traditionally do an sre is i-vector averaging which is pertain we have a |
---|
0:08:08 | single cut |
---|
0:08:10 | dana spent a lot of time on that this summer then we found out that |
---|
0:08:14 | in fact the simplest and thing to do which is to compute the score between |
---|
0:08:17 | every pair of cuts get a matrix of scores and then never recompute any metrics |
---|
0:08:23 | just average the scores is in fact |
---|
0:08:26 | the best performing system and it's also much easier because you don't have to get |
---|
0:08:30 | in your algorithm at all you just pre-computed this distance matrix and feed it into |
---|
0:08:33 | an off-the-shelf |
---|
0:08:35 | clustering software |
---|
0:08:38 | so just as a as a baseline we compared against k-means for clustering with this |
---|
0:08:42 | purity and fragmentation and the main point is this h c with this scoring metric |
---|
0:08:48 | wasn't fact quite a bit better the k-means so we're comfortable that it seems to |
---|
0:08:52 | be clustering in an intelligent way |
---|
0:08:56 | now we wanna move towards doing it for adaptation but the other thing we need |
---|
0:08:59 | to know is how do we side when the start clustering how do we decide |
---|
0:09:03 | how many speakers are really there because nobody has told us |
---|
0:09:06 | to do this you have to avenge the makeup are decision that you're gonna start |
---|
0:09:09 | merging that and basically that you look at the two most similar clusters and you |
---|
0:09:14 | gotta decide are these from a different speaker or are they the same in you |
---|
0:09:17 | can make a hard-decisioning |
---|
0:09:19 | and this is one of the |
---|
0:09:20 | the nice contributions of this work that was really don't after the summer i think |
---|
0:09:24 | of |
---|
0:09:25 | where we just treaty scores as speak to speaker recognition scores we do calibration in |
---|
0:09:31 | the way that we do and in particular |
---|
0:09:33 | this unsupervised calibration method than aca what daniel presented at icassp |
---|
0:09:38 | can be used exactly in this situation we can take |
---|
0:09:42 | are unlabeled pile of data and look at all the scores across to learn a |
---|
0:09:45 | calibration from that we can actually no with threshold and we can make a decision |
---|
0:09:49 | about when to stop |
---|
0:09:53 | so how well does that work |
---|
0:09:56 | this is a across are unlabeled pile as we introduce bigger and bigger piles |
---|
0:10:02 | the this is the correct number of clusters the dashed line |
---|
0:10:07 | this is five random draws where we draw on random subsets and we've average the |
---|
0:10:11 | performance and the blue is the average which is the easiest one to see and |
---|
0:10:15 | you can see in general this technique works pretty well it always underestimate typically about |
---|
0:10:21 | twenty percent so you think there's a few were speakers in the really are what |
---|
0:10:25 | you're pretty close and getting |
---|
0:10:27 | and automated and reliable way to actually figure out how many speakers are there is |
---|
0:10:31 | actually we're |
---|
0:10:32 | we're pretty excited to even do this well at it that's very heart task |
---|
0:10:36 | so to actually do the adaptation then |
---|
0:10:40 | the recipe is we use our out-of-domain p lda |
---|
0:10:44 | to compute the similarity matrix of all pairs |
---|
0:10:49 | we don't cluster the data using that distance metric |
---|
0:10:54 | estimate all this how many speakers there are and the speaker labels |
---|
0:10:58 | generate another set of covariance matrices from this labeled data |
---|
0:11:02 | and then we apply or adaptation formulas |
---|
0:11:05 | on this data |
---|
0:11:11 | so here's a similar curve as i so the for here is the out of |
---|
0:11:16 | the out-of-domain system and the in domain system in green at the bottom |
---|
0:11:20 | and |
---|
0:11:21 | but we're so in here |
---|
0:11:23 | is the h z |
---|
0:11:25 | adaptation |
---|
0:11:28 | performance and the supervised |
---|
0:11:30 | adaptation |
---|
0:11:32 | we should means the number of speakers |
---|
0:11:35 | no sorry supervised adaptation is one issue before |
---|
0:11:38 | excuse me |
---|
0:11:39 | so that if you have to labels |
---|
0:11:41 | for all of the data that's what we you compress the first time now by |
---|
0:11:44 | self labeling |
---|
0:11:46 | of course we're not as good |
---|
0:11:47 | but we are in fact much better than we ever thought we could be because |
---|
0:11:50 | when we first set up this task we really didn't think |
---|
0:11:53 | in fact daniel i had a little bit and he was convinced that this was |
---|
0:11:56 | never gonna work because how are you gonna learn your parameters from your system that |
---|
0:11:59 | doesn't know what you're parameters are but factor can |
---|
0:12:02 | so we've done surprisingly well myself labeling |
---|
0:12:05 | and we're still able to get at five percent of the for performance get if |
---|
0:12:08 | we have all the data but is unlabeled which still able to recover |
---|
0:12:12 | almost all the performance |
---|
0:12:16 | now what if we didn't know the number of clusters |
---|
0:12:19 | so if we had an oracle the told us it is exactly this many speakers |
---|
0:12:23 | with that make our system perform better so that the additional |
---|
0:12:27 | bar here and in fact |
---|
0:12:30 | our estimation of the number of speakers is good enough because even had we known |
---|
0:12:34 | it exactly we're gonna get |
---|
0:12:36 | almost the same performance |
---|
0:12:38 | so even though we didn't get exactly correct number of speakers the hyper parameters that |
---|
0:12:42 | we have estimated still work just as well |
---|
0:12:48 | and that's illustrated in this way |
---|
0:12:50 | which is the sensitivity to knowing the number of clusters so here we're using all |
---|
0:12:54 | the data the actual number of speakers is here and this is what we estimated |
---|
0:12:58 | with their stopping criterion |
---|
0:13:00 | and you can see that as a sweep across all of our if we had |
---|
0:13:03 | stopped at all of these different points and decided that was how many speakers that |
---|
0:13:07 | were |
---|
0:13:08 | there's not a tremendous sensitivity if we massively over cluster then we have a big |
---|
0:13:12 | hit in performance and if we massively under cluster it is bad but there's a |
---|
0:13:16 | pretty big fat region |
---|
0:13:18 | where we get almost the same kind performance with their hyper parameters if we had |
---|
0:13:23 | us start our clustering at that point |
---|
0:13:28 | so in conclusion then |
---|
0:13:31 | domain mismatch can be a surprisingly difficult problem in state-of-the-art systems using the lda |
---|
0:13:38 | and |
---|
0:13:39 | we are denoted supervised adaptation could work quite well but in fact |
---|
0:13:42 | unsupervised adaptation also works extremely well |
---|
0:13:47 | we can close at five percent of the performance gap due to the domain mismatch |
---|
0:13:51 | in order to do that we need to do this adaptation we need to use |
---|
0:13:54 | both the out-of-domain parameters and the in domain parameters not just label of the in |
---|
0:13:59 | domain |
---|
0:14:00 | and this unsupervised calibration trick |
---|
0:14:04 | in fact gives as a useful and meaningful stopping criterion for figuring out how many |
---|
0:14:08 | speakers are in our data |
---|
0:14:10 | thank you |
---|
0:14:21 | i four questions |
---|
0:14:30 | it's a wonder i can imagine that the distribution of speakers |
---|
0:14:35 | comments basically the number of segments per speaker |
---|
0:14:40 | of your unsupervised set |
---|
0:14:43 | will make a difference right i guess at you get this from these days or |
---|
0:14:48 | whatever switchboard data so the |
---|
0:14:51 | will be relatively homogeneous is a is that correct or |
---|
0:14:56 | i think yes classes i one has these are not homogeneous but this is a |
---|
0:15:00 | good pile of unlabeled data because in fact it's the same power that we used |
---|
0:15:04 | as a labeled data set |
---|
0:15:06 | so it's pretty much everything we could find from these speakers some of them have |
---|
0:15:10 | very many phone calls some of them have you are |
---|
0:15:13 | but all of them have quite a few in order to be in this file |
---|
0:15:16 | obviously for example you couldn't learn any within class covariance if you only had one |
---|
0:15:20 | example from each speaker |
---|
0:15:22 | hidden in that pile so you're absolutely right is not just that we do the |
---|
0:15:27 | labelling it's also that the pilot self has some richness in order for us to |
---|
0:15:30 | discover |
---|
0:15:34 | before we give a microphone image i have a related question |
---|
0:15:40 | when you train the i-vector extractor the nice thing is that you can do it |
---|
0:15:44 | unsupervised |
---|
0:15:47 | but again how many cats per speaker so if we had only one speaker with |
---|
0:15:51 | many cats obviously that's not good because we don't get the speaker variability |
---|
0:15:56 | the converse situations where you have every speaker only once |
---|
0:16:01 | you have been any duration with that would give a good idea that |
---|
0:16:07 | i don't think that the we looked at but i |
---|
0:16:10 | i completely agree that would make me uncomfortable as i said in this effort we |
---|
0:16:15 | just were able to show that the out-of-domain data which we assume we do have |
---|
0:16:19 | a good labeled set somewhere in some domain that we can use we were able |
---|
0:16:23 | to use that when the rest of the time so we're comfortable where it came |
---|
0:16:26 | from i don't think of ever run an experiment with |
---|
0:16:29 | with what you say and that is interesting i suspect it would not work so |
---|
0:16:33 | well |
---|
0:16:34 | what get both kinds of variability comes from a variety of channels |
---|
0:16:39 | the variability speaker and the channels in the not quite the same proportions as you |
---|
0:16:44 | get in the state |
---|
0:16:47 | if you collect data and the while |
---|
0:16:50 | in a situation where they're very many speakers |
---|
0:16:54 | you might have data like that so i think that's an interesting qualities two |
---|
0:16:59 | thank you |
---|
0:17:01 | very impressive work in right set of results works are also thank you for that |
---|
0:17:05 | so i question i have is this is all telephone speech and test work very |
---|
0:17:11 | well with that i have we consider what would happen if the out-of-domain tighter walls |
---|
0:17:16 | the different channels such as mock fine |
---|
0:17:18 | i and is that even a realistic hence would you have a pre-training microphone system |
---|
0:17:23 | that you try and adapt |
---|
0:17:25 | right so yes we have like the microphone the very first work right a few |
---|
0:17:30 | years ago on this task was adapting from telephone to microphone and daniel revisited early |
---|
0:17:36 | in the summer when we were debating working with dog on this dataset whether we |
---|
0:17:40 | trusted here if he did a similar experiment |
---|
0:17:43 | with the sre telephone and microphone and actually got similar results |
---|
0:17:47 | it is |
---|
0:17:49 | that does sound a bit surprising but we have seen in the sre is the |
---|
0:17:52 | telephone a microphone is not nearly as art is that ought to be i don't |
---|
0:17:55 | know the reason for that but yes we have than worked with telephone microphone histories |
---|
0:17:59 | and it's not shockingly different in this great things |
---|
0:18:06 | i just isn't that answer i'm because question |
---|
0:18:08 | we trained i-vector start on unofficial database which there is no one speaker a per |
---|
0:18:14 | utterance |
---|
0:18:16 | one and it's about the same as you two thousand four and five or so |
---|
0:18:23 | first |
---|
0:18:25 | okay thank you |
---|
0:18:29 | i knew that no i think about |
---|
0:18:31 | thank you |
---|
0:18:32 | so either really stupid question yesterday people are mentioning about how the mean shift clustering |
---|
0:18:40 | algorithm is working well |
---|
0:18:43 | is that i mean you don't seem to use that you use the |
---|
0:18:47 | a limited to a lot i don't the clustering so |
---|
0:18:50 | is that a reason why |
---|
0:18:52 | i believe over the course of the summer we another people look the quite a |
---|
0:18:57 | few different algorithms i know that's use in diarisation a no we have looked at |
---|
0:19:00 | it in diarisation i cannot remember if we looked at it for this task |
---|
0:19:05 | we did look at others a stephen is gonna talk about some other clustering algorithms |
---|
0:19:08 | where's but i don't think he's gonna talk about the mean shift and so i'm |
---|
0:19:13 | not sure i don't have that compares it clearly is also useful out with |
---|
0:19:28 | the i just want to know if the this split and this protocols and available |
---|
0:19:33 | yes to they are on the jhu website the link as in the paper okay |
---|
0:19:38 | it thanks you wanna get the speech type of error but the lists |
---|
0:19:43 | i encourage you to work on this task |
---|
0:19:49 | which |
---|
0:19:50 | one question |
---|
0:19:52 | let's suppose that you are not as to do a speaker clustering but gender clustering |
---|
0:19:57 | and you don't have any prior to how many genders other |
---|
0:20:02 | input and they |
---|
0:20:04 | the stopping criterion would be the same i mean you have a file sign genders |
---|
0:20:10 | i'm not sure and if there is saying with the clustering accidently fine gender well |
---|
0:20:14 | let me say one thing first is we did i think i forgot to mention |
---|
0:20:17 | this is a gender independent system |
---|
0:20:20 | well gender suppose that they classes to sip to cluster and not that the and |
---|
0:20:24 | the speakers that any of the end of clusters either kind of |
---|
0:20:27 | correctly well this is why daniel thought this wouldn't work |
---|
0:20:31 | who knows what you're gonna cluster by we're just using the metric we are hoping |
---|
0:20:34 | that the out of out-of-domain p lda metric is encouraging the clustering to focus on |
---|
0:20:40 | speaker differences |
---|
0:20:42 | but we cannot guarantee that except |
---|
0:20:44 | with the results |
---|
0:20:49 | i think more so than gender if for example language we're different which there might |
---|
0:20:53 | be some differently data and you might think you would cluster is the same speaker |
---|
0:20:57 | speaking multiple languages you might think that would confuse are clustering for us |
---|
0:21:02 | so |
---|
0:21:03 | you saw nick like to one aspect i think is very important especially in the |
---|
0:21:08 | forensic framework could you |
---|
0:21:12 | shows slide five |
---|
0:21:15 | probably |
---|
0:21:24 | so all |
---|
0:21:26 | what you've neglected here is the decision threshold |
---|
0:21:30 | yes we have neglected calibration of the final task that's |
---|
0:21:34 | so it could possibly be that a factor of three becomes a factor of |
---|
0:21:41 | one hundred |
---|
0:21:43 | not to three degradation could actually |
---|
0:21:46 | the factor one take it is yes which we simply neglected |
---|
0:21:50 | you are right george when you collected that has we would think the with the |
---|
0:21:53 | unsupervised calibration that we could accomplish calibration is what i would like you to do |
---|
0:21:59 | when you get home yes |
---|
0:22:02 | in two up |
---|
0:22:05 | annotate this slide with the |
---|
0:22:08 | decision |
---|
0:22:09 | points |
---|
0:22:10 | and all these systems are not even calibrated so |
---|
0:22:14 | well we always have to run a separate calibration process to get onto single somewhat |
---|
0:22:18 | easily to a when you go home |
---|
0:22:21 | but go ahead and do that work |
---|
0:22:23 | and it's only your only gonna have to do this for the in domain system |
---|
0:22:31 | and then you |
---|
0:22:31 | applying a threshold |
---|
0:22:33 | but the dots on those two curves in zambia copy |
---|
0:22:38 | thank you very well for your assignments are then |
---|
0:22:42 | that question is already partially on so by are unsupervised score calibration paper which was |
---|
0:22:50 | published icassp so as true |
---|
0:22:53 | that |
---|
0:22:56 | okay so we so we thank the speaker |
---|