0:00:15 | kind of the transition from the systems and in the previous salary into the n |
---|
0:00:21 | n's in what |
---|
0:00:22 | people do deeply we all could have |
---|
0:00:25 | no i don't think so |
---|
0:00:29 | we all clodo presented in both but i think this is a good transition "'cause" |
---|
0:00:32 | we did have some kind of new things that we did that i wanna talk |
---|
0:00:36 | about |
---|
0:00:37 | this is work with my colleagues correct cell and daniel from johns hopkins of both |
---|
0:00:41 | of whom were unable unfortunately to get spousal permission to attend this work so but |
---|
0:00:47 | they have that excuses rags wife had their second child two weeks ago and daniels |
---|
0:00:51 | is due in about two so |
---|
0:00:54 | they have a reason |
---|
0:00:57 | so i'm gonna present an overview of the d n an i-vector system that we |
---|
0:01:02 | submitted to leo every fifteen |
---|
0:01:04 | i wanna hear give a shout out nist for introducing his fixed training data condition |
---|
0:01:10 | which actually allowed us to make a very competitive system with only three people which |
---|
0:01:15 | is a not very common in our is historically |
---|
0:01:20 | the approach that we used algorithmically i'll go in the more detail but we use |
---|
0:01:25 | the n n's unlike some of the previous presentations you've seen we were able to |
---|
0:01:30 | get good performance not just with the bottleneck features but also with the nn state |
---|
0:01:35 | labels i'll talk about that |
---|
0:01:38 | we used a three different kind of i-vectors i'll explain that more but |
---|
0:01:43 | everyone had acoustic systems and those are very good we're able to do quite well |
---|
0:01:47 | with the phonotactic i-vector system as well and here we're trying for the first time |
---|
0:01:52 | a joint i-vector which does both things at once |
---|
0:01:56 | because we had a fairly powerful system that we were comfortable with and we didn't |
---|
0:02:02 | trust that we had enough development data we used i think the you simplest and |
---|
0:02:07 | most naive fusion of anybody a net seem to work for us because we actually |
---|
0:02:10 | got of using game which i think also we were one of the few |
---|
0:02:14 | and that was just to some the scores together and then scale "'em" with the |
---|
0:02:18 | duration model that all talk about |
---|
0:02:21 | and lastly as i think it's been mentioned but i wanna go with a little |
---|
0:02:24 | bit more because this was a limited data task data augmentation turned out to be |
---|
0:02:29 | very helpful for us |
---|
0:02:33 | so in the top i'll go through our bayes the i-vectors system design a talk |
---|
0:02:38 | about the two ways that we use the d n n's that have both been |
---|
0:02:41 | touched on previously today |
---|
0:02:43 | and i'll talk about the use alternate i-vectors to we experimented with |
---|
0:02:50 | talks a more specifically about the lre fifteen task and how we use the data |
---|
0:02:54 | and what we learn later about how we could have used the data |
---|
0:02:59 | and trying to that will talk about the results that we had in the summation |
---|
0:03:02 | in some interesting things that we've learned since both about whatever other systems could have |
---|
0:03:07 | done and also how we could've done better with the systems that we that we |
---|
0:03:10 | use |
---|
0:03:13 | so here's a but block diagram of |
---|
0:03:16 | our lid system |
---|
0:03:21 | it's a little i-vector system so we can be split into two parts the first |
---|
0:03:24 | uses the unlabeled data to the to do the ubm and the t matrix learning |
---|
0:03:29 | and then the supervised system is basically the two covariance model |
---|
0:03:34 | within class across class covariance that's first used in lda to reduce the dimension and |
---|
0:03:39 | in the same matrices are used for the gaussian scoring following on after that |
---|
0:03:45 | we've done for awhile rather than having a separate back end to do the work |
---|
0:03:48 | we do a discriminative refinement of these gaussian parameters |
---|
0:03:53 | to produce a system that not only performs a little bit better but also produces |
---|
0:03:58 | naturally calibrated scores |
---|
0:04:00 | and we do that in a two-step process first we learn a scale factor of |
---|
0:04:05 | this within class covariance |
---|
0:04:07 | and then we go into all the class means and adjust them to better or |
---|
0:04:10 | provide the discriminative power and that's we we're using the mmi algorithm from gmm training |
---|
0:04:17 | in a really simplified mode |
---|
0:04:19 | and of course that's the same criterion is the multiclass cross entropy but all everybody |
---|
0:04:23 | uses for every day |
---|
0:04:28 | so just layout data |
---|
0:04:29 | talk more about how we use the d n and together people dimension it but |
---|
0:04:33 | let me have some pictures of so you can see a better of what we're |
---|
0:04:36 | doing splitting up the normal use the gmm to do the alignment and then compute |
---|
0:04:40 | the stats after the fire |
---|
0:04:42 | from that |
---|
0:04:43 | where splitting it out in two ways and using the announced the first is simply |
---|
0:04:47 | to replace the mfccs with bottleneck features |
---|
0:04:51 | from indiana and we are just using a straightforward bottleneck note that kind of anything |
---|
0:04:55 | else |
---|
0:04:56 | and then the |
---|
0:04:58 | second system |
---|
0:04:59 | is a little bit more complicated were used to the nn to generate the frame |
---|
0:05:03 | posteriors for the signals are for the cluster all state |
---|
0:05:06 | that used to label the data and you the alignment and then you use the |
---|
0:05:10 | ubm after that |
---|
0:05:15 | the unit time are to draw indian and but this is daniel's best rendition of |
---|
0:05:19 | a |
---|
0:05:20 | of a probably d n and a couple of things that the power particular perhaps |
---|
0:05:23 | about our system or about the cali way of doing things |
---|
0:05:27 | which by the way we do highly recommend |
---|
0:05:30 | is it uses this t-norm would just kind of like to max pooling so there |
---|
0:05:33 | is a there's an expansion in a contractual made at each layer that's how the |
---|
0:05:37 | nonlinearity come there |
---|
0:05:40 | what else we i think probably nobody says these days but we're not using fmllr |
---|
0:05:43 | which i think it is common |
---|
0:05:45 | for our purposes |
---|
0:05:48 | you can see we basically use the same architecture either for this you know posteriors |
---|
0:05:52 | are or we introduce the bottleneck to the one that's just gonna be the bottleneck |
---|
0:05:56 | that goes |
---|
0:05:57 | the that's a little in your layer before the |
---|
0:06:01 | in the middle that one there |
---|
0:06:06 | we have |
---|
0:06:07 | about nine thousand output state so it is it is a pretty big ubm that |
---|
0:06:13 | we get out of this |
---|
0:06:14 | and of course it's trained using switchboard one "'cause" that's what we were given for |
---|
0:06:18 | the a fixed data condition |
---|
0:06:20 | in you know |
---|
0:06:24 | so let me talk about desire is a little bit the one that |
---|
0:06:29 | we're all familiar what we're gonna fall acoustic i-vector this is based on a gaussian |
---|
0:06:33 | probability model and german output in a little parentheses for use a given that the |
---|
0:06:39 | alignments already know otherwise it would be much more complicated |
---|
0:06:44 | and but because of that it's a big gaussian supervector problem there's closed form solution |
---|
0:06:48 | for the map estimate that the i-vector |
---|
0:06:51 | there's an em algorithm for the that the estimation |
---|
0:06:55 | the second approach is phonotactic thing now i think the you guys mentioned that used |
---|
0:07:00 | it for a number of areas before |
---|
0:07:03 | the this is well i'll talk about the details of the or lighter but that |
---|
0:07:07 | the king is we can still have sort of a gaussian model for an i-vector |
---|
0:07:12 | but the output now is the latent model we're talking about the weights of gmm |
---|
0:07:17 | instead of the means |
---|
0:07:19 | and those things are naturally gonna be count based so we need a multinomial probability |
---|
0:07:24 | model out not a gaussian probability model |
---|
0:07:27 | and the way we do that with is to go from log space with the |
---|
0:07:30 | softmax singular probability part |
---|
0:07:33 | even when they're fairly simple formula unfortunately there's not a closed form solution for what |
---|
0:07:38 | is the optimal i-vectors of these additions method iteration |
---|
0:07:42 | and similarly there's not a two year for the t matrix that we know what |
---|
0:07:46 | yet so there is a alternating maximization algorithm |
---|
0:07:53 | so we presented this phonotactics a thing for lid the four |
---|
0:07:59 | and in the meantime we don't think it okay we have two systems we have |
---|
0:08:02 | an acoustic in a phonotactic are we gonna combine |
---|
0:08:05 | actually the first thing we knew score fusion and yes we did that and use |
---|
0:08:08 | that works |
---|
0:08:09 | and then we are a little more except well |
---|
0:08:11 | about two i-vector systems there are doing the same thing why don't i stack the |
---|
0:08:15 | i-vectors together and get one big i-vector and then run one i-vector system and does |
---|
0:08:19 | that work |
---|
0:08:20 | and yes that works two |
---|
0:08:22 | and we thought of as more and said well |
---|
0:08:24 | why the widely twos independent i-vector extractors |
---|
0:08:28 | what can i make one latent variable the both models |
---|
0:08:31 | the means of the gmm the latent gmm the generated code and the weights of |
---|
0:08:35 | the gmm generated to cut |
---|
0:08:38 | the fact is the math says that you can i'll go into a little more |
---|
0:08:42 | detail but basically this is |
---|
0:08:44 | a permutation of the subspace gmm that the input we was talking about in two |
---|
0:08:49 | thousand eight thousand nine |
---|
0:08:52 | to see leslie workshop and sense |
---|
0:08:54 | so there are algorithms for doing this we had to manipulate them a little bit |
---|
0:08:58 | for our purpose |
---|
0:09:02 | so a couple of the tails how to do this we have some references in |
---|
0:09:07 | the paper |
---|
0:09:08 | so complex in particular that we're doing differently than then if you just to get |
---|
0:09:12 | out of what bandwidth |
---|
0:09:14 | the first is he did everything but sort of ml estimates so we didn't have |
---|
0:09:17 | any prior didn't how many backup |
---|
0:09:19 | obviously for acoustic we don't wanna use ml i-vectors we wanna use map i-vectors |
---|
0:09:24 | we've actually shown previously that for a phonotactic system map is also beneficial and if |
---|
0:09:29 | we're gonna do a jointly it's |
---|
0:09:31 | critical the to be the same criterion for both things because it back |
---|
0:09:35 | it is a joint optimization of |
---|
0:09:38 | map of the overall likelihood plus that the prior |
---|
0:09:44 | a nice trick we can do with this joint i-vector is since this closed form |
---|
0:09:47 | solution for the acoustic we can |
---|
0:09:49 | initialize the newton's method with the acoustic and then just refine it using the phonotactic |
---|
0:09:54 | as well |
---|
0:09:55 | and that gets us to a starting point pretty easily where we can then do |
---|
0:09:59 | winning greatly simplify the newton's descent |
---|
0:10:03 | in particular by pretending everything is independent of each other which is a huge spud |
---|
0:10:08 | improvement because the doing full haskins in this update |
---|
0:10:11 | is anybody who's ever looked at it is a pretty tedious |
---|
0:10:15 | so once we do that |
---|
0:10:16 | it's essentially rather than being much slower than acoustic i-vector system it's essentially the same |
---|
0:10:22 | order it's very simple |
---|
0:10:33 | so that no okay |
---|
0:10:36 | the lre fifteen task which has been discussed |
---|
0:10:39 | this i guess isn't happening here there is telephone and broadcast narrowband speech with it |
---|
0:10:44 | twenty language six confusable clusters |
---|
0:10:48 | but the limited training condition is very important element from what we were able to |
---|
0:10:52 | get away with |
---|
0:10:53 | and of course that means |
---|
0:10:54 | both that you have limited a little data to more only twenty languages but also |
---|
0:10:58 | means that you can only train your supervised the nn |
---|
0:11:01 | on the switchboard english because that's the only thing that had transcripts |
---|
0:11:06 | which is not our favourite thing to do it was kind of limiting but it |
---|
0:11:09 | allows nist exercise the technology |
---|
0:11:12 | and because of the languages didn't have much data that was also would keep |
---|
0:11:20 | so all of our systems |
---|
0:11:21 | basically because we had a small team we didn't built too much complicated stuff |
---|
0:11:26 | i described really everything that we did |
---|
0:11:28 | so we had two different ways of the using the d n and we had |
---|
0:11:31 | three different kinds of i-vectors that we could have built out of each of the |
---|
0:11:34 | to the in an i-vector the unit systems |
---|
0:11:37 | out of that we could've done six things i'll talk about a few that were |
---|
0:11:40 | interesting and the ones that we actually |
---|
0:11:43 | but everything was the same classifier |
---|
0:11:48 | as i missed because the systems are already calibrated a by this mmi process |
---|
0:11:54 | we didn't have to use a complicated back end |
---|
0:11:57 | the thing we get introduced because there is we knew there was this range of |
---|
0:12:01 | durations that had to be exercised |
---|
0:12:04 | i think the simplest way that we could get there was to re reuse some |
---|
0:12:08 | work that we had done previously on making a |
---|
0:12:11 | duration dependent backend where there's a continuous function which maps |
---|
0:12:15 | duration into scale factor score |
---|
0:12:19 | between of the raw score and the true log likelihood estimate that you're trying to |
---|
0:12:23 | make |
---|
0:12:25 | and that there's a justification for that function but for our purposes the important thing |
---|
0:12:29 | is that |
---|
0:12:29 | it's very simply trainable because it's just got to free parameters |
---|
0:12:34 | so then you can use this cross entropy criterion and figure out the best parameters |
---|
0:12:39 | and then because we have is a very simple system |
---|
0:12:43 | we just at all scores together assume that they were independent estimates of things and |
---|
0:12:48 | then rescaled the whole thing to bring it back in |
---|
0:12:52 | and we found that to be helpful for us |
---|
0:12:58 | another thing about lre fifteen which was mentioned but maybe to go to be more |
---|
0:13:01 | familiar with the task you it went past incorrectly is very important |
---|
0:13:05 | so nist the |
---|
0:13:07 | proposed these somewhat on task of close to the texan across each of the clusters |
---|
0:13:13 | what we did is |
---|
0:13:15 | it is generated each cluster is an id score which means that each cluster had |
---|
0:13:19 | a id posteriors on the one since the ri six clusters of means we gave |
---|
0:13:23 | an scores from the six which means if nist wanted to evaluate across cluster performance |
---|
0:13:29 | it was meeting this |
---|
0:13:32 | and we had to convert these ideas to detection log likelihood ratios which is something |
---|
0:13:36 | we've all over how to do your |
---|
0:13:39 | but one thing i wanna mention about our system is we didn't do anything |
---|
0:13:42 | cluster specific anywhere we just change train a twenty language lid system |
---|
0:13:47 | and then just the |
---|
0:13:50 | spun on the scores for each of the clusters because that's what nist one |
---|
0:13:54 | i think we would like in the future for a more generic lid task |
---|
0:14:01 | not the key element that i mentioned is the |
---|
0:14:04 | other with limited training data so |
---|
0:14:08 | we had figure out what to do with that |
---|
0:14:11 | as i mentioned we have the unsupervised and supervised power we took the theory which |
---|
0:14:16 | was later proven not quite ready to we would use everything we could |
---|
0:14:20 | for the unsupervised data which included switchboard which is english only and was not one |
---|
0:14:25 | of the languages |
---|
0:14:27 | for that we could've done better than that all talk about it |
---|
0:14:30 | and then for the classifier design we did find it helpful |
---|
0:14:34 | to do augmentation to do duration modeling a cut so we can use all sides |
---|
0:14:39 | we use segments that were duration |
---|
0:14:42 | appropriate for the lid task |
---|
0:14:44 | and we used argumentation used augmentation to change the limited clean data |
---|
0:14:49 | and try and give us more examples of things to learn what i-vectors would look |
---|
0:14:53 | like |
---|
0:14:55 | to go into the augmentation a little bit more |
---|
0:14:58 | many of these are standard things the this big thing indian ends now is to |
---|
0:15:02 | do augmentation |
---|
0:15:05 | so sample rate perturbation additive noise |
---|
0:15:08 | right made a kind of forty kind of an additive noise but maybe that's more |
---|
0:15:11 | interesting we did throw in reverb |
---|
0:15:15 | and a multi band compression is kind of a signal processing thing that you might |
---|
0:15:18 | see in an audio signal |
---|
0:15:20 | but the thing i wanna mention and the thing that we have actually don't have |
---|
0:15:23 | been slides but if you look in the paper |
---|
0:15:26 | the most effective single augmentation for us in the task was to run to use |
---|
0:15:30 | "'em" so you were encoder decoder against |
---|
0:15:32 | which kind of the makes sense |
---|
0:15:35 | as a thing to do |
---|
0:15:36 | and as former speech coding to a fairly attractive |
---|
0:15:42 | so our submission performance |
---|
0:15:45 | these are the for things that we submitted our primary wasn't fact one of the |
---|
0:15:49 | bottom which looks like it's pretty good choice out of the were available to us |
---|
0:15:54 | so we did a joint i-vector on the bottleneck features we have well i'll show |
---|
0:15:58 | later of the when some more stable i guess other through that they know what |
---|
0:16:02 | dimensional ways in this submission |
---|
0:16:04 | our second basis them was actually slightly better than our bottleneck system and again |
---|
0:16:09 | that makes that the best sort of phonotactic system i think than anybody saw because |
---|
0:16:14 | everyone else from the bottlenecks will be the only really good thing to do |
---|
0:16:18 | and fusion provided again partly because we have simple fusion and partly because we have |
---|
0:16:23 | two systems which are pretty good |
---|
0:16:28 | so we get a couple things post email with we found someone educational the first |
---|
0:16:34 | one i will go in the much details in the paper but |
---|
0:16:38 | within the family of gaussian scoring there's a question of whether you count trials as |
---|
0:16:42 | independent are not which in speaker you typically pertain you all had one you only |
---|
0:16:46 | had one trial for |
---|
0:16:47 | enrolment is all one hour |
---|
0:16:50 | that turned on and what we submitted we usually see it is slightly better turns |
---|
0:16:53 | out for this develop a slightly worse |
---|
0:16:55 | i have no idea work |
---|
0:16:57 | the other thing that might be a little bit more interesting is the list usage |
---|
0:17:00 | we spent quite invaded time even with their the metadata trying this |
---|
0:17:05 | decide what to do with the ubm and t |
---|
0:17:08 | but i think that it turned out to work best |
---|
0:17:10 | we didn't try because we thought of as a dom idea which is to just |
---|
0:17:14 | use |
---|
0:17:15 | only the lid data |
---|
0:17:16 | and only for cuts |
---|
0:17:18 | which i forget exactly but i think that's only three or four thousand cuts or |
---|
0:17:21 | something |
---|
0:17:22 | it ought to be nowhere near enough to train a t matrix we thought |
---|
0:17:27 | but without or |
---|
0:17:30 | so here again there's a more numbers splitting things out the first thing which is |
---|
0:17:34 | kind of interesting for us as we went and rented to this acoustic baseline so |
---|
0:17:38 | what we would have done with previous technology we are definitely better with all the |
---|
0:17:42 | stuff we have i dunno if we're not instantly better but we're better |
---|
0:17:48 | thing |
---|
0:17:49 | i'm sorry |
---|
0:17:51 | the ldc is now we split out with the scene on system the three different |
---|
0:17:54 | kinds of i-vectors and the first thing is the phonotactic system by itself |
---|
0:17:59 | is actually better than the acoustic system which is what we have seen before and |
---|
0:18:03 | i think that's |
---|
0:18:04 | well linguist might about whether it's really a phonotactic system to look at the counts |
---|
0:18:08 | of frame posteriors but |
---|
0:18:10 | that aside it's i think the best performing phonotactic system that's out of for lid |
---|
0:18:16 | right now and then you see also that the joint i-vector doesn't five given noticeable |
---|
0:18:21 | gain over the acoustic |
---|
0:18:23 | so that's |
---|
0:18:44 | okay and the fusion still work let me just go one so then included in |
---|
0:18:52 | we were able to get pretty good performance in this evaluation with a small team |
---|
0:18:55 | and of relatively straightforward system |
---|
0:18:58 | we think that there is still whole in the signal count system that doesn't have |
---|
0:19:03 | to be just bottleneck |
---|
0:19:05 | and we were able show that |
---|
0:19:07 | we think that the phonotactic in the joint i-vectors the joint i-vector especially is a |
---|
0:19:12 | nice simple way to capture that |
---|
0:19:14 | that information is one of things that enables the signal system to be competitive |
---|
0:19:20 | we think it is helpful to use a really simple fusion if you have this |
---|
0:19:23 | discriminatively trained classifier to start with |
---|
0:19:27 | and find the data augmentation it can be a very valuable thing for the manager |
---|
0:19:32 | at |
---|
0:19:33 | limited data |
---|
0:19:35 | thank you |
---|
0:19:43 | we have time some questions |
---|
0:19:55 | thank you for told you proposed able to collect all is doing marks |
---|
0:20:02 | we can focus to the lower levels double is the tools like to other tools |
---|
0:20:10 | for d is a classical for more old home too |
---|
0:20:15 | yes there are we always use the same mailed again gaussian classifiers |
---|
0:20:20 | no matter what kind of i-vectors |
---|
0:20:22 | "'cause" distribution is not |
---|
0:20:24 | no the intention is the i-vector could still have been in a gaussian space that |
---|
0:20:29 | that's this is why we like this kind of |
---|
0:20:33 | a subspace there are other count subspace algorithms like lda a non-negative matrix factorization i |
---|
0:20:40 | think that in for example is compared some of those |
---|
0:20:42 | where the subspaces in the linear probability space and that |
---|
0:20:47 | i don't think would be well modeled by gaussian fact i know it wouldn't be |
---|
0:20:50 | well my time gaussian pretty comfortable that "'cause" it's positive |
---|
0:20:53 | but by going into the log space i think you it does |
---|
0:20:57 | it really is going to lda and that's right tools okay cindy |
---|
0:21:20 | but i'm very much like the additional processing that you're doing to kind of or |
---|
0:21:24 | to the data you had casework security of sample rate perturbation most of the speech |
---|
0:21:29 | coders most versions |
---|
0:21:31 | if you had to go back again we're which ones you think actually would help |
---|
0:21:35 | i think you mean imagine which up there is a table in the paper |
---|
0:21:41 | which many of them are helpful but the speech coder is the most helpful on |
---|
0:21:45 | its own |
---|
0:21:45 | so we choose the sample rate conversion was at a really big variations |
---|
0:21:51 | we did things like plus or minus ten percent plus or minus five percent but |
---|
0:21:56 | i think |
---|
0:21:57 | i would say that big |
---|
0:22:02 | so we use a big difference maybe we have other cts broadcast news progress has |
---|
0:22:08 | been which would typically be guessing |
---|
0:22:12 | we didn't break them apart |
---|
0:22:24 | try other nations that just |
---|
0:22:27 | a pattern p norm |
---|
0:22:30 | we have since |
---|
0:22:31 | and |
---|
0:22:34 | it's |
---|
0:22:34 | so little bit it seems like for this particular task it looks like the sigmoid |
---|
0:22:39 | is that some other people use are a little bit better i'm not sure if |
---|
0:22:43 | we think that's a universal statement |
---|
0:22:46 | excuse me the sigmoid are better for training the bottlenecks |
---|
0:22:51 | i think for this you know maybe not |
---|
0:22:54 | so we have looked a little bit |
---|
0:22:56 | there is more to explore |
---|
0:23:07 | so if there are no more questions and we assume everybody here knows everything about |
---|
0:23:12 | language recognition got common both systems |
---|
0:23:16 | so that the same speaker again |
---|