0:00:16 | okay so by no means but also |
---|
0:00:19 | and whatever send a work with microwave so these hand plucking |
---|
0:00:25 | we're basically we test and evaluate our address the m utterance system |
---|
0:00:32 | in a like a different mean on this and so this was a system that |
---|
0:00:36 | we already presented |
---|
0:00:37 | and at an obvious to test how we disaffected on different scenarios |
---|
0:00:44 | so for someone a the one the motivation why we started using this architecture and |
---|
0:00:49 | how |
---|
0:00:50 | how we started using this |
---|
0:00:53 | there we will lead to a very of file is the m probably you will |
---|
0:00:57 | be quite |
---|
0:00:59 | already aware this that i guess it's |
---|
0:01:02 | nice to have some tracks near |
---|
0:01:05 | then we will all the details of the screen men so we will detailed a |
---|
0:01:09 | system description |
---|
0:01:11 | the reference i-vector system that we will compare |
---|
0:01:14 | our proposed system we |
---|
0:01:16 | the different scenarios we're gonna tested |
---|
0:01:19 | and results |
---|
0:01:21 | and finally we will conclude the work |
---|
0:01:25 | so |
---|
0:01:26 | we all know what we take these already the process of automatically identifying language for |
---|
0:01:32 | a given a spoken utterance |
---|
0:01:33 | and typically this has been done for many years |
---|
0:01:38 | rely you know acoustic model so these systems |
---|
0:01:41 | basically have the state is first some i-vector extraction |
---|
0:01:45 | and then some classification states |
---|
0:01:48 | and last years we're seeing a really a strong |
---|
0:01:52 | a new line that it's deep neural networks |
---|
0:01:54 | and it can be more or less divided in three different approaches |
---|
0:01:58 | one is the and two and systems we have seen that it's a very nice |
---|
0:02:02 | solution but we are not achieving best results so far |
---|
0:02:06 | then we have the what an x |
---|
0:02:08 | and then |
---|
0:02:09 | after computing but as we go to the i-vector |
---|
0:02:13 | struck some and we keep the fuel line |
---|
0:02:16 | and then we have this signals |
---|
0:02:18 | sorry for type other |
---|
0:02:20 | so this would be a and this paper we wanna focus on the end-to-end approach |
---|
0:02:25 | so we want to improve the end-to-end approach for this |
---|
0:02:29 | we would be a very like stander the nn for language recognition when we try |
---|
0:02:36 | to use and to an approach |
---|
0:02:37 | basically we have |
---|
0:02:40 | some parameters as input |
---|
0:02:42 | then we have one or several he'd over the years with some nonlinearity |
---|
0:02:48 | and we try to compute the probability of some |
---|
0:02:52 | some of the |
---|
0:02:54 | the language we are gone test |
---|
0:02:56 | in the last layer so for this we use a softmax it with us probabilities |
---|
0:03:02 | one of the main drawbacks of these |
---|
0:03:05 | be system is that |
---|
0:03:07 | we need some context if we try to get an output frame-by-frame we are not |
---|
0:03:12 | gonna get then you would result so this system relies on stacking several acoustic frames |
---|
0:03:18 | in order to all the time context |
---|
0:03:21 | and that has many problems one and we have a fixed |
---|
0:03:25 | length but probably will not work best for all different conditions |
---|
0:03:30 | and it's like bright the |
---|
0:03:33 | since a deep to union |
---|
0:03:36 | so how can we model these in a better way |
---|
0:03:39 | the theoretical answer he's recommend your networks so basically we have same |
---|
0:03:45 | structure that before |
---|
0:03:46 | but this once we have recourse if connections |
---|
0:03:49 | all the others are saying |
---|
0:03:51 | what's the problem with this one's we have a vanishing gradient problem |
---|
0:03:54 | the a basically what happened sees |
---|
0:03:58 | in theory it's a very nice model |
---|
0:04:00 | but when we try to train these networks because of these records you can extra |
---|
0:04:06 | we end up having all day they weights going to either zero |
---|
0:04:11 | or something really high |
---|
0:04:13 | so there are ways to avoid this but usually is very tricky it depends a |
---|
0:04:17 | lot on the task on the data so it's not |
---|
0:04:20 | really useful |
---|
0:04:22 | and here is where the others the m columns |
---|
0:04:26 | basically stm means they first |
---|
0:04:30 | stander the nn |
---|
0:04:31 | and we replace |
---|
0:04:32 | all day hidden nodes |
---|
0:04:34 | with this l s t m block that we have here |
---|
0:04:38 | so let's go to the theory of this blog |
---|
0:04:43 | basically it seems kind of scary when you see first |
---|
0:04:47 | but it's pretty simple after you look at provider |
---|
0:04:52 | we have a flow of information that goes from the bottom to the top |
---|
0:04:56 | and as in any |
---|
0:04:59 | a standard you know we have a nonlinear function |
---|
0:05:03 | that we |
---|
0:05:04 | this one here |
---|
0:05:06 | and is bessel thing all the others the n is that it has a minimal |
---|
0:05:10 | use |
---|
0:05:11 | we take this one |
---|
0:05:14 | so that |
---|
0:05:15 | the all the other stuff that we have there are three different gates duh what |
---|
0:05:20 | they do he's they let |
---|
0:05:22 | or b |
---|
0:05:23 | they later they don't lead the information go through |
---|
0:05:27 | so here we have a input data |
---|
0:05:29 | the if it's activated we will lead the input |
---|
0:05:33 | of a new times that we'll for war |
---|
0:05:36 | if it's not it won't |
---|
0:05:38 | we have they forget gate |
---|
0:05:40 | that's what it that is basically we set the memory so |
---|
0:05:45 | so if we speech calculated it will would that sell to zero otherwise it will |
---|
0:05:51 | keep the state of the of the previous time step |
---|
0:05:55 | and the output gates |
---|
0:05:57 | note that gate we'll let the computer |
---|
0:06:02 | computer output |
---|
0:06:05 | here |
---|
0:06:06 | go to the network or not |
---|
0:06:08 | and then what we have of course is a vector and connex so |
---|
0:06:13 | the output of |
---|
0:06:15 | well as time goes the input |
---|
0:06:18 | of day next one you know data |
---|
0:06:20 | so it's basically trying to meaning they are in and model |
---|
0:06:25 | but with this case we avoid that problem because that gates work |
---|
0:06:30 | both in this time but also entering time so when we have we're doing the |
---|
0:06:35 | back propagation |
---|
0:06:36 | and we have some ever that's a stew maybe rice the weight |
---|
0:06:41 | that forget gate that would be a that input gate it's but also clock that |
---|
0:06:45 | error from going |
---|
0:06:47 | many times so we avoid that problem |
---|
0:06:51 | the system that we used for language recognition |
---|
0:06:55 | been doesn't rely on stacking acoustic frame so we receive only one frame at the |
---|
0:07:00 | time |
---|
0:07:02 | we will have one or two hidden layers and the relay here will be a |
---|
0:07:06 | unidirectional it is the m |
---|
0:07:08 | we impose |
---|
0:07:09 | impose war |
---|
0:07:11 | these connections that we have here |
---|
0:07:14 | that basically |
---|
0:07:15 | it allows the network to decide things on the like depending on time so we |
---|
0:07:21 | it supposed to improve they the performance of a memory cell |
---|
0:07:28 | the output we will use a softmax right like in the nn |
---|
0:07:32 | cross entropy error function |
---|
0:07:34 | and for training |
---|
0:07:36 | what we do he's in the first area will have a very balanced nice dataset |
---|
0:07:42 | so we need to do any implies either |
---|
0:07:44 | but on more difficult to know is we will have some and but also the |
---|
0:07:48 | data |
---|
0:07:49 | so what we do in order to avoid problems with them but data |
---|
0:07:53 | we just over something so we take random sites of two seconds and then we |
---|
0:07:57 | have six hours |
---|
0:07:58 | of all the other languages in every iteration |
---|
0:08:00 | so that it so that we have |
---|
0:08:02 | for every iteration is different |
---|
0:08:05 | then we we'll use |
---|
0:08:08 | to compute the final score of an utterance we will do operates of day softmax |
---|
0:08:13 | output |
---|
0:08:13 | but taking into account only the last ten percent |
---|
0:08:16 | of this course i was playing ability later right |
---|
0:08:20 | and then finally we will use a multiclass linear logistic regression calibration we use simple |
---|
0:08:27 | we will compare the system to a reference i-vector system needs a very straightforward using |
---|
0:08:33 | mfccs the see exactly the same features that we used for that is the m |
---|
0:08:38 | we we'll one thousand twenty four gaussian components for the ubm |
---|
0:08:42 | the i-vector ease of size four hundred |
---|
0:08:45 | it's based cosine distance scoring that's |
---|
0:08:49 | it controls are it depending on how many languages we have snow would |
---|
0:08:53 | this was working better |
---|
0:08:55 | the and doing lda you're doing the lda so that's why we decided to take |
---|
0:08:58 | a cosine distance scoring |
---|
0:09:01 | if we have more languages it would be better to use lda but the difference |
---|
0:09:06 | was a small enough to note that a too much since there |
---|
0:09:09 | and this is the most implementing quality and need has exactly the same by recent |
---|
0:09:13 | technique always trained with the same |
---|
0:09:16 | same data |
---|
0:09:19 | so these are the three scenarios that we are going values to compare and test |
---|
0:09:25 | these |
---|
0:09:25 | these network personnel you e |
---|
0:09:29 | a subset |
---|
0:09:30 | on the nist |
---|
0:09:31 | two thousand nine language permission evaluation |
---|
0:09:34 | so that is that we use is that there is coming from the three seconds |
---|
0:09:37 | task |
---|
0:09:38 | this is a subset the a pretty minutes it's like very set so that the |
---|
0:09:43 | it is the in will work based |
---|
0:09:46 | so it's a very kind of d c subset in the in the two thousand |
---|
0:09:52 | and nine evaluation what we d d's first we have a imbalance meetings of cts |
---|
0:09:57 | voice of america so we draw all the cts data then we will avoid that |
---|
0:10:02 | buttons makes and also we will avoid i mean a mismatched |
---|
0:10:05 | in training so we have only one dataset |
---|
0:10:09 | a for the languages we wanted to have also a high amount of data |
---|
0:10:14 | so we to only those then which is that had at least two hundred of |
---|
0:10:18 | more hours |
---|
0:10:19 | i'm we also then want to have unbalanced data so we got those datasets so |
---|
0:10:23 | all of them |
---|
0:10:24 | two hundred hours per available for training |
---|
0:10:28 | and that lid |
---|
0:10:30 | two d subset of we have here |
---|
0:10:33 | it's not a soul seven it's not the most difficult like we so before it's |
---|
0:10:38 | just those that happened these two hundred hours a of voice of america data |
---|
0:10:45 | and we use only a three seconds task because historically we so that for starter |
---|
0:10:52 | addresses |
---|
0:10:53 | is where the neural networks outperform more director so we wanted to be in that |
---|
0:10:59 | in that scenario |
---|
0:11:00 | then seconds note that we want to test is they that said |
---|
0:11:05 | of nist language is no one listened to for some fifteen |
---|
0:11:09 | here we don't avoid any of the difficulties so we have a meeting so cts |
---|
0:11:14 | and brought about and b and b s |
---|
0:11:17 | and we will keep |
---|
0:11:18 | everything |
---|
0:11:20 | we have seen the there's of this so it's twenty language is scroll in six |
---|
0:11:23 | cluster accordion similarity so it's supposed to be more challenging because the languages are closer |
---|
0:11:30 | within a cluster |
---|
0:11:32 | that model training data it's also gonna be like it during just followed we have |
---|
0:11:36 | some then which is we lessen have a lower something which is with more than |
---|
0:11:39 | hundred hours |
---|
0:11:40 | and split that we need ease |
---|
0:11:43 | eighty five percent for training fifteen percent for testing |
---|
0:11:46 | that's something we wouldn't do again if we like run experiments again this is what |
---|
0:11:52 | we need |
---|
0:11:54 | the time so before i mean this set and everything |
---|
0:11:57 | and we thought it would be nice to have more data for training but after |
---|
0:12:01 | that we ran some experiments and we found that having |
---|
0:12:04 | it'll be less training data but more they've data we'll help experiments |
---|
0:12:10 | but we keep exactly what we use in the one best |
---|
0:12:14 | and that's a test what we need these with that fifty percent we took chance |
---|
0:12:19 | of three seconds ten seconds and three seconds |
---|
0:12:22 | two meeting a little bit the a |
---|
0:12:25 | the performance of on the and the one less |
---|
0:12:28 | and then that are texan area will be they test set of nist language relational |
---|
0:12:32 | oneness |
---|
0:12:33 | we discover a broad runs of speech durations it's not been beans anymore |
---|
0:12:38 | and we has a big mismatches between training and unable as we so before |
---|
0:12:47 | so the results that we have first this is kind of aside result is not |
---|
0:12:51 | that important but as we are using a unidirectional it is the em what we |
---|
0:12:56 | have is that the output at a given |
---|
0:13:00 | times them |
---|
0:13:00 | things that depends |
---|
0:13:02 | not only on the input of that |
---|
0:13:04 | times that are also and all the input of the previous inputs |
---|
0:13:08 | so the last output is always more reliable |
---|
0:13:11 | then the ones before |
---|
0:13:13 | and we thought that maybe we were affecting they performance if we take the first |
---|
0:13:19 | outputs that are less reliable so we just started dropping all the first outputs and |
---|
0:13:26 | seen how that affected the performance |
---|
0:13:28 | this is this so for this one |
---|
0:13:31 | we don't really care about the |
---|
0:13:34 | the moderated we have here we only got about how improves |
---|
0:13:38 | so the absolute |
---|
0:13:39 | equal error rate doesn't matter only the relative difference |
---|
0:13:42 | and we found a taking into account only the last ten percent |
---|
0:13:46 | would be a very optimal point |
---|
0:13:49 | and we also so that taking into account only the very last score only one |
---|
0:13:54 | output of a softmax |
---|
0:13:55 | we were as good as taking the last ten percent but we do they |
---|
0:13:59 | the last ten percent or |
---|
0:14:03 | so these are the results |
---|
0:14:06 | on they |
---|
0:14:07 | on they first scenario |
---|
0:14:09 | remember that this is the one do we only voice of america a languages |
---|
0:14:13 | two hundred hours per language for training |
---|
0:14:16 | we have here |
---|
0:14:18 | this is the different architecture that we use we both three had one hidden layer |
---|
0:14:23 | those two layers and then we have different size of the he'll data from like |
---|
0:14:30 | this is the smallest we two hundred fifty six |
---|
0:14:33 | are the begins with one hundred twenty four one thousand and four |
---|
0:14:37 | this is the a size in terms of number of parameters |
---|
0:14:42 | of all the models |
---|
0:14:44 | and be so the results that we obtain |
---|
0:14:47 | so the reference i-vector system and a seventeen percent almost equal error rate |
---|
0:14:53 | and point sixty now see a rates |
---|
0:14:56 | and we see that pretty mats all day and as the em approach is clearly |
---|
0:15:01 | outperformed that |
---|
0:15:02 | and i'm not of them has a much smaller number of parameters |
---|
0:15:09 | so those are really good results but we are in these |
---|
0:15:12 | balance error you |
---|
0:15:14 | as we can see the best system us like |
---|
0:15:17 | fifteen percent a better error |
---|
0:15:21 | and has like |
---|
0:15:24 | i four percent gain in terms of size |
---|
0:15:27 | and we also wanted to check how complementary information that these others the m and |
---|
0:15:32 | the i-vector were struggling so we fuse the best alice consistent with the reference i-vector |
---|
0:15:38 | system |
---|
0:15:39 | and the result whether the way remotes |
---|
0:15:41 | that's better |
---|
0:15:42 | we twelve percent |
---|
0:15:44 | which is like fifteen percent better than they based system i'll |
---|
0:15:50 | this is the completion metrics doesn't have much information but we can see i'm not |
---|
0:15:56 | only in terms of accuracy but comparison with other languages how would be performed in |
---|
0:16:00 | this subset |
---|
0:16:03 | these are the results in that the dev set of a language recognition evaluation |
---|
0:16:08 | to for some fifteen |
---|
0:16:10 | for this one we just we didn't do an experiment with different detectors we were |
---|
0:16:14 | a little bit and harris we use only they based system on the previous scenario |
---|
0:16:20 | we what which was to don't layers of size five hundred total |
---|
0:16:25 | and what we can see here is that the |
---|
0:16:29 | that is the m |
---|
0:16:30 | performs |
---|
0:16:32 | much better than the than the i-vector or on three seconds |
---|
0:16:36 | while on thirty seconds |
---|
0:16:39 | we d scenario where we have these mismatches between that the bases and these buttons |
---|
0:16:45 | on the data sets |
---|
0:16:47 | this end to end system is not that so we still results for what are |
---|
0:16:51 | like that were always outperform an i-vector why this and to an approach i it's |
---|
0:16:57 | able to extract more information from sort lessons but not that matt's for longer |
---|
0:17:03 | that would think that we so here is that even though the results for longer |
---|
0:17:06 | utterances |
---|
0:17:07 | is |
---|
0:17:08 | a way worse than the one of the i-vector |
---|
0:17:11 | diffusion used pretty much always |
---|
0:17:14 | better than any of a single system |
---|
0:17:16 | so even if the even when the when there is the m is working worse |
---|
0:17:21 | than the i-vector |
---|
0:17:22 | we are able to strut different information that will help in a file and system |
---|
0:17:28 | so we were also quite be quite happy with the results |
---|
0:17:31 | this is they |
---|
0:17:32 | they do that we have for three seconds where we can see that the l |
---|
0:17:37 | is the em outperforms and over twenty percent relative improvement |
---|
0:17:41 | over the i-vector |
---|
0:17:43 | and we see also that the a diffusion works always |
---|
0:17:47 | better |
---|
0:17:48 | that in any of a single system |
---|
0:17:50 | and now here we go on to the results of at all but this set |
---|
0:17:55 | of language recognition evaluation |
---|
0:17:57 | and here the things get much more so |
---|
0:18:00 | first of all |
---|
0:18:01 | we have first column is that is the and second column is a i-vector |
---|
0:18:07 | third one is the fusion of both |
---|
0:18:10 | noncheating one the one we used for the listening |
---|
0:18:13 | and a point one |
---|
0:18:14 | is exactly the same but using like the succeeding fusion so we use a two |
---|
0:18:20 | fold cross validation |
---|
0:18:22 | so we will use in how of the test set |
---|
0:18:25 | for training they fusion on the other half |
---|
0:18:29 | of course that that's |
---|
0:18:30 | that is not alone in the evaluation |
---|
0:18:33 | but we wanted to know how like |
---|
0:18:35 | whether the |
---|
0:18:37 | the systems were learning complementary information |
---|
0:18:40 | or whether they weren't so what with always maybe we've used in a good way |
---|
0:18:47 | we can distract how maps how complementary information |
---|
0:18:51 | so for the messages that we have to take from here is that versa for |
---|
0:18:55 | at the end you learning these very hot a scenario is able to |
---|
0:19:00 | get a result they need a comparable with a i-vector but when it gets much |
---|
0:19:06 | worse as when the base increases because the i-vector is able to extract better than |
---|
0:19:11 | better results when it is the m |
---|
0:19:13 | a status |
---|
0:19:14 | on the |
---|
0:19:15 | on that same performance |
---|
0:19:18 | but the good thing is that we don't have such a big might minutes or |
---|
0:19:21 | we are able to do a good solution |
---|
0:19:23 | we can steal even when you're |
---|
0:19:27 | on the known as the rest we can use the in room we diffusion |
---|
0:19:31 | the performance of that i-vector |
---|
0:19:36 | so |
---|
0:19:37 | it's conclude the work |
---|
0:19:38 | basically the same take a messages are |
---|
0:19:42 | first of all on a control unbalanced scenario |
---|
0:19:45 | we have we promising results |
---|
0:19:47 | it's a very simple system we eighty five percent this parameters |
---|
0:19:52 | and that it's able to get fifteen percent relative improvement |
---|
0:19:56 | problem is the once it gets |
---|
0:19:59 | on an imbalance in a real more exciting england the results are not |
---|
0:20:04 | as good |
---|
0:20:05 | and finally we know that the that on strong mismatches and harder scenario it we |
---|
0:20:11 | are not able to strike information within a |
---|
0:20:14 | so there is a need for variability compensation but we still think that it's a |
---|
0:20:18 | you really promising approach |
---|
0:20:21 | that we need to simpler a systems that can get quite good results |
---|
0:20:38 | lots of questions |
---|
0:20:43 | so also |
---|
0:20:50 | just the small comment you say that you're averaging the outputs of the ten percent |
---|
0:20:55 | of the last frames |
---|
0:20:57 | you are always using than posants for free second test of a thirty second test |
---|
0:21:03 | we always using them person did you try to just |
---|
0:21:07 | a rate for the thirty last frame independently of duration of your this |
---|
0:21:11 | we e actually not for they how this areas but for the aec once we |
---|
0:21:17 | need a like a lot of |
---|
0:21:19 | things like not only averaging about like i mean or selecting only based on was |
---|
0:21:25 | one or |
---|
0:21:26 | just a drawl all the ones that are out that yours |
---|
0:21:29 | and we found that is not really work need to the little thing to note |
---|
0:21:32 | in there |
---|
0:21:33 | but maybe in day in a more telling in serious it would be a with |
---|
0:21:37 | with the way we haven't right |
---|
0:21:51 | is it possible to go back to slide twenty four second here |
---|
0:21:57 | sorry i notice you're always getting improvement with i guess when you're good to elicit |
---|
0:22:04 | iain's versus the i-vector but when you look at the in which case and think |
---|
0:22:08 | when you want to the fusion |
---|
0:22:11 | be |
---|
0:22:12 | fusion with emotion actually did worse than the i-vector system three point to each six |
---|
0:22:19 | or each seven and the i-vector had one point nine that's the only one where |
---|
0:22:23 | you didn't get an improvement was really reason why |
---|
0:22:26 | you saw maybe when it happened may be used or you know stm actually had |
---|
0:22:30 | words performance of guessing is because |
---|
0:22:33 | you got more realist em system in |
---|
0:22:36 | so i'm not completed is are but we have some kind of idea of why |
---|
0:22:40 | that happened the idea is that |
---|
0:22:43 | for training day systems |
---|
0:22:45 | what we d is these oversampling |
---|
0:22:49 | so on english there was one of the i think it was reduced english that |
---|
0:22:52 | had only half an hour |
---|
0:22:54 | so you know to train the others the n |
---|
0:22:56 | that of course hardly the need has a war is useful but i think it |
---|
0:23:01 | also hardy follow the fusion |
---|
0:23:03 | so when you have |
---|
0:23:05 | one |
---|
0:23:05 | well in with that has a less data for today the nn for that is |
---|
0:23:10 | the m |
---|
0:23:11 | you can more or less we'd oversampling because that |
---|
0:23:14 | an infusion usually you need much less |
---|
0:23:17 | much less data in general so in all the other clusters that since a lot |
---|
0:23:23 | because you anything they are imbalance |
---|
0:23:24 | for calibration you have stealing a of all the blame |
---|
0:23:27 | but for the english one i think it yes that we do not have enough |
---|
0:23:31 | data for calibrating |
---|
0:23:34 | so the fusion for training sufficient to so i think that was there is |
---|
0:23:39 | the diffusion is not well trained because of not having enough data one of the |
---|
0:23:43 | languages |
---|
0:23:54 | i've a question and i found it quite interesting that you're |
---|
0:23:59 | a list em has fewer parameters than the and the i-vector system |
---|
0:24:05 | and i'm wondering about the |
---|
0:24:08 | the time complexity how long does it like to train it and test time |
---|
0:24:12 | some compared to the i-vector system |
---|
0:24:16 | the a training time is much longer because we had a lot of iterations i |
---|
0:24:20 | think it's also because of the way we trained the that we use a different |
---|
0:24:24 | subset per iteration we need a lot of them so |
---|
0:24:30 | actually i think that there is also have your are not the best we could |
---|
0:24:34 | see because these was only and evaluation and so some of the networks were still |
---|
0:24:39 | improving when we had to stop band and there's run them as they were sewing |
---|
0:24:44 | training time eight side and w has much fewer parameters each word |
---|
0:24:49 | but testing time he's way faster |
---|
0:24:53 | and of course of one thing is that was you have the network trained you |
---|
0:24:57 | only need to the day before while in the editorial you have new data you |
---|
0:25:01 | have always extra i-vector before |
---|
0:25:04 | before doing scoring |
---|
0:25:10 | anymore questions |
---|
0:25:15 | so then there's lots of time for costly i guess we'll back end at five |
---|
0:25:20 | o'clock |
---|
0:25:23 | forty four special tools |
---|
0:25:25 | so that's target speaker can |
---|