0:00:24 | right |
---|
0:00:25 | right |
---|
0:00:26 | short term memory |
---|
0:00:28 | no |
---|
0:00:29 | oh |
---|
0:00:29 | i |
---|
0:00:30 | i |
---|
0:00:31 | i explain each extraction by nonnegative made |
---|
0:00:34 | i and |
---|
0:00:35 | and think that's case but um so a memory |
---|
0:00:38 | and and and and i like |
---|
0:00:41 | one thing speech |
---|
0:00:46 | yeah so |
---|
0:00:48 | as probably most of you know |
---|
0:00:51 | the localisation of non-linguistic events and spontaneous speech can have multiple applications one of them is of course again |
---|
0:00:58 | some paralinguistic information |
---|
0:01:00 | for example if you recognise laughed there |
---|
0:01:03 | or size or other |
---|
0:01:06 | yeah localisation colonisation that have a |
---|
0:01:08 | semantic meaning |
---|
0:01:09 | but |
---|
0:01:10 | and the application would uh be to increase the vertex you C of an are system |
---|
0:01:16 | so for example if you know where there is a lexical items and the speech and the there are no |
---|
0:01:21 | lexical items |
---|
0:01:22 | we can perform decoding only on the lexical items and |
---|
0:01:25 | maybe this can increase your word accuracy |
---|
0:01:28 | so the crucial question your johns a is well to do this inside or outside the A's our frame |
---|
0:01:35 | because you of think |
---|
0:01:36 | uh of doing it inside the a our framework you can just at some more models to yeah recognizer |
---|
0:01:42 | and for example of or it for after of works S i |
---|
0:01:46 | and include them in the language model and yeah to the standard acoustic modeling of them |
---|
0:01:52 | but i'd another approach would be to do this outside the ace are for um work in uh |
---|
0:01:56 | different classify and this is actually the |
---|
0:01:58 | approach to that "'cause" you here |
---|
0:02:01 | so i do a frame-wise context sensitive classification |
---|
0:02:05 | all the speech into yeah |
---|
0:02:07 | lexical speech and L items on linguistic segments |
---|
0:02:11 | and i do it in a purely data based way year |
---|
0:02:15 | but means i just uh trained on different non-linguistic segments |
---|
0:02:19 | and speech and try to discriminate them |
---|
0:02:25 | so |
---|
0:02:27 | why i'm confident that this should work is because we already did some work on static classification of speech and |
---|
0:02:34 | non-linguistic vocalisations |
---|
0:02:36 | using an F features |
---|
0:02:38 | um and as a svm classifier |
---|
0:02:42 | and we could show that uh and F features |
---|
0:02:45 | together with at the M |
---|
0:02:47 | a performance and that mfcc classification here |
---|
0:02:51 | but of course uh static classification is means that you already have a presegmentation |
---|
0:02:57 | and to speech and non-linguistic segments |
---|
0:02:59 | so this is not an the realistic application |
---|
0:03:02 | which is in this study we now include the segmentation part |
---|
0:03:06 | and |
---|
0:03:08 | that's classifier we used a long short-term memory recurrent neural network |
---|
0:03:11 | which has been widely successfully used for phoneme recognition and speech |
---|
0:03:16 | also and spontaneous speech |
---|
0:03:21 | so i i don't know how many of your familiar with non-negative matrix factorization |
---|
0:03:26 | there is just |
---|
0:03:53 | the only matrix |
---|
0:03:56 | as a basis i |
---|
0:03:58 | i think that |
---|
0:03:59 | anyway |
---|
0:04:00 | and the H matrix |
---|
0:04:01 | gives you the time activations of those spec |
---|
0:04:05 | i |
---|
0:04:06 | and here's a of the place for let that advertisement here |
---|
0:04:10 | yeah because we have an open source to look at for nmf |
---|
0:04:13 | oh the that which will also present on |
---|
0:04:16 | first the in the evening the poster session |
---|
0:04:18 | so all of our experiments can |
---|
0:04:20 | you're read on very easily |
---|
0:04:26 | so the nmf are good that we apply |
---|
0:04:29 | is just the multiplicative update i my think it's pretty stand that |
---|
0:04:33 | so it's uh iterative minimisation of a cost function |
---|
0:04:37 | between the original |
---|
0:04:40 | back down we and the product of W eight |
---|
0:04:44 | and in our previous study we could show that the euclidean distance |
---|
0:04:48 | not a good measure to minimize here |
---|
0:04:51 | um so we |
---|
0:04:53 | on the one hand we evaluate the colour the climate the origins |
---|
0:04:56 | and on the other hand uh yeah |
---|
0:04:59 | but say new cost function that has been proposed |
---|
0:05:02 | X especially for music processing |
---|
0:05:04 | which is the itakura side so D origins |
---|
0:05:07 | and the the main difference of those in |
---|
0:05:09 | that the the itakura-saito divergence is scale and and |
---|
0:05:13 | so um low energy components |
---|
0:05:16 | are weighted uh the same way as high energy components basically |
---|
0:05:20 | and calculation of the error |
---|
0:05:26 | so now we move on to the feature extraction by and a mask |
---|
0:05:30 | and the idea used to follow a supervised nmf approach |
---|
0:05:34 | which means that the W matrix |
---|
0:05:37 | fine |
---|
0:05:38 | um so |
---|
0:05:40 | yeah that's is actually a |
---|
0:05:41 | and approach of those pursued you'd multiply in |
---|
0:05:44 | source separation |
---|
0:05:46 | so if you have different sources like speech and noise you print pretty initialize the W matrix and kind reconstruct |
---|
0:05:52 | a sources after work |
---|
0:05:55 | and what we did here is |
---|
0:05:56 | re predefined the |
---|
0:05:58 | the W matrix with |
---|
0:06:00 | yeah spectra from different classes |
---|
0:06:03 | which uh on the one hand |
---|
0:06:05 | normal speech so it's say |
---|
0:06:07 | so with |
---|
0:06:08 | was words |
---|
0:06:09 | and that there here |
---|
0:06:11 | and other vocal noise |
---|
0:06:13 | and |
---|
0:06:14 | all the noise |
---|
0:06:15 | which is most environmental noise or microphone noise well |
---|
0:06:20 | yeah |
---|
0:06:21 | so |
---|
0:06:22 | in an ideal world |
---|
0:06:24 | if you do this decomposition |
---|
0:06:27 | we can just look |
---|
0:06:41 | yeah |
---|
0:06:45 | and then the |
---|
0:06:46 | the activation matrix what exactly give us the temporal location |
---|
0:06:51 | of those segments but of course this |
---|
0:06:53 | does not possible or this does not work like that |
---|
0:06:56 | because of the large spectral overlap between the different spectra |
---|
0:07:00 | from the different classes |
---|
0:07:04 | so |
---|
0:07:08 | the real case |
---|
0:07:11 | is |
---|
0:07:20 | so our approach is uh just to normalize each column of the H matrix |
---|
0:07:26 | to get something like a likelihood |
---|
0:07:28 | that uh a given spectrum was active at a given time frame |
---|
0:07:33 | and because those likelihood features do not contain energy information as opposed to the normal H matrix |
---|
0:07:40 | you also at the energy |
---|
0:07:45 | okay so now i come to the classification was long short-term memory |
---|
0:07:50 | so my colleague step us as um after what's presenting |
---|
0:07:54 | not talk on long short-term memory which is why a |
---|
0:07:56 | explain interior a little more in detail |
---|
0:07:59 | so the |
---|
0:08:01 | yeah the drawback like of a conventional recurrent network is uh |
---|
0:08:04 | but the context range quite limited |
---|
0:08:07 | because the the weight of a was single input |
---|
0:08:11 | on the output calculation decreases exponentially over time |
---|
0:08:15 | and this is all known as the vanishing gradient problem |
---|
0:08:19 | so the solution |
---|
0:08:21 | or one solution for this us to use you long short-term memory cells |
---|
0:08:25 | instead of the standard cells for the neural network |
---|
0:08:29 | which have an internal state that is maintained by a |
---|
0:08:32 | well a connection with a recurrent rate which is |
---|
0:08:35 | constant that's |
---|
0:08:36 | uh i one point zero |
---|
0:08:39 | so this means that the network can actually store information over yeah an arbitrarily long time |
---|
0:08:46 | and of course |
---|
0:08:47 | to also to access that that information and to update it and maybe to deleted you need some other units |
---|
0:08:53 | that control the state of this cell |
---|
0:08:56 | and these are known one is the gate units for input output and memory |
---|
0:09:01 | and yeah the the great advantage of this architecture that it automatically learns the a required amount of context |
---|
0:09:08 | so all those weights for those gate units the control input output and memory a learned during training by |
---|
0:09:14 | resilient propagation for example |
---|
0:09:18 | so you don't have to specify the be required mind amount of context as you would have to do for |
---|
0:09:23 | example when you |
---|
0:09:24 | just to feature frames stacking |
---|
0:09:28 | so of course you can ask does this give us any had wanted |
---|
0:09:31 | oh but just and |
---|
0:09:33 | normal recurrent not work |
---|
0:09:35 | which is why we investigated several architectures in this study |
---|
0:09:40 | so |
---|
0:09:43 | so to just speak in |
---|
0:09:45 | it's here |
---|
0:09:52 | oh |
---|
0:09:54 | so |
---|
0:09:54 | a bidirectional actually means that the you network processes the input for but and backward |
---|
0:10:00 | and |
---|
0:10:00 | yeah to this and is has |
---|
0:10:02 | two input layers |
---|
0:10:03 | and also to in layers |
---|
0:10:06 | and |
---|
0:10:08 | yeah the dimensionality of the input layer is just the number of input features |
---|
0:10:13 | which is a three for in or nmf configuration |
---|
0:10:16 | or thirty nine if you just use normal plp features plus the those |
---|
0:10:21 | yeah and the size of hidden layer was evaluated at at and one hundred twenty |
---|
0:10:25 | and the the output layer |
---|
0:10:27 | just a gives T posterior probabilities of the for different classes that i want to discriminate |
---|
0:10:38 | so or evaluation was done of the part i corpus of spontaneous speech |
---|
0:10:42 | i don't know how many of you know wait |
---|
0:10:44 | so it's uh we took only is subject turns |
---|
0:10:47 | so there remained about twenty five hours of spontaneous speech |
---|
0:10:52 | it's ten use in the sense that it's interview speech so there is one interview viewer |
---|
0:10:57 | and a test subject and |
---|
0:10:59 | they follow a free conversation |
---|
0:11:01 | without any specific protocol |
---|
0:11:04 | there are forty speakers to male and twenty female |
---|
0:11:07 | and we sup they white at the corpus in a speaker independent manner |
---|
0:11:11 | which means we divided into a training validation and test set |
---|
0:11:15 | but all stratified by age and gender |
---|
0:11:18 | so the percentages were around eighty percent for training ten percent for validation and ten percent for test |
---|
0:11:26 | and yeah O to to make it more reproducible we did this subdivision an ascending order of speaker id |
---|
0:11:33 | and the corpus also comes with an automatic alignment |
---|
0:11:36 | of of yes |
---|
0:11:37 | speech and phonemes |
---|
0:11:39 | and after a car noise and i don't noise |
---|
0:11:41 | and this automatic alignment was used on the training data |
---|
0:11:45 | as well to to use that to train the nmf math |
---|
0:11:49 | as well as to train the neural network |
---|
0:11:55 | various just a short summary |
---|
0:11:57 | on the the different sizes of the test sets sub the it back classes |
---|
0:12:02 | so as you would expect the the speech classes predominant |
---|
0:12:07 | and |
---|
0:12:08 | yeah especially the the after and the other noise class are quite |
---|
0:12:13 | sparse |
---|
0:12:14 | especially in the test set |
---|
0:12:18 | so |
---|
0:12:20 | yeah the evaluation that that we did um |
---|
0:12:23 | was |
---|
0:12:24 | yeah motive what by the question about but is better to model the non-linguistic vocalisations inside the a our system |
---|
0:12:31 | or outside the A's our system |
---|
0:12:34 | which is why we set up and |
---|
0:12:35 | yeah produced and that's a a are system on the back i corpus for |
---|
0:12:40 | as a reference |
---|
0:12:42 | so i'm going |
---|
0:12:43 | yeah quite fast a is because it's all pretty standard |
---|
0:12:47 | a plp coefficients plus deltas and uh |
---|
0:12:50 | bigram gram language model trained on the black i training set |
---|
0:12:53 | we also experimented with other language models but it didn't increase but curious accuracy |
---|
0:12:58 | we had and addition to the thirty nine monophones |
---|
0:13:01 | we had three models for non-linguistic vocalisations laughter woke noise and i don't noise |
---|
0:13:06 | but at uh double as many states as T funny models |
---|
0:13:10 | and we estimated say clustered triphones with |
---|
0:13:13 | sixteen you thirty two mixtures |
---|
0:13:15 | and yeah as you can see the word accuracy of the system is quite low |
---|
0:13:20 | but which is quite common actually for spontaneous speech |
---|
0:13:28 | oh on this i have the comparison on the discriminability of the different classes by a different types of are |
---|
0:13:35 | and hands |
---|
0:13:36 | and the general trend that you can see is that the |
---|
0:13:39 | you normal uh and and the as the lowest frame wise F one measure |
---|
0:13:44 | which is the primary evaluation matter here |
---|
0:13:47 | and you a and W A stand for and weighted average of of the four classes and weighted average |
---|
0:13:53 | weighted weighted is just uh weighted by the prior class probability |
---|
0:13:59 | um what you also can see |
---|
0:14:01 | is that the at as T M concept doesn't |
---|
0:14:04 | give that much K you know well the be normal our and and |
---|
0:14:07 | but uh they |
---|
0:14:09 | the bidirectional be L S T M |
---|
0:14:11 | the live was again for |
---|
0:14:13 | almost all the classes over will be |
---|
0:14:15 | i'll the network |
---|
0:14:17 | the only class that this is not the case is the other noise class |
---|
0:14:21 | but this might be also do you just by to sparsity |
---|
0:14:24 | as i've indicated before |
---|
0:14:31 | so i according to be a less M size and features |
---|
0:14:35 | we can actually conclude that the F |
---|
0:14:37 | features computed with the call but lie but the of origins |
---|
0:14:41 | or to form as well plp features |
---|
0:14:43 | and also the the nmf F features |
---|
0:14:46 | generated by the itakura the at the were in |
---|
0:14:50 | and you have the improvements ah |
---|
0:14:52 | especially |
---|
0:14:54 | was able for the |
---|
0:14:56 | we won't know the |
---|
0:14:58 | the other noise class |
---|
0:15:00 | but uh as i said those as |
---|
0:15:02 | yeah is not that much data on that |
---|
0:15:05 | but um in some we can see that the unweighted average |
---|
0:15:08 | increases by about uh two percent absolute |
---|
0:15:12 | from the plp features to the kl based and have features |
---|
0:15:17 | and yeah be D weighted |
---|
0:15:19 | average is of course dominated by the performance on speech |
---|
0:15:23 | which is |
---|
0:15:24 | also increase |
---|
0:15:29 | so no to come to a conclusion whether this better to model the |
---|
0:15:34 | you vocalisations inside a is are or outside a are |
---|
0:15:38 | you we can see that |
---|
0:15:40 | actually except for the after class it is always better |
---|
0:15:44 | and terms of frame wise F one measure |
---|
0:15:46 | to two model it with the B T em approach |
---|
0:15:50 | and set of direct modeling in the A are |
---|
0:15:54 | so |
---|
0:15:55 | there are yeah |
---|
0:15:57 | but if you difference |
---|
0:15:58 | according to recall and precision |
---|
0:16:00 | because we are actually not talking about detection here |
---|
0:16:03 | but of of classification so of course |
---|
0:16:06 | we could also a |
---|
0:16:08 | uh yeah |
---|
0:16:10 | read use it to a binary detection task and |
---|
0:16:12 | calculate a use you measures but we have not done here |
---|
0:16:16 | um |
---|
0:16:19 | yeah so over all the |
---|
0:16:21 | you weighted average accuracy or weighted average recall |
---|
0:16:25 | this increased from |
---|
0:16:27 | you ninety one but |
---|
0:16:28 | three five |
---|
0:16:29 | a nine point one percent |
---|
0:16:32 | and this improvement is also significant |
---|
0:16:35 | uh like and on said previously |
---|
0:16:37 | it's not a |
---|
0:16:38 | real significance test here but |
---|
0:16:40 | just a heuristic measures of the key values actually smaller than |
---|
0:16:44 | ten to the minus three year |
---|
0:16:49 | or concluding we can say that the |
---|
0:16:51 | be less tim approach |
---|
0:16:53 | the live uh |
---|
0:16:54 | quite high |
---|
0:16:55 | reduction of the frame wise error rate |
---|
0:16:58 | by thirty |
---|
0:16:59 | seven point five percent relative |
---|
0:17:01 | and best results have been obtained with the |
---|
0:17:04 | kl divergence as and F cost function |
---|
0:17:07 | and of course future work will do you with |
---|
0:17:10 | yeah to how to integrate this uh be a less to classify actually in the asr are system |
---|
0:17:16 | and for those we have a quite promising approach |
---|
0:17:18 | or multi-stream hmm a are system which is |
---|
0:17:21 | currently also using a a S in phoneme prediction |
---|
0:17:25 | and it could all some of the |
---|
0:17:27 | prediction of that there is speech or non-linguistic vocalisations |
---|
0:17:31 | and other improvements uh relate to the and i agree and and there we could |
---|
0:17:36 | include context-sensitive sensitive if features |
---|
0:17:39 | like features to by a non-negative matrix deconvolution |
---|
0:17:43 | or also use sparsity constraints |
---|
0:17:45 | in the supervised and and have |
---|
0:17:47 | to improve the discrimination |
---|
0:17:50 | so this concludes a talk from my part |
---|
0:17:52 | and i'm looking forward to your questions |
---|
0:17:55 | thank you much fit |
---|
0:17:56 | i |
---|
0:18:00 | someone one all we moved from slightly behind get you to be a had of schedule so there's quite some |
---|
0:18:05 | time of question |
---|
0:18:11 | and |
---|
0:18:17 | and at at or you done some experiments uh |
---|
0:18:20 | whether there um a use our modelling out perform this one or whether the uh they both to together |
---|
0:18:27 | uh uh all the form each single one |
---|
0:18:37 | channel |
---|
0:18:43 | so how about how about the result all that makes a speech |
---|
0:18:47 | i mean that's speech that a mixed up with snap |
---|
0:18:53 | yeah yeah yeah |
---|
0:19:15 | i have a question |
---|
0:19:17 | can |
---|
0:19:30 | i |
---|
0:19:33 | a short reply |
---|
0:19:34 | so |
---|
0:19:36 | thing |
---|
0:19:37 | but thank phoenix again |
---|