0:00:15 | next to representation |
---|
0:00:16 | that is |
---|
0:00:18 | exemplar based sparse representation and sparse discrimination |
---|
0:00:22 | richard per speaker |
---|
0:00:24 | identification |
---|
0:00:43 | oh |
---|
0:00:44 | two examples i think |
---|
0:00:47 | is the joint work with the university a |
---|
0:00:50 | oh |
---|
0:00:51 | not to miss rate |
---|
0:00:52 | and |
---|
0:00:53 | and when and |
---|
0:00:56 | well |
---|
0:00:58 | so |
---|
0:01:00 | the name maybe |
---|
0:01:01 | why we can T |
---|
0:01:04 | but |
---|
0:01:04 | this is the first and that sort of |
---|
0:01:07 | try that for |
---|
0:01:08 | the speaker recognition |
---|
0:01:15 | so |
---|
0:01:25 | section five |
---|
0:01:26 | that is |
---|
0:01:27 | speaker |
---|
0:01:29 | the name |
---|
0:01:30 | yeah |
---|
0:01:32 | this |
---|
0:01:34 | you |
---|
0:01:35 | noisy conditions that recently |
---|
0:01:37 | with this sort of motivate us |
---|
0:01:41 | the recent studies of this one |
---|
0:01:45 | been done in our group |
---|
0:01:46 | that this child that's that the noise example |
---|
0:01:50 | yeah |
---|
0:01:50 | effect of noise |
---|
0:01:51 | despite harsh on a state of art speaker recognition |
---|
0:01:55 | i-vector based system and you have a basis |
---|
0:01:58 | it is |
---|
0:01:59 | it needs to be sort of way to deal with the effect of additive noise |
---|
0:02:04 | in speaker recognition six |
---|
0:02:07 | yeah |
---|
0:02:07 | as they are |
---|
0:02:09 | use |
---|
0:02:10 | with that being |
---|
0:02:12 | something about how to deal with the effect of noise in the speaker recognition especially |
---|
0:02:18 | i-vector basis |
---|
0:02:20 | a recent literature |
---|
0:02:23 | in i |
---|
0:02:25 | i |
---|
0:02:26 | first |
---|
0:02:27 | and they |
---|
0:02:29 | try to multi condition training to deal with different types of noises |
---|
0:02:35 | speaker recognition |
---|
0:02:37 | that work was about to sort of very different models and clearly models based on |
---|
0:02:43 | different noises |
---|
0:02:44 | and the work of labels about you know how to a different |
---|
0:02:51 | features |
---|
0:02:52 | noisy features |
---|
0:02:53 | and then all of them together in modeling phase |
---|
0:02:56 | in the sort of a the only thing |
---|
0:02:59 | most conditional speech |
---|
0:03:02 | the other way is |
---|
0:03:04 | it is also |
---|
0:03:06 | we go a small initial training class a missing features |
---|
0:03:11 | you already it means that we are using a conditional training is that features are |
---|
0:03:15 | called you contaminated by noise and them together but the modeling face |
---|
0:03:20 | that the features that they are affected by noise technicians their account the so called |
---|
0:03:26 | in the out how we can in a |
---|
0:03:31 | and the rest for a auditory or features |
---|
0:03:35 | and separation so |
---|
0:03:37 | well how to choose the R G F C is not from filterbank as the |
---|
0:03:41 | cepstral coefficients that they are shown to be quite efficient compared to mfccs |
---|
0:03:46 | because it is sort of more bus or model of the auditory system |
---|
0:03:52 | and the separation system based on the on the auditory scene analysis that they |
---|
0:03:58 | try to separate the speech and noise and build it |
---|
0:04:02 | three mask that they can rely on speech to trying to clean speech out of |
---|
0:04:07 | it and you can be done after that we missing feature everybody marginalisation reconstruction |
---|
0:04:13 | so |
---|
0:04:14 | a recent for us to make the speaker |
---|
0:04:17 | robust against i |
---|
0:04:21 | and |
---|
0:04:23 | i |
---|
0:04:25 | yeah |
---|
0:04:31 | a |
---|
0:04:31 | what presenting here is a preliminary results are recorded |
---|
0:04:36 | research to remove the noise robust speaker |
---|
0:04:40 | and it is quite different from the things that you have seen because that the |
---|
0:04:44 | message inside the speech is somehow disturbing the speaker |
---|
0:04:49 | section |
---|
0:04:49 | there whatever think it's a sort of speaker |
---|
0:04:52 | mission with a speech |
---|
0:04:54 | so what do not being exactly is |
---|
0:04:57 | vision |
---|
0:04:57 | important what is being said |
---|
0:05:01 | that works |
---|
0:05:03 | exemplar based approach it means that we have examples |
---|
0:05:06 | the data are also clustered examples of the data in the dictionary and then we |
---|
0:05:11 | build the observation based on what we have a sort of dictionary |
---|
0:05:17 | yeah we are considering |
---|
0:05:22 | no sort of long temporal oh temporal context of the spectrum |
---|
0:05:28 | so we go to build narrowband amplitude for each |
---|
0:05:34 | what is the that for each for |
---|
0:05:37 | for each frame we have be noted that and he uses like mfccs this is |
---|
0:05:41 | just mel band and you know what man and amplitude spectrum |
---|
0:05:46 | and E that the before here |
---|
0:05:48 | and we have this three |
---|
0:05:52 | yeah we |
---|
0:05:53 | so each frame |
---|
0:05:55 | and |
---|
0:05:56 | okay then you have a sort of |
---|
0:05:58 | superframe every frame that we have all deformation in this part that is typically twenty |
---|
0:06:04 | five years he's |
---|
0:06:06 | so it is in the order of two hundred fifty milisecond of all the all |
---|
0:06:10 | in one vector or to consider one building block i |
---|
0:06:15 | section |
---|
0:06:16 | a sliding window is a we're gonna do cover all of the |
---|
0:06:24 | is that |
---|
0:06:25 | a small one |
---|
0:06:30 | in |
---|
0:06:31 | next |
---|
0:06:33 | so |
---|
0:06:34 | i |
---|
0:06:35 | this |
---|
0:06:36 | let me say a example |
---|
0:06:39 | what we need to do is to build the dictionary the next so we have |
---|
0:06:43 | these things and we need to build the dictionary that it is representative of the |
---|
0:06:48 | speaker |
---|
0:06:49 | yeah |
---|
0:06:50 | so here now this work we had a small vocabulary |
---|
0:06:55 | so we were able to do forced aligned hmm on it and make all sort |
---|
0:07:00 | of label for each of the frames and |
---|
0:07:03 | for example if you have a hmm models and for each of the |
---|
0:07:10 | word models we have several states this work model and we have |
---|
0:07:15 | the several states per model |
---|
0:07:17 | so each of these frames |
---|
0:07:19 | could be associated with one of the hmm states |
---|
0:07:22 | but we have associate the |
---|
0:07:24 | states with the frame so we take the context around on a cool the a |
---|
0:07:33 | long temporal context |
---|
0:07:34 | we have labeled as to belong to the same age and state |
---|
0:07:38 | and then after that so all representing the same sort of phonetic events if you |
---|
0:07:44 | can call |
---|
0:07:45 | what we do to make just one representative of this event |
---|
0:07:50 | to be wise median over all of these temporal on it |
---|
0:07:53 | and they just one representative of this state |
---|
0:07:57 | means that in this special task that we perform we had sort of two hundred |
---|
0:08:04 | fifty hmm states for the for the let me say hmm for someone model |
---|
0:08:09 | and then be we have now two hundred fifty long temporal context that we just |
---|
0:08:16 | put in one vector and we have it as a representative of this fantastic |
---|
0:08:22 | so |
---|
0:08:23 | this is this is not per speaker so per speaker we have hmm trained on |
---|
0:08:28 | the data and these atoms are stored in the dictionary |
---|
0:08:34 | in addition we have also important dictionary to model the noise |
---|
0:08:38 | so we have speaker or anyhow the noise part in the dictionary and for the |
---|
0:08:43 | noise we are using a noise dictionary it means that in this is just fit |
---|
0:08:49 | for it is assumed that you sort of a existing in data in large recording |
---|
0:08:56 | so we observed that what when you is gonna start time and speakers gonna start |
---|
0:09:03 | and resampled the noise from the beginning of at the time that it's gonna start |
---|
0:09:10 | so i think the dictionary so this is sort of context recognition normal way that |
---|
0:09:15 | people do the sparse representation they do there are lots of taking dictionary building but |
---|
0:09:21 | this is context recognition rate and we know what we are building and that sort |
---|
0:09:24 | of the stress of set approach |
---|
0:09:27 | there exists a factorisation for factorisation normally we estimate the observation |
---|
0:09:33 | based on dictionary and the |
---|
0:09:37 | that's a X as activation at all the terms of the dictionary and X as |
---|
0:09:43 | a nation |
---|
0:09:45 | it is just a pictorial representation |
---|
0:09:47 | the data from and to provide and icassp two thousand twelve paper because we were |
---|
0:09:53 | doing the same thing |
---|
0:09:54 | so this is because there's that we have three from the dictionary and the or |
---|
0:10:01 | for a result in this context of the spectral |
---|
0:10:06 | and an observation |
---|
0:10:08 | in this once we have this sort of events that they are coming after each |
---|
0:10:12 | other |
---|
0:10:13 | and decomposing this the observation this frames we need to all |
---|
0:10:20 | somehow minimize the distance between the observation at a combination a frame |
---|
0:10:28 | yeah so we have for example three and also in the activation we have three |
---|
0:10:34 | elements that it is sort of the linear combination of atoms to build the observation |
---|
0:10:40 | and yeah well |
---|
0:10:42 | yeah or nonnegative matrix factorization we have and also non-negative matrix deconvolution |
---|
0:10:49 | what is done in both there is a distance function to be minimized to make |
---|
0:10:53 | it's quite similar to what we observe |
---|
0:10:56 | in addition in this is a function actually it's not easily this ser what you're |
---|
0:11:03 | using it is scaled averages function it is presented in the in the reference of |
---|
0:11:09 | the paper sorry hundred be here and in addition we have a penalty term to |
---|
0:11:13 | just have a sparse |
---|
0:11:15 | so you variation used here |
---|
0:11:17 | using the |
---|
0:11:18 | sparse what is being estimated it means that if you want to estimate this observation |
---|
0:11:24 | we need to be estimated from a few of the actual the dictionary and we |
---|
0:11:27 | cannot use all of the combination just tuning optimal weights in its best way to |
---|
0:11:32 | prepare |
---|
0:11:33 | and that's why because we see we say that these are also somehow events of |
---|
0:11:38 | the speech and we have seen before we don't need to combine to meaning of |
---|
0:11:41 | the observed all observations to represent the current |
---|
0:11:47 | context |
---|
0:11:48 | so in a non-negative matrix deconvolution that's is employed here it takes care about this |
---|
0:11:54 | overlap between the events you know who is |
---|
0:11:57 | space that it cannot build this one based on that the terms that it's existing |
---|
0:12:04 | in and the dictionary so it doesn't so all of the activations are zero here |
---|
0:12:09 | because why because it can be able to from the nist and from the before |
---|
0:12:13 | and that's the way that are presented to |
---|
0:12:17 | well and it works just one by one |
---|
0:12:23 | decompositions of words on this one and tries to build it so close to the |
---|
0:12:27 | next one tries to build and the cost function minimized over one long temporal oh |
---|
0:12:33 | but in handy it takes all of this into account and minimize the distance over |
---|
0:12:38 | the whole utterance of all can you know |
---|
0:12:41 | so and it was proved that it is utilized in this study |
---|
0:12:49 | it was on T well it doesn't sort of background in years ago about the |
---|
0:12:55 | class and this is what on your volunteers were not just the users for speech |
---|
0:13:00 | recognition |
---|
0:13:04 | so the content and no we need to |
---|
0:13:09 | oh well we are using a speaker |
---|
0:13:13 | so i |
---|
0:13:16 | oh we are building dictionaries |
---|
0:13:17 | or long on that for each |
---|
0:13:22 | one |
---|
0:13:25 | for example |
---|
0:13:27 | all two hundred fifty S dictionary from each speaker or concatenated here |
---|
0:13:33 | a solution noise example |
---|
0:13:37 | so we have representation of the speakers that there exist |
---|
0:13:42 | it is closed set speaker |
---|
0:13:44 | okay |
---|
0:13:46 | i |
---|
0:13:47 | when we are decomposing or factor on the relation to see that all we can |
---|
0:13:53 | deal with this |
---|
0:13:55 | the dictionary |
---|
0:13:57 | yeah activation vector that you have your paper is sort of a representative for the |
---|
0:14:02 | speaker identity by itself |
---|
0:14:04 | because each of the last one to one of the speakers |
---|
0:14:08 | but we decompose it |
---|
0:14:10 | dries it actually the components that they are activating because we have also |
---|
0:14:15 | sparsity some but not all of them could get activated few of them usually in |
---|
0:14:20 | the order and fifty |
---|
0:14:22 | and then we see that normally be again we have one of the things that |
---|
0:14:29 | the event was called |
---|
0:14:31 | simple manipulation or something like that but we go over the last in the activations |
---|
0:14:37 | see if it is just a speaker that's talking |
---|
0:14:40 | but we just concentrate on one frame this could be nancy |
---|
0:14:44 | because well we have similarities between the speakers and some of the events it can |
---|
0:14:49 | happen that the egg |
---|
0:14:51 | the apple from other the speaker detected |
---|
0:14:54 | so what you three a this reliability has also called so now we are concentrating |
---|
0:15:00 | on the can think that each of each one is activate |
---|
0:15:04 | if you go averaging over these activations |
---|
0:15:07 | or the art so for each part we have like to vision and it for |
---|
0:15:12 | example for two seconds we have two hundred activations |
---|
0:15:15 | so we averaged over activations just |
---|
0:15:18 | somehow deemphasize the contact |
---|
0:15:21 | so the content |
---|
0:15:22 | is good less important but in the real additions for each of them because it |
---|
0:15:27 | but are averaging or something about this effect the car is less important but is |
---|
0:15:33 | the information from the speaker |
---|
0:15:36 | which area is detective at its is still present |
---|
0:15:40 | i think this is it |
---|
0:15:42 | feature |
---|
0:15:43 | you're representing if |
---|
0:15:44 | the speaker again |
---|
0:15:46 | so in normal approach we have |
---|
0:15:49 | icsi's and then |
---|
0:15:50 | thesis we have i-vectors secondary features that you do classification of the i-vector here we |
---|
0:15:56 | have a spectrogram and then this is sparse representation |
---|
0:16:00 | on a strict or as the representative of identity representative there are |
---|
0:16:09 | so what |
---|
0:16:10 | but able to do this one |
---|
0:16:13 | to do the classification is to go for lda or P S |
---|
0:16:17 | i-vector out of three |
---|
0:16:19 | and some people are window lda and then |
---|
0:16:22 | plp to classify the i-vectors |
---|
0:16:26 | i'm not describing the slides as well no |
---|
0:16:28 | but what's the |
---|
0:16:30 | oh yeah i work for the features are sparse features that we have are sparse |
---|
0:16:36 | so what can be you better |
---|
0:16:39 | L are sparse |
---|
0:16:41 | that was a question that i and i |
---|
0:16:44 | literature and that recently |
---|
0:16:47 | it is proposed to have sparse discriminant analysis in our data are sparse |
---|
0:16:53 | the weighted discriminant analysis is working this sort of extension to minimize discriminant analysis |
---|
0:16:59 | in parallel discriminant analysis we need to account for the within class covariance estimation of |
---|
0:17:05 | scatter estimation because this is what is that is sparse and this scatter matrix can |
---|
0:17:10 | be estimated |
---|
0:17:11 | so there is no doubt that had to the to the within class scatter matrix |
---|
0:17:17 | which is normally an identity matrix to biased estimation |
---|
0:17:24 | to make it is sparse |
---|
0:17:25 | that is |
---|
0:17:26 | additional this part of the sparse representation we had to sort of northern eight thousand |
---|
0:17:31 | five hundred so we need to have that the egg and then we want to |
---|
0:17:36 | make it a sparse so that the eigen directions of between class scatter matrix is |
---|
0:17:43 | sort of analyzed with the L one norm of the integration in this sense it's |
---|
0:17:49 | possible |
---|
0:17:50 | you |
---|
0:17:51 | i think direction sparse to this i get a sparse direction that it is utilized |
---|
0:18:00 | so |
---|
0:18:01 | going to description of course |
---|
0:18:04 | people in this community are too much time to chime corpus |
---|
0:18:08 | it is sort of its all computational hearing in multi source environments and it was |
---|
0:18:13 | challenging |
---|
0:18:15 | interspeech two thousand twelve for noise-robust speech recognition |
---|
0:18:19 | this data |
---|
0:18:20 | the little in U K and the thirty four speakers five hundred segments contain segments |
---|
0:18:27 | yeah for speaker in training |
---|
0:18:29 | i six snr levels and test and six hundred files |
---|
0:18:33 | S and their snr to test |
---|
0:18:37 | it does |
---|
0:18:39 | the noisy that they were collected for |
---|
0:18:40 | room environment that is really living room environment so that and the noise or sort |
---|
0:18:46 | of very widely this data that the lower snrs we have really nonstationary noises sort |
---|
0:18:54 | of T V is running matching is working in the also there are many things |
---|
0:18:58 | happening at the same time and M indicates our streaming so it is quite a |
---|
0:19:03 | challenging especially |
---|
0:19:05 | it was from zero db minus sixty it is very challenging database |
---|
0:19:10 | speech |
---|
0:19:12 | so the dictionary is limited |
---|
0:19:15 | all the segments is about two seconds |
---|
0:19:20 | so we just present some results |
---|
0:19:25 | some results that we have |
---|
0:19:29 | yeah |
---|
0:19:30 | yeah |
---|
0:19:32 | you |
---|
0:19:33 | at all |
---|
0:19:34 | right |
---|
0:19:35 | that we had speaker dependent hmm training |
---|
0:19:39 | this one so it is |
---|
0:19:41 | decode each for each test utterance based on hmms rates and we have thirty four |
---|
0:19:46 | hmms and be let be decoded each test segment |
---|
0:19:51 | thirty four hmms |
---|
0:19:53 | see that each hmms between the baseline |
---|
0:19:57 | so this is the result of that one |
---|
0:19:59 | considering the speaker |
---|
0:20:01 | well |
---|
0:20:02 | so for the clean it is quite good so missus |
---|
0:20:06 | match but |
---|
0:20:08 | but to the lower snrs |
---|
0:20:11 | our so the H hmms the likelihood was not really robust when we need to |
---|
0:20:19 | look at it from the pure speaker |
---|
0:20:23 | if you just you're of number of the |
---|
0:20:27 | speech recognition results online sixty before these hmms |
---|
0:20:31 | yeah or something in the in the order of thirty six or so |
---|
0:20:37 | so going for gmm system is very baseline |
---|
0:20:40 | but we need to just try to see that what is the what is the |
---|
0:20:45 | results of a speaker independence is |
---|
0:20:48 | sorry text-independent system which we don't care about context as the hmm system here and |
---|
0:20:55 | we do |
---|
0:20:55 | just easy modeling and gmm you know |
---|
0:20:59 | what is there anything that you can have compared to the H |
---|
0:21:04 | since this is some sort of |
---|
0:21:08 | designed for speaker recognition |
---|
0:21:10 | it gives us a really large margin of improvement for the noise environments |
---|
0:21:16 | but this was not something that we will consider this was based to baseline |
---|
0:21:21 | included here |
---|
0:21:23 | so for example of a simple manipulation |
---|
0:21:26 | you i |
---|
0:21:27 | remember simple manipulation just going to the pitch flux of activations after that i see |
---|
0:21:32 | that just a simple averaging all activations see that which one is just get activate |
---|
0:21:38 | which i speakers |
---|
0:21:39 | is present in this a try |
---|
0:21:42 | and |
---|
0:21:43 | it was still in the range for compared to gmm ubm and hmm in noisy |
---|
0:21:48 | conditions it was quite robust so well the reason because each none of these to |
---|
0:21:55 | alter the noise models but in the exemplar based approach we have been always included |
---|
0:22:00 | inside a dictionary so it is sort of dealing with noise but not |
---|
0:22:04 | a the noise inside |
---|
0:22:07 | well |
---|
0:22:09 | so the next one features also of examples which was a scoring so it was |
---|
0:22:13 | a simple manipulation but in cosine distance cosine distance between the representative this or |
---|
0:22:21 | it was also |
---|
0:22:23 | it is a better because it was just the distance between these two was important |
---|
0:22:29 | in simple manipulation it does not compared to anything it was just a test utterance |
---|
0:22:34 | we do simple manipulation on the activation |
---|
0:22:38 | and we put this |
---|
0:22:40 | we said that no we can have a training for the training we use pot |
---|
0:22:47 | you |
---|
0:22:48 | and this training brought improvements in the sort of that's a close to the noisy |
---|
0:22:56 | for rating that the reason was that the training was that the clean so the |
---|
0:23:02 | examplars of four hundred |
---|
0:23:04 | or plastic or just clean speech |
---|
0:23:08 | and the final what is that |
---|
0:23:12 | is that we train |
---|
0:23:15 | the training method is used as the sparse discriminant and |
---|
0:23:19 | the difference |
---|
0:23:20 | i |
---|
0:23:20 | i two of them on it |
---|
0:23:23 | some of the effect of having |
---|
0:23:25 | sparse features in the |
---|
0:23:28 | for the input of training |
---|
0:23:31 | this really is important in helping that when the sparse modeling technique should be also |
---|
0:23:37 | sparse to deal better with the data |
---|
0:23:42 | this is actually and we sort of improve this average by including group sparsity on |
---|
0:23:48 | top of the norm of the sparsity |
---|
0:23:51 | this paper a unique what's it is gonna be in the speech recognition systems that |
---|
0:23:56 | is most likely that you have not gonna see so i'm taking much as presented |
---|
0:24:00 | here |
---|
0:24:01 | so the group sparsity means that there are imposing the no sparsity vc we say |
---|
0:24:07 | that |
---|
0:24:08 | you should select few iterations you want twenty seven target group sparsity be also make |
---|
0:24:15 | more penalty that if patients treated from different group of speakers |
---|
0:24:20 | so it is sort of course that if they need to get at each inside |
---|
0:24:25 | a lot of the speakers |
---|
0:24:27 | so it can be improvement in development and test set a specially action |
---|
0:24:34 | so this work |
---|
0:24:37 | containing |
---|
0:24:38 | and |
---|
0:24:39 | it is |
---|
0:24:40 | one |
---|
0:24:42 | that is being now |
---|
0:24:44 | it's a lot are working on it to fit it's to thing is this are |
---|
0:24:48 | well as we have posted there i mean that close but we are allowed to |
---|
0:24:53 | use the speaker information in the training |
---|
0:24:57 | and the well there are some issues about only about the |
---|
0:25:01 | the channel effect and dictionary size you nist and with this one |
---|
0:25:07 | so far we probably no noise and them and the channel estimate the channel difference |
---|
0:25:12 | with the fact that the channel if you look at this is |
---|
0:25:22 | is the |
---|
0:25:24 | yeah well you know asians are different for each frame but if we consider the |
---|
0:25:31 | channel this constant over a for |
---|
0:25:34 | so we are at eight estimate the channel difference between what has been observed in |
---|
0:25:39 | the training and or training or making detection |
---|
0:25:45 | the test |
---|
0:25:47 | thank you |
---|
0:25:57 | yeah |
---|
0:26:00 | oh |
---|
0:26:01 | not really different here because |
---|
0:26:06 | you was provided that the one you have to horses |
---|
0:26:10 | and each of these two seconds were happening somewhere in this |
---|
0:26:15 | and we were able to see that the noise before happening inside |
---|
0:26:23 | i |
---|
0:26:25 | yeah |
---|
0:26:30 | i |
---|
0:26:35 | the stress of this method is that it doesn't care about this and this and |
---|
0:26:39 | or something that when it is combining the speech and noise atoms |
---|
0:26:43 | of all the T V |
---|
0:26:45 | one inside |
---|
0:26:47 | also there |
---|
0:26:48 | and about the different noise types |
---|
0:26:50 | what is the M |
---|
0:26:52 | sorry what is the idea is working right now is that we don't need really |
---|
0:26:57 | the noise dictionary what is needed is sort of initialization for the noise dictionary and |
---|
0:27:03 | adapted during that |
---|
0:27:05 | the authors the previous that we see that there is no speech activation so we |
---|
0:27:10 | okay we take as the sort |
---|
0:27:12 | adaptation for the for the noise dictionary |
---|
0:27:27 | i |
---|
0:27:34 | yeah |
---|
0:27:43 | i |
---|
0:27:47 | so all sort of yeah |
---|
0:27:52 | oh |
---|
0:27:55 | i |
---|
0:28:03 | so we are also estimating the |
---|
0:28:05 | this one but not also which is |
---|
0:28:07 | that the a non-negative matrix factorisation |
---|
0:28:12 | there is a linear on this one that these are all |
---|
0:28:15 | these are not simply so this more feature |
---|
0:28:56 | well |
---|
0:29:01 | yeah so |
---|
0:29:03 | each frame |
---|
0:29:07 | and the dictionary |
---|
0:29:09 | test for all zero |
---|
0:29:11 | as we are able we have to be able i |
---|
0:29:17 | speech inside the |
---|
0:29:19 | we see |
---|
0:29:21 | yeah |
---|
0:29:21 | so |
---|
0:29:23 | yeah we know |
---|
0:29:25 | computes vol |
---|
0:29:27 | so |
---|
0:29:28 | generated in the morning to be you know |
---|
0:29:32 | maybe |
---|
0:29:34 | speech or |
---|
0:29:42 | there was |
---|
0:29:51 | i |
---|