0:00:15next to representation
0:00:16that is
0:00:18exemplar based sparse representation and sparse discrimination
0:00:22richard per speaker
0:00:24identification
0:00:43oh
0:00:44two examples i think
0:00:47is the joint work with the university a
0:00:50oh
0:00:51not to miss rate
0:00:52and
0:00:53and when and
0:00:56well
0:00:58so
0:01:00the name maybe
0:01:01why we can T
0:01:04but
0:01:04this is the first and that sort of
0:01:07try that for
0:01:08the speaker recognition
0:01:15so
0:01:25section five
0:01:26that is
0:01:27speaker
0:01:29the name
0:01:30yeah
0:01:32this
0:01:34you
0:01:35noisy conditions that recently
0:01:37with this sort of motivate us
0:01:41the recent studies of this one
0:01:45been done in our group
0:01:46that this child that's that the noise example
0:01:50yeah
0:01:50effect of noise
0:01:51despite harsh on a state of art speaker recognition
0:01:55i-vector based system and you have a basis
0:01:58it is
0:01:59it needs to be sort of way to deal with the effect of additive noise
0:02:04in speaker recognition six
0:02:07yeah
0:02:07as they are
0:02:09use
0:02:10with that being
0:02:12something about how to deal with the effect of noise in the speaker recognition especially
0:02:18i-vector basis
0:02:20a recent literature
0:02:23in i
0:02:25i
0:02:26first
0:02:27and they
0:02:29try to multi condition training to deal with different types of noises
0:02:35speaker recognition
0:02:37that work was about to sort of very different models and clearly models based on
0:02:43different noises
0:02:44and the work of labels about you know how to a different
0:02:51features
0:02:52noisy features
0:02:53and then all of them together in modeling phase
0:02:56in the sort of a the only thing
0:02:59most conditional speech
0:03:02the other way is
0:03:04it is also
0:03:06we go a small initial training class a missing features
0:03:11you already it means that we are using a conditional training is that features are
0:03:15called you contaminated by noise and them together but the modeling face
0:03:20that the features that they are affected by noise technicians their account the so called
0:03:26in the out how we can in a
0:03:31and the rest for a auditory or features
0:03:35and separation so
0:03:37well how to choose the R G F C is not from filterbank as the
0:03:41cepstral coefficients that they are shown to be quite efficient compared to mfccs
0:03:46because it is sort of more bus or model of the auditory system
0:03:52and the separation system based on the on the auditory scene analysis that they
0:03:58try to separate the speech and noise and build it
0:04:02three mask that they can rely on speech to trying to clean speech out of
0:04:07it and you can be done after that we missing feature everybody marginalisation reconstruction
0:04:13so
0:04:14a recent for us to make the speaker
0:04:17robust against i
0:04:21and
0:04:23i
0:04:25yeah
0:04:31a
0:04:31what presenting here is a preliminary results are recorded
0:04:36research to remove the noise robust speaker
0:04:40and it is quite different from the things that you have seen because that the
0:04:44message inside the speech is somehow disturbing the speaker
0:04:49section
0:04:49there whatever think it's a sort of speaker
0:04:52mission with a speech
0:04:54so what do not being exactly is
0:04:57vision
0:04:57important what is being said
0:05:01that works
0:05:03exemplar based approach it means that we have examples
0:05:06the data are also clustered examples of the data in the dictionary and then we
0:05:11build the observation based on what we have a sort of dictionary
0:05:17yeah we are considering
0:05:22no sort of long temporal oh temporal context of the spectrum
0:05:28so we go to build narrowband amplitude for each
0:05:34what is the that for each for
0:05:37for each frame we have be noted that and he uses like mfccs this is
0:05:41just mel band and you know what man and amplitude spectrum
0:05:46and E that the before here
0:05:48and we have this three
0:05:52yeah we
0:05:53so each frame
0:05:55and
0:05:56okay then you have a sort of
0:05:58superframe every frame that we have all deformation in this part that is typically twenty
0:06:04five years he's
0:06:06so it is in the order of two hundred fifty milisecond of all the all
0:06:10in one vector or to consider one building block i
0:06:15section
0:06:16a sliding window is a we're gonna do cover all of the
0:06:24is that
0:06:25a small one
0:06:30in
0:06:31next
0:06:33so
0:06:34i
0:06:35this
0:06:36let me say a example
0:06:39what we need to do is to build the dictionary the next so we have
0:06:43these things and we need to build the dictionary that it is representative of the
0:06:48speaker
0:06:49yeah
0:06:50so here now this work we had a small vocabulary
0:06:55so we were able to do forced aligned hmm on it and make all sort
0:07:00of label for each of the frames and
0:07:03for example if you have a hmm models and for each of the
0:07:10word models we have several states this work model and we have
0:07:15the several states per model
0:07:17so each of these frames
0:07:19could be associated with one of the hmm states
0:07:22but we have associate the
0:07:24states with the frame so we take the context around on a cool the a
0:07:33long temporal context
0:07:34we have labeled as to belong to the same age and state
0:07:38and then after that so all representing the same sort of phonetic events if you
0:07:44can call
0:07:45what we do to make just one representative of this event
0:07:50to be wise median over all of these temporal on it
0:07:53and they just one representative of this state
0:07:57means that in this special task that we perform we had sort of two hundred
0:08:04fifty hmm states for the for the let me say hmm for someone model
0:08:09and then be we have now two hundred fifty long temporal context that we just
0:08:16put in one vector and we have it as a representative of this fantastic
0:08:22so
0:08:23this is this is not per speaker so per speaker we have hmm trained on
0:08:28the data and these atoms are stored in the dictionary
0:08:34in addition we have also important dictionary to model the noise
0:08:38so we have speaker or anyhow the noise part in the dictionary and for the
0:08:43noise we are using a noise dictionary it means that in this is just fit
0:08:49for it is assumed that you sort of a existing in data in large recording
0:08:56so we observed that what when you is gonna start time and speakers gonna start
0:09:03and resampled the noise from the beginning of at the time that it's gonna start
0:09:10so i think the dictionary so this is sort of context recognition normal way that
0:09:15people do the sparse representation they do there are lots of taking dictionary building but
0:09:21this is context recognition rate and we know what we are building and that sort
0:09:24of the stress of set approach
0:09:27there exists a factorisation for factorisation normally we estimate the observation
0:09:33based on dictionary and the
0:09:37that's a X as activation at all the terms of the dictionary and X as
0:09:43a nation
0:09:45it is just a pictorial representation
0:09:47the data from and to provide and icassp two thousand twelve paper because we were
0:09:53doing the same thing
0:09:54so this is because there's that we have three from the dictionary and the or
0:10:01for a result in this context of the spectral
0:10:06and an observation
0:10:08in this once we have this sort of events that they are coming after each
0:10:12other
0:10:13and decomposing this the observation this frames we need to all
0:10:20somehow minimize the distance between the observation at a combination a frame
0:10:28yeah so we have for example three and also in the activation we have three
0:10:34elements that it is sort of the linear combination of atoms to build the observation
0:10:40and yeah well
0:10:42yeah or nonnegative matrix factorization we have and also non-negative matrix deconvolution
0:10:49what is done in both there is a distance function to be minimized to make
0:10:53it's quite similar to what we observe
0:10:56in addition in this is a function actually it's not easily this ser what you're
0:11:03using it is scaled averages function it is presented in the in the reference of
0:11:09the paper sorry hundred be here and in addition we have a penalty term to
0:11:13just have a sparse
0:11:15so you variation used here
0:11:17using the
0:11:18sparse what is being estimated it means that if you want to estimate this observation
0:11:24we need to be estimated from a few of the actual the dictionary and we
0:11:27cannot use all of the combination just tuning optimal weights in its best way to
0:11:32prepare
0:11:33and that's why because we see we say that these are also somehow events of
0:11:38the speech and we have seen before we don't need to combine to meaning of
0:11:41the observed all observations to represent the current
0:11:47context
0:11:48so in a non-negative matrix deconvolution that's is employed here it takes care about this
0:11:54overlap between the events you know who is
0:11:57space that it cannot build this one based on that the terms that it's existing
0:12:04in and the dictionary so it doesn't so all of the activations are zero here
0:12:09because why because it can be able to from the nist and from the before
0:12:13and that's the way that are presented to
0:12:17well and it works just one by one
0:12:23decompositions of words on this one and tries to build it so close to the
0:12:27next one tries to build and the cost function minimized over one long temporal oh
0:12:33but in handy it takes all of this into account and minimize the distance over
0:12:38the whole utterance of all can you know
0:12:41so and it was proved that it is utilized in this study
0:12:49it was on T well it doesn't sort of background in years ago about the
0:12:55class and this is what on your volunteers were not just the users for speech
0:13:00recognition
0:13:04so the content and no we need to
0:13:09oh well we are using a speaker
0:13:13so i
0:13:16oh we are building dictionaries
0:13:17or long on that for each
0:13:22one
0:13:25for example
0:13:27all two hundred fifty S dictionary from each speaker or concatenated here
0:13:33a solution noise example
0:13:37so we have representation of the speakers that there exist
0:13:42it is closed set speaker
0:13:44okay
0:13:46i
0:13:47when we are decomposing or factor on the relation to see that all we can
0:13:53deal with this
0:13:55the dictionary
0:13:57yeah activation vector that you have your paper is sort of a representative for the
0:14:02speaker identity by itself
0:14:04because each of the last one to one of the speakers
0:14:08but we decompose it
0:14:10dries it actually the components that they are activating because we have also
0:14:15sparsity some but not all of them could get activated few of them usually in
0:14:20the order and fifty
0:14:22and then we see that normally be again we have one of the things that
0:14:29the event was called
0:14:31simple manipulation or something like that but we go over the last in the activations
0:14:37see if it is just a speaker that's talking
0:14:40but we just concentrate on one frame this could be nancy
0:14:44because well we have similarities between the speakers and some of the events it can
0:14:49happen that the egg
0:14:51the apple from other the speaker detected
0:14:54so what you three a this reliability has also called so now we are concentrating
0:15:00on the can think that each of each one is activate
0:15:04if you go averaging over these activations
0:15:07or the art so for each part we have like to vision and it for
0:15:12example for two seconds we have two hundred activations
0:15:15so we averaged over activations just
0:15:18somehow deemphasize the contact
0:15:21so the content
0:15:22is good less important but in the real additions for each of them because it
0:15:27but are averaging or something about this effect the car is less important but is
0:15:33the information from the speaker
0:15:36which area is detective at its is still present
0:15:40i think this is it
0:15:42feature
0:15:43you're representing if
0:15:44the speaker again
0:15:46so in normal approach we have
0:15:49icsi's and then
0:15:50thesis we have i-vectors secondary features that you do classification of the i-vector here we
0:15:56have a spectrogram and then this is sparse representation
0:16:00on a strict or as the representative of identity representative there are
0:16:09so what
0:16:10but able to do this one
0:16:13to do the classification is to go for lda or P S
0:16:17i-vector out of three
0:16:19and some people are window lda and then
0:16:22plp to classify the i-vectors
0:16:26i'm not describing the slides as well no
0:16:28but what's the
0:16:30oh yeah i work for the features are sparse features that we have are sparse
0:16:36so what can be you better
0:16:39L are sparse
0:16:41that was a question that i and i
0:16:44literature and that recently
0:16:47it is proposed to have sparse discriminant analysis in our data are sparse
0:16:53the weighted discriminant analysis is working this sort of extension to minimize discriminant analysis
0:16:59in parallel discriminant analysis we need to account for the within class covariance estimation of
0:17:05scatter estimation because this is what is that is sparse and this scatter matrix can
0:17:10be estimated
0:17:11so there is no doubt that had to the to the within class scatter matrix
0:17:17which is normally an identity matrix to biased estimation
0:17:24to make it is sparse
0:17:25that is
0:17:26additional this part of the sparse representation we had to sort of northern eight thousand
0:17:31five hundred so we need to have that the egg and then we want to
0:17:36make it a sparse so that the eigen directions of between class scatter matrix is
0:17:43sort of analyzed with the L one norm of the integration in this sense it's
0:17:49possible
0:17:50you
0:17:51i think direction sparse to this i get a sparse direction that it is utilized
0:18:00so
0:18:01going to description of course
0:18:04people in this community are too much time to chime corpus
0:18:08it is sort of its all computational hearing in multi source environments and it was
0:18:13challenging
0:18:15interspeech two thousand twelve for noise-robust speech recognition
0:18:19this data
0:18:20the little in U K and the thirty four speakers five hundred segments contain segments
0:18:27yeah for speaker in training
0:18:29i six snr levels and test and six hundred files
0:18:33S and their snr to test
0:18:37it does
0:18:39the noisy that they were collected for
0:18:40room environment that is really living room environment so that and the noise or sort
0:18:46of very widely this data that the lower snrs we have really nonstationary noises sort
0:18:54of T V is running matching is working in the also there are many things
0:18:58happening at the same time and M indicates our streaming so it is quite a
0:19:03challenging especially
0:19:05it was from zero db minus sixty it is very challenging database
0:19:10speech
0:19:12so the dictionary is limited
0:19:15all the segments is about two seconds
0:19:20so we just present some results
0:19:25some results that we have
0:19:29yeah
0:19:30yeah
0:19:32you
0:19:33at all
0:19:34right
0:19:35that we had speaker dependent hmm training
0:19:39this one so it is
0:19:41decode each for each test utterance based on hmms rates and we have thirty four
0:19:46hmms and be let be decoded each test segment
0:19:51thirty four hmms
0:19:53see that each hmms between the baseline
0:19:57so this is the result of that one
0:19:59considering the speaker
0:20:01well
0:20:02so for the clean it is quite good so missus
0:20:06match but
0:20:08but to the lower snrs
0:20:11our so the H hmms the likelihood was not really robust when we need to
0:20:19look at it from the pure speaker
0:20:23if you just you're of number of the
0:20:27speech recognition results online sixty before these hmms
0:20:31yeah or something in the in the order of thirty six or so
0:20:37so going for gmm system is very baseline
0:20:40but we need to just try to see that what is the what is the
0:20:45results of a speaker independence is
0:20:48sorry text-independent system which we don't care about context as the hmm system here and
0:20:55we do
0:20:55just easy modeling and gmm you know
0:20:59what is there anything that you can have compared to the H
0:21:04since this is some sort of
0:21:08designed for speaker recognition
0:21:10it gives us a really large margin of improvement for the noise environments
0:21:16but this was not something that we will consider this was based to baseline
0:21:21included here
0:21:23so for example of a simple manipulation
0:21:26you i
0:21:27remember simple manipulation just going to the pitch flux of activations after that i see
0:21:32that just a simple averaging all activations see that which one is just get activate
0:21:38which i speakers
0:21:39is present in this a try
0:21:42and
0:21:43it was still in the range for compared to gmm ubm and hmm in noisy
0:21:48conditions it was quite robust so well the reason because each none of these to
0:21:55alter the noise models but in the exemplar based approach we have been always included
0:22:00inside a dictionary so it is sort of dealing with noise but not
0:22:04a the noise inside
0:22:07well
0:22:09so the next one features also of examples which was a scoring so it was
0:22:13a simple manipulation but in cosine distance cosine distance between the representative this or
0:22:21it was also
0:22:23it is a better because it was just the distance between these two was important
0:22:29in simple manipulation it does not compared to anything it was just a test utterance
0:22:34we do simple manipulation on the activation
0:22:38and we put this
0:22:40we said that no we can have a training for the training we use pot
0:22:47you
0:22:48and this training brought improvements in the sort of that's a close to the noisy
0:22:56for rating that the reason was that the training was that the clean so the
0:23:02examplars of four hundred
0:23:04or plastic or just clean speech
0:23:08and the final what is that
0:23:12is that we train
0:23:15the training method is used as the sparse discriminant and
0:23:19the difference
0:23:20i
0:23:20i two of them on it
0:23:23some of the effect of having
0:23:25sparse features in the
0:23:28for the input of training
0:23:31this really is important in helping that when the sparse modeling technique should be also
0:23:37sparse to deal better with the data
0:23:42this is actually and we sort of improve this average by including group sparsity on
0:23:48top of the norm of the sparsity
0:23:51this paper a unique what's it is gonna be in the speech recognition systems that
0:23:56is most likely that you have not gonna see so i'm taking much as presented
0:24:00here
0:24:01so the group sparsity means that there are imposing the no sparsity vc we say
0:24:07that
0:24:08you should select few iterations you want twenty seven target group sparsity be also make
0:24:15more penalty that if patients treated from different group of speakers
0:24:20so it is sort of course that if they need to get at each inside
0:24:25a lot of the speakers
0:24:27so it can be improvement in development and test set a specially action
0:24:34so this work
0:24:37containing
0:24:38and
0:24:39it is
0:24:40one
0:24:42that is being now
0:24:44it's a lot are working on it to fit it's to thing is this are
0:24:48well as we have posted there i mean that close but we are allowed to
0:24:53use the speaker information in the training
0:24:57and the well there are some issues about only about the
0:25:01the channel effect and dictionary size you nist and with this one
0:25:07so far we probably no noise and them and the channel estimate the channel difference
0:25:12with the fact that the channel if you look at this is
0:25:22is the
0:25:24yeah well you know asians are different for each frame but if we consider the
0:25:31channel this constant over a for
0:25:34so we are at eight estimate the channel difference between what has been observed in
0:25:39the training and or training or making detection
0:25:45the test
0:25:47thank you
0:25:57yeah
0:26:00oh
0:26:01not really different here because
0:26:06you was provided that the one you have to horses
0:26:10and each of these two seconds were happening somewhere in this
0:26:15and we were able to see that the noise before happening inside
0:26:23i
0:26:25yeah
0:26:30i
0:26:35the stress of this method is that it doesn't care about this and this and
0:26:39or something that when it is combining the speech and noise atoms
0:26:43of all the T V
0:26:45one inside
0:26:47also there
0:26:48and about the different noise types
0:26:50what is the M
0:26:52sorry what is the idea is working right now is that we don't need really
0:26:57the noise dictionary what is needed is sort of initialization for the noise dictionary and
0:27:03adapted during that
0:27:05the authors the previous that we see that there is no speech activation so we
0:27:10okay we take as the sort
0:27:12adaptation for the for the noise dictionary
0:27:27i
0:27:34yeah
0:27:43i
0:27:47so all sort of yeah
0:27:52oh
0:27:55i
0:28:03so we are also estimating the
0:28:05this one but not also which is
0:28:07that the a non-negative matrix factorisation
0:28:12there is a linear on this one that these are all
0:28:15these are not simply so this more feature
0:28:56well
0:29:01yeah so
0:29:03each frame
0:29:07and the dictionary
0:29:09test for all zero
0:29:11as we are able we have to be able i
0:29:17speech inside the
0:29:19we see
0:29:21yeah
0:29:21so
0:29:23yeah we know
0:29:25computes vol
0:29:27so
0:29:28generated in the morning to be you know
0:29:32maybe
0:29:34speech or
0:29:42there was
0:29:51i