next to representation
that is
exemplar based sparse representation and sparse discrimination
richard per speaker
identification
oh
two examples i think
is the joint work with the university a
oh
not to miss rate
and
and when and
well
so
the name maybe
why we can T
but
this is the first and that sort of
try that for
the speaker recognition
so
section five
that is
speaker
the name
yeah
this
you
noisy conditions that recently
with this sort of motivate us
the recent studies of this one
been done in our group
that this child that's that the noise example
yeah
effect of noise
despite harsh on a state of art speaker recognition
i-vector based system and you have a basis
it is
it needs to be sort of way to deal with the effect of additive noise
in speaker recognition six
yeah
as they are
use
with that being
something about how to deal with the effect of noise in the speaker recognition especially
i-vector basis
a recent literature
in i
i
first
and they
try to multi condition training to deal with different types of noises
speaker recognition
that work was about to sort of very different models and clearly models based on
different noises
and the work of labels about you know how to a different
features
noisy features
and then all of them together in modeling phase
in the sort of a the only thing
most conditional speech
the other way is
it is also
we go a small initial training class a missing features
you already it means that we are using a conditional training is that features are
called you contaminated by noise and them together but the modeling face
that the features that they are affected by noise technicians their account the so called
in the out how we can in a
and the rest for a auditory or features
and separation so
well how to choose the R G F C is not from filterbank as the
cepstral coefficients that they are shown to be quite efficient compared to mfccs
because it is sort of more bus or model of the auditory system
and the separation system based on the on the auditory scene analysis that they
try to separate the speech and noise and build it
three mask that they can rely on speech to trying to clean speech out of
it and you can be done after that we missing feature everybody marginalisation reconstruction
so
a recent for us to make the speaker
robust against i
and
i
yeah
a
what presenting here is a preliminary results are recorded
research to remove the noise robust speaker
and it is quite different from the things that you have seen because that the
message inside the speech is somehow disturbing the speaker
section
there whatever think it's a sort of speaker
mission with a speech
so what do not being exactly is
vision
important what is being said
that works
exemplar based approach it means that we have examples
the data are also clustered examples of the data in the dictionary and then we
build the observation based on what we have a sort of dictionary
yeah we are considering
no sort of long temporal oh temporal context of the spectrum
so we go to build narrowband amplitude for each
what is the that for each for
for each frame we have be noted that and he uses like mfccs this is
just mel band and you know what man and amplitude spectrum
and E that the before here
and we have this three
yeah we
so each frame
and
okay then you have a sort of
superframe every frame that we have all deformation in this part that is typically twenty
five years he's
so it is in the order of two hundred fifty milisecond of all the all
in one vector or to consider one building block i
section
a sliding window is a we're gonna do cover all of the
is that
a small one
in
next
so
i
this
let me say a example
what we need to do is to build the dictionary the next so we have
these things and we need to build the dictionary that it is representative of the
speaker
yeah
so here now this work we had a small vocabulary
so we were able to do forced aligned hmm on it and make all sort
of label for each of the frames and
for example if you have a hmm models and for each of the
word models we have several states this work model and we have
the several states per model
so each of these frames
could be associated with one of the hmm states
but we have associate the
states with the frame so we take the context around on a cool the a
long temporal context
we have labeled as to belong to the same age and state
and then after that so all representing the same sort of phonetic events if you
can call
what we do to make just one representative of this event
to be wise median over all of these temporal on it
and they just one representative of this state
means that in this special task that we perform we had sort of two hundred
fifty hmm states for the for the let me say hmm for someone model
and then be we have now two hundred fifty long temporal context that we just
put in one vector and we have it as a representative of this fantastic
so
this is this is not per speaker so per speaker we have hmm trained on
the data and these atoms are stored in the dictionary
in addition we have also important dictionary to model the noise
so we have speaker or anyhow the noise part in the dictionary and for the
noise we are using a noise dictionary it means that in this is just fit
for it is assumed that you sort of a existing in data in large recording
so we observed that what when you is gonna start time and speakers gonna start
and resampled the noise from the beginning of at the time that it's gonna start
so i think the dictionary so this is sort of context recognition normal way that
people do the sparse representation they do there are lots of taking dictionary building but
this is context recognition rate and we know what we are building and that sort
of the stress of set approach
there exists a factorisation for factorisation normally we estimate the observation
based on dictionary and the
that's a X as activation at all the terms of the dictionary and X as
a nation
it is just a pictorial representation
the data from and to provide and icassp two thousand twelve paper because we were
doing the same thing
so this is because there's that we have three from the dictionary and the or
for a result in this context of the spectral
and an observation
in this once we have this sort of events that they are coming after each
other
and decomposing this the observation this frames we need to all
somehow minimize the distance between the observation at a combination a frame
yeah so we have for example three and also in the activation we have three
elements that it is sort of the linear combination of atoms to build the observation
and yeah well
yeah or nonnegative matrix factorization we have and also non-negative matrix deconvolution
what is done in both there is a distance function to be minimized to make
it's quite similar to what we observe
in addition in this is a function actually it's not easily this ser what you're
using it is scaled averages function it is presented in the in the reference of
the paper sorry hundred be here and in addition we have a penalty term to
just have a sparse
so you variation used here
using the
sparse what is being estimated it means that if you want to estimate this observation
we need to be estimated from a few of the actual the dictionary and we
cannot use all of the combination just tuning optimal weights in its best way to
prepare
and that's why because we see we say that these are also somehow events of
the speech and we have seen before we don't need to combine to meaning of
the observed all observations to represent the current
context
so in a non-negative matrix deconvolution that's is employed here it takes care about this
overlap between the events you know who is
space that it cannot build this one based on that the terms that it's existing
in and the dictionary so it doesn't so all of the activations are zero here
because why because it can be able to from the nist and from the before
and that's the way that are presented to
well and it works just one by one
decompositions of words on this one and tries to build it so close to the
next one tries to build and the cost function minimized over one long temporal oh
but in handy it takes all of this into account and minimize the distance over
the whole utterance of all can you know
so and it was proved that it is utilized in this study
it was on T well it doesn't sort of background in years ago about the
class and this is what on your volunteers were not just the users for speech
recognition
so the content and no we need to
oh well we are using a speaker
so i
oh we are building dictionaries
or long on that for each
one
for example
all two hundred fifty S dictionary from each speaker or concatenated here
a solution noise example
so we have representation of the speakers that there exist
it is closed set speaker
okay
i
when we are decomposing or factor on the relation to see that all we can
deal with this
the dictionary
yeah activation vector that you have your paper is sort of a representative for the
speaker identity by itself
because each of the last one to one of the speakers
but we decompose it
dries it actually the components that they are activating because we have also
sparsity some but not all of them could get activated few of them usually in
the order and fifty
and then we see that normally be again we have one of the things that
the event was called
simple manipulation or something like that but we go over the last in the activations
see if it is just a speaker that's talking
but we just concentrate on one frame this could be nancy
because well we have similarities between the speakers and some of the events it can
happen that the egg
the apple from other the speaker detected
so what you three a this reliability has also called so now we are concentrating
on the can think that each of each one is activate
if you go averaging over these activations
or the art so for each part we have like to vision and it for
example for two seconds we have two hundred activations
so we averaged over activations just
somehow deemphasize the contact
so the content
is good less important but in the real additions for each of them because it
but are averaging or something about this effect the car is less important but is
the information from the speaker
which area is detective at its is still present
i think this is it
feature
you're representing if
the speaker again
so in normal approach we have
icsi's and then
thesis we have i-vectors secondary features that you do classification of the i-vector here we
have a spectrogram and then this is sparse representation
on a strict or as the representative of identity representative there are
so what
but able to do this one
to do the classification is to go for lda or P S
i-vector out of three
and some people are window lda and then
plp to classify the i-vectors
i'm not describing the slides as well no
but what's the
oh yeah i work for the features are sparse features that we have are sparse
so what can be you better
L are sparse
that was a question that i and i
literature and that recently
it is proposed to have sparse discriminant analysis in our data are sparse
the weighted discriminant analysis is working this sort of extension to minimize discriminant analysis
in parallel discriminant analysis we need to account for the within class covariance estimation of
scatter estimation because this is what is that is sparse and this scatter matrix can
be estimated
so there is no doubt that had to the to the within class scatter matrix
which is normally an identity matrix to biased estimation
to make it is sparse
that is
additional this part of the sparse representation we had to sort of northern eight thousand
five hundred so we need to have that the egg and then we want to
make it a sparse so that the eigen directions of between class scatter matrix is
sort of analyzed with the L one norm of the integration in this sense it's
possible
you
i think direction sparse to this i get a sparse direction that it is utilized
so
going to description of course
people in this community are too much time to chime corpus
it is sort of its all computational hearing in multi source environments and it was
challenging
interspeech two thousand twelve for noise-robust speech recognition
this data
the little in U K and the thirty four speakers five hundred segments contain segments
yeah for speaker in training
i six snr levels and test and six hundred files
S and their snr to test
it does
the noisy that they were collected for
room environment that is really living room environment so that and the noise or sort
of very widely this data that the lower snrs we have really nonstationary noises sort
of T V is running matching is working in the also there are many things
happening at the same time and M indicates our streaming so it is quite a
challenging especially
it was from zero db minus sixty it is very challenging database
speech
so the dictionary is limited
all the segments is about two seconds
so we just present some results
some results that we have
yeah
yeah
you
at all
right
that we had speaker dependent hmm training
this one so it is
decode each for each test utterance based on hmms rates and we have thirty four
hmms and be let be decoded each test segment
thirty four hmms
see that each hmms between the baseline
so this is the result of that one
considering the speaker
well
so for the clean it is quite good so missus
match but
but to the lower snrs
our so the H hmms the likelihood was not really robust when we need to
look at it from the pure speaker
if you just you're of number of the
speech recognition results online sixty before these hmms
yeah or something in the in the order of thirty six or so
so going for gmm system is very baseline
but we need to just try to see that what is the what is the
results of a speaker independence is
sorry text-independent system which we don't care about context as the hmm system here and
we do
just easy modeling and gmm you know
what is there anything that you can have compared to the H
since this is some sort of
designed for speaker recognition
it gives us a really large margin of improvement for the noise environments
but this was not something that we will consider this was based to baseline
included here
so for example of a simple manipulation
you i
remember simple manipulation just going to the pitch flux of activations after that i see
that just a simple averaging all activations see that which one is just get activate
which i speakers
is present in this a try
and
it was still in the range for compared to gmm ubm and hmm in noisy
conditions it was quite robust so well the reason because each none of these to
alter the noise models but in the exemplar based approach we have been always included
inside a dictionary so it is sort of dealing with noise but not
a the noise inside
well
so the next one features also of examples which was a scoring so it was
a simple manipulation but in cosine distance cosine distance between the representative this or
it was also
it is a better because it was just the distance between these two was important
in simple manipulation it does not compared to anything it was just a test utterance
we do simple manipulation on the activation
and we put this
we said that no we can have a training for the training we use pot
you
and this training brought improvements in the sort of that's a close to the noisy
for rating that the reason was that the training was that the clean so the
examplars of four hundred
or plastic or just clean speech
and the final what is that
is that we train
the training method is used as the sparse discriminant and
the difference
i
i two of them on it
some of the effect of having
sparse features in the
for the input of training
this really is important in helping that when the sparse modeling technique should be also
sparse to deal better with the data
this is actually and we sort of improve this average by including group sparsity on
top of the norm of the sparsity
this paper a unique what's it is gonna be in the speech recognition systems that
is most likely that you have not gonna see so i'm taking much as presented
here
so the group sparsity means that there are imposing the no sparsity vc we say
that
you should select few iterations you want twenty seven target group sparsity be also make
more penalty that if patients treated from different group of speakers
so it is sort of course that if they need to get at each inside
a lot of the speakers
so it can be improvement in development and test set a specially action
so this work
containing
and
it is
one
that is being now
it's a lot are working on it to fit it's to thing is this are
well as we have posted there i mean that close but we are allowed to
use the speaker information in the training
and the well there are some issues about only about the
the channel effect and dictionary size you nist and with this one
so far we probably no noise and them and the channel estimate the channel difference
with the fact that the channel if you look at this is
is the
yeah well you know asians are different for each frame but if we consider the
channel this constant over a for
so we are at eight estimate the channel difference between what has been observed in
the training and or training or making detection
the test
thank you
yeah
oh
not really different here because
you was provided that the one you have to horses
and each of these two seconds were happening somewhere in this
and we were able to see that the noise before happening inside
i
yeah
i
the stress of this method is that it doesn't care about this and this and
or something that when it is combining the speech and noise atoms
of all the T V
one inside
also there
and about the different noise types
what is the M
sorry what is the idea is working right now is that we don't need really
the noise dictionary what is needed is sort of initialization for the noise dictionary and
adapted during that
the authors the previous that we see that there is no speech activation so we
okay we take as the sort
adaptation for the for the noise dictionary
i
yeah
i
so all sort of yeah
oh
i
so we are also estimating the
this one but not also which is
that the a non-negative matrix factorisation
there is a linear on this one that these are all
these are not simply so this more feature
well
yeah so
each frame
and the dictionary
test for all zero
as we are able we have to be able i
speech inside the
we see
yeah
so
yeah we know
computes vol
so
generated in the morning to be you know
maybe
speech or
there was
i