Speech Transcript - Exemplar-based Sparse Representation and Sparse Discrimination for Noise Robust Speaker Identification

next to representation

that is

exemplar based sparse representation and sparse discrimination

richard per speaker

identification

two examples i think

is the joint work with the university a

not to miss rate

and

and when and

well

the name maybe

why we can T

but

this is the first and that sort of

try that for

the speaker recognition

section five

that is

speaker

the name

yeah

this

you

noisy conditions that recently

with this sort of motivate us

the recent studies of this one

been done in our group

that this child that's that the noise example

yeah

effect of noise

despite harsh on a state of art speaker recognition

i-vector based system and you have a basis

it is

it needs to be sort of way to deal with the effect of additive noise

in speaker recognition six

yeah

as they are

use

with that being

something about how to deal with the effect of noise in the speaker recognition especially

i-vector basis

a recent literature

in i

first

and they

try to multi condition training to deal with different types of noises

speaker recognition

that work was about to sort of very different models and clearly models based on

different noises

and the work of labels about you know how to a different

features

noisy features

and then all of them together in modeling phase

in the sort of a the only thing

most conditional speech

the other way is

it is also

we go a small initial training class a missing features

you already it means that we are using a conditional training is that features are

called you contaminated by noise and them together but the modeling face

that the features that they are affected by noise technicians their account the so called

in the out how we can in a

and the rest for a auditory or features

and separation so

well how to choose the R G F C is not from filterbank as the

cepstral coefficients that they are shown to be quite efficient compared to mfccs

because it is sort of more bus or model of the auditory system

and the separation system based on the on the auditory scene analysis that they

try to separate the speech and noise and build it

three mask that they can rely on speech to trying to clean speech out of

it and you can be done after that we missing feature everybody marginalisation reconstruction

a recent for us to make the speaker

robust against i

and

yeah

what presenting here is a preliminary results are recorded

research to remove the noise robust speaker

and it is quite different from the things that you have seen because that the

message inside the speech is somehow disturbing the speaker

section

there whatever think it's a sort of speaker

mission with a speech

so what do not being exactly is

vision

important what is being said

that works

exemplar based approach it means that we have examples

the data are also clustered examples of the data in the dictionary and then we

build the observation based on what we have a sort of dictionary

yeah we are considering

no sort of long temporal oh temporal context of the spectrum

so we go to build narrowband amplitude for each

what is the that for each for

for each frame we have be noted that and he uses like mfccs this is

just mel band and you know what man and amplitude spectrum

and E that the before here

and we have this three

yeah we

so each frame

and

okay then you have a sort of

superframe every frame that we have all deformation in this part that is typically twenty

five years he's

so it is in the order of two hundred fifty milisecond of all the all

in one vector or to consider one building block i

section

a sliding window is a we're gonna do cover all of the

is that

a small one

this

let me say a example

what we need to do is to build the dictionary the next so we have

these things and we need to build the dictionary that it is representative of the

speaker

yeah

so here now this work we had a small vocabulary

so we were able to do forced aligned hmm on it and make all sort

of label for each of the frames and

for example if you have a hmm models and for each of the

word models we have several states this work model and we have

the several states per model

so each of these frames

could be associated with one of the hmm states

but we have associate the

states with the frame so we take the context around on a cool the a

long temporal context

we have labeled as to belong to the same age and state

and then after that so all representing the same sort of phonetic events if you

can call

what we do to make just one representative of this event

to be wise median over all of these temporal on it

and they just one representative of this state

means that in this special task that we perform we had sort of two hundred

fifty hmm states for the for the let me say hmm for someone model

and then be we have now two hundred fifty long temporal context that we just

put in one vector and we have it as a representative of this fantastic

this is this is not per speaker so per speaker we have hmm trained on

the data and these atoms are stored in the dictionary

in addition we have also important dictionary to model the noise

so we have speaker or anyhow the noise part in the dictionary and for the

noise we are using a noise dictionary it means that in this is just fit

for it is assumed that you sort of a existing in data in large recording

so we observed that what when you is gonna start time and speakers gonna start

and resampled the noise from the beginning of at the time that it's gonna start

so i think the dictionary so this is sort of context recognition normal way that

people do the sparse representation they do there are lots of taking dictionary building but

this is context recognition rate and we know what we are building and that sort

of the stress of set approach

there exists a factorisation for factorisation normally we estimate the observation

based on dictionary and the

that's a X as activation at all the terms of the dictionary and X as

a nation

it is just a pictorial representation

the data from and to provide and icassp two thousand twelve paper because we were

doing the same thing

so this is because there's that we have three from the dictionary and the or

for a result in this context of the spectral

and an observation

in this once we have this sort of events that they are coming after each

other

and decomposing this the observation this frames we need to all

somehow minimize the distance between the observation at a combination a frame

yeah so we have for example three and also in the activation we have three

elements that it is sort of the linear combination of atoms to build the observation

and yeah well

yeah or nonnegative matrix factorization we have and also non-negative matrix deconvolution

what is done in both there is a distance function to be minimized to make

it's quite similar to what we observe

in addition in this is a function actually it's not easily this ser what you're

using it is scaled averages function it is presented in the in the reference of

the paper sorry hundred be here and in addition we have a penalty term to

just have a sparse

so you variation used here

using the

sparse what is being estimated it means that if you want to estimate this observation

we need to be estimated from a few of the actual the dictionary and we

cannot use all of the combination just tuning optimal weights in its best way to

prepare

and that's why because we see we say that these are also somehow events of

the speech and we have seen before we don't need to combine to meaning of

the observed all observations to represent the current

context

so in a non-negative matrix deconvolution that's is employed here it takes care about this

overlap between the events you know who is

space that it cannot build this one based on that the terms that it's existing

in and the dictionary so it doesn't so all of the activations are zero here

because why because it can be able to from the nist and from the before

and that's the way that are presented to

well and it works just one by one

decompositions of words on this one and tries to build it so close to the

next one tries to build and the cost function minimized over one long temporal oh

but in handy it takes all of this into account and minimize the distance over

the whole utterance of all can you know

so and it was proved that it is utilized in this study

it was on T well it doesn't sort of background in years ago about the

class and this is what on your volunteers were not just the users for speech

recognition

so the content and no we need to

oh well we are using a speaker

so i

oh we are building dictionaries

or long on that for each

one

for example

all two hundred fifty S dictionary from each speaker or concatenated here

a solution noise example

so we have representation of the speakers that there exist

it is closed set speaker

okay

when we are decomposing or factor on the relation to see that all we can

deal with this

the dictionary

yeah activation vector that you have your paper is sort of a representative for the

speaker identity by itself

because each of the last one to one of the speakers

but we decompose it

dries it actually the components that they are activating because we have also

sparsity some but not all of them could get activated few of them usually in

the order and fifty

and then we see that normally be again we have one of the things that

the event was called

simple manipulation or something like that but we go over the last in the activations

see if it is just a speaker that's talking

but we just concentrate on one frame this could be nancy

because well we have similarities between the speakers and some of the events it can

happen that the egg

the apple from other the speaker detected

so what you three a this reliability has also called so now we are concentrating

on the can think that each of each one is activate

if you go averaging over these activations

or the art so for each part we have like to vision and it for

example for two seconds we have two hundred activations

so we averaged over activations just

somehow deemphasize the contact

so the content

is good less important but in the real additions for each of them because it

but are averaging or something about this effect the car is less important but is

the information from the speaker

which area is detective at its is still present

i think this is it

feature

you're representing if

the speaker again

so in normal approach we have

icsi's and then

thesis we have i-vectors secondary features that you do classification of the i-vector here we

have a spectrogram and then this is sparse representation

on a strict or as the representative of identity representative there are

so what

but able to do this one

to do the classification is to go for lda or P S

i-vector out of three

and some people are window lda and then

plp to classify the i-vectors

i'm not describing the slides as well no

but what's the

oh yeah i work for the features are sparse features that we have are sparse

so what can be you better

L are sparse

that was a question that i and i

literature and that recently

it is proposed to have sparse discriminant analysis in our data are sparse

the weighted discriminant analysis is working this sort of extension to minimize discriminant analysis

in parallel discriminant analysis we need to account for the within class covariance estimation of

scatter estimation because this is what is that is sparse and this scatter matrix can

be estimated

so there is no doubt that had to the to the within class scatter matrix

which is normally an identity matrix to biased estimation

to make it is sparse

that is

additional this part of the sparse representation we had to sort of northern eight thousand

five hundred so we need to have that the egg and then we want to

make it a sparse so that the eigen directions of between class scatter matrix is

sort of analyzed with the L one norm of the integration in this sense it's

possible

you

i think direction sparse to this i get a sparse direction that it is utilized

going to description of course

people in this community are too much time to chime corpus

it is sort of its all computational hearing in multi source environments and it was

challenging

interspeech two thousand twelve for noise-robust speech recognition

this data

the little in U K and the thirty four speakers five hundred segments contain segments

yeah for speaker in training

i six snr levels and test and six hundred files

S and their snr to test

it does

the noisy that they were collected for

room environment that is really living room environment so that and the noise or sort

of very widely this data that the lower snrs we have really nonstationary noises sort

of T V is running matching is working in the also there are many things

happening at the same time and M indicates our streaming so it is quite a

challenging especially

it was from zero db minus sixty it is very challenging database

speech

so the dictionary is limited

all the segments is about two seconds

so we just present some results

some results that we have

yeah

you

at all

right

that we had speaker dependent hmm training

this one so it is

decode each for each test utterance based on hmms rates and we have thirty four

hmms and be let be decoded each test segment

thirty four hmms

see that each hmms between the baseline

so this is the result of that one

considering the speaker

well

so for the clean it is quite good so missus

match but

but to the lower snrs

our so the H hmms the likelihood was not really robust when we need to

look at it from the pure speaker

if you just you're of number of the

speech recognition results online sixty before these hmms

yeah or something in the in the order of thirty six or so

so going for gmm system is very baseline

but we need to just try to see that what is the what is the

results of a speaker independence is

sorry text-independent system which we don't care about context as the hmm system here and

we do

just easy modeling and gmm you know

what is there anything that you can have compared to the H

since this is some sort of

designed for speaker recognition

it gives us a really large margin of improvement for the noise environments

but this was not something that we will consider this was based to baseline

included here

so for example of a simple manipulation

you i

remember simple manipulation just going to the pitch flux of activations after that i see

that just a simple averaging all activations see that which one is just get activate

which i speakers

is present in this a try

and

it was still in the range for compared to gmm ubm and hmm in noisy

conditions it was quite robust so well the reason because each none of these to

alter the noise models but in the exemplar based approach we have been always included

inside a dictionary so it is sort of dealing with noise but not

a the noise inside

well

so the next one features also of examples which was a scoring so it was

a simple manipulation but in cosine distance cosine distance between the representative this or

it was also

it is a better because it was just the distance between these two was important

in simple manipulation it does not compared to anything it was just a test utterance

we do simple manipulation on the activation

and we put this

we said that no we can have a training for the training we use pot

you

and this training brought improvements in the sort of that's a close to the noisy

for rating that the reason was that the training was that the clean so the

examplars of four hundred

or plastic or just clean speech

and the final what is that

is that we train

the training method is used as the sparse discriminant and

the difference

i two of them on it

some of the effect of having

sparse features in the

for the input of training

this really is important in helping that when the sparse modeling technique should be also

sparse to deal better with the data

this is actually and we sort of improve this average by including group sparsity on

top of the norm of the sparsity

this paper a unique what's it is gonna be in the speech recognition systems that

is most likely that you have not gonna see so i'm taking much as presented

here

so the group sparsity means that there are imposing the no sparsity vc we say

that

you should select few iterations you want twenty seven target group sparsity be also make

more penalty that if patients treated from different group of speakers

so it is sort of course that if they need to get at each inside

a lot of the speakers

so it can be improvement in development and test set a specially action

so this work

containing

and

it is

one

that is being now

it's a lot are working on it to fit it's to thing is this are

well as we have posted there i mean that close but we are allowed to

use the speaker information in the training

and the well there are some issues about only about the

the channel effect and dictionary size you nist and with this one

so far we probably no noise and them and the channel estimate the channel difference

with the fact that the channel if you look at this is

is the

yeah well you know asians are different for each frame but if we consider the

channel this constant over a for

so we are at eight estimate the channel difference between what has been observed in

the training and or training or making detection

the test

thank you

yeah

not really different here because

you was provided that the one you have to horses

and each of these two seconds were happening somewhere in this

and we were able to see that the noise before happening inside

yeah

the stress of this method is that it doesn't care about this and this and

or something that when it is combining the speech and noise atoms

of all the T V

one inside

also there

and about the different noise types

what is the M

sorry what is the idea is working right now is that we don't need really

the noise dictionary what is needed is sort of initialization for the noise dictionary and

adapted during that

the authors the previous that we see that there is no speech activation so we

okay we take as the sort

adaptation for the for the noise dictionary

yeah

so all sort of yeah

so we are also estimating the

this one but not also which is

that the a non-negative matrix factorisation

there is a linear on this one that these are all

these are not simply so this more feature

well

yeah so

each frame

and the dictionary

test for all zero

as we are able we have to be able i

speech inside the

we see

yeah

yeah we know

computes vol

generated in the morning to be you know

maybe

speech or

there was

Exemplar-based Sparse Representation and Sparse Discrimination for Noise Robust Speaker Identification

SESSION 08: Features for Speaker Recognition

Rahim Saeidi