Speech Transcript - Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs

i per se and with lincoln laboratory enormous pride some more quickly for channel compensation

i using that the lda for the only thing you know

and that is no brief overview over a five year multichannel speaker recognition and mixer

and the baseline system is an i-vector system is trained on one tell us all

speech

and there are two approaches were looking at one okay that the lda parameters the

telephone data to microphone data

and the other approaches we try to compensate features coming into the system and re-training

or system does not sort of forms are hybrid system i don't give results along

the way

so the basic idea that we have a system is trained on switchboard data and

works pretty well in the data were tested on is also conversational telephone speech

but as a multiple known you try to evaluate microphone trials on the same system

just fall for the for performance is really that

and

two approaches people that has to do this sort of a adaptation of the lda

i don't think is exactly the same adaptation reason was trying to bring in some

of the subspace to move that the only parameters for the microphone data

and we also tried past enhancement of another did not so was different prices to

do the

i'm sorry what's due process is to use a neural network to do this compensation

and actually it's not new in general i should mention that for this part challenge

a lot of people using this technique and works very well for speech recognition and

that test but they had microphone data as well

so for these two techniques one we're taking a i-vectors from a telephone train system

and weird adding those two of this microphone data to do that we take the

within class an across class may parameters are used the lda scoring

and we adapt those parameters towards the microphone data using a relevance map which is

just a lambda interpolation

and that we found that some calibration issues we do pretty well for eer we

get a nice gain at the eer level but for mindcf we don't see much

began

on the other hand it is a very simple technique and that you don't change

or system you just to train these two of parameters with existing i-vectors or you

extract new i-vectors the microphone data we don't change the system itself

the ml approach is a little a requires more work in that you've the training

you know

and the d n is trained to take a parallel data that's noisy the try

to clean it up to try to reconstruct a clean signal given a noisy representation

of the same data

and that's actually very robust technique it works by twelve it does mean you want

to retrain your system with that new front end

also for this work or using three datasets one is switchboard one and two that's

we used for training the baseline system and all the i-vector parameters are trained with

just that data

and then we'll mixer to which is a collection from two thousand for those a

multi microphone collection

i've had a clean telephone channel than the at eight microphones in the romantic like

to the data parallel for tuna forty speakers and up to six sessions i think

that was collected two thousand four minnesota dataset those actually is not straight and

and this is the mixer six about that one they did the same type of

a collection but for an speakers in different rooms and therefore two microphones as well

so the telephone channel

and for the sre they're focusing a lot on the interview condition for that where

the interviewer rum and interviewee and you had to separate the two to try to

not deal with that issue we just took this other portion of the sessions

which is a conversation the person's having over phone so it's the same how collection

but it's conversational data

and that matches the mixer to style so these are disjoint

lex the mixer to the mixer six

we use mixer to for developing the system either for training or indian and or

for adapting or parameters and the mixer six or using protesting that to see how

well works

so i just t v an idea of what these collections or comprise the next

one and two was collected over eight microphones

and mixer six was over fourteen

we found it generate a huge dataset values of fourteen so we just selected six

of them based on the distance from the speaker so the mixer six collection comes

with documentation about where the microphones or position that that's we use here

mixer one and two was available to us but we've actually given this the ldc

and they graph is planning on making release people wanna work with this data so

it should be probably available fairly soon thing

and ice estimates somewhere only evaluating on same mic trials on the mixer six condition

of the trials always you the target speaker and or what the non-target speakers on

the same mike

the baseline system is

exactly what everybody else is doing with an i-vector system

we start with a ubm to be trained on switchboard wanted to extract easier wasn't

first order statistics to create a i supervector and then we take the map point

estimate to get the i-vector six enter dimensional i-vector

the whitening is done with switchboard two data as well for the d n and

case

for the microphone a map and that of for the map-adapted case actually did the

whitening using the mixer a microphone data the mixture to microphone data and then signal

w c and c macy of the parameters are being adapted for the ple the

lda adaptation

so start with the baseline results

well the first result in table is on a street and that's just the telephone

results on disk sort of the out-of-domain task we get the system trained on switchboard

and then the you'd all data is this a street and mixer data so you

don't have mixer data as part of training the system

that's about five point seven percent equal error rate and a point six two and

you take that system

and evaluate it with the s with the mixer six trials the microphone trials

and you can see the equal error rate goes up by a factor of two

or so and mindcf really takes a good as well

and the first number there is the average this just taking the eer further channel

and then averaging number that's kind of unrealistic because typically you'd have to pick one

threshold for everything so the people i think is a more practical matter and that

one's even more c take a bigger hit for that because of the calibration problem

and

where for the remaining results of this for example i think that's a more practical

matter

the first and the map-adapted results and here you can see the same the mindcf

really doesn't improve very much although you do get a pretty big improvement eer goes

down by about thirty one percent

so that part's nice but min you'd really like to see mindcf get a little

better

and just yes i should mention that for landay's use point five and the reason

for that as i did sort of a sweet and you can see there's a

they're nice curves at eer because that's where i get again

and point five looks like it's a it's fairly optimal across microphones of the three

d plot is for each microphone the eers use with as use we plan to

for doing data adaptation

and around point five as we're seeing a sweet spot for that

but you'll get mindcf it doesn't really change very much that's where we were saying

the problem of this technique

so moving on added to the enhancement idea were training a neural network to try

to reconstruct a plane signal given by a noisy version of that so we have

the person talking to telephone the telephone is are clean version and we also have

microphones of the room the collecting of the microphone corrupted versions

and we just trained as like a regression it's a very simple thing we have

a windowed set of i-vectors coming into the n and we have the same vector

trying to reconstruct that we just training over again samples

one key thing release i think this is important is that we include the clean

samples as well really like this neural network not change the clean data but to

try to also improve the noisy data make of what more likely

and just t v some idea of how this data was collected

the ldc the these parallel collections and a couple of rounded have like one or

two rooms which is not really like that morals but this is so how this

time

and you'll have to come in to sit down and they have the microphones around

that have all the equipment running

and of the problems that if you realise later that you wanna one more microphone

maybe really hardly really comeback collect more data so really what people do especially asr

size eight in generating synthetic parallel datasets using a i rs online and point out

noise sources and just generating tons parallel data

and we actually been working on that more recently the another paper interspeech on that

and that actually that works quite well as well i think that's and long-term as

the way wanna do that but we had this probably just available and we want

to start with that for this work

so that the hybrid system where you have that channel compensating neural network in the

front of it and then you have the i-vector system the of the baseline these

before and we just retrain this pipeline after we retrain the denoising neural network we

retreat we retrain the i-vector system on the switchboard

and that for the system or using all the mixture to data for training course

and then we also using forty mfccs and that's the dimensionality of the output of

the neural nets or trying to reconstruct forty mfccs and that includes twenty deltas which

may seem kind of counterintuitive but it was actually important and blue delta coefficients and

thus

we use of five layer neural network with two thousand forty nodes

twenty one frame input context and mainly because that's we used for bottleneck features before

we just adapted that system to this problem

and then we of the one clean channel and the eight noisy ones come

and you can say we get a pretty big in mindcf and everything it almost

a thirty percent gain mindcf and that's cool result

and a fifty percent in eer so this is really doing we're hoping is to

get an improvement at mindcf and eer as well

so that was actually nice k

and

i should mention we didn't number of different things we tried initially i think it

first we're trying to see if we could do this with log mel-frequency filter banks

so i think some of the work that's been done just on the enhancement side

is to try to improve a filter banks and then you can do what one

of those like to synthesise cepstra from those a cleaned up filter banks

but i will be found that the deltas were actually important so going to mfccs

plus deltas give us to begin reduced using filter banks

and is also critical on each other people mention this to be suitable for the

you have to do that some type of me the variance normalisation to the data

for training the neural net just to get the district to converge

and that we also found the architecture at a pretty big impact so i am

reporting results on the two thousand forty eight node be an you can say we

take you can see we take a bit of here we go down to ten

twenty four nodes especially dr and then we get on the five control not be

taken figurehead

but honestly the two thousand and forty you know the nn to goes a long

time the training i-vectors weeks to train that one and that's maybe are four we

don't have a parallel training mechanism

that was the problem that

it's worth seeing what the telephone performances you don't really want to system is robust

to microphone data but also worked well for telephone data and so this is actually

kind of a nice surprise we get a small gain about it some percent relative

in just on the telephone task

than that was for the you know a signal that and forty the map that

the lda falls apart when you buy telephone data is you moved all those parameters

this microphone set there does not well matched telephone data anymore

so it's the trade off there

so we see the nice in using this the nn channel compensation technique forty doesn't

it was a lost on the telephone data

you so you don't need do any kind of channel detection to switch back and

forth

the map that the lda unfortunately so far hasn't work well for us it does

give unity are but the mindcf doesn't really change very much

it is really easy to implement if you have an existing i-vector system you just

run on that day to train parameters

the other issue is that we've been using real relative to this which is not

really very practical so the synthetic parallel corpora makes a lot sense

and lastly at the input if you're really looking into using a recurrent networks within

for doing a lot with feed forward networks and with the big context one to

allow that but i think aren't as we can be the way to go looking

for

the biggest much time

how to the sre five

think that recent training

you said you didn't

we think about the size of the input window you used twenty one frames i

and just about that

you have some

inputs for some idea is do you think that for channel compensation for example you

need a longer window of and what of your were doing only for each speaker

recognition or e

you know actually i would really recommend looking at the aspire papers from i think

it was from

maybe asru not sure it's one of the speaker regular workshops

or might be names which actually a perl thusly train the denoising network and is

it but i think were the fft outputs to introduce six a power a fifty

upwards and yet a really long window or something like a three hundred frames or

something huge like that we trained a giant network

and yes it very impressive results and i've been meaning to see if i can

recreate that it will take me forever to training

so i think we wanna have a faster training algorithm but i would encourage looking

at those results in particular looking at the other aspire systems

the suit they did i think there was a and ice comparison of what do

you did a joint training of the whole system with the way i one was

doing a where you are you do a multi style sorry multi condition training with

a with a whole bunch of data with your targets are always signals and some

people try to decouple it's of the asr system was trained independently

and then they train the denoising network and just use those features and one issue

i haven't addressed here is the idea of not retraining the i-vector system

so could you actually do okay if the features were coming from the denoising network

but you're still using

the same i-vector system

i was worried about right now but i think it's worth busting

but i start pretty i did you go back to a whether you're earlier slides

where you're gonna highlighting the different microphones between mixture to and mixer six

yes so

so i was looking at a mixture one and two and their what kind of

country a little concern if i guess channel number five has the kind of the

jar or a

okay thinking of star wars years arc a cellphone wyman there's also the error by

one so you got to their actually i mean you're but not i don't think

used five and six from mixture one armature two percent correct all next

extra wanted to die all data use all of it so i'm thinking that some

out those when you have two mikes that are actually still configured around here they

are letterman agree you know i mean it it's a mike you're gonna have some

i imagine interference between the two

so that maybe i don't know it does not question you check that okay so

what are the things good the main question i was gonna ask is when you're

looking at a kind of map adaptation you had the

denoising enhancement piece when you're looking across the different mikes going from one mike to

another some mikes that are closer in terms of their characteristics in others did you

see any benefit in moving from one to the other

i guess we're asking is whether we could subset a set of unique

right and we haven't that's a really good question i think actually moving forward anyway

i mean real data is kinda nice because you can reality check but i think

actually moving towards the synthetic data you can really move to two very different you

know

run conditions i mean exactly collected in two rows are male diverse and i'm just

thinking about chemistry structure for all the mikes energy kind of look at your solutions

to see

why you're one if you're launching from one mike to another sometimes of each closer

one solution does better than another

it's actually analysis we could try to do we could try to see which features

look closer cross the parallel data sets

i think about asking you to burn morality cycles either directly it's a nice question

that that's a good point we have to don't i couldn't find placement information for

mixed wanted to it probably exist somewhere but i ran out of like trying to

find mixer six has a lot of information

so mixer to it was it three locations i think there's i aside the ldc

and

and i think it see i think there are three and then mixer six i

think is to i believe that's right

although it was okay feature start with reading it

a question and all that denoising network so when we apply that kind of thing

we found it was important to

applied in fact and then ten train the network because if we send that the

silence frames to it

i was with easy but value that's a really because of just zeros and then

it goes the rest of the network what the network zapping that state so that

are actually thousand four point we ran a

we limited the mars that's right i

that's i think we might have run that on the clean channel for training and

applied at the other ones for decoding we always ran back or whatever the data

was

we try to optimize the you know to not realistic addition but for training i

think we might have done a that on the telephone data which matched are bad

system robust and then use that as i

the speech marks across

anymore questions

okay stack the speaker

Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs

Speaker & Language Recognition: Deep learning approaches

Fred Richardson, Brian Nemsick, Douglas Reynolds