i per se and with lincoln laboratory enormous pride some more quickly for channel compensation
i using that the lda for the only thing you know
and that is no brief overview over a five year multichannel speaker recognition and mixer
and the baseline system is an i-vector system is trained on one tell us all
speech
and there are two approaches were looking at one okay that the lda parameters the
telephone data to microphone data
and the other approaches we try to compensate features coming into the system and re-training
or system does not sort of forms are hybrid system i don't give results along
the way
so the basic idea that we have a system is trained on switchboard data and
works pretty well in the data were tested on is also conversational telephone speech
but as a multiple known you try to evaluate microphone trials on the same system
just fall for the for performance is really that
and
two approaches people that has to do this sort of a adaptation of the lda
i don't think is exactly the same adaptation reason was trying to bring in some
of the subspace to move that the only parameters for the microphone data
and we also tried past enhancement of another did not so was different prices to
do the
i'm sorry what's due process is to use a neural network to do this compensation
and actually it's not new in general i should mention that for this part challenge
a lot of people using this technique and works very well for speech recognition and
that test but they had microphone data as well
so for these two techniques one we're taking a i-vectors from a telephone train system
and weird adding those two of this microphone data to do that we take the
within class an across class may parameters are used the lda scoring
and we adapt those parameters towards the microphone data using a relevance map which is
just a lambda interpolation
and that we found that some calibration issues we do pretty well for eer we
get a nice gain at the eer level but for mindcf we don't see much
began
on the other hand it is a very simple technique and that you don't change
or system you just to train these two of parameters with existing i-vectors or you
extract new i-vectors the microphone data we don't change the system itself
the ml approach is a little a requires more work in that you've the training
you know
and the d n is trained to take a parallel data that's noisy the try
to clean it up to try to reconstruct a clean signal given a noisy representation
of the same data
and that's actually very robust technique it works by twelve it does mean you want
to retrain your system with that new front end
also for this work or using three datasets one is switchboard one and two that's
we used for training the baseline system and all the i-vector parameters are trained with
just that data
and then we'll mixer to which is a collection from two thousand for those a
multi microphone collection
i've had a clean telephone channel than the at eight microphones in the romantic like
to the data parallel for tuna forty speakers and up to six sessions i think
that was collected two thousand four minnesota dataset those actually is not straight and
and this is the mixer six about that one they did the same type of
a collection but for an speakers in different rooms and therefore two microphones as well
so the telephone channel
and for the sre they're focusing a lot on the interview condition for that where
the interviewer rum and interviewee and you had to separate the two to try to
not deal with that issue we just took this other portion of the sessions
which is a conversation the person's having over phone so it's the same how collection
but it's conversational data
and that matches the mixer to style so these are disjoint
lex the mixer to the mixer six
we use mixer to for developing the system either for training or indian and or
for adapting or parameters and the mixer six or using protesting that to see how
well works
so i just t v an idea of what these collections or comprise the next
one and two was collected over eight microphones
and mixer six was over fourteen
we found it generate a huge dataset values of fourteen so we just selected six
of them based on the distance from the speaker so the mixer six collection comes
with documentation about where the microphones or position that that's we use here
mixer one and two was available to us but we've actually given this the ldc
and they graph is planning on making release people wanna work with this data so
it should be probably available fairly soon thing
and ice estimates somewhere only evaluating on same mic trials on the mixer six condition
of the trials always you the target speaker and or what the non-target speakers on
the same mike
the baseline system is
exactly what everybody else is doing with an i-vector system
we start with a ubm to be trained on switchboard wanted to extract easier wasn't
first order statistics to create a i supervector and then we take the map point
estimate to get the i-vector six enter dimensional i-vector
the whitening is done with switchboard two data as well for the d n and
case
for the microphone a map and that of for the map-adapted case actually did the
whitening using the mixer a microphone data the mixture to microphone data and then signal
w c and c macy of the parameters are being adapted for the ple the
lda adaptation
so start with the baseline results
well the first result in table is on a street and that's just the telephone
results on disk sort of the out-of-domain task we get the system trained on switchboard
and then the you'd all data is this a street and mixer data so you
don't have mixer data as part of training the system
that's about five point seven percent equal error rate and a point six two and
you take that system
and evaluate it with the s with the mixer six trials the microphone trials
and you can see the equal error rate goes up by a factor of two
or so and mindcf really takes a good as well
and the first number there is the average this just taking the eer further channel
and then averaging number that's kind of unrealistic because typically you'd have to pick one
threshold for everything so the people i think is a more practical matter and that
one's even more c take a bigger hit for that because of the calibration problem
and
where for the remaining results of this for example i think that's a more practical
matter
the first and the map-adapted results and here you can see the same the mindcf
really doesn't improve very much although you do get a pretty big improvement eer goes
down by about thirty one percent
so that part's nice but min you'd really like to see mindcf get a little
better
and just yes i should mention that for landay's use point five and the reason
for that as i did sort of a sweet and you can see there's a
they're nice curves at eer because that's where i get again
and point five looks like it's a it's fairly optimal across microphones of the three
d plot is for each microphone the eers use with as use we plan to
for doing data adaptation
and around point five as we're seeing a sweet spot for that
but you'll get mindcf it doesn't really change very much that's where we were saying
the problem of this technique
so moving on added to the enhancement idea were training a neural network to try
to reconstruct a plane signal given by a noisy version of that so we have
the person talking to telephone the telephone is are clean version and we also have
microphones of the room the collecting of the microphone corrupted versions
and we just trained as like a regression it's a very simple thing we have
a windowed set of i-vectors coming into the n and we have the same vector
trying to reconstruct that we just training over again samples
one key thing release i think this is important is that we include the clean
samples as well really like this neural network not change the clean data but to
try to also improve the noisy data make of what more likely
and just t v some idea of how this data was collected
the ldc the these parallel collections and a couple of rounded have like one or
two rooms which is not really like that morals but this is so how this
time
and you'll have to come in to sit down and they have the microphones around
that have all the equipment running
and of the problems that if you realise later that you wanna one more microphone
maybe really hardly really comeback collect more data so really what people do especially asr
size eight in generating synthetic parallel datasets using a i rs online and point out
noise sources and just generating tons parallel data
and we actually been working on that more recently the another paper interspeech on that
and that actually that works quite well as well i think that's and long-term as
the way wanna do that but we had this probably just available and we want
to start with that for this work
so that the hybrid system where you have that channel compensating neural network in the
front of it and then you have the i-vector system the of the baseline these
before and we just retrain this pipeline after we retrain the denoising neural network we
retreat we retrain the i-vector system on the switchboard
and that for the system or using all the mixture to data for training course
and then we also using forty mfccs and that's the dimensionality of the output of
the neural nets or trying to reconstruct forty mfccs and that includes twenty deltas which
may seem kind of counterintuitive but it was actually important and blue delta coefficients and
thus
we use of five layer neural network with two thousand forty nodes
twenty one frame input context and mainly because that's we used for bottleneck features before
we just adapted that system to this problem
and then we of the one clean channel and the eight noisy ones come
and you can say we get a pretty big in mindcf and everything it almost
a thirty percent gain mindcf and that's cool result
and a fifty percent in eer so this is really doing we're hoping is to
get an improvement at mindcf and eer as well
so that was actually nice k
and
i should mention we didn't number of different things we tried initially i think it
first we're trying to see if we could do this with log mel-frequency filter banks
so i think some of the work that's been done just on the enhancement side
is to try to improve a filter banks and then you can do what one
of those like to synthesise cepstra from those a cleaned up filter banks
but i will be found that the deltas were actually important so going to mfccs
plus deltas give us to begin reduced using filter banks
and is also critical on each other people mention this to be suitable for the
you have to do that some type of me the variance normalisation to the data
for training the neural net just to get the district to converge
and that we also found the architecture at a pretty big impact so i am
reporting results on the two thousand forty eight node be an you can say we
take you can see we take a bit of here we go down to ten
twenty four nodes especially dr and then we get on the five control not be
taken figurehead
but honestly the two thousand and forty you know the nn to goes a long
time the training i-vectors weeks to train that one and that's maybe are four we
don't have a parallel training mechanism
that was the problem that
it's worth seeing what the telephone performances you don't really want to system is robust
to microphone data but also worked well for telephone data and so this is actually
kind of a nice surprise we get a small gain about it some percent relative
in just on the telephone task
than that was for the you know a signal that and forty the map that
the lda falls apart when you buy telephone data is you moved all those parameters
this microphone set there does not well matched telephone data anymore
so it's the trade off there
so we see the nice in using this the nn channel compensation technique forty doesn't
it was a lost on the telephone data
you so you don't need do any kind of channel detection to switch back and
forth
the map that the lda unfortunately so far hasn't work well for us it does
give unity are but the mindcf doesn't really change very much
it is really easy to implement if you have an existing i-vector system you just
run on that day to train parameters
the other issue is that we've been using real relative to this which is not
really very practical so the synthetic parallel corpora makes a lot sense
and lastly at the input if you're really looking into using a recurrent networks within
for doing a lot with feed forward networks and with the big context one to
allow that but i think aren't as we can be the way to go looking
for
the biggest much time
how to the sre five
think that recent training
you said you didn't
we think about the size of the input window you used twenty one frames i
and just about that
you have some
inputs for some idea is do you think that for channel compensation for example you
need a longer window of and what of your were doing only for each speaker
recognition or e
you know actually i would really recommend looking at the aspire papers from i think
it was from
maybe asru not sure it's one of the speaker regular workshops
or might be names which actually a perl thusly train the denoising network and is
it but i think were the fft outputs to introduce six a power a fifty
upwards and yet a really long window or something like a three hundred frames or
something huge like that we trained a giant network
and yes it very impressive results and i've been meaning to see if i can
recreate that it will take me forever to training
so i think we wanna have a faster training algorithm but i would encourage looking
at those results in particular looking at the other aspire systems
the suit they did i think there was a and ice comparison of what do
you did a joint training of the whole system with the way i one was
doing a where you are you do a multi style sorry multi condition training with
a with a whole bunch of data with your targets are always signals and some
people try to decouple it's of the asr system was trained independently
and then they train the denoising network and just use those features and one issue
i haven't addressed here is the idea of not retraining the i-vector system
so could you actually do okay if the features were coming from the denoising network
but you're still using
the same i-vector system
i was worried about right now but i think it's worth busting
but i start pretty i did you go back to a whether you're earlier slides
where you're gonna highlighting the different microphones between mixture to and mixer six
yes so
so i was looking at a mixture one and two and their what kind of
country a little concern if i guess channel number five has the kind of the
jar or a
okay thinking of star wars years arc a cellphone wyman there's also the error by
one so you got to their actually i mean you're but not i don't think
used five and six from mixture one armature two percent correct all next
extra wanted to die all data use all of it so i'm thinking that some
out those when you have two mikes that are actually still configured around here they
are letterman agree you know i mean it it's a mike you're gonna have some
i imagine interference between the two
so that maybe i don't know it does not question you check that okay so
what are the things good the main question i was gonna ask is when you're
looking at a kind of map adaptation you had the
denoising enhancement piece when you're looking across the different mikes going from one mike to
another some mikes that are closer in terms of their characteristics in others did you
see any benefit in moving from one to the other
i guess we're asking is whether we could subset a set of unique
right and we haven't that's a really good question i think actually moving forward anyway
i mean real data is kinda nice because you can reality check but i think
actually moving towards the synthetic data you can really move to two very different you
know
run conditions i mean exactly collected in two rows are male diverse and i'm just
thinking about chemistry structure for all the mikes energy kind of look at your solutions
to see
why you're one if you're launching from one mike to another sometimes of each closer
one solution does better than another
it's actually analysis we could try to do we could try to see which features
look closer cross the parallel data sets
i think about asking you to burn morality cycles either directly it's a nice question
that that's a good point we have to don't i couldn't find placement information for
mixed wanted to it probably exist somewhere but i ran out of like trying to
find mixer six has a lot of information
so mixer to it was it three locations i think there's i aside the ldc
and
and i think it see i think there are three and then mixer six i
think is to i believe that's right
although it was okay feature start with reading it
a question and all that denoising network so when we apply that kind of thing
we found it was important to
applied in fact and then ten train the network because if we send that the
silence frames to it
i was with easy but value that's a really because of just zeros and then
it goes the rest of the network what the network zapping that state so that
are actually thousand four point we ran a
we limited the mars that's right i
that's i think we might have run that on the clean channel for training and
applied at the other ones for decoding we always ran back or whatever the data
was
we try to optimize the you know to not realistic addition but for training i
think we might have done a that on the telephone data which matched are bad
system robust and then use that as i
the speech marks across
anymore questions
okay stack the speaker