Speech Transcript - Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs

0:00:15	i per se and with lincoln laboratory enormous pride some more quickly for channel compensation
0:00:21	i using that the lda for the only thing you know
0:00:26	and that is no brief overview over a five year multichannel speaker recognition and mixer
0:00:32	and the baseline system is an i-vector system is trained on one tell us all
0:00:36	speech
0:00:38	and there are two approaches were looking at one okay that the lda parameters the
0:00:42	telephone data to microphone data
0:00:45	and the other approaches we try to compensate features coming into the system and re-training
0:00:49	or system does not sort of forms are hybrid system i don't give results along
0:00:54	the way
0:00:57	so the basic idea that we have a system is trained on switchboard data and
0:01:01	works pretty well in the data were tested on is also conversational telephone speech
0:01:06	but as a multiple known you try to evaluate microphone trials on the same system
0:01:10	just fall for the for performance is really that
0:01:14	and
0:01:15	two approaches people that has to do this sort of a adaptation of the lda
0:01:21	i don't think is exactly the same adaptation reason was trying to bring in some
0:01:25	of the subspace to move that the only parameters for the microphone data
0:01:31	and we also tried past enhancement of another did not so was different prices to
0:01:37	do the
0:01:40	i'm sorry what's due process is to use a neural network to do this compensation
0:01:47	and actually it's not new in general i should mention that for this part challenge
0:01:51	a lot of people using this technique and works very well for speech recognition and
0:01:55	that test but they had microphone data as well
0:01:58	so for these two techniques one we're taking a i-vectors from a telephone train system
0:02:04	and weird adding those two of this microphone data to do that we take the
0:02:09	within class an across class may parameters are used the lda scoring
0:02:14	and we adapt those parameters towards the microphone data using a relevance map which is
0:02:19	just a lambda interpolation
0:02:21	and that we found that some calibration issues we do pretty well for eer we
0:02:26	get a nice gain at the eer level but for mindcf we don't see much
0:02:29	began
0:02:31	on the other hand it is a very simple technique and that you don't change
0:02:33	or system you just to train these two of parameters with existing i-vectors or you
0:02:38	extract new i-vectors the microphone data we don't change the system itself
0:02:42	the ml approach is a little a requires more work in that you've the training
0:02:46	you know
0:02:47	and the d n is trained to take a parallel data that's noisy the try
0:02:51	to clean it up to try to reconstruct a clean signal given a noisy representation
0:02:56	of the same data
0:02:58	and that's actually very robust technique it works by twelve it does mean you want
0:03:02	to retrain your system with that new front end
0:03:07	also for this work or using three datasets one is switchboard one and two that's
0:03:11	we used for training the baseline system and all the i-vector parameters are trained with
0:03:16	just that data
0:03:17	and then we'll mixer to which is a collection from two thousand for those a
0:03:22	multi microphone collection
0:03:24	i've had a clean telephone channel than the at eight microphones in the romantic like
0:03:28	to the data parallel for tuna forty speakers and up to six sessions i think
0:03:33	that was collected two thousand four minnesota dataset those actually is not straight and
0:03:38	and this is the mixer six about that one they did the same type of
0:03:42	a collection but for an speakers in different rooms and therefore two microphones as well
0:03:46	so the telephone channel
0:03:48	and for the sre they're focusing a lot on the interview condition for that where
0:03:53	the interviewer rum and interviewee and you had to separate the two to try to
0:03:58	not deal with that issue we just took this other portion of the sessions
0:04:02	which is a conversation the person's having over phone so it's the same how collection
0:04:07	but it's conversational data
0:04:08	and that matches the mixer to style so these are disjoint
0:04:12	lex the mixer to the mixer six
0:04:15	we use mixer to for developing the system either for training or indian and or
0:04:18	for adapting or parameters and the mixer six or using protesting that to see how
0:04:22	well works
0:04:26	so i just t v an idea of what these collections or comprise the next
0:04:30	one and two was collected over eight microphones
0:04:32	and mixer six was over fourteen
0:04:35	we found it generate a huge dataset values of fourteen so we just selected six
0:04:40	of them based on the distance from the speaker so the mixer six collection comes
0:04:43	with documentation about where the microphones or position that that's we use here
0:04:49	mixer one and two was available to us but we've actually given this the ldc
0:04:54	and they graph is planning on making release people wanna work with this data so
0:04:59	it should be probably available fairly soon thing
0:05:02	and ice estimates somewhere only evaluating on same mic trials on the mixer six condition
0:05:06	of the trials always you the target speaker and or what the non-target speakers on
0:05:10	the same mike
0:05:13	the baseline system is
0:05:15	exactly what everybody else is doing with an i-vector system
0:05:18	we start with a ubm to be trained on switchboard wanted to extract easier wasn't
0:05:23	first order statistics to create a i supervector and then we take the map point
0:05:27	estimate to get the i-vector six enter dimensional i-vector
0:05:32	the whitening is done with switchboard two data as well for the d n and
0:05:36	case
0:05:37	for the microphone a map and that of for the map-adapted case actually did the
0:05:41	whitening using the mixer a microphone data the mixture to microphone data and then signal
0:05:46	w c and c macy of the parameters are being adapted for the ple the
0:05:50	lda adaptation
0:05:53	so start with the baseline results
0:05:55	well the first result in table is on a street and that's just the telephone
0:05:59	results on disk sort of the out-of-domain task we get the system trained on switchboard
0:06:04	and then the you'd all data is this a street and mixer data so you
0:06:08	don't have mixer data as part of training the system
0:06:11	that's about five point seven percent equal error rate and a point six two and
0:06:16	you take that system
0:06:17	and evaluate it with the s with the mixer six trials the microphone trials
0:06:22	and you can see the equal error rate goes up by a factor of two
0:06:25	or so and mindcf really takes a good as well
0:06:29	and the first number there is the average this just taking the eer further channel
0:06:34	and then averaging number that's kind of unrealistic because typically you'd have to pick one
0:06:37	threshold for everything so the people i think is a more practical matter and that
0:06:42	one's even more c take a bigger hit for that because of the calibration problem
0:06:47	and
0:06:49	where for the remaining results of this for example i think that's a more practical
0:06:53	matter
0:06:55	the first and the map-adapted results and here you can see the same the mindcf
0:06:58	really doesn't improve very much although you do get a pretty big improvement eer goes
0:07:03	down by about thirty one percent
0:07:04	so that part's nice but min you'd really like to see mindcf get a little
0:07:08	better
0:07:10	and just yes i should mention that for landay's use point five and the reason
0:07:16	for that as i did sort of a sweet and you can see there's a
0:07:19	they're nice curves at eer because that's where i get again
0:07:22	and point five looks like it's a it's fairly optimal across microphones of the three
0:07:27	d plot is for each microphone the eers use with as use we plan to
0:07:32	for doing data adaptation
0:07:34	and around point five as we're seeing a sweet spot for that
0:07:38	but you'll get mindcf it doesn't really change very much that's where we were saying
0:07:41	the problem of this technique
0:07:44	so moving on added to the enhancement idea were training a neural network to try
0:07:49	to reconstruct a plane signal given by a noisy version of that so we have
0:07:54	the person talking to telephone the telephone is are clean version and we also have
0:07:58	microphones of the room the collecting of the microphone corrupted versions
0:08:02	and we just trained as like a regression it's a very simple thing we have
0:08:04	a windowed set of i-vectors coming into the n and we have the same vector
0:08:09	trying to reconstruct that we just training over again samples
0:08:13	one key thing release i think this is important is that we include the clean
0:08:17	samples as well really like this neural network not change the clean data but to
0:08:21	try to also improve the noisy data make of what more likely
0:08:27	and just t v some idea of how this data was collected
0:08:30	the ldc the these parallel collections and a couple of rounded have like one or
0:08:34	two rooms which is not really like that morals but this is so how this
0:08:37	time
0:08:39	and you'll have to come in to sit down and they have the microphones around
0:08:41	that have all the equipment running
0:08:43	and of the problems that if you realise later that you wanna one more microphone
0:08:47	maybe really hardly really comeback collect more data so really what people do especially asr
0:08:53	size eight in generating synthetic parallel datasets using a i rs online and point out
0:08:59	noise sources and just generating tons parallel data
0:09:02	and we actually been working on that more recently the another paper interspeech on that
0:09:06	and that actually that works quite well as well i think that's and long-term as
0:09:10	the way wanna do that but we had this probably just available and we want
0:09:13	to start with that for this work
0:09:17	so that the hybrid system where you have that channel compensating neural network in the
0:09:21	front of it and then you have the i-vector system the of the baseline these
0:09:26	before and we just retrain this pipeline after we retrain the denoising neural network we
0:09:30	retreat we retrain the i-vector system on the switchboard
0:09:35	and that for the system or using all the mixture to data for training course
0:09:38	and then we also using forty mfccs and that's the dimensionality of the output of
0:09:43	the neural nets or trying to reconstruct forty mfccs and that includes twenty deltas which
0:09:50	may seem kind of counterintuitive but it was actually important and blue delta coefficients and
0:09:54	thus
0:09:55	we use of five layer neural network with two thousand forty nodes
0:09:59	twenty one frame input context and mainly because that's we used for bottleneck features before
0:10:05	we just adapted that system to this problem
0:10:08	and then we of the one clean channel and the eight noisy ones come
0:10:13	and you can say we get a pretty big in mindcf and everything it almost
0:10:17	a thirty percent gain mindcf and that's cool result
0:10:20	and a fifty percent in eer so this is really doing we're hoping is to
0:10:23	get an improvement at mindcf and eer as well
0:10:29	so that was actually nice k
0:10:31	and
0:10:32	i should mention we didn't number of different things we tried initially i think it
0:10:35	first we're trying to see if we could do this with log mel-frequency filter banks
0:10:40	so i think some of the work that's been done just on the enhancement side
0:10:43	is to try to improve a filter banks and then you can do what one
0:10:47	of those like to synthesise cepstra from those a cleaned up filter banks
0:10:51	but i will be found that the deltas were actually important so going to mfccs
0:10:56	plus deltas give us to begin reduced using filter banks
0:10:59	and is also critical on each other people mention this to be suitable for the
0:11:03	you have to do that some type of me the variance normalisation to the data
0:11:06	for training the neural net just to get the district to converge
0:11:09	and that we also found the architecture at a pretty big impact so i am
0:11:12	reporting results on the two thousand forty eight node be an you can say we
0:11:16	take you can see we take a bit of here we go down to ten
0:11:19	twenty four nodes especially dr and then we get on the five control not be
0:11:22	taken figurehead
0:11:24	but honestly the two thousand and forty you know the nn to goes a long
0:11:27	time the training i-vectors weeks to train that one and that's maybe are four we
0:11:31	don't have a parallel training mechanism
0:11:33	that was the problem that
0:11:37	it's worth seeing what the telephone performances you don't really want to system is robust
0:11:40	to microphone data but also worked well for telephone data and so this is actually
0:11:44	kind of a nice surprise we get a small gain about it some percent relative
0:11:47	in just on the telephone task
0:11:50	than that was for the you know a signal that and forty the map that
0:11:52	the lda falls apart when you buy telephone data is you moved all those parameters
0:11:57	this microphone set there does not well matched telephone data anymore
0:12:01	so it's the trade off there
0:12:04	so we see the nice in using this the nn channel compensation technique forty doesn't
0:12:10	it was a lost on the telephone data
0:12:13	you so you don't need do any kind of channel detection to switch back and
0:12:16	forth
0:12:17	the map that the lda unfortunately so far hasn't work well for us it does
0:12:22	give unity are but the mindcf doesn't really change very much
0:12:26	it is really easy to implement if you have an existing i-vector system you just
0:12:29	run on that day to train parameters
0:12:32	the other issue is that we've been using real relative to this which is not
0:12:37	really very practical so the synthetic parallel corpora makes a lot sense
0:12:40	and lastly at the input if you're really looking into using a recurrent networks within
0:12:46	for doing a lot with feed forward networks and with the big context one to
0:12:49	allow that but i think aren't as we can be the way to go looking
0:12:54	for
0:12:56	the biggest much time
0:13:02	how to the sre five
0:13:09	think that recent training
0:13:34	you said you didn't
0:13:37	we think about the size of the input window you used twenty one frames i
0:13:42	and just about that
0:13:43	you have some
0:13:45	inputs for some idea is do you think that for channel compensation for example you
0:13:50	need a longer window of and what of your were doing only for each speaker
0:13:56	recognition or e
0:13:58	you know actually i would really recommend looking at the aspire papers from i think
0:14:03	it was from
0:14:06	maybe asru not sure it's one of the speaker regular workshops
0:14:11	or might be names which actually a perl thusly train the denoising network and is
0:14:15	it but i think were the fft outputs to introduce six a power a fifty
0:14:20	upwards and yet a really long window or something like a three hundred frames or
0:14:25	something huge like that we trained a giant network
0:14:28	and yes it very impressive results and i've been meaning to see if i can
0:14:32	recreate that it will take me forever to training
0:14:34	so i think we wanna have a faster training algorithm but i would encourage looking
0:14:38	at those results in particular looking at the other aspire systems
0:14:42	the suit they did i think there was a and ice comparison of what do
0:14:45	you did a joint training of the whole system with the way i one was
0:14:49	doing a where you are you do a multi style sorry multi condition training with
0:14:55	a with a whole bunch of data with your targets are always signals and some
0:14:59	people try to decouple it's of the asr system was trained independently
0:15:03	and then they train the denoising network and just use those features and one issue
0:15:07	i haven't addressed here is the idea of not retraining the i-vector system
0:15:11	so could you actually do okay if the features were coming from the denoising network
0:15:16	but you're still using
0:15:18	the same i-vector system
0:15:20	i was worried about right now but i think it's worth busting
0:15:31	but i start pretty i did you go back to a whether you're earlier slides
0:15:35	where you're gonna highlighting the different microphones between mixture to and mixer six
0:15:41	yes so
0:15:43	so i was looking at a mixture one and two and their what kind of
0:15:48	country a little concern if i guess channel number five has the kind of the
0:15:53	jar or a
0:15:55	okay thinking of star wars years arc a cellphone wyman there's also the error by
0:16:00	one so you got to their actually i mean you're but not i don't think
0:16:04	used five and six from mixture one armature two percent correct all next
0:16:10	extra wanted to die all data use all of it so i'm thinking that some
0:16:15	out those when you have two mikes that are actually still configured around here they
0:16:20	are letterman agree you know i mean it it's a mike you're gonna have some
0:16:25	i imagine interference between the two
0:16:28	so that maybe i don't know it does not question you check that okay so
0:16:34	what are the things good the main question i was gonna ask is when you're
0:16:37	looking at a kind of map adaptation you had the
0:16:44	denoising enhancement piece when you're looking across the different mikes going from one mike to
0:16:49	another some mikes that are closer in terms of their characteristics in others did you
0:16:54	see any benefit in moving from one to the other
0:16:59	i guess we're asking is whether we could subset a set of unique
0:17:02	right and we haven't that's a really good question i think actually moving forward anyway
0:17:08	i mean real data is kinda nice because you can reality check but i think
0:17:11	actually moving towards the synthetic data you can really move to two very different you
0:17:16	know
0:17:16	run conditions i mean exactly collected in two rows are male diverse and i'm just
0:17:22	thinking about chemistry structure for all the mikes energy kind of look at your solutions
0:17:26	to see
0:17:27	why you're one if you're launching from one mike to another sometimes of each closer
0:17:32	one solution does better than another
0:17:34	it's actually analysis we could try to do we could try to see which features
0:17:37	look closer cross the parallel data sets
0:17:40	i think about asking you to burn morality cycles either directly it's a nice question
0:17:55	that that's a good point we have to don't i couldn't find placement information for
0:17:59	mixed wanted to it probably exist somewhere but i ran out of like trying to
0:18:02	find mixer six has a lot of information
0:18:09	so mixer to it was it three locations i think there's i aside the ldc
0:18:14	and
0:18:17	and i think it see i think there are three and then mixer six i
0:18:20	think is to i believe that's right
0:18:25	although it was okay feature start with reading it
0:18:34	a question and all that denoising network so when we apply that kind of thing
0:18:39	we found it was important to
0:18:42	applied in fact and then ten train the network because if we send that the
0:18:48	silence frames to it
0:18:49	i was with easy but value that's a really because of just zeros and then
0:18:54	it goes the rest of the network what the network zapping that state so that
0:18:58	are actually thousand four point we ran a
0:19:02	we limited the mars that's right i
0:19:05	that's i think we might have run that on the clean channel for training and
0:19:09	applied at the other ones for decoding we always ran back or whatever the data
0:19:13	was
0:19:13	we try to optimize the you know to not realistic addition but for training i
0:19:19	think we might have done a that on the telephone data which matched are bad
0:19:22	system robust and then use that as i
0:19:24	the speech marks across
0:19:32	anymore questions
0:19:35	okay stack the speaker

Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs

Speaker & Language Recognition: Deep learning approaches

Fred Richardson, Brian Nemsick, Douglas Reynolds