Speech Transcript - Analysis of Deep Feature Loss Based Enhancement for Speaker Verification

0:00:16	and everyone whiners are from johns hopkins university
0:00:20	a compromise
0:00:22	what is your presentation my our framework is on speaker verification and speech enhancement
0:00:27	let's say that six lights
0:00:36	i love this presentation is a another system which allows this enhancement or speaker verification
0:00:43	and to be using some slides from my previous work i guess was called feature
0:00:49	enhancement but
0:00:50	the feature classes for speaker verification
0:00:56	i mean downstream does is speaker verification
0:00:59	and the problem refers to the
0:01:03	task of data mining if speaker an utterance one
0:01:06	just and drawn inference is same as
0:01:09	these you got an utterance to which is the test utterance
0:01:13	the state-of-the-art we implement this is to use a so-called extractor network and
0:01:19	a probabilistic linear discriminant analysis is okay
0:01:23	and also due date or addition
0:01:27	in conjunction
0:01:30	speech enhancement
0:01:31	is once this problem but you have speaker verification
0:01:35	by any preprocessing and rule and test utterances during this time
0:01:42	it has a node is the speech enhancement maybe on helps when trained in the
0:01:48	and then of speaker recognition option
0:01:52	and three pursue a title frame only fisherman's training
0:01:56	which
0:01:56	next the two problems as we can see how
0:02:02	this is the schematic of each feature loss training was you can see there are
0:02:08	two networks one is e
0:02:10	one just has one or another one is denoted by e which is t alternately
0:02:15	network
0:02:18	the enhancement network takes noisy features and produced enhanced features
0:02:23	these enhanced features are not directly compare between features however they are for us to
0:02:30	also unit for and the intermediate activity activations in the small sooner or we know
0:02:37	the differences in them and they are known as a feature loss
0:02:43	when we don't use this on clean and fruit and simply choose
0:02:46	compared enhanced features indicating features
0:02:49	in a score
0:02:51	feature mostly
0:02:54	this can imagine
0:02:55	this type of training is doing enhancement however results you'd information also
0:03:02	that is then exquisitely is also unit for
0:03:08	this is how or speaker verification that looks like the enrollment and test going through
0:03:15	feature extraction independently and also enhanced independently then
0:03:22	well healthy
0:03:23	a phones goes through our invariant structure which is our case expected network
0:03:30	and
0:03:31	and the but a classifier
0:03:33	tries to give them a log-likelihood ratio and say
0:03:38	the there is
0:03:40	same speaker or not
0:03:44	no these of the details on how database extraction is ten
0:03:49	we use
0:03:50	and use a corpus which consists of
0:03:53	three or instances only use a
0:03:57	gender noises
0:03:59	and that'll
0:04:02	these
0:04:03	the noise classes are used to
0:04:06	combine
0:04:07	with
0:04:08	also the within sixteen khz conversations statistic as a
0:04:13	and be just wrote also the combined and it is three times but also
0:04:19	the emission works of the is
0:04:22	is so some wild
0:04:25	i a fifty percent rate
0:04:28	randomly agreeable so the utterance for it to
0:04:33	we also use
0:04:35	it s not filtering algorithm called about as an two
0:04:39	create a fifty percent you what's alone
0:04:42	and it is supposed to preserve the highest and utterances from work so
0:04:48	such clean and version of also the is gain combined with
0:04:53	the news on
0:04:54	noise is and that serves as the noisy constant for our supervise enhancement training
0:05:04	this trend of the ldr frame with the what's the combined dimension and these see
0:05:09	that no networks a
0:05:11	does use
0:05:14	given more details the features that the use of forty dimensional measure of that
0:05:20	this is to see and other ways
0:05:22	the evaluation will be done on d v train a which is a corpus containing
0:05:27	a young children means that in and controlled environment
0:05:32	the complete data is to fifty hours for is and struck divided in detection and
0:05:37	a diarization task
0:05:40	we have not explained
0:05:42	the diarisation component you know pipeline
0:05:46	for the evaluation data a number of speakers in and roll and test r five
0:05:51	ninety five and one fifty respectively
0:05:54	and results are presented in form of equal error rate and minimum decision cost function
0:06:00	where target prior probability of five percent
0:06:05	the table that you see here is from our previous work which we want to
0:06:10	analyze in this work
0:06:13	use if you focus on the second
0:06:17	dataset column which is about maybe train
0:06:21	you can see for scroll
0:06:24	is actually without an enhanced and every and refers to the original version of x
0:06:29	are gonna work
0:06:30	and if do is just
0:06:32	a notation to denote
0:06:34	the type of be and es data used
0:06:38	so this rule actually give results on
0:06:42	that enhanced and it is seven point six percent eer and then we use a
0:06:47	feature lost which loss and also combination
0:06:51	and
0:06:52	in c d's usually give the best performance previously
0:06:57	assign a row zero
0:06:58	is the comparison between how much performance t and you can see
0:07:04	we just are feature allows efficient most
0:07:07	formants cleanest a or six k
0:07:11	having said that we want to address and questions
0:07:16	forces are
0:07:18	only the initial layers of course in a useful for the official of training
0:07:23	can't feature allows the additive which allows
0:07:27	second it is
0:07:29	for supervised and has the training how clean data is required
0:07:33	can i just using speech results of the
0:07:35	below are created database
0:07:38	mismatch issues
0:07:40	currently you extractor and all seen in four
0:07:44	are available pre-training on your emotions features can i used to train and has the
0:07:49	network each works the height of features can get an idea get some benefit
0:07:57	for this and has a really an expected data and of the training for the
0:08:01	improvements
0:08:05	faced is again and has features the bootstrap to training data double the amount of
0:08:10	data and make our extra to store the be obvious four
0:08:16	six is to see if the was less that we're working with a really useful
0:08:22	during the data condition process
0:08:25	is some of the noise class
0:08:27	even harmful
0:08:30	find regression is that as the proposed scheme for the task of dereverberation and joint
0:08:35	denoising anteater operation
0:08:40	or should be produce the baseline and see what there is good for differs a
0:08:45	lost a extraction
0:08:48	is
0:08:49	results table with a lot of numbers a better for this doesn't station it's enough
0:08:55	to focus on the first column which gives you the labels
0:08:59	for that i all loss or data that's going to use
0:09:02	and the final
0:09:04	a column is the mean result on the
0:09:08	no be retrained test set
0:09:11	but shows without it has then given then one nine percent eer
0:09:15	and then we have l d s l five between the feature last extracted from
0:09:20	five layers
0:09:21	and this
0:09:24	on signal folk has six layers
0:09:26	the fess up to five are used in this one and six is
0:09:30	the
0:09:31	classification in finding invariance we are not using for a particular role
0:09:36	i guess the best performance and z more combinations
0:09:40	to see
0:09:41	and the l f l is the feature loss and it gives you were worse
0:09:46	performance in and then baseline
0:09:48	this reduces observations from previous four
0:09:52	combining them was so
0:09:54	is not good point two percent
0:09:57	when you combine the embedding
0:10:00	years the last layer false in that for the d feature lost
0:10:04	it duh is also not helpful
0:10:07	and then the use
0:10:09	efficient loss five layer for later three layers two years and
0:10:14	finally one layer and they are not as good as using all the layers
0:10:18	the bottom half of the table is a decision cost function
0:10:23	the
0:10:24	observations are mostly same as the equal error rate
0:10:27	so here we have seen the feature losses in three artificial are or system
0:10:34	combining them
0:10:36	is also for
0:10:38	a more lazy use the best increase the computational complexity
0:10:44	well that's okay
0:10:48	the main data v is the
0:10:50	you need to
0:10:51	use you know if all silly layers from the jar
0:10:58	if we see the choice of training data set for enhanced and also you know
0:11:01	where
0:11:03	we see donovan to dash fisher the blue means
0:11:07	what's alone
0:11:08	with the bodice and i was used for the
0:11:11	and has and therefore and
0:11:13	also
0:11:15	on as a consequence for the
0:11:18	also network and gives the best performance you know by boldface or
0:11:24	one
0:11:26	using p c which is the what's of the
0:11:29	and b c we just have also combine
0:11:32	but in spots of the combined with the
0:11:34	the noise documentations
0:11:37	we also from we see if two indian in the has to know where
0:11:43	which is if you core
0:11:44	the you can of the three persons of some kind of what's
0:11:48	and it is not as
0:11:51	good as the bodice not filtering so
0:11:55	the shows that feeling screening all four
0:11:59	barcelona one snr seems to be unimportant
0:12:03	and use a little speech and
0:12:05	can see of course and point to a greater than i one and baseline
0:12:11	and solely for speech
0:12:13	i think being
0:12:14	in on conversational and mismatched data it is for training
0:12:20	even when used as a
0:12:22	clean counterpart for the enhanced
0:12:24	and hence the network
0:12:28	we also thing the powerful the also the network is that it is
0:12:33	and the old one is so
0:12:38	means that the more data is used and
0:12:40	the data condition is also that
0:12:46	you see if we mismatch the features and has the network can i use i
0:12:50	dimensional features and hence for network
0:12:53	second rule festival is
0:12:56	ellen
0:12:57	f b for the means log mel from the man features
0:13:01	for the dimension in has been network
0:13:04	recall that forty dimensional features are used in the opportunity for and the effectiveness of
0:13:10	also
0:13:11	show and this is the condition where the features are matched
0:13:15	so i don't need to learn any bridge between networks for this case of a
0:13:21	were four
0:13:22	if you dimension wanted to do and menus spectrogram
0:13:28	i there is a speech are mismatched and you need lower average between units as
0:13:33	well
0:13:33	and
0:13:34	is the results are not as good as the matched condition
0:13:38	seems like cannot it advantage of high dimensional features
0:13:43	literal
0:13:44	we also the spectrograms somehow since use of for a least or
0:13:50	but it is also
0:13:51	worse than the baseline
0:13:58	you see the effect of hasn't you lda and the or extractor data
0:14:06	for scroll is not as good as us to control was tested and then
0:14:11	alright consisted percent
0:14:13	that at home so we can see
0:14:17	the lda common test is written
0:14:20	as the label which means that be lda
0:14:23	and it is also has
0:14:25	and it does and so much rates it and seven percent
0:14:31	so for the mindcf we have
0:14:38	not much change so don't feel that the really is
0:14:43	is on benefiting an entire susceptible to a enhancement processing
0:14:49	if and hence the training set
0:14:52	there is improvement for the start baseline
0:14:56	which is an iterative system
0:14:58	however it's not as good as just so that has in the test
0:15:02	one and half of them since like
0:15:05	the robustness of the whole system is lost so it's not working for at least
0:15:10	four
0:15:12	this corpus
0:15:16	we combine the enhanced vision see if we can take advantage make them
0:15:22	complementary original features
0:15:25	no that wasn't we just means that even if a if conditions
0:15:30	and half which means the and score of all the data
0:15:37	in the column
0:15:39	you see all can be lda that means
0:15:43	meditation
0:15:45	is then be in the
0:15:48	to verify all can be lda
0:15:51	vol including original features as a listing and switches along with the data
0:15:57	it seems to be getting our
0:16:01	and
0:16:02	when i combined these features in training set
0:16:06	is actually doing much better performance seems like the network analysis double data and
0:16:12	there is also complement energy
0:16:15	in the
0:16:17	has features so they are
0:16:19	it can be bonastre
0:16:22	if i one station and the frame effect of these features in train as well
0:16:26	as the and the lda it doesn't
0:16:32	so this ensures that the lda is a suitable one hasn't processing
0:16:37	i started to just put i has features that or is not in the training
0:16:41	set up or a spoof an oak
0:16:46	now we see if i e one type of noise class from the expected network
0:16:51	r t
0:16:54	and hasn't data
0:16:56	so let's focus on the a lot of this table which is that the war
0:17:01	music and
0:17:05	see the last column we have i one zero five percent this means that
0:17:10	right i skate
0:17:12	using the music files from extract phonetic or and i also don't use enhancement actually
0:17:18	doing better than the based on which means
0:17:20	and then
0:17:21	removing music is good so this discussed actually also performance
0:17:29	next unseen means i used enhancement or what the
0:17:34	the on has filter has not seen use it
0:17:36	so it's still able to improve the one this is some and
0:17:41	most interestingly
0:17:43	and the use the
0:17:44	units seeing which is
0:17:46	when i using and has to network which has seen using it is the s
0:17:51	so it seems like some noise classes are
0:17:55	or are being
0:17:57	and
0:17:59	is that it just give them in x vector training
0:18:02	okay include them in the
0:18:05	a enhancement
0:18:07	training data
0:18:11	it to see if we can do you relation with division loss you try seven
0:18:16	schemes
0:18:17	use call so that would be e tradition earlier repetitions scheme trying to do you
0:18:25	duration and utilizing in
0:18:27	joint fashion
0:18:29	also and the distance fashioned which is denoted by joint one stage
0:18:33	a few we all these numbers
0:18:37	in c
0:18:39	the dereverberation is not actually working
0:18:42	we also suspect that's possible that a there is not possible configuration nevertheless t-norm things
0:18:50	since e
0:18:53	you have
0:18:56	a pre-processing step for a improving on this maybe straight
0:19:02	finally database are you can you need to choose also you know for you have
0:19:08	layers of it for this type of funding
0:19:12	and use one isa nine based filtering to keep highest not only you scores from
0:19:16	the
0:19:17	a construct a clean data for has to network training
0:19:21	the mismatch in and has to and hasn't and also very
0:19:25	and it is slightly worse is better to use same features
0:19:29	we see that the lda is not really
0:19:33	us a nice it's very susceptible to using enhanced data american put this next fortunate
0:19:38	for
0:19:39	some noise types are harder in for extracting data like music
0:19:45	and finally the duration is not or four
0:19:50	using this
0:19:52	state of training scheme
0:19:54	so that is the end of the presentation please feel free to send questions that
0:19:58	where we thank you

Analysis of Deep Feature Loss Based Enhancement for Speaker Verification

Speech Application

Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Najim Dehak