Speech Transcript - Learning Mixture Representation for Deep Speaker Embedding Using Attention

0:00:13	a layer one point eight we really
0:00:17	and i'm the also for people initial of information for each speaker embedding using attention
0:00:23	and the other two already optimum a mac and there will be
0:00:27	no drama hong kong or it can be robustly income working s one the information
0:00:31	engineer
0:00:36	so reviews the contribution okay but we show that was we show that the model
0:00:41	specifically a hundred and twenty one layer tenseness will produce more discriminative speaker id
0:00:48	and the showroom show model as vector
0:00:51	with the referee as an amount of time
0:00:54	secondly we show that measure it or is it is the goal will include a
0:00:58	speaker at ease of a month in noisy data set
0:01:03	so okay so
0:01:05	okay well as five and that's well
0:01:07	as a bayesian so the network take a speech feature and mfcc a filter bank
0:01:11	feature
0:01:12	and you and then costly several l of convolution
0:01:16	and then we because is a real and then us
0:01:20	she speech is very allowed in nature
0:01:22	so we need to convert into a single way to so we do it follows
0:01:26	that is a statistically less specifically we computed be and then the duration and compare
0:01:32	the mean and standard deviation and the goal wilfully conditionally a and it produces and
0:01:37	the network of the softmax layer
0:01:39	so
0:01:39	this is really is really on its own mean of variable an utterance will be
0:01:45	standard edition
0:01:47	and she the recesses found that using me a standard deviation is better than using
0:01:52	solely still this show that
0:01:56	more actually this
0:01:57	it is q did should description of the three levels feature is very helpful for
0:02:02	producing discriminative the speaker at
0:02:06	so
0:02:07	so this class that is really more detail
0:02:11	it's very easy operation we just compute a
0:02:14	me and of very low level and then we compute a standard deviation of three
0:02:18	level e
0:02:19	have a
0:02:21	so we can see no bonus that is still a as the kind of the
0:02:24	summary also no feature
0:02:26	so we use me as the this then used as a summary also three level
0:02:30	features this you a distribution
0:02:32	however
0:02:32	it may lessen the initial can only characterize where a single distribution of a gaussian
0:02:37	distribution
0:02:38	so multimodal distribution yes but alas decision so if even if the frame level feature
0:02:44	a kind of distribution recognition custom
0:02:49	lately this yes and deviation we'll kind of
0:02:52	some right a distribution will
0:02:55	so what was all here we propose a misrepresentation forty
0:02:59	so
0:02:59	it is i use it is no place the use i and use is that
0:03:04	haitian maximization algorithm for
0:03:06	for gas emission model
0:03:07	so here from here is that all using you know in those emission model we
0:03:12	actually kind of you
0:03:14	the euclidean distance to produce alignments and interview me as the user's we use we
0:03:20	use the tension mechanism
0:03:22	to reduce the center score so specifically you have control level feature s and the
0:03:27	we have multiple exposure had
0:03:28	and then and of allegiance it should have computers the set of weight
0:03:34	set away this is the other ways normalized to make system
0:03:37	a one
0:03:38	across each
0:03:39	and we use this certain way
0:03:41	can be me and a standard deviation
0:03:43	and isn't the and then we have multiple yes the divisional is not only as
0:03:48	that used as a reasonable tended to get there
0:03:50	and that we used to compute a speaker id
0:03:53	so the imbalance in here is that e and we have multiple okay and then
0:03:58	addition we is not right across each had so is only sees the kind of
0:04:03	is just that because wishable actually
0:04:06	you know how to compute yes that the user is exactly as a
0:04:10	as a gaussian mixture model
0:04:13	so be so still not allowed us that is supporting map plays the car content
0:04:19	in it is only is a proposal by another researcher
0:04:23	so is right is very close to it but its enrollment network
0:04:28	we use several times about being a different way
0:04:30	so as this is a computer
0:04:33	on the other setups location away
0:04:35	at least at attention ways normalized cost very soulful each we compute a score and
0:04:41	the scores normalized across three
0:04:43	so nist twenty that all the different arabic a in the with real state acacia
0:04:47	mechanism
0:04:47	so you know case you a location we think that location like because the emission
0:04:52	model
0:04:53	in a attention model is more kind of a way to each frame up to
0:04:58	design a way that we use it is trained on the laws only idealise more
0:05:01	like a cell vad
0:05:03	two to three to fuse i'll some
0:05:06	and the contribute to three
0:05:10	so that's not that the landau wasn't that we might be a teacher forty so
0:05:17	actually that's not as in them some other researcher also have okay we will where
0:05:23	is internet
0:05:25	to latin like task
0:05:27	but now map place marginally case that
0:05:31	that's now that's undecidable was additionally could be
0:05:34	also use a model me
0:05:36	maximum mean but unlike visionary as us analyze computing a different way use euclidean system
0:05:42	as far as we use attention so we can have a score files are discovered
0:05:46	that can be very channels covering it can be
0:05:48	i just lock scores on the we use these remote
0:05:51	well various channel neural network
0:05:53	so is more powerful than euclidean distance
0:05:57	so let's take a look and other different between that and that you would disagree
0:06:00	and we shouldn't worry
0:06:02	unlike i talk about before
0:06:05	and i can't is it is probably
0:06:07	have a kind of computer location when normalized cost of frames so
0:06:12	so the distribution is a is the is distributed over a state and each has
0:06:17	kind of
0:06:17	you is very independent of yellow case of the distribution is this will work at
0:06:24	had is only the small where it kind of the mission recognition may be sure
0:06:29	execution
0:06:30	so
0:06:31	so there are we use what i even and that was considered as an one
0:06:36	or on one hundred and you wanna there's net
0:06:40	that's nice it was in computer vision as a kind of
0:06:45	cancun oklahoma for the guy around a ski condition because the
0:06:49	okay ross given that can as the use of test condition
0:06:52	the original that slated the collusion and then we do we just a moment modification
0:06:57	to make you what we with the all pass thus because the very we use
0:07:01	the one to compensate you is not be
0:07:03	and to the to the for convolution
0:07:05	and then for the transition i a low precision here we use can we also
0:07:11	use clues as a sample
0:07:12	as specifically we use the kernels that a twister to lose in an example here
0:07:17	the data symbol
0:07:19	and i see a of the last once the last on the we use the
0:07:23	at my the softmax we find it very pretty effective
0:07:28	so the only information for always the training data that the which idea and you
0:07:34	lda we use
0:07:36	that is seven thousand and three hundred speakers always the rewind with thirty two
0:07:41	it has data a always night maybe we maybe a voice ninety
0:07:45	we of so you very that weighs thirty one has a
0:07:48	okay is that we use it is forty dimensional feature with the mean
0:07:53	and then the weights and additive educational use where is a while use of these
0:07:57	energy based voice activity you question
0:08:00	and the neon and we use your addition to the
0:08:04	is now we also use
0:08:05	us to use a as well and wise but in ways that are we double
0:08:09	up i mean
0:08:11	it double in and the channel size of the listeners that you scroll down there
0:08:15	so this is somebody else the model use a specific law me is a real
0:08:20	time operation
0:08:21	and then model parameter and use it on hold the number and in the model
0:08:25	we can see as well as well and that the work flow is quite low
0:08:28	having a is i is a low otherwise but the network because we don't models
0:08:33	i don't know the multichannel
0:08:35	solar for helpful plastic
0:08:39	a powerful
0:08:40	about referee of all time all and the models and also quite able but that
0:08:45	is that with as the although is are quite enough that will hundred and you
0:08:50	want layer
0:08:51	because the actually have a weighted loaf localities roughly is there almost every as i
0:08:56	z s where the network
0:08:57	and then i mean is also only all of the tuple
0:09:00	and but we can see that because of this as nice very even networks
0:09:04	so we've the you know device like that you will be a little bit so
0:09:11	that's right
0:09:13	so it is there is a well or in our results
0:09:17	so first let's talk about network structure
0:09:21	we find that does not for all of our last record and when wiseman than
0:09:25	the i-th user can you has if you all three data is that
0:09:29	our has never phone a fast and i and that although and why do as
0:09:34	well were used
0:09:35	a rough is a model parameter and take more time interval as
0:09:40	ieee
0:09:42	in the performance case can be our guys that obviously perform better
0:09:47	and then follows that is important maslow we found then be sure of that and
0:09:51	we
0:09:52	of on a the task you know ways nineteen evaluations that
0:09:55	and i've always that anyone we have been a small improvement
0:09:58	and generally speaking way out of all conditions that is we
0:10:04	so here is that was totally an application had
0:10:09	so here we to acquire it we because study
0:10:14	are we face the known ola layer after recognition so increase number of half will
0:10:20	not be sign quiz or not but i mean
0:10:22	because he to achieve increase the number of without controlling the concatenated that dimension you
0:10:28	use of where like to model so the number of the times that no i
0:10:32	can not be penetrated problem as a mechanism
0:10:35	it could be getting the benefit for like to model so as to the telly
0:10:39	aside
0:10:39	so a reasonable how will i reason and stories they will be a more fair
0:10:44	comparison
0:10:44	so as to here we see that if we present will had four one two
0:10:49	to four
0:10:51	avoid and it's to the us is probably actually that i scheme going a
0:10:54	so we show that as you one highest ask volunteers as that is only
0:10:59	overall image relevant between c reasonable huh
0:11:03	okay queries the
0:11:05	only the increase the number that young we actually going not so reason is that
0:11:09	this kind of you shape at a when the number of buttons at high rates
0:11:14	so we conclude we introduce the console mixture of importing dues that is
0:11:19	i was that is the point i using only had training and all way or
0:11:24	policies is i about is imitation maximization verifying cost initial model i am on like
0:11:30	gmm model
0:11:32	images time on a given by this mechanism is that the fusion this that we
0:11:37	do nothing levels each index pieces and so i know propose a mechanism to one
0:11:42	hundred and twenty one data s now but it should be for one was that
0:11:47	everyone for on several ways night evaluation set
0:11:52	so this is all my presentation so thank you very much listening if you have
0:11:58	a of any question of all my presentation and all that it is illegal common

Learning Mixture Representation for Deep Speaker Embedding Using Attention

Special Session: VOiCES 2020

Weiwei Lin, Man Wai Mak, Lu Yi