Speech Transcript - Orthogonality Regularizations for End-to-End Speaker Verification

0:00:14	another one broken into my presentation this is the recorded video for all the c
0:00:20	two thousand and twenty eight
0:00:22	and in control from different university of science and technology
0:00:27	in this we do i'd like to introduce our work on orthogonality regularization for and
0:00:34	you and speaker verification
0:00:38	for the speaker verification tasks
0:00:40	high resistance has being dominic solutions for a long time
0:00:45	for example the i-vector based systems
0:00:48	or a selector based distance
0:00:51	the hybrid system is usually consist of multiple not your is
0:00:56	the speaker and endings can be and then problem i-vectors or deep neural networks and
0:01:02	a separate scoring function is commonly but with a ple a classifier
0:01:09	the in the hypothesis in
0:01:12	if from audio is optimized with respect to its own target function
0:01:17	which is usually not the stand
0:01:20	moreover these speaker verification is an open set problem
0:01:26	used to handle on all speakers in the evaluation stage
0:01:30	so that regularization ability of the system will be very important
0:01:35	recently
0:01:36	one more speaker verification systems are trending and two and then or
0:01:42	so and to assist as well macula test utterance and
0:01:46	i enrollment utterances directly to a single scoring
0:01:50	it can simplify the training pipeline
0:01:53	and then the whole system is optimized in the morning consistent manner
0:01:59	it also and levels lending or task specific matrix during the training stage
0:02:05	and various loss functions have been proposed for the entrance systems
0:02:11	for example that regret loss
0:02:13	the generalized and to analysis
0:02:17	though the core idea of the loss functions in the entrances students is to minimize
0:02:23	the distances between i'm outings from the signs being sent speaker
0:02:28	and maximize the distance between and bad things from different speakers
0:02:33	in this loss functions most of them will use the cosine similarity
0:02:38	the that means that cosine of the angle between two and biting vectors to and
0:02:43	be the distance between these two and batting
0:02:48	so the major on the line assumption for the effectiveness of the cosine similarity measurement
0:02:55	in that the and endings base is although no
0:02:58	which is not guaranteeing during the training
0:03:02	in this work
0:03:03	we and to explore in the orthogonal
0:03:06	regularization in the entrance speaker verification systems
0:03:13	so
0:03:14	in this work our systems are trend was generalized enter and loss
0:03:20	we propose to regular i-vectors
0:03:23	the first one is quite a soft also known as you regularization the second one
0:03:27	is score spectra restricted isometry property
0:03:31	regularization
0:03:34	and this
0:03:35	two proposed regularized there is i evaluated on two different you and then were structures
0:03:42	the air sgm based one and times and a neural network based one
0:03:49	so first i'd like to briefly introduce
0:03:52	the generalized and two and lost
0:03:57	in our and tie system
0:03:59	so one mini batch consists of an speakers and utterances from each speaker
0:04:06	that means we will have and times and utterances in total for one mini batch
0:04:13	so that x r j represents acoustic feature computed phone utterance tray of speaker i
0:04:20	for each input feature excitedly the network
0:04:24	produces a corresponding and adding vector e i j
0:04:28	so we can
0:04:30	compute the centroid of the speaker
0:04:32	so in and batting vectors from
0:04:35	speaker i
0:04:36	by averaging its and i think vectors
0:04:41	then we define the similarity make free matrix as r j k
0:04:46	b
0:04:48	scaled cosine distances between h and adding vector to all of the century
0:04:54	it's i j k-means the similarity matrix of
0:04:59	speaker and endings e i j two thus
0:05:03	speaker centroids the k
0:05:06	and w and
0:05:08	b r trainable parameters
0:05:11	and we
0:05:13	i don't really is constrained to be costing so that this me guarantee will be
0:05:18	larger when the cosine distance is larger
0:05:23	a during the training
0:05:25	we wanna h all utterances
0:05:26	and batting
0:05:27	e i j to be close to it all speakers
0:05:31	centroid while far away from other speakers centroid so we apply a softmax only s
0:05:38	i j k
0:05:39	for all the possible
0:05:41	okay
0:05:42	and got these
0:05:43	loss function
0:05:47	and the final a generalized and two and almost is the summation of classes all
0:05:52	well and adding vectors
0:05:55	in brief
0:05:57	each generalized and two and lost push and battles towards the centroid of the for
0:06:03	speaker
0:06:04	and away from the centroid of the most similar different speakers
0:06:12	and we introduce the two regular or right there is to the entrance is jen
0:06:16	the first one is corn soft all about ninety regularization
0:06:22	so house
0:06:23	we have a full a canadian the
0:06:26	and a has a weight matrix w
0:06:29	the start of the analogy regularization is defined in this way
0:06:34	and then and there is a regularization
0:06:37	coefficient
0:06:40	and the new on a first and the frobenius norm
0:06:46	so this soft although not if you're realisation turn
0:06:50	requires the grand matrix of w to be close to identity
0:06:57	and since the gradient of this also known or regularization to resist respect to the
0:07:02	weight
0:07:03	that you can be computed in the
0:07:06	stable form
0:07:07	so this regularization term can be directly added to the and two and almost
0:07:13	and optimize
0:07:15	together
0:07:18	a second one is core the spectral restricted isometry property regularization
0:07:25	the
0:07:25	restricted isometry property characterised
0:07:30	you matrices that on your e
0:07:32	of all know
0:07:35	so this realisation to is derived from the
0:07:39	a restricted isometry property
0:07:43	for it i weighted measure is the u s r i p regularization
0:07:47	is formulated in this way
0:07:50	here is an enter is also regularization coefficient
0:07:54	and sigma is
0:07:56	the spectral
0:07:58	it cost you the largest singular value of the
0:08:02	matrix
0:08:05	so this is alright p regularization term
0:08:08	requires
0:08:11	the largest singular value of the gram matrix
0:08:15	to be close to identity
0:08:17	eight in close to
0:08:19	required all this all the singular values of to be able to be crossed to
0:08:23	white
0:08:27	in the same pieces
0:08:28	s r i p realisation turn
0:08:31	requires the
0:08:33	a given value to control station
0:08:36	it will result in a novel stable gradients
0:08:40	so
0:08:41	in practice we use the technique cord how iteration
0:08:46	two
0:08:47	approximate the spectral norm computation process
0:08:51	so in our experiments we just randomly in usual nice the vector b
0:08:57	and
0:08:58	a repeat these above iterative process due for a two times
0:09:07	and there is regularization coefficient for both romanisation terms
0:09:12	the choice of the regularization coefficient
0:09:15	plays an important role in the training process as well as final system performance
0:09:22	so we investigated to different
0:09:25	a sky true
0:09:26	the first one is k the
0:09:30	consists and get a consistent coefficients are the training stage which is an and the
0:09:36	to be sure one
0:09:38	and the second scatter and started with
0:09:42	learn the you question zero point two and then we gradually reduce it to zero
0:09:47	during the training stage
0:09:51	we a scroll two different types of neural networks
0:09:55	and the first one is rest best system
0:10:00	and the second one is td and a system
0:10:03	the air sgm system fines three-layer as it always gmm based projection
0:10:09	and if a rest em there has seven hundred sixty a hidden nodes
0:10:14	the projections i z is set to two hundred fifty six
0:10:18	after processing the whole input utterance and they the lost brand output
0:10:24	only have rest
0:10:25	well we have
0:10:27	at the seven real
0:10:29	of the whole utterance
0:10:32	ending that indian system and we use smt do the structured as
0:10:36	in how these x vectors
0:10:39	model
0:10:41	and all other hand adding letters
0:10:44	computed as the l two normalization of the network output
0:10:52	so our ex experiments is
0:10:57	ways
0:10:57	well so we at one corpora
0:11:02	and the
0:11:04	in its meaning that if we use sixty four
0:11:08	speakers and a segments per speaker
0:11:11	then it is use of concern of ours you memory capacity
0:11:15	and this set in the last are randomly sample from one hundred forty age of
0:11:20	one hundred a j
0:11:23	and also already the also analogy regularization can be applied to all the layers
0:11:29	but in this work we hand the applied be also narrative constraints on the weighted
0:11:34	matrix of speaker and batting print using their
0:11:40	there is the results of the u r s g and basis jen's
0:11:44	in addition
0:11:46	now regularized the remains
0:11:49	we don't have an eighteen are then and two
0:11:53	regularization term during training stage and this is the baseline
0:11:58	for all the result we can see that both regularization term is improve the system
0:12:03	performance
0:12:06	and the is alright p regularization
0:12:09	and all to perform the soft orthogonality romanisation time
0:12:13	as well as a baseline with remarkable performance against
0:12:18	it's a around
0:12:19	twenty percent improvements in eer
0:12:23	this you have to mindcf three
0:12:27	and the decreasing scoundrel planned the
0:12:30	performance factors and the constant schedule for both regular i-vectors
0:12:37	we also show the det curves for the baseline and the fast addressee and the
0:12:43	assistant
0:12:44	trained with is already realisation and the decreasing scatter
0:12:48	in this figure it can see that the
0:12:51	regularization
0:12:53	also noted to realisation ray helps to produce i-vectors just
0:12:59	and here is the result of the td and then based systems
0:13:04	in this case
0:13:06	the two regularization to know actually is compatible in performance
0:13:12	and the
0:13:14	for these soft although not here annotation time
0:13:18	it is it is forging is a battery or no i percent better in addition
0:13:23	to and the sixteen christian
0:13:25	by doing these they have three when trained without decreasing band masking
0:13:31	and the solve the regularization
0:13:34	is beneficial one friend ways integrating seeing and the scatter
0:13:38	the best asr a physician and
0:13:41	is twelve percent better eer and i
0:13:45	teen percent better in this industry
0:13:50	so the sri p minimisation cues
0:13:52	consistent
0:13:57	in performance went random is different than the schemes
0:14:04	here we plot the det curves for the baseline and to all the
0:14:10	s sis
0:14:11	t and their systems trained with two regularizer this way and the decreasing is gonna
0:14:16	doing this figure
0:14:22	and to explore the effect of all that an additive regularization during training we plot
0:14:28	the validation last curves
0:14:31	grace to example
0:14:32	of the validation last curve do the training of error rates gmm based system
0:14:38	just noticed out
0:14:39	the actual number of training a hoax is then trained different four
0:14:44	different systems
0:14:46	it is because we stand the maximum number of training works
0:14:50	to one hundred and start training if the validation loss does not
0:14:55	decrease for six consecutive blocks
0:15:01	from the loss function of from the loss curves we can see that
0:15:05	all the regular right there is the accelerate the training process in the other day
0:15:09	training stage
0:15:11	and then ten at several or lost remote the training compared to of the baseline
0:15:18	in general the sre regularization and she's real remote additional lost
0:15:23	then
0:15:23	a soft also no regularization and this finding is consistent with their system performance
0:15:31	where in general the sre p rotation is better than is or regularization
0:15:37	for both running the right there is
0:15:40	training we set consists in lambda released to a more training
0:15:44	i box and also lower finder lost
0:15:48	or this is different from the findings
0:15:51	in the
0:15:53	system performance
0:15:55	well according to the final system without
0:15:59	training without increasing scared you
0:16:01	always results in better performance
0:16:05	so one possible reason is that
0:16:07	in the final training stage is a trend use of model parameters numbers more it's
0:16:12	more likely not point really
0:16:14	so keeping nist and recognition strange
0:16:18	stress
0:16:18	sure about the training
0:16:20	well be all illustrate
0:16:23	at this stage
0:16:24	so by decreasing the coefficient we lose in the orthogonality constrained and of the model
0:16:30	parameters have more flexibility in the final stage
0:16:34	thus leading to a better system performance
0:16:40	so in conclusion
0:16:41	we introduce the two also nancy reagan right there is infringing
0:16:47	and two and text independent speaker verification systems
0:16:52	the first one is the soft also known and two regularization
0:16:56	it requires the gram matrix
0:16:59	to be a close to identity
0:17:01	and the second one is
0:17:06	sri peer organisation
0:17:08	in minimize the not is
0:17:10	singular value of the gram matrix
0:17:12	based on the restricted isometry property
0:17:20	two different neural network architectures
0:17:22	there is meant easier than
0:17:24	or investigated
0:17:26	no weight fried different all regularization coefficient rantings gonna do this and investigated their effect
0:17:34	on the training data as well as evaluation performance
0:17:39	we find that spectral restricted isometry property realisation
0:17:45	performs that best indoor the cases and
0:17:49	and shapes in the bass case around trendy percent improvement on the all the criteria
0:17:57	both run underwriters can be combined into the original training loss and optimized scalar with
0:18:05	little computation or hate
0:18:09	and that's all e u
0:18:11	a presentation thank you for listening ningbo the constraints of work are then you

Orthogonality Regularizations for End-to-End Speaker Verification

Speaker Recognition 1

Yingke Zhu, Brian Mak