0:00:14another one broken into my presentation this is the recorded video for all the c
0:00:20two thousand and twenty eight
0:00:22and in control from different university of science and technology
0:00:27in this we do i'd like to introduce our work on orthogonality regularization for and
0:00:34you and speaker verification
0:00:38for the speaker verification tasks
0:00:40high resistance has being dominic solutions for a long time
0:00:45for example the i-vector based systems
0:00:48or a selector based distance
0:00:51the hybrid system is usually consist of multiple not your is
0:00:56the speaker and endings can be and then problem i-vectors or deep neural networks and
0:01:02a separate scoring function is commonly but with a ple a classifier
0:01:09the in the hypothesis in
0:01:12if from audio is optimized with respect to its own target function
0:01:17which is usually not the stand
0:01:20moreover these speaker verification is an open set problem
0:01:26used to handle on all speakers in the evaluation stage
0:01:30so that regularization ability of the system will be very important
0:01:35recently
0:01:36one more speaker verification systems are trending and two and then or
0:01:42so and to assist as well macula test utterance and
0:01:46i enrollment utterances directly to a single scoring
0:01:50it can simplify the training pipeline
0:01:53and then the whole system is optimized in the morning consistent manner
0:01:59it also and levels lending or task specific matrix during the training stage
0:02:05and various loss functions have been proposed for the entrance systems
0:02:11for example that regret loss
0:02:13the generalized and to analysis
0:02:17though the core idea of the loss functions in the entrances students is to minimize
0:02:23the distances between i'm outings from the signs being sent speaker
0:02:28and maximize the distance between and bad things from different speakers
0:02:33in this loss functions most of them will use the cosine similarity
0:02:38the that means that cosine of the angle between two and biting vectors to and
0:02:43be the distance between these two and batting
0:02:48so the major on the line assumption for the effectiveness of the cosine similarity measurement
0:02:55in that the and endings base is although no
0:02:58which is not guaranteeing during the training
0:03:02in this work
0:03:03we and to explore in the orthogonal
0:03:06regularization in the entrance speaker verification systems
0:03:13so
0:03:14in this work our systems are trend was generalized enter and loss
0:03:20we propose to regular i-vectors
0:03:23the first one is quite a soft also known as you regularization the second one
0:03:27is score spectra restricted isometry property
0:03:31regularization
0:03:34and this
0:03:35two proposed regularized there is i evaluated on two different you and then were structures
0:03:42the air sgm based one and times and a neural network based one
0:03:49so first i'd like to briefly introduce
0:03:52the generalized and two and lost
0:03:57in our and tie system
0:03:59so one mini batch consists of an speakers and utterances from each speaker
0:04:06that means we will have and times and utterances in total for one mini batch
0:04:13so that x r j represents acoustic feature computed phone utterance tray of speaker i
0:04:20for each input feature excitedly the network
0:04:24produces a corresponding and adding vector e i j
0:04:28so we can
0:04:30compute the centroid of the speaker
0:04:32so in and batting vectors from
0:04:35speaker i
0:04:36by averaging its and i think vectors
0:04:41then we define the similarity make free matrix as r j k
0:04:46b
0:04:48scaled cosine distances between h and adding vector to all of the century
0:04:54it's i j k-means the similarity matrix of
0:04:59speaker and endings e i j two thus
0:05:03speaker centroids the k
0:05:06and w and
0:05:08b r trainable parameters
0:05:11and we
0:05:13i don't really is constrained to be costing so that this me guarantee will be
0:05:18larger when the cosine distance is larger
0:05:23a during the training
0:05:25we wanna h all utterances
0:05:26and batting
0:05:27e i j to be close to it all speakers
0:05:31centroid while far away from other speakers centroid so we apply a softmax only s
0:05:38i j k
0:05:39for all the possible
0:05:41okay
0:05:42and got these
0:05:43loss function
0:05:47and the final a generalized and two and almost is the summation of classes all
0:05:52well and adding vectors
0:05:55in brief
0:05:57each generalized and two and lost push and battles towards the centroid of the for
0:06:03speaker
0:06:04and away from the centroid of the most similar different speakers
0:06:12and we introduce the two regular or right there is to the entrance is jen
0:06:16the first one is corn soft all about ninety regularization
0:06:22so house
0:06:23we have a full a canadian the
0:06:26and a has a weight matrix w
0:06:29the start of the analogy regularization is defined in this way
0:06:34and then and there is a regularization
0:06:37coefficient
0:06:40and the new on a first and the frobenius norm
0:06:46so this soft although not if you're realisation turn
0:06:50requires the grand matrix of w to be close to identity
0:06:57and since the gradient of this also known or regularization to resist respect to the
0:07:02weight
0:07:03that you can be computed in the
0:07:06stable form
0:07:07so this regularization term can be directly added to the and two and almost
0:07:13and optimize
0:07:15together
0:07:18a second one is core the spectral restricted isometry property regularization
0:07:25the
0:07:25restricted isometry property characterised
0:07:30you matrices that on your e
0:07:32of all know
0:07:35so this realisation to is derived from the
0:07:39a restricted isometry property
0:07:43for it i weighted measure is the u s r i p regularization
0:07:47is formulated in this way
0:07:50here is an enter is also regularization coefficient
0:07:54and sigma is
0:07:56the spectral
0:07:58it cost you the largest singular value of the
0:08:02matrix
0:08:05so this is alright p regularization term
0:08:08requires
0:08:11the largest singular value of the gram matrix
0:08:15to be close to identity
0:08:17eight in close to
0:08:19required all this all the singular values of to be able to be crossed to
0:08:23white
0:08:27in the same pieces
0:08:28s r i p realisation turn
0:08:31requires the
0:08:33a given value to control station
0:08:36it will result in a novel stable gradients
0:08:40so
0:08:41in practice we use the technique cord how iteration
0:08:46two
0:08:47approximate the spectral norm computation process
0:08:51so in our experiments we just randomly in usual nice the vector b
0:08:57and
0:08:58a repeat these above iterative process due for a two times
0:09:07and there is regularization coefficient for both romanisation terms
0:09:12the choice of the regularization coefficient
0:09:15plays an important role in the training process as well as final system performance
0:09:22so we investigated to different
0:09:25a sky true
0:09:26the first one is k the
0:09:30consists and get a consistent coefficients are the training stage which is an and the
0:09:36to be sure one
0:09:38and the second scatter and started with
0:09:42learn the you question zero point two and then we gradually reduce it to zero
0:09:47during the training stage
0:09:51we a scroll two different types of neural networks
0:09:55and the first one is rest best system
0:10:00and the second one is td and a system
0:10:03the air sgm system fines three-layer as it always gmm based projection
0:10:09and if a rest em there has seven hundred sixty a hidden nodes
0:10:14the projections i z is set to two hundred fifty six
0:10:18after processing the whole input utterance and they the lost brand output
0:10:24only have rest
0:10:25well we have
0:10:27at the seven real
0:10:29of the whole utterance
0:10:32ending that indian system and we use smt do the structured as
0:10:36in how these x vectors
0:10:39model
0:10:41and all other hand adding letters
0:10:44computed as the l two normalization of the network output
0:10:52so our ex experiments is
0:10:57ways
0:10:57well so we at one corpora
0:11:02and the
0:11:04in its meaning that if we use sixty four
0:11:08speakers and a segments per speaker
0:11:11then it is use of concern of ours you memory capacity
0:11:15and this set in the last are randomly sample from one hundred forty age of
0:11:20one hundred a j
0:11:23and also already the also analogy regularization can be applied to all the layers
0:11:29but in this work we hand the applied be also narrative constraints on the weighted
0:11:34matrix of speaker and batting print using their
0:11:40there is the results of the u r s g and basis jen's
0:11:44in addition
0:11:46now regularized the remains
0:11:49we don't have an eighteen are then and two
0:11:53regularization term during training stage and this is the baseline
0:11:58for all the result we can see that both regularization term is improve the system
0:12:03performance
0:12:06and the is alright p regularization
0:12:09and all to perform the soft orthogonality romanisation time
0:12:13as well as a baseline with remarkable performance against
0:12:18it's a around
0:12:19twenty percent improvements in eer
0:12:23this you have to mindcf three
0:12:27and the decreasing scoundrel planned the
0:12:30performance factors and the constant schedule for both regular i-vectors
0:12:37we also show the det curves for the baseline and the fast addressee and the
0:12:43assistant
0:12:44trained with is already realisation and the decreasing scatter
0:12:48in this figure it can see that the
0:12:51regularization
0:12:53also noted to realisation ray helps to produce i-vectors just
0:12:59and here is the result of the td and then based systems
0:13:04in this case
0:13:06the two regularization to know actually is compatible in performance
0:13:12and the
0:13:14for these soft although not here annotation time
0:13:18it is it is forging is a battery or no i percent better in addition
0:13:23to and the sixteen christian
0:13:25by doing these they have three when trained without decreasing band masking
0:13:31and the solve the regularization
0:13:34is beneficial one friend ways integrating seeing and the scatter
0:13:38the best asr a physician and
0:13:41is twelve percent better eer and i
0:13:45teen percent better in this industry
0:13:50so the sri p minimisation cues
0:13:52consistent
0:13:57in performance went random is different than the schemes
0:14:04here we plot the det curves for the baseline and to all the
0:14:10s sis
0:14:11t and their systems trained with two regularizer this way and the decreasing is gonna
0:14:16doing this figure
0:14:22and to explore the effect of all that an additive regularization during training we plot
0:14:28the validation last curves
0:14:31grace to example
0:14:32of the validation last curve do the training of error rates gmm based system
0:14:38just noticed out
0:14:39the actual number of training a hoax is then trained different four
0:14:44different systems
0:14:46it is because we stand the maximum number of training works
0:14:50to one hundred and start training if the validation loss does not
0:14:55decrease for six consecutive blocks
0:15:01from the loss function of from the loss curves we can see that
0:15:05all the regular right there is the accelerate the training process in the other day
0:15:09training stage
0:15:11and then ten at several or lost remote the training compared to of the baseline
0:15:18in general the sre regularization and she's real remote additional lost
0:15:23then
0:15:23a soft also no regularization and this finding is consistent with their system performance
0:15:31where in general the sre p rotation is better than is or regularization
0:15:37for both running the right there is
0:15:40training we set consists in lambda released to a more training
0:15:44i box and also lower finder lost
0:15:48or this is different from the findings
0:15:51in the
0:15:53system performance
0:15:55well according to the final system without
0:15:59training without increasing scared you
0:16:01always results in better performance
0:16:05so one possible reason is that
0:16:07in the final training stage is a trend use of model parameters numbers more it's
0:16:12more likely not point really
0:16:14so keeping nist and recognition strange
0:16:18stress
0:16:18sure about the training
0:16:20well be all illustrate
0:16:23at this stage
0:16:24so by decreasing the coefficient we lose in the orthogonality constrained and of the model
0:16:30parameters have more flexibility in the final stage
0:16:34thus leading to a better system performance
0:16:40so in conclusion
0:16:41we introduce the two also nancy reagan right there is infringing
0:16:47and two and text independent speaker verification systems
0:16:52the first one is the soft also known and two regularization
0:16:56it requires the gram matrix
0:16:59to be a close to identity
0:17:01and the second one is
0:17:06sri peer organisation
0:17:08in minimize the not is
0:17:10singular value of the gram matrix
0:17:12based on the restricted isometry property
0:17:20two different neural network architectures
0:17:22there is meant easier than
0:17:24or investigated
0:17:26no weight fried different all regularization coefficient rantings gonna do this and investigated their effect
0:17:34on the training data as well as evaluation performance
0:17:39we find that spectral restricted isometry property realisation
0:17:45performs that best indoor the cases and
0:17:49and shapes in the bass case around trendy percent improvement on the all the criteria
0:17:57both run underwriters can be combined into the original training loss and optimized scalar with
0:18:05little computation or hate
0:18:09and that's all e u
0:18:11a presentation thank you for listening ningbo the constraints of work are then you