another one broken into my presentation this is the recorded video for all the c
two thousand and twenty eight
and in control from different university of science and technology
in this we do i'd like to introduce our work on orthogonality regularization for and
you and speaker verification
for the speaker verification tasks
high resistance has being dominic solutions for a long time
for example the i-vector based systems
or a selector based distance
the hybrid system is usually consist of multiple not your is
the speaker and endings can be and then problem i-vectors or deep neural networks and
a separate scoring function is commonly but with a ple a classifier
the in the hypothesis in
if from audio is optimized with respect to its own target function
which is usually not the stand
moreover these speaker verification is an open set problem
used to handle on all speakers in the evaluation stage
so that regularization ability of the system will be very important
recently
one more speaker verification systems are trending and two and then or
so and to assist as well macula test utterance and
i enrollment utterances directly to a single scoring
it can simplify the training pipeline
and then the whole system is optimized in the morning consistent manner
it also and levels lending or task specific matrix during the training stage
and various loss functions have been proposed for the entrance systems
for example that regret loss
the generalized and to analysis
though the core idea of the loss functions in the entrances students is to minimize
the distances between i'm outings from the signs being sent speaker
and maximize the distance between and bad things from different speakers
in this loss functions most of them will use the cosine similarity
the that means that cosine of the angle between two and biting vectors to and
be the distance between these two and batting
so the major on the line assumption for the effectiveness of the cosine similarity measurement
in that the and endings base is although no
which is not guaranteeing during the training
in this work
we and to explore in the orthogonal
regularization in the entrance speaker verification systems
so
in this work our systems are trend was generalized enter and loss
we propose to regular i-vectors
the first one is quite a soft also known as you regularization the second one
is score spectra restricted isometry property
regularization
and this
two proposed regularized there is i evaluated on two different you and then were structures
the air sgm based one and times and a neural network based one
so first i'd like to briefly introduce
the generalized and two and lost
in our and tie system
so one mini batch consists of an speakers and utterances from each speaker
that means we will have and times and utterances in total for one mini batch
so that x r j represents acoustic feature computed phone utterance tray of speaker i
for each input feature excitedly the network
produces a corresponding and adding vector e i j
so we can
compute the centroid of the speaker
so in and batting vectors from
speaker i
by averaging its and i think vectors
then we define the similarity make free matrix as r j k
b
scaled cosine distances between h and adding vector to all of the century
it's i j k-means the similarity matrix of
speaker and endings e i j two thus
speaker centroids the k
and w and
b r trainable parameters
and we
i don't really is constrained to be costing so that this me guarantee will be
larger when the cosine distance is larger
a during the training
we wanna h all utterances
and batting
e i j to be close to it all speakers
centroid while far away from other speakers centroid so we apply a softmax only s
i j k
for all the possible
okay
and got these
loss function
and the final a generalized and two and almost is the summation of classes all
well and adding vectors
in brief
each generalized and two and lost push and battles towards the centroid of the for
speaker
and away from the centroid of the most similar different speakers
and we introduce the two regular or right there is to the entrance is jen
the first one is corn soft all about ninety regularization
so house
we have a full a canadian the
and a has a weight matrix w
the start of the analogy regularization is defined in this way
and then and there is a regularization
coefficient
and the new on a first and the frobenius norm
so this soft although not if you're realisation turn
requires the grand matrix of w to be close to identity
and since the gradient of this also known or regularization to resist respect to the
weight
that you can be computed in the
stable form
so this regularization term can be directly added to the and two and almost
and optimize
together
a second one is core the spectral restricted isometry property regularization
the
restricted isometry property characterised
you matrices that on your e
of all know
so this realisation to is derived from the
a restricted isometry property
for it i weighted measure is the u s r i p regularization
is formulated in this way
here is an enter is also regularization coefficient
and sigma is
the spectral
it cost you the largest singular value of the
matrix
so this is alright p regularization term
requires
the largest singular value of the gram matrix
to be close to identity
eight in close to
required all this all the singular values of to be able to be crossed to
white
in the same pieces
s r i p realisation turn
requires the
a given value to control station
it will result in a novel stable gradients
so
in practice we use the technique cord how iteration
two
approximate the spectral norm computation process
so in our experiments we just randomly in usual nice the vector b
and
a repeat these above iterative process due for a two times
and there is regularization coefficient for both romanisation terms
the choice of the regularization coefficient
plays an important role in the training process as well as final system performance
so we investigated to different
a sky true
the first one is k the
consists and get a consistent coefficients are the training stage which is an and the
to be sure one
and the second scatter and started with
learn the you question zero point two and then we gradually reduce it to zero
during the training stage
we a scroll two different types of neural networks
and the first one is rest best system
and the second one is td and a system
the air sgm system fines three-layer as it always gmm based projection
and if a rest em there has seven hundred sixty a hidden nodes
the projections i z is set to two hundred fifty six
after processing the whole input utterance and they the lost brand output
only have rest
well we have
at the seven real
of the whole utterance
ending that indian system and we use smt do the structured as
in how these x vectors
model
and all other hand adding letters
computed as the l two normalization of the network output
so our ex experiments is
ways
well so we at one corpora
and the
in its meaning that if we use sixty four
speakers and a segments per speaker
then it is use of concern of ours you memory capacity
and this set in the last are randomly sample from one hundred forty age of
one hundred a j
and also already the also analogy regularization can be applied to all the layers
but in this work we hand the applied be also narrative constraints on the weighted
matrix of speaker and batting print using their
there is the results of the u r s g and basis jen's
in addition
now regularized the remains
we don't have an eighteen are then and two
regularization term during training stage and this is the baseline
for all the result we can see that both regularization term is improve the system
performance
and the is alright p regularization
and all to perform the soft orthogonality romanisation time
as well as a baseline with remarkable performance against
it's a around
twenty percent improvements in eer
this you have to mindcf three
and the decreasing scoundrel planned the
performance factors and the constant schedule for both regular i-vectors
we also show the det curves for the baseline and the fast addressee and the
assistant
trained with is already realisation and the decreasing scatter
in this figure it can see that the
regularization
also noted to realisation ray helps to produce i-vectors just
and here is the result of the td and then based systems
in this case
the two regularization to know actually is compatible in performance
and the
for these soft although not here annotation time
it is it is forging is a battery or no i percent better in addition
to and the sixteen christian
by doing these they have three when trained without decreasing band masking
and the solve the regularization
is beneficial one friend ways integrating seeing and the scatter
the best asr a physician and
is twelve percent better eer and i
teen percent better in this industry
so the sri p minimisation cues
consistent
in performance went random is different than the schemes
here we plot the det curves for the baseline and to all the
s sis
t and their systems trained with two regularizer this way and the decreasing is gonna
doing this figure
and to explore the effect of all that an additive regularization during training we plot
the validation last curves
grace to example
of the validation last curve do the training of error rates gmm based system
just noticed out
the actual number of training a hoax is then trained different four
different systems
it is because we stand the maximum number of training works
to one hundred and start training if the validation loss does not
decrease for six consecutive blocks
from the loss function of from the loss curves we can see that
all the regular right there is the accelerate the training process in the other day
training stage
and then ten at several or lost remote the training compared to of the baseline
in general the sre regularization and she's real remote additional lost
then
a soft also no regularization and this finding is consistent with their system performance
where in general the sre p rotation is better than is or regularization
for both running the right there is
training we set consists in lambda released to a more training
i box and also lower finder lost
or this is different from the findings
in the
system performance
well according to the final system without
training without increasing scared you
always results in better performance
so one possible reason is that
in the final training stage is a trend use of model parameters numbers more it's
more likely not point really
so keeping nist and recognition strange
stress
sure about the training
well be all illustrate
at this stage
so by decreasing the coefficient we lose in the orthogonality constrained and of the model
parameters have more flexibility in the final stage
thus leading to a better system performance
so in conclusion
we introduce the two also nancy reagan right there is infringing
and two and text independent speaker verification systems
the first one is the soft also known and two regularization
it requires the gram matrix
to be a close to identity
and the second one is
sri peer organisation
in minimize the not is
singular value of the gram matrix
based on the restricted isometry property
two different neural network architectures
there is meant easier than
or investigated
no weight fried different all regularization coefficient rantings gonna do this and investigated their effect
on the training data as well as evaluation performance
we find that spectral restricted isometry property realisation
performs that best indoor the cases and
and shapes in the bass case around trendy percent improvement on the all the criteria
both run underwriters can be combined into the original training loss and optimized scalar with
little computation or hate
and that's all e u
a presentation thank you for listening ningbo the constraints of work are then you