0:00:14 | another one broken into my presentation this is the recorded video for all the c |
---|
0:00:20 | two thousand and twenty eight |
---|
0:00:22 | and in control from different university of science and technology |
---|
0:00:27 | in this we do i'd like to introduce our work on orthogonality regularization for and |
---|
0:00:34 | you and speaker verification |
---|
0:00:38 | for the speaker verification tasks |
---|
0:00:40 | high resistance has being dominic solutions for a long time |
---|
0:00:45 | for example the i-vector based systems |
---|
0:00:48 | or a selector based distance |
---|
0:00:51 | the hybrid system is usually consist of multiple not your is |
---|
0:00:56 | the speaker and endings can be and then problem i-vectors or deep neural networks and |
---|
0:01:02 | a separate scoring function is commonly but with a ple a classifier |
---|
0:01:09 | the in the hypothesis in |
---|
0:01:12 | if from audio is optimized with respect to its own target function |
---|
0:01:17 | which is usually not the stand |
---|
0:01:20 | moreover these speaker verification is an open set problem |
---|
0:01:26 | used to handle on all speakers in the evaluation stage |
---|
0:01:30 | so that regularization ability of the system will be very important |
---|
0:01:35 | recently |
---|
0:01:36 | one more speaker verification systems are trending and two and then or |
---|
0:01:42 | so and to assist as well macula test utterance and |
---|
0:01:46 | i enrollment utterances directly to a single scoring |
---|
0:01:50 | it can simplify the training pipeline |
---|
0:01:53 | and then the whole system is optimized in the morning consistent manner |
---|
0:01:59 | it also and levels lending or task specific matrix during the training stage |
---|
0:02:05 | and various loss functions have been proposed for the entrance systems |
---|
0:02:11 | for example that regret loss |
---|
0:02:13 | the generalized and to analysis |
---|
0:02:17 | though the core idea of the loss functions in the entrances students is to minimize |
---|
0:02:23 | the distances between i'm outings from the signs being sent speaker |
---|
0:02:28 | and maximize the distance between and bad things from different speakers |
---|
0:02:33 | in this loss functions most of them will use the cosine similarity |
---|
0:02:38 | the that means that cosine of the angle between two and biting vectors to and |
---|
0:02:43 | be the distance between these two and batting |
---|
0:02:48 | so the major on the line assumption for the effectiveness of the cosine similarity measurement |
---|
0:02:55 | in that the and endings base is although no |
---|
0:02:58 | which is not guaranteeing during the training |
---|
0:03:02 | in this work |
---|
0:03:03 | we and to explore in the orthogonal |
---|
0:03:06 | regularization in the entrance speaker verification systems |
---|
0:03:13 | so |
---|
0:03:14 | in this work our systems are trend was generalized enter and loss |
---|
0:03:20 | we propose to regular i-vectors |
---|
0:03:23 | the first one is quite a soft also known as you regularization the second one |
---|
0:03:27 | is score spectra restricted isometry property |
---|
0:03:31 | regularization |
---|
0:03:34 | and this |
---|
0:03:35 | two proposed regularized there is i evaluated on two different you and then were structures |
---|
0:03:42 | the air sgm based one and times and a neural network based one |
---|
0:03:49 | so first i'd like to briefly introduce |
---|
0:03:52 | the generalized and two and lost |
---|
0:03:57 | in our and tie system |
---|
0:03:59 | so one mini batch consists of an speakers and utterances from each speaker |
---|
0:04:06 | that means we will have and times and utterances in total for one mini batch |
---|
0:04:13 | so that x r j represents acoustic feature computed phone utterance tray of speaker i |
---|
0:04:20 | for each input feature excitedly the network |
---|
0:04:24 | produces a corresponding and adding vector e i j |
---|
0:04:28 | so we can |
---|
0:04:30 | compute the centroid of the speaker |
---|
0:04:32 | so in and batting vectors from |
---|
0:04:35 | speaker i |
---|
0:04:36 | by averaging its and i think vectors |
---|
0:04:41 | then we define the similarity make free matrix as r j k |
---|
0:04:46 | b |
---|
0:04:48 | scaled cosine distances between h and adding vector to all of the century |
---|
0:04:54 | it's i j k-means the similarity matrix of |
---|
0:04:59 | speaker and endings e i j two thus |
---|
0:05:03 | speaker centroids the k |
---|
0:05:06 | and w and |
---|
0:05:08 | b r trainable parameters |
---|
0:05:11 | and we |
---|
0:05:13 | i don't really is constrained to be costing so that this me guarantee will be |
---|
0:05:18 | larger when the cosine distance is larger |
---|
0:05:23 | a during the training |
---|
0:05:25 | we wanna h all utterances |
---|
0:05:26 | and batting |
---|
0:05:27 | e i j to be close to it all speakers |
---|
0:05:31 | centroid while far away from other speakers centroid so we apply a softmax only s |
---|
0:05:38 | i j k |
---|
0:05:39 | for all the possible |
---|
0:05:41 | okay |
---|
0:05:42 | and got these |
---|
0:05:43 | loss function |
---|
0:05:47 | and the final a generalized and two and almost is the summation of classes all |
---|
0:05:52 | well and adding vectors |
---|
0:05:55 | in brief |
---|
0:05:57 | each generalized and two and lost push and battles towards the centroid of the for |
---|
0:06:03 | speaker |
---|
0:06:04 | and away from the centroid of the most similar different speakers |
---|
0:06:12 | and we introduce the two regular or right there is to the entrance is jen |
---|
0:06:16 | the first one is corn soft all about ninety regularization |
---|
0:06:22 | so house |
---|
0:06:23 | we have a full a canadian the |
---|
0:06:26 | and a has a weight matrix w |
---|
0:06:29 | the start of the analogy regularization is defined in this way |
---|
0:06:34 | and then and there is a regularization |
---|
0:06:37 | coefficient |
---|
0:06:40 | and the new on a first and the frobenius norm |
---|
0:06:46 | so this soft although not if you're realisation turn |
---|
0:06:50 | requires the grand matrix of w to be close to identity |
---|
0:06:57 | and since the gradient of this also known or regularization to resist respect to the |
---|
0:07:02 | weight |
---|
0:07:03 | that you can be computed in the |
---|
0:07:06 | stable form |
---|
0:07:07 | so this regularization term can be directly added to the and two and almost |
---|
0:07:13 | and optimize |
---|
0:07:15 | together |
---|
0:07:18 | a second one is core the spectral restricted isometry property regularization |
---|
0:07:25 | the |
---|
0:07:25 | restricted isometry property characterised |
---|
0:07:30 | you matrices that on your e |
---|
0:07:32 | of all know |
---|
0:07:35 | so this realisation to is derived from the |
---|
0:07:39 | a restricted isometry property |
---|
0:07:43 | for it i weighted measure is the u s r i p regularization |
---|
0:07:47 | is formulated in this way |
---|
0:07:50 | here is an enter is also regularization coefficient |
---|
0:07:54 | and sigma is |
---|
0:07:56 | the spectral |
---|
0:07:58 | it cost you the largest singular value of the |
---|
0:08:02 | matrix |
---|
0:08:05 | so this is alright p regularization term |
---|
0:08:08 | requires |
---|
0:08:11 | the largest singular value of the gram matrix |
---|
0:08:15 | to be close to identity |
---|
0:08:17 | eight in close to |
---|
0:08:19 | required all this all the singular values of to be able to be crossed to |
---|
0:08:23 | white |
---|
0:08:27 | in the same pieces |
---|
0:08:28 | s r i p realisation turn |
---|
0:08:31 | requires the |
---|
0:08:33 | a given value to control station |
---|
0:08:36 | it will result in a novel stable gradients |
---|
0:08:40 | so |
---|
0:08:41 | in practice we use the technique cord how iteration |
---|
0:08:46 | two |
---|
0:08:47 | approximate the spectral norm computation process |
---|
0:08:51 | so in our experiments we just randomly in usual nice the vector b |
---|
0:08:57 | and |
---|
0:08:58 | a repeat these above iterative process due for a two times |
---|
0:09:07 | and there is regularization coefficient for both romanisation terms |
---|
0:09:12 | the choice of the regularization coefficient |
---|
0:09:15 | plays an important role in the training process as well as final system performance |
---|
0:09:22 | so we investigated to different |
---|
0:09:25 | a sky true |
---|
0:09:26 | the first one is k the |
---|
0:09:30 | consists and get a consistent coefficients are the training stage which is an and the |
---|
0:09:36 | to be sure one |
---|
0:09:38 | and the second scatter and started with |
---|
0:09:42 | learn the you question zero point two and then we gradually reduce it to zero |
---|
0:09:47 | during the training stage |
---|
0:09:51 | we a scroll two different types of neural networks |
---|
0:09:55 | and the first one is rest best system |
---|
0:10:00 | and the second one is td and a system |
---|
0:10:03 | the air sgm system fines three-layer as it always gmm based projection |
---|
0:10:09 | and if a rest em there has seven hundred sixty a hidden nodes |
---|
0:10:14 | the projections i z is set to two hundred fifty six |
---|
0:10:18 | after processing the whole input utterance and they the lost brand output |
---|
0:10:24 | only have rest |
---|
0:10:25 | well we have |
---|
0:10:27 | at the seven real |
---|
0:10:29 | of the whole utterance |
---|
0:10:32 | ending that indian system and we use smt do the structured as |
---|
0:10:36 | in how these x vectors |
---|
0:10:39 | model |
---|
0:10:41 | and all other hand adding letters |
---|
0:10:44 | computed as the l two normalization of the network output |
---|
0:10:52 | so our ex experiments is |
---|
0:10:57 | ways |
---|
0:10:57 | well so we at one corpora |
---|
0:11:02 | and the |
---|
0:11:04 | in its meaning that if we use sixty four |
---|
0:11:08 | speakers and a segments per speaker |
---|
0:11:11 | then it is use of concern of ours you memory capacity |
---|
0:11:15 | and this set in the last are randomly sample from one hundred forty age of |
---|
0:11:20 | one hundred a j |
---|
0:11:23 | and also already the also analogy regularization can be applied to all the layers |
---|
0:11:29 | but in this work we hand the applied be also narrative constraints on the weighted |
---|
0:11:34 | matrix of speaker and batting print using their |
---|
0:11:40 | there is the results of the u r s g and basis jen's |
---|
0:11:44 | in addition |
---|
0:11:46 | now regularized the remains |
---|
0:11:49 | we don't have an eighteen are then and two |
---|
0:11:53 | regularization term during training stage and this is the baseline |
---|
0:11:58 | for all the result we can see that both regularization term is improve the system |
---|
0:12:03 | performance |
---|
0:12:06 | and the is alright p regularization |
---|
0:12:09 | and all to perform the soft orthogonality romanisation time |
---|
0:12:13 | as well as a baseline with remarkable performance against |
---|
0:12:18 | it's a around |
---|
0:12:19 | twenty percent improvements in eer |
---|
0:12:23 | this you have to mindcf three |
---|
0:12:27 | and the decreasing scoundrel planned the |
---|
0:12:30 | performance factors and the constant schedule for both regular i-vectors |
---|
0:12:37 | we also show the det curves for the baseline and the fast addressee and the |
---|
0:12:43 | assistant |
---|
0:12:44 | trained with is already realisation and the decreasing scatter |
---|
0:12:48 | in this figure it can see that the |
---|
0:12:51 | regularization |
---|
0:12:53 | also noted to realisation ray helps to produce i-vectors just |
---|
0:12:59 | and here is the result of the td and then based systems |
---|
0:13:04 | in this case |
---|
0:13:06 | the two regularization to know actually is compatible in performance |
---|
0:13:12 | and the |
---|
0:13:14 | for these soft although not here annotation time |
---|
0:13:18 | it is it is forging is a battery or no i percent better in addition |
---|
0:13:23 | to and the sixteen christian |
---|
0:13:25 | by doing these they have three when trained without decreasing band masking |
---|
0:13:31 | and the solve the regularization |
---|
0:13:34 | is beneficial one friend ways integrating seeing and the scatter |
---|
0:13:38 | the best asr a physician and |
---|
0:13:41 | is twelve percent better eer and i |
---|
0:13:45 | teen percent better in this industry |
---|
0:13:50 | so the sri p minimisation cues |
---|
0:13:52 | consistent |
---|
0:13:57 | in performance went random is different than the schemes |
---|
0:14:04 | here we plot the det curves for the baseline and to all the |
---|
0:14:10 | s sis |
---|
0:14:11 | t and their systems trained with two regularizer this way and the decreasing is gonna |
---|
0:14:16 | doing this figure |
---|
0:14:22 | and to explore the effect of all that an additive regularization during training we plot |
---|
0:14:28 | the validation last curves |
---|
0:14:31 | grace to example |
---|
0:14:32 | of the validation last curve do the training of error rates gmm based system |
---|
0:14:38 | just noticed out |
---|
0:14:39 | the actual number of training a hoax is then trained different four |
---|
0:14:44 | different systems |
---|
0:14:46 | it is because we stand the maximum number of training works |
---|
0:14:50 | to one hundred and start training if the validation loss does not |
---|
0:14:55 | decrease for six consecutive blocks |
---|
0:15:01 | from the loss function of from the loss curves we can see that |
---|
0:15:05 | all the regular right there is the accelerate the training process in the other day |
---|
0:15:09 | training stage |
---|
0:15:11 | and then ten at several or lost remote the training compared to of the baseline |
---|
0:15:18 | in general the sre regularization and she's real remote additional lost |
---|
0:15:23 | then |
---|
0:15:23 | a soft also no regularization and this finding is consistent with their system performance |
---|
0:15:31 | where in general the sre p rotation is better than is or regularization |
---|
0:15:37 | for both running the right there is |
---|
0:15:40 | training we set consists in lambda released to a more training |
---|
0:15:44 | i box and also lower finder lost |
---|
0:15:48 | or this is different from the findings |
---|
0:15:51 | in the |
---|
0:15:53 | system performance |
---|
0:15:55 | well according to the final system without |
---|
0:15:59 | training without increasing scared you |
---|
0:16:01 | always results in better performance |
---|
0:16:05 | so one possible reason is that |
---|
0:16:07 | in the final training stage is a trend use of model parameters numbers more it's |
---|
0:16:12 | more likely not point really |
---|
0:16:14 | so keeping nist and recognition strange |
---|
0:16:18 | stress |
---|
0:16:18 | sure about the training |
---|
0:16:20 | well be all illustrate |
---|
0:16:23 | at this stage |
---|
0:16:24 | so by decreasing the coefficient we lose in the orthogonality constrained and of the model |
---|
0:16:30 | parameters have more flexibility in the final stage |
---|
0:16:34 | thus leading to a better system performance |
---|
0:16:40 | so in conclusion |
---|
0:16:41 | we introduce the two also nancy reagan right there is infringing |
---|
0:16:47 | and two and text independent speaker verification systems |
---|
0:16:52 | the first one is the soft also known and two regularization |
---|
0:16:56 | it requires the gram matrix |
---|
0:16:59 | to be a close to identity |
---|
0:17:01 | and the second one is |
---|
0:17:06 | sri peer organisation |
---|
0:17:08 | in minimize the not is |
---|
0:17:10 | singular value of the gram matrix |
---|
0:17:12 | based on the restricted isometry property |
---|
0:17:20 | two different neural network architectures |
---|
0:17:22 | there is meant easier than |
---|
0:17:24 | or investigated |
---|
0:17:26 | no weight fried different all regularization coefficient rantings gonna do this and investigated their effect |
---|
0:17:34 | on the training data as well as evaluation performance |
---|
0:17:39 | we find that spectral restricted isometry property realisation |
---|
0:17:45 | performs that best indoor the cases and |
---|
0:17:49 | and shapes in the bass case around trendy percent improvement on the all the criteria |
---|
0:17:57 | both run underwriters can be combined into the original training loss and optimized scalar with |
---|
0:18:05 | little computation or hate |
---|
0:18:09 | and that's all e u |
---|
0:18:11 | a presentation thank you for listening ningbo the constraints of work are then you |
---|