0:00:15 | i speech |
---|
0:00:18 | that's going to present our |
---|
0:00:21 | i files a odin counters i |
---|
0:00:25 | i-vector space for speaker recognition |
---|
0:00:29 | well and |
---|
0:00:32 | i |
---|
0:00:34 | presentation |
---|
0:00:36 | that let me start from the |
---|
0:00:38 | motivation or activation |
---|
0:00:41 | cool for fireworks |
---|
0:00:44 | down i would like to the and details are the only thing |
---|
0:00:51 | a particular i will focus on |
---|
0:00:54 | i've and |
---|
0:01:00 | and the |
---|
0:01:02 | few words |
---|
0:01:05 | will be made out that they can't and scoring |
---|
0:01:08 | well the next section of the dedicated to |
---|
0:01:13 | and improve denoising thing or is this so i mean you're probability |
---|
0:01:19 | we tried to apply |
---|
0:01:21 | we tried to apply |
---|
0:01:23 | this technique |
---|
0:01:25 | and a deep |
---|
0:01:27 | our conjecture will be considered in this section |
---|
0:01:31 | next denoting comforting for the system in the domain mismatch |
---|
0:01:36 | scenario will prevented |
---|
0:01:38 | and the finally i will conclude |
---|
0:01:41 | my presentation |
---|
0:01:44 | okay let me start for all |
---|
0:01:46 | our motivation and goals last year published our work about implementation of it you know |
---|
0:01:53 | it engulfed encoder |
---|
0:01:55 | for the speaker verification task |
---|
0:01:58 | and the this system |
---|
0:02:01 | based on |
---|
0:02:04 | t aec still showed |
---|
0:02:07 | some improvements |
---|
0:02:08 | compared to the commonly used baseline system i mean ple on the raw i-vectors |
---|
0:02:15 | well and this motivated us to |
---|
0:02:18 | two |
---|
0:02:19 | for the investigation to detailed investigation |
---|
0:02:24 | and the |
---|
0:02:26 | and i'll go also used to study the proposed to solve and i in the |
---|
0:02:31 | i-vector space |
---|
0:02:33 | to analyse different straight edges all units as a nation and training probably big back |
---|
0:02:38 | and parameters |
---|
0:02:40 | to investigate about and to explored a different deep architecture |
---|
0:02:46 | we |
---|
0:02:47 | we offer |
---|
0:02:48 | and to investigate |
---|
0:02:50 | the a basis to increase or domain mismatch conditions |
---|
0:02:57 | well to the |
---|
0:03:01 | the dataset and experimental setup we used in our work |
---|
0:03:05 | as you can see for the |
---|
0:03:08 | training data as a training data we used a telephone channel recording from the nist |
---|
0:03:13 | is the re |
---|
0:03:15 | corpora for evaluation we used and used |
---|
0:03:20 | ten sre protocol condition five extended |
---|
0:03:23 | and to our results presented in terms of four |
---|
0:03:28 | equal error rate and minimum detection cost function |
---|
0:03:33 | and now to our front end tent |
---|
0:03:37 | i-vector extractor |
---|
0:03:40 | as you can see we used to |
---|
0:03:42 | uhuh |
---|
0:03:44 | mfccs and the first and second to do it is just the county where from |
---|
0:03:50 | well are what structural was based on |
---|
0:03:53 | the nn posteriors |
---|
0:03:55 | with the eleven frames why thing |
---|
0:03:58 | we used |
---|
0:03:59 | two thousand and that's a silence at one hundred three phone states with the twenty |
---|
0:04:05 | non speech state |
---|
0:04:07 | and |
---|
0:04:08 | instead of |
---|
0:04:10 | using |
---|
0:04:11 | hardwired decision |
---|
0:04:12 | we try to use soft one solution using the nn outputs |
---|
0:04:18 | well you can see this formula i |
---|
0:04:21 | we |
---|
0:04:23 | try to apply |
---|
0:04:25 | cepstral means you mean and variance normalization |
---|
0:04:28 | in this way in the statistics space |
---|
0:04:32 | well and the you can see that all e |
---|
0:04:35 | triphone states corresponding to the |
---|
0:04:37 | speech |
---|
0:04:38 | states are used to calculate |
---|
0:04:41 | a sufficient statistics |
---|
0:04:43 | finally a four hundred dimensional i-vectors were instructed for |
---|
0:04:50 | our first experiments |
---|
0:04:54 | well |
---|
0:04:56 | few works about the det system and the |
---|
0:05:00 | the a training procedure |
---|
0:05:03 | to their own devising transform we're |
---|
0:05:06 | use |
---|
0:05:07 | do noise are pre-training generative pre-training speech |
---|
0:05:13 | with the contrastive divergence algorithm |
---|
0:05:16 | well |
---|
0:05:17 | and |
---|
0:05:19 | tool |
---|
0:05:20 | to train our |
---|
0:05:22 | denoising transform we |
---|
0:05:26 | we used the |
---|
0:05:29 | speaker session dependent i-vectors and the box |
---|
0:05:34 | the mean four |
---|
0:05:36 | the main for i of all i-vectors of the same speaker |
---|
0:05:41 | i mean i s |
---|
0:05:44 | well and we modeled |
---|
0:05:47 | joint distribution of |
---|
0:05:49 | this |
---|
0:05:51 | i-vectors |
---|
0:05:53 | and then after training but we unfold they are and |
---|
0:05:58 | a finds you and you two |
---|
0:06:01 | two |
---|
0:06:02 | to obtain a |
---|
0:06:04 | a denoising out in order |
---|
0:06:10 | well |
---|
0:06:11 | on the next slide i have a back to prevent |
---|
0:06:15 | our system |
---|
0:06:17 | under consideration |
---|
0:06:19 | well as you can see we used |
---|
0:06:22 | convention the lda based system as our baseline |
---|
0:06:27 | with whitening and length normalisation |
---|
0:06:30 | a pre-processing |
---|
0:06:32 | well |
---|
0:06:34 | the next system is based on |
---|
0:06:36 | are a out to import or also with a whitening and men normalisation |
---|
0:06:44 | a pre-processing and the finally |
---|
0:06:48 | are where |
---|
0:06:51 | next system is a det based a |
---|
0:06:57 | well it's just and |
---|
0:06:59 | l two in order which is |
---|
0:07:01 | find fuel from the army and this dashed looked at all means fine tuning procedure |
---|
0:07:10 | well |
---|
0:07:10 | and the |
---|
0:07:12 | a ball the hero or about the parameter transmission or substitution |
---|
0:07:18 | i will focus on that on the on my neck slides |
---|
0:07:22 | it is very important |
---|
0:07:24 | right it just turned out to be important in our system |
---|
0:07:29 | well |
---|
0:07:31 | for |
---|
0:07:33 | we used two covariance model for scoring it's can be viewed as simple case of |
---|
0:07:37 | the lda and the score can be a |
---|
0:07:42 | expressed in terms of |
---|
0:07:45 | between speaker and within speaker covariance matrices |
---|
0:07:50 | well |
---|
0:07:52 | few words about the parameter substitution |
---|
0:07:56 | during our experiments |
---|
0:07:58 | in our work we figure out that the |
---|
0:08:02 | the best performing the best performance of the a base based the system |
---|
0:08:10 | is performed so well we a substitute |
---|
0:08:14 | why whitening and p lda back-end parameters from they are bm system |
---|
0:08:20 | to the eight based system |
---|
0:08:24 | denoting crafting for the basis |
---|
0:08:27 | well it's empirical fun |
---|
0:08:29 | but it's it is wearing important |
---|
0:08:33 | for this system |
---|
0:08:35 | let me show you our first results |
---|
0:08:38 | well with just the system |
---|
0:08:41 | on the nist as the retail |
---|
0:08:44 | protocol and to |
---|
0:08:45 | as you can see |
---|
0:08:47 | the gain |
---|
0:08:49 | we're observed again |
---|
0:08:52 | over the baseline system when we applied our da a based system with parameter replacement |
---|
0:08:59 | both four |
---|
0:09:01 | commonly used in nist sre ten protocol and our second |
---|
0:09:06 | corpus called rest rooms telecom test got stuck on the on the results |
---|
0:09:14 | and |
---|
0:09:16 | so |
---|
0:09:18 | some information about the |
---|
0:09:20 | a risk telecon corpus can perform and the by the slide |
---|
0:09:25 | well |
---|
0:09:28 | and |
---|
0:09:28 | to the analysis of the det based system we decided to use cluster variability criteria |
---|
0:09:37 | e g |
---|
0:09:39 | it is also called for can not criteria |
---|
0:09:43 | well it is based on |
---|
0:09:46 | we since began between speaker covariance matrices |
---|
0:09:50 | and if you're |
---|
0:09:53 | take a look at this figure and you can see that there |
---|
0:10:00 | odin quarter based projections have more stronger clustered variability |
---|
0:10:09 | about unit is well and the in this case we didn't apply and normalization for |
---|
0:10:16 | our bn and |
---|
0:10:17 | d e bay super projections |
---|
0:10:22 | well i mean about normalization i mean to know whitening |
---|
0:10:27 | were applied to d r b m and v |
---|
0:10:31 | well |
---|
0:10:33 | additionally |
---|
0:10:34 | are we decided to use cosine scoring |
---|
0:10:38 | as an independent estimation |
---|
0:10:40 | or to assess the |
---|
0:10:44 | the properties of our projections |
---|
0:10:49 | you can see from this result |
---|
0:10:51 | that no weight in the that da based system achieves the |
---|
0:10:56 | the good performance among the |
---|
0:10:59 | all the system |
---|
0:11:01 | by the way we try to use |
---|
0:11:04 | and simple |
---|
0:11:05 | out in order to |
---|
0:11:08 | two |
---|
0:11:10 | to try that it's in speaker recognition if you |
---|
0:11:13 | but it shot out to be the |
---|
0:11:15 | not so would is the e bay system |
---|
0:11:20 | well |
---|
0:11:23 | and now to the white in can length normalization |
---|
0:11:27 | when we apply this parameters for the r b m and g u based projections |
---|
0:11:33 | we obtain those results |
---|
0:11:37 | and i |
---|
0:11:39 | that we can see the |
---|
0:11:41 | the lines are very similar |
---|
0:11:43 | and that close to each other |
---|
0:11:46 | in this situation a where we applied |
---|
0:11:50 | it di da a based |
---|
0:11:53 | whitening |
---|
0:11:54 | one of the four |
---|
0:11:56 | forty it based system |
---|
0:11:58 | it's turned out to be |
---|
0:12:00 | not so who |
---|
0:12:02 | for the system |
---|
0:12:04 | and the |
---|
0:12:05 | now on the next slide |
---|
0:12:07 | we applied parameter substitution so we decided to use the parameter whitening parameter from our |
---|
0:12:15 | em system |
---|
0:12:17 | and the |
---|
0:12:18 | in this situation we achieve good performance of the system |
---|
0:12:24 | yes you can see |
---|
0:12:26 | one baseline |
---|
0:12:28 | and the |
---|
0:12:30 | to the figure |
---|
0:12:31 | i |
---|
0:12:32 | you also can see at the to |
---|
0:12:35 | the discriminative properties |
---|
0:12:37 | or was the in this case |
---|
0:12:41 | is |
---|
0:12:41 | a more stronger for the a basis projection |
---|
0:12:48 | to summarize altogether i prepared |
---|
0:12:53 | all table we we'll terrible with the all common result |
---|
0:12:58 | and the among the |
---|
0:13:02 | the system the a based system with a are very important the substitution i mean |
---|
0:13:07 | whitening |
---|
0:13:09 | at you the best performance |
---|
0:13:15 | well |
---|
0:13:17 | and no to the |
---|
0:13:19 | p lda based scoring |
---|
0:13:22 | well |
---|
0:13:23 | in this table |
---|
0:13:24 | you can see that our results we obtained a opted different experiments in different configuration |
---|
0:13:31 | of our system |
---|
0:13:32 | and again |
---|
0:13:33 | at the last line |
---|
0:13:35 | the table you can see that the |
---|
0:13:38 | good improvement would be in |
---|
0:13:40 | can be achieved by using |
---|
0:13:43 | parameter substitution from there are bm system |
---|
0:13:46 | but the question |
---|
0:13:48 | why it's happens is still open for us and we didn't manage to until it's |
---|
0:13:54 | question |
---|
0:13:57 | well |
---|
0:13:59 | no i will |
---|
0:14:01 | we will discuss some improvements for the a based system |
---|
0:14:06 | and first we decided to apply to apply |
---|
0:14:09 | dropout regularisation |
---|
0:14:11 | for both our em training |
---|
0:14:14 | and the |
---|
0:14:16 | for fine-tuning |
---|
0:14:18 | well as you can see |
---|
0:14:21 | dropped out helps |
---|
0:14:23 | to improve the system |
---|
0:14:25 | when we used the it's a in |
---|
0:14:27 | the orange where |
---|
0:14:29 | our em training stage |
---|
0:14:31 | r be improved training |
---|
0:14:33 | but unfortunately apple a plan to produce the stage of discriminative fine tuning wasn't couple |
---|
0:14:39 | for us |
---|
0:14:42 | well to the jeep our conjecture we try to use the two schemes |
---|
0:14:49 | first |
---|
0:14:50 | you can see the first one on the slide |
---|
0:14:53 | it is cold stating audience |
---|
0:14:57 | well |
---|
0:14:59 | after training the first are |
---|
0:15:01 | it's out what can be may be used as a as an input for the |
---|
0:15:05 | next are |
---|
0:15:07 | and then we try to find t one |
---|
0:15:09 | each altogether you |
---|
0:15:13 | jointly |
---|
0:15:14 | well but it does not |
---|
0:15:17 | asked to improve the system |
---|
0:15:21 | about the second that scheme |
---|
0:15:23 | which is named stating bias |
---|
0:15:28 | manage to obtain good results |
---|
0:15:30 | but in this scenario we need to |
---|
0:15:33 | to you and or two |
---|
0:15:36 | substitute whitening parameter again probably are bm system |
---|
0:15:42 | some big generative pretrained system |
---|
0:15:45 | and the we get a little bit improvement from that |
---|
0:15:52 | and the |
---|
0:15:54 | next question i would like to focus is |
---|
0:15:59 | the domain mismatch tonight |
---|
0:16:01 | we investigated our da a best system in |
---|
0:16:06 | in the domain mismatch conditions |
---|
0:16:10 | well we used domain adaptation challenge that a dataset |
---|
0:16:14 | and setup |
---|
0:16:16 | it's a back end we use cosine scoring |
---|
0:16:19 | two covariance model record s |
---|
0:16:22 | to as the lda and simplify the lda with |
---|
0:16:27 | four hundred dimensional speaker subspace |
---|
0:16:29 | referred to |
---|
0:16:30 | as the only |
---|
0:16:33 | it should be noted that in our experiments we absolutely ignore label so the in |
---|
0:16:38 | the main beta we used |
---|
0:16:41 | we use it |
---|
0:16:42 | one way to estimate whitening and the |
---|
0:16:46 | whitening parameters or the systems |
---|
0:16:49 | well and not to the results |
---|
0:16:52 | you can see |
---|
0:16:54 | for the baseline |
---|
0:16:58 | system when we use in domain data for training |
---|
0:17:03 | we obtain both results for |
---|
0:17:05 | cosine scoring and you can see that the in applying a to do when the |
---|
0:17:10 | wind di da a based system |
---|
0:17:13 | before was focus i in only a scoring |
---|
0:17:17 | but so when we |
---|
0:17:22 | used out-of-domain that the data to train our systems |
---|
0:17:25 | or with a you can see the degradation |
---|
0:17:29 | for both for cosine and you'll be scoring |
---|
0:17:34 | and in the |
---|
0:17:37 | find |
---|
0:17:38 | this table |
---|
0:17:39 | you can see it the improvement |
---|
0:17:41 | when we used whitening parameters from |
---|
0:17:45 | in the mean data |
---|
0:17:51 | the same results but for the |
---|
0:17:53 | a simplified field v scoring |
---|
0:17:56 | well i just little bit |
---|
0:18:00 | better |
---|
0:18:01 | then you'll be |
---|
0:18:04 | and i'll to conclude ones |
---|
0:18:06 | we present to |
---|
0:18:09 | the study of denoising grafting order |
---|
0:18:12 | in there |
---|
0:18:13 | i-vector space |
---|
0:18:14 | we figured out that the i |
---|
0:18:17 | but |
---|
0:18:20 | i'm sort be performed on the t or tdoa based system is you two |
---|
0:18:24 | you by employing can parameters directly from the rear are beyond i'll put |
---|
0:18:31 | the question is still open why are beyond transform provide better bacon parameters for this |
---|
0:18:38 | set |
---|
0:18:39 | well dropped about helps to improve the results but when applied to do our em |
---|
0:18:46 | training stage |
---|
0:18:47 | and that helped when we implemented in fine tuning |
---|
0:18:54 | different project share in the form of stated denoising crafting quarter provide a few further |
---|
0:19:01 | improvements |
---|
0:19:03 | well and all our findings |
---|
0:19:06 | regarding speaker verification system in my conditions |
---|
0:19:10 | called so true in |
---|
0:19:12 | mismatched condition case |
---|
0:19:15 | and |
---|
0:19:16 | the last one it's and the you think whitening parameters for the target domain along |
---|
0:19:21 | the |
---|
0:19:22 | the a it train twenty out-of-domain set |
---|
0:19:25 | else two |
---|
0:19:27 | the weights avoid significant |
---|
0:19:29 | performance gap |
---|
0:19:30 | goes by domain mismatch |
---|
0:19:32 | that's it |
---|
0:19:43 | top questions |
---|
0:19:51 | michael |
---|
0:19:57 | in this late it's when d you show the and the stacked |
---|
0:20:01 | in tennessee note and can then |
---|
0:20:03 | digits right more than two layers |
---|
0:20:09 | yes but in this |
---|
0:20:11 | in this we need to inject whitening conflict summarisation between the wires it is the |
---|
0:20:21 | this has five |
---|
0:20:24 | five i want to with whitening and length normalization injection |
---|
0:20:31 | i mean |
---|
0:20:32 | and that when you when you use it to like us to |
---|
0:20:39 | to denoising of the encoders |
---|
0:20:40 | you improve the results |
---|
0:20:42 | so that you use your tie the third one |
---|
0:20:47 | what's |
---|
0:20:50 | what do you know more than one at each other than the corrected where a |
---|
0:20:56 | whole |
---|
0:20:56 | i see |
---|
0:20:58 | well we i |
---|
0:21:00 | we decided to |
---|
0:21:02 | two |
---|
0:21:05 | through might not able to for the |
---|
0:21:09 | goal deeper in this because of four we find out that this result is very |
---|
0:21:15 | similar to the you know our first one based on only one |
---|
0:21:27 | simmons |
---|
0:21:32 | although we probably have discussed this issue about your question why copying the p lda |
---|
0:21:38 | and the and the long length normalization variables from b r p m rather than |
---|
0:21:43 | they |
---|
0:21:44 | final say stage gives better performance |
---|
0:21:49 | where it should be initial maybe of a over feeding you do the back propagation |
---|
0:21:55 | but you're doing since you're using the same set |
---|
0:21:59 | maybe therefore let's say via residual matrix that were using be lda becomes artificially small |
---|
0:22:07 | in terms of strays let's say |
---|
0:22:10 | so how to check maybe the traces of the two matrices |
---|
0:22:14 | the one that you estimate from r b m and what i guesstimate after to |
---|
0:22:18 | see maybe |
---|
0:22:20 | the covariance matrices are sufficiently small |
---|
0:22:23 | might be a result of overfitting |
---|
0:22:26 | well this and now assumption and we try to check out chip it calculates after |
---|
0:22:34 | as the meeting our paper our paper was submitted but we figure out |
---|
0:22:41 | it was it does not the reason because of we try to |
---|
0:22:46 | to split our datasets in two parts and the to use separate data to train |
---|
0:22:54 | a lda based and so but they can parameters but |
---|
0:22:58 | the results |
---|
0:22:59 | schultz |
---|
0:23:01 | shows that |
---|
0:23:02 | and is not the rate |
---|
0:23:04 | it is not a repeating |
---|
0:23:07 | occured while we trained the system on the same data |
---|
0:23:12 | well |
---|
0:23:15 | al so try to |
---|
0:23:18 | explain the situation by |
---|
0:23:21 | using a house bill option assumption well i mean |
---|
0:23:27 | after |
---|
0:23:29 | det projection we can obtain |
---|
0:23:31 | no more or less |
---|
0:23:34 | goals and but less torsion the |
---|
0:23:38 | and that can be the |
---|
0:23:40 | this |
---|
0:23:41 | in this case but |
---|
0:23:43 | seems to us |
---|
0:23:46 | but also it is not the answer |
---|
0:23:54 | this time for another question jumps a |
---|
0:24:03 | just to construe on the first step of your system but i think it will |
---|
0:24:09 | to be spot you say that you are using twenty |
---|
0:24:13 | non-speech states i don't quite amazed both this huge number could you say something about |
---|
0:24:19 | that |
---|
0:24:20 | you mean huge number of non-speech states but |
---|
0:24:25 | we have |
---|
0:24:29 | we use this |
---|
0:24:30 | standard caldera see from our where speech recognition department |
---|
0:24:36 | and they the |
---|
0:24:39 | you fast |
---|
0:24:40 | and a twice to use this configuration all these system and the we train |
---|
0:24:48 | ours the d n and in this way |
---|
0:24:50 | and the |
---|
0:24:51 | well it's provide food |
---|
0:24:54 | voice activity detection for our system |
---|
0:24:57 | and we are also it's a |
---|
0:25:00 | mentioned we also used to |
---|
0:25:03 | this |
---|
0:25:04 | capabilities to |
---|
0:25:06 | a to a black soft one solution |
---|
0:25:10 | also what decision in this statistic space |
---|
0:25:15 | well i mean we |
---|
0:25:17 | we have done |
---|
0:25:20 | cepstral mean shift normalization in the statistics space |
---|
0:25:23 | by excluding a non speech |
---|
0:25:26 | well non speech is the problem our consideration |
---|
0:25:34 | that's to the speaker again thank you |
---|