0:00:15 | hi everybody in this that i'm going to present the average at least the mental |
---|
0:00:23 | for |
---|
0:00:25 | allowing for the states and also selecting assumption is that is seeking text dependent speaker |
---|
0:00:32 | verification |
---|
0:00:33 | and also using deep neural network for improving the performance of the |
---|
0:00:39 | text dependent speaker verification |
---|
0:00:43 | a text dependent speaker verification is |
---|
0:00:47 | task of verifying both speaker and the also phase signal the phase information |
---|
0:00:54 | and we can use for improving the performance |
---|
0:01:00 | we proposed a freezing dependent hmm models for a lightning frame |
---|
0:01:06 | to the states and also to the gaussian components |
---|
0:01:10 | an by using a hmm a we shall use the phrase information also we can |
---|
0:01:17 | take into account the framework |
---|
0:01:22 | then to use the h and then channel reviews the |
---|
0:01:26 | uncertainty in the i-vector estimation if we need a they're pretty |
---|
0:01:32 | and the average to resolve the covariance both sides |
---|
0:01:37 | as uncertainty |
---|
0:01:39 | this not so the reviews the uncertainty about twenty pairs |
---|
0:01:45 | compared to the g |
---|
0:01:48 | in addition to write we try to using deep neural networks for reducing the gap |
---|
0:01:54 | between a gmm how much of the alignments |
---|
0:01:59 | are also for improving the performance of the assuming this that's |
---|
0:02:06 | that i certainly the general i-vector based system |
---|
0:02:10 | in i-vector as system may mobile the all test you can then |
---|
0:02:15 | supervector is we the |
---|
0:02:18 | and this equation |
---|
0:02:20 | in i-vector system we need to zero and first order statistic for training and the |
---|
0:02:25 | extracting i-vectors |
---|
0:02:27 | you can see the efficient |
---|
0:02:30 | in this the equations a got one i shows the posterior probability of one frame |
---|
0:02:37 | b |
---|
0:02:38 | generated be the one a specific a gaussian components |
---|
0:02:43 | we can component are computed the |
---|
0:02:47 | gamma |
---|
0:02:48 | got most be different at all the gmm-ubm |
---|
0:02:52 | and or also or channel dismantled |
---|
0:02:58 | but then you want them using the chairman has ubm in text dependent speaker verification |
---|
0:03:04 | you have several choices |
---|
0:03:08 | the first and second one is the using a phrase dependent hmm models in this |
---|
0:03:13 | case you have to train and i-vector extractor for each phrase |
---|
0:03:19 | this is suitable for common raspberries are also for text prompted speaker verification |
---|
0:03:25 | we need sufficient training data for each phrase out of the also and so this |
---|
0:03:31 | is not practical for |
---|
0:03:33 | real application of taste gonna speaker verification |
---|
0:03:38 | then other choices a tied mixture |
---|
0:03:41 | hmms |
---|
0:03:42 | and the last minute or phrasing the dimensional models |
---|
0:03:47 | in this the middle the be used from a monophone structure same as the speech |
---|
0:03:52 | production |
---|
0:03:54 | and the |
---|
0:03:56 | trait phrase model by concatenating the |
---|
0:04:01 | four times and models |
---|
0:04:04 | and extracting such as that seek for each phrase and convert into the |
---|
0:04:08 | same shape of for all for example train an i-vector extractor for all |
---|
0:04:13 | phrase |
---|
0:04:14 | in this method to we don't need any a bead only large amount of training |
---|
0:04:19 | data for each other frames on hmms can be trained |
---|
0:04:22 | using any transcriber data |
---|
0:04:27 | in these |
---|
0:04:30 | the first stage of the this method is the training of phone recognizer under constructing |
---|
0:04:37 | mobile left right problems for each phrase |
---|
0:04:40 | and that doing a bit every forced alignment to align the frame to the states |
---|
0:04:46 | and eight and |
---|
0:04:49 | in |
---|
0:04:50 | each state |
---|
0:04:51 | extracting such as that's is the same as |
---|
0:04:55 | simple gmm |
---|
0:04:59 | and this is the for each phrase test statistic have different shapes and you have |
---|
0:05:05 | to |
---|
0:05:07 | change a them to a unique shape to be able to train on i-vectors structure |
---|
0:05:13 | for all of the reason |
---|
0:05:15 | in this |
---|
0:05:17 | in the button of this |
---|
0:05:19 | figure see that |
---|
0:05:22 | spectral zero or a first order statistics |
---|
0:05:28 | colours that |
---|
0:05:30 | phrase specific statistics of the final of erin |
---|
0:05:33 | we just sound the |
---|
0:05:35 | some part of |
---|
0:05:37 | nh just sound |
---|
0:05:40 | part of the statistic that associated with the |
---|
0:05:43 | saying the state of the same performance |
---|
0:05:46 | and after the training |
---|
0:05:49 | train an i-vector extractor exactly similar to text independent speaker |
---|
0:05:54 | and verification |
---|
0:05:58 | for channel compensation and scoring a text dependent speaker verification via problem be these |
---|
0:06:07 | it's proved that the performance of the lda it's not so a lot and sometimes |
---|
0:06:12 | the performance of the baseline gmm this is better than a p lda |
---|
0:06:19 | also because the in text dependent speaker location of the training data it's really make |
---|
0:06:24 | you need to |
---|
0:06:26 | in most number of speakers also number of |
---|
0:06:30 | sometimes pair freight |
---|
0:06:33 | we cannot use a simple lda and |
---|
0:06:35 | the reduces the end of the search just that using a regularized reduces cm |
---|
0:06:42 | for |
---|
0:06:44 | reducing the effect of a sample size |
---|
0:06:49 | in their regularized values the c and we just add some |
---|
0:06:58 | we just had some regularization to the |
---|
0:07:01 | the covariance multi cell for each class something and all that i think it's the |
---|
0:07:06 | exactly same as the symbol that uses |
---|
0:07:11 | and the also in a |
---|
0:07:15 | takes the gun on the speaker location because the old the ocean it's a very |
---|
0:07:19 | short you have to |
---|
0:07:22 | use the phrase dependent transform and also for is the government |
---|
0:07:27 | regular is a the first dependent score normalization |
---|
0:07:30 | especially venues the |
---|
0:07:32 | a hmm for a long data frames |
---|
0:07:35 | use cosine similarity for scoring and the system for normalization |
---|
0:07:43 | for reducing the get fifteen |
---|
0:07:46 | hmm and gmm alignment the we can use |
---|
0:07:50 | the intent |
---|
0:07:51 | in two scenarios the first one maybe use the nn for calculating |
---|
0:07:57 | and posterior probability and it is exactly same as was found in |
---|
0:08:04 | and text independent speaker verification |
---|
0:08:06 | and another choice is using |
---|
0:08:10 | the nn for extracting bodily features |
---|
0:08:13 | for improving the gmm alignment |
---|
0:08:18 | in this case the better of fun i'm like features based clustering obtained on the |
---|
0:08:23 | performance of the gmm based improve |
---|
0:08:28 | for it's like four |
---|
0:08:31 | note for heavy use stacked bottleneck features |
---|
0:08:35 | in this topology |
---|
0:08:38 | the two |
---|
0:08:41 | to the bottleneck networks |
---|
0:08:44 | that's good that to each other |
---|
0:08:48 | the bottleneck loaded of the first stage construct their input of the second stage |
---|
0:08:55 | and we use the old that of the |
---|
0:08:58 | but what a nickel that of the second stage as |
---|
0:09:01 | well to make hold |
---|
0:09:04 | are used to different the |
---|
0:09:06 | networks one us for a menu for extracting bottleneck features that have about eight thousand |
---|
0:09:14 | percent and another one used for bows |
---|
0:09:18 | extracting bottleneck out of the calculating the posterior probability |
---|
0:09:23 | that have |
---|
0:09:25 | i bought of one thousand sentence |
---|
0:09:28 | four feet input features used utterance six a lot and the scale filterbank |
---|
0:09:34 | and also three features |
---|
0:09:40 | where x for experiment of used car one of the r s r dataset |
---|
0:09:46 | in a result dataset there are a three hundred speakers |
---|
0:09:50 | one hundred on the |
---|
0:09:52 | fifty |
---|
0:09:53 | so that males and one hundred forty three females each of which are problems for |
---|
0:09:58 | announcing thirty |
---|
0:10:01 | and different phrases from timit in nine this thing sessions are used really a sessions |
---|
0:10:08 | for enrollment a by averaging the i-vector and others for testing |
---|
0:10:12 | we just use backgrounds for |
---|
0:10:15 | training and the results reported just some evaluation set |
---|
0:10:20 | a for training the n and the we use the switchboard data sets |
---|
0:10:25 | as a feature we use different acoustic features |
---|
0:10:29 | thirty nine dimensional plp features are also |
---|
0:10:34 | the initial all mfcc features both of them extracted from sixteen influence |
---|
0:10:40 | and two version of the bottleneck features but extracted from at a data |
---|
0:10:48 | for |
---|
0:10:50 | vad we use supervised |
---|
0:10:53 | silence model for |
---|
0:10:55 | just dropping to find out |
---|
0:10:57 | just probably the initial and final silence in a regional trans on |
---|
0:11:04 | after that applied |
---|
0:11:05 | cepstral mean and variance too much mean and variance |
---|
0:11:09 | use of four hundred dimensional i-vectors that length normalized before regularized w c n |
---|
0:11:17 | and as lisa the use phrase dependent required their use in an s not cosine |
---|
0:11:22 | distance for a score |
---|
0:11:27 | in this table you can see the comparison results between a different a features and |
---|
0:11:33 | also alignment that so it |
---|
0:11:35 | in the first section of this table you can as can bear the performance of |
---|
0:11:41 | the gmm and hmm aligner and you can see that it shows that the significantly |
---|
0:11:46 | improve the performance |
---|
0:11:51 | and comparing that the nn alignment with hmm of each and see that the nn |
---|
0:11:56 | also calendar improve the performance |
---|
0:12:00 | especially for the female the performance is better than it channel alignment |
---|
0:12:05 | may be used it was just features |
---|
0:12:09 | then use bottleneck features |
---|
0:12:12 | the performance of the gmm it's |
---|
0:12:16 | increased |
---|
0:12:17 | and you can see compare these two number on also others |
---|
0:12:23 | well for hmm based for female the performance is better for mesa |
---|
0:12:29 | you got some deterioration in performance |
---|
0:12:34 | well for the and then we can use those bottleneck features on the l an |
---|
0:12:38 | alignment you can see the |
---|
0:12:41 | so |
---|
0:12:44 | you can see that |
---|
0:12:47 | q |
---|
0:12:48 | you duration in performance |
---|
0:12:50 | maybe use both of them |
---|
0:12:52 | and the in the last section you see the a pair results of the bottleneck |
---|
0:12:59 | concatenated image |
---|
0:13:01 | that the mfcc features |
---|
0:13:04 | in this case we got that this result |
---|
0:13:07 | for weight loss hmm and the gmm case you can see that of in the |
---|
0:13:12 | use this the features the performance of the gmm |
---|
0:13:16 | it's very close to the |
---|
0:13:17 | hmm one but again for be and then the performance is not so |
---|
0:13:24 | because the pair performance of the chinese it's better than other we just report the |
---|
0:13:29 | results on this but also in this table |
---|
0:13:34 | in this table in the first |
---|
0:13:36 | section we compare the |
---|
0:13:39 | performance of the different features |
---|
0:13:43 | mfcc plp what'll make a two button think one of them extracted from |
---|
0:13:50 | is smaller network |
---|
0:13:52 | you can see that most i'm this field you a |
---|
0:13:56 | the perplexity same |
---|
0:13:58 | and the bottleneck its course for made on the it's better for female |
---|
0:14:06 | then reduce the size of the network |
---|
0:14:09 | the performance of the bottleneck reviews the you can see |
---|
0:14:16 | for both appeal to kill the and mfcc we |
---|
0:14:21 | concatenated with the bottleneck we get a |
---|
0:14:24 | would be improvement |
---|
0:14:28 | and in the last session of this table you can see the results of the |
---|
0:14:32 | errors fusion in score domain |
---|
0:14:36 | a comprise only the second session that it's fusion in the it's in feature domain |
---|
0:14:43 | in this case you can see that the in almost all cases the performance of |
---|
0:14:49 | the main interest for coming is better than features domain name |
---|
0:14:54 | takes you can then speaker verification because in text |
---|
0:14:58 | independent |
---|
0:14:59 | the performance of the |
---|
0:15:02 | concatenation is better than |
---|
0:15:05 | fusing the |
---|
0:15:08 | it's cool it's course of two features |
---|
0:15:11 | and a higher the |
---|
0:15:14 | the problem is the training data the training data it's very limited and for larger |
---|
0:15:21 | than actually need to more training data |
---|
0:15:26 | you can see that of the four |
---|
0:15:29 | using the bottleneck features be the plp and mfcc a we get the |
---|
0:15:34 | we would be improvement |
---|
0:15:37 | and that this result come from a fusing three different |
---|
0:15:41 | it scores of three different |
---|
0:15:44 | features |
---|
0:15:48 | and at the end we proved that a |
---|
0:15:51 | be also |
---|
0:15:52 | can get very best results with i-vectors in a text dependent speaker verification |
---|
0:15:58 | we verify that in text examine the speaker verification |
---|
0:16:02 | the performance of that the an alignment |
---|
0:16:05 | so good |
---|
0:16:06 | and in some cases similar or better result did |
---|
0:16:11 | the in an alignment |
---|
0:16:14 | we also get a |
---|
0:16:16 | we do excellent a result we using your bottleneck in text dependent speaker verification |
---|
0:16:23 | it's should even concatenated you the other cepstral features |
---|
0:16:29 | in text-dependent has been in speaker recognition have also |
---|
0:16:33 | it score to maintain it is better than |
---|
0:16:38 | feature level fusion |
---|
0:16:41 | and we get the best results from i've using three different features |
---|
0:16:48 | i'm just another one is a text dependent speaker verification you have to |
---|
0:16:54 | used for a sleep and then the transform on score normalization was |
---|
0:16:59 | the |
---|
0:17:01 | duration is very short and then use the |
---|
0:17:05 | hmm for aligning frame to the states |
---|
0:17:08 | pitch and not to use the phrase independent |
---|
0:17:12 | and |
---|
0:17:13 | score |
---|
0:17:21 | questions |
---|
0:17:36 | okay maybe a quick question for lunch a very good on the vector work aid |
---|
0:17:41 | to try this one the red dots |
---|
0:17:43 | yes a are also the |
---|
0:17:47 | results from our right that's |
---|
0:17:51 | you can see the result of that the use of this was able to interspeech |
---|
0:17:58 | i can see that a comparison between gmm ubm gmm i-vector also hmm i-vector in |
---|
0:18:04 | three different non-target trial |
---|
0:18:08 | and you can see that the especially for the target find that the freight frame |
---|
0:18:14 | that it is important for us |
---|
0:18:17 | also the content of it is important and the performance of the |
---|
0:18:23 | hmm alignment |
---|
0:18:24 | it's very better than other two methods |
---|
0:18:28 | and also for impostor enquiry case the performance of directories |
---|
0:18:35 | better too |
---|
0:18:40 | first thing to note question |
---|
0:18:45 | just a quick question to the fusion for gmm systems so they didn't systems were |
---|
0:18:52 | working controls the hmms drive using cd units is used to see we're very minor |
---|
0:18:58 | to remain |
---|
0:19:00 | no i and try to be |
---|