0:00:13 | a layer one point eight we really |
---|
0:00:17 | and i'm the also for people initial of information for each speaker embedding using attention |
---|
0:00:23 | and the other two already optimum a mac and there will be |
---|
0:00:27 | no drama hong kong or it can be robustly income working s one the information |
---|
0:00:31 | engineer |
---|
0:00:36 | so reviews the contribution okay but we show that was we show that the model |
---|
0:00:41 | specifically a hundred and twenty one layer tenseness will produce more discriminative speaker id |
---|
0:00:48 | and the showroom show model as vector |
---|
0:00:51 | with the referee as an amount of time |
---|
0:00:54 | secondly we show that measure it or is it is the goal will include a |
---|
0:00:58 | speaker at ease of a month in noisy data set |
---|
0:01:03 | so okay so |
---|
0:01:05 | okay well as five and that's well |
---|
0:01:07 | as a bayesian so the network take a speech feature and mfcc a filter bank |
---|
0:01:11 | feature |
---|
0:01:12 | and you and then costly several l of convolution |
---|
0:01:16 | and then we because is a real and then us |
---|
0:01:20 | she speech is very allowed in nature |
---|
0:01:22 | so we need to convert into a single way to so we do it follows |
---|
0:01:26 | that is a statistically less specifically we computed be and then the duration and compare |
---|
0:01:32 | the mean and standard deviation and the goal wilfully conditionally a and it produces and |
---|
0:01:37 | the network of the softmax layer |
---|
0:01:39 | so |
---|
0:01:39 | this is really is really on its own mean of variable an utterance will be |
---|
0:01:45 | standard edition |
---|
0:01:47 | and she the recesses found that using me a standard deviation is better than using |
---|
0:01:52 | solely still this show that |
---|
0:01:56 | more actually this |
---|
0:01:57 | it is q did should description of the three levels feature is very helpful for |
---|
0:02:02 | producing discriminative the speaker at |
---|
0:02:06 | so |
---|
0:02:07 | so this class that is really more detail |
---|
0:02:11 | it's very easy operation we just compute a |
---|
0:02:14 | me and of very low level and then we compute a standard deviation of three |
---|
0:02:18 | level e |
---|
0:02:19 | have a |
---|
0:02:21 | so we can see no bonus that is still a as the kind of the |
---|
0:02:24 | summary also no feature |
---|
0:02:26 | so we use me as the this then used as a summary also three level |
---|
0:02:30 | features this you a distribution |
---|
0:02:32 | however |
---|
0:02:32 | it may lessen the initial can only characterize where a single distribution of a gaussian |
---|
0:02:37 | distribution |
---|
0:02:38 | so multimodal distribution yes but alas decision so if even if the frame level feature |
---|
0:02:44 | a kind of distribution recognition custom |
---|
0:02:49 | lately this yes and deviation we'll kind of |
---|
0:02:52 | some right a distribution will |
---|
0:02:55 | so what was all here we propose a misrepresentation forty |
---|
0:02:59 | so |
---|
0:02:59 | it is i use it is no place the use i and use is that |
---|
0:03:04 | haitian maximization algorithm for |
---|
0:03:06 | for gas emission model |
---|
0:03:07 | so here from here is that all using you know in those emission model we |
---|
0:03:12 | actually kind of you |
---|
0:03:14 | the euclidean distance to produce alignments and interview me as the user's we use we |
---|
0:03:20 | use the tension mechanism |
---|
0:03:22 | to reduce the center score so specifically you have control level feature s and the |
---|
0:03:27 | we have multiple exposure had |
---|
0:03:28 | and then and of allegiance it should have computers the set of weight |
---|
0:03:34 | set away this is the other ways normalized to make system |
---|
0:03:37 | a one |
---|
0:03:38 | across each |
---|
0:03:39 | and we use this certain way |
---|
0:03:41 | can be me and a standard deviation |
---|
0:03:43 | and isn't the and then we have multiple yes the divisional is not only as |
---|
0:03:48 | that used as a reasonable tended to get there |
---|
0:03:50 | and that we used to compute a speaker id |
---|
0:03:53 | so the imbalance in here is that e and we have multiple okay and then |
---|
0:03:58 | addition we is not right across each had so is only sees the kind of |
---|
0:04:03 | is just that because wishable actually |
---|
0:04:06 | you know how to compute yes that the user is exactly as a |
---|
0:04:10 | as a gaussian mixture model |
---|
0:04:13 | so be so still not allowed us that is supporting map plays the car content |
---|
0:04:19 | in it is only is a proposal by another researcher |
---|
0:04:23 | so is right is very close to it but its enrollment network |
---|
0:04:28 | we use several times about being a different way |
---|
0:04:30 | so as this is a computer |
---|
0:04:33 | on the other setups location away |
---|
0:04:35 | at least at attention ways normalized cost very soulful each we compute a score and |
---|
0:04:41 | the scores normalized across three |
---|
0:04:43 | so nist twenty that all the different arabic a in the with real state acacia |
---|
0:04:47 | mechanism |
---|
0:04:47 | so you know case you a location we think that location like because the emission |
---|
0:04:52 | model |
---|
0:04:53 | in a attention model is more kind of a way to each frame up to |
---|
0:04:58 | design a way that we use it is trained on the laws only idealise more |
---|
0:05:01 | like a cell vad |
---|
0:05:03 | two to three to fuse i'll some |
---|
0:05:06 | and the contribute to three |
---|
0:05:10 | so that's not that the landau wasn't that we might be a teacher forty so |
---|
0:05:17 | actually that's not as in them some other researcher also have okay we will where |
---|
0:05:23 | is internet |
---|
0:05:25 | to latin like task |
---|
0:05:27 | but now map place marginally case that |
---|
0:05:31 | that's now that's undecidable was additionally could be |
---|
0:05:34 | also use a model me |
---|
0:05:36 | maximum mean but unlike visionary as us analyze computing a different way use euclidean system |
---|
0:05:42 | as far as we use attention so we can have a score files are discovered |
---|
0:05:46 | that can be very channels covering it can be |
---|
0:05:48 | i just lock scores on the we use these remote |
---|
0:05:51 | well various channel neural network |
---|
0:05:53 | so is more powerful than euclidean distance |
---|
0:05:57 | so let's take a look and other different between that and that you would disagree |
---|
0:06:00 | and we shouldn't worry |
---|
0:06:02 | unlike i talk about before |
---|
0:06:05 | and i can't is it is probably |
---|
0:06:07 | have a kind of computer location when normalized cost of frames so |
---|
0:06:12 | so the distribution is a is the is distributed over a state and each has |
---|
0:06:17 | kind of |
---|
0:06:17 | you is very independent of yellow case of the distribution is this will work at |
---|
0:06:24 | had is only the small where it kind of the mission recognition may be sure |
---|
0:06:29 | execution |
---|
0:06:30 | so |
---|
0:06:31 | so there are we use what i even and that was considered as an one |
---|
0:06:36 | or on one hundred and you wanna there's net |
---|
0:06:40 | that's nice it was in computer vision as a kind of |
---|
0:06:45 | cancun oklahoma for the guy around a ski condition because the |
---|
0:06:49 | okay ross given that can as the use of test condition |
---|
0:06:52 | the original that slated the collusion and then we do we just a moment modification |
---|
0:06:57 | to make you what we with the all pass thus because the very we use |
---|
0:07:01 | the one to compensate you is not be |
---|
0:07:03 | and to the to the for convolution |
---|
0:07:05 | and then for the transition i a low precision here we use can we also |
---|
0:07:11 | use clues as a sample |
---|
0:07:12 | as specifically we use the kernels that a twister to lose in an example here |
---|
0:07:17 | the data symbol |
---|
0:07:19 | and i see a of the last once the last on the we use the |
---|
0:07:23 | at my the softmax we find it very pretty effective |
---|
0:07:28 | so the only information for always the training data that the which idea and you |
---|
0:07:34 | lda we use |
---|
0:07:36 | that is seven thousand and three hundred speakers always the rewind with thirty two |
---|
0:07:41 | it has data a always night maybe we maybe a voice ninety |
---|
0:07:45 | we of so you very that weighs thirty one has a |
---|
0:07:48 | okay is that we use it is forty dimensional feature with the mean |
---|
0:07:53 | and then the weights and additive educational use where is a while use of these |
---|
0:07:57 | energy based voice activity you question |
---|
0:08:00 | and the neon and we use your addition to the |
---|
0:08:04 | is now we also use |
---|
0:08:05 | us to use a as well and wise but in ways that are we double |
---|
0:08:09 | up i mean |
---|
0:08:11 | it double in and the channel size of the listeners that you scroll down there |
---|
0:08:15 | so this is somebody else the model use a specific law me is a real |
---|
0:08:20 | time operation |
---|
0:08:21 | and then model parameter and use it on hold the number and in the model |
---|
0:08:25 | we can see as well as well and that the work flow is quite low |
---|
0:08:28 | having a is i is a low otherwise but the network because we don't models |
---|
0:08:33 | i don't know the multichannel |
---|
0:08:35 | solar for helpful plastic |
---|
0:08:39 | a powerful |
---|
0:08:40 | about referee of all time all and the models and also quite able but that |
---|
0:08:45 | is that with as the although is are quite enough that will hundred and you |
---|
0:08:50 | want layer |
---|
0:08:51 | because the actually have a weighted loaf localities roughly is there almost every as i |
---|
0:08:56 | z s where the network |
---|
0:08:57 | and then i mean is also only all of the tuple |
---|
0:09:00 | and but we can see that because of this as nice very even networks |
---|
0:09:04 | so we've the you know device like that you will be a little bit so |
---|
0:09:11 | that's right |
---|
0:09:13 | so it is there is a well or in our results |
---|
0:09:17 | so first let's talk about network structure |
---|
0:09:21 | we find that does not for all of our last record and when wiseman than |
---|
0:09:25 | the i-th user can you has if you all three data is that |
---|
0:09:29 | our has never phone a fast and i and that although and why do as |
---|
0:09:34 | well were used |
---|
0:09:35 | a rough is a model parameter and take more time interval as |
---|
0:09:40 | ieee |
---|
0:09:42 | in the performance case can be our guys that obviously perform better |
---|
0:09:47 | and then follows that is important maslow we found then be sure of that and |
---|
0:09:51 | we |
---|
0:09:52 | of on a the task you know ways nineteen evaluations that |
---|
0:09:55 | and i've always that anyone we have been a small improvement |
---|
0:09:58 | and generally speaking way out of all conditions that is we |
---|
0:10:04 | so here is that was totally an application had |
---|
0:10:09 | so here we to acquire it we because study |
---|
0:10:14 | are we face the known ola layer after recognition so increase number of half will |
---|
0:10:20 | not be sign quiz or not but i mean |
---|
0:10:22 | because he to achieve increase the number of without controlling the concatenated that dimension you |
---|
0:10:28 | use of where like to model so the number of the times that no i |
---|
0:10:32 | can not be penetrated problem as a mechanism |
---|
0:10:35 | it could be getting the benefit for like to model so as to the telly |
---|
0:10:39 | aside |
---|
0:10:39 | so a reasonable how will i reason and stories they will be a more fair |
---|
0:10:44 | comparison |
---|
0:10:44 | so as to here we see that if we present will had four one two |
---|
0:10:49 | to four |
---|
0:10:51 | avoid and it's to the us is probably actually that i scheme going a |
---|
0:10:54 | so we show that as you one highest ask volunteers as that is only |
---|
0:10:59 | overall image relevant between c reasonable huh |
---|
0:11:03 | okay queries the |
---|
0:11:05 | only the increase the number that young we actually going not so reason is that |
---|
0:11:09 | this kind of you shape at a when the number of buttons at high rates |
---|
0:11:14 | so we conclude we introduce the console mixture of importing dues that is |
---|
0:11:19 | i was that is the point i using only had training and all way or |
---|
0:11:24 | policies is i about is imitation maximization verifying cost initial model i am on like |
---|
0:11:30 | gmm model |
---|
0:11:32 | images time on a given by this mechanism is that the fusion this that we |
---|
0:11:37 | do nothing levels each index pieces and so i know propose a mechanism to one |
---|
0:11:42 | hundred and twenty one data s now but it should be for one was that |
---|
0:11:47 | everyone for on several ways night evaluation set |
---|
0:11:52 | so this is all my presentation so thank you very much listening if you have |
---|
0:11:58 | a of any question of all my presentation and all that it is illegal common |
---|