a layer one point eight we really
and i'm the also for people initial of information for each speaker embedding using attention
and the other two already optimum a mac and there will be
no drama hong kong or it can be robustly income working s one the information
engineer
so reviews the contribution okay but we show that was we show that the model
specifically a hundred and twenty one layer tenseness will produce more discriminative speaker id
and the showroom show model as vector
with the referee as an amount of time
secondly we show that measure it or is it is the goal will include a
speaker at ease of a month in noisy data set
so okay so
okay well as five and that's well
as a bayesian so the network take a speech feature and mfcc a filter bank
feature
and you and then costly several l of convolution
and then we because is a real and then us
she speech is very allowed in nature
so we need to convert into a single way to so we do it follows
that is a statistically less specifically we computed be and then the duration and compare
the mean and standard deviation and the goal wilfully conditionally a and it produces and
the network of the softmax layer
so
this is really is really on its own mean of variable an utterance will be
standard edition
and she the recesses found that using me a standard deviation is better than using
solely still this show that
more actually this
it is q did should description of the three levels feature is very helpful for
producing discriminative the speaker at
so
so this class that is really more detail
it's very easy operation we just compute a
me and of very low level and then we compute a standard deviation of three
level e
have a
so we can see no bonus that is still a as the kind of the
summary also no feature
so we use me as the this then used as a summary also three level
features this you a distribution
however
it may lessen the initial can only characterize where a single distribution of a gaussian
distribution
so multimodal distribution yes but alas decision so if even if the frame level feature
a kind of distribution recognition custom
lately this yes and deviation we'll kind of
some right a distribution will
so what was all here we propose a misrepresentation forty
so
it is i use it is no place the use i and use is that
haitian maximization algorithm for
for gas emission model
so here from here is that all using you know in those emission model we
actually kind of you
the euclidean distance to produce alignments and interview me as the user's we use we
use the tension mechanism
to reduce the center score so specifically you have control level feature s and the
we have multiple exposure had
and then and of allegiance it should have computers the set of weight
set away this is the other ways normalized to make system
a one
across each
and we use this certain way
can be me and a standard deviation
and isn't the and then we have multiple yes the divisional is not only as
that used as a reasonable tended to get there
and that we used to compute a speaker id
so the imbalance in here is that e and we have multiple okay and then
addition we is not right across each had so is only sees the kind of
is just that because wishable actually
you know how to compute yes that the user is exactly as a
as a gaussian mixture model
so be so still not allowed us that is supporting map plays the car content
in it is only is a proposal by another researcher
so is right is very close to it but its enrollment network
we use several times about being a different way
so as this is a computer
on the other setups location away
at least at attention ways normalized cost very soulful each we compute a score and
the scores normalized across three
so nist twenty that all the different arabic a in the with real state acacia
mechanism
so you know case you a location we think that location like because the emission
model
in a attention model is more kind of a way to each frame up to
design a way that we use it is trained on the laws only idealise more
like a cell vad
two to three to fuse i'll some
and the contribute to three
so that's not that the landau wasn't that we might be a teacher forty so
actually that's not as in them some other researcher also have okay we will where
is internet
to latin like task
but now map place marginally case that
that's now that's undecidable was additionally could be
also use a model me
maximum mean but unlike visionary as us analyze computing a different way use euclidean system
as far as we use attention so we can have a score files are discovered
that can be very channels covering it can be
i just lock scores on the we use these remote
well various channel neural network
so is more powerful than euclidean distance
so let's take a look and other different between that and that you would disagree
and we shouldn't worry
unlike i talk about before
and i can't is it is probably
have a kind of computer location when normalized cost of frames so
so the distribution is a is the is distributed over a state and each has
kind of
you is very independent of yellow case of the distribution is this will work at
had is only the small where it kind of the mission recognition may be sure
execution
so
so there are we use what i even and that was considered as an one
or on one hundred and you wanna there's net
that's nice it was in computer vision as a kind of
cancun oklahoma for the guy around a ski condition because the
okay ross given that can as the use of test condition
the original that slated the collusion and then we do we just a moment modification
to make you what we with the all pass thus because the very we use
the one to compensate you is not be
and to the to the for convolution
and then for the transition i a low precision here we use can we also
use clues as a sample
as specifically we use the kernels that a twister to lose in an example here
the data symbol
and i see a of the last once the last on the we use the
at my the softmax we find it very pretty effective
so the only information for always the training data that the which idea and you
lda we use
that is seven thousand and three hundred speakers always the rewind with thirty two
it has data a always night maybe we maybe a voice ninety
we of so you very that weighs thirty one has a
okay is that we use it is forty dimensional feature with the mean
and then the weights and additive educational use where is a while use of these
energy based voice activity you question
and the neon and we use your addition to the
is now we also use
us to use a as well and wise but in ways that are we double
up i mean
it double in and the channel size of the listeners that you scroll down there
so this is somebody else the model use a specific law me is a real
time operation
and then model parameter and use it on hold the number and in the model
we can see as well as well and that the work flow is quite low
having a is i is a low otherwise but the network because we don't models
i don't know the multichannel
solar for helpful plastic
a powerful
about referee of all time all and the models and also quite able but that
is that with as the although is are quite enough that will hundred and you
want layer
because the actually have a weighted loaf localities roughly is there almost every as i
z s where the network
and then i mean is also only all of the tuple
and but we can see that because of this as nice very even networks
so we've the you know device like that you will be a little bit so
that's right
so it is there is a well or in our results
so first let's talk about network structure
we find that does not for all of our last record and when wiseman than
the i-th user can you has if you all three data is that
our has never phone a fast and i and that although and why do as
well were used
a rough is a model parameter and take more time interval as
ieee
in the performance case can be our guys that obviously perform better
and then follows that is important maslow we found then be sure of that and
we
of on a the task you know ways nineteen evaluations that
and i've always that anyone we have been a small improvement
and generally speaking way out of all conditions that is we
so here is that was totally an application had
so here we to acquire it we because study
are we face the known ola layer after recognition so increase number of half will
not be sign quiz or not but i mean
because he to achieve increase the number of without controlling the concatenated that dimension you
use of where like to model so the number of the times that no i
can not be penetrated problem as a mechanism
it could be getting the benefit for like to model so as to the telly
aside
so a reasonable how will i reason and stories they will be a more fair
comparison
so as to here we see that if we present will had four one two
to four
avoid and it's to the us is probably actually that i scheme going a
so we show that as you one highest ask volunteers as that is only
overall image relevant between c reasonable huh
okay queries the
only the increase the number that young we actually going not so reason is that
this kind of you shape at a when the number of buttons at high rates
so we conclude we introduce the console mixture of importing dues that is
i was that is the point i using only had training and all way or
policies is i about is imitation maximization verifying cost initial model i am on like
gmm model
images time on a given by this mechanism is that the fusion this that we
do nothing levels each index pieces and so i know propose a mechanism to one
hundred and twenty one data s now but it should be for one was that
everyone for on several ways night evaluation set
so this is all my presentation so thank you very much listening if you have
a of any question of all my presentation and all that it is illegal common