hello my name is unless it in all the and i'm going to present the
one on reminisced incriminating still applied for
the task of speaker diarization
and this work is the result of collaboration between me
you'll continuity and the look and we'll get
right but and we can put on there and the most likely from all year
but i hope
want to note that even though there
that's that we are doing here we is diarization
the model builder then going to present is not
and necessarily have to be used there
it can be also applied for example for speaker verification but in my presentation and
considering only there is a should
the first
i want to start
the with the
a sure
motivational slight widely started with the
the troubles model
so we are interested in doing speaker diarization by first splitting the entrance into short
overlapping segments in our case this year
second and how long or short
and they overlap was zero point seventy five seconds
then we will extract and invading programmable expect that for each segment
and
class there are then variance
and consequently segments
for the there is a nation
no that there is a problem with this approach it or
the drawback is that the segments are consists either
the same however there
really not that one and you're that might be different
and we would like to
utilize the information of the what did you how trustful that segment base
so our assumptions here is that the quality of the segment a actually
i think our ability to extract them being
so
we
really that e
that is sentiment for sure and noisy
we shouldn't be really sure that the we extracted the
really invading however if a segment was long and clean then the weighting can be
trusted work
so
a in our little we propose to train invading hidden variables rather as observed the
that is
done usually
i in this case and we have two
modify
and weighting extract there are so that it does not want
vector invariance but rather parameters of invading distribution
and also we have to have some of which can
and
and digestive and weighting this for
the one starting with the model here we see a graphical model
war
i single utterance
off and speech segments
i don't each segment this is also our souls
sdr speech segments and
each segment has a
a sign
speaker labels
which are the labels are observed the training data
and then we have two sets of hidden variables
i x are more human beings and y
are there a human speaker variables
and that here there
there is only one
speaker variable and the
to conveying and consequently segment that each time
and the
the a speaker label defines which one this
so we are interested in clustering this
segment
into speaker clusters
and to be able to do so
we have to know how to compute clustering posterior
e o s where
if it'll l denotes there
it's handle all speaker labels
and are is this that'll all
a speech segments
so let's the loser how this posterior of books
and it can be expressed the this ratio wherein linear either we have a product
of two children
we all know is down
right of some given clustering
and
a year and given a the likelihood of the clustering and in the original meaning
that we have a star
all the germs of the same one and this time here is over all possible
partitions
and segment is into clusters
regarding the prior
in our experiments we are using chain users to process prior however i
probably like in that is then again i'm the only option
by probably know the optimal option as well it was just convenient for us so
we stick to the
an s and i'm not willing to discuss prior and in this presentation anymore
for a known beers you know the
it is then
and we are going to concentrate on the signature on the likelihood
at we
of the loser and a within that it can be
represented as a and for that
over individual likelihoods
well
speech segments assigned to
a individual speakers
but there are no segments of time to some specific speaker that the su than
this
is just once all that another thing that little brother
s
all descendants are assumed to
and
all the segments and i are assumed to belong to the same speaker in the
shape share the same speaker variable
so
we can represented as the eighteen
following integral
here the integration is over a speaker available
and until the integral we have the product of
a prior over the speaker variable and the
the brother
or
likelihood terms of individual
speech segments given the speaker variable
and
we are willing to discuss how do you this and twelve
assumptions and restriction during model like to make to be able to computed efficiently
so
you know no
the
speech segment and
it got available are not connected directly by two
you known invading
so we have to integrate
it how to be able to compute this likelihood
and that's exactly what we do here
and here the integration of is over human invading and until the integral we have
the product of two choice
the first one is a model the relation between
you know invading and you know speaker available
and we proposed a novel by gaussian really well
and the next
german
a little the relation between speech segment
although there are we just after and human and eighteen
so
at first
so is gaussian efficient in the second one is also gaussian as a function of
x then
the whole
integral can be a little computed in the closed form
so basically then the first assumption that we make in our model is
then it is exactly but you
the robot into speech human
fusion invading
e can be represented this brother
which is a gaussian distribution that is a function of x
and the normalized zero the gaussian is
non-negative function h
we depends only on speech and not on the
and making it so
then in between lockheed inability
formula for k
a likelihood we see that it can be expressed the data
this
why this equation
and that of here the likelihood depends on that
the parameters to be lda which are a domain justice w and also on
parameters of the embedding distribution which are x and
e
and the x k is the mean of the invading distribution and d is the
precision matrix really
i mean even though that we have now the closed form solution what is
likelihoo
it will be very impractical to use
and
we have to be one base calls me
matrix
matrix inversion
operation for each
speech then answer rate
and it will be just a to them to do in
real
application
so we have proposed to restrict our moral to a two covariance model instead of
just general gaussian guilty
we do this because we know that
a within and across class covariance as a eulogy can be mutually data not like
and if we assume that two covariance model that we can
set
that the loading matrix all of the two identity
and assume that they
relating class covariance of the consequently precision is diagonal
and we have pretty close to the
for all of a
and weighting parameters
as we like so we retreat is i
that they in building precision matrix would be also there now
then the whole the likelihood of expression greatly simplifies a stronger this like
then getting back to living in the very interested in computing clustering posterior and score
that we need and the
likelihood
or
speech segments of thing to the same speaker
a given the partition
and that was computed as this integral and now we know
the expression to compute the
this
germs under the integral which are gaussian
so we have the
and a product of gaussian under the integral also the prior year
is standard normal distribution
and it's
assumed by the lda model
so we can compute the whole integral form and that they result is given here
on this line
a
please
no
is that
even though we can compute
this
likelihood of the level log likelihood exactly but only doctors analysing once then
it does not really matter as in will are training and test recipes
this corresponds are going to cancel so we can just a regular
so alright do compute the clustering posterior is we need
therefore we need
and then the within class precision matrix or
of the terrible
the lda within class president
and then there's to be channel
means of a and variance and vectors which are diagonal precisions of and endings
and that we propose to model them by using style
pretrained thinks
excellent they're extractor each is shown your on the scheme
in great
and so this is some extent they're extracted which was train
and extend the was not modified better there
normally in the look
use it will go the output of the presentation layer after the statistics one and
later here we just three or the rest of the network is we don't really
need it and is that in one really in your earlier today and within the
out of the thing only will be the
mean are unwilling distribution
also we had
an sub network which is able to extract invariant precision
and this is the
been for a network with
two hidden layers
and included things the
i don't the statistics when layer and also the length
then
segmented into frames
and their outputs of this vector which stores that there and making decision
and we'll mean and precision in to be lda
huh
and so can be
all these yellow loss can be trained together
place on discriminatively dating
i
let's not is that
we just ignore this little or
well
in this work
then we are back to the standard expect their gaussian the lda
recipe
is
this is in your
transformation and the within class precision
together just define the lda model trained on
x are there are extracted from the original
so how really training we propose though
use multiclass cross entropy criterion to train
the
models
but the model parameters
one and reorganise the training set as a collection of supervised trials and the each
of them
a contains
a saddle
eight speech segments and corresponding speaker labels which define
two clustering or this eight
segment
and that we used just eight segments
for a reason though
for the higher number of them
we cannot be of
what it will be just very computationally expensive to compute that
was there is that
so
once we have train the model with this criterion we can use it for diarization
just like sizes
a our baseline approach and the one that we propose
and the baseline we use completely of the
call the diarisation recipe
which
extract extra there for each on this time and then there is
there is a rubber systems that of that
and then processed x vectors are fed into but which provides an matrix all the
similarity scores
and discourse and used
and in two
agglomerative can be a clustering algorithm
which is really i algorithm starting with them
constraining each segment this
separate speaker label and then
gradually merging
clusters to at a time
the baseline you this is a portion of this algorithm which
and after each manners e
and compute them
similarity scores of the new cluster against all the rest
by simply averaging the
it's worse
all the individual parts of this
cluster
a the
noting stops
once there are no
similarity scores higher than some preset threshold
in our we use not only extra there but also out of the o
statistics for english and the
number of
frames
in a segment then we center and length normalized expect there
and the
use
the image and then just this
a probabilistic and things
finally they yell at similarity scores are
used by age
however in our case
after each match we compute log-likelihood ratio scores
exactly
for a new cluster at all the rest
and the
do the experimental setup manuals
and also one and two to train extract their extractor and baseline the lda
and then
and you we use i mean dataset to training
and certainty
extractor which is the this the network extracting a and b imprecisions and
also retraining the lda
and then
we this we use there are two thousand nineteen development and evaluation sets
two
there is a performance
and here and the results
so first i have to know that the results the in the table here is
slightly different than what is in the paper
actually because i in part of the result document and then are actually because i
here
tract there is a meeting the paper i manage to improve the baseline performance like
this total generated the updated results
so
for each models here we have two sets of the results
one is
where a general threshold still
agglomerative clustering
and i think one is when the oldest are generally and then a threshold is
tuned on the development set of the
us all these are
energy
and then there is no threshold should be
a maximum likelihood optimum threshold
in case
the a model will probably just
correct a true local will
log-likelihood ratio scores
if it's
not the case then
mandarin threshold we can still role there is no
which is definitely the gaze for all the systems at least eight
a first if you look at the baseline system there is a quite time gap
between optimal performance and
the performance we're using your own personal
however
we just a place there
the baseline version or h t with our
when we compute the log-likelihood ratios or some other each of which
then we see that
the
calibration and an issue becomes more romance of the results with zero threshold degree substantially
and even the optimal results also
get a quite worse than
then
baseline
i will work here we did not the retraining any willingly just directly the clustering
algorithm
if we
train the same not also without using what will be taken bearings we just are
trained dog
a clean multiclass cross entropy as was discussed before
then this calibration issue is
so large extent
so the difference between zero threshold and you know one is no
as a remaining anyone
and also we even
managed to
slightly improve
over the zero threshold the baseline performance
finally if we add to this model this and being precisions
so we are using the uncertainty about emission then
we
further improve
zero threshold performance and also the optimal
so
and then aligned
that setting let's say that this
system will give us the best performance over it was
development data
gender threshold we can still get better with the baseline performance
but
in this case it's a very close
or at
and
the difference between the optimal performance and is zero
threshold beforehand is already not that
it is simple s
for other models
so finally though the convolution recently proposed
other
and our scheme to jointly train b and then invading extractor and with multi class
cross entropy
and this discriminative training
helps
two
eliminate war
calibration problem or for their
a regional of the baseline method
then we add and certainty extractor to the training and the
training together with yellow the it is a further improves calibration and the main it
away message here will be that even though the model that we propose that not
necessarily give the best
performance it's results in a better calibrated system
and which is more robust
so that was it rummy game q and
but by