0:00:15hello my name is unless it in all the and i'm going to present the
0:00:19one on reminisced incriminating still applied for
0:00:22the task of speaker diarization
0:00:25and this work is the result of collaboration between me
0:00:30you'll continuity and the look and we'll get
0:00:33right but and we can put on there and the most likely from all year
0:00:38but i hope
0:00:39want to note that even though there
0:00:42that's that we are doing here we is diarization
0:00:45the model builder then going to present is not
0:00:49and necessarily have to be used there
0:00:51it can be also applied for example for speaker verification but in my presentation and
0:00:57considering only there is a should
0:01:00the first
0:01:01i want to start
0:01:03the with the
0:01:04a sure
0:01:07motivational slight widely started with the
0:01:10the troubles model
0:01:12so we are interested in doing speaker diarization by first splitting the entrance into short
0:01:20overlapping segments in our case this year
0:01:24second and how long or short
0:01:27and they overlap was zero point seventy five seconds
0:01:31then we will extract and invading programmable expect that for each segment
0:01:38and
0:01:39class there are then variance
0:01:41and consequently segments
0:01:43for the there is a nation
0:01:46no that there is a problem with this approach it or
0:01:50the drawback is that the segments are consists either
0:01:55the same however there
0:01:58really not that one and you're that might be different
0:02:01and we would like to
0:02:03utilize the information of the what did you how trustful that segment base
0:02:09so our assumptions here is that the quality of the segment a actually
0:02:14i think our ability to extract them being
0:02:17so
0:02:18we
0:02:19really that e
0:02:21that is sentiment for sure and noisy
0:02:24we shouldn't be really sure that the we extracted the
0:02:28really invading however if a segment was long and clean then the weighting can be
0:02:34trusted work
0:02:35so
0:02:36a in our little we propose to train invading hidden variables rather as observed the
0:02:42that is
0:02:44done usually
0:02:46i in this case and we have two
0:02:49modify
0:02:50and weighting extract there are so that it does not want
0:02:54vector invariance but rather parameters of invading distribution
0:02:58and also we have to have some of which can
0:03:01and
0:03:03and digestive and weighting this for
0:03:06the one starting with the model here we see a graphical model
0:03:09war
0:03:10i single utterance
0:03:12off and speech segments
0:03:15i don't each segment this is also our souls
0:03:19sdr speech segments and
0:03:21each segment has a
0:03:23a sign
0:03:26speaker labels
0:03:27which are the labels are observed the training data
0:03:31and then we have two sets of hidden variables
0:03:34i x are more human beings and y
0:03:37are there a human speaker variables
0:03:41and that here there
0:03:45there is only one
0:03:47speaker variable and the
0:03:49to conveying and consequently segment that each time
0:03:53and the
0:03:54the a speaker label defines which one this
0:04:00so we are interested in clustering this
0:04:02segment
0:04:04into speaker clusters
0:04:06and to be able to do so
0:04:08we have to know how to compute clustering posterior
0:04:11e o s where
0:04:13if it'll l denotes there
0:04:16it's handle all speaker labels
0:04:18and are is this that'll all
0:04:21a speech segments
0:04:23so let's the loser how this posterior of books
0:04:27and it can be expressed the this ratio wherein linear either we have a product
0:04:31of two children
0:04:32we all know is down
0:04:34right of some given clustering
0:04:37and
0:04:38a year and given a the likelihood of the clustering and in the original meaning
0:04:43that we have a star
0:04:44all the germs of the same one and this time here is over all possible
0:04:50partitions
0:04:51and segment is into clusters
0:04:55regarding the prior
0:04:57in our experiments we are using chain users to process prior however i
0:05:03probably like in that is then again i'm the only option
0:05:08by probably know the optimal option as well it was just convenient for us so
0:05:13we stick to the
0:05:14an s and i'm not willing to discuss prior and in this presentation anymore
0:05:21for a known beers you know the
0:05:22it is then
0:05:24and we are going to concentrate on the signature on the likelihood
0:05:29at we
0:05:31of the loser and a within that it can be
0:05:34represented as a and for that
0:05:38over individual likelihoods
0:05:40well
0:05:42speech segments assigned to
0:05:43a individual speakers
0:05:46but there are no segments of time to some specific speaker that the su than
0:05:50this
0:05:51is just once all that another thing that little brother
0:05:55s
0:05:56all descendants are assumed to
0:05:58and
0:05:59all the segments and i are assumed to belong to the same speaker in the
0:06:03shape share the same speaker variable
0:06:06so
0:06:07we can represented as the eighteen
0:06:10following integral
0:06:12here the integration is over a speaker available
0:06:15and until the integral we have the product of
0:06:17a prior over the speaker variable and the
0:06:20the brother
0:06:21or
0:06:22likelihood terms of individual
0:06:24speech segments given the speaker variable
0:06:28and
0:06:30we are willing to discuss how do you this and twelve
0:06:34assumptions and restriction during model like to make to be able to computed efficiently
0:06:39so
0:06:40you know no
0:06:42the
0:06:43speech segment and
0:06:44it got available are not connected directly by two
0:06:48you known invading
0:06:50so we have to integrate
0:06:52it how to be able to compute this likelihood
0:06:57and that's exactly what we do here
0:07:00and here the integration of is over human invading and until the integral we have
0:07:04the product of two choice
0:07:06the first one is a model the relation between
0:07:10you know invading and you know speaker available
0:07:13and we proposed a novel by gaussian really well
0:07:18and the next
0:07:20german
0:07:21a little the relation between speech segment
0:07:25although there are we just after and human and eighteen
0:07:28so
0:07:28at first
0:07:30so is gaussian efficient in the second one is also gaussian as a function of
0:07:35x then
0:07:36the whole
0:07:38integral can be a little computed in the closed form
0:07:41so basically then the first assumption that we make in our model is
0:07:46then it is exactly but you
0:07:48the robot into speech human
0:07:53fusion invading
0:07:54e can be represented this brother
0:07:57which is a gaussian distribution that is a function of x
0:08:01and the normalized zero the gaussian is
0:08:03non-negative function h
0:08:05we depends only on speech and not on the
0:08:10and making it so
0:08:11then in between lockheed inability
0:08:16formula for k
0:08:17a likelihood we see that it can be expressed the data
0:08:21this
0:08:22why this equation
0:08:23and that of here the likelihood depends on that
0:08:27the parameters to be lda which are a domain justice w and also on
0:08:32parameters of the embedding distribution which are x and
0:08:37e
0:08:37and the x k is the mean of the invading distribution and d is the
0:08:42precision matrix really
0:08:44i mean even though that we have now the closed form solution what is
0:08:48likelihoo
0:08:49it will be very impractical to use
0:08:51and
0:08:52we have to be one base calls me
0:08:55matrix
0:08:56matrix inversion
0:08:59operation for each
0:09:00speech then answer rate
0:09:02and it will be just a to them to do in
0:09:07real
0:09:08application
0:09:09so we have proposed to restrict our moral to a two covariance model instead of
0:09:17just general gaussian guilty
0:09:21we do this because we know that
0:09:24a within and across class covariance as a eulogy can be mutually data not like
0:09:29and if we assume that two covariance model that we can
0:09:33set
0:09:34that the loading matrix all of the two identity
0:09:38and assume that they
0:09:42relating class covariance of the consequently precision is diagonal
0:09:47and we have pretty close to the
0:09:50for all of a
0:09:52and weighting parameters
0:09:55as we like so we retreat is i
0:09:58that they in building precision matrix would be also there now
0:10:02then the whole the likelihood of expression greatly simplifies a stronger this like
0:10:11then getting back to living in the very interested in computing clustering posterior and score
0:10:16that we need and the
0:10:19likelihood
0:10:20or
0:10:21speech segments of thing to the same speaker
0:10:24a given the partition
0:10:26and that was computed as this integral and now we know
0:10:30the expression to compute the
0:10:34this
0:10:35germs under the integral which are gaussian
0:10:38so we have the
0:10:39and a product of gaussian under the integral also the prior year
0:10:44is standard normal distribution
0:10:47and it's
0:10:49assumed by the lda model
0:10:50so we can compute the whole integral form and that they result is given here
0:10:56on this line
0:10:59a
0:11:00please
0:11:00no
0:11:01is that
0:11:02even though we can compute
0:11:05this
0:11:05likelihood of the level log likelihood exactly but only doctors analysing once then
0:11:10it does not really matter as in will are training and test recipes
0:11:15this corresponds are going to cancel so we can just a regular
0:11:20so alright do compute the clustering posterior is we need
0:11:25therefore we need
0:11:27and then the within class precision matrix or
0:11:32of the terrible
0:11:34the lda within class president
0:11:37and then there's to be channel
0:11:39means of a and variance and vectors which are diagonal precisions of and endings
0:11:46and that we propose to model them by using style
0:11:51pretrained thinks
0:11:53excellent they're extractor each is shown your on the scheme
0:11:58in great
0:11:59and so this is some extent they're extracted which was train
0:12:04and extend the was not modified better there
0:12:07normally in the look
0:12:09use it will go the output of the presentation layer after the statistics one and
0:12:14later here we just three or the rest of the network is we don't really
0:12:18need it and is that in one really in your earlier today and within the
0:12:25out of the thing only will be the
0:12:27mean are unwilling distribution
0:12:29also we had
0:12:31an sub network which is able to extract invariant precision
0:12:36and this is the
0:12:37been for a network with
0:12:39two hidden layers
0:12:40and included things the
0:12:42i don't the statistics when layer and also the length
0:12:46then
0:12:48segmented into frames
0:12:50and their outputs of this vector which stores that there and making decision
0:12:55and we'll mean and precision in to be lda
0:13:00huh
0:13:01and so can be
0:13:03all these yellow loss can be trained together
0:13:07place on discriminatively dating
0:13:09i
0:13:10let's not is that
0:13:12we just ignore this little or
0:13:15well
0:13:16in this work
0:13:17then we are back to the standard expect their gaussian the lda
0:13:22recipe
0:13:23is
0:13:25this is in your
0:13:27transformation and the within class precision
0:13:30together just define the lda model trained on
0:13:35x are there are extracted from the original
0:13:38so how really training we propose though
0:13:41use multiclass cross entropy criterion to train
0:13:44the
0:13:45models
0:13:47but the model parameters
0:13:49one and reorganise the training set as a collection of supervised trials and the each
0:13:54of them
0:13:54a contains
0:13:56a saddle
0:13:58eight speech segments and corresponding speaker labels which define
0:14:03two clustering or this eight
0:14:06segment
0:14:06and that we used just eight segments
0:14:09for a reason though
0:14:10for the higher number of them
0:14:12we cannot be of
0:14:14what it will be just very computationally expensive to compute that
0:14:18was there is that
0:14:20so
0:14:20once we have train the model with this criterion we can use it for diarization
0:14:27just like sizes
0:14:28a our baseline approach and the one that we propose
0:14:33and the baseline we use completely of the
0:14:36call the diarisation recipe
0:14:40which
0:14:40extract extra there for each on this time and then there is
0:14:45there is a rubber systems that of that
0:14:48and then processed x vectors are fed into but which provides an matrix all the
0:14:54similarity scores
0:14:55and discourse and used
0:14:57and in two
0:14:59agglomerative can be a clustering algorithm
0:15:03which is really i algorithm starting with them
0:15:06constraining each segment this
0:15:09separate speaker label and then
0:15:12gradually merging
0:15:14clusters to at a time
0:15:17the baseline you this is a portion of this algorithm which
0:15:20and after each manners e
0:15:23and compute them
0:15:26similarity scores of the new cluster against all the rest
0:15:29by simply averaging the
0:15:32it's worse
0:15:33all the individual parts of this
0:15:35cluster
0:15:37a the
0:15:39noting stops
0:15:40once there are no
0:15:43similarity scores higher than some preset threshold
0:15:48in our we use not only extra there but also out of the o
0:15:54statistics for english and the
0:15:56number of
0:15:57frames
0:15:58in a segment then we center and length normalized expect there
0:16:02and the
0:16:03use
0:16:05the image and then just this
0:16:07a probabilistic and things
0:16:09finally they yell at similarity scores are
0:16:14used by age
0:16:16however in our case
0:16:19after each match we compute log-likelihood ratio scores
0:16:23exactly
0:16:24for a new cluster at all the rest
0:16:27and the
0:16:29do the experimental setup manuals
0:16:33and also one and two to train extract their extractor and baseline the lda
0:16:39and then
0:16:40and you we use i mean dataset to training
0:16:44and certainty
0:16:47extractor which is the this the network extracting a and b imprecisions and
0:16:52also retraining the lda
0:16:54and then
0:16:55we this we use there are two thousand nineteen development and evaluation sets
0:17:00two
0:17:01there is a performance
0:17:03and here and the results
0:17:05so first i have to know that the results the in the table here is
0:17:10slightly different than what is in the paper
0:17:13actually because i in part of the result document and then are actually because i
0:17:19here
0:17:21tract there is a meeting the paper i manage to improve the baseline performance like
0:17:25this total generated the updated results
0:17:28so
0:17:29for each models here we have two sets of the results
0:17:33one is
0:17:34where a general threshold still
0:17:38agglomerative clustering
0:17:39and i think one is when the oldest are generally and then a threshold is
0:17:45tuned on the development set of the
0:17:48us all these are
0:17:50energy
0:17:51and then there is no threshold should be
0:17:57a maximum likelihood optimum threshold
0:18:00in case
0:18:01the a model will probably just
0:18:04correct a true local will
0:18:07log-likelihood ratio scores
0:18:09if it's
0:18:10not the case then
0:18:12mandarin threshold we can still role there is no
0:18:15which is definitely the gaze for all the systems at least eight
0:18:19a first if you look at the baseline system there is a quite time gap
0:18:24between optimal performance and
0:18:27the performance we're using your own personal
0:18:30however
0:18:31we just a place there
0:18:34the baseline version or h t with our
0:18:38when we compute the log-likelihood ratios or some other each of which
0:18:42then we see that
0:18:43the
0:18:45calibration and an issue becomes more romance of the results with zero threshold degree substantially
0:18:53and even the optimal results also
0:18:56get a quite worse than
0:18:58then
0:18:59baseline
0:19:01i will work here we did not the retraining any willingly just directly the clustering
0:19:05algorithm
0:19:06if we
0:19:08train the same not also without using what will be taken bearings we just are
0:19:13trained dog
0:19:15a clean multiclass cross entropy as was discussed before
0:19:18then this calibration issue is
0:19:23so large extent
0:19:25so the difference between zero threshold and you know one is no
0:19:30as a remaining anyone
0:19:33and also we even
0:19:35managed to
0:19:36slightly improve
0:19:37over the zero threshold the baseline performance
0:19:42finally if we add to this model this and being precisions
0:19:47so we are using the uncertainty about emission then
0:19:51we
0:19:52further improve
0:19:54zero threshold performance and also the optimal
0:19:58so
0:19:59and then aligned
0:20:02that setting let's say that this
0:20:06system will give us the best performance over it was
0:20:12development data
0:20:13gender threshold we can still get better with the baseline performance
0:20:17but
0:20:17in this case it's a very close
0:20:20or at
0:20:21and
0:20:21the difference between the optimal performance and is zero
0:20:26threshold beforehand is already not that
0:20:30it is simple s
0:20:31for other models
0:20:33so finally though the convolution recently proposed
0:20:36other
0:20:38and our scheme to jointly train b and then invading extractor and with multi class
0:20:44cross entropy
0:20:46and this discriminative training
0:20:48helps
0:20:49two
0:20:51eliminate war
0:20:52calibration problem or for their
0:20:55a regional of the baseline method
0:20:58then we add and certainty extractor to the training and the
0:21:05training together with yellow the it is a further improves calibration and the main it
0:21:11away message here will be that even though the model that we propose that not
0:21:16necessarily give the best
0:21:18performance it's results in a better calibrated system
0:21:23and which is more robust
0:21:26so that was it rummy game q and
0:21:29but by