Speech Transcript - Probabilistic Embeddings for Speaker Diarization

hello my name is unless it in all the and i'm going to present the

one on reminisced incriminating still applied for

the task of speaker diarization

and this work is the result of collaboration between me

you'll continuity and the look and we'll get

right but and we can put on there and the most likely from all year

but i hope

want to note that even though there

that's that we are doing here we is diarization

the model builder then going to present is not

and necessarily have to be used there

it can be also applied for example for speaker verification but in my presentation and

considering only there is a should

the first

i want to start

the with the

a sure

motivational slight widely started with the

the troubles model

so we are interested in doing speaker diarization by first splitting the entrance into short

overlapping segments in our case this year

second and how long or short

and they overlap was zero point seventy five seconds

then we will extract and invading programmable expect that for each segment

and

class there are then variance

and consequently segments

for the there is a nation

no that there is a problem with this approach it or

the drawback is that the segments are consists either

the same however there

really not that one and you're that might be different

and we would like to

utilize the information of the what did you how trustful that segment base

so our assumptions here is that the quality of the segment a actually

i think our ability to extract them being

really that e

that is sentiment for sure and noisy

we shouldn't be really sure that the we extracted the

really invading however if a segment was long and clean then the weighting can be

trusted work

a in our little we propose to train invading hidden variables rather as observed the

that is

done usually

i in this case and we have two

modify

and weighting extract there are so that it does not want

vector invariance but rather parameters of invading distribution

and also we have to have some of which can

and

and digestive and weighting this for

the one starting with the model here we see a graphical model

war

i single utterance

off and speech segments

i don't each segment this is also our souls

sdr speech segments and

each segment has a

a sign

speaker labels

which are the labels are observed the training data

and then we have two sets of hidden variables

i x are more human beings and y

are there a human speaker variables

and that here there

there is only one

speaker variable and the

to conveying and consequently segment that each time

and the

the a speaker label defines which one this

so we are interested in clustering this

segment

into speaker clusters

and to be able to do so

we have to know how to compute clustering posterior

e o s where

if it'll l denotes there

it's handle all speaker labels

and are is this that'll all

a speech segments

so let's the loser how this posterior of books

and it can be expressed the this ratio wherein linear either we have a product

of two children

we all know is down

right of some given clustering

and

a year and given a the likelihood of the clustering and in the original meaning

that we have a star

all the germs of the same one and this time here is over all possible

partitions

and segment is into clusters

regarding the prior

in our experiments we are using chain users to process prior however i

probably like in that is then again i'm the only option

by probably know the optimal option as well it was just convenient for us so

we stick to the

an s and i'm not willing to discuss prior and in this presentation anymore

for a known beers you know the

it is then

and we are going to concentrate on the signature on the likelihood

at we

of the loser and a within that it can be

represented as a and for that

over individual likelihoods

well

speech segments assigned to

a individual speakers

but there are no segments of time to some specific speaker that the su than

this

is just once all that another thing that little brother

all descendants are assumed to

and

all the segments and i are assumed to belong to the same speaker in the

shape share the same speaker variable

we can represented as the eighteen

following integral

here the integration is over a speaker available

and until the integral we have the product of

a prior over the speaker variable and the

the brother

likelihood terms of individual

speech segments given the speaker variable

and

we are willing to discuss how do you this and twelve

assumptions and restriction during model like to make to be able to computed efficiently

you know no

the

speech segment and

it got available are not connected directly by two

you known invading

so we have to integrate

it how to be able to compute this likelihood

and that's exactly what we do here

and here the integration of is over human invading and until the integral we have

the product of two choice

the first one is a model the relation between

you know invading and you know speaker available

and we proposed a novel by gaussian really well

and the next

german

a little the relation between speech segment

although there are we just after and human and eighteen

at first

so is gaussian efficient in the second one is also gaussian as a function of

x then

the whole

integral can be a little computed in the closed form

so basically then the first assumption that we make in our model is

then it is exactly but you

the robot into speech human

fusion invading

e can be represented this brother

which is a gaussian distribution that is a function of x

and the normalized zero the gaussian is

non-negative function h

we depends only on speech and not on the

and making it so

then in between lockheed inability

formula for k

a likelihood we see that it can be expressed the data

this

why this equation

and that of here the likelihood depends on that

the parameters to be lda which are a domain justice w and also on

parameters of the embedding distribution which are x and

and the x k is the mean of the invading distribution and d is the

precision matrix really

i mean even though that we have now the closed form solution what is

likelihoo

it will be very impractical to use

and

we have to be one base calls me

matrix

matrix inversion

operation for each

speech then answer rate

and it will be just a to them to do in

real

application

so we have proposed to restrict our moral to a two covariance model instead of

just general gaussian guilty

we do this because we know that

a within and across class covariance as a eulogy can be mutually data not like

and if we assume that two covariance model that we can

set

that the loading matrix all of the two identity

and assume that they

relating class covariance of the consequently precision is diagonal

and we have pretty close to the

for all of a

and weighting parameters

as we like so we retreat is i

that they in building precision matrix would be also there now

then the whole the likelihood of expression greatly simplifies a stronger this like

then getting back to living in the very interested in computing clustering posterior and score

that we need and the

likelihood

speech segments of thing to the same speaker

a given the partition

and that was computed as this integral and now we know

the expression to compute the

this

germs under the integral which are gaussian

so we have the

and a product of gaussian under the integral also the prior year

is standard normal distribution

and it's

assumed by the lda model

so we can compute the whole integral form and that they result is given here

on this line

please

is that

even though we can compute

this

likelihood of the level log likelihood exactly but only doctors analysing once then

it does not really matter as in will are training and test recipes

this corresponds are going to cancel so we can just a regular

so alright do compute the clustering posterior is we need

therefore we need

and then the within class precision matrix or

of the terrible

the lda within class president

and then there's to be channel

means of a and variance and vectors which are diagonal precisions of and endings

and that we propose to model them by using style

pretrained thinks

excellent they're extractor each is shown your on the scheme

in great

and so this is some extent they're extracted which was train

and extend the was not modified better there

normally in the look

use it will go the output of the presentation layer after the statistics one and

later here we just three or the rest of the network is we don't really

need it and is that in one really in your earlier today and within the

out of the thing only will be the

mean are unwilling distribution

also we had

an sub network which is able to extract invariant precision

and this is the

been for a network with

two hidden layers

and included things the

i don't the statistics when layer and also the length

then

segmented into frames

and their outputs of this vector which stores that there and making decision

and we'll mean and precision in to be lda

huh

and so can be

all these yellow loss can be trained together

place on discriminatively dating

let's not is that

we just ignore this little or

well

in this work

then we are back to the standard expect their gaussian the lda

recipe

this is in your

transformation and the within class precision

together just define the lda model trained on

x are there are extracted from the original

so how really training we propose though

use multiclass cross entropy criterion to train

the

models

but the model parameters

one and reorganise the training set as a collection of supervised trials and the each

of them

a contains

a saddle

eight speech segments and corresponding speaker labels which define

two clustering or this eight

segment

and that we used just eight segments

for a reason though

for the higher number of them

we cannot be of

what it will be just very computationally expensive to compute that

was there is that

once we have train the model with this criterion we can use it for diarization

just like sizes

a our baseline approach and the one that we propose

and the baseline we use completely of the

call the diarisation recipe

which

extract extra there for each on this time and then there is

there is a rubber systems that of that

and then processed x vectors are fed into but which provides an matrix all the

similarity scores

and discourse and used

and in two

agglomerative can be a clustering algorithm

which is really i algorithm starting with them

constraining each segment this

separate speaker label and then

gradually merging

clusters to at a time

the baseline you this is a portion of this algorithm which

and after each manners e

and compute them

similarity scores of the new cluster against all the rest

by simply averaging the

it's worse

all the individual parts of this

cluster

a the

noting stops

once there are no

similarity scores higher than some preset threshold

in our we use not only extra there but also out of the o

statistics for english and the

number of

frames

in a segment then we center and length normalized expect there

and the

use

the image and then just this

a probabilistic and things

finally they yell at similarity scores are

used by age

however in our case

after each match we compute log-likelihood ratio scores

exactly

for a new cluster at all the rest

and the

do the experimental setup manuals

and also one and two to train extract their extractor and baseline the lda

and then

and you we use i mean dataset to training

and certainty

extractor which is the this the network extracting a and b imprecisions and

also retraining the lda

and then

we this we use there are two thousand nineteen development and evaluation sets

two

there is a performance

and here and the results

so first i have to know that the results the in the table here is

slightly different than what is in the paper

actually because i in part of the result document and then are actually because i

here

tract there is a meeting the paper i manage to improve the baseline performance like

this total generated the updated results

for each models here we have two sets of the results

one is

where a general threshold still

agglomerative clustering

and i think one is when the oldest are generally and then a threshold is

tuned on the development set of the

us all these are

energy

and then there is no threshold should be

a maximum likelihood optimum threshold

in case

the a model will probably just

correct a true local will

log-likelihood ratio scores

if it's

not the case then

mandarin threshold we can still role there is no

which is definitely the gaze for all the systems at least eight

a first if you look at the baseline system there is a quite time gap

between optimal performance and

the performance we're using your own personal

however

we just a place there

the baseline version or h t with our

when we compute the log-likelihood ratios or some other each of which

then we see that

the

calibration and an issue becomes more romance of the results with zero threshold degree substantially

and even the optimal results also

get a quite worse than

then

baseline

i will work here we did not the retraining any willingly just directly the clustering

algorithm

if we

train the same not also without using what will be taken bearings we just are

trained dog

a clean multiclass cross entropy as was discussed before

then this calibration issue is

so large extent

so the difference between zero threshold and you know one is no

as a remaining anyone

and also we even

managed to

slightly improve

over the zero threshold the baseline performance

finally if we add to this model this and being precisions

so we are using the uncertainty about emission then

further improve

zero threshold performance and also the optimal

and then aligned

that setting let's say that this

system will give us the best performance over it was

development data

gender threshold we can still get better with the baseline performance

but

in this case it's a very close

or at

and

the difference between the optimal performance and is zero

threshold beforehand is already not that

it is simple s

for other models

so finally though the convolution recently proposed

other

and our scheme to jointly train b and then invading extractor and with multi class

cross entropy

and this discriminative training

helps

two

eliminate war

calibration problem or for their

a regional of the baseline method

then we add and certainty extractor to the training and the

training together with yellow the it is a further improves calibration and the main it

away message here will be that even though the model that we propose that not

necessarily give the best

performance it's results in a better calibrated system

and which is more robust

so that was it rummy game q and

but by

Probabilistic Embeddings for Speaker Diarization

Speaker Recognition 1

Anna Silnova, Niko Brummer, Johan Rohdin, Themos Stafylakis, Lukas Burget