Speech Transcript - Influence of transition cost in the segmentation stage of speaker diarization

to do not an i'm here two percent this work the influence of transition costs

in this one depicts the states of a speaker diarization i'm bad at the in

a i work in the speech technology group in the ubm how well the wonders

apply the kingdom at the here units playing

so here is the online i'm going to following the in this presentation first of

all i'm going to well explain the baseline system the baseline actually changes that we

are used and with what they are detect to without the modules the states so

focusing in the segmentation of clustering of states which is

basically that the that initiation estates

where a which is where we have been well actually making things and analysing results

and what we

one kid in this work was to analyze the effect of

that the of the parameters involved in the in the duration of the speaker that's

i mean

at time can be very long various or tweets parameters are involved in this decision

and how much it the probably affecting in our system

then i well i will present the experiments we have done with a the development

dataset and a all the analysis and some compression

here is the baseline system architecture we how we work with more t-ball a it

multiple signals from multiple microphones so that's or input that we first filtered to reduce

noise

and then we extract from these various signal the delay between them that is the

time delay of arrival and we use these information for to work for two things

first

is the acoustic function which is to create as you

probably no

to create a and i can see now just to me and all the all

the signals from the different my in the microphones a delay in or one to

the other what a

probably about the proposed for them

one delay to the other so they this soon as at the end it the

voice on the acoustic a signal or something and it's a nice but

no we use this signal to extract the cepstral with these in the mel frequency

cepstral coefficient

and also what to extract information about where their use a speech on there is

not with the voice activity with that

the on a way we use the delays that at the tdoa are used as

actually as an input to

and you see the last the states that is the segmentation of agglomerative clustering of

speech strenuous that these the a day is to each day

where we actually decide who uses beacon and way needs to speak and so when

we performed a diarization

it's a calm

what it doesn't matter

that these

segmentation i'm mlps clustering i mean

here is like a more in the diagram of what an of what these estates

performs we first with a i'm not at the initialization allows a segmentation any sound

segmentation

that is porous is uniform for the baseline system we use his unit probably a

segment into and plus their for a cease

sixteen bit as or more but we use it might be sixteen because then we

iteratively are going to

reduce the number by marriage in or hypothesize a hypothesis that size clusters

then

after this initial segmentation we before we start and that the of segmentation a training

and during in the states we with these segmentation create it models than we that

the that we that we use to a restatement the see now

and well at the end of we will have a better segmentation is according to

date at twelve a speaker models we have train

i thirty three segmentation we a we compare these

as clusters one-to-one birds and we didn't joint in we match those that are more

seem a lot

we used for that but using information you know

we use

in general

well with the with a we do eat all these iteratively until there is no

more clusters to merits because

well this is then can see that there are no more plaster that are in

a similar to demerits and it

it finished

score

that's features

while here is a

we are moving to the point this is like that diagram where you can see

all the parameters that actually involved in the

duration of these speaker that's we have one parameter the called medium on duration on

that

for us is to fifty frames a two hundred fifty prince of the music and

so a round to two point five seconds that is like

okay i'm going i want that might speaker times are at least of two one

five seconds of duration because if they are stored their well i'm you know still

much interested them and so well let's force this is then to go at least

two fifty

then these parameters are beta a we won it we wanted to cancel them because

they define to concern i mean to constantly influence it has in the duration of

the speaker a

out

good mean okay is probably t you would applied to

remain in the same speaker or not moving to another and data is to

one two and all that the one of the clustered by another speaker so we

said then to one

we know would they does not so one that is

like the stopping people think that is they way they actually having not influence in

the final decision

but we also a in the a in there you know is used and the

experiment that one this last and that and we

discover is it was the from there you know system and

the problem for us is that is not just the this parameter that is a

happy some useless again in the decision of moving from one to speak at one

other but that these m is the number of active clusters

and

overall system iteratively rate use these number of clusters it goes from sixteen one sixteen

for us "'cause" is what we used in

to well bass-fine a time and that could be too

or one

what one would be just the timit but two three four and in each iteration

it going is going to change

here you see actually when it sees a state institution to change first we have

the likelihood of a and while in the basic question is

the likelihood of some primes

to belong to one cluster the other side the likelihood of the same frames to

belong to another cluster

those two and i related to the about data we have so we are okay

with this but these all the parameter located great we have called logarithm of k

is just a independent and deals with a well what was in the data in

the by diagram of the of the previous slide was a we present is that

why this last

is in the band has nothing to do with the to and this

actually if it's lower than one is kind of been analyzed in changes

because well as a variant of

zero point whatever yes and negative

and if hi there it

well

five is fable or d is these changes

and as i said

when

as we have that case one is less and a

and decreases in every iteration also calculate increase in every iteration but it still is

going to be always lower than twice on so you know baseline sees the not

fighting we are always been analysing fancy since even though we really don't know if

we want to make

sort their parents or wrong bass we are doing it so

well we see and if we do not really a lower number of speakers because

what we increase from sixteen whatever if we have a lower number of speakers

we really have high probability of haven't changes i don't know i assume that i

and all isolated so more these transitions

so we thought well let's concept scott time that

maybe it works fine and not these we remove this variability and take the decision

only a few data and also

of course as we have a we decided to do this experiment we decided also

to say okay

it said this case to a fixed value

negative maybe but it what's actually we wanted to look is a if we could

fable that is

these transitioned plastics not used so it doesn't change over iterations

but a also

maybe a positive value so we have probably faber transition in a speaker's changing of

speakers

is that experiments

i a here is the database i fused

we have the development set that is

probably switch task is to evaluate somebody eight meetings

from yes

you see two thousand two thousand and five two thousand six and seven

and we have used that for all the development dataset and then the test set

that this r t o nine from

come on

well that the element that the this it is the from these results presented here

i the two thousand nine

here is all or the been all the experiments we have don't to analyze

a study the effect of these

parameters we wanted to check the effect of the season when we got we have

all these k

consistent weight because is like

well i applied to taxis you have

and we wanted also to tasty to evaluate its influence a

we if we are also taking into account the minimum duration parameter and talk you

about

because well also of them are actually influencing parameter data duration of the speaker time

we used widely work in the baseline if i two fifty frames so there there's

the baseline which is the flat line

it's this is only after a score of course because the transition weighting the baseline

is it to one it's last m so you change to about the process

and then we have all these other

experiments a here that i see still weight can be well if it's one i

want you to do not is that winter season weight is one it's like constantly

its effect caused by one zero so

no effect at all is only data

and if you y very you know a it put me in that changes are

very few people actually

i have and

put the value for detecting some way equal to zero

because

it's like fifty

and a vertical line was there i have the two with the only needs and

was like okay well very high because this e-step for be trained sees you know

sent at the end is segments all the recording p one speaker which is obviously

a very high error rate

then we sell okay with me duration equal to two hundred we actually have like

that every instable and all where it with low error rate various table section on

a in trying to find yourself one

and with a lower error rate and the baseline so

maybe it's good to have peace into consideration

so let's see what

what i and in at the end

what happened on the n

if we six

this with that that's that we choose

three point is all of those points we have checked the we have a evaluated

with the development dataset where

with a better than they baseline we three

one two three

we also compute the well compute all the system compute the data position for the

a for a transition weight you want to

one as last and which is the baseline but with a minimum duration but the

two hundred so

and what we could compare actually a

the improvement you to these transition we variation in you to the mean duration

separately because what the baseline use minimum duration to fifty and so on

i liked very much the idea scott setting it because i and well

was

good to see

parameter is in the band then if we can console and have to better results

are at least the best was also at least good enough

why not

something less to train for future experiments

the problem is actually the test set it out what it didn't go very well

not very much but we may be well compute the average of the two error

rates it's good but

it was worse what

we have what was barry we thought well

and that the results for day

prediction we what is three which is very boring actually very for in it changes

of a speaker

and rate using them anymore and iteration of any a speaker time

conclusion model compression i think i four for these more or less during the during

the presentation was more like stream rice and what i think that these turn transitioned

weight i don't have it we discover because was

it's a was a previously statically that came from icsi was well maybe someone you

have worked with it is

unleashed

for s we discovered that very small changes can affect the very much that i

use a c and that's why i look like at the beginning to have calculated

but if you is the one to constantly it at least you have to note

that it exists a you if you want to change your the duration or to

work with the duration of jerry speaker dance is important it's important to make it

to run experiments

it with both transmit what the racing on "'em" also because well a very three

of these

actually got

better it's also

for us is good but and the main thing we land problem this is that

and

if the variability with one time is very high

or can be very high you mass

i try to take into account the maybe evolve constantly made with a this technique

what to one is the best option so you can

i make the system or upwards for future experiments

well that's more honest that i think

so we then proposed

thanks to multiple english

first of all

we look for this so that it's much platoons good solution from the whole circle

two six should also and so each time constant

so smooth the lasted a okay a sycamore to all the weights

we should

flirt

the phone but it's very important to train them

the show a high constant

i know not so the remote were used for training the transition probabilities

in rooms do not want to work with them or whatever how to cope with

this remote is

and it's as much as the solution

this transition

the motivation and the results

i dunno why the snow

okay well

i don't use the word

those in differences

all rates were all the routes two goals in the logo to go all this

as a constant

it's a cost and is

one two three doesn't matter at all

so in quantum o one with the home and speaker of the

because why you try

well

i is a three and you know is a constant value but is a different

number and the decisions taken when this inequality it is a full field i want

you made this inequality

like okay i'll be is a idea would all be is a brings belong to

discuss the that likelihood of these is saying brings belong to another cluster completely different

and then be used and if it's high and have a forced that you okay

change of class that is like if it's a very low

it for b s

to go to another cluster

that's why it's a variable you fable more deterrence easiest the changes or you penalise

them

questions

so why i of course there is also probably

you transition words so

we can use

okay for volume of english could be thing

in there as well

first for ratio between the core model on the

the

moreover maybe

become the new speaker

so it's

so do you think is sort of threshold are just

it would be dependent on the task of the database just have one still because

i haven't actually an take a nice okay i have a right

that's why i think is that for the system to be more robust in the

to be using future task or you know that databases and

well it is the speaker out of the rights and in different meetings before and

databases that have a slightly longer duration maybe i speak and a lot at all

their interface the on a sorta

and if you are in that room with four people just don't well it depends

on the basically see that that's why i tied states yes okay if a if

you can that's a similar results yes have you

is a time you don't have to train and that's always

if you it you have a similar result or something in that this would to

know what you have you will have a less work to do you know used

to let that the c sent one you that bayes and unique the menu or

and

straightforward

get rid of this problem right i also a one because this is like a

preliminary work and i would like to maybe to use these

if i

somehow could this somehow i really don't know that don't have any clear the of

what to do two

but when i a get a good resampling that

if i somehow had a i in the idea of how long the speaker concept

going to be

or how many singers or maybe if i have some information about the role of

the speakers in the room and that could

actually

not would be used to i think smiling at that is aligned and that that's

all a lot

actually staying i think this kind of the probability of this is a low enough

to these one or something some way of extracting this information in

unsupervised diarization could be tricky but still i think you then you could

and achieve this parameter full for get them better results

but not

okay

show you questions

Influence of transition cost in the segmentation stage of speaker diarization

Speaker Clustering and Diarization

Beatriz Martínez-González, José M. Pardo, Rubén San-Segundo, J.M. Montero