okay

thank you my name is a problem

i'll present the work we have carried out by extracting i-vectors from

short and long time speech features for speaker clustering

and

this is also the what of don't really k and have it at london

so the outline of

a representation that's is as follows so we would describe

so objectives of our research

we would also describe the main

long-term features that are used in our experiments we would also mention the

baseline and the proposed speaker the standard vector

and then we will

describe the fusion techniques that are carried out in the speaker segmentation and speaker clustering

and finally the experimental setups conclusions would be presented

so far the on all

speaker diarization consists of two men tasks and these are

speaker segmentation and speaker clustering

and in a speaker segmentation

a given audio speech is

split it into homogeneous boxes and in speaker clustering

speech clusters that belong to a given speaker are grouped together

so the main motivation for this dataset used in our previous

work

we have shown that the use of jitter and shimmer and

prosodic features have improved

the performance of

gmm based speaker detection systems so based on these

we have proposed to the extraction of i-vectors from these

detection or prosodic features

and then want to fuse their cosine distance courses with the

mfcc for speaker clustering task

so here in the feature selection

we select different set of features from the voice quality and from the prosodic

from the voice quality way extracts

features called absolute jitter absolute stream attention meticulously and from the prosodic once we extract

the speech

intensity and the first four formant frequencies

once these features are extracted the abstract the same feature vectors

then we extract two different set of i-vectors the first i-vector is from

the mfcc

and the second i-vector is from the long-term features

then the cosine similarity of these two

i-vectors is used for speaker clustering task

so these are the main speech features that are used in our experiments without mfcc

voice quality that the jitter and shimmer and we have also used the prosodic ones

so from the voice qualities we have selected three different measurement is based on previous

studies these are the absolute jitter which major the variation between

two consecutive periods

and we have also very absolute stream

it may just evaluation of the amplitude between consecutive periods and also

they should medically two c d's

similar to should matter out of instrument that's

it takes into consideration three consecutive periods

so from prosody we have extracted speech

in basically and formant frequencies

so when it comes to the speaker diarization architecture first i'll try to describe the

baseline system

so given speech signal

so we

further steak the speech different mappings or i thought

the main reason wearers using the oracle studies

where i'm really interested on the speaker errors

where the restaurant the speaker segmentation errors

then we extract the mfcc the jitter and shimmer and the prosodic once only for

the speech frames

then the jitter and shimmer and that was only once output in the same feature

vectors

so based on the side of that inside the

new channel number of clusters is initialized if we have

more number of if there

size of the data is or if the

sure is longer to have more number of clusters if it is shown to do

have less number of clusters all the initial number of clusters

depend just on the duration of the audio signal

then we assign genments complex tenish ali for this neutralised clusters

then we perform the hmm decoding and

training process and then we'll get to different to log-likelihood scores the first one is

for the

short time spectral features

and then we also get another score role

you don't of features

then these two scores are used you nearly in the speaker segmentation and

we get the speaker segmentation still in gives us

a set of clusters

so we use a classical bic

computation technique and computes

pairwise similarity between

all set of clusters and i each iteration the two clusters that that's the not

the highest

bic score

will be but you'd and this process

i to its until the highest peak value among the clusters is less than the

specified threshold value

so this is a classical be computation so you know work

the initialization and the speaker segmentation are the same

the may conclusion we should it in the speaker clustering one the speech the decomposition

of the gmm be competition is replaced by the i-vector clustering one

so this is our proposed architecture so given a set of clusters

that are

the output of the viterbi segmentation we extract

two different set of i-vectors

if a test i-vector is from the mfcc

and the second one is from the detection and the problems once

and we used to difference

universal background models the first one is forty

short-term spectral features

and the second one is four

the

long-term features

so the ubm and the t matrix is are trained using the same source from

a and have selected one hundreds unusual side of a duration of forty hours to

train

the ubm

and the i-vectors are extracted using an is a toolkit

so that the less than the g

is normally based on

specified threshold value so if the threshold value is based on

specified one the

system stops margin

so

to find the optimum

threshold value we have used a semi-automatic way of

finding

the number of triphone value

for example in this figure

we have displayed how we have selected

the lamp that value and the stopping criterion for five shows from the development set

so these once the or it once show the highest

cosine distance scores per each iteration

and

these block once they are the diarization error rates but each iteration so there

horizontal dashed line is the lamb data selected

as a threshold to stop the process for example

if we talk about the rest

showed

there's system

stops at the for citation because in the fall detection

the

maximum

cosine distance score value is made than this threshold value so we have applied this

techniques on

the whole development shows and this number about it is applied directly on the test

sites

so we have used to different fusion techniques ones on the speaker segmentation and the

other on speaker clustering

so in the

segmentation

the figure technique is based on log likelihood scores so we get

two different scores for a given we can see that the axes

the short-term spectral features and the what is the long-term features all

we gates

and more than

for

the short-term spectral features so we get the log-likelihood score so this is multiplied by

are five and again similarly for the

long-term features we ate

extract

the log-likelihood score and this is multiplied by a the file and the and fast

how to be against you on the development data sites

so that putting technique in the speaker clustering is carried out possible so we have

three different set of features very mfcc the voice quality and the prosodic once

so the long-term features are stacked a basic

then we extract two different sets of i-vectors from the mfcc and from the long

term one

then the cosine similarity between

these two sets of i-vectors

is fused divide

a linear weighting function

so that fused score that is a multi it "'cause" i similarity is multiplied by

weight functions

but also the beta in this one is

the weights

but applied for their cosine task force extracted from

the spectral features and one minus data is

the way to signs

for the cosine distance scores

extracted from the long-term features

so when we come to the experimental setup

we have

developed and tested that experiment on ami corpus which is

and multi-party and spontaneous that of meeting recordings

so normally in the i shows the number of speakers is

let me just two

so you to five that's mostly

the number of speakers these for and

it is and meeting records and it is a model channel with the fight of

each condition

so we have selected potentials as a development set to tune the different parameter studies

the weight values

and that threshold values

then we have defined

two experimental setups the first one is a single sides so potentials

how to be selected from idea

and the other one is a multiple sites

we have selected ten calls from idea

adam back end to end all sides so

the

optimum parameters that are obtained from the development sites are directly used on these

a single and the multiple sites roles so we have used to difference

as of i-vectors

for the short and long term features and is also

do you want on the development set and we have

use the artists at all the speech differences

at the speech activity detection so very but is that the city portage in this

work

corresponds mainly to the speaker errors missus speech and the form out on this have

a zero value

so he if we see

the results the baseline system that is based on mfcc and gmm big

clustering p does

is a model of the art

but when we are using jitter and shimmer and prosody both in the gmm and

i-vector

clustering technique it improves

a lot compared to

the baseline

and if we compare these to the i-vector

clustering techniques with the gmm ones

but i with a clustering techniques

again provide better result is on

they gmm clustering technique

and we can also conclude that

if we compare the same clustering techniques the i-vector clustering techniques that this one based

on only short-term spectral feature and this one

using two different set of features it's

i provide us better results on

using one i-vectors from the

sure that features

so we have

also then

some posts paper processing work

after the sensational stories to better

so we have

also pasted

that the lda scoring

in the clustering stage

and the p l a clustering as it is shown in the table

with that it uses only one set of i-vector or

two sets of i-vectors

it provides a better diarization of results on what the gmm and cosine scoring techniques

so one of the issues in a speaker adaptation is the diarization error rates among

the different roles is

a relatively

it follows from one to one show for example is a wonderful may give us

a small d are like five percent and another show to make debusk idea of

like a fifty percent

so for example this box plot shows the

d r evaluation all the multi pole and a single side so

this one is a d r evaluation for the single five and the grey one

is

the idea validation for the multiple site

so this easy high d r and d c the lowest eer

so we can see that there is

a huge evaluation

between

the maximum and the minimum

so if we see

here the use of long-term features

both in the gmm and i-vector clustering technique

help us to reduce the

the other what if you normal the different roles

and the other thing we can see both

i-vector clustering techniques that are based on

short-term and shorter class long-term features

they give us

a bit errors

at least we can say it reduces again

the idea variations among

the different roles

and finally this one that is the i-vector clustering technique based on

short-term and long-term features used as

the lost

variations among

the different roles

so in conclusion

we have proposed the extraction of i-vectors from

short and long term c feature for

speaker clustering task

and in the experiments are designed to sit strum that's the

i-vector clustering techniques provide

bitter diarization error is that is and the clustering the general clustering once

and also the extraction of i-vectors from the

long-term features

in addition to the

a short time once

help us to reduce the d r

so in conclusion we can phase that's the extraction of i-vectors

and the use of

i-vector clustering techniques are helpful for speaker diarization system

and thank you

then it's time for questions

so i have

but i was one thing to do explain the process you using for calculating the

jitter and shimmer in did you find it to be a robust process across the

tv shows

normally are

shows a meeting domains

but

it is

it is

and meeting domain it's not a t v show

but when we extract different remote

we the problem of bases if

the speech is almost

we give zero buttons

so we compensate them by averaging over five hundred milisecond duration

that extract the fattest all certainly second duration

sort compensates a zero values for the unvoiced frames we averaging over five hundred milisecond

duration

you have also in one of your a slight and you said that the training

from the development set

how did you you'll find it or train it how did you find that threshold

and did you experiment with changing the threshold value

you mean that the segmentation i think this one

no in the formula you

we present the segmentation

this one

or you hear

so you mean that i four buttons they have been

modeling be you on the development sites

we taste different weights while the weight

bottles

for the two features

and

these files are directly applied on the test sites

okay so they are fixed your exists in the test experiments affix it's

of thank you very clear presentation i just wanted to understand the little bit about

the physical what we should you have an explanation why he went so did to

shiver and prosody

so for example in explains that we do we for pitch to be quite well

quite to important how did you sort of converge of these two did you go

through a selection opposed to get to the mean do you have any intuition or

expression for the

so you're saying why we are interested in the extraction of the detection but and

prosodic how did you zero it on the balloon is what's your sort of physical

intuition for what using that as opposed to of the long-term features

because they are voice quality measurements

no special potentially much

so they can be used to discriminate

where the speech of one percent from another one so you'll hypothesis is that they

would that would be significant difference between us we have seen it is and the

this will be robust to whatever channel that is going through

but we didn't similar extremely delicate if you will so

if you had extend this outside this dataset for example of real life recording

we're going to worry about the sensitivity of these features that you looking at

okay for example jitter and shimmer they have also been used in a speaker

verification and recognition on these database

so we have normally

that is it will not is the reason why we applied on speaker diarization

and we have checked the jitter and shimmer on ami corpus

here's what i'm presenting we have also attracted on how about campus it is a

cut on projects t v show

is that also we got some improvements

so you would like companies it's helps and

would that be any other as you think

i don't it would a but others you think that you could out to the

two

note that different types of region we have about ten or eleven types of jitter

and shimmer measurements

beds we have selected this c d based on previous studies for speaker recognition and

of maybe you can check with the others also

and you in a question

and i don't have we question so it's about the stopping criterion so you are

not assuming and that you know the number of speakers beforehand

that's right now we know the number of speakers you know you larson and you

know it is okay conditions

so any other questions

there are no more questions to estimate speaker again