Speech Transcript - Analysis of the Impact of the Audio Database Characteristics in the Accuracy of a Speaker Clustering System

hello would have "'em" everybody in these presentation and we show you some of my

work in speaker clustering

but before starting i would like to define two things the first one is the

speaker clustering problem that we want to scroll we have another database in which i

would be awesome belong to unknown speaker and also we have are known number of

a speaker

and the second one is we will talk about audio database characteristic in this presentation

when we refer to this term we think is in things such as the number

of audio or how many of yours we happening

each speaker higher

first of all i would percent you the outline of the presentation

we will start with the motivation

later we i present you the clustering algorithm that we are we have been using

later a we will see the them

the right of also that we have studied and we will conclude

with some experiment a starting the stopping criteria

if we talk about the what the question why we suppose that a we a

receiving number of these one client that is interesting

it getting a clustering based solution

and one common question that we have to deal with is okay

how is your system working

and for that purpose a we will ask them to give a and how the

database a similar as possible

to that one that will be used

later in the in the system and with that database we will make something we

will be able to say okay we expect

to have similar results as this one but

based on hours again we've seen that a clustering task

my of that

very different results depending on the of the database so we also my sake be

careful because if the distribution of our viewers and speaker in the database is different

from what we have now

you may have

very different results

and then

of course based on how can we expect

those disorders to change

and one so that's what you would need to and on several experiment and someone

else experiment one to nine percent think you're

okay so now we know what we want to do first of all i we

present the clustering algorithm that we are using

we can see that and are domain i think it i've got clustering about these

a clustering algorithm that are that stuck in a partition in which each audio is

identified with one single cluster and it editing really we match the close to a

cluster

two completely fine i will algorithm we will have to a fixed three scenes the

first one is the distance metric and for this purpose we will can see that

a the scores provided by the lda system so

before running the clustering algorithm

we compute all the buttons all scores for the abolition database and we will use

both the score to be the similarity matrix

we also saw and need to define a linkage method and we will use minimum

distance

and also what we have six

but stopping criterion and we can see that a score based initial particularly

a maximum distance scores about these were to cluster made is the this time is

right about certain threshold we will start

and weather wise we will continue a

messing cluster

regarding the performance measures we are when i use a we will use a those

defined by david but only when one of his work that a lot of one

are the speaker but the and the clustering purity speaker the matter how to speak

at the house but in the speaker a

overall the clustered

white cluster impurity measure of how corrupt cluster are and when we say that one

cluster gypsy score but i we refer to the fact that you

has audio from many different the speaker

if we compute

a those of i levels at each iteration of the big clustering process

and we blocks

the always point in graph

we will get impunity three of course that are going as the one but a

we have here in this slide

we will use these graphs

to make sure that performance of our way the clustering experiments a using the

the whole the presentation

and for as a reference

point will be that you went but working point that these when we have

the same is speaking ability of the clustering purity

before we start with the presentation a i was and you the database that we

have used we can see that

and i leo's from these that are

telephone channel

and with a three hundred segment duration and here in a graph you can see

the are we just put a speaker distributions that we have in this database

okay

use our policies

to conduct a times an hour ago database was first meet a to define some

variables that if an art in this part so we can see that don't then

the first one size of the task

but these the number of audio we have been database

the second one number of a speaker that is the number of a speaker that

we haven't database and the balance of a speaker that meshes

and how many how close it just be good a house

show

and regarding the first well what we will perform different experiments in which

we might i the size of the task

it was started from the initial set of audio and we will study

i into

that's what is more the side

so for example a we have as you can see in the table six

and subset of side a three subsets results and

for those task in which

it we have more than clustering task we will the weather or the

one of the resource l with one single car

we can better results between different size of the task

here we have a meeting place of course not they

what extent that actually have we have clustering purity and in the medical axes we

have speaker impurity

and as we can see as we introduce

the size of the task we expect to have better results in our clustering problem

the second part of what we have i think use if the number for speaker

and to characterize this experiment

we will use

the value out that is defined as the number of a speaker divided by the

number of our with your

we can also have another interpretation of these available

but it allows us to know that

iteration in which we should stop since we want to stop when we have as

many clusters

as the speakers

we can see that several groups of clustering that's we will win of a time

the number of speakers and all the task

and have the same number of yours and given a task of a concrete number

of a speaker

a we will have a same number of our guest better speaker

so as you can see in the table four component we will have task with

a five a speaker size hundred and twenty hours per speaker

and

here we have the universal bases

and that it's a little bit different from what we have seen the previews experiment

but again we will exactly the same information on the a forty some

axes

and we have are weighted by table that the we have time but this i

and the vertical axis we have the speaker evaluation

and each

of the lines represents all standpoint of clustering purity a valid

so for example if we want to start with a

the results they're suppose we would like to get

in our experiments are clustering purity of one percent that is the score

and we want to compare themselves

but using o point five a and one eight and we see that

with

point five we need high spirits high fighters getting ability value

this means that

if our a optimal solution

it is found is found in the middle of the clustering the risk we will

the spectral sub network resource

then that's about of all we have studied use it to balance of a speaker

in the but also for speaker would try to study the manual they one we

are percent in a slight that these

we have one to speak at it that fast most of the owners in the

database and we have

all the number of a speaker about how much less our reviewers

a we also need to fix

but these the number of speakers are divided by the number to follow and in

our task we will can see that always a the size of the that six

to forty so it's of a where

giving are it's equal to

given the numbers or

of the speakers

here we have

for scenario in which we might i

they a presentation of a clear that the remainder speaker

that's we start

from a with this one which

the main or speaker task

more or less the same number of years that or something until these one in

which

we the main speaker cost much more out of your than the other where

if we

again

take a look at the results that this is a getting us

empirically the rate of call

we see that

this leads to a system and the sense similar results and as we increase the

presentation of i'll give that the range you get how

get better results

we can conclude that if the main speaker

task you know audio to make the different with different the rest of the via

speaker we will expect with a better clustered into shows

okay

it still for a what if you remember a when i present the clustering algorithm

i talk about the stopping criteria but it

so far a the computation cost of a threshold value

it has been avoided

in this section a we will study it to a different methods

and arseholes method requires a set of labeled a are we get database

two one we would better for a the experiments instead and then also a mismatch

between the training

and the testing set

the first one that we have call maximum this time with a baseball

we will use

the label our database to run a clustering process and

as we know

how many speakers do we have will be able to stop at the point in

which the number of speakers is equal to the number of clusters

if we

it saves that the distance or vast last iteration we will be able to use

later

a substantial value and that's initial value is they want that it's used for placement

for

the second method that these called maximum distance with unsupervised score calibration what we do

is instead of a leaving the clustering algorithm

and they distance metric but time we can be from the ap lda system

we will make a calibration process over the voucher scored and

that's a made use of credit with this point is the one that will be

used later in a clustering algorithm

a as this process calibrating we will be able to choose the threshold value that

we want depending on

how many a errors

we moved to let our clustering algorithm to make

i'm thinking that if you let

a few errors you will stop at very a high speaker the greedy values and

we will not get the correct number four

or for speaker

and we can see that

and for the group of clustering task

the first one but using a in which we will use similar training and testing

set and all the three groups in which we will have different a i'll just

better speaker distribution in the training and that there's things that

as here we are going in the rest i in stopping what we have a

as many speakers just clustering

we will define a way to perform a measure as the difference between the number

of speakers and the number of clusters

related to a the number of speakers

so here we have the obtain it results eh

we see here it may but the girl axis the their valuable exactly the this

one but i just define

and here we have

in blue

a difference of dining with the maximum distance with protocol

and on that a solution well funded by the a calibrated a scores

and

we see that a

the second method performs similar source no matter

a the that's a mismatch between

training and testing set and

the first method may only be used

when we have

see me that a databases

in the training and testing

so it to conclude with my presentation

i would like to say to think that these

we see that speaker clustering used

strongly affect by the characteristics of our are we get a calibration

and also a we can use these completion to anticipate

a possible to change but also to find possible solution in the future for example

we see

that it if we have operating at

are we dataset

we will get

much one assaults that use the at the database is more so

we will propose to split that our database into a is more than one and

use those smaller set to run a clustering that aims at

i as

those clustering task we

i have better visual that the rules that the big one

we will finally have

better results in

you know what clustering problem

and

the supply the need for questions so

i question so it's so probably

so they you mentioned you have stuck that someone clusters that are useful participate in

the accuracy of the best in a scenario

but it's based on the system i mean how dependent distributions on the system do

you use

i is at the unit is possible you know that

well i would say you know

it is used a quite spatially

i believe that you know when you make

one decision

a at the beginning of the clustering process

you that you will

take that into a home until the end of the process

i think i the reason behind and this conclusion is found in that's thing

for example a

a we can think why

we have

shown different results when and we have different size of the task

and used as the size of the task t speaker

errors that are made at the beginning of the clustering process

we started out or the of the clustering three

and

these

use

more harmful as

model iteration

we have so is are where the task is more than once we have there's

less there's iteration that will be less channel

and also for example

the task we in which we analyze

a the number of a speaker eh we see that

there was a result where a chain when we were at the middle of the

the clustering three

and

a and if the solution was found

in the beginning of the three or in the end of the three we got

better visual

that is also because

and

i again and at the beginning that a

less possible

partition

and

in the middle we have more but as

we cannot access all obtain because

because of the it possible decisions that we have previously made

the old

may not be available but that

that

a in doesn't happen that if we apply

we need a in just we have a more

possible option

that's because of course okay i

due to the bic clustering algorithm where using

i'd say a

yes i think a

clustering

i believe i affected by these by an the conclusion stuck

at a very influenced by a the algorithm you use

for example

a here and are not all there are so that all the experiments we have

make

but a

if we change the

but in case mix of and a we used for example

average

score

we show that the evidence

a you to the finals of the big of a see what we have

a better results when there is a means because if we use

average the score instead of matching score a all the results that we obtain whether

this were similar so

that was an example that if we change the clustering algorithm we may have a

different

some most of the completion suspect all the rebuttal for fundamental this element definitely a

clustering is our method for testing the particular scoring your you see inside what so

what is your inside of the limits once

so what would you say that affects the most of the to these conclusions

a high i think it's a quite affected by

by the

like the clustering algorithm within your

thanks

sorry

i four u s i isn't

no way stance

one work was the it's able the database that used to you mentioned that using

only "'cause" the t a three hundred seconds of the

okay there is the duration variability inside and so on

did you study the effect of this duration on the

all the conclusion that you would

yes i think we also need any some experiments which a we tested different

different iteration

and hey the data results channel deconvolution

and it keeps similar but a we have

after some we that higher a clustering purity levels

all of our weighting

experiment

as we got higher the difference between a different databases used not show something

Analysis of the Impact of the Audio Database Characteristics in the Accuracy of a Speaker Clustering System

Speaker Clustering and Diarization

Jesús Jorrín Prieto, Carlos Vaquero, Paola García