Speech Transcript - Analysis of the Impact of the Audio Database Characteristics in the Accuracy of a Speaker Clustering System

0:00:15	hello would have "'em" everybody in these presentation and we show you some of my
0:00:21	work in speaker clustering
0:00:23	but before starting i would like to define two things the first one is the
0:00:28	speaker clustering problem that we want to scroll we have another database in which i
0:00:34	would be awesome belong to unknown speaker and also we have are known number of
0:00:38	a speaker
0:00:39	and the second one is we will talk about audio database characteristic in this presentation
0:00:44	when we refer to this term we think is in things such as the number
0:00:49	of audio or how many of yours we happening
0:00:52	each speaker higher
0:00:54	so
0:00:56	first of all i would percent you the outline of the presentation
0:01:01	we will start with the motivation
0:01:03	later we i present you the clustering algorithm that we are we have been using
0:01:09	later a we will see the them
0:01:12	the right of also that we have studied and we will conclude
0:01:16	with some experiment a starting the stopping criteria
0:01:21	so
0:01:23	if we talk about the what the question why we suppose that a we a
0:01:28	receiving number of these one client that is interesting
0:01:32	it getting a clustering based solution
0:01:35	and one common question that we have to deal with is okay
0:01:40	how is your system working
0:01:42	and for that purpose a we will ask them to give a and how the
0:01:46	database a similar as possible
0:01:49	to that one that will be used
0:01:51	later in the in the system and with that database we will make something we
0:01:56	will be able to say okay we expect
0:01:59	to have similar results as this one but
0:02:02	based on hours again we've seen that a clustering task
0:02:07	my of that
0:02:08	very different results depending on the of the database so we also my sake be
0:02:13	careful because if the distribution of our viewers and speaker in the database is different
0:02:18	from what we have now
0:02:20	you may have
0:02:21	very different results
0:02:22	and then
0:02:24	of course based on how can we expect
0:02:27	those disorders to change
0:02:29	and one so that's what you would need to and on several experiment and someone
0:02:35	else experiment one to nine percent think you're
0:02:38	okay so now we know what we want to do first of all i we
0:02:42	present the clustering algorithm that we are using
0:02:46	we can see that and are domain i think it i've got clustering about these
0:02:50	a clustering algorithm that are that stuck in a partition in which each audio is
0:02:56	identified with one single cluster and it editing really we match the close to a
0:03:01	cluster
0:03:02	two completely fine i will algorithm we will have to a fixed three scenes the
0:03:09	first one is the distance metric and for this purpose we will can see that
0:03:13	a the scores provided by the lda system so
0:03:17	before running the clustering algorithm
0:03:19	we compute all the buttons all scores for the abolition database and we will use
0:03:24	both the score to be the similarity matrix
0:03:27	we also saw and need to define a linkage method and we will use minimum
0:03:32	distance
0:03:33	and also what we have six
0:03:36	but stopping criterion and we can see that a score based initial particularly
0:03:41	a maximum distance scores about these were to cluster made is the this time is
0:03:48	right about certain threshold we will start
0:03:51	and weather wise we will continue a
0:03:54	messing cluster
0:03:56	regarding the performance measures we are when i use a we will use a those
0:04:00	defined by david but only when one of his work that a lot of one
0:04:05	are the speaker but the and the clustering purity speaker the matter how to speak
0:04:11	at the house but in the speaker a
0:04:14	overall the clustered
0:04:15	white cluster impurity measure of how corrupt cluster are and when we say that one
0:04:20	cluster gypsy score but i we refer to the fact that you
0:04:24	has audio from many different the speaker
0:04:27	if we compute
0:04:29	a those of i levels at each iteration of the big clustering process
0:04:34	and we blocks
0:04:35	the always point in graph
0:04:37	we will get impunity three of course that are going as the one but a
0:04:42	we have here in this slide
0:04:44	we will use these graphs
0:04:46	to make sure that performance of our way the clustering experiments a using the
0:04:52	the whole the presentation
0:04:54	and for as a reference
0:04:57	point will be that you went but working point that these when we have
0:05:00	the same is speaking ability of the clustering purity
0:05:05	before we start with the presentation a i was and you the database that we
0:05:10	have used we can see that
0:05:12	and i leo's from these that are
0:05:14	telephone channel
0:05:15	and with a three hundred segment duration and here in a graph you can see
0:05:22	the are we just put a speaker distributions that we have in this database
0:05:27	okay
0:05:28	use our policies
0:05:30	to conduct a times an hour ago database was first meet a to define some
0:05:34	variables that if an art in this part so we can see that don't then
0:05:39	the first one size of the task
0:05:41	but these the number of audio we have been database
0:05:44	the second one number of a speaker that is the number of a speaker that
0:05:49	we haven't database and the balance of a speaker that meshes
0:05:53	and how many how close it just be good a house
0:05:58	show
0:05:59	and regarding the first well what we will perform different experiments in which
0:06:05	we might i the size of the task
0:06:07	it was started from the initial set of audio and we will study
0:06:11	i into
0:06:12	that's what is more the side
0:06:14	so for example a we have as you can see in the table six
0:06:19	and subset of side a three subsets results and
0:06:25	for those task in which
0:06:27	it we have more than clustering task we will the weather or the
0:06:31	one of the resource l with one single car
0:06:34	we can better results between different size of the task
0:06:39	here we have a meeting place of course not they
0:06:43	what extent that actually have we have clustering purity and in the medical axes we
0:06:48	have speaker impurity
0:06:50	and as we can see as we introduce
0:06:53	the size of the task we expect to have better results in our clustering problem
0:07:00	the second part of what we have i think use if the number for speaker
0:07:04	and to characterize this experiment
0:07:06	we will use
0:07:08	the value out that is defined as the number of a speaker divided by the
0:07:13	number of our with your
0:07:14	we can also have another interpretation of these available
0:07:18	but it allows us to know that
0:07:21	iteration in which we should stop since we want to stop when we have as
0:07:25	many clusters
0:07:26	as the speakers
0:07:29	we can see that several groups of clustering that's we will win of a time
0:07:34	the number of speakers and all the task
0:07:38	and have the same number of yours and given a task of a concrete number
0:07:43	of a speaker
0:07:44	a we will have a same number of our guest better speaker
0:07:50	so as you can see in the table four component we will have task with
0:07:54	a five a speaker size hundred and twenty hours per speaker
0:08:00	and
0:08:00	here we have the universal bases
0:08:04	and that it's a little bit different from what we have seen the previews experiment
0:08:11	but again we will exactly the same information on the a forty some
0:08:17	axes
0:08:17	and we have are weighted by table that the we have time but this i
0:08:22	and the vertical axis we have the speaker evaluation
0:08:26	and each
0:08:28	of the lines represents all standpoint of clustering purity a valid
0:08:32	so for example if we want to start with a
0:08:35	the results they're suppose we would like to get
0:08:38	in our experiments are clustering purity of one percent that is the score
0:08:44	and we want to compare themselves
0:08:46	but using o point five a and one eight and we see that
0:08:52	with
0:08:54	point five we need high spirits high fighters getting ability value
0:08:59	this means that
0:09:00	if our a optimal solution
0:09:03	it is found is found in the middle of the clustering the risk we will
0:09:08	the spectral sub network resource
0:09:12	then that's about of all we have studied use it to balance of a speaker
0:09:17	in the but also for speaker would try to study the manual they one we
0:09:21	are percent in a slight that these
0:09:23	we have one to speak at it that fast most of the owners in the
0:09:28	database and we have
0:09:30	all the number of a speaker about how much less our reviewers
0:09:34	a we also need to fix
0:09:37	but these the number of speakers are divided by the number to follow and in
0:09:41	our task we will can see that always a the size of the that six
0:09:46	to forty so it's of a where
0:09:49	giving are it's equal to
0:09:51	given the numbers or
0:09:53	of the speakers
0:09:55	here we have
0:09:57	for scenario in which we might i
0:10:00	they a presentation of a clear that the remainder speaker
0:10:05	that's we start
0:10:06	from a with this one which
0:10:09	the main or speaker task
0:10:10	more or less the same number of years that or something until these one in
0:10:15	which
0:10:15	we the main speaker cost much more out of your than the other where
0:10:21	if we
0:10:22	again
0:10:23	take a look at the results that this is a getting us
0:10:27	empirically the rate of call
0:10:28	we see that
0:10:31	this leads to a system and the sense similar results and as we increase the
0:10:37	presentation of i'll give that the range you get how
0:10:40	we
0:10:41	get better results
0:10:43	so
0:10:43	we can conclude that if the main speaker
0:10:46	task you know audio to make the different with different the rest of the via
0:10:52	speaker we will expect with a better clustered into shows
0:10:57	okay
0:10:58	it still for a what if you remember a when i present the clustering algorithm
0:11:03	i talk about the stopping criteria but it
0:11:07	so far a the computation cost of a threshold value
0:11:12	it has been avoided
0:11:13	in this section a we will study it to a different methods
0:11:19	and arseholes method requires a set of labeled a are we get database
0:11:27	two one we would better for a the experiments instead and then also a mismatch
0:11:32	between the training
0:11:34	and the testing set
0:11:36	so
0:11:37	the first one that we have call maximum this time with a baseball
0:11:41	we will use
0:11:42	the label our database to run a clustering process and
0:11:48	as we know
0:11:49	how many speakers do we have will be able to stop at the point in
0:11:54	which the number of speakers is equal to the number of clusters
0:11:57	if we
0:11:58	it saves that the distance or vast last iteration we will be able to use
0:12:04	later
0:12:04	a substantial value and that's initial value is they want that it's used for placement
0:12:09	for
0:12:11	the second method that these called maximum distance with unsupervised score calibration what we do
0:12:16	is instead of a leaving the clustering algorithm
0:12:21	and they distance metric but time we can be from the ap lda system
0:12:27	we will make a calibration process over the voucher scored and
0:12:31	that's a made use of credit with this point is the one that will be
0:12:35	used later in a clustering algorithm
0:12:38	a as this process calibrating we will be able to choose the threshold value that
0:12:44	we want depending on
0:12:46	how many a errors
0:12:49	we moved to let our clustering algorithm to make
0:12:54	i'm thinking that if you let
0:12:57	a few errors you will stop at very a high speaker the greedy values and
0:13:03	we will not get the correct number four
0:13:06	or for speaker
0:13:08	and we can see that
0:13:10	and for the group of clustering task
0:13:15	the first one but using a in which we will use similar training and testing
0:13:22	set and all the three groups in which we will have different a i'll just
0:13:28	better speaker distribution in the training and that there's things that
0:13:32	as here we are going in the rest i in stopping what we have a
0:13:37	as many speakers just clustering
0:13:39	we will define a way to perform a measure as the difference between the number
0:13:44	of speakers and the number of clusters
0:13:46	related to a the number of speakers
0:13:51	so here we have the obtain it results eh
0:13:55	we see here it may but the girl axis the their valuable exactly the this
0:14:00	one but i just define
0:14:02	and here we have
0:14:05	in blue
0:14:06	a difference of dining with the maximum distance with protocol
0:14:10	and on that a solution well funded by the a calibrated a scores
0:14:17	and
0:14:18	we see that a
0:14:20	the second method performs similar source no matter
0:14:24	a the that's a mismatch between
0:14:28	training and testing set and
0:14:31	we
0:14:32	the first method may only be used
0:14:35	when we have
0:14:36	see me that a databases
0:14:38	in the training and testing
0:14:42	so it to conclude with my presentation
0:14:46	i would like to say to think that these
0:14:49	we see that speaker clustering used
0:14:53	strongly affect by the characteristics of our are we get a calibration
0:14:57	and also a we can use these completion to anticipate
0:15:03	a possible to change but also to find possible solution in the future for example
0:15:09	we see
0:15:10	that it if we have operating at
0:15:13	are we dataset
0:15:14	we will get
0:15:15	much one assaults that use the at the database is more so
0:15:20	we will propose to split that our database into a is more than one and
0:15:26	use those smaller set to run a clustering that aims at
0:15:32	i as
0:15:33	those clustering task we
0:15:35	i have better visual that the rules that the big one
0:15:39	we will finally have
0:15:40	better results in
0:15:42	you know what clustering problem
0:15:44	and
0:15:56	the supply the need for questions so
0:16:12	i question so it's so probably
0:16:18	so they you mentioned you have stuck that someone clusters that are useful participate in
0:16:24	the accuracy of the best in a scenario
0:16:27	but it's based on the system i mean how dependent distributions on the system do
0:16:34	you use
0:16:37	i is at the unit is possible you know that
0:16:41	or
0:16:44	well i would say you know
0:16:47	it is used a quite spatially
0:16:52	i believe that you know when you make
0:16:54	one decision
0:16:56	a at the beginning of the clustering process
0:16:59	you that you will
0:17:01	take that into a home until the end of the process
0:17:05	so
0:17:06	i think i the reason behind and this conclusion is found in that's thing
0:17:14	for example a
0:17:16	a we can think why
0:17:18	we have
0:17:20	shown different results when and we have different size of the task
0:17:25	and used as the size of the task t speaker
0:17:29	errors that are made at the beginning of the clustering process
0:17:32	we started out or the of the clustering three
0:17:38	and
0:17:38	these
0:17:39	use
0:17:39	more harmful as
0:17:42	model iteration
0:17:44	we have so is are where the task is more than once we have there's
0:17:50	less there's iteration that will be less channel
0:17:54	and also for example
0:17:57	the task we in which we analyze
0:18:04	a the number of a speaker eh we see that
0:18:09	there was a result where a chain when we were at the middle of the
0:18:17	the clustering three
0:18:19	and
0:18:20	a and if the solution was found
0:18:23	in the beginning of the three or in the end of the three we got
0:18:28	a
0:18:29	better visual
0:18:30	that is also because
0:18:32	and
0:18:33	i again and at the beginning that a
0:18:38	less possible
0:18:39	partition
0:18:40	and
0:18:41	in the middle we have more but as
0:18:45	we cannot access all obtain because
0:18:48	because of the it possible decisions that we have previously made
0:18:52	the old
0:18:53	may not be available but that
0:18:55	that
0:18:56	a in doesn't happen that if we apply
0:18:59	we need a in just we have a more
0:19:03	possible option
0:19:06	that's because of course okay i
0:19:09	due to the bic clustering algorithm where using
0:19:12	so
0:19:13	i'd say a
0:19:15	yes i think a
0:19:18	clustering
0:19:20	i believe i affected by these by an the conclusion stuck
0:19:25	at a very influenced by a the algorithm you use
0:19:31	a
0:19:32	for example
0:19:33	a here and are not all there are so that all the experiments we have
0:19:38	make
0:19:39	but a
0:19:41	if we change the
0:19:44	but in case mix of and a we used for example
0:19:49	average
0:19:50	score
0:19:51	we show that the evidence
0:19:54	a you to the finals of the big of a see what we have
0:19:58	a better results when there is a means because if we use
0:20:03	average the score instead of matching score a all the results that we obtain whether
0:20:09	this were similar so
0:20:10	that was an example that if we change the clustering algorithm we may have a
0:20:17	different
0:20:23	some most of the completion suspect all the rebuttal for fundamental this element definitely a
0:20:28	clustering is our method for testing the particular scoring your you see inside what so
0:20:35	what is your inside of the limits once
0:20:40	so what would you say that affects the most of the to these conclusions
0:20:45	a high i think it's a quite affected by
0:20:50	by the
0:20:52	like the clustering algorithm within your
0:20:57	thanks
0:21:03	sorry
0:21:05	i four u s i isn't
0:21:11	no way stance
0:21:12	one work was the it's able the database that used to you mentioned that using
0:21:18	only "'cause" the t a three hundred seconds of the
0:21:23	of
0:21:25	okay there is the duration variability inside and so on
0:21:29	did you study the effect of this duration on the
0:21:33	all the conclusion that you would
0:21:36	yes i think we also need any some experiments which a we tested different
0:21:46	different iteration
0:21:48	and hey the data results channel deconvolution
0:21:53	and it keeps similar but a we have
0:21:56	hi
0:21:57	after some we that higher a clustering purity levels
0:22:02	all of our weighting
0:22:05	experiment
0:22:06	as we got higher the difference between a different databases used not show something

Analysis of the Impact of the Audio Database Characteristics in the Accuracy of a Speaker Clustering System

Speaker Clustering and Diarization

Jesús Jorrín Prieto, Carlos Vaquero, Paola García