Přepis řeči - LINGUISTIC INFLUENCES ON BOTTOM-UP AND TOP-DOWN CLUSTERING FOR SPEAKER DIARIZATION

0:00:13	come come from us
0:00:14	and to they i would like to present that will work work and type full uh linguistic in sees on
0:00:20	bottom-up and top-down clustering
0:00:23	for speaker diarization
0:00:25	so let
0:00:26	give a short of view of the work
0:00:29	um so for
0:00:31	i we give a short introduction giving the motivation of this work
0:00:34	and one with the
0:00:36	formulation of the problem
0:00:38	to finally move want to compare is no these two clustering systems
0:00:44	and finally he straight that or ideas with
0:00:46	some experiments or word
0:00:53	so
0:00:53	um um have seen during the last reason two nine evaluation but was basically two main approaches for
0:01:00	speaker diarization one he's bottom-up up also called agglomerative hierarchical clustering
0:01:06	and the them
0:01:08	or or a device a hierarchical clustering
0:01:12	um
0:01:12	we but released uh we sent to you well actually last year a class like as a a bit per
0:01:18	uh give a brief education process
0:01:21	for
0:01:22	uh speaker diarization system
0:01:24	and we we so that it get some consistent improvement for the top-down system
0:01:29	but to trying to apply it on the bottom up
0:01:32	we so that uh the result to word totally in consistent
0:01:36	so that's what
0:01:38	is that's the motivation of work is to know why it does not work and it's leads to
0:01:43	try to have a look on what is the in front of an be sticking reasons on bottom-up and top
0:01:49	top-down
0:01:52	so that start with the formulation of the problem
0:01:56	so here here you have an now just stream
0:01:59	and so we want to solve the problem you spoken one so we proposed to cold G is the segmentation
0:02:04	so so
0:02:06	is the group of boundaries at each speaker down
0:02:09	and
0:02:11	uh uh S
0:02:12	which is is
0:02:14	speaker or grants so the list of the successive speakers
0:02:17	so we is as and when is a G in this case
0:02:21	and so we can
0:02:23	summarise is
0:02:25	setting and by the following questions so finding the optimum S and the optimum G as the argument of the
0:02:31	maximum
0:02:32	of
0:02:33	is the probability given as a set of observations so it's is case a or B the audio stream
0:02:40	so uh just using as a base and from that to stand
0:02:44	uh a a question
0:02:46	we can get the second mine you see on the screen
0:02:49	and uh and use the dean and a it can be it is does not depend on a as of
0:02:54	to so giving the the question number one there
0:02:57	so a a use with this a question we can see that
0:03:00	as is or you know there's which are required to solve this optimization task
0:03:05	the first one you know to compute
0:03:07	P or a given as and G
0:03:09	is my acoustic speaker mother's
0:03:11	off on uh uh so in this case it's often on gmm in may not the approach we we use
0:03:16	currently a
0:03:18	state-of-the-art of
0:03:19	and the second
0:03:20	model model so P
0:03:22	S and G
0:03:23	which is often on me it's uh
0:03:25	maybe except in the
0:03:27	someone prayer was work to to the just been presented now
0:03:30	uh and
0:03:32	so i looking at is a question was is that we have two main difficulties
0:03:36	first
0:03:37	of course we know what the speaker
0:03:40	he's
0:03:41	and secondly is acoustic model defined
0:03:45	a perfect word
0:03:46	from than thirty on the speaker but it can depend and as well
0:03:50	oh on other is and C is like to the linguistic content
0:03:54	so for the next part of this presentation we do the following assumption
0:03:59	is that the major and reasons
0:04:01	but i shouldn't is only you
0:04:03	to the linguistic content
0:04:05	so that's what we go are gonna like a sense
0:04:08	on on is the difference of one times
0:04:10	and they're gonna be written Q
0:04:14	so
0:04:15	considering this assumption is this assumption option can just we formulate a i a question
0:04:21	uh take
0:04:22	speaker and boundary and "'em" out that that are are possible speakers sequences
0:04:28	so now a looking of the optimum as and G plus the optimal speaker and read that all
0:04:35	so consider a now as the inference of the front and we can move on that the second nine on
0:04:41	the screen
0:04:42	which should correspond to monte guys a the probability of or or or or to different for names you
0:04:48	and um and the third line is does a set just explained it with the bayesian rule
0:04:54	um
0:04:56	and next
0:04:57	we can propose to do to S and she first
0:05:00	a speaker diarization and do the following assumption that or the speaker a a or babble
0:05:05	so we can just a a speaker john mother's so P
0:05:09	of S and G can just disappear
0:05:11	and the second assumption is
0:05:13	that's
0:05:14	we can expect
0:05:15	the from and to be in that the and of the speaker and independent of G as well
0:05:19	so that's why we can just from problem in the prior of Q
0:05:24	so finally
0:05:26	we got to a question the first for simple approach
0:05:29	the second line for of maybe more complete approach
0:05:32	and in comparing but of them which will lead mean to same results in perfect board
0:05:37	we see that um
0:05:39	uh the second question a phone are normalized
0:05:42	and
0:05:44	that in the first one
0:05:46	we should have a normalized know that as well
0:05:48	it means that's P or a given as an G has to be trained
0:05:52	we
0:05:53	a can think about a different for names
0:05:57	so
0:05:58	to summarise i
0:06:00	see from this equation that
0:06:01	the speaker in mentoring delta has to be up to nice to get a or with S and G
0:06:07	and so that a an called solution for the top
0:06:10	so um that the reason why it is to the fine was try to and you are search
0:06:15	um um
0:06:17	which are uh main a bottom-up and top-down approaches
0:06:22	so if we move on out comparing these two approaches
0:06:26	see
0:06:29	i'm are with is
0:06:30	just just to is one cluster and divide
0:06:33	i to actively in order to get the optimum number of clusters white bottom-up is the opposite scenario we stop
0:06:39	was a plenty of cluster and about them uh i to civilly
0:06:44	um
0:06:45	so so not is by far more popular approach
0:06:48	um you get the best result of the law
0:06:52	nine evaluation
0:06:53	i top now is
0:06:54	uh maybe a bit less than its but achieve competitive results
0:06:58	um i work for sentence
0:07:00	show that for single distant microphone and can lead to compare a result
0:07:05	but the question is okay we start with an artist will converge into some clusters
0:07:10	and how sure that this cluster corresponds to a speaker
0:07:14	or another acoustically sans is like the final
0:07:19	so
0:07:20	yeah required so that this approach converge to a local maximum
0:07:25	um in the perfect word
0:07:29	operations dominates over the intra-speaker variation
0:07:33	and if
0:07:34	i mean
0:07:35	could uh M and size resize to we should say okay bottom-up and top down should lead to exactly
0:07:41	the same
0:07:42	results
0:07:44	but
0:07:45	a of course has nothing is perfect
0:07:47	yeah is there is as well the inference of linguistic contents can which can be very significant
0:07:53	and may since one the speaker more there's are not well normalized
0:07:58	uh i
0:07:59	is the system can converse to a local maxima and we can uh
0:08:02	B not
0:08:04	a speaker unit but uh other acoustic units like the phone and Q
0:08:09	so in the case of a down
0:08:11	so the a new speaker out to and from uh
0:08:15	normalized by grand mother
0:08:17	so this model is to with or of the at by a lot available speech so we can expect small
0:08:22	well to be we've
0:08:24	and that is the
0:08:26	speaker uh uh i iteratively introduce was a large amount of data us so
0:08:30	we can expect
0:08:33	have this new model quite a more normal light as well
0:08:38	so is a huge risk as well
0:08:41	a a a a a a a zero sum of their
0:08:44	uh to a of the linguistic is to normalize it
0:08:48	uh uh i
0:08:49	to as the speaker by motion as well that's of course what we don't want to get we want to
0:08:53	get the highly speaker-discriminative system
0:08:57	by comparing the bottom up
0:08:59	we should has a system was some very small clusters
0:09:02	so which are which can
0:09:05	to am i mean a local maximum and a highly uh discriminative
0:09:09	a so that my from this point of view
0:09:11	and the
0:09:13	nation compared to bottom-up up
0:09:15	but the problem is
0:09:16	has a cluster a very small as a big
0:09:19	is that a a is a you would risk that's the system converts but a a a a a some
0:09:24	of the acoustic it and we
0:09:27	normalized
0:09:29	so finally just some
0:09:31	i think but of the system may have the own drawback a and there or the advantages
0:09:37	according to the so so speaker discrimination and the optimization to linguistic nuances
0:09:44	so that's you just right now is is
0:09:47	where with some
0:09:49	but one work
0:09:50	so
0:09:51	here is a our experiment set so
0:09:53	we have a a a a a a a speech activity detector is for but of the system
0:09:59	um i
0:10:01	on the left i think it's on the left for you as well yeah
0:10:04	uh uh you have a bottom-up systems so it's a classical system
0:10:09	of the art system
0:10:10	yeah is the following reference you can see you are going to
0:10:13	to spend too much time to
0:10:16	to do about this but uh uh and you decide you as the top-down down sister
0:10:20	so typical top-down system as well uh
0:10:23	the so this is these are the two "'cause"
0:10:26	S parents
0:10:27	and next we use so a pretty freakish
0:10:30	as long the following paper shown here
0:10:32	so this is an option step will see the difference lead
0:10:36	and a the by a and map based resegmentation segmentation and a of the this and and bodies edition of
0:10:41	the features and a final the segmentation
0:10:45	or the that that's sets so on
0:10:47	a top training from conference meeting
0:10:50	yeah from the list out you of four five six evaluation
0:10:53	and for the evaluation sets so the proposed to use a a to a seven out to nine
0:10:58	uh
0:10:59	uh that that set which are
0:11:02	a of T V shows right cord want to shook is a function of T B shows corpus
0:11:07	here ah
0:11:09	no additional preference
0:11:10	so the first call um is that a can be the better
0:11:14	and
0:11:14	is the score one
0:11:16	of speech
0:11:18	uh of course as i our system
0:11:20	the help
0:11:21	does not process and and is the overlapped speech we just focus on the second one
0:11:26	and
0:11:27	so a for
0:11:28	the we can see
0:11:30	and and just looking at the is that was also apply
0:11:34	occasions fixations that
0:11:36	and see that okay first sub of the system
0:11:40	a a better to an ounce for two Y for
0:11:44	top down a well there is a uh
0:11:47	um is a result a much worse for of
0:11:50	for uh um
0:11:51	the T V shows
0:11:54	yeah signal
0:11:55	it is not to as the best system which provides the best with a
0:12:00	or to the that that's set and see for example T a part to a seven top down Q better
0:12:04	result why for out you nine that
0:12:07	as at the bottom up
0:12:08	and
0:12:09	we and also consider hmmm
0:12:11	the results be a simplification
0:12:14	so that i can see it
0:12:16	vacation
0:12:18	just
0:12:19	uh but a a degradation in performance for the bottom-up
0:12:24	a that for the top-down down
0:12:26	it's a way a proof of um is the system
0:12:30	so
0:12:31	it's a question is uh a okay may be purification you the discrimination between clusters
0:12:38	i i am a as has a down
0:12:41	the propagation
0:12:43	bottom-up
0:12:44	well unless one normalized against phone but yeah sure
0:12:49	the in this case the propagation an is you last
0:12:53	so that's explain a bit it a a clear of the cluster purity
0:12:57	so we propose to look at all the cluster to at by one of the system and compute the purity
0:13:03	for four all of this cluster the card
0:13:06	uh so the is computed
0:13:08	one is the fist
0:13:10	so we takes a double speaker time seconds
0:13:13	and we divide by the that optimal number
0:13:16	uh a uh of second of the cluster
0:13:19	so that a difference a situation
0:13:21	if we have a high purity and a small number
0:13:24	of
0:13:25	cluster
0:13:26	yeah well i i one has a pretty is a purity of
0:13:30	cluster
0:13:31	we can expect a system
0:13:32	a to be lightly to converge to some speaker you
0:13:38	and are very you do not of clusters
0:13:41	like like to as the system converts to as or acoustic it
0:13:45	we as it to as their have been
0:13:50	a
0:13:51	uh we do and what happened difference in audio was a are possible and the same for the last case
0:13:56	so we doing at the true G and the number of cluster
0:13:59	um the for tab and so we see him in
0:14:02	we we don't use we do the propagation process
0:14:07	a a top down as compare about a priority
0:14:10	more less with
0:14:12	but
0:14:13	the top down as the that's class
0:14:16	and C
0:14:18	um of the right the number of clusters
0:14:20	we have a as uh clusters them the bottom-up up and them about cluster it's clusters none the idea and
0:14:25	number of cluster to for the ground truth
0:14:27	so we can expect that top down to be a in the first situation also converge
0:14:32	to some speaker
0:14:34	as a button up is probably the for case
0:14:37	so
0:14:38	well
0:14:39	to see uh what happened
0:14:41	right
0:14:43	the purification
0:14:44	we see that
0:14:45	a first for the top down the purification
0:14:48	use pro as a pretty is improved
0:14:51	um
0:14:53	i cluster
0:14:54	so for sure how is uh the system to converse to speaker then without purification
0:15:01	uh
0:15:02	there is a consistent in purity
0:15:05	uh i i have a cluster them for the top that down so
0:15:09	that is not or even i have to say a in which it situation we however
0:15:14	uh
0:15:16	a last not for this experiment to part is
0:15:18	uh looking at the from musician
0:15:20	for this case a so
0:15:22	we take a different clusters
0:15:24	we take all the clusters um
0:15:27	for a system and for each of these cluster
0:15:30	right
0:15:31	histogram of the different for names
0:15:34	we do this for all the clusters generated for the top down and the sample for the bottom-up up
0:15:39	and for
0:15:40	all of the
0:15:41	the four
0:15:44	compute the to a cluster distance between D histogram
0:15:48	is uh is the colour back like the distance
0:15:52	so
0:15:53	and
0:15:54	X is the average of all
0:15:57	for these distances for each of the system
0:16:00	so um
0:16:02	we can expect uh is this average distance to be small
0:16:06	uh uh uh as a
0:16:08	a distribution in the different phone that uh in the different clusters
0:16:12	so which means that's the system my
0:16:16	and we can expect it to be a high ones as a higher degree of conversion to have problems
0:16:21	and so i
0:16:22	the distribution i'm not equality is the different cluster
0:16:27	a a the is exposed the result in seen first
0:16:30	sound propagation step
0:16:33	i
0:16:35	are used
0:16:36	is a bottom-up
0:16:38	which show really that's in this guy's a cluster are better normalized
0:16:43	a a pill now the propagation
0:16:45	we see that there is an improvement for bus of the system
0:16:48	but um a plus but if a cash
0:16:52	am a very high or than the top down
0:16:55	plus but if question
0:16:57	which just it's explained why the purification prove that that i
0:17:01	of the bottom up so to conclude
0:17:03	um
0:17:05	we have seen in this slides that's
0:17:07	but approach products bottom-up and top-down down
0:17:09	give some compare but results but
0:17:12	is
0:17:12	uh_huh you different behaviours
0:17:15	but up not isn't more disk
0:17:18	because but
0:17:20	often
0:17:20	a a uh a a trade off from some clusters which are last normalized against linguistic content
0:17:26	well i is a top down
0:17:28	uh a a off from some cluster which are better normalized but less
0:17:33	speaker discriminative
0:17:36	so a
0:17:37	uh i i think and
0:17:39	one of the conclusion of this work is a there is a good thing to note to
0:17:44	nation of this two approaches
0:17:46	so we recently published a bit but but i think that a lot of the or other
0:17:51	things to try
0:17:53	and has a a future work
0:17:55	we can expect maybe design a specific propagation process
0:17:58	for a a bottom-up
0:18:00	taking into consideration of this linguistic in which is quite particular
0:18:05	or or on a of this approach
0:18:08	he france's
0:18:11	and that's it
0:18:13	thanks
0:18:14	okay
0:18:20	any question
0:18:25	okay that and if i one a quick question
0:18:30	i the can i think with
0:18:33	and are going to take a hard thing and
0:18:37	um
0:18:38	can i oh oh oh
0:18:41	for
0:18:42	right
0:18:44	as a we stick like to use
0:18:46	i have seen that is
0:18:47	the core of these two approaches which you are are not what the provocation was just a motivation
0:18:53	which lead to these work
0:18:55	but as the core of the bottom-up and call of the top down acts differently
0:19:00	is is the mystic in which is isn't this case the phone and content
0:19:04	of the speech
0:19:07	so uh
0:19:11	but it
0:19:11	but to a question you
0:19:14	i and
0:19:15	you
0:19:17	i think i think

LINGUISTIC INFLUENCES ON BOTTOM-UP AND TOP-DOWN CLUSTERING FOR SPEAKER DIARIZATION

Speaker Diarization

Přednášející: Simon Bozonnet, Autoři: Simon Bozonnet, Dong Wang, Nicholas Evans, Raphaël Troncy, EURECOM, France