Přepis řeči - AN UTTERANCE COMPARISON MODEL FOR SPEAKER CLUSTERING USING FACTOR ANALYSIS

0:00:13	one one um and which reach an on and i'm gonna talk but
0:00:17	an utterance comparison model that we're proposing for um
0:00:20	speaker clustering using factor analysis
0:00:23	so uh i'm first gonna define what we exactly mean by speaker clustering because the term is used under different
0:00:29	context with like you know a a subtle variations
0:00:32	and in our study we define speaker clustering as a the task of clustering a set of speaker homogeneous speech
0:00:39	utterances
0:00:40	such that each cluster corresponds to a unique speaker
0:00:44	and one i say a speaker homogeneous speech utterance a means a each utterance which is like a set of
0:00:48	speech features a feature vectors of contains speech from only one speaker
0:00:53	um the so
0:00:54	and the number of speakers are are uh no
0:00:57	so i the applications of this um um
0:01:00	the plan speech recognition for example when you want use a predefined set of speaker clusters to do uh robust
0:01:05	speaker adaptation when when test data is is very limited
0:01:08	um this also used uh in a very classical for a class called method of speaker diarisation where you want
0:01:14	so
0:01:15	i spoke when
0:01:16	problem
0:01:17	so um this is a a a very classical setting um speaker directories reason is when
0:01:22	he you given an unlabeled but the recording of an unknown number of unknown speakers talking
0:01:27	you to determine the parts spoken by each person so if you have an example here it's just a sixty
0:01:31	second
0:01:32	um recording of a conversation but they can do is in just divided up into small little chunks and
0:01:38	assume each chunk is one utterance meaning it only contain speech by one person
0:01:42	a you do some kind of
0:01:43	like the of of these of each chunk
0:01:46	and and then
0:01:47	a can you have a clusters first cluster here a second and and that there
0:01:51	and uh if the number of clusters is a number speakers in each cluster actually contains speech but only one
0:01:55	person then you have perfect speaker diarisation
0:01:57	of course in reality um
0:01:59	you may have then you may have actually done some like this force them letters as you little board
0:02:03	sometimes you a there may actually more speakers and clusters or there may actually less speakers
0:02:09	a cluster or see that kind of errors that can occur
0:02:12	so um this is a just a sort of classic speaker diarisation method uh of course the the more sort
0:02:17	of the art uh methods is that that don't use this widget for with for example a variational inference inferred
0:02:24	a but here it was let's look at this class of method for C we have a speech signal use
0:02:28	segmented into
0:02:29	these speaker homogeneous uh is
0:02:31	and that you use some kind of distance measure to compute the distance between the headers is you merge the
0:02:36	close to study or addresses check whether some stopping criterion is met
0:02:39	but is not that you look back in and you continue clustering until you until your done
0:02:44	so i have some pop a distance measures for for this task
0:02:48	um
0:02:49	a to arbitrary speech utterances X of a an X of P what is the distance between them
0:02:54	uh you have things like the generalized likelihood ratio or there that cross likelihood ratio or the uh
0:02:59	a a bayesian information criterion just
0:03:02	and uh again for both yeah why and see that we are uh you have have to you uh estimate
0:03:06	some some gmm parameters from from each utterance
0:03:10	and then you that compute uh uh likelihoods and then use those two
0:03:14	create some kind of a really show that determines you know how close these utterances
0:03:18	are to each other
0:03:20	so
0:03:20	a why we we're is it can to be a a a better to measure i mean and we for
0:03:25	example if if you look at you look at these
0:03:27	that's G C a wire the the mostly really mathematical constructs i mean
0:03:32	a a you're not be a really have a rigorous just as justification and on how they compare uh is
0:03:37	based on
0:03:38	a physical a of speaker similarity
0:03:41	um there's no real a statistical training training involved
0:03:45	um so it in that sense they're they're kind of a hot when you when you just that the men
0:03:49	into a
0:03:50	uh you know in a speaker clustering task
0:03:52	yeah and that's to address these problems there been trained up a distance metrics that have been proposed and eigen
0:03:57	voice uh voice
0:03:59	eigenvoice voice based
0:04:00	a methods
0:04:01	um especially at the i didn't voice and i did channels and and factor analysis uh do this
0:04:06	provides a very elegant and uh and a what framework for for modeling uh inter speaker and
0:04:11	and intra-speaker variability and we
0:04:14	we want to try to use this to come up with something that we think is is a more reasonable
0:04:18	uh distance measure or method of comparing letters
0:04:21	so the first thing we thought was
0:04:23	what
0:04:24	what what how do we define a uh a a a eight
0:04:27	that and a way to compare other since M what example exactly were trying to do
0:04:31	a one we cluster it if you have to a speech utterances
0:04:34	but we think that they can from the same speaker then we should cluster and
0:04:37	if we don't think they came from the same speaker and then we should cluster
0:04:41	that's what we're to
0:04:42	basically data
0:04:43	so
0:04:44	so we just define higher
0:04:46	uh
0:04:47	no i a probability that the two speakers were spoke them by the same person
0:04:51	and uh and that's that's or similarity
0:04:53	that metric
0:04:54	so how to define the probability well
0:04:57	i
0:04:58	if you
0:05:00	a perfectly that posterior probability
0:05:02	uh of each speaker clip and and um arbitrary utterance this P that we i given an
0:05:07	if that then you could simply right
0:05:10	uh this
0:05:11	a a probability each one which is the probability that
0:05:14	i which is the at the hypothesis that X of a an X that be which are to arbitrary utterances
0:05:19	or the same speaker
0:05:20	and i can just simple we set up a question this way i just using for a basic probability
0:05:25	a a probability of X a
0:05:27	a a probability of of
0:05:29	a um X A of producing a speaker W Y
0:05:33	or let's say that that the of the uh i don't six a big was the probability of your speaker
0:05:37	being W I
0:05:38	and then the probably a given an X a be what's your
0:05:41	a probability that you're speakers W like you just much by these two and then you just sum up over
0:05:46	all the speakers in the world so that's so of W is
0:05:49	but is basically the population of the world
0:05:51	so i
0:05:53	and we can also uh in and
0:05:55	no some but that the five
0:05:57	uh uh the uh the null hypothesis were X of and that would be come from different speakers and then
0:06:02	you simply do this the notion
0:06:03	a for the i-th jay's which are different
0:06:05	and then
0:06:06	it's very easy to show that these two uh probably are are going to add to one
0:06:10	so so these are
0:06:11	exactly
0:06:12	you could just very basic probability
0:06:14	one can question these
0:06:16	of course but
0:06:17	but are like impractical um
0:06:19	i mean there's no we can really
0:06:21	a a are these posteriors
0:06:22	so this is where a a factor analysis
0:06:25	um the
0:06:26	um so are uh if you if you have a speaker-dependent dependent gmms mean supervector
0:06:31	uh you you can model that has a ubm mean supervector plus
0:06:35	and a some uh eigenvoice matrix much by by speaker factor vector
0:06:39	plus and i can tell matrix
0:06:40	uh multiplied by by channel factor
0:06:42	fact
0:06:43	and um
0:06:45	a assume that each speaker uh in the world is mapped to a unique speaker factor vector Y
0:06:50	but you can just change your uh uh uh the previous equation we had a we just replace the W
0:06:54	use with wise
0:06:56	of course this still doesn't have any
0:06:57	any any practical that we
0:06:59	what we wanna do that the more to some kind of analytical form where we're we can
0:07:03	uh a you know introduce the uh the priors that we have on on Y
0:07:07	a and Z
0:07:09	so um
0:07:11	a first a step is uh you have a we have that's
0:07:13	because the estimation
0:07:15	of the piece
0:07:16	um so we just to a summation two
0:07:18	uh a and then it about
0:07:20	so
0:07:21	and
0:07:21	this as well
0:07:22	okay do this
0:07:23	um
0:07:24	a first we have to realise is that the summation is over a speakers uh not the wise wherever ever
0:07:28	whereas the integral is done over the why
0:07:31	a uh a you have to actually get a to uh just a really basic capitalist and and the probability
0:07:37	of break comes down to the uh
0:07:39	room a summation forms
0:07:41	and you actually get uh this is actually the correct form from you get
0:07:45	uh for the probability that a that the two others is uh are from the same trick
0:07:50	and this and equation of for "'em" actually i it actually terms up it in that the different contexts to
0:07:55	um
0:07:56	so which is quite interesting ah a here you see that you have a W you um
0:08:01	yeah that that amount of or
0:08:02	which means that if you if W goes to infinity then this probability goes to zero
0:08:06	which intuitively makes sense
0:08:08	uh you trying to calculate the probably that they came from the same speaker but
0:08:12	if you of infinite number speakers
0:08:13	then yeah that probably should go to zero
0:08:17	so now
0:08:18	are we need is closed form expressions for or uh the prior P X and
0:08:23	uh the conditional P of X
0:08:24	um
0:08:25	given Y
0:08:27	so um
0:08:28	first we want uh the first thing we did was we we simplify the problem by ignoring the intra-speaker variability
0:08:34	so let's just so that you zero
0:08:36	and it just use a S is and plus V Y so we you we just have the eigen voice
0:08:40	not be eigen channels
0:08:41	um
0:08:42	a and the second that assumption that we said
0:08:45	was that um
0:08:46	well
0:08:47	yep i i got into that
0:08:48	a a two add them use that we have to
0:08:51	a use um
0:08:52	a
0:08:53	just take just
0:08:54	but just of these these to have them is use that first
0:08:57	uh a in the house and that that
0:08:59	and i have to and can be written as a glass in with respect to the mean
0:09:02	a the second i'd we use that the product of two thousand is is also a gaussian that's all you
0:09:06	really need to know is that be a normalized gaussian there's gonna be some
0:09:09	some scale factors that at the beginning but
0:09:11	is essentially just gonna a gas
0:09:14	um
0:09:15	and then another sub that that would make
0:09:17	a uh is to simplify the be computation
0:09:20	a a is that we just assume that each vector in in in in each utterance was just generated by
0:09:24	by only one gal in in the gmm not up a whole mixture because once if you use of whole
0:09:29	mixture sure than the cup to to becomes
0:09:31	to complicated
0:09:32	so now you you can see here that uh uh the uh mixture summation is just spare place by a
0:09:37	a single gal C
0:09:39	and and how to decide which mixture
0:09:41	generated which a each frame
0:09:43	well one way is to just obtain the uh maximum like to estimate of of the Y
0:09:47	a for each utterance
0:09:48	uh which then for we described a parameters in the gmm
0:09:51	and then just use
0:09:52	and then for for each frame you just find a gal sing with with the maximum
0:09:56	occupation probability
0:09:58	so uh
0:09:59	now uh you can see that this condition is basically just been a multiplication of gas since that's that's all
0:10:04	we have is just a whole string of gauss is mark what together
0:10:07	we we know that when you multiply gas is you get another gaussian although those not we normalize
0:10:12	so i you you just continuously apply that i'd eighty two two pairs of of the absence
0:10:17	and and the whole string of of multiple
0:10:20	and uh i you we to pay too much attention to the map D appear
0:10:25	but just to is that if you keep going
0:10:27	you basically just gonna get run they have C and what put by some some complicated uh remote
0:10:33	um
0:10:33	a factor or uh which is now inter depended on just the your or uh observations and you are
0:10:40	or eigen voices
0:10:41	and you a universal background model
0:10:44	and the also so of us to up to like a form solution for
0:10:48	for the prior as well um
0:10:50	and here are again uh everything that in a but just can be multiple of gaussian
0:10:55	at the end just that with one thousand that's and out from negative infinity infinity so just an increase to
0:11:00	one
0:11:01	so now that you
0:11:02	you've basically destroyed you're integral
0:11:04	and i you you're just left with a with all these
0:11:07	these factors there just based on your
0:11:08	but put observation and and your model
0:11:11	for and there's and then your a pre-trained to um eigen voice
0:11:15	so i i everything here and again pretty much do go through the same process and
0:11:20	i this is actually a a the the final form
0:11:23	a that they can get for a for me to arbitrary speech utterances X of in X to be
0:11:28	a you can find you can actually compute the probability that the came from the same speaker
0:11:33	we we don't we don't doesn't matter which speaker that is your we actually much over all the speakers in
0:11:38	the world
0:11:39	yeah um and this is is basically the the uh close form solution
0:11:43	uh that you can to ford
0:11:45	and uh if you look at this uh solution
0:11:48	you can actually see that
0:11:49	uh for each utterance um
0:11:51	uh you just need a a a a set of sufficient statistics uh D
0:11:55	P N J A um and these are sufficient enough
0:11:58	to just come your
0:11:59	or um
0:12:00	uh utterance comparison function than this probability so
0:12:03	a in some settings i one but you don't want to keep
0:12:06	a a a a a uh the input observation data you can just
0:12:09	uh
0:12:10	a extract be statistic a sufficient statistics
0:12:13	and then just um
0:12:14	discard
0:12:15	yeah yeah the observations
0:12:17	uh if you're in a constrained by ring uh environment
0:12:20	so
0:12:22	a sound uh and that's just as measure uh we we just a pilot to
0:12:26	uh make the classical clustering a method of of doing speaker diarisation
0:12:30	um for the for the call
0:12:32	um data set
0:12:33	and uh we just used uh a a uh measure for
0:12:38	cluster purity
0:12:39	and then a measure for uh uh how accurately we uh us we estimate of the number speakers
0:12:44	we actually have to use both of them in conjunction um
0:12:47	that's really make sense to just use one of them
0:12:50	and these are just the optimal numbers that we were able to get a
0:12:53	um using
0:12:54	uh of for different
0:12:56	uh distance functions
0:12:57	um we use stick center at phone conversations number speakers range from two to seven
0:13:02	i just twelve mfccs
0:13:04	with energy and out to
0:13:05	um dropped up the non-speech frames
0:13:08	a we use eigenvoices is trained using uh uh uh G
0:13:12	uh we got trained using a um
0:13:14	i i think it was the uh
0:13:16	that is the switchboard um database
0:13:19	um
0:13:20	and and and a here you can see see that uh the proposed model uh as much better performers than
0:13:27	and the others uh of that that we tried
0:13:30	um
0:13:31	and uh at this is a really in the paper but you can actually uh uh uh a do use
0:13:36	to an extension to the model
0:13:37	uh we actually are originally uh of P eigen channel matrix for
0:13:42	for a a a you know simplicity but not we can actually included and then go through the same process
0:13:46	is actually a lot more
0:13:47	that's actually have more involved but again you can of actually get a this kind of close form solution were
0:13:53	now but also uh involving B B eigen channels that model T
0:13:57	the intra speaker of very abilities and uh you can actually easily show that this
0:14:02	a close this simplifies to that the previous one we had a if you
0:14:05	if you set all the uh if you set the i can channel matrix to zero
0:14:09	and so we actually tried this to has an additional experiment using a interest or
0:14:13	of their is uh using eigen channels matrices that that we trained a i think a um
0:14:18	but use a microphone database
0:14:20	and that actually improve the uh the accuracy of of the column task by
0:14:24	i one or two percent point
0:14:26	and the actually more sessions that you can do here you you can actually also uh derive of this equation
0:14:31	of for for for a general case and speakers and instead of a set of just two
0:14:36	so
0:14:37	that's
0:14:38	pretty much it
0:14:40	and are much
0:14:47	and choose we use them for one two questions
0:14:56	so i is the a question about then of these the cool on the the is do than the overlapping
0:15:02	speech and
0:15:02	for
0:15:03	um there was but um
0:15:05	is uh there were all
0:15:07	each channel was recorded separately
0:15:09	so when there was overlap things a speech i i basically just discarded
0:15:13	a one channel and then just just
0:15:15	just use one channels as to ensure that there's only one speaker talking
0:15:19	for each utterance just where doing the clustering task
0:15:22	i i just use the at manual transcriptions to to just
0:15:25	to to obtain be
0:15:26	to to pretty segment the the utterances so the other she's where basically person
0:15:31	and and so you enjoyed just see what happens when it's the the living just to see whether
0:15:35	it's a single
0:15:37	a a a a a new speaker or something
0:15:39	um them
0:15:41	but that that would of interest try
0:15:45	you
0:15:47	ooh
0:15:48	a question
0:15:53	oh
0:15:57	so we my vision in first or
0:16:01	yep
0:16:04	oh
0:16:05	that
0:16:06	a
0:16:07	yeah i did actually to try with the back
0:16:09	um the performance actually wasn't to great
0:16:12	so
0:16:13	i just a mention it
0:16:16	yeah for for this task um
0:16:19	uh uh it just seemed like a a a the G a large you gave better results
0:16:25	have and the big
0:16:26	you know
0:16:32	i
0:16:34	use can be very well
0:16:37	oh
0:16:40	yeah yeah i actually did better
0:16:42	hmmm
0:16:44	yeah i mean i wish i had be had missed T database
0:16:48	but we do have it
0:16:49	so
0:16:49	hmmm
0:16:53	the movies because this a the simply greedily
0:16:56	it's from calls
0:16:57	hmmm
0:16:59	that's from goes that you recorded two
0:17:02	it to the at so own clue it's
0:17:05	um maybe it's because of the the range
0:17:08	frequency considering
0:17:09	hmmm
0:17:09	yeah i i i i don't remember a of uh was a K or sixteen K
0:17:15	okay
0:17:19	i can thank you like and

AN UTTERANCE COMPARISON MODEL FOR SPEAKER CLUSTERING USING FACTOR ANALYSIS

Miscellaneous Speaker Identification

Přednášející: Woojay Jeon, Autoři: Woojay Jeon, Changxue Ma, Dusan Macho, Motorola, United States