Speech Transcript - FAST SPEAKER DIARIZATION BASED ON BINARY KEYS

0:00:20	a this is such a are going to a uh and uh i we're talk about
0:00:24	a binary that is a to they
0:00:26	this is a during what we jump and so last from university of a you're i'm coming from that of
0:00:30	funny nick research
0:00:32	so what do not that way
0:00:34	use
0:00:35	has these out
0:00:37	it is in front of a
0:00:39	"'cause" is outlined
0:00:40	i'm going to use what speaker diarization use or at least
0:00:44	for the ones of you that the remember from the produced
0:00:46	talk
0:00:47	i i'm about a binary speaker modeling
0:00:50	and then when during the two things into the binary speaker diarization system that we just developed
0:00:55	experiments and then conclude a future work
0:00:59	first first a speaker there is a and yeah as we have a a a of you are we split
0:01:03	the give the speakers
0:01:05	we see who spoke were and and we don't know
0:01:08	need
0:01:09	speakers is or how many speakers are there K is like the P three dogs
0:01:14	no
0:01:16	is the art days
0:01:18	well we have done
0:01:20	oh as in the people in the last year as we've got an
0:01:23	do around seven ten percent in but cousin was even though these this
0:01:27	something that since two thousand four uh
0:01:29	it's not a part of the nist evaluations and i bet nowadays it's
0:01:33	C even lower than that
0:01:35	and we've got an to twelve fourteen percent for meetings even maybe nine percent now on uh
0:01:41	on making the me
0:01:43	this is a a a a a great to result these makes that a shouldn't be able to use for
0:01:47	all there a you know as a a a as a blocking as a block step for other applications like
0:01:52	a speaker I D when there is a a multiple speaker that that there
0:01:55	but still we have a problem
0:01:56	it's too small
0:01:58	example of have some numbers an in in uh standard systems if you develop a diarization system you the do
0:02:04	anything about it
0:02:05	it's
0:02:06	most probably gonna go way up of one time real
0:02:10	and if you try doing something about it
0:02:14	the two systems at base so that the people that they saw the were trying to do something about it
0:02:18	that for mixing
0:02:19	the first one is
0:02:20	one couple years ago
0:02:22	i was going down to point ninety seven for real time that's a on a model or and they were
0:02:28	do some tweaks
0:02:29	to to a gmm based algorithms to or it's a hierarchical bottom-up system
0:02:35	a and the they were getting to a just on the real time
0:02:38	and a father on they said okay let's go to do you know
0:02:41	so we can use uh sixteen core whatever the uh our you much this
0:02:45	and they went down to zero point zero seven i nine all the nowadays these this still even five you
0:02:49	and faster but is to P O so
0:02:52	you don't have to be you in a mobile phone or you don't have you know these
0:02:55	uh
0:02:56	the is have to work on used in
0:02:59	depending on what architecture so
0:03:02	and this is what you
0:03:04	have a system
0:03:05	that the really really is very fast and it doesn't matter and one and or what uh architecture running it
0:03:12	and still have a
0:03:14	this case we by adapting a recently proposed the uh uh uh a technique called binary speaker model
0:03:21	we also have another poster or in uh getting i on using this for a speaker I
0:03:27	and so in this case we that it to there is a show and i'll tell you i'll tell you
0:03:31	how we it
0:03:34	uh uh to know what we'll do that to know the basics of what's uh by speaker modeling as were
0:03:39	i'm explain about it a little bit more now
0:03:43	so
0:03:44	and this is a a six of it so we have a a an acoustic C have some input acoustic
0:03:47	data and one uh and that we
0:03:50	a a vector well actor
0:03:52	of J
0:03:54	a zeros and ones
0:03:55	so that it is uh basically in a very general way like explain here we just extract some to sleep
0:04:02	parameters mfcc or whatever want
0:04:05	and and we use a
0:04:07	and he back background model K B M which is basically a ubm but trained in a different way
0:04:13	to
0:04:14	yeah this acoustic they and then with these K B M we obtain this uh
0:04:18	these minor to case
0:04:19	for each uh acoustic
0:04:21	they say which could be a a data for one speaker or data for a couple seconds a
0:04:28	that C D's T M
0:04:30	be T B M you the understand it different ways this is basically a set of options
0:04:35	position
0:04:36	in a particular way in the acoustic space in that more be them may show acoustic space
0:04:40	in you have just one they may so that
0:04:43	we can see the example
0:04:44	so we first position be
0:04:46	acoustic options in the space and then we take up put that that were acoustic stick or or or of
0:04:50	a speaker data
0:04:51	i we see which all these options
0:04:54	at most are present in the best our would data
0:04:58	and uh from there extract a binary fingerprint which uh or by taking which
0:05:03	has to does
0:05:05	are present in the positions of discussions that do not really represent well than that and ones
0:05:10	oh on the options that are are ending our data
0:05:14	and is you right
0:05:16	so how do we do it for a a a for an obvious to how the we all these together
0:05:20	well we can see here on the left side we have already puts signal
0:05:25	where we compute that were uh mfcc acoustic features at at the and on the right side we have a
0:05:30	can be yeah
0:05:32	and and the
0:05:33	so
0:05:34	and that's the vertical vectors
0:05:36	he's we have uh vector as
0:05:38	which each uh whose dimensionality is and is than a certain number of options we have
0:05:43	you in our what be a model
0:05:46	and for each input feature vector we select
0:05:49	the best
0:05:51	we could say other nor the one percent best two percent best that ten best whatever
0:05:56	we wanna use whatever
0:05:57	for a scroll one of us
0:05:59	the
0:06:00	and but feature vector
0:06:02	that that for X one to X and a
0:06:06	our where data one a model
0:06:08	we can get down to this uh camping vector
0:06:11	the first of the of the result of actors which basically
0:06:15	counts how many times
0:06:16	each of these options have been
0:06:18	has been selected as one of the best representing options for the acoustic data
0:06:23	and then i C we just say okay
0:06:25	that
0:06:27	a and know that by or whatever the options are present in the data of our once on the rest
0:06:31	are also does
0:06:39	so
0:06:40	once we have a a a a E
0:06:42	a binary vector
0:06:43	for a two speakers of for two sets of acoustic data
0:06:47	it is very fast and very easy to
0:06:49	to compare them to combat how close they are
0:06:52	in here is just an example and a is the the type of a few that should be a some
0:06:57	uh uh in the form of a
0:07:00	as in the top one of the model
0:07:01	and uh a basic this one possible
0:07:03	i mean that is many possibilities it in the working in by them are you just need to find a
0:07:07	way to compare to binary signals in this case
0:07:10	well we used in this paper is
0:07:12	in the uh not we just need the sum
0:07:16	oh of uh uh you know it's uh some supplies one whenever in the to back to we have a
0:07:20	one
0:07:22	and the denominator just uh
0:07:24	do are so we some but as one of a number in a a in you that of the vectors
0:07:28	we have a one
0:07:29	and this gives as a score or from zero to one no
0:07:32	the zero either use
0:07:33	a this in a body not seem that and one is the the same back
0:07:41	a a a a a speaker by and models
0:07:43	and that they said to we have a poster experiment more about and you can go back to uh to
0:07:48	a post we cut speech
0:07:50	and that's see now how we apply
0:07:52	to us
0:07:53	to speaker diarization
0:07:57	so this is basically the system the new system that was into they
0:08:01	this is uh uh
0:08:03	just
0:08:04	even if it was a because different of strange this is just and a minute if but the map system
0:08:10	we can see that the is a but if clustering down B and we have a
0:08:14	kind of a stopping therapy or or or a cluster selection
0:08:17	but the see
0:08:19	the the of about so first the bottom it
0:08:22	uh its uh D feature extraction to extract mfcc whatever we run
0:08:27	training the next to eight
0:08:28	so we need to train these K be a models in this case we train them from be they the
0:08:32	the data itself we don't use external up
0:08:35	features i did by stations
0:08:38	well the we take the acoustic features
0:08:40	and we
0:08:41	like
0:08:42	a i'm interested in in summarisation we always need to initialize as or a system as we are doing about
0:08:46	the bottom-up system we need
0:08:47	many more clusters than actual speakers are there so
0:08:51	we need some how to create those clusters
0:08:53	and
0:08:55	this part is that was processing
0:08:57	that is
0:08:59	in at this is just a nice of using should would just
0:09:02	a just a little bit of time of the computational time of the system
0:09:05	after that the of minute of clustering
0:09:08	which is what we uh keep blasting keep joining together or those clusters that are closest to that this is
0:09:14	all going in the binary space
0:09:15	and final once we have reached to one and this is one difference from a standard
0:09:19	have a minute of clustering system go
0:09:21	from a and to one
0:09:23	we have reached a one
0:09:24	use an algorithm to select how many
0:09:26	a terms of to multi we have
0:09:30	as a said
0:09:31	uh of mfccs
0:09:33	we use like be they have to C is a standard uh and B ten millisecond T five miliseconds
0:09:39	and can be um well as a said that
0:09:42	a model but train to the you know a special way
0:09:46	i in a special
0:09:48	if you use a uh you a model train it we stand standard em M L techniques
0:09:52	you going to have the options positions at the average points
0:09:56	modeling optimal more the in the late that uh and this a but it's is are we so that they
0:10:01	are not
0:10:02	uh uh uh are present in the particle it is of the discriminative information that the speakers have that the
0:10:07	speakers of your all you have
0:10:09	so we try to do something different that can model that
0:10:12	and uh and the this and X so that it can be anything higher than five hundred options we can
0:10:18	go to ten thousand the the performance
0:10:20	those an the neither neither a uh that rates
0:10:24	how to do this
0:10:26	so in this case
0:10:28	in this paper or to it in the following way
0:10:31	to the uh we to be the that these is uh a i would put audio and we first train
0:10:35	as to option for them
0:10:37	i believe it's two seconds of speech we some overlap
0:10:41	so we and that is parental
0:10:42	oh
0:10:43	second that the house and options
0:10:45	oh two thousand a all the options
0:10:48	the options of was and very small portions of the only so whenever the is speaker they represent the speaker
0:10:54	very discriminatively
0:10:55	and and we use that can do to uh medic to adaptively
0:11:00	yeah shows shows that we're
0:11:02	this space is optimally ultimately
0:11:05	model the space
0:11:06	like more do more separate between them
0:11:09	the whole acoustic space
0:11:11	and that's it
0:11:12	this is actually much faster than than doing with additive splitting uh yeah M L
0:11:18	no
0:11:19	right
0:11:20	and a is of the data
0:11:22	these these binary vectors
0:11:24	from the acoustic data and in two steps
0:11:27	to do stuff so
0:11:28	a step which is
0:11:30	oh in the
0:11:31	first best the the K best
0:11:34	uh captions for each acoustic feature that we have to do
0:11:37	we one time only and then on the second step
0:11:40	for every subset of features that we to compute a fingerprint from that's gonna meet only the evenly in our
0:11:47	uh
0:11:48	that is that is addition
0:11:49	hmmm
0:11:51	a time we need it then this is actually very fast
0:11:53	so that
0:11:55	we have the mfcc vectors
0:11:57	a in top
0:11:58	and for each of them yet is
0:12:00	this best options you may not working in
0:12:04	for the time
0:12:05	and that is our were first part and we can store in the score memory
0:12:09	and that's done on one time this is a little expensive because evaluating option mixture models
0:12:13	but this is
0:12:15	one time only
0:12:16	then at every time when i can be here
0:12:19	speaker model just have to get
0:12:21	that that the
0:12:23	and
0:12:24	the counts
0:12:25	and from those counts get a binary vector
0:12:27	okay and this is like fast
0:12:31	a five
0:12:33	acoustic have to talk about initialisation
0:12:36	and he just uh uh did something so for simplicity just use we use the can be M
0:12:42	the kingdom
0:12:44	and then initial clusters which you you just to bit any options that where
0:12:49	uh chosen first
0:12:51	i mean
0:12:52	as that that the segmental or segmentation and we've it there and with those we assigned
0:12:58	we got the clusters that
0:13:00	we than the most
0:13:03	now are in the binary the me
0:13:05	okay and we have that
0:13:07	this is have for us is is is is exactly the same as
0:13:11	for example the icsi system is a format for them map clustering
0:13:14	except that now or anything the domain
0:13:17	so for example to is
0:13:20	fingerprints from our approach of
0:13:22	of a cave as options
0:13:25	a close per T is completely a binary
0:13:28	a between all the models are that all the cluster models and just choosing the two that are closest
0:13:34	to merge them
0:13:36	i
0:13:37	i am and we
0:13:39	there are we just take
0:13:41	three seconds of data
0:13:43	in one second at time and assign compute a fingerprint from T for each of them
0:13:47	and assign it to the to the better speaker model
0:13:55	last but
0:13:56	the last part of the system
0:13:58	these ones we not to one so we have one a cluster we have to choose how many clusters is
0:14:02	our optimum number of clusters
0:14:04	so for bad
0:14:06	a a a that the S uh
0:14:09	to test this terms that was present
0:14:11	but i i two are people in interspeech two thousand eight
0:14:14	and the uh in a fit of time
0:14:17	so five it all in the paper are but we just a is estimated to in the uh just a
0:14:22	relation between the uh in and inter distances between the power of the terms
0:14:26	which allows us to select the optimal number of clusters
0:14:29	as as i have to say
0:14:30	he's about in the system that i'm less happy about and the have to improve this by
0:14:37	about eight
0:14:38	of course we use that as a should it but also use a by a factor
0:14:43	and
0:14:43	because
0:14:44	the diarization results of so freaking decided to use
0:14:48	a nice to rich transcription evaluation that he's is about thirty the six
0:14:53	uh shows
0:14:55	and uh i
0:14:55	to say that he's
0:14:56	runs see in just a but an hour in in a lot the P C so it's pretty fast
0:15:03	they
0:15:03	maybe results
0:15:05	the first aligned use the results using uh a big easy could a gmm system but just an implementation of
0:15:12	the
0:15:13	um
0:15:14	basic one
0:15:15	a a is as about twenty three mm send and average that position of than a running down of about
0:15:21	one point nineteen uh real time
0:15:24	he's is optimization here that is no i mean is just an implementation
0:15:28	the standard implementation
0:15:31	a at the last two lines
0:15:33	but do that this is a uh to uh configuration depending on the number of options we do we take
0:15:38	for the K B N
0:15:39	two possible implementations of by system
0:15:42	we can see that in
0:15:44	a five
0:15:45	or that position it is this is slightly higher than the baseline instant
0:15:51	a the real time factor is ten times
0:15:53	faster
0:15:54	so is pretty good
0:15:55	and uh i mean was to importance of the training of the K B in
0:15:59	a a a a uh we
0:16:01	the the that's we used just a standard gmm just T V if too
0:16:05	that's the second line of results are we see that it just breaks
0:16:08	i mean the reaches as if at a speaker
0:16:11	characteristic a speaker discriminant down shown it just doesn't work
0:16:15	i also about so that
0:16:18	a a selection of the number of clusters
0:16:20	still those and
0:16:21	do the job
0:16:23	a number of clusters after running the system
0:16:26	we actually get to the five percent of the error rate
0:16:30	which is
0:16:31	a better than the than our baseline
0:16:34	this is just a a a two show that's right and
0:16:37	all depending on the number of options that the position error rate we have
0:16:41	how we can see just think of the the black is the average
0:16:44	we can see that event
0:16:47	and have nine hundred but after five hundred a our sense for the K B um the results are more
0:16:52	less flat
0:16:53	so we doesn't matter of five hundred six and the
0:16:56	that's fine
0:16:58	and this is a body is so i've shows was or meetings
0:17:02	oh
0:17:02	are our proposed system or the baseline of and see that in most cases they have
0:17:07	but was the same out of course a sum up is that
0:17:10	make these two percent difference but
0:17:12	and
0:17:14	and and that a couple of shows that are are better
0:17:18	so
0:17:19	we
0:17:20	so that that is a shown was kind of a a a a a star
0:17:23	shown is more uh a was but the things on top of of a standard system that i to get
0:17:29	these little gains in performance
0:17:31	but
0:17:33	just start a a a a system that we call we can even get that
0:17:39	and and and when i'm working the next to uh uh we can improve the by key fingerprinting
0:17:44	we gonna find a better of stopping at the hopefully
0:17:47	and uh also
0:17:49	that the system always monocle in and maybe working in cell phones will
0:17:55	thank you very much
0:18:02	that's can like to think of making did not
0:18:07	and he's my
0:18:09	okay
0:18:19	no
0:18:20	no this is this is and the M
0:18:22	oh
0:18:25	oh
0:18:27	oh sorry L merging and speech key detection is on right at the beginning
0:18:31	at the very beginning
0:18:32	so as just the stand like a standard uh
0:18:35	that action system
0:18:36	it's just not it's
0:18:37	mean
0:18:39	the
0:18:40	the
0:18:42	and see if it goes back
0:18:46	justin the acoustic feature extractions at the beginning of the system
0:18:49	and but uh used uh the speech taking that action from for you to come
0:18:53	thanks to that
0:18:59	no no i just i just to acoustic
0:19:02	i don't merge
0:19:03	i use M D and that's multiple microphones but just been for than the use a single channel then
0:19:13	many ideas but that work at the no
0:19:16	have to try
0:19:18	okay since we ran out of a nice thing

FAST SPEAKER DIARIZATION BASED ON BINARY KEYS

Speaker Diarization

Presented by: Xavier Anguera, Author(s): Xavier Anguera, Telefonica I+D, Spain; Jean-François Bonastre, University of Avignon, France