Speech Transcript - Computationally Efficient Speaker Identification for Large Population Tasks using MLLR and Sufficient Statistics

0:00:06	um
0:00:07	so
0:00:08	as a as i mentioned that um
0:00:10	they're going to use mllr inception statistics for
0:00:12	speaker identification problem
0:00:14	uh but we're not building any speech recognition as such in this particular
0:00:18	people
0:00:19	and the idea that we're looking specifically at
0:00:22	uh
0:00:23	at the past where the large number of speakers
0:00:26	anyone to identify
0:00:27	one of them
0:00:28	then we want to do it in a computationally efficient way
0:00:31	so that's what was actually done by my students aging gets a car and checked it out and then we'll
0:00:35	make
0:00:39	so just to give you a brief overview of the top
0:00:41	i'm just gonna go uh briefly about uh
0:00:44	speaker identification problem that is identifying one out of uh
0:00:48	a set of L speakers
0:00:50	and i'll talk about the commonly used techniques such as using map adaptation followed by topsy mixture based likelihood estimation
0:00:57	um and then
0:00:58	stage maybe
0:00:59	uh we show that this is actually if you have a large number of speakers
0:01:03	uh then evaluating the likelihood across all the
0:01:05	because and then choosing the best one
0:01:07	it is
0:01:08	uh obviously very computationally expensive and this
0:01:10	number of speakers
0:01:12	in the population can very large
0:01:14	and so we proposing to use
0:01:15	mllr mattresses
0:01:17	uh for the adaptation of
0:01:18	speaker models
0:01:19	um
0:01:20	the reason is that then we just need to have the mllr mattresses
0:01:24	and uh we show that you know if you have a manila mattresses then estimating the
0:01:28	uh the likelihood of the difference because is a very very fast
0:01:31	step adjusting once a matrix multiplication with the mllr
0:01:34	uh row vectors
0:01:36	and so we give you some comparison of the performance of the conventional uh gmmubm based that that
0:01:42	we we show that although the mllr system is
0:01:44	oh
0:01:45	uh it will give you some degradation in performance
0:01:48	and therefore
0:01:49	oh finally we propose
0:01:50	some sort of a cascade system where
0:01:52	the mllr system will reduce the search space from this huge population
0:01:56	and then we find a gmmubm system can
0:01:59	uh you know look at small set of speakers
0:02:02	and identify
0:02:03	the
0:02:03	uh the best speaker from that
0:02:05	uh
0:02:05	set
0:02:06	so this is the basic
0:02:07	flow of that all
0:02:10	so
0:02:11	um
0:02:11	as i said uh the idea is that i'm doing speaker identification so
0:02:16	there are and
0:02:17	because
0:02:18	i had to close that so
0:02:19	assuming that that is because in the population
0:02:22	so given a test feature
0:02:23	we're going to actually find like you would
0:02:25	respect to all the things
0:02:26	and models
0:02:27	and choose the one that maximises the model
0:02:30	okay
0:02:30	and obviously the and the number of speakers
0:02:33	population is large
0:02:35	and i have to evaluate for each and every speaker population
0:02:38	and therefore uh you know the computation complexity keeps going
0:02:41	as an uh the number of speakers in the population
0:02:44	becomes law
0:02:47	so um so what would be uh
0:02:50	conventional methods along the most popular method that is used for speaker identification
0:02:54	oh
0:02:55	pretty much the same thing is useful speaker verification assume
0:02:58	is that uh we will be using uh you don't given uh uh you know so background model
0:03:03	uh for each of the speakers the uh basically do a map adaptation to get the speaker models
0:03:09	from the universal background model
0:03:11	so these are people who uh a speaker adapted models
0:03:14	and uh then uh as
0:03:16	doug reynolds pointed out that it that of the possibly that you can do the scoring still
0:03:20	and that is
0:03:21	uh that given the data as such models
0:03:24	we first align uh the test database respect to the ubm
0:03:28	and finally topsy mixtures
0:03:30	uh for that particular test data
0:03:32	and so when you want to evaluate the likelihood you don't have to compute all the uh you know
0:03:36	like that so
0:03:37	each of those two thousand forty bits just assuming that the two thousand forty in the background model
0:03:41	for each of the speaker model
0:03:43	instead you have the first evaluation with respect all of the two thousand forty
0:03:47	from the ubm
0:03:48	but then for each of the speaker models you just need
0:03:51	uh you know sealed those mixtures to be evaluated
0:03:54	so
0:03:55	oh but nevertheless
0:03:56	as L becomes large uh there is a large uh uh
0:04:00	increase in the computation
0:04:02	so
0:04:02	it's still
0:04:03	uh as we will show is
0:04:04	is expensive especially L becomes large
0:04:09	so uh what we're proposing is
0:04:11	is
0:04:12	is uh a little more
0:04:14	uh
0:04:14	it's it's again adaptation but yeah saying that it's middle of doing map adaptation
0:04:19	why don't people
0:04:20	you can models using just fmllr adaptation
0:04:23	so the idea is that for each speaker
0:04:26	uh given that we already have the ubm model
0:04:28	um we are going to use in strip map adaptation
0:04:31	uh
0:04:32	but
0:04:32	uh we actually have a speaker model
0:04:34	which is now gone through mllr speaker adaptation
0:04:38	so this is where i think uh
0:04:40	the confusion came so we're using a male and that's about all that we have ordering it from
0:04:44	speech recognition
0:04:45	literature
0:04:46	so
0:04:46	so
0:04:47	each of the speaker is now basically uh the means of the speaker model is nothing but a maddox transformation
0:04:53	of the means of the universal background model
0:04:55	so
0:04:56	uh the idea that i just need this uh
0:04:59	uh matrix the mllr maddox to characterise a speaker
0:05:02	so in essence we are not performing individual speaker models
0:05:06	except that each speaker's now codified by
0:05:08	his or her
0:05:10	uh the spectrum
0:05:11	mllr
0:05:12	so
0:05:12	this is a stage that the actually
0:05:15	but the speaker specific
0:05:17	mllr matrix
0:05:18	and then
0:05:19	this is
0:05:20	she said identification vol
0:05:22	oh becomes the one we have such
0:05:26	i know that i lattices
0:05:27	it's just sell such models
0:05:29	and of course uh these
0:05:31	lattices are what is the L C D this is that
0:05:35	and so here the likelihood calculation essentially boils down to finding out
0:05:39	the test utterance
0:05:40	given that that's what that's what is like you
0:05:42	respect to
0:05:43	the background model which means see each of these
0:05:45	and mattresses
0:05:47	the which are already stored uh since you've done uh mllr adaptation for each of these individual speakers
0:05:53	so at this point it still looks like
0:05:55	we need to compute all the
0:05:57	they'll likelihoods and therefore
0:05:59	it is still uh looks like i mean we haven't solved anything
0:06:02	as yeah
0:06:03	but the advantage is that if i want to compute these individual likelihoods
0:06:07	now it's very very simple
0:06:08	all that i need to do is just do some markets multiplications
0:06:12	to get
0:06:13	the likelihoods for each of the
0:06:14	individual speaker
0:06:19	so
0:06:19	um
0:06:20	so the idea is again more out of this
0:06:22	is again water from speech recognition literature because mllr basically
0:06:26	uh
0:06:27	even if it's using the equations from
0:06:29	uh the mllr map text estimation
0:06:32	so we have the use of a facility function
0:06:34	uh in converge speech recognition what you would do if you're doing mllr estimation
0:06:39	he's actually trying to estimate the schematics
0:06:41	W S
0:06:43	given
0:06:43	the i went uh the test on the adaptation data
0:06:47	okay so
0:06:48	the idea that given the adaptation utterance
0:06:50	X
0:06:51	what is the back deck so what are the elements of the matrix
0:06:54	that will maximise the likelihood
0:06:56	in this case the optimal function that you're looking at
0:06:58	so
0:06:59	we
0:07:00	and now we was in the same problem in a speaker identification framework
0:07:04	so the idea is now i already know
0:07:07	the L
0:07:08	speaker mattresses for each individual speaker i already know
0:07:11	the mllr matrix
0:07:13	and that's what the problem is now one of finding out which of those
0:07:17	and mattresses
0:07:19	maximises the likelihood so in this case
0:07:21	i am not estimating
0:07:23	be mllr mattresses
0:07:24	i have already computed the mllr mattresses
0:07:27	and stored for each of the individual speakers
0:07:30	and the only thing that i'm trying to maximise here
0:07:32	is trying to maximise
0:07:33	oh
0:07:35	one of those L
0:07:36	mllr mattresses
0:07:37	that maximises the likelihood
0:07:39	and this is very very efficiently done as a destroyer that
0:07:42	again waterfront speech recognition
0:07:44	so what we would do is that we already have
0:07:46	these
0:07:47	and mattresses each of which
0:07:49	i'm now represented by the row vectors W one W B these are all vectors actually
0:07:53	and in mllr you will see that these row vectors that what estimated
0:07:57	when we you when you do actually speaker adaptation
0:08:00	here these are already precomputed and stored
0:08:03	and so we only computing the likelihood
0:08:05	here
0:08:07	so
0:08:08	oh what is it efficient so i said i denied to compute all the likelihood
0:08:12	but i can do that very very very efficiently
0:08:15	a white what is it it's only varies because
0:08:17	i just need to do one alignment of the data with respect to ubm
0:08:21	and that's exactly same thing that is normally done in math class topsy
0:08:25	uh likelihood estimation
0:08:27	i had to have less
0:08:28	an alignment to find out which are the mixture for that
0:08:31	uh you know dominant
0:08:32	so that's not exactly same as what to do with
0:08:35	uh you know map
0:08:36	just to see
0:08:37	uh
0:08:37	it's just that that we do which is again borrowed from speech recognition it is
0:08:41	basically compute for the given test utterance
0:08:44	D corresponding
0:08:45	sufficient statistics
0:08:46	he i
0:08:47	in G R
0:08:48	okay
0:08:48	G I so these not sufficient statistic that that
0:08:51	computer depending on the alignment and the data that's given guess of the alignment and then there's the data comes
0:08:57	okay
0:08:58	and so
0:08:58	for each of the and speakers now
0:09:01	i just one matrix multiplication using these key and G A G uh uh uh statistics
0:09:07	so the ski energies computed one you want
0:09:10	you suspect you of the number of
0:09:11	speakers
0:09:12	but
0:09:13	the likelihood calculation now
0:09:15	uses this individual a row vectors from the corresponding
0:09:19	speaker mllr matrix so
0:09:21	uh this is of dimension be so each row vector basically
0:09:25	uh is modelled for that particular speaker so if it's speaker
0:09:28	i'd model is
0:09:29	i have that i
0:09:30	hi
0:09:31	i at low
0:09:32	and this is just a matrix multiplication so
0:09:34	in a sense
0:09:35	this is the most crucial step that this happening and that is that the likelihood can be easily computed for
0:09:40	each of those
0:09:41	and
0:09:42	speakers
0:09:43	but using the corresponding mllr hypothesis
0:09:46	and doing william at its multiplication
0:09:48	on here i
0:09:49	gee i
0:09:50	hmmm
0:09:50	and that's where we get the maximum key in
0:09:53	in performance
0:09:55	so i'll in computation time so
0:09:57	just to go through the old useful given the feature vector i'm assuming that already
0:10:01	i have taken
0:10:02	and uh the individual speaker's training data and computed the mllr mattresses for all the and speaker
0:10:08	and so given a test feature
0:10:10	i first do an alignment
0:10:12	but the background model
0:10:14	and also compute the key I N G I statistics is only done once
0:10:18	using the X
0:10:20	the test feature and the ubm model
0:10:22	and then with respect to each of those mattresses
0:10:25	i just need to compute by multiplying this matrix
0:10:28	but the statistics
0:10:30	to get
0:10:30	because one and likelihood
0:10:32	so this is
0:10:33	a very computationally efficient because it only in what's matrix multiplication
0:10:39	oh please stop me if you
0:10:40	you have any questions
0:10:42	so
0:10:42	um
0:10:43	so the proof of the pudding is basically uh to go to some of the
0:10:46	uh time uh and a complexity analysis
0:10:50	uh
0:10:51	so what they're doing is now we're comparing the conventional uh map
0:10:55	plus topsy approach to check on gmm ubm
0:10:58	uh
0:10:59	and then the fast mllr system that's one that maybe have
0:11:02	and uh
0:11:04	mllr mattresses that capture the speaker characteristics
0:11:07	and what is shown on the left that is basically um uh
0:11:12	this is
0:11:12	uh again more fun than that
0:11:14	two thousand four data
0:11:16	so we have two different uh uh
0:11:18	test basically one to ten ten ten seconds speech and the other one
0:11:22	side speech
0:11:23	so if you use me
0:11:25	and then at the end and such that six speakers in this identity
0:11:28	that's
0:11:30	so uh what we're trying to do is identify
0:11:32	uh given the test
0:11:33	uh
0:11:34	data
0:11:35	to identify from one of these
0:11:36	three and then six models
0:11:38	and so uh there's what the ten second anti one side uh case
0:11:42	so the blue
0:11:43	is basically what the conventional approach
0:11:46	here we have taken C to be top fifteen
0:11:48	um and you see that obviously
0:11:50	there is a degradation in performance
0:11:53	uh uh in in in the case of mllr uh
0:11:56	so but uh
0:11:57	you know i mean and i analyses them in a little would be good
0:12:00	and for the one second C
0:12:02	speech
0:12:03	uh the gmmubm obviously does better and therefore uh also has a corresponding improvement for the
0:12:09	mllr kiss but again
0:12:10	there is a gap
0:12:11	between performance in uh
0:12:13	the conventional case
0:12:14	and uh proposed approach
0:12:16	um but the advantage comes
0:12:18	in the right half of the figure
0:12:20	which shows
0:12:21	uh here we are just using
0:12:22	uh a fixed computer can configuration
0:12:25	and trying to find out the average time that they can but i want to estimate uh or to identify
0:12:31	the optimal speaker
0:12:32	so uh this is
0:12:34	uh summation of that review
0:12:36	uh so yeah we can see that there's a few which again in terms of complexity of the computation time
0:12:42	while this takes about ten point three seconds and then averaged
0:12:44	this takes about a second on an average twenty ten second data
0:12:48	for this to be announced
0:12:49	speakers
0:12:50	and when the test it obviously becomes larger it
0:12:53	obviously wouldn't take much more time to compute
0:12:55	and that takes about forty four seconds versus
0:12:57	uh more seconds
0:12:58	uh
0:12:59	for the mllr
0:13:00	so uh so the bottom line is
0:13:03	you got a huge gain in a it's about like one is to seven one is to ten
0:13:07	a winning as fast mllr so this is useful if you have a two thousand speakers in your
0:13:11	uh
0:13:12	ask anyone to identify which one of them
0:13:14	uh
0:13:14	is the one that
0:13:16	well the utterance
0:13:17	but then there's a downside that is used
0:13:20	some
0:13:20	in terms of performance
0:13:22	so you can see that the sum of
0:13:23	performance
0:13:24	and uh obviously when they're when these sentences are larger
0:13:28	uh the
0:13:29	be gmmubm takes a lot more time and that's what you gain more
0:13:32	when you have a longer utterances to be
0:13:37	so um
0:13:38	so the said oh yeah this is a little more analysis
0:13:41	uh a little more details of what's happening between the proposed
0:13:44	a fast mllr
0:13:46	yeah
0:13:46	the gmmubm
0:13:47	and it said uh since
0:13:49	the uh like you would even for that all you have to be computed
0:13:53	as the number of speakers so that's
0:13:54	a lap dance figure
0:13:55	shows me computation time and the number of speakers
0:13:58	in the database increases
0:14:00	so as the number of speakers increases
0:14:02	so the blue line is the conventional approach
0:14:05	or if it's ten second obviously it's gonna take less time than if someone finds speech
0:14:09	so but you can see that there is a sort of a linear relationship
0:14:13	with
0:14:13	the number of speakers
0:14:14	the database so as the number of speakers
0:14:17	in the database increases
0:14:18	the computation time results were linearly increase
0:14:21	on the other hand if you look at the mllr system which is
0:14:24	all those those brown sort of uh uh dark line
0:14:27	ah it's almost flat
0:14:28	as the number of speakers increases
0:14:31	and that's because the meeting uh yeah complexity comes basically in uh in trying to do the alignment and things
0:14:37	like that
0:14:38	the actual likelihood estimation does not depend much uh that's not real significantly with the number of speakers but
0:14:44	yeah just matrix multiplications with the mllr map
0:14:47	so
0:14:48	uh so
0:14:49	uh you can see that uh you know it is a population of two thousand there's gonna be huge uh
0:14:53	again in terms of
0:14:55	uh
0:14:55	computation time
0:14:57	um
0:14:57	the other interesting thing is obviously that if i'm trying to identify a dbn best performance that is of these
0:15:04	two systems that is
0:15:05	if i look at the top forty speakers i how often
0:15:08	do the kind of speaker okay
0:15:10	stop with you at all in
0:15:11	we see a zipper that as the number of speakers in the top increases obviously uh they both start converging
0:15:18	and so the blue a gmm ubm and the red army
0:15:22	the order of the brown on mllr
0:15:24	so the performance
0:15:25	sort of
0:15:26	the top and performance that is identifying at least in the top hundred
0:15:29	he's
0:15:30	uh similar uh uh as well what this
0:15:32	some schools of of the uh the start and went into a teacher
0:15:35	so we thought that we could sort of
0:15:37	uh
0:15:38	explain
0:15:38	the advantage of simple
0:15:40	the gmmubm which obviously
0:15:42	is
0:15:42	superior to a million tons of performance
0:15:45	and still get some computation again
0:15:47	by using the mllr to identify
0:15:49	from the population of thousand of something the top one hundred one
0:15:52	two speakers
0:15:53	and then use only those
0:15:55	uh in the uh use
0:15:56	that we do set of speakers
0:15:58	in the final gmmubm system so that's what one of the cascades
0:16:01	yeah
0:16:02	uh
0:16:03	uh
0:16:03	so
0:16:04	the idea is that obviously fast mllr system to first
0:16:07	i think that that sentence and made use the
0:16:10	search space for the speaker so we identified the top hundred at all
0:16:13	you print your properly
0:16:15	speakers depending on
0:16:16	as usual
0:16:17	has an impact on performance
0:16:19	and then we'll let the conventional gmmubm operate only on these three disorders because to identify
0:16:25	the best
0:16:26	okay
0:16:26	and this is basically the same thing in implementation
0:16:29	which basically shows that uh we don't lose much in terms of additional cost and computation
0:16:35	so the conventional approach would have taken the uh the test feature
0:16:39	and you would have done an alignment with the ubm
0:16:42	a lot of the topsy mixtures and use
0:16:44	uh the gmmubm based
0:16:46	system
0:16:47	to actually identify the speaker
0:16:49	i'll be at exactly doing the same thing
0:16:51	there's an alignment step that goes on here
0:16:54	but we do an additional computational
0:16:56	sufficient statistics
0:16:57	this is only done once
0:16:59	and then we have the mllr system which is
0:17:02	down in the training phase so in the training phase we already
0:17:05	a bill
0:17:06	the mllr mattresses for each of those individual speakers
0:17:09	so using the statistics the features and the mllr hypothesis
0:17:13	we identify
0:17:14	the and most probable speakers
0:17:16	and once we identify the end was problems because we feed it to the human subjects
0:17:20	to get the final
0:17:21	identified
0:17:22	because
0:17:23	so in both cases the aligned this is
0:17:25	so
0:17:27	this is
0:17:32	so
0:17:32	um
0:17:33	so that's a that's a compromise between complexity and performance
0:17:37	um so if i look at the end that's performance that is
0:17:40	if i did use a set
0:17:41	uh of speakers only at all yeah
0:17:44	oh then the performance
0:17:45	for the this is the ten second case
0:17:47	uh this
0:17:48	a degradation performance
0:17:49	but development
0:17:50	because
0:17:51	this degradation decreases and savannah good top thirty
0:17:55	uh
0:17:55	you know uh
0:17:56	performances
0:17:57	the still a degradation obviously by uh there is some hit in performance
0:18:01	but that it does not very significant
0:18:03	but on the other hand uh even for top thirty
0:18:06	i do get significant gain in terms of uh computational complexity
0:18:10	so as the number of speakers
0:18:12	increases
0:18:13	the back and that of the gmmubm system has to work on more number of speakers
0:18:18	and that obviously the computation time is going to work
0:18:21	and therefore the speed up is good at it you
0:18:23	but still it's
0:18:24	a significant i mean you do get some uh you know five times more uh gain in terms of computation
0:18:29	uh
0:18:30	this
0:18:31	sort of
0:18:31	uh
0:18:32	same thing is repeated for the one side
0:18:34	uh the problem with the one side of the pause between reading a book
0:18:37	uh is a huge amount of data five seconds
0:18:40	oh five minutes of speech
0:18:41	so again if you look at the top
0:18:43	uh you know and best
0:18:44	if it's then put a top ten that obviously the huge hit in performance two point five percent
0:18:49	slow
0:18:51	yeah
0:18:51	absolute
0:18:52	lost
0:18:53	but if i go to the top
0:18:54	okay
0:18:54	um then i get only about uh how point seven percent
0:18:58	uh
0:18:58	degradation
0:18:59	but yeah
0:19:00	the top
0:19:01	uh oh
0:19:02	but the P I in the top well and base that that
0:19:05	i mean you're not allowed to segment it is the i can
0:19:08	even though i did use the number of
0:19:10	a speaker's to fourteen the backend gmms still have to operate at all this forty speakers
0:19:14	and therefore compared to ten seconds features see that the gains are not that significant but still uh
0:19:19	get about
0:19:19	almost
0:19:20	three times
0:19:21	uh competition
0:19:22	so this is the basic idea of our proposed method so
0:19:26	we have compromised so you can actually
0:19:29	but the operating point at a need
0:19:30	any of these
0:19:31	uh and best and you'll get uh in the past one again in performance but uh
0:19:35	it in terms of computation
0:19:37	so uh this is
0:19:39	what we have uh
0:19:41	uh so basically we're using the idea of uh
0:19:45	uh you know exploiting uh
0:19:46	mllr matter
0:19:47	just to do fast likelihood calculation for the speaker models
0:19:50	but uh using mllr adaptation that decrease the performance
0:19:54	slightly or i mean significantly depending on whether you stop standing or
0:19:58	oh what
0:19:59	and therefore this
0:20:00	you need that with this and
0:20:01	that and that that we choose
0:20:03	to reduce the search space so i think you say
0:20:06	you get better accuracy but uh uh it gets
0:20:09	in the uh in terms of computation time
0:20:11	so for the T and speaker
0:20:13	speaker that it up and then this
0:20:14	database
0:20:15	uh if you choose the top ten
0:20:17	then you get
0:20:18	these as the performance degradation speed up
0:20:20	for the one side of the top twenty get about three point one
0:20:23	one
0:20:24	no
0:20:25	so uh
0:20:26	this is basically it
0:20:41	and
0:20:41	timefrequency
0:20:42	thank you very much
0:20:51	oh
0:20:52	to achieve the same result
0:20:54	uh
0:20:55	recent ones
0:20:56	okay
0:20:58	two
0:20:59	in
0:20:59	yeah
0:21:00	so
0:21:02	okay
0:21:03	you are much more
0:21:06	uh
0:21:07	to uh to to
0:21:10	if you want to
0:21:11	same performance
0:21:12	right
0:21:13	i do have some uh
0:21:15	result
0:21:16	who
0:21:17	oh
0:21:17	we use
0:21:18	i want to achieve
0:21:20	same performance
0:21:21	not
0:21:21	the
0:21:22	same and this
0:21:24	oh
0:21:24	so i
0:21:25	oh
0:21:26	for some minimal i'm i'm
0:21:28	we understand happy or
0:21:30	mllr adaptation obviously one this past summer heat compared to
0:21:33	uh map is that is generally true i i think that's what we notice
0:21:37	so it may be more a hundred or two hundred
0:21:40	you will get closer and closer to be conventional gmm you can but you will never get exactly the same
0:21:45	so you're always going to get
0:21:46	something can performance
0:21:48	and a single closer to be complete set obviously all again
0:21:51	in computation time is
0:21:53	barnaby
0:21:53	i mean you want to lose
0:21:55	anybody get comparable performance
0:21:56	so what do we think is that you will have an example in performance
0:22:01	how much it in performances in your hand
0:22:03	and
0:22:04	depending on how much you're willing to go down and pick and performance
0:22:07	we can get that much more gain
0:22:09	in
0:22:10	uh
0:22:10	yeah
0:22:11	so your question is can i have a cheap gmmubm performance and still get a speedup
0:22:16	uh i'm not sure about that i think you will have something to always
0:22:24	like
0:22:30	well
0:22:31	i listen to a in speech recognition i notice that using it so you system
0:22:38	i need more adaptation data done and map adaptation
0:22:41	note
0:22:41	this morning
0:22:42	since my
0:22:43	the opposite is true right i mean yeah the better more data you have not always better than
0:22:48	mllr right
0:22:49	this is what i
0:22:50	this alignment so that we do mllr because i do
0:22:55	estimating and and not
0:22:57	yeah but the most simple i mean the constrained mllr see
0:22:59	so i sit in my
0:23:01	no matter what is normally most conventional cases
0:23:04	uh if you have enough data obviously we should go back to that
0:23:12	oh okay
0:23:13	uh oh
0:23:14	if i understand it well in in the case of
0:23:17	i'm a large
0:23:18	like
0:23:19	sufficient statistics
0:23:20	but in the case of gmmubm you only things frame by frame like
0:23:24	yeah
0:23:25	you have this uh
0:23:26	and evolution right
0:23:28	got it
0:23:28	but you could use the century right so you could actually is coming to such an statistics for even for
0:23:34	the originals
0:23:35	also um
0:23:36	the
0:23:37	well adapted model
0:23:39	oh
0:23:40	okay this but
0:23:45	yeah
0:23:46	so what is your question like uh
0:23:48	so i'm just saying that you are you are you trying to the speed up comes from uh collecting deception
0:23:53	statistics on the novel weighting mllr system quickly
0:23:57	i don't know but you could use the same trick with the with them up adapted model you can actually
0:24:01	look like this so you can you can apply the absentee function the multifunction
0:24:06	instead of civility
0:24:07	uh things frame by frame between all that work
0:24:10	as well as a similarity
0:24:12	G gmm frame by frame and using it to
0:24:14	and
0:24:15	school
0:24:16	so you're saying i could do similar things format i mean that the clique assumption statistics
0:24:20	exactly yes
0:24:21	and which which would probably um
0:24:24	well this is what we do and and this leads to much force
0:24:27	i i certainly would be probably even faster than
0:24:30	then dissimilar scoring result losing any powerful
0:24:33	um okay uh
0:24:36	i didn't uh
0:24:37	so
0:24:39	i i okay i i have to so you think i could either do this
0:24:43	format right of one way to do it for mllr is that the question
0:24:46	i mean yeah i'm i'm i i just think you are basically compare
0:24:50	two different thing i mean you wanted to come with
0:24:52	person
0:24:53	you should only tingles
0:24:54	medals with the sufficient statistics and
0:24:57	i guess that would be
0:24:58	uh
0:24:59	about the same false alarm
0:25:01	um
0:25:02	my
0:25:03	still more
0:25:04	but
0:25:05	i is that too i'm i'm not very familiar so maybe i should have because
0:25:08	why do we always then variable that all been uh uh C mixtures
0:25:12	five
0:25:12	to the top and we don't do that
0:25:15	okay
0:25:15	so
0:25:16	uh maybe okay
0:25:17	so i should have a
0:25:25	so going back to your
0:25:27	original premise that you had here was about you were primarily focused on speech
0:25:31	right you're saying that you're dealing with large
0:25:34	by population set
0:25:35	but but i also get their situation
0:25:37	hearing about durations
0:25:39	it wasn't just large population said it was the duration
0:25:42	test utterance
0:25:42	so
0:25:43	score large populations that
0:25:45	right at ten seconds versus poolside
0:25:48	and you were compared that was one of the comparisons you had
0:25:51	so i see
0:25:52	mllr approach you have
0:25:54	tree
0:25:55	is it is
0:25:56	is done kind of independent of the uh
0:25:58	except for the ubm stats is independent of the duration of the test
0:26:01	that's gonna right
0:26:02	so
0:26:03	but i mean what other approach people taken this
0:26:05	propulsion speech recognition on it
0:26:07	is
0:26:08	why don't you look at the notion of the uh
0:26:10	yeah
0:26:12	that's a well known thing you do beam pruning is frames can do a lot of high calcium drop
0:26:17	very quickly so
0:26:18	that i see
0:26:20	well necessarily have
0:26:21	go through
0:26:22	keep a hundred twenty at any time
0:26:24	um it it back and if you're speech real concern
0:26:27	alternately you can bail out
0:26:30	oh yeah
0:26:30	so i i actually we have all the very mention of the papers that i mean this
0:26:35	and there are other methods that you can use
0:26:37	what he what speed up
0:26:38	um you know for for example pruning or you know the downsampling and things of that
0:26:42	um
0:26:43	so
0:26:44	uh
0:26:46	yes
0:26:46	i mean maybe we're not saying that
0:26:48	uh this is the only way of
0:26:49	uh you know doing fast computation
0:26:51	uh
0:26:52	that's one of the base that we could possibly do
0:26:54	that's always existed
0:26:55	uh
0:26:55	right the questions used more as a research paper is
0:26:59	you chose this method in your baseline was full frames without
0:27:03	classical other ways of speeding up
0:27:05	why was this
0:27:06	why was it eight wide user interface
0:27:09	oh so
0:27:10	C O
0:27:11	so even in the case of
0:27:13	pruning i'm sure you wanna get some hidden performed i don't think i can absolutely get the same performance as
0:27:17	you do the gmmubm right through
0:27:20	uh because it was this possibility that while opening about to lose
0:27:23	some speakers out
0:27:25	if i think i mean so this would be the ultimate
0:27:28	uh
0:27:28	oops
0:27:29	i thought i mean with the uh you know well
0:27:32	which one would try to achieve a thing right okay
0:27:34	do you want it is that errors are introduced
0:27:37	using that is more than can rest and play
0:27:39	okay
0:27:48	so
0:27:49	um
0:27:50	performance
0:27:51	system as a function
0:27:53	number
0:27:53	right yeah
0:27:55	you know
0:27:56	sure
0:27:57	no no
0:27:58	i have to remind you selected suitable
0:28:01	oh my can actually use those
0:28:03	so um
0:28:06	oh i would like
0:28:07	two
0:28:08	in utah also but this is obviously more or something
0:28:10	kind of application to describe yeah
0:28:13	creation
0:28:15	in this case it's second was i mean we we just thought that the need we have
0:28:19	uh you computing the likelihood
0:28:21	an efficient manner
0:28:22	and i'm sure that a lot of applications
0:28:24	and maybe an audio indexing of some of the
0:28:26	you might be having large populations
0:28:29	because then you might want to identify
0:28:31	somebody yeah
0:28:32	big database
0:28:33	so
0:28:34	we have specifically looked at any particular application
0:28:37	we just
0:28:37	then that
0:28:38	here they are
0:28:39	a lot of applications
0:28:40	possibly
0:28:40	at least
0:28:41	you know
0:28:42	where the utility
0:28:43	databases
0:28:44	and one is interested
0:28:45	and something like this might work
0:28:47	so
0:28:48	a realtor when they all possibly rather than the way that
0:28:50	we have an application that what we want to find out
0:28:53	we are
0:28:54	we just um
0:28:56	looks cool
0:28:57	oh
0:28:58	you
0:28:58	the menu or
0:29:00	what the application space
0:29:02	well
0:29:05	this is i would like to know more about
0:29:07	these are
0:29:07	sure
0:29:08	but
0:29:09	but
0:29:09	right
0:29:24	oh did you try to use
0:29:25	more than one mllr transformations for speaker
0:29:28	oh yeah we could do with this on that yeah i think that's something that we have
0:29:32	thinking of doing that but we have
0:29:34	but it should not
0:29:35	uh
0:29:36	you know
0:29:36	it should
0:29:37	hopefully improve but we are not
0:29:45	hmmm
0:29:46	so how does this
0:29:47	um
0:29:48	it should be interesting to compare
0:29:50	the types going to do we will
0:29:52	um
0:29:53	another type of scoring where
0:29:55	once you have sufficient statistics
0:29:58	so the test utterance you actually get in a more
0:30:00	transform for the test utterance
0:30:03	as well and then compare
0:30:05	the mllr transforms for
0:30:07	the model and the test utterance
0:30:08	whether
0:30:09	by doing into the product
0:30:12	an svm
0:30:13	oh yeah so you could have i mean we have just using like it's what you
0:30:17	you're saying that given that that's weapons i could use
0:30:20	the test vectors mllr a lot
0:30:23	and compare it with
0:30:24	with the speaker's mllr and that was it
0:30:27	it
0:30:27	it will probably be more efficient because once you get to be a man the lord matrix
0:30:31	you're the dimension is lower than
0:30:34	you're only your submission statistics i mean you're sufficient statistics that winter
0:30:39	um
0:30:40	you have to consider how to note that these are just
0:30:42	the
0:30:43	the mentioned previously before the feature vectors of it
0:30:46	that's it
0:30:48	so this is very very small
0:31:06	can't think speaker like
0:31:07	yeah
0:31:08	i

Computationally Efficient Speaker Identification for Large Population Tasks using MLLR and Sufficient Statistics

SESSION 1: Speaker recognition – LVCSR and high level features

Added: 14. 7. 2010 11:08, Author: Achintya Kumar Sarkar, S. Umesh, Shakti Prasad Rath (Indian Institute of Technology Madras), Length: 0:31:12