Speech Transcript - Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks

0:00:16	okay so by no means but also
0:00:19	and whatever send a work with microwave so these hand plucking
0:00:25	we're basically we test and evaluate our address the m utterance system
0:00:32	in a like a different mean on this and so this was a system that
0:00:36	we already presented
0:00:37	and at an obvious to test how we disaffected on different scenarios
0:00:44	so for someone a the one the motivation why we started using this architecture and
0:00:49	how
0:00:50	how we started using this
0:00:53	there we will lead to a very of file is the m probably you will
0:00:57	be quite
0:00:59	already aware this that i guess it's
0:01:02	nice to have some tracks near
0:01:05	then we will all the details of the screen men so we will detailed a
0:01:09	system description
0:01:11	the reference i-vector system that we will compare
0:01:14	our proposed system we
0:01:16	the different scenarios we're gonna tested
0:01:19	and results
0:01:21	and finally we will conclude the work
0:01:25	so
0:01:26	we all know what we take these already the process of automatically identifying language for
0:01:32	a given a spoken utterance
0:01:33	and typically this has been done for many years
0:01:38	rely you know acoustic model so these systems
0:01:41	basically have the state is first some i-vector extraction
0:01:45	and then some classification states
0:01:48	and last years we're seeing a really a strong
0:01:52	a new line that it's deep neural networks
0:01:54	and it can be more or less divided in three different approaches
0:01:58	one is the and two and systems we have seen that it's a very nice
0:02:02	solution but we are not achieving best results so far
0:02:06	then we have the what an x
0:02:08	and then
0:02:09	after computing but as we go to the i-vector
0:02:13	struck some and we keep the fuel line
0:02:16	and then we have this signals
0:02:18	sorry for type other
0:02:20	so this would be a and this paper we wanna focus on the end-to-end approach
0:02:25	so we want to improve the end-to-end approach for this
0:02:29	we would be a very like stander the nn for language recognition when we try
0:02:36	to use and to an approach
0:02:37	basically we have
0:02:40	some parameters as input
0:02:42	then we have one or several he'd over the years with some nonlinearity
0:02:48	and we try to compute the probability of some
0:02:52	some of the
0:02:54	the language we are gone test
0:02:56	in the last layer so for this we use a softmax it with us probabilities
0:03:02	one of the main drawbacks of these
0:03:05	be system is that
0:03:07	we need some context if we try to get an output frame-by-frame we are not
0:03:12	gonna get then you would result so this system relies on stacking several acoustic frames
0:03:18	in order to all the time context
0:03:21	and that has many problems one and we have a fixed
0:03:25	length but probably will not work best for all different conditions
0:03:30	and it's like bright the
0:03:33	since a deep to union
0:03:36	so how can we model these in a better way
0:03:39	the theoretical answer he's recommend your networks so basically we have same
0:03:45	structure that before
0:03:46	but this once we have recourse if connections
0:03:49	all the others are saying
0:03:51	what's the problem with this one's we have a vanishing gradient problem
0:03:54	the a basically what happened sees
0:03:58	in theory it's a very nice model
0:04:00	but when we try to train these networks because of these records you can extra
0:04:06	we end up having all day they weights going to either zero
0:04:11	or something really high
0:04:13	so there are ways to avoid this but usually is very tricky it depends a
0:04:17	lot on the task on the data so it's not
0:04:20	really useful
0:04:22	and here is where the others the m columns
0:04:26	basically stm means they first
0:04:30	stander the nn
0:04:31	and we replace
0:04:32	all day hidden nodes
0:04:34	with this l s t m block that we have here
0:04:38	so let's go to the theory of this blog
0:04:43	basically it seems kind of scary when you see first
0:04:47	but it's pretty simple after you look at provider
0:04:52	we have a flow of information that goes from the bottom to the top
0:04:56	and as in any
0:04:59	a standard you know we have a nonlinear function
0:05:03	that we
0:05:04	this one here
0:05:06	and is bessel thing all the others the n is that it has a minimal
0:05:10	use
0:05:11	we take this one
0:05:14	so that
0:05:15	the all the other stuff that we have there are three different gates duh what
0:05:20	they do he's they let
0:05:22	or b
0:05:23	they later they don't lead the information go through
0:05:27	so here we have a input data
0:05:29	the if it's activated we will lead the input
0:05:33	of a new times that we'll for war
0:05:36	if it's not it won't
0:05:38	we have they forget gate
0:05:40	that's what it that is basically we set the memory so
0:05:45	so if we speech calculated it will would that sell to zero otherwise it will
0:05:51	keep the state of the of the previous time step
0:05:55	and the output gates
0:05:57	note that gate we'll let the computer
0:06:02	computer output
0:06:05	here
0:06:06	go to the network or not
0:06:08	and then what we have of course is a vector and connex so
0:06:13	the output of
0:06:15	well as time goes the input
0:06:18	of day next one you know data
0:06:20	so it's basically trying to meaning they are in and model
0:06:25	but with this case we avoid that problem because that gates work
0:06:30	both in this time but also entering time so when we have we're doing the
0:06:35	back propagation
0:06:36	and we have some ever that's a stew maybe rice the weight
0:06:41	that forget gate that would be a that input gate it's but also clock that
0:06:45	error from going
0:06:47	many times so we avoid that problem
0:06:51	the system that we used for language recognition
0:06:55	been doesn't rely on stacking acoustic frame so we receive only one frame at the
0:07:00	time
0:07:02	we will have one or two hidden layers and the relay here will be a
0:07:06	unidirectional it is the m
0:07:08	we impose
0:07:09	impose war
0:07:11	these connections that we have here
0:07:14	that basically
0:07:15	it allows the network to decide things on the like depending on time so we
0:07:21	it supposed to improve they the performance of a memory cell
0:07:28	the output we will use a softmax right like in the nn
0:07:32	cross entropy error function
0:07:34	and for training
0:07:36	what we do he's in the first area will have a very balanced nice dataset
0:07:42	so we need to do any implies either
0:07:44	but on more difficult to know is we will have some and but also the
0:07:48	data
0:07:49	so what we do in order to avoid problems with them but data
0:07:53	we just over something so we take random sites of two seconds and then we
0:07:57	have six hours
0:07:58	of all the other languages in every iteration
0:08:00	so that it so that we have
0:08:02	for every iteration is different
0:08:05	then we we'll use
0:08:08	to compute the final score of an utterance we will do operates of day softmax
0:08:13	output
0:08:13	but taking into account only the last ten percent
0:08:16	of this course i was playing ability later right
0:08:20	and then finally we will use a multiclass linear logistic regression calibration we use simple
0:08:27	we will compare the system to a reference i-vector system needs a very straightforward using
0:08:33	mfccs the see exactly the same features that we used for that is the m
0:08:38	we we'll one thousand twenty four gaussian components for the ubm
0:08:42	the i-vector ease of size four hundred
0:08:45	it's based cosine distance scoring that's
0:08:49	it controls are it depending on how many languages we have snow would
0:08:53	this was working better
0:08:55	the and doing lda you're doing the lda so that's why we decided to take
0:08:58	a cosine distance scoring
0:09:01	if we have more languages it would be better to use lda but the difference
0:09:06	was a small enough to note that a too much since there
0:09:09	and this is the most implementing quality and need has exactly the same by recent
0:09:13	technique always trained with the same
0:09:16	same data
0:09:19	so these are the three scenarios that we are going values to compare and test
0:09:25	these
0:09:25	these network personnel you e
0:09:29	a subset
0:09:30	on the nist
0:09:31	two thousand nine language permission evaluation
0:09:34	so that is that we use is that there is coming from the three seconds
0:09:37	task
0:09:38	this is a subset the a pretty minutes it's like very set so that the
0:09:43	it is the in will work based
0:09:46	so it's a very kind of d c subset in the in the two thousand
0:09:52	and nine evaluation what we d d's first we have a imbalance meetings of cts
0:09:57	voice of america so we draw all the cts data then we will avoid that
0:10:02	buttons makes and also we will avoid i mean a mismatched
0:10:05	in training so we have only one dataset
0:10:09	a for the languages we wanted to have also a high amount of data
0:10:14	so we to only those then which is that had at least two hundred of
0:10:18	more hours
0:10:19	i'm we also then want to have unbalanced data so we got those datasets so
0:10:23	all of them
0:10:24	two hundred hours per available for training
0:10:28	and that lid
0:10:30	two d subset of we have here
0:10:33	it's not a soul seven it's not the most difficult like we so before it's
0:10:38	just those that happened these two hundred hours a of voice of america data
0:10:45	and we use only a three seconds task because historically we so that for starter
0:10:52	addresses
0:10:53	is where the neural networks outperform more director so we wanted to be in that
0:10:59	in that scenario
0:11:00	then seconds note that we want to test is they that said
0:11:05	of nist language is no one listened to for some fifteen
0:11:09	here we don't avoid any of the difficulties so we have a meeting so cts
0:11:14	and brought about and b and b s
0:11:17	and we will keep
0:11:18	everything
0:11:20	we have seen the there's of this so it's twenty language is scroll in six
0:11:23	cluster accordion similarity so it's supposed to be more challenging because the languages are closer
0:11:30	within a cluster
0:11:32	that model training data it's also gonna be like it during just followed we have
0:11:36	some then which is we lessen have a lower something which is with more than
0:11:39	hundred hours
0:11:40	and split that we need ease
0:11:43	eighty five percent for training fifteen percent for testing
0:11:46	that's something we wouldn't do again if we like run experiments again this is what
0:11:52	we need
0:11:54	the time so before i mean this set and everything
0:11:57	and we thought it would be nice to have more data for training but after
0:12:01	that we ran some experiments and we found that having
0:12:04	it'll be less training data but more they've data we'll help experiments
0:12:10	but we keep exactly what we use in the one best
0:12:14	and that's a test what we need these with that fifty percent we took chance
0:12:19	of three seconds ten seconds and three seconds
0:12:22	two meeting a little bit the a
0:12:25	the performance of on the and the one less
0:12:28	and then that are texan area will be they test set of nist language relational
0:12:32	oneness
0:12:33	we discover a broad runs of speech durations it's not been beans anymore
0:12:38	and we has a big mismatches between training and unable as we so before
0:12:47	so the results that we have first this is kind of aside result is not
0:12:51	that important but as we are using a unidirectional it is the em what we
0:12:56	have is that the output at a given
0:13:00	times them
0:13:00	things that depends
0:13:02	not only on the input of that
0:13:04	times that are also and all the input of the previous inputs
0:13:08	so the last output is always more reliable
0:13:11	then the ones before
0:13:13	and we thought that maybe we were affecting they performance if we take the first
0:13:19	outputs that are less reliable so we just started dropping all the first outputs and
0:13:26	seen how that affected the performance
0:13:28	this is this so for this one
0:13:31	we don't really care about the
0:13:34	the moderated we have here we only got about how improves
0:13:38	so the absolute
0:13:39	equal error rate doesn't matter only the relative difference
0:13:42	and we found a taking into account only the last ten percent
0:13:46	would be a very optimal point
0:13:49	and we also so that taking into account only the very last score only one
0:13:54	output of a softmax
0:13:55	we were as good as taking the last ten percent but we do they
0:13:59	the last ten percent or
0:14:03	so these are the results
0:14:06	on they
0:14:07	on they first scenario
0:14:09	remember that this is the one do we only voice of america a languages
0:14:13	two hundred hours per language for training
0:14:16	we have here
0:14:18	this is the different architecture that we use we both three had one hidden layer
0:14:23	those two layers and then we have different size of the he'll data from like
0:14:30	this is the smallest we two hundred fifty six
0:14:33	are the begins with one hundred twenty four one thousand and four
0:14:37	this is the a size in terms of number of parameters
0:14:42	of all the models
0:14:44	and be so the results that we obtain
0:14:47	so the reference i-vector system and a seventeen percent almost equal error rate
0:14:53	and point sixty now see a rates
0:14:56	and we see that pretty mats all day and as the em approach is clearly
0:15:01	outperformed that
0:15:02	and i'm not of them has a much smaller number of parameters
0:15:09	so those are really good results but we are in these
0:15:12	balance error you
0:15:14	as we can see the best system us like
0:15:17	fifteen percent a better error
0:15:21	and has like
0:15:24	i four percent gain in terms of size
0:15:27	and we also wanted to check how complementary information that these others the m and
0:15:32	the i-vector were struggling so we fuse the best alice consistent with the reference i-vector
0:15:38	system
0:15:39	and the result whether the way remotes
0:15:41	that's better
0:15:42	we twelve percent
0:15:44	which is like fifteen percent better than they based system i'll
0:15:50	this is the completion metrics doesn't have much information but we can see i'm not
0:15:56	only in terms of accuracy but comparison with other languages how would be performed in
0:16:00	this subset
0:16:03	these are the results in that the dev set of a language recognition evaluation
0:16:08	to for some fifteen
0:16:10	for this one we just we didn't do an experiment with different detectors we were
0:16:14	a little bit and harris we use only they based system on the previous scenario
0:16:20	we what which was to don't layers of size five hundred total
0:16:25	and what we can see here is that the
0:16:29	that is the m
0:16:30	performs
0:16:32	much better than the than the i-vector or on three seconds
0:16:36	while on thirty seconds
0:16:39	we d scenario where we have these mismatches between that the bases and these buttons
0:16:45	on the data sets
0:16:47	this end to end system is not that so we still results for what are
0:16:51	like that were always outperform an i-vector why this and to an approach i it's
0:16:57	able to extract more information from sort lessons but not that matt's for longer
0:17:03	that would think that we so here is that even though the results for longer
0:17:06	utterances
0:17:07	is
0:17:08	a way worse than the one of the i-vector
0:17:11	diffusion used pretty much always
0:17:14	better than any of a single system
0:17:16	so even if the even when the when there is the m is working worse
0:17:21	than the i-vector
0:17:22	we are able to strut different information that will help in a file and system
0:17:28	so we were also quite be quite happy with the results
0:17:31	this is they
0:17:32	they do that we have for three seconds where we can see that the l
0:17:37	is the em outperforms and over twenty percent relative improvement
0:17:41	over the i-vector
0:17:43	and we see also that the a diffusion works always
0:17:47	better
0:17:48	that in any of a single system
0:17:50	and now here we go on to the results of at all but this set
0:17:55	of language recognition evaluation
0:17:57	and here the things get much more so
0:18:00	first of all
0:18:01	we have first column is that is the and second column is a i-vector
0:18:07	third one is the fusion of both
0:18:10	noncheating one the one we used for the listening
0:18:13	and a point one
0:18:14	is exactly the same but using like the succeeding fusion so we use a two
0:18:20	fold cross validation
0:18:22	so we will use in how of the test set
0:18:25	for training they fusion on the other half
0:18:29	of course that that's
0:18:30	that is not alone in the evaluation
0:18:33	but we wanted to know how like
0:18:35	whether the
0:18:37	the systems were learning complementary information
0:18:40	or whether they weren't so what with always maybe we've used in a good way
0:18:47	we can distract how maps how complementary information
0:18:51	so for the messages that we have to take from here is that versa for
0:18:55	at the end you learning these very hot a scenario is able to
0:19:00	get a result they need a comparable with a i-vector but when it gets much
0:19:06	worse as when the base increases because the i-vector is able to extract better than
0:19:11	better results when it is the m
0:19:13	a status
0:19:14	on the
0:19:15	on that same performance
0:19:18	but the good thing is that we don't have such a big might minutes or
0:19:21	we are able to do a good solution
0:19:23	we can steal even when you're
0:19:27	on the known as the rest we can use the in room we diffusion
0:19:31	the performance of that i-vector
0:19:36	so
0:19:37	it's conclude the work
0:19:38	basically the same take a messages are
0:19:42	first of all on a control unbalanced scenario
0:19:45	we have we promising results
0:19:47	it's a very simple system we eighty five percent this parameters
0:19:52	and that it's able to get fifteen percent relative improvement
0:19:56	problem is the once it gets
0:19:59	on an imbalance in a real more exciting england the results are not
0:20:04	as good
0:20:05	and finally we know that the that on strong mismatches and harder scenario it we
0:20:11	are not able to strike information within a
0:20:14	so there is a need for variability compensation but we still think that it's a
0:20:18	you really promising approach
0:20:21	that we need to simpler a systems that can get quite good results
0:20:38	lots of questions
0:20:43	so also
0:20:50	just the small comment you say that you're averaging the outputs of the ten percent
0:20:55	of the last frames
0:20:57	you are always using than posants for free second test of a thirty second test
0:21:03	we always using them person did you try to just
0:21:07	a rate for the thirty last frame independently of duration of your this
0:21:11	we e actually not for they how this areas but for the aec once we
0:21:17	need a like a lot of
0:21:19	things like not only averaging about like i mean or selecting only based on was
0:21:25	one or
0:21:26	just a drawl all the ones that are out that yours
0:21:29	and we found that is not really work need to the little thing to note
0:21:32	in there
0:21:33	but maybe in day in a more telling in serious it would be a with
0:21:37	with the way we haven't right
0:21:51	is it possible to go back to slide twenty four second here
0:21:57	sorry i notice you're always getting improvement with i guess when you're good to elicit
0:22:04	iain's versus the i-vector but when you look at the in which case and think
0:22:08	when you want to the fusion
0:22:11	be
0:22:12	fusion with emotion actually did worse than the i-vector system three point to each six
0:22:19	or each seven and the i-vector had one point nine that's the only one where
0:22:23	you didn't get an improvement was really reason why
0:22:26	you saw maybe when it happened may be used or you know stm actually had
0:22:30	words performance of guessing is because
0:22:33	you got more realist em system in
0:22:36	so i'm not completed is are but we have some kind of idea of why
0:22:40	that happened the idea is that
0:22:43	for training day systems
0:22:45	what we d is these oversampling
0:22:49	so on english there was one of the i think it was reduced english that
0:22:52	had only half an hour
0:22:54	so you know to train the others the n
0:22:56	that of course hardly the need has a war is useful but i think it
0:23:01	also hardy follow the fusion
0:23:03	so when you have
0:23:05	one
0:23:05	well in with that has a less data for today the nn for that is
0:23:10	the m
0:23:11	you can more or less we'd oversampling because that
0:23:14	an infusion usually you need much less
0:23:17	much less data in general so in all the other clusters that since a lot
0:23:23	because you anything they are imbalance
0:23:24	for calibration you have stealing a of all the blame
0:23:27	but for the english one i think it yes that we do not have enough
0:23:31	data for calibrating
0:23:34	so the fusion for training sufficient to so i think that was there is
0:23:39	the diffusion is not well trained because of not having enough data one of the
0:23:43	languages
0:23:54	i've a question and i found it quite interesting that you're
0:23:59	a list em has fewer parameters than the and the i-vector system
0:24:05	and i'm wondering about the
0:24:08	the time complexity how long does it like to train it and test time
0:24:12	some compared to the i-vector system
0:24:16	the a training time is much longer because we had a lot of iterations i
0:24:20	think it's also because of the way we trained the that we use a different
0:24:24	subset per iteration we need a lot of them so
0:24:30	actually i think that there is also have your are not the best we could
0:24:34	see because these was only and evaluation and so some of the networks were still
0:24:39	improving when we had to stop band and there's run them as they were sewing
0:24:44	training time eight side and w has much fewer parameters each word
0:24:49	but testing time he's way faster
0:24:53	and of course of one thing is that was you have the network trained you
0:24:57	only need to the day before while in the editorial you have new data you
0:25:01	have always extra i-vector before
0:25:04	before doing scoring
0:25:10	anymore questions
0:25:15	so then there's lots of time for costly i guess we'll back end at five
0:25:20	o'clock
0:25:23	forty four special tools
0:25:25	so that's target speaker can

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks

Speaker & Language Recognition: Deep learning approaches

Ruben Zazo, Alicia Lozano-Diez, Joaquin Gonzalez-Rodriguez