Přepis řeči - COMPARING MULTILAYER PERCEPTRON TO DEEP BELIEF NETWORK TANDEM FEATURES FOR ROBUST ASR

0:00:17	a low everyone um i'm or your vinnie else
0:00:20	i i gonna be talking about
0:00:22	a a deep learning and more a concrete on using um um the learning on tandem features
0:00:29	and analysing how a it performance for robust asr um basically
0:00:35	seeing how if that's with no
0:00:37	um so a bit of related work and
0:00:41	and background
0:00:42	um deep learning i'm not gonna go into
0:00:45	many details but it's basically the idea of having many layers of computation
0:00:49	and typically in that
0:00:52	it just a neural network
0:00:53	um with uh better initialization than just random so in two thousand six him and i in do these are
0:00:59	B M
0:01:00	we each um apparently only have a a lot um and training these deep models
0:01:05	and
0:01:06	since then many groups um are working on be browning um you can see that by the amount of publications
0:01:12	on machine learning conferences and also related conference just these or computer vision conferences like C R
0:01:19	and so that um
0:01:20	this some people that apply a very deep learning to speech and it's quite recent the last couple years
0:01:26	and um um but in the um estimating a phone posteriors using neural networks
0:01:32	or
0:01:33	deep neural networks
0:01:34	he's not a new idea
0:01:36	um and basically um there's
0:01:39	the two main approaches um one that uses the phone posteriors to um
0:01:44	to get that
0:01:46	and model for the hmm the high
0:01:48	eight M model and the other that uses stand them um which means um
0:01:53	we take just the yeah the posteriors as features
0:01:56	and then just use them you know a otherwise just stand there at gmm "'em" M system
0:02:01	and and the is it's quite attractive because may take needs to rely on a in gmm hmm system so
0:02:06	that the kind of approach that
0:02:08	we are looking here at in this work
0:02:10	um
0:02:11	so just to
0:02:13	oh
0:02:15	okay
0:02:15	so just to
0:02:17	to just explain briefly what
0:02:19	and they may use um for those with an no you we just get some some sort of estimate
0:02:24	or frame of posteriors probabilities for for the phones
0:02:27	i on the top and then there several techniques and tricks um that have been a a applied and found
0:02:33	them
0:02:34	ten years ago or so
0:02:36	so those posts you're probably use um you we have like the law to them and then we do a
0:02:41	like uh we we white in the them so they might better the gmm they have one a covariance assumption
0:02:47	we do mean by and the normalisation and then last we just concatenate them a with the mfccs or some
0:02:53	spectral um features
0:02:55	and we train and or or decode um we these extended feature set
0:03:00	so pretty
0:03:02	pretty easy to answer i
0:03:04	um and i easy to implement as well
0:03:07	i so
0:03:10	so that made jam to the main points uh of these were um
0:03:15	first um well had we want to
0:03:17	a C how
0:03:18	from post us coming a yeah might be uh
0:03:21	a neural network
0:03:22	combined with spectral features if there is any gain at there when we when we had them to the to
0:03:27	the mfcc
0:03:29	i i in this tandem fashion um
0:03:32	also i and this is probably a a but interesting and and and
0:03:36	and i don't know if that this has been a as yet how does noise affect um the deep neural
0:03:41	net based systems
0:03:42	and in particular um we want to and the light or
0:03:47	a kind of rule out what
0:03:48	right
0:03:49	parts of the beep
0:03:50	she are helping in which situations
0:03:53	um so for that
0:03:55	as i said we have some some questions regarding deep-learning so for example
0:04:00	well and why does
0:04:01	having a deep structure matter
0:04:03	that's the first question then we can also ask ourselves what about pre-training these are B M training that i
0:04:09	was talking about easy eating pork then or not
0:04:12	and lastly um
0:04:14	we know that to train neural networks you it gets Q sometimes especially when they are the um so as
0:04:20	the optimization technique use my
0:04:23	and and in the paper are the i uh it was focus on the first two points um the that's
0:04:28	the point was
0:04:29	that's
0:04:29	but not
0:04:30	are you deeply
0:04:32	and
0:04:33	that has to um it's something i've been working on and i i wasn't it is not in the paper
0:04:37	about i'm what i talk about this in in this talk
0:04:41	and and so were referring to those questions like the
0:04:44	some some way to see neural networks
0:04:47	that the good part of new deep neural networks yeah i is that they are or for models it by
0:04:52	expressive and they can represent by complicated nonlinear relations that good because we know are bring probably does that
0:04:59	and also i the they're attractive because a great in is easy to compute and in fact now with the
0:05:04	um
0:05:05	uh that computing and used
0:05:07	i scenes all it in bob is some matrix operations they can actually we train pretty fast and it's a
0:05:12	very efficient
0:05:14	a there some but things so so i it is a non-convex optimization problem and and there's a vanishing gradients
0:05:20	problem that if if we are by E the got instant to zero so
0:05:24	it's kind of
0:05:25	it is not be they are not easy to change especially when they are very the the neural network
0:05:30	and also a number of out there you got from very large um in fact
0:05:34	our brain has a when or all or there's of my need more than that the neural net that we
0:05:38	obtain training nowadays
0:05:40	so were feeding is a an issue that
0:05:42	people are worried about a of use the as in many other machine learning techniques
0:05:46	and that is something that people don't like about
0:05:49	neural networks is they are kind of difficult to parade what what's going on
0:05:54	a are some exceptions and there's many like
0:05:57	the that some people who were in the computer vision and also speech that are and analysing actually what
0:06:02	the you runs are are learning
0:06:04	and and its impressive in in computer vision for example you can see that
0:06:10	um the first you're is learning basically what be one in our brain he's doing um these double like few
0:06:15	there's for computer vision and not really much is so
0:06:18	so actually these is actually becoming good in some sense to into bread and
0:06:22	they be it and and hence deep learning
0:06:26	and so
0:06:27	just just to two
0:06:29	um um
0:06:30	like a concrete on exactly experiment that we don
0:06:33	um we train these kind of neural network
0:06:36	so it it was um D because it has as you can see three hidden layers and and on the
0:06:42	left we have just the input
0:06:44	just with the like thirty nine acoustic observations and nine frames of frames some of can
0:06:49	and then we have that fee following layers we five hundred a thousand and fifteen hundred um by now your
0:06:54	logistic unit
0:06:55	and lastly the last layer or it's the one that we estimate the uh
0:06:59	the phone posteriors through the softmax later
0:07:02	and that need say that i did and
0:07:05	to or or and use so the i came up with these architecture and i've been using these i i
0:07:10	i haven't change like the parameters and so one
0:07:13	a because the but i wanted to see the effect and compare which that work
0:07:17	um but they are already better numbers good
0:07:20	could be found just by
0:07:21	trying different architecture
0:07:24	and so dumping in the experimental setup
0:07:27	um we use the our a was so it's fairly small or well at around one point four million samples
0:07:32	for training
0:07:34	um at ten milliseconds for at a sampling rate
0:07:38	and
0:07:39	as as we know like these these the testing conditions are with added noise
0:07:44	at that at different snr level
0:07:46	i i'm said just to train station airport port and so on
0:07:50	and
0:07:51	let me like we we train the our models on clean speech and then we are
0:07:55	testing that one D several noisy conditions just to see
0:07:58	right the the yeah is a is being as in the noisy conditions if at all
0:08:03	and and then we just use the standard hmm model proposed in the uh our or to set up on
0:08:08	the
0:08:08	the same decoding scheme
0:08:11	um so
0:08:13	um first table of results i'm let me let need just explain E
0:08:17	um it's
0:08:18	on the as well as you can see just the
0:08:21	a a different noise conditions starting from clean and adding more i
0:08:25	and in the parentheses you see that kind of um relevant um differences so if we we run these experiments
0:08:32	randomly
0:08:33	we observe a around point to point four and point for in word that are a different so that's kind
0:08:38	of the significant level of of this result
0:08:41	and then the for
0:08:42	column of result is just stand the mfcc um model
0:08:46	we can see a the as we had not use that it did great
0:08:49	and the next two columns are from
0:08:52	um write
0:08:53	we probably these from icsi so
0:08:55	basically that at them mlp so the first column would be a the the mlp would be just using a
0:09:00	be features no mfccs and ten denote Ps concatenating both
0:09:04	so we can see that concatenating mfcc helps because all the numbers are basically lower
0:09:10	and it's
0:09:11	it helps also a um in improve the word error rates for for us all the noise conditions
0:09:17	now we the the belief network that i show
0:09:21	um basically we get
0:09:22	that the results are a all the most conditions
0:09:25	and in particular
0:09:27	um
0:09:28	the on clean speech we get an improvement which which is
0:09:31	i guess i compared but with other people found findings on timit and so on
0:09:36	and um the improvement are consistently better than and the tandem mlp
0:09:42	a approach that was proposed several years ago
0:09:45	which which is good news
0:09:46	and also recall a as C at that um
0:09:49	mfcc usually helps but when there's is a lot of noise that using the five V be or that of
0:09:54	the only be case
0:09:56	um actually the and
0:09:57	but and then the be and that's words and maybe you using just the T V and for phone posteriors
0:10:02	is better
0:10:03	um um
0:10:05	and and but the first questions that
0:10:07	he's how phone two years
0:10:10	um
0:10:10	combining that tandem fashion um when we use use people earning best as and just mlp features living
0:10:17	and then have there's noise affect so it seems that deep neural nets are also good for nice and not
0:10:22	for only like king speech
0:10:24	and now
0:10:25	i'm gonna jump to some more recent results um actually i was able to run is because i been working
0:10:32	on a
0:10:32	second all the optimization method proposed you and i see M at ten
0:10:37	kind of
0:10:38	for
0:10:39	suggest that maybe pre-training or these are B business was not necessary if you use some sort of second order
0:10:45	the optimization in the back perhaps the
0:10:47	um but i go step by step to these questions and the columns to look at
0:10:51	so first um does optimization matter
0:10:55	um what we need here used in the call um the first the first two columns are the same as
0:10:59	previously
0:11:00	so that and M M mlp was trained using a standard techniques to that's a again in the centre
0:11:05	and it's kind of um after seven hundred or so he and you needs
0:11:10	we and and that's see an improvement of perform
0:11:12	the last problem that and then a P with the little star
0:11:15	we were actually be able to train a a more beer a bigger model basically as many parameters
0:11:22	as the D
0:11:23	you one that model that i will show later and that they so in the beginning
0:11:27	and and as we can see at least for but not a a nice um
0:11:32	low the region
0:11:34	the time tandem mlp with that that are in need um
0:11:38	optimization and more parameters
0:11:40	actually a performance
0:11:42	but and them "'em" up be without these a big new optimization
0:11:45	um but then a on the like
0:11:48	a higher noise conditions that you did good
0:11:51	so that kind of these a pointing but maybe because there's so many you there some sort of or feeding
0:11:56	and the model that a deal
0:11:58	well with that
0:11:59	um
0:11:59	so
0:12:00	that brings that's to the next point
0:12:02	and that the match
0:12:04	so now
0:12:05	let's take the parameters of the
0:12:07	single tandem P
0:12:09	layer would is around three million by the way
0:12:11	and let's use it in that the deep neural network that i was this at the beginning but with not
0:12:16	to no pre-training so it's not not that deep belief network that can propose C just that that or neural
0:12:21	net with many layers
0:12:23	and what we see here use the performance is identical
0:12:26	but on that high noise situations actually the sheet that we saw performance actually gonna and and we actually get
0:12:33	a bit better
0:12:34	so i might my here use maybe adding the deep nist
0:12:38	has has some sort of effect on being able to cancel the noise better than if you do we have
0:12:43	to just the shallow network
0:12:45	um
0:12:46	obviously this
0:12:47	it's just an i is that but
0:12:49	from the results we can probably see that
0:12:51	and
0:12:52	has the that pre-training so this is basically the from the first table the the
0:12:58	it's the same neural net but with these pre-training step
0:13:01	we see that
0:13:02	it improves upon the deep neural net that has not been preaching
0:13:06	um um so we what that means that uh and it it improves a grass all the noise conditions so
0:13:12	i think what this means this pretty training
0:13:14	basically it as a generalization um
0:13:17	we know actually that for over fitting pre-training helps quite a lot so for the em nice that the set
0:13:22	that was probably seen the signs paper
0:13:25	this huge or feeding and pretending a lot
0:13:27	but in this case it had
0:13:29	not on the clean condition but on the even when to noise is quite low
0:13:33	um it i i um to preach in the weights that these to make them to some some sort of
0:13:39	generality and not only discriminative objective function
0:13:42	and
0:13:43	so
0:13:44	but i two
0:13:45	to conclude this discussion about
0:13:47	the error and that if he's and so on
0:13:50	um i look at the
0:13:51	i thing you have a phone error rate of all these three networks
0:13:54	i which is
0:13:55	i thought are so i just but some random um phone
0:13:59	and we can see that
0:14:00	there's the the phone error rate seems similar
0:14:03	but then and then when we had the noise right the D ends
0:14:06	learn more robust a representation is because what we is when we had that we had when we had a
0:14:11	a large amount of light
0:14:13	a bore what the deep neural net
0:14:16	and the shall only run neural net trained with that both with the better optimization technique
0:14:21	so i i believe that i
0:14:23	but do maybe it's it's it's hiding has to to learn basically better the representations of the data has it
0:14:29	has been found actually also in computer vision and so on
0:14:32	um and so basically to conclude
0:14:35	um i think it is now
0:14:38	but
0:14:38	it's it's being your i but people running
0:14:41	i words also in and them not only in the hybrid systems which is good news for those school who
0:14:46	have a lot of uh engineering work around M Ms and a M system
0:14:50	and
0:14:51	furthermore i think the mfccs is this
0:14:54	oh for the scroll are working on how distance maybe
0:14:57	they should
0:14:58	incorporate
0:14:59	more spectral information some somehow especially if there's not a lot of noise
0:15:04	then
0:15:04	pre-training we we know we it has for over fitting but also it that was for um kind of generalization
0:15:10	um of the uh of of the K in the case where we have a these might to clear mismatch
0:15:15	between training which was
0:15:16	to on a clean speech and and testing
0:15:19	and this also
0:15:21	i think the model seem to use you given the same amount of parameters they seem to be more robust
0:15:26	in very high noisy situation
0:15:28	um which which was found also in computer vision
0:15:31	um and obviously these conclusions are for now based on a fairly small task
0:15:35	and i think for future work
0:15:38	um it would be interesting to go i guess a larger dataset set which we are actually working on that
0:15:44	and also to compare
0:15:45	between the uh
0:15:47	so called a deep neural net it's mm M that and these deep uh and the estimate
0:15:52	um thank the very much how they some question
0:15:59	whereas
0:16:00	question
0:16:01	oh have this person able
0:16:05	i have a question regarding or comments on the or not to work can you go back to the slides
0:16:09	use all zeros
0:16:11	this
0:16:11	was was that is the beginnings of use some wise a good thing and Y yeah sir
0:16:16	this one and that's one
0:16:17	yeah
0:16:19	but so the one use some wise the bat thing for that deep network
0:16:22	and here you use the two points
0:16:25	well you in the works so there is a there's one problem is the vanishing the creek me
0:16:30	and and otherwise all or with feeding
0:16:32	yeah
0:16:32	and that to me is this two clones seems uh
0:16:35	country uh can a contract
0:16:37	is a content uh is not as that because uh
0:16:40	is is that if you are in the all right so that you be are getting the hell
0:16:44	and the uh more was a get which means the basic is model is mean the same
0:16:50	and and the and you can change the them automatically right not no to do this these two the
0:16:55	or a feeding
0:16:56	is this a well as happened the some case all the
0:16:59	as as as a all the single happens all the case uh the yeah i think this to happen in
0:17:04	the band it's over fitting so in my is actually in my experiment idea an observe of a a whole
0:17:08	lot of over
0:17:10	i was just doing out to regularization on the weights but i i i i have over fading
0:17:15	but in other cases like the if you read the science paper from hint on there's
0:17:20	a lot of like that so that you have only like twenty thousand samples and they are all were fitting
0:17:24	is
0:17:25	you basically get to zero percent here error in those cases obviously
0:17:29	the optimization method doesn't matter that much and it's
0:17:32	these to basically how you by as you weights that were
0:17:34	these using these are B
0:17:37	the next question i don't know which one now
0:17:39	oh
0:17:40	okay
0:17:42	but that's you that that there is some pretending happening so that means
0:17:46	you must be using some
0:17:47	a person and they tell which is not used to the um right not not not for the in this
0:17:51	case like that the pre-training training it it's and supervised so it in or you put at a lot of
0:17:56	data to a lot more than that to be chain
0:17:58	so i
0:17:59	yeah
0:17:59	so my question
0:18:01	that's for a neural network because
0:18:04	oh
0:18:05	but but was that then
0:18:06	uh a that uh_huh
0:18:08	we can try and you the that you know out of this a two more
0:18:12	um
0:18:12	so do the and thing are you to know
0:18:15	on them to do for belief net
0:18:17	um i just
0:18:19	as i
0:18:19	okay a model that is
0:18:21	in networks uh_huh
0:18:23	oh but the and and that's and removed and not as a remote meaning type to construct the would be
0:18:28	to make an addict for uh_huh
0:18:30	or
0:18:32	it comes to the network
0:18:35	and the same level
0:18:37	and in the sense that i
0:18:39	lastly last limit
0:18:40	i to construct an I and do that so that the zero net was trained does
0:18:45	take that whole objective function of for neural net and do but pro
0:18:49	we random them waiting is a a a problem um
0:18:51	that is a fair competition
0:18:53	uh_huh okay
0:18:54	thanks
0:18:55	yeah
0:18:56	uh_huh
0:18:57	a question
0:18:58	uh_huh
0:18:59	basic
0:19:00	and no
0:19:01	or
0:19:02	you concatenating anything the puts
0:19:04	from the mlp P to the mfcc C you probably want to have
0:19:08	the most to different information can from the M D so i wonder
0:19:13	a when you do the back propagation
0:19:16	with that
0:19:17	or basically forcing the M P two
0:19:20	of two focus on on know something which discriminates between the class that you decide to use
0:19:26	did you try to look at how that would work
0:19:29	if you did concatenate the features coming from
0:19:32	on
0:19:33	i trained
0:19:34	and and like not to doing the training
0:19:37	yes and what can to and something to different
0:19:40	so actually they
0:19:41	the deep neural net is train before the concatenation have so
0:19:46	so in a sense you
0:19:47	you just have the phone targets and you train the neural net first and then you concatenate and the each
0:19:53	eight to map
0:19:54	so i'm not sure
0:19:55	because i
0:19:57	so i Z
0:19:58	oh
0:20:00	i i i think you are using the outputs of to
0:20:04	yeah so all right
0:20:07	so you train your network first and then one you when you have a train you get it fix and
0:20:11	then you don't twenty more backprop prop after you can can you could do that but
0:20:15	but if you
0:20:17	sure to than that but train before the uh back propagation used still have the the same uh a number
0:20:23	of i'll put your and is you will have a off to the back propagation right
0:20:27	yeah so this
0:20:28	you can choose
0:20:29	but there you can cut the need to the mfccs Cs do you have now or you will have off
0:20:33	to the back propagation right
0:20:36	or missing something
0:20:38	i
0:20:40	sorry i
0:20:41	yeah i don't and i don't stand
0:20:42	a do was
0:20:43	so we problem to present and seven yeah and we can put that to a okay
0:20:50	i

COMPARING MULTILAYER PERCEPTRON TO DEEP BELIEF NETWORK TANDEM FEATURES FOR ROBUST ASR

Robust ASR

Přednášející: Oriol Vinyals, Autoři: Oriol Vinyals, Suman Ravuri, University of California Berkeley, United States