Speech Transcript - Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks

0:00:15	right
0:00:16	is my great writer
0:00:18	two presents right after the two that paper nominees
0:00:23	so i hope you also you also like this talk
0:00:27	alright so
0:00:29	a this work is about
0:00:31	trans online spoken language understanding and the language modeling
0:00:35	these recurrent neural networks
0:00:37	my name is being real
0:00:38	this is the work with my otherwise are provided in
0:00:42	we are from carnegie mellon university
0:00:46	but this is not always while the talk
0:00:48	first of you introduce the background and the motivation of our work
0:00:52	volume by that's are we will explain in detail our proposed method
0:00:57	and then comes the experiment setup and the without analysis and finally
0:01:03	conclusions will be people
0:01:06	first the background
0:01:09	spoken language understanding is one of the important components in spoken dialogue systems
0:01:15	in slu
0:01:16	two major tasks
0:01:18	intense detection and slot filling
0:01:20	even though user query we want slu system to identify the user's intent
0:01:26	and also to extract
0:01:28	useful semantic constitutions from the user query
0:01:32	a given the
0:01:33	example query like
0:01:35	based show me the flights from seattle to stanley accords model
0:01:38	we want the as a whole system
0:01:41	to identify that
0:01:43	the user is looking for flight information that is the intent
0:01:47	and so we also want to
0:01:49	extract useful information such as if one location
0:01:53	it to location
0:01:54	and the departure time p g's the task force one feeling
0:02:00	intent detection
0:02:02	can be treated as a sequence classification problem
0:02:05	so standard of classifiers
0:02:07	like
0:02:08	a support vector machines with n-gram features
0:02:11	or convolution on your network
0:02:12	recursive neural networks can be applied
0:02:16	on the other hand slot filling
0:02:19	can be treated as a sequence labeling problems
0:02:21	so sequence models like maximum entropy markov model
0:02:26	conditional random fields
0:02:27	and recurrent neural networks
0:02:29	a good candidates for sequence labeling
0:02:34	intended detection small feeding are typically processed separately
0:02:38	in spoken language understanding systems
0:02:41	i joint model
0:02:42	that it can perform the two task
0:02:44	at the same time simplifies
0:02:46	the slu system
0:02:48	as only one model needs to be trained and function
0:02:52	also
0:02:53	i training
0:02:55	two related the task together
0:02:57	is it is likely that
0:02:59	we can improve the generalization performance of a task
0:03:02	using the other related the task
0:03:05	trance model for slot filling and the intended detection have been proposed in literature
0:03:10	using convolutional neural networks
0:03:12	and the recursive neural networks
0:03:17	the limitations of deep repairs proposed as so you're models
0:03:22	is that's this model typically
0:03:24	condition the a the output of this model typically conditioned
0:03:29	on the entire word sequence
0:03:31	which makes those model not very suitable for online tasks
0:03:35	for example in speech recognition
0:03:37	instead of receiving the be transcript taxed
0:03:40	at the end of the speech
0:03:42	you'd are typically prefer to see the ongoing from transcription
0:03:45	well the user speaks
0:03:47	similarly in spoken language understanding
0:03:50	wrist real-time intent detection and slot filling
0:03:53	the constraint system will be able to perform press one enquiry
0:03:57	well the user can take it
0:04:01	so in this work
0:04:02	we want to develop a model that can perform online spoken language understanding
0:04:08	as the new word arrives from the asr in g
0:04:12	more
0:04:13	we suggest that
0:04:15	the slu without
0:04:16	can provide additional context for the next word prediction
0:04:20	in the asr on and decoding
0:04:24	so we want to build a model that can perform on the slu
0:04:28	and language modeling jointly
0:04:33	here is a simple visualization of our proposed idea
0:04:37	so given a user query like first got i want a first class flights from
0:04:41	phoenix to seattle
0:04:43	and we push describe me to asr engine on a decoding
0:04:48	we use the arrival of the first few
0:04:50	words
0:04:51	our intent model
0:04:53	based on these available information
0:04:55	or why the estimation of the user intent
0:04:58	and
0:04:59	the
0:05:00	intent model gives very high confidence score
0:05:03	on a
0:05:04	the intent class i have fair and the lower
0:05:07	confidence score for the other content copies
0:05:10	confusion and conditional this intent estimation
0:05:14	p language model
0:05:15	i just use next word
0:05:17	prediction probabilities
0:05:19	so here we see that
0:05:21	the next the probability for price being the next word is pretty high because
0:05:26	twice
0:05:27	he's closely related
0:05:29	these the intents of i are fair
0:05:32	then we start with a rival of another word flight from the asr engine
0:05:37	the intent model update is intent estimation
0:05:41	and increased
0:05:43	the confidence score for instance cost flight
0:05:45	and
0:05:47	reduce the
0:05:49	confidence score for alpha
0:05:51	accordingly
0:05:52	the language model
0:05:54	i just ease
0:05:56	next word probability next word prediction probabilities
0:06:00	so here
0:06:01	the location related words such as pittsburgh and phoenix
0:06:06	received higher probability
0:06:07	and the price the probability of a price
0:06:10	is reduced
0:06:13	and diffuse
0:06:14	additional input from the
0:06:16	asr
0:06:17	all words
0:06:19	our intent model becomes more confidence that's what the user is looking for use the
0:06:24	flight information
0:06:25	and accordingly the language model
0:06:27	i just the next word probability
0:06:30	a piece the a conditioned on the intent estimation
0:06:35	and
0:06:36	in two we compute the processing
0:06:39	of the entire the car
0:06:41	note this is not be realization of our
0:06:45	proposed idea afford run online spoken language and the spoken language understanding and the language
0:06:50	modeling
0:06:52	okay next
0:06:53	our proposed method
0:06:57	okay here on the rnn
0:07:00	recurrent neural net models
0:07:01	for the three different tasks
0:07:03	that's we want to model in or walk us a bit is we i believe
0:07:08	these three models are very familiar to most of last the first one is the
0:07:12	standard recurrent
0:07:14	you know network language model
0:07:16	the second one is the are the model for intent detection
0:07:20	so
0:07:20	the last hidden state output
0:07:23	is used to produce the intent estimation
0:07:27	and the third model used recurrent neural network for slot filling
0:07:31	here different from the or in language model
0:07:34	the
0:07:36	the are the output is connected act of the hidden state so that's the slot
0:07:41	label dependencies can also be modeled
0:07:44	in the u d u r n
0:07:48	and here is our proposed joint model
0:07:52	so similar to the are independent rainy models input to the models
0:07:56	are the board in the u r in the given utterance
0:08:01	see most okay
0:08:02	so we have the word is included
0:08:05	and the hidden layer all boards is used for the three different tasks
0:08:10	so here cd represents the intent costs
0:08:12	s represent the small label
0:08:14	and
0:08:15	w represents the next word
0:08:17	so the output from the r and he the state is used use prosody to
0:08:22	used to generate
0:08:24	the
0:08:24	intent estimation
0:08:26	once we obtained the intense
0:08:29	uhuh intend the class probability distribution we draw a sample from these probability distribution
0:08:34	as the
0:08:36	as here at that some point in the cost
0:08:39	similarly what do the same thing for slate slot label
0:08:42	once we have to these two vector we cascade these two actor into a single
0:08:46	one
0:08:47	and use these i-th the complex vector
0:08:49	to the next word prediction
0:08:51	also we connect these contact vector
0:08:54	back
0:08:55	to the are and he the state
0:08:57	such that the intense variations on the sequence
0:09:01	as well as the small label dependencies can be modeled
0:09:05	you are in the recurrent neural network
0:09:09	well basically
0:09:10	the task all code
0:09:12	at each time-step depends on the task all posts from previous time steps
0:09:16	so by using the chain rule the three
0:09:19	models intense love reading and language model can be off vectorized accordingly
0:09:26	a closer look at our model
0:09:29	at each time-step words in goes into the art in the state
0:09:33	and
0:09:33	the input to the hidden states
0:09:36	are the he the states from the previous time step
0:09:40	the intended task strong labels from the previous times that
0:09:44	and they were input from the current time step
0:09:47	and
0:09:48	once we have these are instead of word
0:09:50	we perform
0:09:52	intent classification
0:09:53	slot-filling and next word probably next word prediction
0:09:57	in the sequence
0:09:59	so here
0:10:00	these intent distributions for label distribution and what its fusion
0:10:04	represents the
0:10:05	multilayer perceptual for each of the different task
0:10:09	the reason why we applied
0:10:10	multilayer perceptron for each task is because
0:10:14	using a shared a representation
0:10:16	which is the r and he the state a good for the street different tasks
0:10:21	you order to improve on the other two
0:10:24	introduce additional discriminative hours
0:10:27	for the joint model
0:10:28	we used a multilayer perceptron
0:10:31	given a multilayer perceptron for each task
0:10:33	instead of using simple linear transformation
0:10:40	"'kay" this one is about model training
0:10:44	is what we have seen so what we do use we
0:10:48	model the three different tasks jointly
0:10:50	so
0:10:52	doing model training the anywhere from the street given tasks
0:10:55	all probably are propagated
0:10:57	to the beginning of the input sequence
0:11:00	and we perform a linear interpolation of the cost for each task
0:11:04	so as
0:11:06	in this object a function
0:11:08	we can see that's we interpolate
0:11:10	the cost from the intent classification
0:11:14	from smart meeting and the language modeling linearly
0:11:17	and but addition be at one l two recommendations
0:11:23	to this object to this objective function
0:11:28	as we have no to used in the previous example
0:11:32	the intent estimation at the beginning of the sequence
0:11:36	may not be very stable anchor eight
0:11:39	so the confusion on
0:11:41	so when we do next word prediction
0:11:43	conditioning on the wrong intent cost
0:11:46	may not be desirable
0:11:47	to me to get easy fact
0:11:50	we proposed a schedule approach
0:11:52	in adjusting be intense contribution to the context
0:11:57	so to be specific
0:11:58	doing the first case that
0:12:01	we disabled
0:12:02	we disable the intent contribution to the contacts vector
0:12:06	entirety
0:12:07	and after the case that
0:12:09	we gradually
0:12:10	increase
0:12:11	the intent contribution to the contacts vector
0:12:15	and you the end of the sequence
0:12:17	so here we
0:12:19	propose just to use the linear you chris function of the case that and other
0:12:22	type of increasing functions like lock functions for the number functions can also be explored
0:12:31	okay so these are some model variations of the speech on the model that we
0:12:36	introduce just no
0:12:39	the first one is what we call it
0:12:40	the basic at one the model
0:12:42	so here
0:12:44	the same a shared representation from the art and hidden state
0:12:48	is used for the three different tasks
0:12:50	and there's no conditional dependencies
0:12:54	among these three different tasks so this is what we caught the basic at run
0:12:57	the model
0:12:58	the second one
0:13:01	once we produced the
0:13:03	intense estimation
0:13:04	the intent sample is connected
0:13:07	locally
0:13:08	to the next word prediction
0:13:10	without cost connecting these one back to the artist eight
0:13:14	so what we call these all we call this model
0:13:16	s
0:13:17	model these local context
0:13:19	the third one
0:13:21	this
0:13:22	a context like to is not connected to the local that squared prediction
0:13:26	is that it's connect directly is connect back to the art and he the state
0:13:30	so we call this model
0:13:32	the model this recurrence context
0:13:35	it last variation
0:13:37	is the one piece also local and recurrent context
0:13:40	and this is the thing model
0:13:41	as well to be seen just no
0:13:46	okay next one some experiments that have and without
0:13:52	so in the experiments the data that that'll be used
0:13:54	is the airline travel information system dataset and in this dataset in total we have
0:13:59	eighteen intent classes and a hundred and the twenty seven slot labels
0:14:04	for intense detection we evaluated
0:14:08	we intend model on classification intent classification error rate for small fading
0:14:12	but you evaluated i've a score
0:14:16	the details about our are in model
0:14:20	configurations
0:14:21	we use lstm cells as the basic rnns you need voice
0:14:25	stronger capability in term of modeling longer-term dependencies
0:14:29	we perform in a batch training using adam of optimisation method
0:14:33	and to improve the generalization k o all we're of the proposed model
0:14:38	we use drop out and out to regular stations
0:14:43	in order to
0:14:45	to evaluate the robustness of our proposed model
0:14:49	we not only experiment these the true text input
0:14:53	also please
0:14:54	noisy speech input
0:14:55	so
0:14:58	so
0:14:59	we use this to have of improved and these are some details in
0:15:03	our the si model setting which we will see
0:15:06	no well
0:15:08	basically in these experiments we report performance
0:15:12	using these two type of include the true text input and the speech input be
0:15:16	simulated noise
0:15:18	compare the performance of five different type of models
0:15:22	on these three different tasks
0:15:24	the intent caught the intent detection slot filling and the language modeling
0:15:31	and
0:15:32	here is the
0:15:34	in change detection performance
0:15:37	using true text input
0:15:40	the fine models from left to right
0:15:42	a the independence training models for a intended detection the basic it on the models
0:15:48	as will be seen just now in the in the model variations
0:15:52	the third one is the joint one of these intent context
0:15:56	force one is the joint model this marker label context
0:15:59	and the last one is the current model
0:16:02	this also type of context
0:16:04	so as we can see that joint model of east coast type
0:16:08	context
0:16:09	performed the best and eats achieves twenty six point three percent relative error reduction
0:16:16	or where the independent training intent models
0:16:18	so
0:16:21	of this what is the slot filling performance
0:16:25	you think the true text input
0:16:27	so as what can as what we can see that's
0:16:30	our proposed one-model shoulders a slight degradations on this slot filling f one score
0:16:36	comparing to the independent tree models
0:16:39	but this might due to the fact that
0:16:42	the dt proposed run model
0:16:45	lack of certain discriminative powers
0:16:48	for the multiple tasks because we are using the shared
0:16:52	representation from this
0:16:53	r and you just a good
0:16:56	but this
0:16:57	so just one aspect that we can be improved further in our future work for
0:17:01	the joint modeling
0:17:04	this one is the language modeling performance
0:17:07	using the should act input
0:17:09	as whatever can see
0:17:11	the best performing model is that one to model these intent and strongly slot label
0:17:15	context
0:17:16	and this model achieves eleven but its relative error
0:17:20	or action a sorry
0:17:21	relative reduction of perplexity
0:17:24	comparing to the independent training language model
0:17:27	so all one saying that we can not used from this result is that
0:17:32	the intent intense context
0:17:35	used very important
0:17:37	in term of producing a
0:17:39	cootes language modeling performance
0:17:41	we doddington context
0:17:43	bit one model be smart label contact used off
0:17:46	produced very similar performance
0:17:48	in term of a perplexity comparing to be independent of any models
0:17:53	so
0:17:54	here we show that intent
0:17:57	information internal context is very important for small for language modeling
0:18:04	and the last be some results he's
0:18:07	using these speech input
0:18:08	and asr output to our model
0:18:11	these are the for asr model settings
0:18:13	the first one is just use the without directly from the decoding
0:18:17	and second one use
0:18:19	after decoding we do rescoring restore five grand
0:18:22	language model
0:18:23	a sort of one use the rescoring this independence training rnn language model
0:18:29	last one is
0:18:30	the model that this rescoring
0:18:32	using our proposed drunks trendy model
0:18:36	as we can is what we can see from these without
0:18:39	the p d joint modeling the joint training
0:18:42	approach
0:18:44	produce the
0:18:45	best performance
0:18:46	across all these three evaluation criteria here
0:18:50	basically the word error rate force are
0:18:52	speech recognition in turn error anova of a score
0:18:56	so basically this result shows that
0:18:58	even ads d word error rates of a wrong troll
0:19:03	if you nine
0:19:04	our intent model and our model comes to perform can still produce
0:19:10	competitive performance in intense detection and the scroll speeding
0:19:13	so these numbers are slightly worse than the experiment
0:19:18	these two text input
0:19:19	that's on these two also to extract shows the robustness
0:19:23	of our proposed
0:19:25	model
0:19:27	okay lastly the conclusion
0:19:30	in this work
0:19:31	we proposed a rl model for trounced online
0:19:35	language a spoken language understanding and the language modeling
0:19:38	and it's a by modeling the street asked one three
0:19:43	our model is able to
0:19:45	achieve improved performance on the intent detection and the language modeling
0:19:50	to be slightly location
0:19:51	a small feeding performance
0:19:54	you order to show the robustness our model
0:19:56	we applied our model
0:19:59	on the asr on the past noisy speech impose
0:20:03	and we also observed consistent performance gain
0:20:07	or the infantry models
0:20:10	by using our joint model
0:20:13	so this is the end of the talk
0:20:16	right okay
0:20:22	okay
0:20:23	come from a few questions
0:20:25	that's
0:21:00	okay so the question is if i colour channel two we define the model what
0:21:05	are the criterias that i am i will be looking for
0:21:09	corpus yes
0:21:10	right so
0:21:13	basically it's all here is
0:21:14	we can see that we are using the recurrence new enough models
0:21:17	and
0:21:19	typically such models on nlp tasks requires
0:21:22	very large dataset to show stable and robust or robot performance
0:21:27	so the first criteria is a cost if we can have a lot of data
0:21:31	that would be the best
0:21:33	the bigger the better i will assume
0:21:35	and that seconds what i can single of is that
0:21:39	for it as
0:21:40	why this is the very simple rather simple dataset is because it is very
0:21:46	don't min imitate limited so most of the training utterances
0:21:51	a close to be related to flights
0:21:54	airline travel information
0:21:56	so if i can
0:21:57	you know review the covers
0:21:59	i which explore the
0:22:01	a multi domain
0:22:04	scenario
0:22:05	that to see whether our model is able to handle
0:22:08	you know perform
0:22:09	really good not only in the two men limited case but also in the generalized
0:22:13	braille in a more detailed many cases
0:22:15	so that is
0:22:17	what i really care about in the model in the corpus define
0:22:47	right i completely agree with you i think this is
0:22:51	it is very good suggestion is be here we are doing joint modeling of slu
0:22:56	and the language modeling
0:22:57	and typically language modeling used you know having asked to make a prediction of what
0:23:02	the user might say that the next that and
0:23:05	i think that is not very nice that is good
0:23:21	eval model have five words maybe have
0:23:23	just this is one single training instance
0:23:43	so our experiment for the
0:23:46	for it should tax simple which the we don't have that situation
0:23:50	that's in the asr output we may be seen in a partial
0:23:55	partial phrase ease or corrections
0:23:59	we
0:24:00	you know to look into these particularly in this work
0:24:02	but it that is something
0:24:04	"'cause" look into in the future work
0:24:35	alright okay thanks
0:24:39	just a quick original source will i like to multi language model over the local
0:24:44	you know trying about the main problem is about the corpus we have for training
0:24:48	or slu model is usually very small going for creating language model you will be
0:24:52	corpus so budgeting but right but jointly you know you needed to say that you
0:24:57	have to have a
0:24:59	you know you're automatically determine your
0:25:02	training a language model
0:25:04	right i think
0:25:06	i believe in this domain and
0:25:08	data at all
0:25:09	well labeled data that is really a limitation because we don't have very large male
0:25:15	labeled data for these slu task so
0:25:18	i think if we can put more effort in generating
0:25:21	you know
0:25:22	better quality coppers that you
0:25:24	have a lot of them of these slu research
0:25:27	that's question
0:25:44	yes i did
0:25:56	okay so i think that is a very good question so we have a
0:26:00	a chart in the paper but it initially here in the annotation
0:26:03	basically all be evaluated different number of different size of k
0:26:08	the basic a use one
0:26:09	starting from each that
0:26:11	we start gradually increasing the intent contribution
0:26:14	and we evaluate so we show the training curve and validation curve
0:26:18	for different k values
0:26:20	the but basically these values a set
0:26:23	not in the experiment is that all learned
0:26:26	in the in a kind of work
0:26:32	i think
0:26:33	definitely discover then i think this is
0:26:35	one of the hyper parameters that can be
0:26:38	then from the purely data-driven approach
0:26:41	just think that in the current work we
0:26:43	not select of uk values
0:26:45	and evaluates which is a
0:26:48	that's k values
0:26:50	okay so that's by the speaker again and that's university okay

Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks

Oral Session 1: Dialogue state tracking & Spoken language understanding

Bing Liu and Ian Lane