okay the
so then we move on to the next
speaker
so the paper is unsupervised dialogue spectrum generation for more variable rounding
the
well as usual
however and signal
and this work is finished with the each or jealously then and aston gently from
microsoft research by the way i'm from her what university
and a flat start
so the aim for this paper is that we are we would you the a
ranker
to detect the problematic dialogues
from the normal ones without the and you labeled data
where used in the existing six dialogues as the normal dialogues
and then learned in a way to a generative use assimilated by can setups
and have a talk with the
bought in an in different training steps
and the we get the old conversations from different rings taps
and take them as the problem is problematic dialogues we call this mattered a step
gone
and the experiment result shows that the stuff step can compared favorably with the run
first train it on the labeled a manually labeled datasets
okay so what is the log data like a ranking
so the log dialogues are dialogues are dialogues of conversations happen between the real users
and the dialogue system
and the other
dialogue ranking aims that the identify the problematic splines from the normal ones
here are two examples of
the normal dialogues and problematic dialogues
here is the first one
the first one is a normal dialogue
the dialogue scenes in the restaurant searching domain or every
firstly the cs it's and state hollow and then the user
is asking for european restaurant
and then the system task what's part of time to have in mind
and they used a set the center
and after that the u s systems that i it's a system was asking for
the price range
and the uses the tagset expensive one
and the get after getting all the information at the system said i suggest this
the machine house cafe
and then repeat the all the requirements of the users
and after that the user ask for the rest of this cafe and the system
gives the cracks informations
and this think it to each other and the dialog finish
so we define it is not what dialogue as
dialogues that without any can't actually and natural turns
and also achieved over the requirements
ask about the user
and here is that problematic with dialogue
so where a is very
pat apparently
so when the system can understand the user utterance
and the conversations going to the wrong direction
for example
the use this that i would really like still european that's cheap
and the system has some problems based understanding this utterance
by suggesting one restaurant
which is in the east is tough town
however the user was asking for the standard
and after that the user it's a
i want to eat at this restaurant have you got there is that a address
and this is indeed is the this utterance and ask what part of town to
have in mind again
so we define is problematic dialogs as
the dialogues with either can't actually unnatural turns
or some and cheap the requirements or both
so the goal for this bunker
actually is the best
so the goal for the ranker is the to pick up this type of problematic
dialogues from the normal ones
so what we need a stronger
in people unity development loop of the at data driven dialogues
the developer is would able upgrade there's a dialogue system
i seeing some in domestic dialogues
and then the dialogue system will a beating the
deploy a three
will be released to the customers
and then the locked a lot or log conversations can be collected
and then the developers can improve the performance of the system
by correcting some mistakes than system at in a locked dialogues and then retrain the
a dialogue system model
however
going through all these dialogues are time consuming
so we hope
that these manually checking process can be replace the by the a dialog drunker
that can detect dialogue with lower quality automatically
to make this dialogue learning process with human the look more efficient
so here this structure of the structure of the ranker
that you put for the ranker used it just the dialogue
and outputs is the score
in between zero and one and zero mean is the normal dialogues and the why
means that problematic dialogues
so firstly
we get the sentencing biting by distance decoder
and then feed them into this multi have stuff what's multi have self attention
to capture the meaning of the dialogue context
and then we have these turn level classifier
to identify the quality of each turn
for example
for these very smooth turn the score should be zero point one
and for all sort
and for these problematic turns the score should be zero point nine
and their and then would be these i a dialogue level run curves on top
of this term life what qualities
and this the for this dialogue there are some parts of them are us to
move the some of them are problematic
so probably the score will be like a zero point eight or something the extracted
score
so the training for the normally digit that a
the gathering all the data for these of a trend for training of this one
queries very time-consuming
so you matching that human the loop process in the development when that whenever at
a significant change is made to the system a new labeled data for the i'd
run queries required
this is not feasible for most of the developer
and that's
motivates us to explore this stuck on approach
the general idea for this task is that
we take the c dialogue set the normal dialogues
and at the same time we need to step can to simulate the problem might
problematic dialogues
and train the bankers on top of this data
so here is the structure of the i-th turn setup
we have these dialogue generator and all have this we made here discriminator and need
the dialogue generator would have the restaurant searching a dialogue system
and are in based the user simulator
firstly we start of pre-training process
in this process we preach win over a user simulator but the full utterance
a multi domain dialogues
for example for the full for the most intimate dialogue this can be for example
the pizza ordering
we in which the user is asking for the large a pineapple pizza
and this does it can be the temperatures taking to men in which the user
is asking for a setting the temperature of the room to a seventy two degrees
and then we
we just we ask the user simulator to simulate some dialogue
together with the restaurant search in both
and hearsay example of simulated dialogue after pre-training as we can see the
user simulator has some that basically language abilities
but it doesn't know how to talk a bit based a restaurant search imports
so when the system is asking for
some restaurant searching requirement the user said management home or something like that
and of course
the dialogs not going to the right direction
so
after a guide the this after we get the simulated problematic dialogues we a trend
that is committed to get discriminator together with the c dialogues
which is pre-trained sorry
so after the pre-training process we come we move on to the first type of
the goddess that can training
firstly we just the initialize the are user simulator and that discriminator
by the occlusion and model
separately
and they're in there
than setups
for the training of the discriminator we ask the that looked in the reader to
simulate some dialogues with only one pair
and take them at the problem problematic dialogues
and then we have this each dialogue and truncated them up to the first turn
and to get a take them as the normal dialogues and feed them into the
discriminator
and for the training of the simulator in step one we also where you also
use nist are wondering stick sd and si dialogues
after that we start our can setups that's trained for treating the training of the
generated matching of the discriminator
after conver after the model get cumbersome
we ask
the model to simulate full length of dialogues
and put them into the simulated problem simulated problematic dialogues
buckets
as we can see the first term of this system is very is very small
but after that when the system
is asking for which what's product and you have in mind
the used as the continent which they use the system can understand
and the dialogues going wrong
and have the first that were coming to the second step
and we firstly we also initialize our used a military and the discriminator
we use to be we initialize the user simulator with the wire which rendered in
this step one
and we are
a initialize disk major with the push shouldn't model
and the only difference between this that one is that to step two is that
we are asking the you the that ballot denoted to generator to generate the
to simulate dialogue with two turns
and that the same time we truncate our artistic see dialogues into two turns and
then show in that is committed and estimated a user simulator at the same time
after the model get commerce
we asked then using user simulator to simulate folded of dialogues
and then put them into the simulated problematic dialogues
so as we can see the first two terms of or a smooth and stuff
and third term turns there's something wrong
okay and then
we just repeat this that for like and steps
and after the and step of training
we get
a four bucks buckets of the simulated problem of problematic dialogues
and together with the c dialogues
where should in our dialogue drunker
so here's so that is a set or using this paper
basically we're using the re dataset
the first one is the multi domain dialogues
that is for the pre-training of that segment user simulator and it's good discriminator
and where using this might otherwise these is that
which is task oriented conversations with a thirty sorry for two thousand dialogues
you over fifty one domains
and each dialogues in this dataset is task oriented conversational we interaction
between two real speakers and one of them a stimulating the user and detect the
otherwise stimulating the but
and the second part is to see dialogues
this a dialogue is portrayed is the is for the training of the can structure
and normally to see dialogues are human written dialogues that will be offered to the
developers before the active development of the dialogue system
however we don't have these human written dialogues
so the we create this stick dialogue this
we create what i just need a lot
by having the a high dial restaurant a searching but
talk to be the rule based
user simulator that also offer a high tail
and the third one is the manually labeled log dialogues which is for the evaluation
of this task
to claques this the labeled data we deployed a deployed our a high tail
restaurant search in both the way the amazon mechanical turk platform
are firstly we generate automatically generates some requirements for the user's for example
for some for type and also
locations and price range
and then
we asked turkers to find the restaurant
that satisfy those requirements
by checking base our restaurant sports
and i d n i d end of each
and the also at the end of each task
we add the quite the users are asked two questions
and the first one is the weather define the restaurant
making all the requirements mistaken one in the second one where we ask the user
two labeled a contextually an actual turn
do in the conversation
in total we collect a one what are than the six hundred normal dialogues and
one thousand three hundred problematic dialogues
here are some experiment results would you basically for example for experiments
to justify the performance of this
stuck on
so the first one is we investigate how
the generated dialogue's move to was to the normal dialogues
basically we examine the dialogues generated at each test
each time step of the static on
in terms of three metrics
a here are to love them
the first one a dapper one is the ranking score and the second one dollar
wise the success rate
and the yellow dashed lines and the green dashed line is probably very
a week
d stands for the average performance of the are labeled
no more dialogues and the labeled
problematic that problematic dialogues
so as we can see after the first turn a training
the
performance of the are generated dialogue
are much worse than the probably labeled problematic that'll a problem
labeled problematic dialogues
okay
after three terms of training
the both matrix star a growing and are better than the average performance of the
labeled a problematic dialogues
and as we can see after the and i terms of training
and the success rate
used email is as high is the
unlabeled normal dialogues
and also we can see the dialogues is going or smaller than the
a very smooth and very
a natural
it here is the
cues is second experiment
so in the second experiment we just the compare the stuck on be the
a ranker train it on the labeled data set
so firstly we just divided aim at amt labeled data into three part of the
two thousand training dataset and
to the training examples two hundred tap examples and the four hundred testing samples
and then we trained these dialogue ranker
we call this as to provide two thousand on this labeled training dataset
and use the performance
and by the we were evaluating this problem by the opposite yet proceed and k
and recall at k
so the training of the
sorry for this task done
and we simulated basically rt start and problematic dialogues
and
because the number of the c dialogue opens a we so all the all the
data set up balanced datasets to their one thousand a positive examples in the what
the next examples
and because the see the number of c dialogues is only one hundred still
we just duplicated by thirty times and try to make this dataset balanced
and then which in our aspect that on this dataset
so here's the performance
and as we can see the us that can performs even better than the supervised
approach
when the k a is lower than fifty
even though the supervised at two thousand has higher performance
wouldn't case getting larger
just can't do you comparison a fair regulate this
and here's the thirty some experiments
we just basically class the
we just basically i'd the simulated data
into the into the unlabeled data
and try to compare the performance of this combined it has said with the labeled
data set
and
here is the result
so basically the experiment shows that our us that can
approach can bring some additional
generate sessions by the segment by simulating
a wired a range of dialogues
that are not covered by the labeled data
so the last six or experiment is where comparing the set down with other type
of use of user simulator
and the first one is the
basically what coded multi domain
what is doing is just like we train this user simulator with that the multi
domain dialogues
and simulated one about them problematic dialogues
and then a together with just see dialogue which we need a ranker the dialogue
ranker
and q
and the second one is the find you model
so basically we preach when the user simulator based the multi-domain dollars
and then find kuwait on this t dialogues
and then we generate
went out and problematic dialogues and train it together with the see that looks
thank you performance
and the last one is the we code it's that finite-state thank you
so basically we just
replace this find used to use that of blank unit on the full length of
i think that walks we just
thank you in the stepwise fashion which has been introduced in the a stack on
just without the con structure
and
hughes the results
and we also train our are stacked on the same size of dataset
we should still one thought and assimilate out with simulated dialogue and the ones not
and i'll
the c dialogues so as we can see the is that stuck on are also
performance than all the others user simulator
so the conclusion is just that can generate dialogues based a wide range of
qualities
and compared to i this compares favorably with the ranker train another labelled dataset
and this we need additional general addition by simulating little
while the range of take this dialogue
they can not covered by the al
a labeled data or sorry
the last wise
it also forms other your system
but you're much we questions volumes
hi i actually have to questions let's see if i
the first one is
of course you starting with a binary classification problematic versus non problematic but of course
there are
more problematic dialogues and you had it
i and you address some of that via the times however in the end is
still a binary classification right yep
then my second question is because it's a binary classification what does it mean precision
that okay in this case so used to basically procedure is i case like to
a ranking of matrix it might is pretty relevant for evaluating the ranking process
so basically what we're doing is like
we for example have for a four hundred testing data and then we just the
use our model to dialogue ranker to give score to each dialogues and they would
market from a top from
upper to down
and then that means like
we suppose that
i the top of these dataset like it would give is higher score to this
dialogue them use like these dialogues are problematic dialogues
so was again the case like which is truncated this tell at this dataset as
for example first ten dialogues
and then we calculate how many of them are the problem at a problematic dialogues
and divided by ten
and we can transmit more like maybe we can see like of part fifty and
top one hand
you generate this problem is to dialogue so sort of letting lasso for us a
so we generate this problematic dialogues in this fashion where the beginning "'cause" all this
food and then the and this kind of rubbish
this is also comes from there or you this is a separate but there is
like something that in the middle of the thing to get you so for the
task that is the like basically use human labeled data is not only labeled but
thanks acumen is talking with the our system so the error can be like at
the meteoric talk alright and the end so it's like
it's just or if you don't really don't by john it's like the whole don't
know yes we don't run time by turn would just about the hotel or is
to think intent
it
all the questions
hi i'm a really from a dt i have a question about the how you
define the problematic dellaleau as a whole i mean that is they can be some
errors in the middle that the system can repair so what you mean exactly what
a problem of the limiting database so we define a problematic dialogs as
are like they have to look up to way not two-way these like to type
of problematic that actually history type of its a problematic dialogues
and the first type is likely they have some a natural turn
so basically
they achieve this goal they achieve their goal
but the communication is not smooth
so this person that
and second type is like the communication is not smooth but i know that same
type at the achieve a goal
and actually they're potentially have the third one which nist back behind the communication use
the moves but they didn't you are so
we just define diplomatically in this way
the in terms of the fan from the entrance is not smooth but the task
be successful is that this do you have a targeted the done data entry and
i'm sorry i didn't you have been calculated the annotator agreement
hence we can o we didn't specifically to find this type of data but because
the we gather data i think this type of examples are in the testing dataset
alright thank you
question whether
right
because like the ranker outputs
continues but
and you
no so as to the also the run queries is cut continuers between zero and
one so it can be like their point eight hours a point five something
and when is close to one that means this problem i take one and when
is close to zero that means that so that like the normalized so is the
these units and zero and one
it can it's just what is the loss function so that so
so the loss function basically use the
the
discord that the run currently based the late is not labeled with the label so
we labeled problematic dialogs as one
and the normal dallas zero and the loss it just like the score given by
the ranker
between this like with this one so for example we use should be critical bands
you know a one question also so this generate the
but dialogue some problematic dialogues
how do you know that they also wrote something to the actual problematic scores owes
to that of course are so this corpus
so we also be assuming we have three metrics to evaluate that
and
the first one is like the last
so normally if the if that there's something while the dialogue
or the user didn't achieve this goal normally the dialogues longer
so this one matrix
and the otherwise the a success rate determines whether the user achieve their goal
and third one is the
to score given by the on the run for which to train it on the
labeled data so basically it's like
proper boundary like giving the score
so we just compiler like to
so we just that would just compare the so basically that is this one
this
so basically use this lies so
we just compare it with the average
for example the average running score of the are labeled problematic dialogues which is the
real
and compare it with the also compare with the yellow dashed line
that means the average performance of the labeled
a normal dialogues so we just see like at the beginning of these very always
all this evaluation metrics a very low and after that is getting higher so that
means like at the beginning that for the dialogues over there is a lot of
problem is problematic dialogues and you the end is getting
but i was if you read this example it seems like the user utterances or
various look up to be very unlikely to happen in about
what color turn your mind boston
them for is going up in colorado and it's like the user is doing great
system here
yes there is no virtual characters and system reacting yes but it without introducing probably
or whatever but for this one is likely only after one trainings that and after
so you can see like after the three
after treatment utterance of training the user is
saying something a possible example that i'm not looking for this place please change so
these also related to the restaurants do man
but so that
that is the utterance that use the contents are the system can understand so that
cost the problem of the failure of the dialogue so probably at the bikini well
we want to generate the problematic in like multiple maori in very creepy we but
after so i do you in this step can training process the dialogue is getting
into this is a restaurant search and a man is just like the way the
user describing their requirements is not accepted by the by the system so you to
generate a dialogue is getting closer to the to the domain and is getting last
three
okay but you
we want to run for a final question
so it is you go along blues steps of the step again it looks like
the
the problems
looks like ordering and back
like after this the g m is that the case i'm just asking what you
like doesn't the generator
generated a low quality
problem just and
is actually you know so
so most of the devil wears problem there come at their of the appeared in
the end but it's a unit do that you in the generation process because we
have some like a random seed or something
and there are some problem as can
appeared in between but these the much less than the one appeared in the end
i see okay i mean so then be secure you i mean that's
something because we are doing
problems in the middle or the beginning i see it does so basically we actually
ideally we one this paper like we have the arrow in like all kinds of
place
and the
indeed like some of the generated dialogue even though after maybe six times over the
seven turns they are still there are some problems appear in me to but is
much lasso
i think maybe second of future work this i guess it was just gonna see
my helpful to combine different dialogs from different steps of just a
in table i want to train the rent
you mean like to collect the data from a different a training stuff but we're
doing that where like
a completely at all these dialogues into this okay
okay the think that's the from a question so let's think the speaker again