so my name is a recharging is not there are some in the operation and
the today i'm gonna talk about the real data is question answering by a real
users for a million samples is consistent first like this
so
now we are seeing a lot of
samples okay because we are talking everyday the these little some people are talking to
these characters everyday
i criticism microsoft's we know in japan
it is very famous people talking to a everyday and we have a like to
get a box i image
the people can tell to the virtual characters in this us small cost
and also we have a
more human like
catherine you mentions in destiny as in david work
so we are having
many samples and they have consistent present it is
and if we want them to the but double they need to have consistent just
like this
and to generate consistent responses what follows
it's got each of the specific question answer yes
like
but the creation of that yes is as you know very costly
so the motivation behind this work is that
we want to efficiently
what
questions that there's for characters
and in this work we particularly news
the technique called role-play this question answering
as a technique for collecting
the
questions that s
and it before going into the details of this work i'm gonna explaining about what
role play this question answering
so in well database question answering
in the middle we have
a famous person
and people users talk to this famous person
and in this case this is an image and cutting down who is very famous
we've got is a
and
at the back
all this and scatter we have a bunch of all players to collectively play the
role of the famous plus
so if the user this user
asks a question to this famous person like what to do you like
and this question is broadcast
do all the old place
and better
one of the probably as and so is the question by saying like high tech
suites
then this answer was like to use a while
and
this question a second formant
a this there can be collected at a question answer for this task to
since both players can enjoy playing the role of their favourite character
and also the users can ask listen to their favourite character
users can get highly motivated to provide questions okay is that this is how it
works
let the that there are some problems with this architecture
so that is
only a small scale experiment with paid users was performed
to test the concept of the whole database question answering
so because not clear if this key would work with okay we've users
and also another problem is that the small scale experiment
if not you must data
to allow data driven methods to work
so the applicability of the collected data to the creation of examples
but not very fight
so to us all these problems in this
a to the protein that we buried by
effectiveness of role played this question answering is real users
six study we focus on two famous characters in japan
and
you setup we have signs for roleplay discuss something
both the people to you know enjoy the class
and for the second problem we created samples using the collected data
quickly in this way
and
in this paper we propose a retriever based method
and evaluate its performance by subjective evaluation
so let me
talk about
that the data collection by you
users
so we focus on these two characters
who are very concerned about
why is not my reason actual present and he's a company c or and
he's also youtube a who specialises you like the coverage of t v games
and
and the characters is a rig it is there is a fictional character is novel
and it does is the company this you
and head character is often referred to as the and the right
according to mitigate here and their exact is mentally unstable and use extreme balance of
brutality is an absolute
but in most so they are two very distinct
different chapters ones
actually present
male cat to another one that action factor of female part
and we set up websites
so that people can enjoy the role played this question answering
so each task has the channel
kind of maybe a kind of channel
user channels for the fans on the japanese
jamie service you can decode all that
this is like are you to
and
we set up the side
on their channels for the subscribers to enjoy role-play based question answering
so this is how it's how the image that looks like fall right
the people down
for questions these are the questions posed questions
and these are the given answers by several pages
and this is how it looks like full
sn
you can post questions in the text few and the and this is a
is imposed by the user and this is the answer posted by the well
so this is how it looks like
and we ran this kind of a trial for several model
and this is what we get task to a few and shows the statistics of
the collected data
if you look at the these two
number of users who participated and number of a questions okay as we obtain
we obtain a have many uses a
as you can see play roles of right and is a model three hundred people
participated
and we over ten thousand questions there's were collected for both
that is right and there's
and also houses for is a this is this is average
words but also that are is that is pronounced as of is it will much
longer and contain more wasn't matters
so in that is a there was more talkative and my are not as talkative
that is
just filling their effects present utterance
and this slide shows efficiency
of the data collection process
that this
yes table shows
how long we took to reach this number of questions up yes
so
for example
to each two thousand
there's
if the standard a full scale of the seven day from right and about one
day for is a and to reach ten thousand pairs
it took about three months former i and eighteen days for testing
so for both characters it is just about the couple of days to reach two
thousand questions appears
and what is a we collected
ten thousand question answer pairs in just eighteen days i think if it is quite
fast
and deciding this confirms this chancy a role-play discourse something for the question
you "'cause" note that uses doesn't run parry provided a to develop a they just
boundary in
provide data enjoying contrast
and the decisive the quality of data and user satisfaction of the users
so this shows
this table shows the average score for example downstairs
and the maximum score is five and we get very reasonable utterance correctly for the
posted classes
and for the user satisfaction of the users
we had the three items for the questionnaire items usability a website willingness for future
use and enjoyment of update and we see that users really enjoyed roleplaying
so we have a created about the more than ten k okay sounds okay as
in
well maybe this question answering and now it's time to create samples using the click
data
so this is a overview of our proposed method
basically we employ a retrieval-based approach that you haven't that question q
and
your question answer pairs of which leaves from this question answer pairs database that we
have collected
and if
the score of this which ends up the is high
in this exactly as or not
so
with the highest score is but and it's a prime
is used as out of this task
so for example this has a score of zero point nine and other ones how
the scores based on
the point nine then this would be selected and a prime the use of the
output for this tuple
and
the important thing to do this
how do we collected this goal
so for this purpose we have this scoring function
it is a weighted sum of six
different
school
so score you types my school central school translation score
so a rave transition score and semantic similarity score and these scores are integrated you
calculate this overall score for the for each question that
a nice
describe these scores along by well
for the initial sweets course
so for the summer school
this is what is given by the scene text with you but engine conclusions of
asr service this question as a great
and reason using with default settings it uses the m twenty five as such
and for the question types
my school
you score is calculated on the basis of case of the question type of to
match that of q prime and the number of named entities good prime requested by
chris
and also susceptible school
we first extract centre was and the was mean noun phrases representing topics are extracted
from all those q and q prime and if the overlap is score of while
it's okay
for the other three scores
well for this some sessions for use a mural found this model can be a
primary cue it is a generative probability of a prime given q at the school
the model is proclaimed is in house the point five million question answer pairs and
then fine tune is a quick collected questions up yes
and for this purpose we use open and m t two
and the reverse translation score is very similar to the translation score not be huge
even a crime is used
at school
finally the semantic similarity score
first sentence vectors are obtained from both q defined by using the averaged word vectors
using welcome back
then cosine similarity between two sentences because it's
used at the school
what do back model is trained from wikipedia articles
note that all scores are normalized between zero and one before integrating the schools
so it's i shows the overlapping to all the system
so user question comes in then this look into document retrieval engine the same achieve
this question answer pairs from discussions appears database
and top and candidates aretha
and for each of the candidate
indicate the score
by using these modules
question-type system action a named entity recognition sent over the extraction module you are translation
models
and what of a model
and we obtain g six
scores that i just plain
and
we get the final ranking of the two it is a the and outputs the
top and
just the masses and did not use only top one also
at the tuples response
and
because we have only about ten k
questions appears in this database is that it can at the coverage of the questions
you know you know
so we additionally have another database which is an extended question answer pairs
created from discussion on sub yes i just explaining but this is
so to extend the questions that the as
we first
focus on this
on the full
in a in a in one particular questions up
and we first that's for a very similar
three in a feature space
which has a very similar content on the normalized edit distance is below zero point
one so they should be very similar on the surface
and for this study we use
the all that questions
to which this was announced
and we therefore these questions
and
a couple these questions is questions and the sounds that
and these
hubble's i mean do is extended question answer yes that's how we extend its question
answer yes into this extended question answer yes
and former i
we all the thing additional wasn't really on
questions that sample is a
we obtain
about one million additional questions okay yes
so by using the proposed method
we did an experiment to verify the effectiveness of the proposed method
we use twenty six subjects
each fold ryan is a
and they were recruited from the transcribers data they are very tricky about the quality
of the utterance is that they are five of the cactus
and the procedure is that each subject evaluated ounces
of the five methods for comparison i explained and misses later
on a five point likert scale
and
you use test speakers questions which were the held-out data from the collected questions appears
were used as input
we have the two evaluation criteria
why naturalness
not knowing who's taking the answer is appropriate to the input question or not
and have an s
knowing that i think question is taking there is probably due to input question on
so
i
describe the message for comparison we have five
we have two baselines
and to propose messes i wonder about
as a problem as a baseline while it's called mail
and it uses general-purpose three hundred k and crafted we use you can email a
show intelligence markup language for response generation
and personal pronouns and sentence and expressions of them
but i lose to match those of the cast as
so as you know this is applied massive amount of
a handcrafted rules that we have been developing and we are using that
for response generation in this and of set
and baseline to this is called c
and it is easy the answer to the highest ranking to it
which achieved by to see which uses the in twenty five by using the input
question other clear
and this is the proposed method one it is called prob
without you x d be extended database the proposed method without the extended question is
like three
and i have the all the all the weights in the scoring function a set
to one
for this proposed method
and for the proposed method to it's called prob
the proposed method this is the proposed method itself and all the weight us to
do well
and the upper bound
it's called goals and it's a gold responses
provide it online user's focus questions
then we compare these five
and this is shows the results
for the five methods for both right and s
and as you can see that the proposed method a much better than the baseline
all right the proposed messes seeing significantly outperform the baselines
and those the problem is that doesn't probably the text and database or not
of what is a
the proposed method outperforms one of the baselines which is mail
and also proposed method is better than problem without extent database all naturalness
the weighted by good and this is a
at the bounds of the but close getting goals is the
gold about data
i show you some of the examples that a more interesting so for example this
is for right and what you do you
for lunch today and then we tend i have it's a compressed by for it
is good at the g
and it had a very high that's on the school but it does not very
much like and so
and the proposed method just return running
but it was hot but it was that just like himself
and via say
use of cute with a question and
we had the two
responses like to thank you very embarrassing thank you from the proposed methods and they
are very much higher scores
so that mm lose may produce not frequencies
but such happens is not necessary you too high
and short answers just liked of these ram and thank you
can lead to high schoolers showing that the content is utterances
it's very important for
so to summarize
we successfully verify the effectiveness of our previous question answering
by using real users
and we successfully created samples using the selected questions yes
and of future work
you want to improve the quality
of the proposed method and those so we want to try additional types of characters
as targets for local a discussion on
actually
questions
so actually this is a kind of a
how they say people can compare different the answers and that's the winds in part
of this the system
the people can just actually there's a kind of like important here
the people can just press this button then
the you know you can you can see that this was much better utterances so
it was kind of you know it's not a confusion but this kind of into
the thing for comparing them
yes a they are completely isolated
no it was just this amounts to
so we just wanted to make sure that
we are not cheating so that that's not that the point
and we could have done
users
but in their own questions and then evaluate the response but since we had a
dataset we wanted to do kind of us as kind of a class wasn't survey
so we can do that so we what how
so we
you have to be able reading with the this streaming service and that they have
the right to be addicted and area
so we have the rights to but our website and their fans on it was
and we all of the right have been created
and the other question
okay so let's thank you gaze