Speech Transcript - The Many Facets of Dialog

good morning everyone welcome to date three of us signal and on the like to

be here to introduce our third keynote speaker professor helen mapping from chinese university of

hong kong the howling gotta phd from mit

and she has been professor in a in hong kong chinese university of hong kong

for a sometime it's not count the number of years and in addition to what

she's done abilities aspects of speech and language processing language learning exact role

she is also involved in universal thing should be an associate universe archie's also given

presentations the world economic forum and world peace conference on the main i'm so she's

is not just doing research but actually trying to get a

a the information about speech and language and a help other people so without for

the to do that like to introduce professor how nine

so thank you very much talent for the kind introduction of the morning ladies and

gentlemen i'm really delighted to be here i wish to thank the organizers for the

very kind invitation

and i've been working as i once the a lot on language learning in recent

years but upon receiving the invitation from stick to al

i thought of this is a

excellent opportunity for me to take stock of what i've been doing

rather serendipity

on dialogue

so i decided to

choose this topic the many facets of dialogue for

my presentation

and in fact

the different fact that some going to cover

include

dialogue in teaching and learning

dialogue and e commerce

dialogue in cognitive assessment and the first three are more application oriented and then

the next to a more research oriented extracting semantic patterns from dialogues

and modeling user emotion changes in dialogues

so here we go the first one

is on

dialogue in teaching and learning

where

this project is

about investigating student discussion dialogues and learning outcomes in flip classroom teaching

so how is that my phd of it and more so too is

the research assistant in our t

i don't have three undergraduate student helpers in this project

this project came about because back in twenty twelve

that was actually a sweeping change in university education and home call

where

well the university have to migrate from a three year

curriculum to a for your curriculum

so what was said then we're admitting

students

who are one year younger

and we have to design a curriculum for first year engineering students which is brought

based meeting

or engineering students need to

take those course this

and among these is the engineering a freshman

math course

and because it's a broad base that mission

so we have really because this

and after a few years of teaching these big classes

we realise that we need to

sort of the students better

i specially for the each students

so we designed a

elite freshman amount of course

where it has a much more demanding a curriculum and of course students can opt

in an opt out of this course

it's basically of freshman year engineering math course

but we have this elite course and we have a very dedicated a teacher my

colleague a professor sit on jackie

and he's very creative and innovative and he has been

trying out many different

ways to teach the elite students

and so many different ways to flip it's constant

and eventually he's settled upon a

a mode where i'm gonna talk about that so in general is you know flip

classroom teaching involves having students watch online video lectures before they come into class and

then class it's all dedicated to base a cost discussions

so students are given

in class exercise this and they work in teams

and they discuss and in fact survey try to solve these problems and sometimes the

team

get picked to go up to the front and

presents but there there's solution to the their classmates

now this is that setting

and in fact it's in a computer lab so you have to see computers i

think it will be ideal if we have peace a reconfigurable furniture in a classroom

but hopefully it will come someday so

as i mentioned every week

the class

time it's

spent on

peer to peer learning and group discussions and some clips are selected to present their

solution

since we to let my students record

the student group discussions during class

so the dots are where the computer monitors are placed in the room

and the red dots are where we put the speech recorders

and

so you can see the students in groups and we actually get consents from most

of the groups

except for two

which are shown here to record their discussions

so technically

the contents of an audio file looks like this

so the lecture or woodstock the class

by addressing the whole class and also of course also close the cost

so we have lecture speech

at the beginning and at the end

and

at various

points in time

in the class

sometimes the lecture was speak and sometimes the ta will speak

again addressing the whole class

and there are times

when i still included finishes an exercise and they're invited to go up to the

front to present their solution but all the other times are open for the

student groups to discuss

within the team within the group to try to solve

the problem at hand

so this is the content of the audio file

so it's actually

we have two types of speech

one which is directed at the whole class

and one

which is the student group discussions

so we devised a methodology to automatic separation

between these two types

so that we can filter out the we want to be able to filter out

the

student group discussions speech

for further processing and studying here

this methodology we will be presenting a interspeech next week

now

it's actually

within that student group discussions

we actually segment the speech the audio

and this expectation is based on speaker change

and also if there's a pause

of more than one second duration then we'll segmented and

we have a lot of student helpers helping us in terms of transcribing

the speech

and a typical transcription looks like this

so each segment includes

the name

so for example gets more bits known as and report the call themselves and reburned

and here are the

i segments in fact that students we teach and we lecture in english but when

they are

open to discussing among themselves some of them

discussed input on parliamentary

philip and discussed in

in a cantonese

so here the speech is actually in chinese

and but i've translated it for presentation here so just to play for you

each of these segments in turn

so basically the first segment is a speaker a male speaker

say it really should be the same and then the females because they know these

piece to always exactly the same and so on so i'm gonna play for you

what the audio sounds like starting with the first segment

so that the first segment seconds segments

third segment

of segments and the last so very noisy

and

so what we have been working on is the transcription

now

the class exercise is generally take one which to solve

at each week i three classes

and so together the recordings composed a set

we have ten groups and over semester where we are able to record over twelve

weeks a we end up with a hundred and twenty

a weekly group discussions sets which we do not by w g d s

i don't speeds

fifty two have been transcribed this is from the previous offering

well as yours offering of the course

and the total a number of hours of the audio is five hundred fifty a

worse

and the total colours of discussion is about two hundred eighty hours and we've transcribed

about a hundred hours

so what we do care

as the beginning a beginning step

it's to look at the weekly group discussions that and try to look at

the discussions of the students and see whether it is relevant

so the core topic

and also whether it and also what level of activity

there was in communicative exchange

and that we try to conduct analysis to tie with the academic performance

of the group in the course

if we look at peace to

measures a relevance to the course topic in fact we divide that up into

two components

the first is the number of matching map terms

that's occur in the speech

so for example here is

it group audio

so basically they if there's a circle that usually use polar coordinates

and i've

used polar coordinates and then i've used it for integration but the variable y has

some problems

so that's what he thing

and in this

segments

we actually see the matching map terms based on some textbooks and mapped dictionaries these

other resources that we have chosen

and so we not take note of those

then the next component it's on content similarity and we figured that because the discussion

is there is solved and in cost exercise so they should bear similarity that discussions

content should have similarity to the in class exercise so to measure that's

we trained a

what effect model

and when we use that

to compute a segment vector so far

each segment in the discussion

we got a segment vector and we also get a document vector

from the in class exercise and we measure the cosine similarity

so here's an example of the a high similarity segment is on top versus the

low similarity segment and the bottom so you can see that's upon first glance the

top to segments they are indeed about some math

and then that the third one it's which chapter so it's referring to the text

probably

whereas the low similarity segments are general conversation

so that has to do with the relevance of the content we also measure the

level of activity in information exchange and for that

counts the number of segments in the inter in the discussion dialogue

and also the number of words

in the discussion dialogue and we add both

chinese characters and english words together

so it's actually for a weekly group discussions that we have

four features

two

putting to relevance to the course topic and two for information exchange measures

now

the next thing we do is to look at

be academic performance

so the learning outcome

that corresponds to each week scores topic

it's measured through the relevant question components

that's it's present in the way we've sets the midterm paper and the final exam

paper

basically we have a score and the final exam count sixty percent

the midterm talents forty percent but we have set the questions that's the course content

for each week will be present in different components

in the midterm and

final papers respectively

therefore we are able to

look at a groups overall performance according to the course content for a particular week

so this is the way we did the analysis and here's the

quick summary

so basically we looked at the high performing groups

versus the low performing groups and it's not surprise we can see that's

the high performing groups generally have a much higher average proportion of

matching map terms in the discussions

and also they have higher content similarity so

the worth it that use the discussion content it's much more relevant

and

in terms of communicative exchange activity the high-performing groups have many more

total segments exchanged and

more words

note that the first three measures so these three matching map terms content similarity

and number of segments exchanged

we did a success significance test and it's significant that the fourth one is at

point a weight so but i think it's still relevance and it still important an

important feature

so what have presented to you is if the first step

where we

collected the data and we try to investigate to the discussion dialogues in that it

flip classroom setting

in relation to learning outcomes

in terms of for the investigation what

our team will like to understand it's how

can

the student discussion

become if and if pair effective platform for peer to peer learning how the dialogue

facilitate learning and then hands learning

and for more if they're high-performing teams

because a very efficient exchange

in the dialogues

whether

we can use that information to inform formation

so right now that students would form a group to what the beginning of the

semester and they stick with that before the entire semester so

where thinking that if there cry performing groups as the results are very effective discussions

maybe if we are able to swap the groups around and

and

not this dialogue exchange the benefits of the dialogue exchange to learning

spread that maybe

you know rising tide

races all boats so maybe you and hands learning for the whole class

so that's the direction we'd like to take this investigation

so that the first section

no i will want to the second section which is on e commerce

so that this is actually the ching don't dialogue challenge in the summer of twenty

eighteen

and i had a summer

in turn

that year and i ching and is the undergraduate students and so i said well

maybe you may be interested in joining the team don't dialogue challenge but you have

no background luckily i have also had a part time a postdoctoral fellow duct according

and also doctor a value is a recent graduate from a group i'm he's not

working for the startup speech acts limited

and in particular i'd like to thank a doctor bones order to show don't go

and

miss them on track of

don't ai for running that's general dialogue challenge from which we've benefited a lot of

a special student

junior and undergraduate student

learning a lot

the goal of this dialogue challenge is to develop a chat part for you commerce

customer service

using gin don's very large dataset

they're giving us

they gave us one million chinese customer service conversations sessions

what amounts to twenty million conversation utterances or turns

this data covers ten after sales topics

and their unlabeled and for each of these topics may have for the subtopics so

for example in voice modification this topic

it can have

the subtopics of changing the name

changing the in voiced type asking about e invoices extraction

and the task it's to do the following we have a context

which consists of

the two previous conversation on

turns

so the two

so therefore utterances

from the two previous turns and the current query

from the user or from the customer

and the task is to generate a response for this context

okay so it's basically a of five utterance group

and we need to generate a response

and but generally that response from the system is evaluated by experts

a human experts to for from customer service

so there are two very well known approach is the retrieval-based approach and the gender

and racial based approach

and we

take advantage of the training data with the context and response pairs

in building bees

so i retrieval-based approaches very standard basically if the tf-idf plus cosine similarity

and our generation based approach is also a very standard configuration where we segmented

be chinese

context

the two previous

dialogue turns together with the current query

with that met that's

and then also we segment the response

and we feed those data and we model that statistical relation between the context

and the response

using i think to stick with attention

using this model

and so that's the training and also be inference phases

now

lee

system that we eventually submitted is a hybrid model

based on a

very commonly used rescoring framework

so what we did words to generate using their retrieval-based approach

and that's response alternatives

where we chose and to be twenty

so that it's

that there's enough choice that's but also it won't take too long

and

and we use the generation based approach to rescore

these twenty responses so

then i think about that it's be the generation based approach will

consider

the

given context and hand and the chosen response the relationship between those

and then we use this

we scored

the highest scoring response so we rescore it and we're a racket and use and

we check whether the highest scoring response has exceeded the threshold and this is arbitrarily

trout chosen

at points out of five

so if it exceeds a threshold then we'll output that response

otherwise we think that maybe that this signed that's our which we will base model

does not have enough information to choose the right response so we just use the

entire i think to seek

to generate that a new response

and so that the system and we got a technology innovation award for the system

so it has been a very fruitful experience especially for my undergraduate students and she

decided after this a general dialogue challenge to pursue a phd so she's actually starting

her first term as the phd student in our lab now

and also we got valuable data resources from the industry doing this summer

and i think

moving forward we'd like to

look into flexible use of context information

for different kinds of user inputs ranging from chit chats to one shot information-seeking enquiries

followup questions multi intent input et cetera and i think time yesterday i saw a

professor of folk owens

poster and i think i you have the a very comprehensive decomposition of this problem

so that's my second project and now i'm gonna move to the third project which

is looking at dialogue in cognitive screening

so investigating spoken language model markets in euro psychological dialogues for cognitive screening this is

actually a recently funded project is the very big project and we have a frost

university t

so there's the chinese university team

and we also have colleagues from h k u s t and also polytechnic university

but also from chinese university not only do we have engineers we also have

linguists

psychologists urologist

jerry education center and how just on our team so i'm really excited about this

team

and

we have our teaching hospital which is the prince of wales hospital and we also

building a new see which k teaching hospital which is a private hospital so i

think we're gonna be able to get

any

subjects to

participate in our study

so is actually this study focus on focuses on your cooperativeness order

so it's and another time for dimension

and it is and you know well that's know that the global population is ageing

fast and actually hong kong's population is ageing even faster

and cd neurocognitive is order

it's very prevalent among older at outs

it has an insidious onset it's chronic and progressive and there's a general global deterioration

and memory

communication thinking judgement and either probably to functions

and it's the most incapacitated

disease

now that cd manifests itself in communicative impairments such as uncoordinated articulation like this a

trio the subject may

news the capability in language use such as an aphasia

they may have a reduced vocabulary programmer weakened listening reading and writing

and the existing detection methods include brain scans blood tests

and face-to-face neural psychological and p assessments which include structured

semi-structured and free-form dialogues

so if we want dialogue is where the participant is invited to

to do a picture description so the given a picture or sometimes the process

and asked to describe it

now

my colleagues in the teaching hot scroll they have been recording

actually we we're allowed to record their then you're psychological tasks

and that will provide some that provide some initial data for our research so is

actually

the flow of the conversation includes the mmse

the many a mental state examination together with the montreal cognitive assessment a test

so it's the combination of both and there's some overlapping component so that's shared

and

we have about two hundred hours of a conversations between the clinicians and the subjects

it's a one on one

and euro psychological test

now here's an example so we have normal subjects and also others were cognitively impaired

and here are some examples of the

excerpts of the conversation so this is from a normal subject was ask about the

commonality between a training on a bicycle

and this is answer

and then the condition has size is big and then the subjects that yes to

train as long of the bike a smaller is in it and then the pledges

that's o

okay but what's called between them and the subjects that's both values for transport

now for the cognitively impaired subject the

the this is more typical and in fact the original

dialogue is in tiny so we also translated to into english for presentation here

and this is that the dialogue for a cooperative impaired subject so we did not

vary preliminary analysis based on about twenty individuals gender balance

and we look at than average number of utterances in and p assessment as

so you can see

that for males

so the total number of utterance the total number of utterances drop as we move

from the normal to the cognitively impaired

and also the same trend for the female

and then the cat time that sort of the reaction time

there's a general increase small increase

going from the normal to the cognitive impaired and this is for the male and

this one is for the female

also the normal subjects tend to speak faster so they put out more about how

your number of average characters per minute and average number of words per minute

and

so this is very preliminary data

and what we're looking at

different linguistic features such as

parameter quality

information density fluency and also acoustic features such as

and that it in addition to reaction time duration of pauses hesitations pitch prosody et

cetera so will be looking at a whole spectrum of these features

and also my student has developed an initial prototype which illustrates how interactive screening may

be done

and here's the

a demonstration video to show you

so it's actually it starts with

a word recall

exercise

please listen carefully i and going to state three words that i want you to

try to remember and repeat then back to me

please repeat the following three words to me

c then

can

radar

say a response it'd be

well

season

it should

river

good

please remember that three words that were presented and recall them later on

please your best to describe what is happening in the picture about

cap on the button below to begin our complete your response

i see

a family of four

or sitting in the living room

there is a order

monitor

carol

and the board

they are do you do we are we to release

i can't really see much clearly i don't know

that's

good

tap on data and that an if you have completed the task

tap on the try again that into redid the picture description task

please say that three words i asked you to remember earlier in the

recall and say that three words to me

say a response it'd be

season

rumour

i don't remember the last one

summer

u denotes the

so basically the system tries just or a job

the results of everyone several

the data

and so they're score charts

related to for example how many contracts a answers

correct responses were given the response time length get the gap time exact role so

i need to i need to state clearly that

the voice is actually so the voice is based on know that speech is based

real data but it's in chinese

so my student

translated to english and try to mimic the

the pause it and also used as you would think that the subject like to

say i think that's it so sort of talk

talking to himself

so he also mimic that so that is for illustration only

are most about data

will be in chinese cantonese or maybe

mandarin

so as a quick summary spoken dialogue offers easy accessibility

and high feature

resolution i'm talking about even millisecond resolution

in terms of reaction time and pause time extractor

for cognitive assessment so we want to be able to develop

a very speech language and dialogue processing technologies

to support holistic assessment of various cognitive functions

and domains

by combining dialog interaction with other interactions

and also we want to further develop this platform as the support of two

for cognitive screening

so that's the end of the third projects and now i'm gonna move away from

the applications oriented facets to a more research oriented facets

so the for project is on extracting

semantic patterns from user inputs

in dialogues and we've been developing a convex probably topic model for that and this

work done by a doctor according to myself and my colleague are professor younger

this study actually use it at its two and three

and to get about five thousand utterances to support our investigation

and that complex probably topic model

it's really and unsupervised approach

that is applicable to short text

and it can help us automatically identify semantic patterns from a dialogue corpus

via a geometric technique

so as shown here this that with the well-known m eight is

examples

we can see that semantic pattern of

show me flights

so this is an intent

and also another

semantic pattern of going from an origin to a destination and also

another

semantic pattern on a certain day

so we begin the space of m dimensions where if the vocabulary size and each

utterance forms in this space i'd point and the coordinates of the points

you close to the sum normalize worked out of that axis

so that there are two steps in our approach the first one is to embed

the utterances into a low dimensional affine subspace using principal component analysis so it's actually

this is a very common technique and the principal components in to capture

features that can optimally distinguish points by their semantic differences

then we want to the second step where we try to generate a compact

compact convex polytope

two

and close or the and bedded utterance points

and this is using

the quick whole algorithm

so i think illustration

this is what we call a normal type

convex polytope

and all these

points are always points so there are the illustrate be utterances in the corpus

residing in that space

maybe affine subspace

and the

compact a compact convex polytope the various ease of the pot the polytope

each vertex is actually

a point from the set of from the collection of utterance points

so each vertex

also corresponds to an utterance

now

we can then connect the linguistic aspects

of the utterances within the corpus to be geometric aspect of the convex palmtop

so it's actually you can think of the utterances in the dialogue corpus they become

embedded points in the affine subspace

the scope of the corpus

it's now and complex by be compact

convex polytope

that is delineated by the boundaries connecting liver disease

and then the semantic patterns of the language of the corpus

it's not represented

the vertices

of the complex

on of the compact convex polytope

now

because the very sees represents extreme points of the polytope

each are displayed can also be formed by a linear combination of the party types

for disease

so let's look at the a this corpus

be a this corpora

and as you know and it is we have these intents

and we also colour code them here and that we plot the utterances in be

a that's training corpora

in that space and which shows a two-dimensional space that you can

see all the plots on a plane

and then we won the quick all algorithm and it came up with this polytope

so this is the most compact one

and you can see

that the most compact

a polytope

meets

twelve or to see so v one v two

well the way to be twelve

now each word x actually also

corresponds to an utterance

so you can look at

the vertices one

tonight they're all

dark blue in colour and in fact they all

correspond to an address with the intent class think of lights

but next

is light blue

and actually a corresponds to

the intents of

abbreviation

and then vertex eleven is also dark blue so with vertex twelve

so this is

an illustration

of the convex polytope

now we can then look at each vertex

so we want to view nine they all

corresponds one hundred just so you can see

you want to v nine

so these not be one vertex once a vertex nine over here they're very close

together and essentially they are well

capturing the semantic pattern

from some origin to some destination and these are all

address this with the you labeled intent of flight

now vertex twelve it's very close by

and

but it's twelve itself the constituent utterance its flights to baltimore

so just having the destination

and

we when we also want to look at work text ten and eleven so let's

go to the next page

no vertex

and here in green

the other

utterances and if you look at the constants one utterances you can see that they're

all questions are what is an abbreviation

and then vertex alive it so the nearest neighbors of vertex eleven

basically all capture show me

show me some flights

okay so

you can see

that the versus ease the a generally together with their nearest neighbors capture some car

semantic patterns

now

for the context polytope we don't have any control on the number of er to

seize and it's usually unknown until you actually run the algorithm

so if you want to

control the number of vertices we can use

a simplex

and here again

we want to put plot in two d two dimensions so we chose a simplex

with three birdies so if we want to constrain it you

three courtesies we can use

the sequential quadratic programming algorithm

to come up with the minimum volume simplex

so just

for you to recall

this is the normal type convex polytope

so you can see

it has twelve were to see now we want to

control the number of vertices into three is that we want to

generate a

minute volume simplex and here is the output of the algorithm

okay so we can now see

we have the

minimum volume simplex with the river receives

and

if you look at this minimum volume simplex vertex one

two and three

and if you compare with the previous normal type

convex polytope so let's look at vertex one of the simplex

and it just corresponds to vertex eleven of the normal type polytope

and it also happens to coincide with an utterance

now if we go to vertex summary of the simplex you can see that there's

the light

blue

dots here and that actually corresponds to

for next

and

of the normal type up until so it's very close by

so the vertex

three of the simplex is very close to what extent of than normal type probably

channel

know what about

all these policies from one to nine and also verdicts twelve

these are all

we grouped into

into here

and we have a little bit by

extending vertex to

so you can see that is actually that's minimum

well in seven flights it's not encompassing all the utterance this week no longer guaranteed

that the verdict itself is an utterance points but

we have only three policies and the resulting

minimum value a minute volume simplex is formed by extrapolating the three lines

and joining the previous

not more type take bounding convex hull the vertices from that convex hull

including v ten

we tend to be a lot of n we eleven t v twelve

and then v eight and nine in be three lines

now

we can also look at

for this minimum volume simplex for each vertex we can look at it further so

for example

the first four attacks

you can look at feast on

nearest neighbors and here is the list of the utterances

that corresponds to e point each point

in the nearest neighbor group and they all have the pattern of show me

some flights from someplace to someplace show me flights so that some a semantic parser

now let's look at

verdicts two

so this is where you can see the patterns are from a and order to

a destination

for every vertex

because it's also residing in

the m dimensional space so the

coordinates can actually show was what are the top words the strongest words that are

most representative of the board chuck's

so you can also see

the list of ten top words for those verdicts coordinates of each you

now let's look at b three

the we and its nearest neighbors are shown here and it's mostly

about what it's

for by an abbreviation

okay so the minimum volume simplex actually also shows it allows us to pick

the number of vertices what is this we want to use and also shows some

of the semantic patterns

there are captured

and we paid three because we wanna be able to plot it

in fact and we can pick any arbitrary number of higher dimensions

we can examine at a higher dimensionality that semantic patterns

by analysing the nearest neighbors and also the top words of the verdict sees

so for example we ran

well one with sixteen dimensions

so we end up with seventeen courtesies

and i like that

first ten here

followed by the next

seven so seventeen altogether

and then here are the top words for each vertex and also the representative nearest

neighbor

so you can see that

for example verdicts full

it's cut it's capturing the semantic patterns show me something

and number x

from someplace to someplace

for x

eight

what does

some abbreviation me

and verdicts nine

asking about ground transportation

we also have er to seize one

two

five which

really

related to locations

and i think

that's because the perhaps due to data sparsity

and also verdicts the re

it's about can i get something i would like something

and vortex

so then

it's really a bunch of

frequently occurring words and i guess

now if we look at the next set inverted c

a vortex

thirteen it's

about flights from someplace

maybe to someplace as well

fourteen is what is something

sixty s list all

something and again verdicts eleven

fifteen and seventeen or location names

word x twelve

is an airline

name

exactly about either date a date or an airline so i think this is the

case where

we may have been

to address it introducing the subspace dimensions

and i think if we have one this

same experiment more dimensions hopefully it will

separate the day from the airline

so basically we're just playing around with this complex probably topic model as an a

tool for exploratory data analysis

and

i like the geometric nature because it helps me interpret the semantic patterns

and my hope is to extend this

from

semantic pattern extraction to tracking dialog states in the future

so that section four

and now

section five

i last section which is on

affective design

for conversational agents

modeling user emotion changes in a dialogue

this is actually the phd work of monotony

of with the students from to enquire university

and we also interned

in our lab in hong kong for a couple of summers because direct supervisor is

professor at your wafting part university

and this work it's conducted in their drink wa

chinese university joint research center a media sizes technologies and systems

which is and schlangen

and it just funded by the

national

natural science foundation of china

hong kong research grants council part we search scheme

a long time goal is to impart i

sensitivity

into conversational agents

which is important for user engagement and also for supporting

socially intelligence conversations

that's work look at inferring users emotion changes

i mean assumption is that emotive state change is related to the user's emotive state

in the covariance

dialogue turn and also the corresponding system response

so the objective is to infer the users emotion states

and also be emotive state change

which can in the future inform the generation of the system response

we use the p at a model pleasure arousal dominance framework for describing

emotions in a three dimensional continuous space

so pleasure it's more about positive and negative emotions are rows or is about mental

alertness and dominance is about more about control

so this is a real dialogue which is originally in chinese and again i

i have translated into english here for presentation

so this is a dialogue between a chat bots and the user

and

we have

annotated the p i d values

for each dialogue turn

so you can see for example in dialogue turn to

the user study broke up with me and the response from the system

is let it go you deserve a better one and you see that the from

the dialogue turn all the values of p a and the all

increase

and

and then

for example in dialogue turn eight

that use just said

actually

and the systems that use get me

would seem to amuse the user

so and also soft and the dominance

the value of the dominance

so these are the values that we work within the p d space and this

is our approach joe what's inferring emotive state change

on the left it's the speech input on the right is the output of emotion

recognition

and the prediction of emotion stick change

now we start by integrating the acoustic and lexical features

from the speech import

and

this is basically i'm multimodal fusion problem

and it is achieved by concatenating the features and then applying p

multitask learning convolutional

fusion auto-encoder

so it's go through different layers of convolution and max

and

and also max pooling

and

then we also

capture the system response as a whole utterance

and it is

this is because the holistic message is received by the user and the entire message

plays a role in influencing the users emotions

now the system response co and coding that uses a long short-term memory recurrent auto-encoder

and it is trained to map the system response into a sentence level vector

representation

next the user's input

and the system's response are further

combined using convolutional fusion

and

the framework

then performs emotion recognition using a stacked hidden layer

started only years and the results will be

further used for inferring emotive state change

and for this we use a multitask learning structured output layer

so that the dependency between them emotion state change

and the

emotion recognition output is captured

so in other words the e motive state change its conditioned on the recognise

emotion state of the current query

now the experimentation is done on i you mocap which is a corpus of very

widely used

in emotion recognition system

and also that so go voice assistant corpus so that so what is its did

corpus it has over four million put on what utterances in

three domains

it is transcribed by an asr engine with five point five percent whatever rates

now we actually look at the chat dialogues

and

there are

ninety eight thousand of such conversations between for the forty nine turns but we use

a pre-trained

you know emotional dnn to filter out the

the

neutral

dialogues

a neutral conversations so we ended up with about nine thousand

emotive conversations

with over fifty two thousand utterances which are selected for labeling

so labeling the p a d values

and then we run the emotion recognition and also the emotion state change

prediction

so we use a whole suite of evaluation criteria on but predicted emotive states

in p a d values and also the emotive state changes in p d values

the unweighted accuracy

the mean accuracy of different emotion categories

the mean absolute error and also the concordance correlation coefficient

now

this is a

benchmark against other recent work using other methods

and for i mocap and also for the so go data sets

the proposed approach

actually achieves competitive performance

in emotion recognition

now in emotion

change prediction actually

our proposed approach achieves a significantly better performance then be other approaches

but they're still room for improvement if you compare with

a human performance in human annotation

so to sum up this is among the first efforts to analyze

user input features

both acoustical and lexical features

together with the system response to understand how the user emotion changes

due to the system response and the dialogue

and we have achieved competitive performance in impulsive state change prediction

and we believe that this is a very important a step

to work to what's having socially intelligent virtual assistants

with the incorporation of affect sensitivity for human computer interaction

so my talk is in five chunks but this is the overall summary

basically

when i look back at all these different projects

you know with it very

tries on the message that

much can be gleaned

from dialogues

to understand many important phenomena including

how group discussions may facilitate learning

a student would discussions may facilitate learning

however the cuffs customer experience can be shaped by chopper responses and also the status

of an individual's cognitive health

and i guess i'm preaching to the choir here but i really truly believe there's

tremendous potential

we've only seen

the tip of an iceberg

and there's tremendous potential with abundant opportunities and a lot research so thank you very

much

thank you very much do we have questions

thank you very much going to us or regarding the topic three cognitive impairment so

we also working on that but still

so the heavy cognitive impairment of people is easy to detect case of just a

small conversation we can identify this guy so going to put compare

but i think problem is the mild cognitive impairment and ci voice on a is

a very difficult to detect

so i think so the final goal of this well maybe how to estimate the

degree of cognitive impairment using features so what the sig

so thank you very much for the question

indeed

in our study we will be covering

come to the normal adults also what they not call

minor in and cd that so the new terminology

my nancy the my small

and you will have a disorder

and major big

you have to disorder

and

so this is a what are learnt from our colleagues in eulogy so

for elderly people we need to be more diligent in engaging them in these

a positive assessments "'cause" they're a really exercises and there's subjective fluctuations going from one

exercise to another so therefore the more frequent you can

take the assessment of better

and

and the issue is not and axle scoring so the

that's obviously it's more the personal level and if there's any sudden changes perhaps more

drastic changes

in the

scoring level of the individual that is off

that would be an important

sign

and

and also tracking

frequently is important

so in the sometimes that are whole minor and cd more mild cognitive impairments harder

to detect those and also you have to work

again sort of the natural cognitive decline due to ageing and the pathological cognitive decline

so it's a it's in a complex problem but nevertheless because

dimension is such a big problem and people talk about

the dimension is not any of the age and global population

and there's not sure

so we just have to work very hard on how to do early

early detection and intervention thank you for the

question

thank you for this very nice thought maybe topics really impressive i was wondering especially

in relation to the classrooms and to the cognitive screening

the moment of understood by your

working on transcriptions rate on the basis of transcription of you made any experiments

but with this or and if so what was your experience there what's the likelihood

of being sufficiently good

so the

the classroom

it is very difficult

that's why we have two

we have no choice but work on transcriptions

but so for

the

the way we have recorded these neural psychological tests

it's actually between recognition and thus subject

so the conditions of i think that they don't want any sense

so we just put a phone there

and we can send the subject of course

and

depend on the device some of it we think it's doable

but we went to have a response on

speaker adaptive training and noise of is the

speech processing we

we need to fall in the kitchen sink to be able to do

well

thanks for agree though

on the cognitive assessment from a discourse structure point of view actually i was wondering

what sort of processing now you plan to do on those descriptions that they provide

apart from you know speech processing and lexical the cohesion any thoughts about in on

discourse coherence rhetorical relation

among the sentence is that they provide and so on

so thank you for that the one of a question we must look at that

we must okay that we haven't looked at that yet but is actually i have

for her from our you know our colleagues to other clinicians face a coherence in

following the

discourse of a dialog oftentimes show problems

if there's cognitive impairment so that is definitely

one aspect that we must

and in fact we would welcome any

interest the collaborators to look at that together

thank you for regression

a thanks for the survey instinct to you i'm to consider what to talk about

the emotional modeling the pat space move modeling is that just based on speech input

was are you also using i also using to analyse things like

us a nonverbal as a signals like laughter or sighing little things like that

right now we don't have that's it will be wonderful if we can have that

those features but right now it's really the speech input so acoustics and lexical input

and also the sentence level of the system's response

hi a question is about the a section five

so you due to prediction task you did emotion recognition and the emotive change prediction

so even though these some similar really think there is a subtle but important difference

between the two

so my question is

do you use the same features to do both does do you think there are

features that are more important for that you motives the rather than the emotion recognition

and

what difference have you seen

between these two

so requested so we think that

for the current query

based on the current user input we want to be able to

understand the motion of the user

but if you think about

what comes next so depending on how to respond

to the user

the system response the users emotion change the next

input

maybe different

right so for example

in be

in the

so here this is a subject him talking about a breakup

and

i first the system tries to

comfort the subject and then at some point you know the

the country the dialogue goes

i in timit assistive so are you real or not how can robot's no you

i know what you like as i do it should be

and then

the user says something

and at this point it sort of like a in this i at this point

of the dialogue you can you can respond in various ways but the talk about

that all used here

and then it seems that

a and then the user says you must be real so i think

but you most exchanges depend on a system response

so if we can

model that

and the way we've model that is to

mostly task training where a

e motion state change

it's dependent on the

recognize emotion

we want to be able to capture this dependency

and

and to be able you utilize this stuff

dependency is we choose how to

in the future choose how to

recent on how to generate the system response so that you can hopefully died off

dialogue be motioned change in the dialogue

in the way you

The Many Facets of Dialog

Keynotes

Helen Meng (Chinese University of Hong Kong, China)