0:00:17good morning everyone welcome to date three of us signal and on the like to
0:00:24be here to introduce our third keynote speaker professor helen mapping from chinese university of
0:00:29hong kong the howling gotta phd from mit
0:00:34and she has been professor in a in hong kong chinese university of hong kong
0:00:41for a sometime it's not count the number of years and in addition to what
0:00:47she's done abilities aspects of speech and language processing language learning exact role
0:00:52she is also involved in universal thing should be an associate universe archie's also given
0:00:57presentations the world economic forum and world peace conference on the main i'm so she's
0:01:04is not just doing research but actually trying to get a
0:01:09a the information about speech and language and a help other people so without for
0:01:16the to do that like to introduce professor how nine
0:01:31so thank you very much talent for the kind introduction of the morning ladies and
0:01:36gentlemen i'm really delighted to be here i wish to thank the organizers for the
0:01:40very kind invitation
0:01:42and i've been working as i once the a lot on language learning in recent
0:01:48years but upon receiving the invitation from stick to al
0:01:51i thought of this is a
0:01:53excellent opportunity for me to take stock of what i've been doing
0:01:58rather serendipity
0:02:00on dialogue
0:02:01so i decided to
0:02:05choose this topic the many facets of dialogue for
0:02:09my presentation
0:02:11and in fact
0:02:14the different fact that some going to cover
0:02:16include
0:02:17dialogue in teaching and learning
0:02:19dialogue and e commerce
0:02:21dialogue in cognitive assessment and the first three are more application oriented and then
0:02:28the next to a more research oriented extracting semantic patterns from dialogues
0:02:32and modeling user emotion changes in dialogues
0:02:38so here we go the first one
0:02:40is on
0:02:42dialogue in teaching and learning
0:02:44where
0:02:45this project is
0:02:46about investigating student discussion dialogues and learning outcomes in flip classroom teaching
0:02:54so how is that my phd of it and more so too is
0:03:00the research assistant in our t
0:03:02i don't have three undergraduate student helpers in this project
0:03:08so
0:03:09this project came about because back in twenty twelve
0:03:13that was actually a sweeping change in university education and home call
0:03:18where
0:03:19well the university have to migrate from a three year
0:03:23curriculum to a for your curriculum
0:03:26so what was said then we're admitting
0:03:28students
0:03:29who are one year younger
0:03:31and we have to design a curriculum for first year engineering students which is brought
0:03:38based meeting
0:03:39or engineering students need to
0:03:42take those course this
0:03:44and among these is the engineering a freshman
0:03:48math course
0:03:49and because it's a broad base that mission
0:03:52so we have really because this
0:03:54and after a few years of teaching these big classes
0:03:58we realise that we need to
0:04:01sort of the students better
0:04:03i specially for the each students
0:04:05so we designed a
0:04:08elite freshman amount of course
0:04:10where it has a much more demanding a curriculum and of course students can opt
0:04:15in an opt out of this course
0:04:18it's basically of freshman year engineering math course
0:04:22but we have this elite course and we have a very dedicated a teacher my
0:04:28colleague a professor sit on jackie
0:04:31and he's very creative and innovative and he has been
0:04:35trying out many different
0:04:39ways to teach the elite students
0:04:41and so many different ways to flip it's constant
0:04:46and eventually he's settled upon a
0:04:50a mode where i'm gonna talk about that so in general is you know flip
0:04:56classroom teaching involves having students watch online video lectures before they come into class and
0:05:03then class it's all dedicated to base a cost discussions
0:05:08so students are given
0:05:10in class exercise this and they work in teams
0:05:14and they discuss and in fact survey try to solve these problems and sometimes the
0:05:20team
0:05:20get picked to go up to the front and
0:05:24presents but there there's solution to the their classmates
0:05:29now this is that setting
0:05:33and in fact it's in a computer lab so you have to see computers i
0:05:36think it will be ideal if we have peace a reconfigurable furniture in a classroom
0:05:41but hopefully it will come someday so
0:05:45as i mentioned every week
0:05:48the class
0:05:49time it's
0:05:50spent on
0:05:52peer to peer learning and group discussions and some clips are selected to present their
0:05:56solution
0:05:57so
0:05:59since we to let my students record
0:06:05the student group discussions during class
0:06:08so the dots are where the computer monitors are placed in the room
0:06:13and the red dots are where we put the speech recorders
0:06:18and
0:06:20so you can see the students in groups and we actually get consents from most
0:06:25of the groups
0:06:26except for two
0:06:27which are shown here to record their discussions
0:06:31so technically
0:06:34the contents of an audio file looks like this
0:06:36so the lecture or woodstock the class
0:06:40by addressing the whole class and also of course also close the cost
0:06:45so we have lecture speech
0:06:47at the beginning and at the end
0:06:49and
0:06:50at various
0:06:51points in time
0:06:53in the class
0:06:54sometimes the lecture was speak and sometimes the ta will speak
0:06:57again addressing the whole class
0:07:01and there are times
0:07:02when i still included finishes an exercise and they're invited to go up to the
0:07:07front to present their solution but all the other times are open for the
0:07:12student groups to discuss
0:07:15within the team within the group to try to solve
0:07:18the problem at hand
0:07:20so this is the content of the audio file
0:07:23so it's actually
0:07:25we have two types of speech
0:07:27one which is directed at the whole class
0:07:30and one
0:07:31which is the student group discussions
0:07:34so we devised a methodology to automatic separation
0:07:38between these two types
0:07:39so that we can filter out the we want to be able to filter out
0:07:44the
0:07:45student group discussions speech
0:07:47for further processing and studying here
0:07:51this methodology we will be presenting a interspeech next week
0:07:55now
0:07:56it's actually
0:07:57within that student group discussions
0:07:59we actually segment the speech the audio
0:08:03and this expectation is based on speaker change
0:08:06and also if there's a pause
0:08:08of more than one second duration then we'll segmented and
0:08:12we have a lot of student helpers helping us in terms of transcribing
0:08:17the speech
0:08:18and a typical transcription looks like this
0:08:23so each segment includes
0:08:25the name
0:08:27so for example gets more bits known as and report the call themselves and reburned
0:08:31and here are the
0:08:32i segments in fact that students we teach and we lecture in english but when
0:08:37they are
0:08:38open to discussing among themselves some of them
0:08:41discussed input on parliamentary
0:08:43philip and discussed in
0:08:44in a cantonese
0:08:46so
0:08:47so here the speech is actually in chinese
0:08:50and but i've translated it for presentation here so just to play for you
0:09:00each of these segments in turn
0:09:02so basically the first segment is a speaker a male speaker
0:09:08say it really should be the same and then the females because they know these
0:09:11piece to always exactly the same and so on so i'm gonna play for you
0:09:14what the audio sounds like starting with the first segment
0:09:21so that the first segment seconds segments
0:09:28third segment
0:09:32of segments and the last so very noisy
0:09:38and
0:09:39so what we have been working on is the transcription
0:09:44now
0:09:45the class exercise is generally take one which to solve
0:09:49at each week i three classes
0:09:51and so together the recordings composed a set
0:09:55we have ten groups and over semester where we are able to record over twelve
0:10:00weeks a we end up with a hundred and twenty
0:10:03a weekly group discussions sets which we do not by w g d s
0:10:07i don't speeds
0:10:09fifty two have been transcribed this is from the previous offering
0:10:13well as yours offering of the course
0:10:15and the total a number of hours of the audio is five hundred fifty a
0:10:19worse
0:10:20and the total colours of discussion is about two hundred eighty hours and we've transcribed
0:10:27about a hundred hours
0:10:29so what we do care
0:10:30as the beginning a beginning step
0:10:33it's to look at the weekly group discussions that and try to look at
0:10:38the discussions of the students and see whether it is relevant
0:10:42so the core topic
0:10:44and also whether it and also what level of activity
0:10:48there was in communicative exchange
0:10:52and that we try to conduct analysis to tie with the academic performance
0:10:57of the group in the course
0:10:59so
0:11:00if we look at peace to
0:11:03measures a relevance to the course topic in fact we divide that up into
0:11:09two components
0:11:10the first is the number of matching map terms
0:11:14that's occur in the speech
0:11:16so for example here is
0:11:18it group audio
0:11:20i
0:11:29so basically they if there's a circle that usually use polar coordinates
0:11:34and i've
0:11:35used polar coordinates and then i've used it for integration but the variable y has
0:11:40some problems
0:11:41so that's what he thing
0:11:42and in this
0:11:43segments
0:11:45we actually see the matching map terms based on some textbooks and mapped dictionaries these
0:11:52other resources that we have chosen
0:11:55and so we not take note of those
0:12:00then the next component it's on content similarity and we figured that because the discussion
0:12:05is there is solved and in cost exercise so they should bear similarity that discussions
0:12:11content should have similarity to the in class exercise so to measure that's
0:12:16we trained a
0:12:17what effect model
0:12:19and when we use that
0:12:21to compute a segment vector so far
0:12:24each segment in the discussion
0:12:26we got a segment vector and we also get a document vector
0:12:30from the in class exercise and we measure the cosine similarity
0:12:33so here's an example of the a high similarity segment is on top versus the
0:12:39low similarity segment and the bottom so you can see that's upon first glance the
0:12:44top to segments they are indeed about some math
0:12:50and then that the third one it's which chapter so it's referring to the text
0:12:56probably
0:12:57whereas the low similarity segments are general conversation
0:13:02so that has to do with the relevance of the content we also measure the
0:13:08level of activity in information exchange and for that
0:13:11we
0:13:13counts the number of segments in the inter in the discussion dialogue
0:13:17and also the number of words
0:13:19in the discussion dialogue and we add both
0:13:22chinese characters and english words together
0:13:26so it's actually for a weekly group discussions that we have
0:13:30four features
0:13:31two
0:13:34putting to relevance to the course topic and two for information exchange measures
0:13:39now
0:13:40the next thing we do is to look at
0:13:43be academic performance
0:13:45so the learning outcome
0:13:46that corresponds to each week scores topic
0:13:49it's measured through the relevant question components
0:13:53that's it's present in the way we've sets the midterm paper and the final exam
0:13:58paper
0:13:59so
0:14:00basically we have a score and the final exam count sixty percent
0:14:04the midterm talents forty percent but we have set the questions that's the course content
0:14:11for each week will be present in different components
0:14:14in the midterm and
0:14:16final papers respectively
0:14:19therefore we are able to
0:14:21look at a groups overall performance according to the course content for a particular week
0:14:29so this is the way we did the analysis and here's the
0:14:33quick summary
0:14:34so basically we looked at the high performing groups
0:14:38versus the low performing groups and it's not surprise we can see that's
0:14:42the high performing groups generally have a much higher average proportion of
0:14:46matching map terms in the discussions
0:14:49and also they have higher content similarity so
0:14:52the worth it that use the discussion content it's much more relevant
0:14:57and
0:14:58in terms of communicative exchange activity the high-performing groups have many more
0:15:04total segments exchanged and
0:15:08more words
0:15:10note that the first three measures so these three matching map terms content similarity
0:15:16and number of segments exchanged
0:15:18we did a success significance test and it's significant that the fourth one is at
0:15:24point a weight so but i think it's still relevance and it still important an
0:15:30important feature
0:15:32so what have presented to you is if the first step
0:15:35where we
0:15:37collected the data and we try to investigate to the discussion dialogues in that it
0:15:41flip classroom setting
0:15:42in relation to learning outcomes
0:15:45in terms of for the investigation what
0:15:48our team will like to understand it's how
0:15:52can
0:15:53the student discussion
0:15:57become if and if pair effective platform for peer to peer learning how the dialogue
0:16:03facilitate learning and then hands learning
0:16:06and for more if they're high-performing teams
0:16:09because a very efficient exchange
0:16:12in the dialogues
0:16:14whether
0:16:14we can use that information to inform formation
0:16:19so right now that students would form a group to what the beginning of the
0:16:22semester and they stick with that before the entire semester so
0:16:26where thinking that if there cry performing groups as the results are very effective discussions
0:16:33maybe if we are able to swap the groups around and
0:16:38and
0:16:39not this dialogue exchange the benefits of the dialogue exchange to learning
0:16:44spread that maybe
0:16:45you know rising tide
0:16:47races all boats so maybe you and hands learning for the whole class
0:16:50so that's the direction we'd like to take this investigation
0:16:55so that the first section
0:16:57no i will want to the second section which is on e commerce
0:17:00so that this is actually the ching don't dialogue challenge in the summer of twenty
0:17:04eighteen
0:17:06and i had a summer
0:17:08in turn
0:17:08that year and i ching and is the undergraduate students and so i said well
0:17:14maybe you may be interested in joining the team don't dialogue challenge but you have
0:17:19no background luckily i have also had a part time a postdoctoral fellow duct according
0:17:25to
0:17:25and also doctor a value is a recent graduate from a group i'm he's not
0:17:30working for the startup speech acts limited
0:17:33and in particular i'd like to thank a doctor bones order to show don't go
0:17:37and
0:17:37miss them on track of
0:17:39don't ai for running that's general dialogue challenge from which we've benefited a lot of
0:17:46a special student
0:17:47junior and undergraduate student
0:17:49learning a lot
0:17:50so
0:17:51the goal of this dialogue challenge is to develop a chat part for you commerce
0:17:55customer service
0:17:56using gin don's very large dataset
0:17:59they're giving us
0:18:00they gave us one million chinese customer service conversations sessions
0:18:04what amounts to twenty million conversation utterances or turns
0:18:07this data covers ten after sales topics
0:18:10and their unlabeled and for each of these topics may have for the subtopics so
0:18:16for example in voice modification this topic
0:18:19it can have
0:18:20the subtopics of changing the name
0:18:22changing the in voiced type asking about e invoices extraction
0:18:27and the task it's to do the following we have a context
0:18:31which consists of
0:18:32the two previous conversation on
0:18:35turns
0:18:35so the two
0:18:36so therefore utterances
0:18:38from the two previous turns and the current query
0:18:41from the user or from the customer
0:18:44and the task is to generate a response for this context
0:18:49okay so it's basically a of five utterance group
0:18:54and we need to generate a response
0:18:57and but generally that response from the system is evaluated by experts
0:19:02a human experts to for from customer service
0:19:07so there are two very well known approach is the retrieval-based approach and the gender
0:19:11and racial based approach
0:19:13and we
0:19:15take advantage of the training data with the context and response pairs
0:19:19in building bees
0:19:20so i retrieval-based approaches very standard basically if the tf-idf plus cosine similarity
0:19:26and our generation based approach is also a very standard configuration where we segmented
0:19:33be chinese
0:19:35context
0:19:36the two previous
0:19:38dialogue turns together with the current query
0:19:40with that met that's
0:19:42and then also we segment the response
0:19:45and we feed those data and we model that statistical relation between the context
0:19:49and the response
0:19:50using i think to stick with attention
0:19:53using this model
0:19:55and so that's the training and also be inference phases
0:19:58now
0:19:59lee
0:20:00system that we eventually submitted is a hybrid model
0:20:04based on a
0:20:05very commonly used rescoring framework
0:20:08so what we did words to generate using their retrieval-based approach
0:20:14and that's response alternatives
0:20:16where we chose and to be twenty
0:20:18so that it's
0:20:19that there's enough choice that's but also it won't take too long
0:20:22and
0:20:23and we use the generation based approach to rescore
0:20:26these twenty responses so
0:20:29then i think about that it's be the generation based approach will
0:20:34consider
0:20:35the
0:20:35given context and hand and the chosen response the relationship between those
0:20:40and then we use this
0:20:42we scored
0:20:45the highest scoring response so we rescore it and we're a racket and use and
0:20:50we check whether the highest scoring response has exceeded the threshold and this is arbitrarily
0:20:56trout chosen
0:20:57at points out of five
0:20:58so if it exceeds a threshold then we'll output that response
0:21:02otherwise we think that maybe that this signed that's our which we will base model
0:21:09does not have enough information to choose the right response so we just use the
0:21:13entire i think to seek
0:21:15to generate that a new response
0:21:17and so that the system and we got a technology innovation award for the system
0:21:22so it has been a very fruitful experience especially for my undergraduate students and she
0:21:27decided after this a general dialogue challenge to pursue a phd so she's actually starting
0:21:33her first term as the phd student in our lab now
0:21:37and also we got valuable data resources from the industry doing this summer
0:21:42and i think
0:21:43moving forward we'd like to
0:21:45look into flexible use of context information
0:21:48for different kinds of user inputs ranging from chit chats to one shot information-seeking enquiries
0:21:54followup questions multi intent input et cetera and i think time yesterday i saw a
0:21:59professor of folk owens
0:22:03poster and i think i you have the a very comprehensive decomposition of this problem
0:22:08so that's my second project and now i'm gonna move to the third project which
0:22:14is looking at dialogue in cognitive screening
0:22:17so investigating spoken language model markets in euro psychological dialogues for cognitive screening this is
0:22:24actually a recently funded project is the very big project and we have a frost
0:22:29university t
0:22:31so there's the chinese university team
0:22:33and we also have colleagues from h k u s t and also polytechnic university
0:22:38so
0:22:39but also from chinese university not only do we have engineers we also have
0:22:44linguists
0:22:45psychologists urologist
0:22:48jerry education center and how just on our team so i'm really excited about this
0:22:52team
0:22:53and
0:22:54we have our teaching hospital which is the prince of wales hospital and we also
0:22:59building a new see which k teaching hospital which is a private hospital so i
0:23:03think we're gonna be able to get
0:23:05any
0:23:06subjects to
0:23:08participate in our study
0:23:10so is actually this study focus on focuses on your cooperativeness order
0:23:17so it's and another time for dimension
0:23:19and it is and you know well that's know that the global population is ageing
0:23:24fast and actually hong kong's population is ageing even faster
0:23:28and cd neurocognitive is order
0:23:31it's very prevalent among older at outs
0:23:34it has an insidious onset it's chronic and progressive and there's a general global deterioration
0:23:40and memory
0:23:41communication thinking judgement and either probably to functions
0:23:44and it's the most incapacitated
0:23:46disease
0:23:48now that cd manifests itself in communicative impairments such as uncoordinated articulation like this a
0:23:55trio the subject may
0:23:57news the capability in language use such as an aphasia
0:24:00they may have a reduced vocabulary programmer weakened listening reading and writing
0:24:05and the existing detection methods include brain scans blood tests
0:24:09and face-to-face neural psychological and p assessments which include structured
0:24:14semi-structured and free-form dialogues
0:24:17so if we want dialogue is where the participant is invited to
0:24:24to do a picture description so the given a picture or sometimes the process
0:24:29and asked to describe it
0:24:31now
0:24:33my colleagues in the teaching hot scroll they have been recording
0:24:38actually we we're allowed to record their then you're psychological tasks
0:24:43and that will provide some that provide some initial data for our research so is
0:24:48actually
0:24:49the flow of the conversation includes the mmse
0:24:53the many a mental state examination together with the montreal cognitive assessment a test
0:24:59so it's the combination of both and there's some overlapping component so that's shared
0:25:05and
0:25:06we have about two hundred hours of a conversations between the clinicians and the subjects
0:25:10it's a one on one
0:25:12and euro psychological test
0:25:15now here's an example so we have normal subjects and also others were cognitively impaired
0:25:22and here are some examples of the
0:25:25excerpts of the conversation so this is from a normal subject was ask about the
0:25:31commonality between a training on a bicycle
0:25:33and this is answer
0:25:36and then the condition has size is big and then the subjects that yes to
0:25:39train as long of the bike a smaller is in it and then the pledges
0:25:43that's o
0:25:44okay but what's called between them and the subjects that's both values for transport
0:25:49now for the cognitively impaired subject the
0:25:53the this is more typical and in fact the original
0:25:57dialogue is in tiny so we also translated to into english for presentation here
0:26:03and this is that the dialogue for a cooperative impaired subject so we did not
0:26:08vary preliminary analysis based on about twenty individuals gender balance
0:26:13and we look at than average number of utterances in and p assessment as
0:26:18so you can see
0:26:19that for males
0:26:21so the total number of utterance the total number of utterances drop as we move
0:26:26from the normal to the cognitively impaired
0:26:28and also the same trend for the female
0:26:31and then the cat time that sort of the reaction time
0:26:34there's a general increase small increase
0:26:37going from the normal to the cognitive impaired and this is for the male and
0:26:41this one is for the female
0:26:42also the normal subjects tend to speak faster so they put out more about how
0:26:48your number of average characters per minute and average number of words per minute
0:26:52and
0:26:55so this is very preliminary data
0:26:58and what we're looking at
0:26:59different linguistic features such as
0:27:04parameter quality
0:27:06information density fluency and also acoustic features such as
0:27:10and that it in addition to reaction time duration of pauses hesitations pitch prosody et
0:27:15cetera so will be looking at a whole spectrum of these features
0:27:19and also my student has developed an initial prototype which illustrates how interactive screening may
0:27:26be done
0:27:27and here's the
0:27:29a demonstration video to show you
0:27:32so it's actually it starts with
0:27:38a word recall
0:27:39exercise
0:27:41please listen carefully i and going to state three words that i want you to
0:27:47try to remember and repeat then back to me
0:27:51please repeat the following three words to me
0:27:55c then
0:27:57can
0:27:58radar
0:28:00say a response it'd be
0:28:05well
0:28:07season
0:28:08it should
0:28:10river
0:28:18good
0:28:20please remember that three words that were presented and recall them later on
0:28:27please your best to describe what is happening in the picture about
0:28:33cap on the button below to begin our complete your response
0:28:42i see
0:28:43a family of four
0:28:46or sitting in the living room
0:28:50there is a order
0:28:53monitor
0:28:55carol
0:28:57and the board
0:28:59they are do you do we are we to release
0:29:06i can't really see much clearly i don't know
0:29:12that's
0:29:14good
0:29:16tap on data and that an if you have completed the task
0:29:20tap on the try again that into redid the picture description task
0:29:31please say that three words i asked you to remember earlier in the
0:29:37recall and say that three words to me
0:29:41say a response it'd be
0:29:47season
0:29:50rumour
0:29:53i don't remember the last one
0:29:56summer
0:29:58u denotes the
0:30:07so basically the system tries just or a job
0:30:11the results of everyone several
0:30:13the data
0:30:14and so they're score charts
0:30:17related to for example how many contracts a answers
0:30:21correct responses were given the response time length get the gap time exact role so
0:30:27i need to i need to state clearly that
0:30:30the voice is actually so the voice is based on know that speech is based
0:30:36on
0:30:37real data but it's in chinese
0:30:39so my student
0:30:42translated to english and try to mimic the
0:30:45the pause it and also used as you would think that the subject like to
0:30:50say i think that's it so sort of talk
0:30:53talking to himself
0:30:54so he also mimic that so that is for illustration only
0:30:58are most about data
0:31:00will be in chinese cantonese or maybe
0:31:02mandarin
0:31:04so as a quick summary spoken dialogue offers easy accessibility
0:31:09and high feature
0:31:11resolution i'm talking about even millisecond resolution
0:31:14in terms of reaction time and pause time extractor
0:31:17for cognitive assessment so we want to be able to develop
0:31:21a very speech language and dialogue processing technologies
0:31:24to support holistic assessment of various cognitive functions
0:31:28and domains
0:31:29by combining dialog interaction with other interactions
0:31:33and also we want to further develop this platform as the support of two
0:31:37for cognitive screening
0:31:40so that's the end of the third projects and now i'm gonna move away from
0:31:45the applications oriented facets to a more research oriented facets
0:31:50so the for project is on extracting
0:31:53semantic patterns from user inputs
0:31:55in dialogues and we've been developing a convex probably topic model for that and this
0:32:01work done by a doctor according to myself and my colleague are professor younger
0:32:06so
0:32:07this study actually use it at its two and three
0:32:11and to get about five thousand utterances to support our investigation
0:32:16and that complex probably topic model
0:32:19it's really and unsupervised approach
0:32:22that is applicable to short text
0:32:24and it can help us automatically identify semantic patterns from a dialogue corpus
0:32:30via a geometric technique
0:32:32so as shown here this that with the well-known m eight is
0:32:37examples
0:32:38we can see that semantic pattern of
0:32:40show me flights
0:32:41so this is an intent
0:32:43and also another
0:32:44semantic pattern of going from an origin to a destination and also
0:32:50another
0:32:50semantic pattern on a certain day
0:32:54so we begin the space of m dimensions where if the vocabulary size and each
0:33:00utterance forms in this space i'd point and the coordinates of the points
0:33:06we
0:33:07you close to the sum normalize worked out of that axis
0:33:11so that there are two steps in our approach the first one is to embed
0:33:15the utterances into a low dimensional affine subspace using principal component analysis so it's actually
0:33:21this is a very common technique and the principal components in to capture
0:33:26features that can optimally distinguish points by their semantic differences
0:33:31then we want to the second step where we try to generate a compact
0:33:36compact convex polytope
0:33:39two
0:33:40and close or the and bedded utterance points
0:33:43and this is using
0:33:44the quick whole algorithm
0:33:46so i think illustration
0:33:50this is what we call a normal type
0:33:54convex polytope
0:33:55and all these
0:33:57points are always points so there are the illustrate be utterances in the corpus
0:34:03residing in that space
0:34:05maybe affine subspace
0:34:07and the
0:34:08compact a compact convex polytope the various ease of the pot the polytope
0:34:14each vertex is actually
0:34:16a point from the set of from the collection of utterance points
0:34:21so each vertex
0:34:23also corresponds to an utterance
0:34:25now
0:34:26we can then connect the linguistic aspects
0:34:29of the utterances within the corpus to be geometric aspect of the convex palmtop
0:34:37so it's actually you can think of the utterances in the dialogue corpus they become
0:34:42embedded points in the affine subspace
0:34:44the scope of the corpus
0:34:47it's now and complex by be compact
0:34:50convex polytope
0:34:51that is delineated by the boundaries connecting liver disease
0:34:55and then the semantic patterns of the language of the corpus
0:34:59it's not represented
0:35:01as
0:35:02the vertices
0:35:03of the complex
0:35:05on of the compact convex polytope
0:35:09now
0:35:09because the very sees represents extreme points of the polytope
0:35:14each are displayed can also be formed by a linear combination of the party types
0:35:18for disease
0:35:20so let's look at the a this corpus
0:35:23be a this corpora
0:35:24and as you know and it is we have these intents
0:35:28and we also colour code them here and that we plot the utterances in be
0:35:33a that's training corpora
0:35:35in that space and which shows a two-dimensional space that you can
0:35:39see all the plots on a plane
0:35:41and then we won the quick all algorithm and it came up with this polytope
0:35:48so this is the most compact one
0:35:51and you can see
0:35:52that the most compact
0:35:54a polytope
0:35:55meets
0:35:56twelve or to see so v one v two
0:35:59well the way to be twelve
0:36:04now each word x actually also
0:36:06corresponds to an utterance
0:36:08so you can look at
0:36:10the vertices one
0:36:11tonight they're all
0:36:13dark blue in colour and in fact they all
0:36:16correspond to an address with the intent class think of lights
0:36:21but next
0:36:22is light blue
0:36:23and actually a corresponds to
0:36:25the intents of
0:36:27abbreviation
0:36:29and then vertex eleven is also dark blue so with vertex twelve
0:36:34so this is
0:36:36an illustration
0:36:37of the convex polytope
0:36:39now we can then look at each vertex
0:36:43so we want to view nine they all
0:36:47corresponds one hundred just so you can see
0:36:49you want to v nine
0:36:51so these not be one vertex once a vertex nine over here they're very close
0:36:55together and essentially they are well
0:36:58capturing the semantic pattern
0:37:00of
0:37:01from some origin to some destination and these are all
0:37:07address this with the you labeled intent of flight
0:37:10now vertex twelve it's very close by
0:37:14and
0:37:15but it's twelve itself the constituent utterance its flights to baltimore
0:37:20so just having the destination
0:37:23and
0:37:24we when we also want to look at work text ten and eleven so let's
0:37:28go to the next page
0:37:29no vertex
0:37:30and here in green
0:37:32the other
0:37:34utterances and if you look at the constants one utterances you can see that they're
0:37:39all questions are what is an abbreviation
0:37:43and then vertex alive it so the nearest neighbors of vertex eleven
0:37:49basically all capture show me
0:37:51show me some flights
0:37:53okay so
0:37:54you can see
0:37:55that the versus ease the a generally together with their nearest neighbors capture some car
0:38:01semantic patterns
0:38:02now
0:38:03for the context polytope we don't have any control on the number of er to
0:38:08seize and it's usually unknown until you actually run the algorithm
0:38:13so if you want to
0:38:15control the number of vertices we can use
0:38:18a simplex
0:38:20and here again
0:38:22we want to put plot in two d two dimensions so we chose a simplex
0:38:26with three birdies so if we want to constrain it you
0:38:30three courtesies we can use
0:38:32the sequential quadratic programming algorithm
0:38:35to come up with the minimum volume simplex
0:38:38so just
0:38:40for you to recall
0:38:42this is the normal type convex polytope
0:38:44so you can see
0:38:45it has twelve were to see now we want to
0:38:49control the number of vertices into three is that we want to
0:38:52generate a
0:38:54minute volume simplex and here is the output of the algorithm
0:38:58okay so we can now see
0:39:00we have the
0:39:01minimum volume simplex with the river receives
0:39:04and
0:39:05if you look at this minimum volume simplex vertex one
0:39:08two and three
0:39:09and if you compare with the previous normal type
0:39:14convex polytope so let's look at vertex one of the simplex
0:39:18and it just corresponds to vertex eleven of the normal type polytope
0:39:23and it also happens to coincide with an utterance
0:39:27now if we go to vertex summary of the simplex you can see that there's
0:39:32the light
0:39:33blue
0:39:34dots here and that actually corresponds to
0:39:37for next
0:39:38and
0:39:38of the normal type up until so it's very close by
0:39:43so the vertex
0:39:44three of the simplex is very close to what extent of than normal type probably
0:39:50channel
0:39:51know what about
0:39:52all these policies from one to nine and also verdicts twelve
0:39:56these are all
0:39:58we grouped into
0:40:00into here
0:40:02and we have a little bit by
0:40:04extending vertex to
0:40:06so you can see that is actually that's minimum
0:40:09well in seven flights it's not encompassing all the utterance this week no longer guaranteed
0:40:14that the verdict itself is an utterance points but
0:40:18we have only three policies and the resulting
0:40:21minimum value a minute volume simplex is formed by extrapolating the three lines
0:40:26and joining the previous
0:40:27not more type take bounding convex hull the vertices from that convex hull
0:40:32including v ten
0:40:34we tend to be a lot of n we eleven t v twelve
0:40:37and then v eight and nine in be three lines
0:40:41now
0:40:42we can also look at
0:40:44for this minimum volume simplex for each vertex we can look at it further so
0:40:49for example
0:40:50the first four attacks
0:40:53you can look at feast on
0:40:54nearest neighbors and here is the list of the utterances
0:40:58that corresponds to e point each point
0:41:01in the nearest neighbor group and they all have the pattern of show me
0:41:06some flights from someplace to someplace show me flights so that some a semantic parser
0:41:11now let's look at
0:41:13verdicts two
0:41:15so this is where you can see the patterns are from a and order to
0:41:20a destination
0:41:21for every vertex
0:41:23because it's also residing in
0:41:25the m dimensional space so the
0:41:29coordinates can actually show was what are the top words the strongest words that are
0:41:32most representative of the board chuck's
0:41:34so you can also see
0:41:36the list of ten top words for those verdicts coordinates of each you
0:41:41now let's look at b three
0:41:44the we and its nearest neighbors are shown here and it's mostly
0:41:48about what it's
0:41:50for by an abbreviation
0:41:51okay so the minimum volume simplex actually also shows it allows us to pick
0:41:57the number of vertices what is this we want to use and also shows some
0:42:01of the semantic patterns
0:42:02there are captured
0:42:04and we paid three because we wanna be able to plot it
0:42:07in fact and we can pick any arbitrary number of higher dimensions
0:42:12so
0:42:13we can examine at a higher dimensionality that semantic patterns
0:42:17by analysing the nearest neighbors and also the top words of the verdict sees
0:42:21so for example we ran
0:42:23well one with sixteen dimensions
0:42:25so we end up with seventeen courtesies
0:42:27and i like that
0:42:28first ten here
0:42:30followed by the next
0:42:31seven so seventeen altogether
0:42:33and then here are the top words for each vertex and also the representative nearest
0:42:38neighbor
0:42:40so you can see that
0:42:42for example verdicts full
0:42:44it's cut it's capturing the semantic patterns show me something
0:42:48and number x
0:42:50from someplace to someplace
0:42:52for x
0:42:52eight
0:42:53what does
0:42:54some abbreviation me
0:42:56and verdicts nine
0:42:58asking about ground transportation
0:43:01we also have er to seize one
0:43:03two
0:43:06five which
0:43:08really
0:43:11related to locations
0:43:12and i think
0:43:13that's because the perhaps due to data sparsity
0:43:17and also verdicts the re
0:43:19it's about can i get something i would like something
0:43:23and vortex
0:43:24so then
0:43:25it's really a bunch of
0:43:27frequently occurring words and i guess
0:43:29now if we look at the next set inverted c
0:43:32a vortex
0:43:33thirteen it's
0:43:35about flights from someplace
0:43:37maybe to someplace as well
0:43:39fourteen is what is something
0:43:41sixty s list all
0:43:43something and again verdicts eleven
0:43:47fifteen and seventeen or location names
0:43:51word x twelve
0:43:53is an airline
0:43:54name
0:43:55exactly about either date a date or an airline so i think this is the
0:43:59case where
0:44:00we may have been
0:44:02to address it introducing the subspace dimensions
0:44:05and i think if we have one this
0:44:08same experiment more dimensions hopefully it will
0:44:11separate the day from the airline
0:44:14so basically we're just playing around with this complex probably topic model as an a
0:44:22tool for exploratory data analysis
0:44:25and
0:44:26i like the geometric nature because it helps me interpret the semantic patterns
0:44:31and my hope is to extend this
0:44:34from
0:44:34semantic pattern extraction to tracking dialog states in the future
0:44:39so that section four
0:44:41and now
0:44:42section five
0:44:44i last section which is on
0:44:46affective design
0:44:47for conversational agents
0:44:49modeling user emotion changes in a dialogue
0:44:51this is actually the phd work of monotony
0:44:54of with the students from to enquire university
0:44:57and we also interned
0:44:59in our lab in hong kong for a couple of summers because direct supervisor is
0:45:05professor at your wafting part university
0:45:07and this work it's conducted in their drink wa
0:45:11chinese university joint research center a media sizes technologies and systems
0:45:15which is and schlangen
0:45:16and it just funded by the
0:45:18national
0:45:19natural science foundation of china
0:45:21hong kong research grants council part we search scheme
0:45:25so
0:45:26a long time goal is to impart i
0:45:29sensitivity
0:45:31into conversational agents
0:45:32which is important for user engagement and also for supporting
0:45:36socially intelligence conversations
0:45:39so
0:45:40that's work look at inferring users emotion changes
0:45:44i mean assumption is that emotive state change is related to the user's emotive state
0:45:50in the covariance
0:45:51dialogue turn and also the corresponding system response
0:45:56so the objective is to infer the users emotion states
0:46:00and also be emotive state change
0:46:02which can in the future inform the generation of the system response
0:46:09we use the p at a model pleasure arousal dominance framework for describing
0:46:14emotions in a three dimensional continuous space
0:46:18so pleasure it's more about positive and negative emotions are rows or is about mental
0:46:24alertness and dominance is about more about control
0:46:28so this is a real dialogue which is originally in chinese and again i
0:46:32i have translated into english here for presentation
0:46:35so this is a dialogue between a chat bots and the user
0:46:39and
0:46:40we have
0:46:42annotated the p i d values
0:46:44for each dialogue turn
0:46:45so you can see for example in dialogue turn to
0:46:50the user study broke up with me and the response from the system
0:46:53is let it go you deserve a better one and you see that the from
0:46:57the dialogue turn all the values of p a and the all
0:47:00increase
0:47:02and
0:47:03and then
0:47:04for example in dialogue turn eight
0:47:07that use just said
0:47:08actually
0:47:10and the systems that use get me
0:47:12would seem to amuse the user
0:47:14so and also soft and the dominance
0:47:16the value of the dominance
0:47:18so these are the values that we work within the p d space and this
0:47:22is our approach joe what's inferring emotive state change
0:47:27on the left it's the speech input on the right is the output of emotion
0:47:31recognition
0:47:32and the prediction of emotion stick change
0:47:35now we start by integrating the acoustic and lexical features
0:47:39from the speech import
0:47:41and
0:47:42this is basically i'm multimodal fusion problem
0:47:45and it is achieved by concatenating the features and then applying p
0:47:50multitask learning convolutional
0:47:52fusion auto-encoder
0:47:54so it's go through different layers of convolution and max
0:47:57and
0:47:58and also max pooling
0:48:01and
0:48:02then we also
0:48:05capture the system response as a whole utterance
0:48:08and it is
0:48:09this is because the holistic message is received by the user and the entire message
0:48:13plays a role in influencing the users emotions
0:48:17now the system response co and coding that uses a long short-term memory recurrent auto-encoder
0:48:23and it is trained to map the system response into a sentence level vector
0:48:27representation
0:48:30next the user's input
0:48:32and the system's response are further
0:48:34combined using convolutional fusion
0:48:37and
0:48:38the framework
0:48:39then performs emotion recognition using a stacked hidden layer
0:48:43started only years and the results will be
0:48:46further used for inferring emotive state change
0:48:49and for this we use a multitask learning structured output layer
0:48:54so that the dependency between them emotion state change
0:48:57and the
0:48:59emotion recognition output is captured
0:49:02so in other words the e motive state change its conditioned on the recognise
0:49:06emotion state of the current query
0:49:10now the experimentation is done on i you mocap which is a corpus of very
0:49:14widely used
0:49:15in emotion recognition system
0:49:17and also that so go voice assistant corpus so that so what is its did
0:49:22corpus it has over four million put on what utterances in
0:49:27three domains
0:49:28it is transcribed by an asr engine with five point five percent whatever rates
0:49:32now we actually look at the chat dialogues
0:49:36and
0:49:36there are
0:49:37ninety eight thousand of such conversations between for the forty nine turns but we use
0:49:43a pre-trained
0:49:45you know emotional dnn to filter out the
0:49:48the
0:49:49neutral
0:49:50dialogues
0:49:51a neutral conversations so we ended up with about nine thousand
0:49:55emotive conversations
0:49:56with over fifty two thousand utterances which are selected for labeling
0:50:01so labeling the p a d values
0:50:03and then we run the emotion recognition and also the emotion state change
0:50:09prediction
0:50:10so we use a whole suite of evaluation criteria on but predicted emotive states
0:50:17in p a d values and also the emotive state changes in p d values
0:50:21the unweighted accuracy
0:50:24the mean accuracy of different emotion categories
0:50:26the mean absolute error and also the concordance correlation coefficient
0:50:31now
0:50:32this is a
0:50:33benchmark against other recent work using other methods
0:50:37and for i mocap and also for the so go data sets
0:50:44the proposed approach
0:50:45actually achieves competitive performance
0:50:48in emotion recognition
0:50:50now in emotion
0:50:52change prediction actually
0:50:54our proposed approach achieves a significantly better performance then be other approaches
0:51:00but they're still room for improvement if you compare with
0:51:03a human performance in human annotation
0:51:07so to sum up this is among the first efforts to analyze
0:51:11user input features
0:51:13both acoustical and lexical features
0:51:15together with the system response to understand how the user emotion changes
0:51:21due to the system response and the dialogue
0:51:24and we have achieved competitive performance in impulsive state change prediction
0:51:29and we believe that this is a very important a step
0:51:33to work to what's having socially intelligent virtual assistants
0:51:38with the incorporation of affect sensitivity for human computer interaction
0:51:44so
0:51:45so my talk is in five chunks but this is the overall summary
0:51:49basically
0:51:51when i look back at all these different projects
0:51:54you know with it very
0:51:57tries on the message that
0:51:58much can be gleaned
0:52:00from dialogues
0:52:01to understand many important phenomena including
0:52:04how group discussions may facilitate learning
0:52:07a student would discussions may facilitate learning
0:52:10however the cuffs customer experience can be shaped by chopper responses and also the status
0:52:15of an individual's cognitive health
0:52:17and i guess i'm preaching to the choir here but i really truly believe there's
0:52:21tremendous potential
0:52:23we've only seen
0:52:24the tip of an iceberg
0:52:25and there's tremendous potential with abundant opportunities and a lot research so thank you very
0:52:30much
0:52:38thank you very much do we have questions
0:52:47thank you very much going to us or regarding the topic three cognitive impairment so
0:52:52we also working on that but still
0:52:55so the heavy cognitive impairment of people is easy to detect case of just a
0:53:01small conversation we can identify this guy so going to put compare
0:53:06but i think problem is the mild cognitive impairment and ci voice on a is
0:53:14a very difficult to detect
0:53:16so i think so the final goal of this well maybe how to estimate the
0:53:22degree of cognitive impairment using features so what the sig
0:53:29so thank you very much for the question
0:53:32indeed
0:53:34in our study we will be covering
0:53:38come to the normal adults also what they not call
0:53:44minor in and cd that so the new terminology
0:53:49if
0:53:49my nancy the my small
0:53:52and you will have a disorder
0:53:54and major big
0:53:56you have to disorder
0:53:58and
0:53:59so this is a what are learnt from our colleagues in eulogy so
0:54:06for elderly people we need to be more diligent in engaging them in these
0:54:14a positive assessments "'cause" they're a really exercises and there's subjective fluctuations going from one
0:54:23exercise to another so therefore the more frequent you can
0:54:28take the assessment of better
0:54:29and
0:54:31and the issue is not and axle scoring so the
0:54:35that's obviously it's more the personal level and if there's any sudden changes perhaps more
0:54:41drastic changes
0:54:43in the
0:54:44scoring level of the individual that is off
0:54:48that would be an important
0:54:50sign
0:54:51and
0:54:53and also tracking
0:54:55frequently is important
0:54:57so in the sometimes that are whole minor and cd more mild cognitive impairments harder
0:55:03to detect those and also you have to work
0:55:08again sort of the natural cognitive decline due to ageing and the pathological cognitive decline
0:55:15so it's a it's in a complex problem but nevertheless because
0:55:21dimension is such a big problem and people talk about
0:55:25the dimension is not any of the age and global population
0:55:30and there's not sure
0:55:31so we just have to work very hard on how to do early
0:55:37early detection and intervention thank you for the
0:55:41question
0:55:46thank you for this very nice thought maybe topics really impressive i was wondering especially
0:55:52in relation to the classrooms and to the cognitive screening
0:55:57the moment of understood by your
0:55:59working on transcriptions rate on the basis of transcription of you made any experiments
0:56:04but with this or and if so what was your experience there what's the likelihood
0:56:10of being sufficiently good
0:56:12so the
0:56:14the classroom
0:56:16it is very difficult
0:56:18that's why we have two
0:56:19we have no choice but work on transcriptions
0:56:22but so for
0:56:24the
0:56:26the
0:56:27the way we have recorded these neural psychological tests
0:56:32it's actually between recognition and thus subject
0:56:35so the conditions of i think that they don't want any sense
0:56:39so we just put a phone there
0:56:41and we can send the subject of course
0:56:43and
0:56:44depend on the device some of it we think it's doable
0:56:48but we went to have a response on
0:56:51speaker adaptive training and noise of is the
0:56:55speech processing we
0:56:56we need to fall in the kitchen sink to be able to do
0:57:00well
0:57:12thanks for agree though
0:57:14is
0:57:15on the cognitive assessment from a discourse structure point of view actually i was wondering
0:57:22what sort of processing now you plan to do on those descriptions that they provide
0:57:27apart from you know speech processing and lexical the cohesion any thoughts about in on
0:57:35discourse coherence rhetorical relation
0:57:39among the sentence is that they provide and so on
0:57:42so thank you for that the one of a question we must look at that
0:57:45we must okay that we haven't looked at that yet but is actually i have
0:57:51for her from our you know our colleagues to other clinicians face a coherence in
0:57:56following the
0:57:59discourse of a dialog oftentimes show problems
0:58:03if there's cognitive impairment so that is definitely
0:58:06one aspect that we must
0:58:09and in fact we would welcome any
0:58:11interest the collaborators to look at that together
0:58:14thank you for regression
0:58:20a thanks for the survey instinct to you i'm to consider what to talk about
0:58:26the emotional modeling the pat space move modeling is that just based on speech input
0:58:32was are you also using i also using to analyse things like
0:58:37us a nonverbal as a signals like laughter or sighing little things like that
0:58:43right now we don't have that's it will be wonderful if we can have that
0:58:46those features but right now it's really the speech input so acoustics and lexical input
0:58:52and also the sentence level of the system's response
0:59:03hi a question is about the a section five
0:59:07so you due to prediction task you did emotion recognition and the emotive change prediction
0:59:13so even though these some similar really think there is a subtle but important difference
0:59:17between the two
0:59:19so my question is
0:59:21do you use the same features to do both does do you think there are
0:59:26features that are more important for that you motives the rather than the emotion recognition
0:59:30and
0:59:32what difference have you seen
0:59:34between these two
0:59:36so requested so we think that
0:59:41for the current query
0:59:42based on the current user input we want to be able to
0:59:46understand the motion of the user
0:59:49but if you think about
0:59:51what comes next so depending on how to respond
0:59:54to the user
0:59:56the system response the users emotion change the next
1:00:00input
1:00:01maybe different
1:00:03right so for example
1:00:05in be
1:00:06in the
1:00:15so here this is a subject him talking about a breakup
1:00:22and
1:00:23i first the system tries to
1:00:26comfort the subject and then at some point you know the
1:00:31the country the dialogue goes
1:00:38i in timit assistive so are you real or not how can robot's no you
1:00:42like
1:00:43i know what you like as i do it should be
1:00:46and then
1:00:46the user says something
1:00:49and at this point it sort of like a in this i at this point
1:00:52of the dialogue you can you can respond in various ways but the talk about
1:00:57that all used here
1:00:58and then it seems that
1:01:02a and then the user says you must be real so i think
1:01:06but you most exchanges depend on a system response
1:01:09so if we can
1:01:11model that
1:01:12and the way we've model that is to
1:01:15to
1:01:17mostly task training where a
1:01:19e motion state change
1:01:22it's dependent on the
1:01:24recognize emotion
1:01:26we want to be able to capture this dependency
1:01:29and
1:01:30in
1:01:31and to be able you utilize this stuff
1:01:34dependency is we choose how to
1:01:37in the future choose how to
1:01:39recent on how to generate the system response so that you can hopefully died off
1:01:44dialogue be motioned change in the dialogue
1:01:47in the way you