0:00:13a a good to afternoon everyone
0:00:15and
0:00:16um a while from speech group microsoft research asia
0:00:20paper or i'm going to present is is synthesizing we just speech trajectory
0:00:26with minimum generation error row
0:00:28so this is a joint of work we use
0:00:31P G who and
0:00:33from microsoft
0:00:34and the uh uh then drawn from you are use C in S A and so friends so
0:00:40my for some
0:00:44so
0:00:45this work is
0:00:46part of the project
0:00:48oh creating photo real let's go about but are in microsoft
0:00:51so the goal is to create a a lot the but art that
0:00:55look just like you
0:00:58so a but are can be roughly are divided into two categories depends on how to about what are
0:01:04and act to uh in the act with also i word
0:01:07the first uh kind of a but are
0:01:10i can be used uh in me at the human to human communication
0:01:14such a like in town present
0:01:17and uh in this morning
0:01:19uh
0:01:20oh up to field channels
0:01:22uh
0:01:23in every talk we mentioned the ducking that a but are actually he's a is going to release very so
0:01:30and and and not a kind of a but are can be used in human computer interaction
0:01:35for example uh intelligent a to and
0:01:39so for the next generation are what are
0:01:42here's
0:01:43uh i will be issue list
0:01:44we have a common expectation for the that's generation of but are
0:01:48first the we want eight
0:01:50uh
0:01:50can be easily integrated into the
0:01:53things that take a word
0:01:55a a and also we want it in high fidelity and do a more realistic to a human
0:02:02and uh about are
0:02:04a should be
0:02:05personalise to
0:02:07each unique use a
0:02:08and the
0:02:10the but but are is
0:02:11can be easily and automatically created
0:02:15uh so
0:02:17is since the motivation uh oh oh oh for this project
0:02:22and a of as and this paper is focusing on the
0:02:26a photo realist let's moment the censuses
0:02:32so this lies list uh
0:02:35some
0:02:36related work
0:02:37in both
0:02:39you are we just to just sent as this
0:02:41and the test to use speech just synthesis
0:02:44and uh it is
0:02:45uh but ins
0:02:46a interesting to so you overlapping between these two feel
0:02:51so so all uh
0:02:53well a pretty that is speaking that many
0:02:56and ends
0:02:57are used in speech just synthesis
0:02:59i
0:03:00had had so uh successfully applied to to do be just be just since field
0:03:05for example the
0:03:07a unit selection
0:03:08a concatenation based the speech of synthesis matt third or the hmm based speech synthesis
0:03:14oh H M and god unit selection map third
0:03:17extension stature
0:03:19so uh last
0:03:21last september
0:03:22uh we present the paper
0:03:24called a hmm trajectory god the sample selection for four we are talking had in the speech
0:03:30so now we want to try
0:03:33oh oh we want to improve the system
0:03:36i taking
0:03:37the advantage of them
0:03:39recent and progress in speech synthesis
0:03:42so the first attempt
0:03:44that we are trying to improve the we just speech
0:03:47a statistical modeling by
0:03:50i i i i applied to the minimum generation error or words them
0:03:57so let's uh
0:04:00uh
0:04:01that
0:04:01there to do a quick review for the house system
0:04:05so just to like
0:04:06oh we do
0:04:08speech a of is that's the tts system first we start with that speech database
0:04:13so feel be does speech just synthesis we start with a
0:04:16but do database
0:04:18so that add a speaker
0:04:19um
0:04:20speaking talking to the camera instead of a microphone
0:04:24reading some proper that's great
0:04:26a what is got that a be the clip that's
0:04:29the auto meter data base
0:04:31we first do had a pose normalization
0:04:34since the speaker a we'll
0:04:36normally now actually change he's will be he's
0:04:39i more he's had during the recording
0:04:42so
0:04:43after the had to pose normalization
0:04:46every frame in the database and normalized to the fully frontal be you and then we can crop
0:04:52the mouse images
0:04:54a uh using a fixed the rectangular window
0:04:58so once we get all the mouth images we we do prince spoke
0:05:02a component now says
0:05:03to bad to visual feature
0:05:06and then we do do a all the visual and training to get the hmms
0:05:11that's the training part
0:05:12so and the sense it's as part
0:05:15um the input is some phoneme labels
0:05:18plus
0:05:18the L alignment that there's
0:05:20starting time and in time
0:05:22a first to we will use that input a passed the a well trained hmm model two
0:05:28to generate the we'd or trajectory just like a role we do in speech just census as we had to
0:05:34the speech trajectory for speech parameter trajectory
0:05:38and the all speech a trajectory
0:05:40we be been used as a god and to select a let's images from i well that's sample library
0:05:47and a amount those candidate that we have fat find a bass the ones
0:05:51and uh
0:05:52each back to the full had to
0:05:55to render the full face animation
0:05:58so here is a uh some more it that's to the example for this
0:06:03hmm trajectory god the lips
0:06:06and may just selection
0:06:07so you can see that the top line images stick is actually a a pretty to the by H and
0:06:13they are to trajectories
0:06:15that's images are actually are restored the from the predicted the pca back to us
0:06:20and the using these the true trajectory as the guidance we work
0:06:24selected the image of candidates found the
0:06:28from or
0:06:29let
0:06:30uh image library
0:06:32and then um a a wrist moos
0:06:34um
0:06:36a a person was parsed can be fine
0:06:39by a you are using viterbi search a those candidate
0:06:46okay so
0:06:49as we can see that for
0:06:51either either for hmm based on parametric a map or the all these hmm got you'd the hybrid approach
0:06:58just
0:06:59start it's got model actually is very important no by recreational
0:07:03uh because
0:07:05uh the actual be the trajectory to a large
0:07:08a extended you main how the lips can be rendered
0:07:11so that's part is very important
0:07:15um can really being about pretty
0:07:18or were real our previous work we used a um maximum like a hoot
0:07:23a a estimation for the hmm parameters
0:07:26or or in shot a week or lead and now based the training
0:07:30so a of one node is full of the nation that it that the mouse moves is over single was
0:07:36and the
0:07:37uh it
0:07:38this is a a small band and uh is comes the to a much smaller than that then dynamic range
0:07:45so this uh observation is uh actually quite similar to what we are was R
0:07:51thinking hmm based tts
0:07:55so
0:07:56oh thinking
0:07:57to improve the model so we propose to
0:08:01uh uh used a minimum generation our approach
0:08:04oh of to improved the all
0:08:07or the visual hmm
0:08:08parameter
0:08:10uh a training
0:08:11parameter to estimation
0:08:15so
0:08:17and the
0:08:18a a minimum generation error quite around
0:08:20the first important thing is that we need to define the arrow what's to arrow is
0:08:26so here we define them
0:08:28the bit of generation our O
0:08:30for each
0:08:31or you just sent has actually is the euclidean distance between the
0:08:36P C a back to as peace a trajectories
0:08:39so for the whole training set actually he's the average of
0:08:43other twenty sent has this the arrows of order twenty sentence
0:08:47so the objective all and G
0:08:49a quality or is to
0:08:51optimized the model parameters so as to the total generation our or can be minimised
0:08:59i we note that the the rat
0:09:01the direct solution for that question is mathematically intractable so here we adopt a problem
0:09:09let's take a just send the map there to re estimate mate
0:09:12the H and
0:09:13at the bridge at M parameters
0:09:15and the the
0:09:17the film or or for up to eighteen the meeting and the about rinse can be
0:09:21um uh found in the paper
0:09:25so
0:09:26we
0:09:27we incorporated a H based uh
0:09:31oh training thing to do a house system
0:09:34a we want to joint to we find all that we draw a atoms
0:09:38here here he's a are we're process
0:09:41so
0:09:41things
0:09:42in the first stab we were first initialize the model and or so the state alignment
0:09:48we using the traditional the baseline
0:09:51a maximum like who training
0:09:54and then
0:09:56i here we will of re find the state alignment a you know a heuristic a matter we just the
0:10:03per are try to put or just a pound or to the left and to the right
0:10:08and the to see
0:10:09the
0:10:10total generation error all before and after just shaped
0:10:14um
0:10:14that's it is
0:10:15mainly to find that the optimal state ones
0:10:19um
0:10:20a a a i or the energy G criterion
0:10:24so after does a refunds the along the we estimate a to model
0:10:30i'm sorry
0:10:31um
0:10:32that's step is
0:10:33so i but to state
0:10:36a state alignment
0:10:37that we will we find a visual hmm parameters by using the problem list tick this an average them
0:10:46and we go back to step
0:10:48to you and that three
0:10:50uh to see i'm to are there was no increase of the total generation error
0:10:58we are here is the experiment to be about eight at that the entries them
0:11:03so the are of visual database we used is the lips challenge to thousand eight and to to nine
0:11:10a a challenge database it
0:11:13it in close about
0:11:15a
0:11:16last than three hundred we do we do sentences
0:11:19uh uh chris money audio or do try it is welcome by a single native female speaker in neutral emotion
0:11:28so
0:11:28um the experiment is mailing to compare two approaches the baseline approach is the
0:11:35a i my like who the based to method or the and or so the proposed to M G based
0:11:40the third
0:11:40and the post approach a we have become pair with the ground choose the
0:11:45a region of trajectories spoken by the real real person
0:11:49and in objective evaluation since the database is very small
0:11:53so we used the lead
0:11:55i out uh actually and it calls twenty
0:11:58uh um
0:12:00a out cross validation for the open open pass
0:12:03and the the
0:12:05object to the measure we used a its mean square error roll
0:12:08uh average of cross correlation and or so we
0:12:12a matter the global variance
0:12:15a we are so contact that subjective evaluation
0:12:18uh two
0:12:20to use called the M as in terms of the of beach of consistency
0:12:25as six
0:12:26subjects attended this evaluation
0:12:30so
0:12:31uh and
0:12:32this this figure actually use uh
0:12:35oh i want to show that the trajectory how the trajectory looks like
0:12:40so
0:12:41a a in this figure you can see that
0:12:43the the brain
0:12:44the way colour line actually is that one choose
0:12:47and did the red colour is the
0:12:50M L based uh approach
0:12:52and uh the blue colour is the proposed and G based the third
0:12:59can see that
0:13:00um
0:13:02i i highlighted a to the
0:13:04the peak and a badly part you can see that
0:13:06especially for those critical part to the peak and a baddie
0:13:10uh the proposed and G map there'd
0:13:13generated trajectory more close to the
0:13:16uh
0:13:17to the ground choose trajectory which do real human produce it
0:13:24uh and the evaluation of the mean square error all
0:13:27and the
0:13:29a uh in that speaker that the
0:13:30the first part of
0:13:32of the left
0:13:33i the left is
0:13:34um
0:13:35i the mse
0:13:38ah
0:13:40can can be laid all us
0:13:41some summarise all the pca a mentions
0:13:44and the
0:13:46the
0:13:47well
0:13:47the rest of the shot bars actually for that top
0:13:51or top for a component
0:13:55so
0:13:55um air
0:13:57there is roughly about five percent
0:14:00um
0:14:01error reduction
0:14:03i used the in the in new proposed them a third
0:14:06and we are actually late uh after
0:14:09we we that me this paper actually we we we tested on different corpus
0:14:14uh the the problem and is quite a time about a five to seven percent of cross different the database
0:14:23and the this is is about to the
0:14:26a a cross correlation so
0:14:28uh
0:14:29um especially for the
0:14:32oh
0:14:33first the a component to because see that
0:14:35it but a
0:14:37in propose the correlation which is the very and the for the as the first the pca component
0:14:43uh
0:14:44i to be really lady to the mouse open
0:14:47now open and close
0:14:51or so we uh this is this is the result for the global very
0:14:56uh
0:14:57the proposed to the and G method can recover
0:15:01uh a lot of the
0:15:04uh compress the of variance
0:15:08uh
0:15:09it's is it is
0:15:11for the
0:15:12subjective evaluation
0:15:14so we we only used a lower face
0:15:17to do this up to two test
0:15:19because we want to people can't focus only on the lips the region
0:15:24um
0:15:26we generated a ut
0:15:28twelve test email a for each approach
0:15:31and uh this is a party to that is depends
0:15:35we
0:15:35a a us score and most a score for for each radius
0:15:40the mel
0:15:41and did
0:15:42uh this one this one that the
0:15:45then
0:15:46left
0:15:46a to why is the original video
0:15:49that's can to lists sure
0:15:57that's i two tests show
0:16:05lists
0:16:06oh
0:16:13okay so
0:16:14here here i i want to show uh at them oh actually this is a a a a uh a
0:16:19online
0:16:20we sell this is a online product
0:16:22a it's called uh
0:16:24it it is um vertical search thing being
0:16:28i in being search a we uh is that being dictionary online dictionary actually we put a the
0:16:34a had a as a what your english teacher on that
0:16:37that's side
0:16:38they do we'll
0:16:39help
0:16:40the english learners to how how to pronounce each word
0:16:46i can play the deal
0:16:50so we is that being dictionary
0:16:54i
0:16:56i
0:17:03i
0:17:04i
0:17:12so why you uh six
0:17:14was
0:17:15we any could or is to us uh find this T V i
0:17:19and the you click it
0:17:20then the to talking head of will pop up
0:17:27i
0:17:30this
0:17:38okay
0:17:39so
0:17:40here is my conclusion
0:17:42so here
0:17:43uh
0:17:46we applied a the minimum generation error approach to do we do speech synthesis
0:17:51um
0:17:52in objective evaluation compare with the baseline
0:17:56a small like who based approach we get a consistent improvement thing
0:18:02mean square error reduction and the or so increase being on correlation and or so we covered the
0:18:08problem barons
0:18:10in subject to evaluation we found that it can we increase the mouse that "'em" at a range and also
0:18:16make that talking head
0:18:17more like a real human
0:18:21thank you
0:18:28a questions
0:18:39yeah
0:18:39thank you for two
0:18:41um
0:18:42option you know soon as to maybe most to occlusion
0:18:46yeah use to the do that you P C to some features please
0:18:51but to region features
0:18:54uh
0:18:56uh actually we were for
0:18:58after had poles normalization you you can imagine all the
0:19:02face images a fully front tell
0:19:04and then we we just use of
0:19:07a a fixed a rectangular window to crop the mouth region
0:19:11so
0:19:12the pca actually
0:19:14is is uh down on my mouse
0:19:16a images
0:19:17first you craft to mouth images and then P X P
0:19:21i
0:19:22uh yeah yeah yeah yeah
0:19:24so
0:19:25but this a mouth images all the pixels
0:19:28after that we all like a a at the simple back to
0:19:31so one simple to or for each frame and then you can do pca
0:19:36like any
0:19:37see for dimension back
0:19:39which are backed
0:19:40you know the the we shouldn't be two
0:19:43you just one go to my mind
0:19:46you you do you use
0:19:49and in a the with
0:19:53a each
0:19:54we can uh
0:19:57so it really true
0:20:00you
0:20:01i
0:20:04with that you i think
0:20:06i agree
0:20:08question
0:20:15or question
0:20:16hmmm
0:20:16i
0:20:17to range
0:20:20just
0:20:22and look still
0:20:24know
0:20:25we didn't we didn't try to stream
0:20:28yeah we can we can try
0:20:36i
0:20:37a questions
0:20:39oh
0:20:40okay
0:20:42i
0:20:44i
0:20:47i
0:20:55i
0:20:56i
0:20:57a
0:21:02i
0:21:07oh
0:21:09oh
0:21:11question
0:21:13i
0:21:29uh
0:21:30yeah you mean the that the part i
0:21:34the the tiny girl actually
0:21:36at the boy it's you you heard actually is them a lady T D N
0:21:42and uh
0:21:43i i think uh is
0:21:45it's uh
0:21:46that good try but us because firstly we manage a in my imagination with that
0:21:51we think that maybe they are be a will be some mismatch well we use a mac a ladies T
0:21:57S with and trying to the ladies
0:21:59talking head
0:22:00but after we
0:22:03do it and show you that
0:22:05i think
0:22:06uh
0:22:07i okay or i is acceptable
0:22:16a it doesn't sound like best
0:22:20right
0:22:21yeah
0:22:24yeah it may be K common up about that so that
0:22:32okay