0:00:13 | a a good to afternoon everyone |
---|
0:00:15 | and |
---|
0:00:16 | um a while from speech group microsoft research asia |
---|
0:00:20 | paper or i'm going to present is is synthesizing we just speech trajectory |
---|
0:00:26 | with minimum generation error row |
---|
0:00:28 | so this is a joint of work we use |
---|
0:00:31 | P G who and |
---|
0:00:33 | from microsoft |
---|
0:00:34 | and the uh uh then drawn from you are use C in S A and so friends so |
---|
0:00:40 | my for some |
---|
0:00:44 | so |
---|
0:00:45 | this work is |
---|
0:00:46 | part of the project |
---|
0:00:48 | oh creating photo real let's go about but are in microsoft |
---|
0:00:51 | so the goal is to create a a lot the but art that |
---|
0:00:55 | look just like you |
---|
0:00:58 | so a but are can be roughly are divided into two categories depends on how to about what are |
---|
0:01:04 | and act to uh in the act with also i word |
---|
0:01:07 | the first uh kind of a but are |
---|
0:01:10 | i can be used uh in me at the human to human communication |
---|
0:01:14 | such a like in town present |
---|
0:01:17 | and uh in this morning |
---|
0:01:19 | uh |
---|
0:01:20 | oh up to field channels |
---|
0:01:22 | uh |
---|
0:01:23 | in every talk we mentioned the ducking that a but are actually he's a is going to release very so |
---|
0:01:30 | and and and not a kind of a but are can be used in human computer interaction |
---|
0:01:35 | for example uh intelligent a to and |
---|
0:01:39 | so for the next generation are what are |
---|
0:01:42 | here's |
---|
0:01:43 | uh i will be issue list |
---|
0:01:44 | we have a common expectation for the that's generation of but are |
---|
0:01:48 | first the we want eight |
---|
0:01:50 | uh |
---|
0:01:50 | can be easily integrated into the |
---|
0:01:53 | things that take a word |
---|
0:01:55 | a a and also we want it in high fidelity and do a more realistic to a human |
---|
0:02:02 | and uh about are |
---|
0:02:04 | a should be |
---|
0:02:05 | personalise to |
---|
0:02:07 | each unique use a |
---|
0:02:08 | and the |
---|
0:02:10 | the but but are is |
---|
0:02:11 | can be easily and automatically created |
---|
0:02:15 | uh so |
---|
0:02:17 | is since the motivation uh oh oh oh for this project |
---|
0:02:22 | and a of as and this paper is focusing on the |
---|
0:02:26 | a photo realist let's moment the censuses |
---|
0:02:32 | so this lies list uh |
---|
0:02:35 | some |
---|
0:02:36 | related work |
---|
0:02:37 | in both |
---|
0:02:39 | you are we just to just sent as this |
---|
0:02:41 | and the test to use speech just synthesis |
---|
0:02:44 | and uh it is |
---|
0:02:45 | uh but ins |
---|
0:02:46 | a interesting to so you overlapping between these two feel |
---|
0:02:51 | so so all uh |
---|
0:02:53 | well a pretty that is speaking that many |
---|
0:02:56 | and ends |
---|
0:02:57 | are used in speech just synthesis |
---|
0:02:59 | i |
---|
0:03:00 | had had so uh successfully applied to to do be just be just since field |
---|
0:03:05 | for example the |
---|
0:03:07 | a unit selection |
---|
0:03:08 | a concatenation based the speech of synthesis matt third or the hmm based speech synthesis |
---|
0:03:14 | oh H M and god unit selection map third |
---|
0:03:17 | extension stature |
---|
0:03:19 | so uh last |
---|
0:03:21 | last september |
---|
0:03:22 | uh we present the paper |
---|
0:03:24 | called a hmm trajectory god the sample selection for four we are talking had in the speech |
---|
0:03:30 | so now we want to try |
---|
0:03:33 | oh oh we want to improve the system |
---|
0:03:36 | i taking |
---|
0:03:37 | the advantage of them |
---|
0:03:39 | recent and progress in speech synthesis |
---|
0:03:42 | so the first attempt |
---|
0:03:44 | that we are trying to improve the we just speech |
---|
0:03:47 | a statistical modeling by |
---|
0:03:50 | i i i i applied to the minimum generation error or words them |
---|
0:03:57 | so let's uh |
---|
0:04:00 | uh |
---|
0:04:01 | that |
---|
0:04:01 | there to do a quick review for the house system |
---|
0:04:05 | so just to like |
---|
0:04:06 | oh we do |
---|
0:04:08 | speech a of is that's the tts system first we start with that speech database |
---|
0:04:13 | so feel be does speech just synthesis we start with a |
---|
0:04:16 | but do database |
---|
0:04:18 | so that add a speaker |
---|
0:04:19 | um |
---|
0:04:20 | speaking talking to the camera instead of a microphone |
---|
0:04:24 | reading some proper that's great |
---|
0:04:26 | a what is got that a be the clip that's |
---|
0:04:29 | the auto meter data base |
---|
0:04:31 | we first do had a pose normalization |
---|
0:04:34 | since the speaker a we'll |
---|
0:04:36 | normally now actually change he's will be he's |
---|
0:04:39 | i more he's had during the recording |
---|
0:04:42 | so |
---|
0:04:43 | after the had to pose normalization |
---|
0:04:46 | every frame in the database and normalized to the fully frontal be you and then we can crop |
---|
0:04:52 | the mouse images |
---|
0:04:54 | a uh using a fixed the rectangular window |
---|
0:04:58 | so once we get all the mouth images we we do prince spoke |
---|
0:05:02 | a component now says |
---|
0:05:03 | to bad to visual feature |
---|
0:05:06 | and then we do do a all the visual and training to get the hmms |
---|
0:05:11 | that's the training part |
---|
0:05:12 | so and the sense it's as part |
---|
0:05:15 | um the input is some phoneme labels |
---|
0:05:18 | plus |
---|
0:05:18 | the L alignment that there's |
---|
0:05:20 | starting time and in time |
---|
0:05:22 | a first to we will use that input a passed the a well trained hmm model two |
---|
0:05:28 | to generate the we'd or trajectory just like a role we do in speech just census as we had to |
---|
0:05:34 | the speech trajectory for speech parameter trajectory |
---|
0:05:38 | and the all speech a trajectory |
---|
0:05:40 | we be been used as a god and to select a let's images from i well that's sample library |
---|
0:05:47 | and a amount those candidate that we have fat find a bass the ones |
---|
0:05:51 | and uh |
---|
0:05:52 | each back to the full had to |
---|
0:05:55 | to render the full face animation |
---|
0:05:58 | so here is a uh some more it that's to the example for this |
---|
0:06:03 | hmm trajectory god the lips |
---|
0:06:06 | and may just selection |
---|
0:06:07 | so you can see that the top line images stick is actually a a pretty to the by H and |
---|
0:06:13 | they are to trajectories |
---|
0:06:15 | that's images are actually are restored the from the predicted the pca back to us |
---|
0:06:20 | and the using these the true trajectory as the guidance we work |
---|
0:06:24 | selected the image of candidates found the |
---|
0:06:28 | from or |
---|
0:06:29 | let |
---|
0:06:30 | uh image library |
---|
0:06:32 | and then um a a wrist moos |
---|
0:06:34 | um |
---|
0:06:36 | a a person was parsed can be fine |
---|
0:06:39 | by a you are using viterbi search a those candidate |
---|
0:06:46 | okay so |
---|
0:06:49 | as we can see that for |
---|
0:06:51 | either either for hmm based on parametric a map or the all these hmm got you'd the hybrid approach |
---|
0:06:58 | just |
---|
0:06:59 | start it's got model actually is very important no by recreational |
---|
0:07:03 | uh because |
---|
0:07:05 | uh the actual be the trajectory to a large |
---|
0:07:08 | a extended you main how the lips can be rendered |
---|
0:07:11 | so that's part is very important |
---|
0:07:15 | um can really being about pretty |
---|
0:07:18 | or were real our previous work we used a um maximum like a hoot |
---|
0:07:23 | a a estimation for the hmm parameters |
---|
0:07:26 | or or in shot a week or lead and now based the training |
---|
0:07:30 | so a of one node is full of the nation that it that the mouse moves is over single was |
---|
0:07:36 | and the |
---|
0:07:37 | uh it |
---|
0:07:38 | this is a a small band and uh is comes the to a much smaller than that then dynamic range |
---|
0:07:45 | so this uh observation is uh actually quite similar to what we are was R |
---|
0:07:51 | thinking hmm based tts |
---|
0:07:55 | so |
---|
0:07:56 | oh thinking |
---|
0:07:57 | to improve the model so we propose to |
---|
0:08:01 | uh uh used a minimum generation our approach |
---|
0:08:04 | oh of to improved the all |
---|
0:08:07 | or the visual hmm |
---|
0:08:08 | parameter |
---|
0:08:10 | uh a training |
---|
0:08:11 | parameter to estimation |
---|
0:08:15 | so |
---|
0:08:17 | and the |
---|
0:08:18 | a a minimum generation error quite around |
---|
0:08:20 | the first important thing is that we need to define the arrow what's to arrow is |
---|
0:08:26 | so here we define them |
---|
0:08:28 | the bit of generation our O |
---|
0:08:30 | for each |
---|
0:08:31 | or you just sent has actually is the euclidean distance between the |
---|
0:08:36 | P C a back to as peace a trajectories |
---|
0:08:39 | so for the whole training set actually he's the average of |
---|
0:08:43 | other twenty sent has this the arrows of order twenty sentence |
---|
0:08:47 | so the objective all and G |
---|
0:08:49 | a quality or is to |
---|
0:08:51 | optimized the model parameters so as to the total generation our or can be minimised |
---|
0:08:59 | i we note that the the rat |
---|
0:09:01 | the direct solution for that question is mathematically intractable so here we adopt a problem |
---|
0:09:09 | let's take a just send the map there to re estimate mate |
---|
0:09:12 | the H and |
---|
0:09:13 | at the bridge at M parameters |
---|
0:09:15 | and the the |
---|
0:09:17 | the film or or for up to eighteen the meeting and the about rinse can be |
---|
0:09:21 | um uh found in the paper |
---|
0:09:25 | so |
---|
0:09:26 | we |
---|
0:09:27 | we incorporated a H based uh |
---|
0:09:31 | oh training thing to do a house system |
---|
0:09:34 | a we want to joint to we find all that we draw a atoms |
---|
0:09:38 | here here he's a are we're process |
---|
0:09:41 | so |
---|
0:09:41 | things |
---|
0:09:42 | in the first stab we were first initialize the model and or so the state alignment |
---|
0:09:48 | we using the traditional the baseline |
---|
0:09:51 | a maximum like who training |
---|
0:09:54 | and then |
---|
0:09:56 | i here we will of re find the state alignment a you know a heuristic a matter we just the |
---|
0:10:03 | per are try to put or just a pound or to the left and to the right |
---|
0:10:08 | and the to see |
---|
0:10:09 | the |
---|
0:10:10 | total generation error all before and after just shaped |
---|
0:10:14 | um |
---|
0:10:14 | that's it is |
---|
0:10:15 | mainly to find that the optimal state ones |
---|
0:10:19 | um |
---|
0:10:20 | a a a i or the energy G criterion |
---|
0:10:24 | so after does a refunds the along the we estimate a to model |
---|
0:10:30 | i'm sorry |
---|
0:10:31 | um |
---|
0:10:32 | that's step is |
---|
0:10:33 | so i but to state |
---|
0:10:36 | a state alignment |
---|
0:10:37 | that we will we find a visual hmm parameters by using the problem list tick this an average them |
---|
0:10:46 | and we go back to step |
---|
0:10:48 | to you and that three |
---|
0:10:50 | uh to see i'm to are there was no increase of the total generation error |
---|
0:10:58 | we are here is the experiment to be about eight at that the entries them |
---|
0:11:03 | so the are of visual database we used is the lips challenge to thousand eight and to to nine |
---|
0:11:10 | a a challenge database it |
---|
0:11:13 | it in close about |
---|
0:11:15 | a |
---|
0:11:16 | last than three hundred we do we do sentences |
---|
0:11:19 | uh uh chris money audio or do try it is welcome by a single native female speaker in neutral emotion |
---|
0:11:28 | so |
---|
0:11:28 | um the experiment is mailing to compare two approaches the baseline approach is the |
---|
0:11:35 | a i my like who the based to method or the and or so the proposed to M G based |
---|
0:11:40 | the third |
---|
0:11:40 | and the post approach a we have become pair with the ground choose the |
---|
0:11:45 | a region of trajectories spoken by the real real person |
---|
0:11:49 | and in objective evaluation since the database is very small |
---|
0:11:53 | so we used the lead |
---|
0:11:55 | i out uh actually and it calls twenty |
---|
0:11:58 | uh um |
---|
0:12:00 | a out cross validation for the open open pass |
---|
0:12:03 | and the the |
---|
0:12:05 | object to the measure we used a its mean square error roll |
---|
0:12:08 | uh average of cross correlation and or so we |
---|
0:12:12 | a matter the global variance |
---|
0:12:15 | a we are so contact that subjective evaluation |
---|
0:12:18 | uh two |
---|
0:12:20 | to use called the M as in terms of the of beach of consistency |
---|
0:12:25 | as six |
---|
0:12:26 | subjects attended this evaluation |
---|
0:12:30 | so |
---|
0:12:31 | uh and |
---|
0:12:32 | this this figure actually use uh |
---|
0:12:35 | oh i want to show that the trajectory how the trajectory looks like |
---|
0:12:40 | so |
---|
0:12:41 | a a in this figure you can see that |
---|
0:12:43 | the the brain |
---|
0:12:44 | the way colour line actually is that one choose |
---|
0:12:47 | and did the red colour is the |
---|
0:12:50 | M L based uh approach |
---|
0:12:52 | and uh the blue colour is the proposed and G based the third |
---|
0:12:59 | can see that |
---|
0:13:00 | um |
---|
0:13:02 | i i highlighted a to the |
---|
0:13:04 | the peak and a badly part you can see that |
---|
0:13:06 | especially for those critical part to the peak and a baddie |
---|
0:13:10 | uh the proposed and G map there'd |
---|
0:13:13 | generated trajectory more close to the |
---|
0:13:16 | uh |
---|
0:13:17 | to the ground choose trajectory which do real human produce it |
---|
0:13:24 | uh and the evaluation of the mean square error all |
---|
0:13:27 | and the |
---|
0:13:29 | a uh in that speaker that the |
---|
0:13:30 | the first part of |
---|
0:13:32 | of the left |
---|
0:13:33 | i the left is |
---|
0:13:34 | um |
---|
0:13:35 | i the mse |
---|
0:13:38 | ah |
---|
0:13:40 | can can be laid all us |
---|
0:13:41 | some summarise all the pca a mentions |
---|
0:13:44 | and the |
---|
0:13:46 | the |
---|
0:13:47 | well |
---|
0:13:47 | the rest of the shot bars actually for that top |
---|
0:13:51 | or top for a component |
---|
0:13:55 | so |
---|
0:13:55 | um air |
---|
0:13:57 | there is roughly about five percent |
---|
0:14:00 | um |
---|
0:14:01 | error reduction |
---|
0:14:03 | i used the in the in new proposed them a third |
---|
0:14:06 | and we are actually late uh after |
---|
0:14:09 | we we that me this paper actually we we we tested on different corpus |
---|
0:14:14 | uh the the problem and is quite a time about a five to seven percent of cross different the database |
---|
0:14:23 | and the this is is about to the |
---|
0:14:26 | a a cross correlation so |
---|
0:14:28 | uh |
---|
0:14:29 | um especially for the |
---|
0:14:32 | oh |
---|
0:14:33 | first the a component to because see that |
---|
0:14:35 | it but a |
---|
0:14:37 | in propose the correlation which is the very and the for the as the first the pca component |
---|
0:14:43 | uh |
---|
0:14:44 | i to be really lady to the mouse open |
---|
0:14:47 | now open and close |
---|
0:14:51 | or so we uh this is this is the result for the global very |
---|
0:14:56 | uh |
---|
0:14:57 | the proposed to the and G method can recover |
---|
0:15:01 | uh a lot of the |
---|
0:15:04 | uh compress the of variance |
---|
0:15:08 | uh |
---|
0:15:09 | it's is it is |
---|
0:15:11 | for the |
---|
0:15:12 | subjective evaluation |
---|
0:15:14 | so we we only used a lower face |
---|
0:15:17 | to do this up to two test |
---|
0:15:19 | because we want to people can't focus only on the lips the region |
---|
0:15:24 | um |
---|
0:15:26 | we generated a ut |
---|
0:15:28 | twelve test email a for each approach |
---|
0:15:31 | and uh this is a party to that is depends |
---|
0:15:35 | we |
---|
0:15:35 | a a us score and most a score for for each radius |
---|
0:15:40 | the mel |
---|
0:15:41 | and did |
---|
0:15:42 | uh this one this one that the |
---|
0:15:45 | then |
---|
0:15:46 | left |
---|
0:15:46 | a to why is the original video |
---|
0:15:49 | that's can to lists sure |
---|
0:15:57 | that's i two tests show |
---|
0:16:05 | lists |
---|
0:16:06 | oh |
---|
0:16:13 | okay so |
---|
0:16:14 | here here i i want to show uh at them oh actually this is a a a a uh a |
---|
0:16:19 | online |
---|
0:16:20 | we sell this is a online product |
---|
0:16:22 | a it's called uh |
---|
0:16:24 | it it is um vertical search thing being |
---|
0:16:28 | i in being search a we uh is that being dictionary online dictionary actually we put a the |
---|
0:16:34 | a had a as a what your english teacher on that |
---|
0:16:37 | that's side |
---|
0:16:38 | they do we'll |
---|
0:16:39 | help |
---|
0:16:40 | the english learners to how how to pronounce each word |
---|
0:16:46 | i can play the deal |
---|
0:16:50 | so we is that being dictionary |
---|
0:16:54 | i |
---|
0:16:56 | i |
---|
0:17:03 | i |
---|
0:17:04 | i |
---|
0:17:12 | so why you uh six |
---|
0:17:14 | was |
---|
0:17:15 | we any could or is to us uh find this T V i |
---|
0:17:19 | and the you click it |
---|
0:17:20 | then the to talking head of will pop up |
---|
0:17:27 | i |
---|
0:17:30 | this |
---|
0:17:38 | okay |
---|
0:17:39 | so |
---|
0:17:40 | here is my conclusion |
---|
0:17:42 | so here |
---|
0:17:43 | uh |
---|
0:17:46 | we applied a the minimum generation error approach to do we do speech synthesis |
---|
0:17:51 | um |
---|
0:17:52 | in objective evaluation compare with the baseline |
---|
0:17:56 | a small like who based approach we get a consistent improvement thing |
---|
0:18:02 | mean square error reduction and the or so increase being on correlation and or so we covered the |
---|
0:18:08 | problem barons |
---|
0:18:10 | in subject to evaluation we found that it can we increase the mouse that "'em" at a range and also |
---|
0:18:16 | make that talking head |
---|
0:18:17 | more like a real human |
---|
0:18:21 | thank you |
---|
0:18:28 | a questions |
---|
0:18:39 | yeah |
---|
0:18:39 | thank you for two |
---|
0:18:41 | um |
---|
0:18:42 | option you know soon as to maybe most to occlusion |
---|
0:18:46 | yeah use to the do that you P C to some features please |
---|
0:18:51 | but to region features |
---|
0:18:54 | uh |
---|
0:18:56 | uh actually we were for |
---|
0:18:58 | after had poles normalization you you can imagine all the |
---|
0:19:02 | face images a fully front tell |
---|
0:19:04 | and then we we just use of |
---|
0:19:07 | a a fixed a rectangular window to crop the mouth region |
---|
0:19:11 | so |
---|
0:19:12 | the pca actually |
---|
0:19:14 | is is uh down on my mouse |
---|
0:19:16 | a images |
---|
0:19:17 | first you craft to mouth images and then P X P |
---|
0:19:21 | i |
---|
0:19:22 | uh yeah yeah yeah yeah |
---|
0:19:24 | so |
---|
0:19:25 | but this a mouth images all the pixels |
---|
0:19:28 | after that we all like a a at the simple back to |
---|
0:19:31 | so one simple to or for each frame and then you can do pca |
---|
0:19:36 | like any |
---|
0:19:37 | see for dimension back |
---|
0:19:39 | which are backed |
---|
0:19:40 | you know the the we shouldn't be two |
---|
0:19:43 | you just one go to my mind |
---|
0:19:46 | you you do you use |
---|
0:19:49 | and in a the with |
---|
0:19:53 | a each |
---|
0:19:54 | we can uh |
---|
0:19:57 | so it really true |
---|
0:20:00 | you |
---|
0:20:01 | i |
---|
0:20:04 | with that you i think |
---|
0:20:06 | i agree |
---|
0:20:08 | question |
---|
0:20:15 | or question |
---|
0:20:16 | hmmm |
---|
0:20:16 | i |
---|
0:20:17 | to range |
---|
0:20:20 | just |
---|
0:20:22 | and look still |
---|
0:20:24 | know |
---|
0:20:25 | we didn't we didn't try to stream |
---|
0:20:28 | yeah we can we can try |
---|
0:20:36 | i |
---|
0:20:37 | a questions |
---|
0:20:39 | oh |
---|
0:20:40 | okay |
---|
0:20:42 | i |
---|
0:20:44 | i |
---|
0:20:47 | i |
---|
0:20:55 | i |
---|
0:20:56 | i |
---|
0:20:57 | a |
---|
0:21:02 | i |
---|
0:21:07 | oh |
---|
0:21:09 | oh |
---|
0:21:11 | question |
---|
0:21:13 | i |
---|
0:21:29 | uh |
---|
0:21:30 | yeah you mean the that the part i |
---|
0:21:34 | the the tiny girl actually |
---|
0:21:36 | at the boy it's you you heard actually is them a lady T D N |
---|
0:21:42 | and uh |
---|
0:21:43 | i i think uh is |
---|
0:21:45 | it's uh |
---|
0:21:46 | that good try but us because firstly we manage a in my imagination with that |
---|
0:21:51 | we think that maybe they are be a will be some mismatch well we use a mac a ladies T |
---|
0:21:57 | S with and trying to the ladies |
---|
0:21:59 | talking head |
---|
0:22:00 | but after we |
---|
0:22:03 | do it and show you that |
---|
0:22:05 | i think |
---|
0:22:06 | uh |
---|
0:22:07 | i okay or i is acceptable |
---|
0:22:16 | a it doesn't sound like best |
---|
0:22:20 | right |
---|
0:22:21 | yeah |
---|
0:22:24 | yeah it may be K common up about that so that |
---|
0:22:32 | okay |
---|