0:00:15 | however while missus and shall show from actually known anonymous the whole thing of all |
---|
0:00:23 | i like to three then my paper name a personalised the singing wise generation new |
---|
0:00:29 | thing one five and |
---|
0:00:32 | a saliva local as individuals not basic idea of singing wise generation |
---|
0:00:38 | ran about the related work and the limitations |
---|
0:00:43 | i know how that the proposed model and it's quite well without |
---|
0:00:47 | and experiments |
---|
0:00:50 | so stimuli stun dimension is actually technique to train or anything new |
---|
0:00:55 | for you the remote call-in to lost then the ten placing me |
---|
0:01:00 | and the |
---|
0:01:01 | after gain in this to include |
---|
0:01:04 | we will |
---|
0:01:05 | that is a singing out to also you there |
---|
0:01:09 | from this thing wise generation based on |
---|
0:01:13 | all this task is actually a challenging |
---|
0:01:16 | because the generative is singing |
---|
0:01:19 | should be as an actual has thus and everything and i and also need to |
---|
0:01:23 | all of them ten |
---|
0:01:24 | the thing and bowl and we've done or templates |
---|
0:01:27 | and need to be similar to the you know wise |
---|
0:01:30 | identity |
---|
0:01:32 | and |
---|
0:01:33 | this one |
---|
0:01:34 | it was a after is different from and ten placing a |
---|
0:01:38 | so all one way |
---|
0:01:40 | class analyses fusion |
---|
0:01:42 | phone usually transformation and the synthesis |
---|
0:01:45 | or this task |
---|
0:01:48 | and there are some related to the task |
---|
0:01:52 | just one is the speech to singing collection and also perform and lastly |
---|
0:01:56 | analysis |
---|
0:01:57 | it should transformation and that since there is |
---|
0:02:00 | as a solution |
---|
0:02:03 | but difference here is the input is the speech |
---|
0:02:07 | content |
---|
0:02:08 | which |
---|
0:02:09 | well actually it's a lot of the thing content of course training |
---|
0:02:13 | for then |
---|
0:02:14 | for you e |
---|
0:02:16 | you're the speech my heart will go on |
---|
0:02:19 | will be bozos this person's |
---|
0:02:21 | singing |
---|
0:02:22 | my heart we all on |
---|
0:02:24 | and this speech was in equation purely rely on a |
---|
0:02:29 | speech to sing |
---|
0:02:31 | alignment |
---|
0:02:32 | and the parallel speech to speech and singing |
---|
0:02:36 | they got |
---|
0:02:38 | but this is also low from real well for particular |
---|
0:02:42 | we will within generation |
---|
0:02:47 | another |
---|
0:02:49 | well task is the singing wise convolution which can also generate |
---|
0:02:54 | singing |
---|
0:02:55 | well this is |
---|
0:02:57 | basically |
---|
0:02:58 | that's to come were sourced in seen as was to talk case in this one |
---|
0:03:03 | this |
---|
0:03:04 | and there are two basic approach first one is the long run parallel screen okay |
---|
0:03:11 | which means that they have solved and have a stinging |
---|
0:03:14 | and the two |
---|
0:03:16 | a speech analyses |
---|
0:03:18 | transformation |
---|
0:03:19 | and the same face |
---|
0:03:21 | well we'll can get nothing about |
---|
0:03:24 | and second one is the real the ground parallel screening eight |
---|
0:03:29 | but you really the time t |
---|
0:03:32 | speaker identity what is this one need to be learned to the conversion model |
---|
0:03:39 | coupon |
---|
0:03:40 | different have a speaker to learn a no need to be trained |
---|
0:03:44 | repeatedly |
---|
0:03:46 | well as on the limitation in here is |
---|
0:03:49 | for the first approach |
---|
0:03:50 | then you |
---|
0:03:51 | i mean alignment for second approach |
---|
0:03:54 | then you to retrain for different target speakers |
---|
0:03:58 | and |
---|
0:03:59 | then weight control singing was generation right |
---|
0:04:03 | this applies the |
---|
0:04:05 | and i left and commercial model |
---|
0:04:10 | vol |
---|
0:04:11 | the weight for a walk over the whole single noise generation |
---|
0:04:16 | so |
---|
0:04:16 | the training will be |
---|
0:04:18 | two steps |
---|
0:04:19 | first one is the right list and model training |
---|
0:04:23 | well |
---|
0:04:24 | what is actually true can work speech i-vectors |
---|
0:04:29 | same you |
---|
0:04:30 | g p g r zero one at |
---|
0:04:32 | to singing and theses |
---|
0:04:34 | then mm |
---|
0:04:35 | seconds that training is to |
---|
0:04:39 | converting the speech |
---|
0:04:41 | i-vector |
---|
0:04:42 | saying you have zero mean he |
---|
0:04:44 | and m c to singing wise |
---|
0:04:47 | what is the |
---|
0:04:48 | and |
---|
0:04:50 | the ural a way to |
---|
0:04:53 | condition only part and |
---|
0:04:57 | so you way well assuming we have will |
---|
0:05:01 | parallel to be shown a singing |
---|
0:05:03 | well training set |
---|
0:05:06 | one can performance to training procedures |
---|
0:05:10 | i-vector n is still |
---|
0:05:12 | picture |
---|
0:05:13 | to clean and that a speaker identity |
---|
0:05:16 | s zero at |
---|
0:05:18 | if the prosody interest to well |
---|
0:05:21 | the from time at a ten placing e |
---|
0:05:24 | and you you're a is the speaker independent eature |
---|
0:05:31 | so for this right |
---|
0:05:32 | at runtime we will have full |
---|
0:05:35 | a time t speech content a singing this song to cause a |
---|
0:05:40 | and |
---|
0:05:41 | the ten placing is all your liaison |
---|
0:05:44 | professionals be necessary |
---|
0:05:46 | really hopefully can have low professional seen as |
---|
0:05:51 | singing |
---|
0:05:53 | prosody and ten |
---|
0:05:54 | ten whole |
---|
0:05:56 | we will well |
---|
0:05:57 | have the f zero and h e |
---|
0:05:59 | and the purity |
---|
0:06:01 | from no |
---|
0:06:02 | template |
---|
0:06:03 | there again have little i-vector |
---|
0:06:06 | from speech |
---|
0:06:07 | why probabilistically include to the |
---|
0:06:10 | training now rest model |
---|
0:06:12 | we will have no convert a the |
---|
0:06:15 | and ceases |
---|
0:06:16 | well then |
---|
0:06:18 | we'll still have i-vector and see near zero she |
---|
0:06:23 | to include the training that we are and what order |
---|
0:06:27 | and this will all there will generate a final |
---|
0:06:31 | a little thingy |
---|
0:06:32 | silencing me hope to be well |
---|
0:06:36 | the same |
---|
0:06:39 | speaker without any speech |
---|
0:06:41 | well |
---|
0:06:42 | way so |
---|
0:06:43 | ten minutes in the |
---|
0:06:45 | sings that'll |
---|
0:06:47 | i q it is still |
---|
0:06:49 | one problem |
---|
0:06:50 | for the pipeline |
---|
0:06:52 | what is done mismatch between training and testing |
---|
0:06:56 | because |
---|
0:06:58 | for which are and we'll codec training |
---|
0:07:00 | that includes features into a low vocal there |
---|
0:07:03 | s actually |
---|
0:07:05 | an actual and |
---|
0:07:06 | and c is it is natural and it is extracted from actual singing |
---|
0:07:12 | but at a restaurant and conversion |
---|
0:07:15 | here is a commodity and this is from now it model and this converted and |
---|
0:07:20 | sixty |
---|
0:07:22 | well beyond be different |
---|
0:07:24 | from the natural and disease |
---|
0:07:27 | so this |
---|
0:07:28 | for |
---|
0:07:28 | calls |
---|
0:07:30 | some |
---|
0:07:31 | distortion |
---|
0:07:32 | you know the channel right okay that's being killed |
---|
0:07:37 | in order to overcome the mismatch these two |
---|
0:07:41 | we propose low |
---|
0:07:44 | into quality network |
---|
0:07:46 | then this network is to me |
---|
0:07:49 | you |
---|
0:07:50 | evangelising we are |
---|
0:07:54 | conversion and a low coding together |
---|
0:07:57 | cool basically i |
---|
0:07:59 | the training will be |
---|
0:08:00 | not single |
---|
0:08:02 | and only one is that |
---|
0:08:04 | which is to |
---|
0:08:05 | take |
---|
0:08:07 | speaker identity |
---|
0:08:09 | from speech |
---|
0:08:10 | what she's i-vector |
---|
0:08:11 | and the poles of the |
---|
0:08:14 | and the linguistic or present representation under temp placing |
---|
0:08:20 | to train the way for an |
---|
0:08:22 | channel right |
---|
0:08:23 | senior a tote |
---|
0:08:25 | directly so at runtime |
---|
0:08:28 | we will again have will |
---|
0:08:30 | you'll this being each |
---|
0:08:32 | to extract no |
---|
0:08:34 | you'll thus |
---|
0:08:34 | i i-vector |
---|
0:08:36 | and then we'll has another person's trying to say now |
---|
0:08:39 | so the time placing mean |
---|
0:08:41 | and we will again how the |
---|
0:08:44 | prosody much as f zero at a g |
---|
0:08:47 | and the t v g from the training |
---|
0:08:49 | then one for this three feet ratings for the training |
---|
0:08:53 | we'll |
---|
0:08:54 | it's not one will be lost |
---|
0:08:56 | but are generated a singing |
---|
0:08:59 | and this year |
---|
0:09:01 | there are we would do not have no |
---|
0:09:03 | when converting that i'm c and then actual and sitting mismatch problem |
---|
0:09:09 | so |
---|
0:09:10 | way about |
---|
0:09:12 | since the size |
---|
0:09:13 | then the optimal |
---|
0:09:14 | okay we'll be included |
---|
0:09:18 | that's as a result |
---|
0:09:21 | for the experimental way |
---|
0:09:23 | we experimented with two database |
---|
0:09:26 | and the model |
---|
0:09:28 | based testing also |
---|
0:09:31 | speakers voiced concerns |
---|
0:09:33 | and of interest |
---|
0:09:34 | was extracted from |
---|
0:09:36 | what worked on a |
---|
0:09:38 | followed by s g u i |
---|
0:09:40 | and other allies and model |
---|
0:09:43 | was performed on modeling truncated |
---|
0:09:47 | we will welcome past three model the first one is a path i'm way because |
---|
0:09:52 | is cost |
---|
0:09:54 | and that this one is the one we proposed |
---|
0:09:57 | the second one is the |
---|
0:09:59 | okay different |
---|
0:10:01 | clusters one |
---|
0:10:02 | what is that i |
---|
0:10:03 | we have the l s can conversion model like first one |
---|
0:10:07 | and down way |
---|
0:10:09 | i have no word one of the in an all pole but our algorithm where |
---|
0:10:13 | you was lower than the here |
---|
0:10:15 | so long you press |
---|
0:10:17 | is |
---|
0:10:18 | the first one is a label and will call the second one is what marco |
---|
0:10:22 | the |
---|
0:10:23 | down way |
---|
0:10:25 | you why the weight i one without |
---|
0:10:28 | i to the evaluation approach case first one is an object in the evaluation |
---|
0:10:35 | second one in the |
---|
0:10:36 | subject to continue evaluation so for |
---|
0:10:40 | objective so you one iteration |
---|
0:10:42 | we can form the root mean square and roll |
---|
0:10:46 | this is to measure the |
---|
0:10:49 | distortion it and that have a singing and of the current work teasingly |
---|
0:10:54 | i the low |
---|
0:10:55 | so an election year |
---|
0:10:57 | well which |
---|
0:10:58 | actually means |
---|
0:11:00 | well the lower |
---|
0:11:01 | the l i cu |
---|
0:11:04 | but you will need to cater the similarity scores |
---|
0:11:08 | so why you well wait i was really system |
---|
0:11:12 | where are and |
---|
0:11:14 | and c |
---|
0:11:16 | so laces can |
---|
0:11:18 | the fact i |
---|
0:11:20 | what is our crumples integrate a one |
---|
0:11:24 | we can say our integrate it model outperformed the past i model |
---|
0:11:31 | and though when they this |
---|
0:11:34 | actually means |
---|
0:11:35 | our composed model has radios long mismatch |
---|
0:11:40 | well |
---|
0:11:41 | and i think it in turn made a features |
---|
0:11:44 | converting and c and a natural and see |
---|
0:11:48 | so that we can get better results |
---|
0:11:51 | our a modal |
---|
0:11:53 | propose a novel best not all forming |
---|
0:11:57 | no along with the word will go there |
---|
0:12:00 | all which we also found |
---|
0:12:03 | man a similar situations |
---|
0:12:07 | even wise conversion |
---|
0:12:09 | right along all the |
---|
0:12:10 | can be better than one e r vocal the sometimes |
---|
0:12:16 | all objectives evaluations |
---|
0:12:22 | so forth if a new regulation |
---|
0:12:25 | the way you evaluate |
---|
0:12:28 | all closing in quality and analyzing |
---|
0:12:31 | similarity |
---|
0:12:32 | so way actually away |
---|
0:12:35 | telephone the listening test |
---|
0:12:38 | well for all of the comedy essentials |
---|
0:12:41 | and the whole system way |
---|
0:12:43 | on |
---|
0:12:44 | randomly selecting the utterance |
---|
0:12:47 | and l |
---|
0:12:50 | a selflessness |
---|
0:12:52 | but is encased being the |
---|
0:12:53 | as an intact |
---|
0:12:55 | way from a unique referenced asked to you anyway so only |
---|
0:13:01 | and that x a b and asked to leave anyway though |
---|
0:13:04 | and are added |
---|
0:13:06 | well first |
---|
0:13:07 | our proposed model way the |
---|
0:13:10 | a model way so |
---|
0:13:12 | baseline model way somewhere the one over there |
---|
0:13:15 | so of the yellow one in our proposed model |
---|
0:13:19 | why is the baseline |
---|
0:13:21 | we can say i work of course not all on the basic tagging time so |
---|
0:13:26 | all quality this is a unique reference task |
---|
0:13:29 | and similarity |
---|
0:13:32 | this is an a b preference test |
---|
0:13:35 | and a full |
---|
0:13:36 | well |
---|
0:13:38 | and on the |
---|
0:13:39 | comparison |
---|
0:13:41 | i'm terrible |
---|
0:13:42 | samples model and the pad thai |
---|
0:13:45 | we can also |
---|
0:13:46 | although there with us in the trend our proposed a novel |
---|
0:13:51 | awful form |
---|
0:13:52 | the all data and a low internal |
---|
0:13:55 | in comes all |
---|
0:13:57 | generic and only |
---|
0:13:59 | so this |
---|
0:14:00 | significant improvement a unique at our proposed model has |
---|
0:14:07 | well |
---|
0:14:09 | has some |
---|
0:14:10 | benefit |
---|
0:14:12 | from the by far the integrating |
---|
0:14:16 | framework |
---|
0:14:17 | i also plays an animal here |
---|
0:14:22 | two and a half an hour components model |
---|
0:14:27 | okay result okay speech |
---|
0:14:35 | and the knowledge that anything |
---|
0:14:45 | is the |
---|
0:14:46 | we had time |
---|
0:14:54 | is the |
---|
0:14:55 | propose one |
---|
0:15:02 | well another baseline we with that our proposed model we can hear |
---|
0:15:11 | and that and within |
---|
0:15:19 | right |
---|
0:15:26 | and our proposed one |
---|
0:15:36 | okay here an optimal for you feel like in this website |
---|
0:15:41 | and i would like to come the low |
---|
0:15:43 | no this paper so our proposed model |
---|
0:15:47 | actually does not require hernault thinking they have more training work anymore |
---|
0:15:52 | i'm wondering system |
---|
0:15:54 | and then we also do not |
---|
0:15:55 | need to train |
---|
0:15:57 | different models for have a training |
---|
0:16:00 | and although there is no frame alignment needing us critics |
---|
0:16:04 | and |
---|
0:16:05 | well so on what role speech and mismatch you between we are training and drawn |
---|
0:16:11 | from which are which implies better quality in there are people who |
---|
0:16:17 | and then the experimental results also i already have in this all the proposed modeling |
---|
0:16:24 | handle both |
---|
0:16:26 | well quality and other thing that are |
---|
0:16:29 | and real-time you feel have i mean and that's |
---|
0:16:32 | an email me |
---|
0:16:35 | and you |
---|