0:00:16 | controller presentation of a more speaker on the c paper transform is actual and costly |
---|
0:00:22 | or emotional most commercial |
---|
0:00:24 | it's not are no to negate |
---|
0:00:26 | from national university feel cool |
---|
0:00:29 | on the ball state of technology and to sign |
---|
0:00:38 | based in the outline of this presentation |
---|
0:00:41 | first i'd okay and extraction to emotional most commercial |
---|
0:00:45 | and so relating what |
---|
0:00:46 | i don't talk about our contributions |
---|
0:00:49 | proposed framework |
---|
0:00:51 | experiments |
---|
0:00:52 | and can push |
---|
0:00:55 | emotional most commercial is almost conversion technique |
---|
0:00:59 | it aims to coursing motion in the speech |
---|
0:01:02 | from these loss functions to the target options |
---|
0:01:06 | in the meeting about the speaker and then key and linguistic information should be greater |
---|
0:01:13 | and you can see in this speaker |
---|
0:01:15 | the same utterance spoken by the same speaker |
---|
0:01:19 | but the |
---|
0:01:20 | motion has been changed some signs i is tiny has naples many applications we human |
---|
0:01:28 | computer interaction |
---|
0:01:30 | such as personalise |
---|
0:01:32 | text-to-speech |
---|
0:01:33 | so for a pulse no conversational agents |
---|
0:01:39 | and three or no |
---|
0:01:40 | emotion physical access with multiple signal or should groups which can be or five kate |
---|
0:01:46 | right it's like trial |
---|
0:01:48 | mostly |
---|
0:01:49 | more |
---|
0:01:50 | you motion is also scroll segmental |
---|
0:01:53 | i hierarchical you make sure which makes it more difficult |
---|
0:01:57 | two can where is the emotion in the speech mean those that early studies only |
---|
0:02:02 | focus on that spectrum commercial |
---|
0:02:05 | and i haven't okay you mashed attention on the cost the |
---|
0:02:09 | it's missing is not sufficient |
---|
0:02:11 | and most previous work we where |
---|
0:02:14 | tara notion it is |
---|
0:02:15 | from the source of the target emotion more options |
---|
0:02:19 | by in the private case heralded have used it is say a difficult to cart |
---|
0:02:25 | and also will limit the score applications |
---|
0:02:31 | you know the true really met |
---|
0:02:33 | we eliminates the need for the panel clean data we propose to your cycle gonna |
---|
0:02:39 | to find the mappings |
---|
0:02:40 | of spectral a post e |
---|
0:02:43 | so i've okay is proposed for |
---|
0:02:46 | me translation and has |
---|
0:02:49 | shaves remarkable performance |
---|
0:02:50 | a non-parallel tossed |
---|
0:02:53 | researchers |
---|
0:02:54 | have successfully applied these most commercial and speech synthesis |
---|
0:02:59 | i was like yoga has three losses |
---|
0:03:02 | whether where zero loss |
---|
0:03:04 | cycle consistent signals |
---|
0:03:05 | and i and he may not |
---|
0:03:08 | so ministry losses |
---|
0:03:09 | was that we're gonna turn around nineteen features also target don't mean |
---|
0:03:13 | results and you want to know data |
---|
0:03:20 | another challenge you motion commercial without austin modeling |
---|
0:03:25 | in many for so the information test the result |
---|
0:03:31 | fundamental frequency which we also quite |
---|
0:03:34 | i zero |
---|
0:03:35 | used |
---|
0:03:36 | main factor |
---|
0:03:37 | also in the nation |
---|
0:03:39 | where studies |
---|
0:03:41 | convert have zero |
---|
0:03:42 | zero linear transformation |
---|
0:03:44 | but i was we all know |
---|
0:03:46 | i zero very from the micro most reliable suction the walls |
---|
0:03:51 | and three states |
---|
0:03:53 | twenty four options that will |
---|
0:03:56 | side modeling used to sing for channels characterize |
---|
0:04:00 | but speech was there are rare |
---|
0:04:02 | some researchers propose to |
---|
0:04:04 | well do i zero ways |
---|
0:04:06 | conclusion remote transform |
---|
0:04:08 | can was made about transform |
---|
0:04:11 | used a signal processing technique |
---|
0:04:13 | which is true |
---|
0:04:14 | it controls the signal |
---|
0:04:16 | two different time don't means |
---|
0:04:20 | it can describe still |
---|
0:04:22 | well as t you different and resolutions |
---|
0:04:25 | and we think |
---|
0:04:26 | so as to me to modeling hierarchical signals |
---|
0:04:30 | such an afternoon |
---|
0:04:36 | this figure shows although |
---|
0:04:38 | continuous wavelet transform walks |
---|
0:04:41 | we use minimum transform |
---|
0:04:43 | composed of no |
---|
0:04:45 | to turn scale |
---|
0:04:46 | they have the same linguistic content and spoken by the thing speaker and we assume |
---|
0:04:52 | that more scales |
---|
0:04:54 | can capture the short-term variations have scales can capture the long-term variations |
---|
0:05:01 | it has taken this tool options |
---|
0:05:03 | very infomercial to tune the long-term variations |
---|
0:05:06 | even though they are spoken by the same speaker and we don't think speaking time |
---|
0:05:12 | this variations reflect the emotional variance |
---|
0:05:16 | the different time scales options |
---|
0:05:22 | so in this paper |
---|
0:05:23 | we propose a panel of free emotional most commercial framework |
---|
0:05:28 | and we also showed that of course i |
---|
0:05:32 | although motion almost commercial |
---|
0:05:35 | we can come versus an actual and prosody three shows recycle can extracting and we |
---|
0:05:41 | also |
---|
0:05:42 | that's great |
---|
0:05:43 | different |
---|
0:05:44 | training strategies |
---|
0:05:46 | for spectral quality commercial |
---|
0:05:48 | sessions |
---|
0:05:49 | some pre-training a joint training |
---|
0:05:52 | another thing |
---|
0:05:53 | experimental results |
---|
0:05:56 | shows that we also the baseline approaches and carrot she called quality converged |
---|
0:06:03 | speech samples |
---|
0:06:07 | this is the training phase of our proposed framework |
---|
0:06:11 | you know training phase |
---|
0:06:12 | majoring to sample against false fine sure prosody separates the |
---|
0:06:17 | we also want vocoder |
---|
0:06:19 | so it is trend of spectral features and i zero strong salsa target sergeant's |
---|
0:06:25 | with only |
---|
0:06:26 | you called it but aside from fisher seem to twenty four conventional caps |
---|
0:06:31 | and use mean are transformed into compost zero into ten different scales |
---|
0:06:37 | and we train this to cycle goes for spectrum was this outrageously to lower than |
---|
0:06:42 | that of clean speech and start time you start for acoustic features |
---|
0:06:50 | i really conversion phase |
---|
0:06:52 | we use |
---|
0:06:53 | to train sec okay |
---|
0:06:55 | two congresses five approach crusty |
---|
0:06:58 | we used was vocal the actual singing size that coverage options |
---|
0:07:02 | we also in rats case two different training strategies our proposed framework |
---|
0:07:08 | the first one is |
---|
0:07:10 | second conjured |
---|
0:07:11 | in this framework |
---|
0:07:12 | we concatenate and catch |
---|
0:07:14 | we say that would keep based f zero features |
---|
0:07:18 | and that you put to it's like okay |
---|
0:07:21 | and that's that and the second one cycle again separate |
---|
0:07:24 | i in this framework |
---|
0:07:26 | wishing to several gets full spectrum of the active |
---|
0:07:34 | and in this work we can bound of three from walks |
---|
0:07:38 | and where you're sidewalk and coworkers fight for |
---|
0:07:41 | and use a |
---|
0:07:43 | in your transformation to compare the posterior |
---|
0:07:46 | this framework we call so baseline |
---|
0:07:49 | and the |
---|
0:07:51 | so i will control and second guess separate refers to two different training strategies one |
---|
0:07:56 | sec okay |
---|
0:07:57 | and we talked about last slots |
---|
0:08:00 | and we use the l |
---|
0:08:02 | most a lot |
---|
0:08:03 | this whole words which is recorded by a |
---|
0:08:06 | all stressful in american actually it's |
---|
0:08:09 | i don't we conduct experiments from neutral two and three signed a sparse |
---|
0:08:14 | for each emotion combination |
---|
0:08:16 | but you're slidy non-parallel utterance |
---|
0:08:20 | around stream means for training and ten utterances for evaluation |
---|
0:08:27 | and all the all tracking |
---|
0:08:29 | no iteration |
---|
0:08:30 | big companies i'm seeking to marilyn spectrum distortion |
---|
0:08:34 | and the cohort it a nasty and p c of size so |
---|
0:08:38 | the performance of the was the commercial |
---|
0:08:40 | from this to table |
---|
0:08:42 | our |
---|
0:08:43 | we can say that our propose |
---|
0:08:46 | so we can a separate stream or |
---|
0:08:48 | all of the baseline and several controls remote for all wasn't shapes |
---|
0:08:56 | and we also further contact |
---|
0:08:59 | sometimes you evaluation to us to study motion similarity and in this experiment we compare |
---|
0:09:05 | the preference test |
---|
0:09:08 | and that's from this to speaker |
---|
0:09:09 | our proposed framework consistently on some the baseline and the second controller |
---|
0:09:18 | i'm from the figure six we can say that most the listeners |
---|
0:09:22 | who's our |
---|
0:09:24 | so we can |
---|
0:09:25 | separate framework |
---|
0:09:26 | rather than that cycle control and |
---|
0:09:34 | so |
---|
0:09:34 | by results |
---|
0:09:35 | some pre-training is |
---|
0:09:39 | why the sampling training is much faster the orange county |
---|
0:09:43 | we think because of the these menus manage trustees different time scales |
---|
0:09:48 | and it is different time scales makers this though |
---|
0:09:52 | current content and containing depends i denotes |
---|
0:09:56 | so that |
---|
0:09:57 | join training has not |
---|
0:09:59 | can estimate the transform coefficients read the spectral |
---|
0:10:04 | features and stuff frame now |
---|
0:10:06 | and the |
---|
0:10:07 | and this train a strategy assumes that holstein is containing and that |
---|
0:10:13 | so |
---|
0:10:13 | with a mean each number of training samples |
---|
0:10:17 | for example streaming use of speech |
---|
0:10:19 | in our experiments |
---|
0:10:21 | but during training |
---|
0:10:22 | so we're gonna model |
---|
0:10:24 | kernel |
---|
0:10:25 | generalize very well so emotional mandy |
---|
0:10:28 | we start unseen components and the |
---|
0:10:30 | one time |
---|
0:10:31 | interface |
---|
0:10:34 | so we thing that's made use of reason why as a separate training is much |
---|
0:10:38 | better than the joint training our experiments |
---|
0:10:45 | i realistic |
---|
0:10:47 | paper we use that's |
---|
0:10:49 | several training outside for mostly |
---|
0:10:51 | can actually |
---|
0:10:52 | after performance than during training |
---|
0:10:55 | and the experimental results also |
---|
0:10:58 | shows that our proposed motion almost workshop framework can achieve better performance based on but |
---|
0:11:06 | is not an okay data |
---|
0:11:09 | and |
---|
0:11:11 | and this is all |
---|
0:11:12 | all for a pronunciation central or attention |
---|