0:00:16controller presentation of a more speaker on the c paper transform is actual and costly
0:00:22or emotional most commercial
0:00:24it's not are no to negate
0:00:26from national university feel cool
0:00:29on the ball state of technology and to sign
0:00:38based in the outline of this presentation
0:00:41first i'd okay and extraction to emotional most commercial
0:00:45and so relating what
0:00:46i don't talk about our contributions
0:00:49proposed framework
0:00:51experiments
0:00:52and can push
0:00:55emotional most commercial is almost conversion technique
0:00:59it aims to coursing motion in the speech
0:01:02from these loss functions to the target options
0:01:06in the meeting about the speaker and then key and linguistic information should be greater
0:01:13and you can see in this speaker
0:01:15the same utterance spoken by the same speaker
0:01:19but the
0:01:20motion has been changed some signs i is tiny has naples many applications we human
0:01:28computer interaction
0:01:30such as personalise
0:01:32text-to-speech
0:01:33so for a pulse no conversational agents
0:01:39and three or no
0:01:40emotion physical access with multiple signal or should groups which can be or five kate
0:01:46right it's like trial
0:01:48mostly
0:01:49more
0:01:50you motion is also scroll segmental
0:01:53i hierarchical you make sure which makes it more difficult
0:01:57two can where is the emotion in the speech mean those that early studies only
0:02:02focus on that spectrum commercial
0:02:05and i haven't okay you mashed attention on the cost the
0:02:09it's missing is not sufficient
0:02:11and most previous work we where
0:02:14tara notion it is
0:02:15from the source of the target emotion more options
0:02:19by in the private case heralded have used it is say a difficult to cart
0:02:25and also will limit the score applications
0:02:31you know the true really met
0:02:33we eliminates the need for the panel clean data we propose to your cycle gonna
0:02:39to find the mappings
0:02:40of spectral a post e
0:02:43so i've okay is proposed for
0:02:46me translation and has
0:02:49shaves remarkable performance
0:02:50a non-parallel tossed
0:02:53researchers
0:02:54have successfully applied these most commercial and speech synthesis
0:02:59i was like yoga has three losses
0:03:02whether where zero loss
0:03:04cycle consistent signals
0:03:05and i and he may not
0:03:08so ministry losses
0:03:09was that we're gonna turn around nineteen features also target don't mean
0:03:13results and you want to know data
0:03:20another challenge you motion commercial without austin modeling
0:03:25in many for so the information test the result
0:03:31fundamental frequency which we also quite
0:03:34i zero
0:03:35used
0:03:36main factor
0:03:37also in the nation
0:03:39where studies
0:03:41convert have zero
0:03:42zero linear transformation
0:03:44but i was we all know
0:03:46i zero very from the micro most reliable suction the walls
0:03:51and three states
0:03:53twenty four options that will
0:03:56side modeling used to sing for channels characterize
0:04:00but speech was there are rare
0:04:02some researchers propose to
0:04:04well do i zero ways
0:04:06conclusion remote transform
0:04:08can was made about transform
0:04:11used a signal processing technique
0:04:13which is true
0:04:14it controls the signal
0:04:16two different time don't means
0:04:20it can describe still
0:04:22well as t you different and resolutions
0:04:25and we think
0:04:26so as to me to modeling hierarchical signals
0:04:30such an afternoon
0:04:36this figure shows although
0:04:38continuous wavelet transform walks
0:04:41we use minimum transform
0:04:43composed of no
0:04:45to turn scale
0:04:46they have the same linguistic content and spoken by the thing speaker and we assume
0:04:52that more scales
0:04:54can capture the short-term variations have scales can capture the long-term variations
0:05:01it has taken this tool options
0:05:03very infomercial to tune the long-term variations
0:05:06even though they are spoken by the same speaker and we don't think speaking time
0:05:12this variations reflect the emotional variance
0:05:16the different time scales options
0:05:22so in this paper
0:05:23we propose a panel of free emotional most commercial framework
0:05:28and we also showed that of course i
0:05:32although motion almost commercial
0:05:35we can come versus an actual and prosody three shows recycle can extracting and we
0:05:41also
0:05:42that's great
0:05:43different
0:05:44training strategies
0:05:46for spectral quality commercial
0:05:48sessions
0:05:49some pre-training a joint training
0:05:52another thing
0:05:53experimental results
0:05:56shows that we also the baseline approaches and carrot she called quality converged
0:06:03speech samples
0:06:07this is the training phase of our proposed framework
0:06:11you know training phase
0:06:12majoring to sample against false fine sure prosody separates the
0:06:17we also want vocoder
0:06:19so it is trend of spectral features and i zero strong salsa target sergeant's
0:06:25with only
0:06:26you called it but aside from fisher seem to twenty four conventional caps
0:06:31and use mean are transformed into compost zero into ten different scales
0:06:37and we train this to cycle goes for spectrum was this outrageously to lower than
0:06:42that of clean speech and start time you start for acoustic features
0:06:50i really conversion phase
0:06:52we use
0:06:53to train sec okay
0:06:55two congresses five approach crusty
0:06:58we used was vocal the actual singing size that coverage options
0:07:02we also in rats case two different training strategies our proposed framework
0:07:08the first one is
0:07:10second conjured
0:07:11in this framework
0:07:12we concatenate and catch
0:07:14we say that would keep based f zero features
0:07:18and that you put to it's like okay
0:07:21and that's that and the second one cycle again separate
0:07:24i in this framework
0:07:26wishing to several gets full spectrum of the active
0:07:34and in this work we can bound of three from walks
0:07:38and where you're sidewalk and coworkers fight for
0:07:41and use a
0:07:43in your transformation to compare the posterior
0:07:46this framework we call so baseline
0:07:49and the
0:07:51so i will control and second guess separate refers to two different training strategies one
0:07:56sec okay
0:07:57and we talked about last slots
0:08:00and we use the l
0:08:02most a lot
0:08:03this whole words which is recorded by a
0:08:06all stressful in american actually it's
0:08:09i don't we conduct experiments from neutral two and three signed a sparse
0:08:14for each emotion combination
0:08:16but you're slidy non-parallel utterance
0:08:20around stream means for training and ten utterances for evaluation
0:08:27and all the all tracking
0:08:29no iteration
0:08:30big companies i'm seeking to marilyn spectrum distortion
0:08:34and the cohort it a nasty and p c of size so
0:08:38the performance of the was the commercial
0:08:40from this to table
0:08:42our
0:08:43we can say that our propose
0:08:46so we can a separate stream or
0:08:48all of the baseline and several controls remote for all wasn't shapes
0:08:56and we also further contact
0:08:59sometimes you evaluation to us to study motion similarity and in this experiment we compare
0:09:05the preference test
0:09:08and that's from this to speaker
0:09:09our proposed framework consistently on some the baseline and the second controller
0:09:18i'm from the figure six we can say that most the listeners
0:09:22who's our
0:09:24so we can
0:09:25separate framework
0:09:26rather than that cycle control and
0:09:34so
0:09:34by results
0:09:35some pre-training is
0:09:39why the sampling training is much faster the orange county
0:09:43we think because of the these menus manage trustees different time scales
0:09:48and it is different time scales makers this though
0:09:52current content and containing depends i denotes
0:09:56so that
0:09:57join training has not
0:09:59can estimate the transform coefficients read the spectral
0:10:04features and stuff frame now
0:10:06and the
0:10:07and this train a strategy assumes that holstein is containing and that
0:10:13so
0:10:13with a mean each number of training samples
0:10:17for example streaming use of speech
0:10:19in our experiments
0:10:21but during training
0:10:22so we're gonna model
0:10:24kernel
0:10:25generalize very well so emotional mandy
0:10:28we start unseen components and the
0:10:30one time
0:10:31interface
0:10:34so we thing that's made use of reason why as a separate training is much
0:10:38better than the joint training our experiments
0:10:45i realistic
0:10:47paper we use that's
0:10:49several training outside for mostly
0:10:51can actually
0:10:52after performance than during training
0:10:55and the experimental results also
0:10:58shows that our proposed motion almost workshop framework can achieve better performance based on but
0:11:06is not an okay data
0:11:09and
0:11:11and this is all
0:11:12all for a pronunciation central or attention