no i while my name and you're right i got my p h d degree
from in amateur university of time i will make accreditation about our paper on tts
this is at select work well in the mobile university assigned or
national university of thing able and the single university of technology and the desire
the title of the paper is the tts tagger somebody's the tts with joint time-frequency
domain laws
this is just a quick also a what i'm going to talk about
we will now come to the first the section
okay total speech is to compress test into humour like speech
with the element of deep-learning
and twenty s has so many advantages over the question all tts techniques
that was on base the tts
actually consists of two modules
this first line in the feature prediction and the second my in baseform generation
the main task
feature prediction network
it's a lot in frequency domain acoustic features
but i was away from generation model is to convert
frequency domain hosting features into time domain before
a tackle
couple some implementation of the loss
with clean
with a full face a construction
that only uses a loss function derived from spectrogram in frequency domain
that's a loss function that and take the before into considered consideration in the optimisation
process
as a dallas
there exists a mismatch
between the tackles optimisations and the except exactly the before
in this paper we propose to add a time-domain loss function to the with flame
old have some basis you just model and to the training time in other words
where you're the boss frequency domain laws
and the time domain laws for the training of feature prediction model
in addition
where yours is i is the are it's awful scaling by weight signal to distortion
to measure the quality of the time domain before
next
i would like to impose time related work
the overall architecture of the whole some model include feature prediction model
which contains encoder
but session based decoder and the quickly always
for waveform
in construction
but in colour consis all
two components
a c and d is the model size has three combustion this
and it can be used them also
that had a bidirectional and that's in there
the decoder consists of all components
a totally appear in that
to understand layers
a linear projection layer
and the of five convolutional layer
post and that
during training
we optimize the feature prediction model
to minimize the
to minimize the
frequency domain laws
lost
between the center at the minute special features
and there's a time data no special features
as a loss function is your like at lady only for frequency domain acoustic features
that style and
to directly control the quality of the center at time domain before you other words
so that's frequency domain laws question doesn't stick as a before into consideration
in the optimisation process
to address the mismatch problem
but propose and you'll jennings k for wasn't based the tts
the main contributions of this paper a summarized as follows
with that the this you're a time-domain lost for speech synthesis
we in parole
tables on base the tts framework by proposing and the origin is k
based on short time frequency domain lost
we propose to your is a is the are metric to measure the distortion time
domain before
this session looks at as a framework our proposed a method
in this section
based on raises use of and the only propose the time-domain loss function for example
some base the tts
by applying onion jennings k
that takes into account both time and the frequency domain loss functions
we have actually effectively
we use the mismatch
between the frequency domain features
and the time domain with four and in provo the output speech quality
the proposed framework in court as based tts after
next
we have l discussed in detail the proposed as training scheme
in the tts
we define two objective functions during training
the first why is
frequency domain lost
denoted as
last f
is that it completely dolby the mel spectral features
you know similarly as time goes on model
a second of my a's of propose the time-domain lost
innovative as nasty that it obtain the and to a waveform level as a household
gravely iteration
the prettiest
time-domain signal from them as special features
thus f
ensures that
the generated email special enclose the names for male special
nasty minimizing this lost
as a waveform level we had no weighting
coefficient
like to balance the two losses
the total loss function of the whole model
is
defined as this fashion
really them by a also shows the
complex a
training process of our proposed the vts
but in tts the
model p d's the no special features from the given in close quarters samples
and is that converts the produced
and the target from a special to the time-domain signal using greatly algorithm
finally a slight loss function
is jewels the to optimize the we've tts model
we also my is i sdr
it's a full scale you moderate signal to distortion
to measure the this test between the generated a before and the target financial speech
we not there is a is the r is a better at the
only during training and the
known to
and the
not require an to run time in
in first
i don't out like to move on to experiment part
but have also the tts
experiments energy speech database
beta better for systems for comparative study
the first are the first why it happens you know
this system had only a frequency-domain loss function
with clean
wavelan always are is your that will generate of the faithful as try to the
second one is that was older than
this system
also have only a frequency-domain loss function
potentially
we have network older is your the to generate of the waveform at runtime
the survivability gestural
it means that purple treated as model is trained with joint time-frequency domain lost
quickly and always there is yours the during training and lifetime physics
the last one is the tts that and it means that propose that if you
just model is trained of it so i time-frequency domain lost
gravely always there is yours the during training and the painting been that vocal there
is yours the two synthesis speech
and try to
we also compare these systems with the ground truth speech
denoted as c g
a try to
tables on seattle and the but just yell your wavelan
algorithm with sixty four iterations
we can that the listening experiments
to evaluate the quality of the synthesis the speech
the first evaluated the sound quality of the synthesis the speech in terms o mean
opinion score
it quality in figure one
we compare comes on seattle and with tts gentle to all their of that the
effect
so it's time frequency domain lost
we believe that this it off fair comparison
as both frameworks used with him or with them
for waveform generation during training and the right
i can be seen in figure one a
but if tts yellow outperforms two cousins
but compare
couple shown that even
and of if tts
but
to investigate how well
the predicate and their special features
perform
with natural colder
we observe that it means always tts is trained with an em all trees are
it performs better
then we also went in s from whole the is available
s right are
but compare
based tts yellow and the bit tedious you data and
in terms of was quality
we notice that both frameworks
a change under the same conditions
however
the tts target and use is being that's vocal therefore waveform generation of right are
and except t with tts target and
also performs based tts g l
we also conducted at pratt first test
to assess speech quality of proposed frameworks
figure two shows that our proposed to be tts member
also performs
the baseline system
well both gravely and the way that so called as the teens
at the right
we further conduct another a be preference test
to examine the effect
of the number of briefly
iterations
on the with tts
performance
for rapid turnaround
but only apply
why and the two we have lain iterations
for face construction
and investigated the effect in terms o was quality
we observe that
the single iteration of gravely all these are
presents a better performers
then two iterations
finally
but controlled of this paper
but proposed and you'll tokens only implementation cultivated yes
we propose to yours you know in my wrote signal to distortion as the loss
function
the proposed to me tts frameworks
performs the baseline
and the chips high quality signals this the speech
some people
right mouse for taking the time to listen to this presentation
if insists you please check our panel page to the speech samples
thank you for your test