no i while my name and you're right i got my p h d degree

from in amateur university of time i will make accreditation about our paper on tts

this is at select work well in the mobile university assigned or

national university of thing able and the single university of technology and the desire

the title of the paper is the tts tagger somebody's the tts with joint time-frequency

domain laws

this is just a quick also a what i'm going to talk about

we will now come to the first the section

okay total speech is to compress test into humour like speech

with the element of deep-learning

and twenty s has so many advantages over the question all tts techniques

that was on base the tts

actually consists of two modules

this first line in the feature prediction and the second my in baseform generation

the main task

feature prediction network

it's a lot in frequency domain acoustic features

but i was away from generation model is to convert

frequency domain hosting features into time domain before

a tackle

couple some implementation of the loss

with clean

with a full face a construction

that only uses a loss function derived from spectrogram in frequency domain

that's a loss function that and take the before into considered consideration in the optimisation

process

as a dallas

there exists a mismatch

between the tackles optimisations and the except exactly the before

in this paper we propose to add a time-domain loss function to the with flame

old have some basis you just model and to the training time in other words

where you're the boss frequency domain laws

and the time domain laws for the training of feature prediction model

in addition

where yours is i is the are it's awful scaling by weight signal to distortion

to measure the quality of the time domain before

next

i would like to impose time related work

the overall architecture of the whole some model include feature prediction model

which contains encoder

but session based decoder and the quickly always

for waveform

in construction

but in colour consis all

two components

a c and d is the model size has three combustion this

and it can be used them also

that had a bidirectional and that's in there

the decoder consists of all components

a totally appear in that

to understand layers

a linear projection layer

and the of five convolutional layer

post and that

during training

we optimize the feature prediction model

to minimize the

to minimize the

frequency domain laws

lost

between the center at the minute special features

and there's a time data no special features

as a loss function is your like at lady only for frequency domain acoustic features

that style and

to directly control the quality of the center at time domain before you other words

so that's frequency domain laws question doesn't stick as a before into consideration

in the optimisation process

to address the mismatch problem

but propose and you'll jennings k for wasn't based the tts

the main contributions of this paper a summarized as follows

with that the this you're a time-domain lost for speech synthesis

we in parole

tables on base the tts framework by proposing and the origin is k

based on short time frequency domain lost

we propose to your is a is the are metric to measure the distortion time

domain before

this session looks at as a framework our proposed a method

in this section

based on raises use of and the only propose the time-domain loss function for example

some base the tts

by applying onion jennings k

that takes into account both time and the frequency domain loss functions

we have actually effectively

we use the mismatch

between the frequency domain features

and the time domain with four and in provo the output speech quality

the proposed framework in court as based tts after

next

we have l discussed in detail the proposed as training scheme

in the tts

we define two objective functions during training

the first why is

frequency domain lost

denoted as

last f

is that it completely dolby the mel spectral features

you know similarly as time goes on model

a second of my a's of propose the time-domain lost

innovative as nasty that it obtain the and to a waveform level as a household

gravely iteration

the prettiest

time-domain signal from them as special features

thus f

ensures that

the generated email special enclose the names for male special

nasty minimizing this lost

as a waveform level we had no weighting

coefficient

like to balance the two losses

the total loss function of the whole model

is

defined as this fashion

really them by a also shows the

complex a

training process of our proposed the vts

but in tts the

model p d's the no special features from the given in close quarters samples

and is that converts the produced

and the target from a special to the time-domain signal using greatly algorithm

finally a slight loss function

is jewels the to optimize the we've tts model

we also my is i sdr

it's a full scale you moderate signal to distortion

to measure the this test between the generated a before and the target financial speech

we not there is a is the r is a better at the

only during training and the

known to

and the

not require an to run time in

in first

i don't out like to move on to experiment part

but have also the tts

experiments energy speech database

beta better for systems for comparative study

the first are the first why it happens you know

this system had only a frequency-domain loss function

with clean

wavelan always are is your that will generate of the faithful as try to the

second one is that was older than

this system

also have only a frequency-domain loss function

potentially

we have network older is your the to generate of the waveform at runtime

the survivability gestural

it means that purple treated as model is trained with joint time-frequency domain lost

quickly and always there is yours the during training and lifetime physics

the last one is the tts that and it means that propose that if you

just model is trained of it so i time-frequency domain lost

gravely always there is yours the during training and the painting been that vocal there

is yours the two synthesis speech

and try to

we also compare these systems with the ground truth speech

denoted as c g

a try to

tables on seattle and the but just yell your wavelan

algorithm with sixty four iterations

we can that the listening experiments

to evaluate the quality of the synthesis the speech

the first evaluated the sound quality of the synthesis the speech in terms o mean

opinion score

it quality in figure one

we compare comes on seattle and with tts gentle to all their of that the

effect

so it's time frequency domain lost

we believe that this it off fair comparison

as both frameworks used with him or with them

for waveform generation during training and the right

i can be seen in figure one a

but if tts yellow outperforms two cousins

but compare

couple shown that even

and of if tts

but

to investigate how well

the predicate and their special features

perform

with natural colder

we observe that it means always tts is trained with an em all trees are

it performs better

then we also went in s from whole the is available

s right are

but compare

based tts yellow and the bit tedious you data and

in terms of was quality

we notice that both frameworks

a change under the same conditions

however

the tts target and use is being that's vocal therefore waveform generation of right are

and except t with tts target and

also performs based tts g l

we also conducted at pratt first test

to assess speech quality of proposed frameworks

figure two shows that our proposed to be tts member

also performs

the baseline system

well both gravely and the way that so called as the teens

at the right

we further conduct another a be preference test

to examine the effect

of the number of briefly

iterations

on the with tts

performance

for rapid turnaround

but only apply

why and the two we have lain iterations

for face construction

and investigated the effect in terms o was quality

we observe that

the single iteration of gravely all these are

presents a better performers

then two iterations

finally

but controlled of this paper

but proposed and you'll tokens only implementation cultivated yes

we propose to yours you know in my wrote signal to distortion as the loss

function

the proposed to me tts frameworks

performs the baseline

and the chips high quality signals this the speech

some people

right mouse for taking the time to listen to this presentation

if insists you please check our panel page to the speech samples

thank you for your test