0:00:14 | no i while my name and you're right i got my p h d degree |
---|
0:00:18 | from in amateur university of time i will make accreditation about our paper on tts |
---|
0:00:24 | this is at select work well in the mobile university assigned or |
---|
0:00:28 | national university of thing able and the single university of technology and the desire |
---|
0:00:33 | the title of the paper is the tts tagger somebody's the tts with joint time-frequency |
---|
0:00:38 | domain laws |
---|
0:00:42 | this is just a quick also a what i'm going to talk about |
---|
0:00:47 | we will now come to the first the section |
---|
0:00:51 | okay total speech is to compress test into humour like speech |
---|
0:00:56 | with the element of deep-learning |
---|
0:00:58 | and twenty s has so many advantages over the question all tts techniques |
---|
0:01:03 | that was on base the tts |
---|
0:01:05 | actually consists of two modules |
---|
0:01:08 | this first line in the feature prediction and the second my in baseform generation |
---|
0:01:13 | the main task |
---|
0:01:15 | feature prediction network |
---|
0:01:17 | it's a lot in frequency domain acoustic features |
---|
0:01:20 | but i was away from generation model is to convert |
---|
0:01:25 | frequency domain hosting features into time domain before |
---|
0:01:29 | a tackle |
---|
0:01:30 | couple some implementation of the loss |
---|
0:01:33 | with clean |
---|
0:01:34 | with a full face a construction |
---|
0:01:37 | that only uses a loss function derived from spectrogram in frequency domain |
---|
0:01:43 | that's a loss function that and take the before into considered consideration in the optimisation |
---|
0:01:49 | process |
---|
0:01:51 | as a dallas |
---|
0:01:52 | there exists a mismatch |
---|
0:01:54 | between the tackles optimisations and the except exactly the before |
---|
0:02:00 | in this paper we propose to add a time-domain loss function to the with flame |
---|
0:02:05 | old have some basis you just model and to the training time in other words |
---|
0:02:10 | where you're the boss frequency domain laws |
---|
0:02:13 | and the time domain laws for the training of feature prediction model |
---|
0:02:17 | in addition |
---|
0:02:19 | where yours is i is the are it's awful scaling by weight signal to distortion |
---|
0:02:25 | to measure the quality of the time domain before |
---|
0:02:30 | next |
---|
0:02:31 | i would like to impose time related work |
---|
0:02:36 | the overall architecture of the whole some model include feature prediction model |
---|
0:02:41 | which contains encoder |
---|
0:02:43 | but session based decoder and the quickly always |
---|
0:02:47 | for waveform |
---|
0:02:48 | in construction |
---|
0:02:51 | but in colour consis all |
---|
0:02:53 | two components |
---|
0:02:55 | a c and d is the model size has three combustion this |
---|
0:03:00 | and it can be used them also |
---|
0:03:02 | that had a bidirectional and that's in there |
---|
0:03:07 | the decoder consists of all components |
---|
0:03:09 | a totally appear in that |
---|
0:03:11 | to understand layers |
---|
0:03:13 | a linear projection layer |
---|
0:03:15 | and the of five convolutional layer |
---|
0:03:18 | post and that |
---|
0:03:20 | during training |
---|
0:03:21 | we optimize the feature prediction model |
---|
0:03:24 | to minimize the |
---|
0:03:25 | to minimize the |
---|
0:03:27 | frequency domain laws |
---|
0:03:28 | lost |
---|
0:03:29 | between the center at the minute special features |
---|
0:03:33 | and there's a time data no special features |
---|
0:03:38 | as a loss function is your like at lady only for frequency domain acoustic features |
---|
0:03:44 | that style and |
---|
0:03:47 | to directly control the quality of the center at time domain before you other words |
---|
0:03:53 | so that's frequency domain laws question doesn't stick as a before into consideration |
---|
0:03:58 | in the optimisation process |
---|
0:04:00 | to address the mismatch problem |
---|
0:04:03 | but propose and you'll jennings k for wasn't based the tts |
---|
0:04:08 | the main contributions of this paper a summarized as follows |
---|
0:04:13 | with that the this you're a time-domain lost for speech synthesis |
---|
0:04:18 | we in parole |
---|
0:04:19 | tables on base the tts framework by proposing and the origin is k |
---|
0:04:24 | based on short time frequency domain lost |
---|
0:04:27 | we propose to your is a is the are metric to measure the distortion time |
---|
0:04:33 | domain before |
---|
0:04:36 | this session looks at as a framework our proposed a method |
---|
0:04:42 | in this section |
---|
0:04:43 | based on raises use of and the only propose the time-domain loss function for example |
---|
0:04:48 | some base the tts |
---|
0:04:50 | by applying onion jennings k |
---|
0:04:52 | that takes into account both time and the frequency domain loss functions |
---|
0:04:57 | we have actually effectively |
---|
0:04:59 | we use the mismatch |
---|
0:05:01 | between the frequency domain features |
---|
0:05:03 | and the time domain with four and in provo the output speech quality |
---|
0:05:09 | the proposed framework in court as based tts after |
---|
0:05:15 | next |
---|
0:05:16 | we have l discussed in detail the proposed as training scheme |
---|
0:05:20 | in the tts |
---|
0:05:22 | we define two objective functions during training |
---|
0:05:25 | the first why is |
---|
0:05:26 | frequency domain lost |
---|
0:05:28 | denoted as |
---|
0:05:30 | last f |
---|
0:05:32 | is that it completely dolby the mel spectral features |
---|
0:05:35 | you know similarly as time goes on model |
---|
0:05:38 | a second of my a's of propose the time-domain lost |
---|
0:05:41 | innovative as nasty that it obtain the and to a waveform level as a household |
---|
0:05:48 | gravely iteration |
---|
0:05:50 | the prettiest |
---|
0:05:51 | time-domain signal from them as special features |
---|
0:05:55 | thus f |
---|
0:05:56 | ensures that |
---|
0:05:58 | the generated email special enclose the names for male special |
---|
0:06:03 | nasty minimizing this lost |
---|
0:06:06 | as a waveform level we had no weighting |
---|
0:06:09 | coefficient |
---|
0:06:10 | like to balance the two losses |
---|
0:06:13 | the total loss function of the whole model |
---|
0:06:16 | is |
---|
0:06:18 | defined as this fashion |
---|
0:06:25 | really them by a also shows the |
---|
0:06:28 | complex a |
---|
0:06:29 | training process of our proposed the vts |
---|
0:06:33 | but in tts the |
---|
0:06:35 | model p d's the no special features from the given in close quarters samples |
---|
0:06:43 | and is that converts the produced |
---|
0:06:45 | and the target from a special to the time-domain signal using greatly algorithm |
---|
0:06:50 | finally a slight loss function |
---|
0:06:53 | is jewels the to optimize the we've tts model |
---|
0:06:58 | we also my is i sdr |
---|
0:07:01 | it's a full scale you moderate signal to distortion |
---|
0:07:04 | to measure the this test between the generated a before and the target financial speech |
---|
0:07:12 | we not there is a is the r is a better at the |
---|
0:07:16 | only during training and the |
---|
0:07:20 | known to |
---|
0:07:21 | and the |
---|
0:07:22 | not require an to run time in |
---|
0:07:25 | in first |
---|
0:07:27 | i don't out like to move on to experiment part |
---|
0:07:31 | but have also the tts |
---|
0:07:33 | experiments energy speech database |
---|
0:07:36 | beta better for systems for comparative study |
---|
0:07:41 | the first are the first why it happens you know |
---|
0:07:44 | this system had only a frequency-domain loss function |
---|
0:07:48 | with clean |
---|
0:07:49 | wavelan always are is your that will generate of the faithful as try to the |
---|
0:07:55 | second one is that was older than |
---|
0:07:58 | this system |
---|
0:07:59 | also have only a frequency-domain loss function |
---|
0:08:02 | potentially |
---|
0:08:04 | we have network older is your the to generate of the waveform at runtime |
---|
0:08:08 | the survivability gestural |
---|
0:08:11 | it means that purple treated as model is trained with joint time-frequency domain lost |
---|
0:08:17 | quickly and always there is yours the during training and lifetime physics |
---|
0:08:21 | the last one is the tts that and it means that propose that if you |
---|
0:08:26 | just model is trained of it so i time-frequency domain lost |
---|
0:08:30 | gravely always there is yours the during training and the painting been that vocal there |
---|
0:08:35 | is yours the two synthesis speech |
---|
0:08:38 | and try to |
---|
0:08:41 | we also compare these systems with the ground truth speech |
---|
0:08:44 | denoted as c g |
---|
0:08:47 | a try to |
---|
0:08:48 | tables on seattle and the but just yell your wavelan |
---|
0:08:53 | algorithm with sixty four iterations |
---|
0:08:57 | we can that the listening experiments |
---|
0:09:01 | to evaluate the quality of the synthesis the speech |
---|
0:09:06 | the first evaluated the sound quality of the synthesis the speech in terms o mean |
---|
0:09:12 | opinion score |
---|
0:09:14 | it quality in figure one |
---|
0:09:16 | we compare comes on seattle and with tts gentle to all their of that the |
---|
0:09:22 | effect |
---|
0:09:24 | so it's time frequency domain lost |
---|
0:09:26 | we believe that this it off fair comparison |
---|
0:09:30 | as both frameworks used with him or with them |
---|
0:09:34 | for waveform generation during training and the right |
---|
0:09:38 | i can be seen in figure one a |
---|
0:09:40 | but if tts yellow outperforms two cousins |
---|
0:09:44 | but compare |
---|
0:09:45 | couple shown that even |
---|
0:09:47 | and of if tts |
---|
0:09:49 | but |
---|
0:09:50 | to investigate how well |
---|
0:09:52 | the predicate and their special features |
---|
0:09:54 | perform |
---|
0:09:55 | with natural colder |
---|
0:09:58 | we observe that it means always tts is trained with an em all trees are |
---|
0:10:03 | it performs better |
---|
0:10:05 | then we also went in s from whole the is available |
---|
0:10:08 | s right are |
---|
0:10:10 | but compare |
---|
0:10:11 | based tts yellow and the bit tedious you data and |
---|
0:10:15 | in terms of was quality |
---|
0:10:17 | we notice that both frameworks |
---|
0:10:19 | a change under the same conditions |
---|
0:10:21 | however |
---|
0:10:22 | the tts target and use is being that's vocal therefore waveform generation of right are |
---|
0:10:28 | and except t with tts target and |
---|
0:10:31 | also performs based tts g l |
---|
0:10:37 | we also conducted at pratt first test |
---|
0:10:40 | to assess speech quality of proposed frameworks |
---|
0:10:44 | figure two shows that our proposed to be tts member |
---|
0:10:49 | also performs |
---|
0:10:50 | the baseline system |
---|
0:10:52 | well both gravely and the way that so called as the teens |
---|
0:10:57 | at the right |
---|
0:11:00 | we further conduct another a be preference test |
---|
0:11:04 | to examine the effect |
---|
0:11:06 | of the number of briefly |
---|
0:11:08 | iterations |
---|
0:11:09 | on the with tts |
---|
0:11:11 | performance |
---|
0:11:15 | for rapid turnaround |
---|
0:11:16 | but only apply |
---|
0:11:18 | why and the two we have lain iterations |
---|
0:11:22 | for face construction |
---|
0:11:25 | and investigated the effect in terms o was quality |
---|
0:11:31 | we observe that |
---|
0:11:33 | the single iteration of gravely all these are |
---|
0:11:36 | presents a better performers |
---|
0:11:38 | then two iterations |
---|
0:11:42 | finally |
---|
0:11:43 | but controlled of this paper |
---|
0:11:48 | but proposed and you'll tokens only implementation cultivated yes |
---|
0:11:52 | we propose to yours you know in my wrote signal to distortion as the loss |
---|
0:11:58 | function |
---|
0:11:59 | the proposed to me tts frameworks |
---|
0:12:02 | performs the baseline |
---|
0:12:03 | and the chips high quality signals this the speech |
---|
0:12:14 | some people |
---|
0:12:15 | right mouse for taking the time to listen to this presentation |
---|
0:12:19 | if insists you please check our panel page to the speech samples |
---|
0:12:24 | thank you for your test |
---|