0:00:14 | okay hello everybody |
---|
0:00:15 | i'm to right the a from all the image it |
---|
0:00:18 | helsinki finland |
---|
0:00:20 | and i'm gonna talk about a hmm based |
---|
0:00:22 | speech synthesis |
---|
0:00:23 | and how to improve quality |
---|
0:00:26 | by devising a clock those source pulse library |
---|
0:00:30 | and this is |
---|
0:00:31 | um |
---|
0:00:32 | making in collaboration of it |
---|
0:00:34 | michael lakes |
---|
0:00:35 | on this only and what they one you know |
---|
0:00:37 | from the helsinki that unit was the think E and |
---|
0:00:40 | and a block of and a local |
---|
0:00:42 | um the although rest |
---|
0:00:45 | okay so here's content |
---|
0:00:47 | of my talk |
---|
0:00:48 | so let's go straight to the background |
---|
0:00:52 | so a six |
---|
0:00:53 | a a lot in the goal of text speech is to generate net that's was sounding expression person |
---|
0:00:58 | from a bit text and to at R |
---|
0:01:00 | to major tts trends |
---|
0:01:02 | one is they need selection which is space on |
---|
0:01:04 | concatenating netting pretty recording |
---|
0:01:06 | acoustic units |
---|
0:01:07 | and D C else |
---|
0:01:08 | um |
---|
0:01:10 | are three what quality at its best |
---|
0:01:12 | that the adaptability |
---|
0:01:14 | a is somewhat pour |
---|
0:01:16 | a the other mid thirties |
---|
0:01:17 | statistical |
---|
0:01:18 | which space the modeling speech parameters a he mark model |
---|
0:01:22 | and it S but their adaptability |
---|
0:01:24 | and |
---|
0:01:25 | this work |
---|
0:01:26 | a can about statistical |
---|
0:01:29 | synthesis |
---|
0:01:31 | so |
---|
0:01:33 | but the problem is that the quality is not too good |
---|
0:01:36 | so how proposal for this |
---|
0:01:39 | he's |
---|
0:01:39 | to |
---|
0:01:40 | decompose the speech signal into to clock close source |
---|
0:01:43 | signal and to vocal tract transfer function |
---|
0:01:47 | and second |
---|
0:01:48 | um |
---|
0:01:49 | i for the decompose the call those source in several parameters |
---|
0:01:53 | and a you call that pulse library |
---|
0:01:56 | and then be model these parameters in a normal |
---|
0:01:59 | i item and based |
---|
0:02:00 | speech in this framework |
---|
0:02:01 | H T as |
---|
0:02:03 | and |
---|
0:02:03 | and synthesis it's |
---|
0:02:05 | we reconstruct construct the |
---|
0:02:07 | um |
---|
0:02:08 | a a source signal from the policies is and the parameter |
---|
0:02:12 | and feel their it it the vocal filter |
---|
0:02:17 | so that the basics so can so the source of the clock but um |
---|
0:02:21 | voiced speech |
---|
0:02:22 | is the complex they sum |
---|
0:02:25 | and then the |
---|
0:02:26 | signal goes the vocal tract and then we have speech |
---|
0:02:29 | so we are interested in this clock but like citation very much in this work |
---|
0:02:34 | so |
---|
0:02:36 | and uh |
---|
0:02:36 | hopper |
---|
0:02:38 | uh i speaker be have speech you know |
---|
0:02:41 | and the estimate it |
---|
0:02:42 | got don't fall below that |
---|
0:02:44 | and |
---|
0:02:45 | how we can |
---|
0:02:46 | estimate the signal |
---|
0:02:48 | we can for simple use method called got the likeness filtering |
---|
0:02:52 | which to estimate |
---|
0:02:53 | the clock or so signal |
---|
0:02:54 | from the speech signal itself |
---|
0:02:57 | are several methods |
---|
0:02:58 | to that from this task |
---|
0:03:00 | i i go further into that |
---|
0:03:03 | but use |
---|
0:03:05 | but that that is based on each of they will |
---|
0:03:07 | yeah P C |
---|
0:03:08 | a use of a lpc |
---|
0:03:13 | okay and then to the speech in the system |
---|
0:03:16 | so is |
---|
0:03:17 | a a a very family |
---|
0:03:18 | most of you |
---|
0:03:20 | but i will go through this fast so we have a |
---|
0:03:22 | speech database and then me parameterized tries it |
---|
0:03:25 | and train |
---|
0:03:26 | the parameters according to the labours labels |
---|
0:03:30 | and in synthesis to it's |
---|
0:03:31 | i input text and a that and |
---|
0:03:35 | a we can generate parameters are according to the that's was and so we can recall sort |
---|
0:03:39 | speech |
---|
0:03:40 | and in this work we are interested in this |
---|
0:03:43 | a process and and synthesis steps |
---|
0:03:46 | and |
---|
0:03:48 | but in proving these we |
---|
0:03:49 | try to make the speech |
---|
0:03:51 | a more natural |
---|
0:03:54 | so what we do in speech parameters a sony it's be first |
---|
0:03:57 | window of the signal of course and |
---|
0:04:00 | a a mix of tree |
---|
0:04:01 | and to be |
---|
0:04:04 | the is filtering so we decompose the you speech signal |
---|
0:04:07 | the diffuse you logical corresponding parts which is that |
---|
0:04:11 | a those source |
---|
0:04:12 | and uh well got track |
---|
0:04:15 | parameterized the vocal tract bit L S Fs |
---|
0:04:19 | and filter |
---|
0:04:21 | i rise the source with several parameters |
---|
0:04:24 | are |
---|
0:04:25 | fundamental frequency |
---|
0:04:26 | how many noise ratio |
---|
0:04:28 | a spectral to with L C and |
---|
0:04:31 | harmonic model it |
---|
0:04:33 | a in the lower bound |
---|
0:04:35 | and finally |
---|
0:04:37 | we extract the but top row is that the library |
---|
0:04:41 | and link the holes with corresponding |
---|
0:04:43 | source parameters |
---|
0:04:46 | um |
---|
0:04:47 | so how we do that |
---|
0:04:49 | first |
---|
0:04:50 | i um you a mean the couple a close or instance |
---|
0:04:53 | from the different at to go of force you know |
---|
0:04:57 | and then we extract each complete |
---|
0:05:00 | to better at caught a source segment |
---|
0:05:02 | and from do to the hann window |
---|
0:05:06 | the billing T |
---|
0:05:08 | is |
---|
0:05:09 | a corresponding got a source parameters which are the energy fun of the frequency |
---|
0:05:14 | voice source spectrum how much can trace and the harmonics |
---|
0:05:19 | and in a to at and we store |
---|
0:05:22 | yeah a down sampled ten millisecond version of the possible from |
---|
0:05:26 | in order to |
---|
0:05:28 | calculate the concatenation cost |
---|
0:05:30 | and the synthesis stage |
---|
0:05:35 | and the boss we may consist of hundreds |
---|
0:05:38 | or even thousands of clock of a pulses |
---|
0:05:41 | um and two as an example of some of the pulses |
---|
0:05:46 | um um |
---|
0:05:47 | from the two male speaker |
---|
0:05:51 | okay |
---|
0:05:52 | and that the synthesis stage |
---|
0:05:55 | so what we do is we want to reconstruct the voiced |
---|
0:05:59 | excitation |
---|
0:06:00 | so we select |
---|
0:06:01 | to |
---|
0:06:02 | best matching impulses |
---|
0:06:04 | according to the |
---|
0:06:05 | oh of parameters turn that by the hmm men and |
---|
0:06:08 | and um |
---|
0:06:10 | a slippery |
---|
0:06:12 | we scale them and dude |
---|
0:06:14 | of the pulses |
---|
0:06:15 | and all at them |
---|
0:06:17 | to to write text like this |
---|
0:06:19 | for one X at this be used only white noise |
---|
0:06:23 | only be filtered to combine text this and and get so the leaks |
---|
0:06:27 | it's |
---|
0:06:30 | and |
---|
0:06:31 | to but also a space |
---|
0:06:33 | um um that it |
---|
0:06:34 | by minimizing to joint cost composed of target |
---|
0:06:37 | and concatenation costs |
---|
0:06:39 | so that it course these uh |
---|
0:06:41 | root mean square error or be in the voices by me there's |
---|
0:06:44 | try but it's men and the one stored |
---|
0:06:46 | for each pause |
---|
0:06:48 | and we can of course have different weights |
---|
0:06:50 | for different parameters |
---|
0:06:52 | to to |
---|
0:06:53 | this system |
---|
0:06:55 | and the can can to the concatenation cost use the arms |
---|
0:06:58 | error or bit in the down some presence of the poles |
---|
0:07:01 | a second in eight |
---|
0:07:08 | okay here's an example of the |
---|
0:07:11 | um |
---|
0:07:12 | how well in this goals |
---|
0:07:14 | so he |
---|
0:07:15 | to most excitation |
---|
0:07:25 | and |
---|
0:07:26 | and |
---|
0:07:29 | and it was the on was excite there's an |
---|
0:07:32 | less interesting |
---|
0:07:40 | and |
---|
0:07:40 | then combines i one play |
---|
0:07:42 | that two thirty |
---|
0:07:44 | a a with the for the since we get finally |
---|
0:07:47 | a that speech |
---|
0:07:50 | i |
---|
0:07:52 | a |
---|
0:07:58 | so it that's been nice so probably you didn't understand |
---|
0:08:01 | i i have more samples |
---|
0:08:02 | later |
---|
0:08:05 | well first of the result |
---|
0:08:07 | so |
---|
0:08:08 | it was a we had used in the same just then |
---|
0:08:10 | only only one o'clock top boss each you have more T white |
---|
0:08:13 | according to the voice source parameters and we had the result that |
---|
0:08:16 | a a it was preferred over the |
---|
0:08:19 | a basic straight method |
---|
0:08:22 | and |
---|
0:08:23 | we have some samples from |
---|
0:08:24 | from that system |
---|
0:08:35 | and |
---|
0:08:38 | as |
---|
0:08:41 | sky |
---|
0:08:43 | fashion |
---|
0:08:48 | true |
---|
0:08:50 | and |
---|
0:08:51 | a |
---|
0:08:55 | i |
---|
0:08:56 | so we also participate in the proposed so and |
---|
0:08:59 | two doesn't ten |
---|
0:09:00 | with a |
---|
0:09:01 | more more results |
---|
0:09:04 | and you some samples from from that |
---|
0:09:10 | and |
---|
0:09:13 | that's |
---|
0:09:13 | i |
---|
0:09:16 | i |
---|
0:09:19 | i |
---|
0:09:20 | i |
---|
0:09:22 | i |
---|
0:09:26 | i |
---|
0:09:33 | i |
---|
0:09:37 | so the quality is is uh |
---|
0:09:39 | quite good |
---|
0:09:40 | and here the samples comparing this a single pulse technique and the |
---|
0:09:45 | a a major pulse library technique |
---|
0:09:48 | the |
---|
0:09:49 | i hope you can hear the differences in this |
---|
0:09:52 | she to some difference is not so big |
---|
0:09:55 | so maybe i plate the in this persons |
---|
0:10:12 | i |
---|
0:10:24 | i |
---|
0:10:27 | yeah |
---|
0:10:28 | and |
---|
0:10:30 | i heard some |
---|
0:10:32 | some differences |
---|
0:10:33 | yes |
---|
0:10:35 | um |
---|
0:10:36 | okay here's uh spectral comes comparing |
---|
0:10:39 | yeah |
---|
0:10:40 | difference in quality if you don't here |
---|
0:10:42 | you can see for example that i uh here |
---|
0:10:45 | that the nice |
---|
0:10:46 | he model model better |
---|
0:10:48 | uh |
---|
0:10:49 | this sparse up technique as once more of the single pulse technique |
---|
0:10:53 | and |
---|
0:10:54 | a suppose voiced fricatives here |
---|
0:10:56 | or are more but there because the single pulse technique couldn't |
---|
0:10:59 | produce |
---|
0:11:00 | um |
---|
0:11:02 | soft policies |
---|
0:11:04 | and high frequencies as spell |
---|
0:11:05 | are are more that's role |
---|
0:11:11 | and we conduct it's some |
---|
0:11:14 | listening some tests |
---|
0:11:16 | and we found that the |
---|
0:11:18 | a a method but slightly preferred over single pulse technique |
---|
0:11:23 | at the difference of a so great |
---|
0:11:25 | but the but uh |
---|
0:11:27 | speaker similarity |
---|
0:11:28 | ross was but |
---|
0:11:31 | and |
---|
0:11:32 | very very many |
---|
0:11:34 | um |
---|
0:11:35 | sounds where lots more natural |
---|
0:11:38 | yeah |
---|
0:11:38 | this is |
---|
0:11:39 | kind of um |
---|
0:11:41 | could that used then of the source |
---|
0:11:43 | so that are the same problems as in can as uh synthesis |
---|
0:11:47 | so we have some at discontinuity there |
---|
0:11:50 | and some more are are fog |
---|
0:11:52 | a compared to the frequency C of the signal plus can take |
---|
0:12:00 | okay okay here some way so we have |
---|
0:12:03 | we do a need so to what's motivated high quality speech the sensor |
---|
0:12:06 | and this ours |
---|
0:12:08 | for but the blocks and and control all the speech parameters |
---|
0:12:12 | a a speech X |
---|
0:12:13 | have take this and |
---|
0:12:15 | this pulse library generates more that's right side based and |
---|
0:12:18 | because it |
---|
0:12:19 | a three like and |
---|
0:12:21 | in the three |
---|
0:12:22 | and it is slightly prepared or the signal passed |
---|
0:12:26 | and that the |
---|
0:12:27 | references |
---|
0:12:28 | and i thank you for your attention |
---|
0:12:38 | time for questions |
---|
0:12:39 | try the microphones |
---|
0:12:51 | a one can can i have a question of two |
---|
0:12:54 | oh |
---|
0:12:55 | a unit selection um |
---|
0:12:57 | pitch period |
---|
0:12:58 | yeah could use a some about how large the entry tree is and how complex that search is that uh |
---|
0:13:03 | that's potentially much larger search problem in a |
---|
0:13:05 | yeah i don't size units well yeah be are in in the |
---|
0:13:09 | initial initial stage of developing is still |
---|
0:13:11 | the are experts in a concatenative synthesis but |
---|
0:13:14 | you have right |
---|
0:13:15 | um tried |
---|
0:13:17 | um |
---|
0:13:18 | various size is |
---|
0:13:19 | from for example ten policies to twenty thousand paul |
---|
0:13:23 | and E D bands |
---|
0:13:24 | depends a lot |
---|
0:13:26 | a |
---|
0:13:26 | on the speech mother sometimes |
---|
0:13:28 | i even hundred pulses might be as almost as good as |
---|
0:13:32 | ten thousand pulses |
---|
0:13:33 | so it's |
---|
0:13:34 | um are trying to mate make make some sense of how to choose the |
---|
0:13:39 | but also also is that this in that it in that it could be |
---|
0:13:43 | uh |
---|
0:13:43 | i ask with the |
---|
0:13:45 | very few pulses |
---|
0:13:50 | and and D T questions "'cause" you me how to choose appropriate code that house |
---|
0:13:55 | from the right rate |
---|
0:13:56 | so a great P D S had to choose |
---|
0:13:59 | so a library |
---|
0:14:00 | i to to select very so had to choose a pulse from the and the light yeah |
---|
0:14:04 | yeah you have this uh a target cost |
---|
0:14:07 | and can get the new cost |
---|
0:14:08 | and we have rates |
---|
0:14:09 | for lot of these |
---|
0:14:11 | it's are two and by hand |
---|
0:14:13 | at this moment |
---|
0:14:14 | and to target cost is the |
---|
0:14:16 | our and miss error between the source parameters |
---|
0:14:19 | oh the library |
---|
0:14:21 | and the ones to from the hmms |
---|
0:14:24 | and |
---|
0:14:25 | can at cost is uh |
---|
0:14:27 | in this but be D D over only over three pulses |
---|
0:14:32 | so it's it it can |
---|
0:14:33 | the signal the similarity between the |
---|
0:14:36 | a policy |
---|
0:14:37 | and it should be a a a similar as possible |
---|
0:14:40 | a at least we you catch you could start this way |
---|
0:14:44 | and |
---|
0:14:45 | to a total that or is competition of this |
---|
0:14:48 | but the uh a what is we use feature be over a voiced segment |
---|
0:14:52 | two uh |
---|
0:14:53 | of must this procedure |
---|
0:14:56 | um i i then that's that's part fifty are in uh team C between |
---|
0:15:00 | choose them go to house |
---|
0:15:02 | and |
---|
0:15:04 | and and and so what's so and the same there's |
---|
0:15:07 | which are generated from the hmms because the are models key in the hmms or |
---|
0:15:12 | since |
---|
0:15:13 | i as a lot of question regarding the same thing |
---|
0:15:16 | a T I would arise that you use that line spectral pair or yeah said |
---|
0:15:22 | or modeling the pulses |
---|
0:15:24 | and can you |
---|
0:15:25 | could you give some |
---|
0:15:26 | rationale why you want to choose that and also D |
---|
0:15:30 | still of the same high M S and the is really you are measuring the similarities similarity what contents |
---|
0:15:36 | that's a frequency spectral |
---|
0:15:38 | nature of face whatever |
---|
0:15:42 | um |
---|
0:15:43 | and |
---|
0:15:45 | you mean um |
---|
0:15:46 | here |
---|
0:15:48 | the voice source spectrum |
---|
0:15:50 | here |
---|
0:15:52 | yeah i D the parameterization |
---|
0:15:55 | yeah |
---|
0:15:56 | so the vocal tract spectrum is set up as is a that and i D yeah more or just of |
---|
0:16:01 | the spectrum of the of that source source yes yeah |
---|
0:16:04 | yeah that's that's interesting question |
---|
0:16:07 | the |
---|
0:16:09 | but C P stems from the fact that firstly |
---|
0:16:12 | we be have more old |
---|
0:16:13 | the spectrum of the source by this |
---|
0:16:16 | and here we have included D |
---|
0:16:18 | it a power meter here |
---|
0:16:21 | actually i i got i got sure |
---|
0:16:23 | as sure that it would be them |
---|
0:16:25 | most uh |
---|
0:16:27 | and use it is |
---|
0:16:28 | i a problem that how to choose the best parameters be quite this |
---|
0:16:32 | and that we show just to going on a with are the best parameters |
---|
0:16:35 | for selecting the best pulses |
---|
0:16:38 | so |
---|
0:16:39 | yeah but try to model the spectral tilt |
---|
0:16:41 | and the spectral fine structure but is |
---|
0:16:44 | a S Fs |
---|
0:16:45 | of the source |
---|
0:16:47 | as it it's probably be of because they're not |
---|
0:16:49 | really like uh |
---|
0:16:51 | and the most is |
---|
0:16:53 | so the distortion is measured in the frequency domain magnitude |
---|
0:16:57 | a a man yeah L S |
---|
0:16:59 | yeah i i i S are sure that you are measuring the whole seen |
---|
0:17:02 | from the time domain or in the frequency domain or was that was of phase or what |
---|
0:17:08 | no it's only the L S Fs |
---|
0:17:10 | okay |
---|
0:17:11 | so we can improve it maybe |
---|
0:17:13 | two but it the frequency domain |
---|
0:17:21 | oh |
---|
0:17:22 | oh |
---|
0:17:26 | how leave you've known you hear in you uh H main seized didn't for the vocal tract parameters |
---|
0:17:32 | so yeah i cannot not i'd and remember that number |
---|
0:17:38 | okay i think of |
---|
0:17:46 | okay |
---|
0:17:47 | i i Q |
---|