0:00:14 | okay so i'm pleased to introduce the next guest speaker who's a kitchen recruited from the goal is to just |
---|
0:00:20 | technology |
---|
0:00:22 | he's extremely well known but for those who don't know he's the pioneer of statistical speech of the system particular |
---|
0:00:30 | hmm speech of the system captures the together |
---|
0:00:33 | right a single actual not so like i |
---|
0:00:38 | i |
---|
0:00:39 | i |
---|
0:00:41 | okay |
---|
0:00:44 | and operator |
---|
0:00:45 | that |
---|
0:00:47 | most |
---|
0:00:47 | speech recognition researchers |
---|
0:00:50 | re going speech synthesis |
---|
0:00:52 | as a messy problem |
---|
0:00:55 | that the reason why |
---|
0:00:56 | yeah |
---|
0:00:58 | i yeah |
---|
0:01:00 | talk about a statistical formation of speech synthesis |
---|
0:01:04 | in this presentation |
---|
0:01:07 | okay |
---|
0:01:09 | to realise speech synthesis systems |
---|
0:01:12 | many approaches have been proposed |
---|
0:01:15 | before nineteen |
---|
0:01:17 | rule based formant synthesis had both been studied |
---|
0:01:21 | in this case funding you need a bit by hand crafted rules |
---|
0:01:27 | after nineties |
---|
0:01:28 | corpus based concatenative speech synthesis |
---|
0:01:31 | approach is dominant |
---|
0:01:33 | state of the art |
---|
0:01:35 | speech synthesis systems |
---|
0:01:36 | based on unit selection can generate natural sounding speech |
---|
0:01:41 | in recent years |
---|
0:01:43 | statistical parametric speech synthesis approach yet popularity |
---|
0:01:49 | it has |
---|
0:01:50 | several advantage |
---|
0:01:53 | such as |
---|
0:01:54 | flexibility in voice characteristics |
---|
0:01:57 | a small footprint |
---|
0:01:59 | automatic voice between |
---|
0:02:00 | and so on |
---|
0:02:02 | and i'm not there |
---|
0:02:03 | the most important |
---|
0:02:05 | advantage of the statistical approach |
---|
0:02:08 | it that's |
---|
0:02:09 | we can use mathematical you will define the models and average |
---|
0:02:14 | in this talk |
---|
0:02:15 | i would like to discuss how we can formulate |
---|
0:02:18 | and i understand the whole speech synthesis process |
---|
0:02:22 | including speech feature extraction acoustic modeling and the text processing and so on a unified statistical framework |
---|
0:02:33 | okay |
---|
0:02:34 | the basic problem |
---|
0:02:36 | all speech synthesis |
---|
0:02:38 | yeah i can be stated as shown here |
---|
0:02:43 | we have a speech database |
---|
0:02:46 | that is a text |
---|
0:02:49 | yeah |
---|
0:02:50 | a set of text |
---|
0:02:52 | and corresponding speech waveform |
---|
0:02:57 | given a text |
---|
0:02:58 | to be syntax |
---|
0:03:00 | but the speech waveform corresponding context |
---|
0:03:07 | the problem can be represented by this equation |
---|
0:03:13 | and it can be solved |
---|
0:03:15 | by estimating the |
---|
0:03:17 | predictive distribution |
---|
0:03:19 | a given barrier |
---|
0:03:23 | and then drawing samples |
---|
0:03:26 | from the predicted distribution |
---|
0:03:29 | basically it's quite simple |
---|
0:03:32 | however |
---|
0:03:35 | estimating that |
---|
0:03:36 | predictive distribution |
---|
0:03:38 | is very hot |
---|
0:03:40 | so |
---|
0:03:42 | we have to introduce a acoustic model problem |
---|
0:03:47 | then the in the acoustic model for example hmm |
---|
0:03:52 | and this part correspond to the training part |
---|
0:03:57 | and this but response |
---|
0:04:00 | to the generation part |
---|
0:04:03 | and the first i like to discuss the generation for |
---|
0:04:12 | as we know modeling speech waveform |
---|
0:04:14 | directly by |
---|
0:04:17 | acoustic models is very difficult |
---|
0:04:19 | so we have to introduce |
---|
0:04:21 | parametric representation speech waveform |
---|
0:04:25 | oh |
---|
0:04:27 | is a parametric representation of speech waveform |
---|
0:04:31 | for example cepstrum well mel cepstrum but as it is used for every zero |
---|
0:04:38 | accordingly and this generation apart |
---|
0:04:43 | is decomposed into these two terms |
---|
0:04:49 | we also know |
---|
0:04:53 | that takes should be converted to that is |
---|
0:04:57 | because the same text |
---|
0:04:59 | can i have much to pronunciation |
---|
0:05:02 | part of speech analytics lexical stress |
---|
0:05:06 | or other information |
---|
0:05:08 | so that generation part |
---|
0:05:11 | is decomposed |
---|
0:05:12 | into these three times |
---|
0:05:17 | text processing |
---|
0:05:19 | and |
---|
0:05:20 | acoustic model |
---|
0:05:22 | a parameter generation from acoustic model |
---|
0:05:25 | and speech waveform reconstruction |
---|
0:05:31 | and that it is difficult to perform integral and summation |
---|
0:05:36 | yeah over all the variables |
---|
0:05:40 | so we approximate the by joint maximization are shown here |
---|
0:05:46 | however |
---|
0:05:48 | joint maximization is still hot |
---|
0:05:51 | so i |
---|
0:05:52 | is approximated by a step by step maximization problem |
---|
0:05:58 | discourse want to the training part |
---|
0:06:01 | and |
---|
0:06:02 | this maximise |
---|
0:06:04 | maximization with that of this |
---|
0:06:07 | is that this correspond to |
---|
0:06:10 | text and at least |
---|
0:06:11 | and this corresponds to |
---|
0:06:14 | speech parameter generation from a acoustic model |
---|
0:06:19 | i |
---|
0:06:20 | talked about the generation part |
---|
0:06:24 | but the training part |
---|
0:06:26 | also requires a partner parametric representation of a speech waveform and there |
---|
0:06:36 | accordingly the |
---|
0:06:38 | training part |
---|
0:06:40 | can be approximated by a step by step maximization problem in a similar manner at that iteration part |
---|
0:06:49 | they're doing |
---|
0:06:50 | all speech database |
---|
0:06:53 | and the feature extraction of speech database |
---|
0:06:56 | and acoustic model train |
---|
0:07:01 | as a result |
---|
0:07:03 | the original problem |
---|
0:07:05 | is it |
---|
0:07:06 | decompose into these sub-problems |
---|
0:07:09 | bows or four |
---|
0:07:11 | training part and those are |
---|
0:07:14 | for dinner at some point |
---|
0:07:16 | feature extraction |
---|
0:07:18 | of speech database |
---|
0:07:20 | between |
---|
0:07:21 | and acoustic model training |
---|
0:07:24 | and the text and there is |
---|
0:07:25 | of |
---|
0:07:27 | the text |
---|
0:07:28 | to be synthesized |
---|
0:07:29 | and the speech parameter generation from acoustic model |
---|
0:07:33 | and finally yeah we reconstruct speech waveform |
---|
0:07:37 | by sampling of this |
---|
0:07:39 | distribution |
---|
0:07:44 | okay |
---|
0:07:46 | i just talked about the |
---|
0:07:48 | mathematical formulation |
---|
0:07:50 | in the following |
---|
0:07:51 | i like to explain |
---|
0:07:53 | each component a step by step |
---|
0:07:56 | and then |
---|
0:07:57 | show examples to demonstrate the flexibility of thus that's statistical approach |
---|
0:08:04 | and finally give some discussion and computers |
---|
0:08:10 | "'kay" |
---|
0:08:12 | this the overview of an hmm based speech synthesis system |
---|
0:08:17 | the training part is similar to those used in hmm based speech recognition system |
---|
0:08:23 | the essential difference |
---|
0:08:25 | it that the state output vector in clues |
---|
0:08:29 | not only spectrum parameters |
---|
0:08:32 | for example mel-cepstrum |
---|
0:08:34 | but also excited some parameters if zero parameters |
---|
0:08:39 | on the other hand |
---|
0:08:40 | the synthesis part |
---|
0:08:43 | does the inverse operation of speech recognition |
---|
0:08:48 | that is |
---|
0:08:49 | phoneme hmms |
---|
0:08:50 | or concatenated according to the labels |
---|
0:08:54 | i drive the from the text |
---|
0:08:56 | to be synthesized |
---|
0:08:58 | yeah |
---|
0:08:59 | a sequence or speech parameters |
---|
0:09:02 | a spectrum parameters and F zero parameters |
---|
0:09:06 | is determined in such a way that it's at most probable probability for the hmm is max |
---|
0:09:13 | and finally |
---|
0:09:14 | switch maple |
---|
0:09:17 | is in fact by using speech synthesis filter |
---|
0:09:21 | and that each part correspond to the |
---|
0:09:25 | supper problem |
---|
0:09:28 | that we |
---|
0:09:30 | feature extraction |
---|
0:09:32 | and the model training |
---|
0:09:34 | and that text analysis for the text to be synthesized |
---|
0:09:37 | and speech parameter generation from acoustic model trained a cost model |
---|
0:09:43 | and speech waveform reconstruction |
---|
0:09:47 | first |
---|
0:09:48 | i like to talk about speech feature extraction |
---|
0:09:52 | and space speech waveform a reconstruction which correspond to these which |
---|
0:10:03 | it's based on the source-filter model which in that no human speech production |
---|
0:10:10 | in this presentation |
---|
0:10:11 | i assume the |
---|
0:10:13 | system function |
---|
0:10:15 | H of Z is represented by mel-cepstral coefficient |
---|
0:10:21 | that is |
---|
0:10:22 | frequency warped cepstral coefficients |
---|
0:10:25 | defined by this equation |
---|
0:10:28 | the frequency warping function defined by this |
---|
0:10:32 | first order allpass system function |
---|
0:10:34 | give us a good approximation to auditory frequency scales |
---|
0:10:40 | and with an appropriate choice of that of |
---|
0:10:45 | by assuming X |
---|
0:10:47 | icsi's a |
---|
0:10:49 | a short segment of a speech waveform |
---|
0:10:52 | assuming X is a gaussian process |
---|
0:10:55 | we time see |
---|
0:10:57 | mel-cepstrum |
---|
0:10:58 | in such a way that |
---|
0:11:00 | it's likelihood |
---|
0:11:01 | with respect to X |
---|
0:11:04 | is maximized |
---|
0:11:05 | it's just that any other estimation of mel-cepstral coefficient |
---|
0:11:10 | because the of X |
---|
0:11:12 | is convex with respect to see |
---|
0:11:15 | the solution can easily obtained by an iterative everywhere |
---|
0:11:22 | okay |
---|
0:11:23 | to reset resynthesized speech |
---|
0:11:26 | H of the is controlled according to the estimated mel-cepstrum |
---|
0:11:32 | and excited by post-training |
---|
0:11:34 | and of white noise |
---|
0:11:36 | for voiced and unvoiced segments are respectively |
---|
0:11:42 | i know this is the |
---|
0:11:43 | pulse train |
---|
0:11:47 | under this is white noise |
---|
0:11:51 | and the excitation signal is generated based on voiced unvoiced information and if a zero |
---|
0:11:58 | extracted from the original speech |
---|
0:12:01 | this is all the non-speech unfair advantage now parents scales et cetera and dct excitation signal |
---|
0:12:12 | it could have the same if zero |
---|
0:12:15 | at this point |
---|
0:12:17 | and but exciting a speech synthesis filter controlled by mel-cepstral coefficient vectors |
---|
0:12:24 | by this excitation signal we can reconstruct the speech waveform |
---|
0:12:30 | i don't you have somebody else et cetera |
---|
0:12:34 | so now the problem |
---|
0:12:36 | is |
---|
0:12:37 | how we can |
---|
0:12:38 | generate both speech parameters |
---|
0:12:42 | from the tech |
---|
0:12:43 | i have to be synthesized was the corresponding acoustic |
---|
0:12:48 | model |
---|
0:12:53 | okay |
---|
0:12:55 | next i'd like to talk about this maximization problem |
---|
0:12:58 | which correspond to acoustic modeling |
---|
0:13:03 | this is the other two markov model hmm a result left to right topology |
---|
0:13:09 | which is used in speech recognition system |
---|
0:13:12 | we also use the same structure for speech synthesis |
---|
0:13:16 | please note that the state output probability is defined as |
---|
0:13:21 | gaussian single gaussian us that because |
---|
0:13:24 | it's enough for speech synthesis we are using a speaker-dependent model |
---|
0:13:30 | that for speech synthesis |
---|
0:13:35 | as i explained |
---|
0:13:36 | we need to model not only spectral parameters |
---|
0:13:40 | but also F zero parameters to resynthesize speech wave |
---|
0:13:44 | putting the state output vector consists of |
---|
0:13:48 | spectrum part |
---|
0:13:50 | and F zero part |
---|
0:13:52 | spectrum brought consists of mel-cepstrum coefficient vector |
---|
0:13:58 | and its delta and delta-delta |
---|
0:14:01 | and the F zero product consists of F zero and its delta and delta-delta |
---|
0:14:08 | the problem |
---|
0:14:10 | in modeling F zero by a gmm |
---|
0:14:14 | if that |
---|
0:14:15 | we cannot apply to conventional discrete or continuous stated distribution |
---|
0:14:21 | because |
---|
0:14:22 | F zero value |
---|
0:14:24 | not to define in the unvoiced region |
---|
0:14:27 | that is |
---|
0:14:28 | the observation sequence of F zero is composed of |
---|
0:14:33 | one dimensional continuous values |
---|
0:14:37 | and discrete a simple which represent about |
---|
0:14:42 | several heuristic methods have been investigated four hundred in the unvoiced region |
---|
0:14:49 | for example |
---|
0:14:50 | interpolating the caps |
---|
0:14:52 | or substituting random values for almost agrees |
---|
0:14:59 | to model this kind of observation sequence in a statistical quirk the manner |
---|
0:15:05 | we have defined a new kind of hmm |
---|
0:15:08 | yeah |
---|
0:15:10 | we refer to it as multi-space probability distribution hmms |
---|
0:15:14 | or msd hmm |
---|
0:15:16 | it includes the discrete hmm and the continuous mixture hmm |
---|
0:15:21 | as special cases |
---|
0:15:24 | and for the more it can model the sequence or |
---|
0:15:28 | all observation vectors with variable dimensionality including discrete simple |
---|
0:15:35 | we show the structure of msd hmm |
---|
0:15:38 | specialised for F zero modeling |
---|
0:15:42 | each state |
---|
0:15:43 | has weights |
---|
0:15:45 | which represent |
---|
0:15:47 | and probabilities |
---|
0:15:48 | all voiced |
---|
0:15:50 | and unvoiced |
---|
0:15:52 | and |
---|
0:15:53 | continuous distribution for voice |
---|
0:15:56 | observation |
---|
0:15:58 | that is not bad |
---|
0:16:00 | i'm em algorithm can easily be derived for training this type of H M |
---|
0:16:08 | okay |
---|
0:16:09 | but combining the spectrum part and F zero part of the state output distribution |
---|
0:16:15 | has |
---|
0:16:16 | mod stream structure |
---|
0:16:18 | like this |
---|
0:16:23 | okay |
---|
0:16:25 | no |
---|
0:16:26 | i like to talk about |
---|
0:16:28 | model structure |
---|
0:16:30 | in speech recognition |
---|
0:16:32 | preceding and succeeding phone identities are regarded as context |
---|
0:16:39 | on the other hand |
---|
0:16:40 | in speech synthesis |
---|
0:16:43 | current phone identity can also be a context |
---|
0:16:47 | because i |
---|
0:16:48 | no it's not necessary to know |
---|
0:16:51 | what the speech recognition result |
---|
0:16:54 | furthermore |
---|
0:16:55 | there are |
---|
0:16:57 | many other context of factors |
---|
0:16:59 | that affect |
---|
0:17:01 | spectrum |
---|
0:17:02 | every zero |
---|
0:17:03 | and the duration as shown here |
---|
0:17:06 | for example a number |
---|
0:17:09 | phones in this stuff below |
---|
0:17:12 | or |
---|
0:17:13 | for example current syllable in current word or part of speech or other looks more information and so on |
---|
0:17:22 | since there are |
---|
0:17:24 | too many combinations |
---|
0:17:26 | it's a difficult to have all possible model |
---|
0:17:31 | to avoid the problem in the same manner as hmm based speech recognition |
---|
0:17:36 | we use context-dependent hmms |
---|
0:17:39 | and apply a decision tree based context clustering technique to K |
---|
0:17:44 | in this figure a |
---|
0:17:47 | htk sty triphone letters are shown |
---|
0:17:51 | however |
---|
0:17:52 | in the case of speech synthesis the data is very long because it |
---|
0:17:58 | includes |
---|
0:18:00 | all these information |
---|
0:18:02 | so we also a list menu other questions |
---|
0:18:07 | about |
---|
0:18:09 | this information |
---|
0:18:14 | okay |
---|
0:18:15 | each number spectrum and F zero have its own influential contextual factors so that there should be some for spectrum |
---|
0:18:23 | and F zero should be clustered independently |
---|
0:18:27 | it results in |
---|
0:18:30 | stream dependent a context clustering structure |
---|
0:18:34 | i strongly |
---|
0:18:38 | in the standard hmm days |
---|
0:18:40 | the states through some prior probability an exponent site and decrease with increase over last iteration |
---|
0:18:49 | however |
---|
0:18:50 | it's too simple to control a temporal structure of speech parameter C sequence |
---|
0:18:56 | therefore |
---|
0:18:57 | we assume that the state |
---|
0:19:00 | durations |
---|
0:19:01 | oh gosh |
---|
0:19:03 | and not that the hmm with an explicit and racial model is called |
---|
0:19:10 | and hidden semi-markov model |
---|
0:19:12 | or it just a man |
---|
0:19:15 | and now we need a special type of em algorithm for parameter is used to measure this model |
---|
0:19:23 | okay as a result state iterations of aged men each hmms |
---|
0:19:29 | oh model the |
---|
0:19:30 | by a three dimensional |
---|
0:19:33 | gaussian |
---|
0:19:35 | and |
---|
0:19:36 | context-dependent three dimensional gaussians |
---|
0:19:39 | a class that by |
---|
0:19:41 | at this juncture |
---|
0:19:43 | so we now we have |
---|
0:19:45 | seven decision trees are in this example |
---|
0:19:48 | three four spectrum from those mel-cepstrum |
---|
0:19:52 | and three four F zero |
---|
0:19:54 | and a wonderful situation |
---|
0:20:00 | okay |
---|
0:20:01 | next i'd like to talk about the second maximization problem |
---|
0:20:05 | which correspond to speech parameter generation |
---|
0:20:09 | from acoustic model |
---|
0:20:12 | like concatenating context-dependent hmms |
---|
0:20:16 | according to the led us a drive from the text to be synthesized |
---|
0:20:21 | a sentence hmm can yeah |
---|
0:20:25 | something |
---|
0:20:28 | for a given sentence hmm |
---|
0:20:32 | we determine the speech parameter vector sequence |
---|
0:20:35 | oh |
---|
0:20:37 | which maximizes |
---|
0:20:38 | the outputs probably |
---|
0:20:41 | P |
---|
0:20:43 | this equation that can be approximated by this which one |
---|
0:20:48 | output approximated by maximization |
---|
0:20:52 | on the bottom or it can be decomposed into D two maximization problem |
---|
0:20:58 | first |
---|
0:21:00 | we determine the state sequence Q hot |
---|
0:21:03 | independently of all then |
---|
0:21:05 | yeah |
---|
0:21:06 | determine |
---|
0:21:08 | speech parameter vector sequence |
---|
0:21:10 | yeah O |
---|
0:21:11 | all hyped |
---|
0:21:12 | for the |
---|
0:21:13 | prefixed a state sequence |
---|
0:21:16 | do you have |
---|
0:21:18 | the first problem |
---|
0:21:20 | can be sold |
---|
0:21:21 | very easy |
---|
0:21:23 | because us that iteration are modelled by gosh |
---|
0:21:29 | the solution is simply given by means of gaussians |
---|
0:21:33 | a postage or some other |
---|
0:21:38 | unfortunately |
---|
0:21:39 | that direct solution for that |
---|
0:21:42 | second problem is you appropriate for synthesizing speech |
---|
0:21:48 | and this is an example parameter generation from an hmm |
---|
0:21:52 | composed by concatenation of a phoneme hmms |
---|
0:21:58 | each vertical dotted line |
---|
0:22:03 | a state of that the line represents a state out |
---|
0:22:08 | we assume that the covariance matrix is guy or whatever |
---|
0:22:12 | so each state has its means and variance |
---|
0:22:16 | for example this |
---|
0:22:19 | horizontal dotted line |
---|
0:22:21 | represents a mean of this state and the shaded area |
---|
0:22:26 | that represent |
---|
0:22:27 | variance |
---|
0:22:28 | of this thing |
---|
0:22:31 | by maximizing the output probability |
---|
0:22:34 | the parameter sequence becomes the mean vector sequence |
---|
0:22:39 | resulting in a step wise function like this |
---|
0:22:42 | because |
---|
0:22:43 | this is the most likely sequence for the sequence of a state of gaussians |
---|
0:22:50 | and the this jumps |
---|
0:22:52 | a coarse this continues its in synthetic speech |
---|
0:22:59 | about |
---|
0:23:00 | about the problem |
---|
0:23:02 | we assume that each state output vector O |
---|
0:23:06 | consists of mel-cepstral coefficient |
---|
0:23:09 | back to |
---|
0:23:10 | and it's dynamic feature vectors |
---|
0:23:13 | delta and delta-delta |
---|
0:23:15 | which correspond to the first |
---|
0:23:17 | and second derivatives |
---|
0:23:19 | of a speech parameter vector C |
---|
0:23:23 | and can be calculated as a linear combination or neighboring a speech parameter vectors |
---|
0:23:31 | most of speech recognition systems also use this type of speech parameters |
---|
0:23:35 | and |
---|
0:23:36 | relationship |
---|
0:23:39 | between |
---|
0:23:41 | see and that the C and the see that can be arranged in a matrix form |
---|
0:23:47 | as shown here |
---|
0:23:49 | i see in the |
---|
0:23:51 | mel-cepstral coefficient vector |
---|
0:23:54 | and delta and delta-delta and the dct stick out vector |
---|
0:24:01 | and |
---|
0:24:03 | C includes all |
---|
0:24:06 | mel-cepstral coefficients vectors for utterance |
---|
0:24:09 | and W is for calculating does that the |
---|
0:24:16 | and that this constraint |
---|
0:24:18 | on wall |
---|
0:24:20 | maximizing be with respect to all |
---|
0:24:24 | is equivalent to that with respect to see |
---|
0:24:28 | that's by setting the derivative equals zero we obtain a set of linear equations |
---|
0:24:34 | which can be shown in |
---|
0:24:36 | much useful |
---|
0:24:38 | that dimensionality |
---|
0:24:40 | of the equation is very high |
---|
0:24:43 | for example tens of thousand because C was all a mel-cepstral coefficients vector for utterance |
---|
0:24:52 | fortunately |
---|
0:24:53 | by using the special structure of this metric |
---|
0:24:58 | it's very sparse matrix |
---|
0:25:00 | it can be solved by |
---|
0:25:02 | fast algorithm |
---|
0:25:06 | okay |
---|
0:25:07 | this is an example of |
---|
0:25:09 | parameter generation |
---|
0:25:12 | that from us in this hmm using dynamic no feature brown |
---|
0:25:18 | this shows |
---|
0:25:21 | the trajectory |
---|
0:25:23 | of the |
---|
0:25:24 | second the coefficient |
---|
0:25:26 | of that generated the mel cepstrum |
---|
0:25:30 | sequence |
---|
0:25:32 | and |
---|
0:25:33 | they |
---|
0:25:34 | sure its delta |
---|
0:25:36 | and delta-delta which correspond to the first |
---|
0:25:40 | and second derivatives of the |
---|
0:25:43 | trajectory |
---|
0:25:46 | these three |
---|
0:25:47 | trajectories a constrained by each other |
---|
0:25:51 | and to determine the simon tennessee |
---|
0:25:54 | by maximizing |
---|
0:25:56 | total output probabilities |
---|
0:25:59 | as a result |
---|
0:26:00 | that trajectory |
---|
0:26:03 | is constrained to be realistic as defined by the statistics |
---|
0:26:08 | of static and dynamic feature |
---|
0:26:15 | you may have noticed that |
---|
0:26:19 | the of all |
---|
0:26:21 | is improper as the distribution of C |
---|
0:26:25 | because it's not normalize |
---|
0:26:27 | respect to see |
---|
0:26:29 | interestingly by normalizing |
---|
0:26:32 | be with |
---|
0:26:34 | respect to see we can drive a new type of trajectory model would to be called |
---|
0:26:40 | trajectory hmm |
---|
0:26:42 | oh i'm sorry but i won't go into details in this presentation |
---|
0:26:50 | okay if you guess of the spectrum calculated from the mel cepstrum vectors generated |
---|
0:26:56 | without dynamic feature parameters |
---|
0:26:59 | and we dine feature parameter respectively |
---|
0:27:03 | it can be seen that by taking into account |
---|
0:27:06 | a dynamic feature of parameters |
---|
0:27:10 | smoothly varying sequence of spectral can be up to |
---|
0:27:16 | and they show the generated F zero about that |
---|
0:27:19 | without the dynamic feature |
---|
0:27:22 | generated F zero sequence becomes a step wise function |
---|
0:27:27 | on the other hand by taking into account that i and number of features |
---|
0:27:31 | we can generate F zero trajectories |
---|
0:27:34 | which approximate the natural F zero that |
---|
0:27:40 | okay |
---|
0:27:41 | not i would like to play some speech samples of synthesized speech samples and too strong effect of dynamic features |
---|
0:27:49 | in speech parameter generation |
---|
0:27:54 | this was since size |
---|
0:27:57 | from the model trained with both |
---|
0:28:00 | yeah static and dynamic features |
---|
0:28:04 | and that this was syntax |
---|
0:28:06 | without a spectrum then it feature |
---|
0:28:09 | and this was |
---|
0:28:11 | in size without |
---|
0:28:13 | F zero dynamic feature |
---|
0:28:15 | and that this was in fact without the both spectrum and F zero dynamic three |
---|
0:28:22 | as the mean they this one |
---|
0:28:26 | like you know you're not model i mean and sorry again |
---|
0:28:30 | jeez a known only on like you know you're the model i mean and |
---|
0:28:34 | it's some of |
---|
0:28:35 | and the without |
---|
0:28:37 | spectrum then feature you may perceive a frequent discontinued is in this |
---|
0:28:45 | these are not lonely and that you know you wanna model i mean i |
---|
0:28:48 | can't find it |
---|
0:28:51 | and now without F zero dynamic features |
---|
0:28:54 | in this case you made by C different type of discontinued is |
---|
0:28:59 | jeez on only on like you know you on i mean and without both we may perceive serious discontinued |
---|
0:29:09 | these are known only and then you know you want to model i mean again we both |
---|
0:29:15 | jeez unknown only and like you know you're the model i mean and |
---|
0:29:18 | yep |
---|
0:29:19 | from this examples we can see that the importance of dynamic feature hmm based |
---|
0:29:30 | okay |
---|
0:29:31 | in the next part |
---|
0:29:33 | i lurked show some |
---|
0:29:34 | examples |
---|
0:29:36 | to demonstrate the flexibility of the statistical approach |
---|
0:29:43 | first i'd like to show an example of emotional speech synthesis |
---|
0:29:50 | i'm sorry about |
---|
0:29:52 | this is very old then so that support speech quite a |
---|
0:29:58 | is that |
---|
0:30:01 | and |
---|
0:30:02 | this sample is inside from a model trained with |
---|
0:30:07 | neutral speech |
---|
0:30:10 | and this was inside from the model trained with unreadable i'm very |
---|
0:30:16 | pitch |
---|
0:30:17 | this that the case again i'm sorry that it's in japanese |
---|
0:30:22 | this is english translation |
---|
0:30:25 | just a neutral |
---|
0:30:27 | people who need it i and unable maintain okay it has flat prosody |
---|
0:30:34 | and from and we model |
---|
0:30:37 | J i |
---|
0:30:41 | okay |
---|
0:30:42 | one of the sentence |
---|
0:30:44 | neutral |
---|
0:30:45 | meeting anyone i if you can have enough time |
---|
0:30:48 | yeah i |
---|
0:30:52 | it sounds like he's angry |
---|
0:30:54 | yeah |
---|
0:30:56 | and we see that that's training the system with a small amount of emotional speech data we can see that |
---|
0:31:01 | the most no speech very easy it's not in this area |
---|
0:31:05 | to handle craft a heuristic rules for emotional speech and |
---|
0:31:12 | next and that show an example of speaker adaptation in speech synthesis |
---|
0:31:17 | we apply the speaker adaptation technique now using speech recognition yeah mllr to the synthesis system |
---|
0:31:26 | and they say the speaker independent model is a model |
---|
0:31:31 | and it was adapted to a target the speaker eight |
---|
0:31:35 | and that this |
---|
0:31:36 | the adapted model |
---|
0:31:39 | okay this samples |
---|
0:31:41 | is |
---|
0:31:42 | since that from the |
---|
0:31:45 | speaker independent model |
---|
0:31:48 | for channel sometime recognition |
---|
0:31:51 | okay i'm sorry |
---|
0:31:53 | it's in japanese |
---|
0:31:55 | and this was synthesized |
---|
0:31:58 | oh this is that's inside speech |
---|
0:32:00 | but the for speaker i |
---|
0:32:04 | so this is inside speech but yeah it has speech bias voice |
---|
0:32:08 | a voice characteristics |
---|
0:32:11 | can also i snuck |
---|
0:32:13 | and this was synthesized from the adapted model |
---|
0:32:18 | with |
---|
0:32:20 | for turn |
---|
0:32:22 | oh yeah no sunlight recognition and fifty utterances |
---|
0:32:27 | of course you know something that is not let me play them again |
---|
0:32:32 | speaker independent model |
---|
0:32:34 | cocaine or something recognition for utterances |
---|
0:32:38 | yeah no sunlight recognition |
---|
0:32:42 | not something that is not |
---|
0:32:45 | and also i recognition |
---|
0:32:48 | if |
---|
0:32:48 | these three sound |
---|
0:32:51 | very similar it means that the system can maybe the target speakers voice using a very small amount of the |
---|
0:32:58 | adaptation data |
---|
0:33:00 | and then we have another sample |
---|
0:33:02 | maybe in the |
---|
0:33:04 | famous persons voice |
---|
0:33:10 | institute of technology energy was founded in nineteen O five isn't going to hire technical to pioneering academic institution dedicated |
---|
0:33:18 | to industrial education |
---|
0:33:21 | can you find for hey |
---|
0:33:24 | yes |
---|
0:33:26 | you're right |
---|
0:33:28 | please note that |
---|
0:33:31 | this was done by engine geometries at C S T R of the university of edinburgh |
---|
0:33:38 | and they |
---|
0:33:39 | yeah it was us inside by the system adapted to justify |
---|
0:33:47 | okay |
---|
0:33:48 | next example is speaker interpolation in speech synthesis |
---|
0:33:53 | when we have several speaker dependent hmm sets |
---|
0:33:57 | by interpolating among the hmm parameters |
---|
0:34:01 | means and variances |
---|
0:34:04 | we can generate a new hmm set |
---|
0:34:07 | which correspond to a new voice |
---|
0:34:10 | in this case we have to speaker-dependent models |
---|
0:34:15 | one |
---|
0:34:16 | is trained by a female speaker |
---|
0:34:19 | and one |
---|
0:34:20 | is trained by a male speaker |
---|
0:34:24 | okay let me play |
---|
0:34:26 | speech samples |
---|
0:34:27 | synthesized from |
---|
0:34:29 | female model |
---|
0:34:36 | sorry okay |
---|
0:34:42 | i |
---|
0:34:42 | and this was in part from a male speakers for a male speakers model |
---|
0:34:49 | well when i don't and we can interpolate between and these two models with arbitrarily depletion much |
---|
0:34:59 | a dct sent out |
---|
0:35:00 | due to |
---|
0:35:02 | models |
---|
0:35:03 | we cannot find he or she is |
---|
0:35:06 | male or female |
---|
0:35:09 | and |
---|
0:35:14 | and |
---|
0:35:15 | we can change in the in the pool interpolation ratio right nearly in a trance |
---|
0:35:21 | from female to male |
---|
0:35:24 | well |
---|
0:35:32 | we do not know |
---|
0:35:37 | sounds like |
---|
0:35:38 | male finally |
---|
0:35:41 | and i this is the same except we have for speaker-dependent the model |
---|
0:35:46 | models |
---|
0:35:47 | the first speaker |
---|
0:35:48 | and when the second speaker per speaker i don't want to know in a manner i don't have for speaker |
---|
0:36:06 | and this at the center of these four speakers |
---|
0:36:14 | and then we can also change the interpolation ratio red |
---|
0:36:17 | and |
---|
0:36:23 | oh in a manner another |
---|
0:36:25 | i don't know |
---|
0:36:31 | oh in |
---|
0:36:33 | it is interesting |
---|
0:36:34 | but |
---|
0:36:35 | could be used to this |
---|
0:36:38 | i |
---|
0:36:41 | yeah |
---|
0:36:42 | if we train each model |
---|
0:36:45 | with S P speaking style we can interpolate among speaking styles to it could be useful for spoken dialogue systems |
---|
0:36:54 | in this case |
---|
0:36:55 | we have two models |
---|
0:36:57 | once trained with |
---|
0:36:59 | a neutral |
---|
0:37:01 | draw a voice |
---|
0:37:03 | and one trained with |
---|
0:37:06 | high tension voice |
---|
0:37:07 | by the same speaker |
---|
0:37:10 | okay |
---|
0:37:11 | first neutral voice |
---|
0:37:14 | oh |
---|
0:37:18 | and heightened so model |
---|
0:37:21 | i |
---|
0:37:22 | i |
---|
0:37:25 | if you feel it's too much |
---|
0:37:28 | you we can adjust the |
---|
0:37:30 | degree of the expression by interpolating between two models |
---|
0:37:35 | for example this one |
---|
0:37:41 | and that we can |
---|
0:37:43 | also fixed extrapolated and used to model |
---|
0:37:47 | under the replay yeah all of them |
---|
0:37:50 | in this order |
---|
0:37:53 | oh |
---|
0:38:02 | i |
---|
0:38:03 | i |
---|
0:38:05 | oh |
---|
0:38:06 | oh |
---|
0:38:08 | oh |
---|
0:38:09 | please note that |
---|
0:38:11 | it's not just that changing average F zero the prosody it can be changed |
---|
0:38:18 | okay |
---|
0:38:19 | next example is eigenvoice |
---|
0:38:22 | the eigenvoice technique was to have developed for very fast speaker adaptation in speech recognition |
---|
0:38:30 | in speech synthesis |
---|
0:38:32 | it can be used for creating new voices |
---|
0:38:37 | image of something more |
---|
0:38:56 | okay |
---|
0:38:58 | this represents a weight |
---|
0:39:01 | for eigenvoices |
---|
0:39:03 | by adjusting them we can find a favourite voice |
---|
0:39:09 | each eigenvoice first eigenvoice and second eigenvoice |
---|
0:39:12 | it's eigenvoice |
---|
0:39:14 | may correspond to a specific voice character |
---|
0:39:20 | maybe play some speech samples |
---|
0:39:23 | but |
---|
0:39:25 | for the |
---|
0:39:27 | first eigenvoice we've negative rate |
---|
0:39:33 | yeah |
---|
0:39:35 | i |
---|
0:39:36 | okay |
---|
0:39:38 | and now we |
---|
0:39:39 | posted wait for the first eigenvoice |
---|
0:39:42 | okay |
---|
0:39:44 | oh no what contributes to say okay |
---|
0:39:48 | i'm sorry that this is the maximum with the ball |
---|
0:39:52 | and the second eigenvoice we've negative rate |
---|
0:40:00 | and we |
---|
0:40:01 | positive rate |
---|
0:40:03 | okay they're not what makes you don't sound that made on for eigenvoice |
---|
0:40:14 | and we've |
---|
0:40:15 | was divided up with the weight |
---|
0:40:19 | i |
---|
0:40:20 | yeah |
---|
0:40:23 | at |
---|
0:40:24 | and by second weight after writing we don't and various voices |
---|
0:40:29 | and find out for your voice |
---|
0:40:35 | some them then |
---|
0:40:37 | but i |
---|
0:40:39 | i |
---|
0:40:41 | i hope |
---|
0:40:43 | this is better |
---|
0:40:44 | okay |
---|
0:40:46 | the |
---|
0:40:49 | anyway this shows the flexibility of their statistical approach to speech synthesis |
---|
0:41:05 | okay |
---|
0:41:06 | similarly to other corpus based approaches |
---|
0:41:09 | and the hmm baptists |
---|
0:41:10 | system has a |
---|
0:41:12 | very compact language dependent but |
---|
0:41:16 | done easily be applied to other languages |
---|
0:41:20 | i like to play some the them |
---|
0:41:24 | japanese change that you know you and i one and i mean i'm sorry |
---|
0:41:29 | in which |
---|
0:41:30 | you would not keep the truth from chinese |
---|
0:41:34 | well or from grand cherokee |
---|
0:41:40 | korea |
---|
0:41:41 | then they can match the categories and the finnish |
---|
0:41:46 | only taken a little mental but it once again i must be sent to contain an snr essential or several |
---|
0:41:52 | minima |
---|
0:41:53 | and this is also in which but trained by |
---|
0:41:58 | baby |
---|
0:41:59 | vol |
---|
0:42:04 | yeah i |
---|
0:42:07 | okay |
---|
0:42:09 | and now |
---|
0:42:10 | next examples |
---|
0:42:12 | so that |
---|
0:42:13 | even |
---|
0:42:14 | singing voice can be used as a training data |
---|
0:42:18 | as a result |
---|
0:42:20 | the system can seen any piece of music |
---|
0:42:23 | we she's or her voice and simply used i |
---|
0:42:28 | and |
---|
0:42:30 | this is a one oh training data |
---|
0:42:37 | okay |
---|
0:42:44 | sees a semi professional scene |
---|
0:42:48 | and now |
---|
0:42:50 | the server |
---|
0:42:52 | and |
---|
0:42:53 | and anyway and |
---|
0:42:55 | this sample |
---|
0:42:56 | if |
---|
0:42:57 | syntax |
---|
0:42:58 | by using trained acoustic model so |
---|
0:43:03 | she have not |
---|
0:43:05 | some this song |
---|
0:43:11 | yeah |
---|
0:43:17 | oh i |
---|
0:43:23 | maybe it sounds that are |
---|
0:43:25 | but we have not seen this story |
---|
0:43:28 | us in this |
---|
0:43:31 | okay |
---|
0:43:34 | this is the final part |
---|
0:43:40 | yeah i like to show the |
---|
0:43:42 | basic problem of speech synthesis okay this one |
---|
0:43:49 | solving this problem directly |
---|
0:43:53 | based on |
---|
0:43:56 | and this equation |
---|
0:43:59 | is it yeah |
---|
0:44:00 | but we have to decompose it into trapped up to no such problems because the |
---|
0:44:08 | direct solution is not feasible we currently available computational resources |
---|
0:44:15 | however we can relax the approximation |
---|
0:44:21 | for example |
---|
0:44:23 | by marginalised in what their parameters |
---|
0:44:26 | a hmm for acoustic model parameters |
---|
0:44:28 | we can drive a variational bayesian acoustic modeling technique for speech synthesis |
---|
0:44:34 | well |
---|
0:44:36 | by marginalise and that is |
---|
0:44:38 | we can drive |
---|
0:44:40 | joint front-end and back-end model train |
---|
0:44:44 | a friend front it means that text process |
---|
0:44:47 | and back end acoustic model |
---|
0:44:50 | or by including a speech |
---|
0:44:53 | wait for speech waveform generation part in a statistical model we can also drive a bit from that of |
---|
0:45:01 | stats come on |
---|
0:45:03 | anyway please note that |
---|
0:45:05 | this kind of improved techniques |
---|
0:45:08 | can be drivers |
---|
0:45:09 | based on |
---|
0:45:11 | this equation which represents |
---|
0:45:13 | the basic problem |
---|
0:45:15 | since |
---|
0:45:20 | okay |
---|
0:45:22 | then some read this presentation |
---|
0:45:25 | i have talked about the stats got from their some of the speech synthesis problem |
---|
0:45:30 | all speech synthesis process is described in a statistical framework |
---|
0:45:35 | and it give us a unified view and the reviews |
---|
0:45:39 | what is correct and what is wrong |
---|
0:45:43 | another point i should |
---|
0:45:46 | implies that is |
---|
0:45:47 | the importance of the database |
---|
0:45:51 | future work |
---|
0:45:52 | still we have many problems |
---|
0:45:55 | which we should solve |
---|
0:45:57 | based on |
---|
0:46:02 | the |
---|
0:46:03 | equation which represent speech synthesis problem |
---|
0:46:11 | okay this is that |
---|
0:46:13 | final slide |
---|
0:46:15 | is P synthesis |
---|
0:46:17 | and messy problem |
---|
0:46:20 | no i don't think so |
---|
0:46:23 | i would be happy |
---|
0:46:25 | if many speech recognition researchers joining speech synthesis research |
---|
0:46:31 | it's must be very have had |
---|
0:46:33 | to a T research area |
---|
0:46:35 | that's all thank you very much |
---|
0:46:38 | i |
---|
0:46:57 | yeah thanks for such a could talk we have some time for questions |
---|
0:47:01 | michael |
---|
0:47:03 | oh thank you very much for a wonderful talk on speech synthesis at some point in the future i guess |
---|
0:47:09 | we don't even have to have our presenters make presentations anymore we could just synthesise them i would like to |
---|
0:47:16 | do that |
---|
0:47:18 | i'm not the fifteen you speaking |
---|
0:47:21 | so one of the quest one of things you alluded to at the end of your talk i was wondering |
---|
0:47:25 | if you could elaborate a little bit more |
---|
0:47:28 | one of the problems you can still hear on some of the examples you place a certain but seeing this |
---|
0:47:33 | dependent upon the speaker in the quality of the final waveform generation i'm just wondering if you could say a |
---|
0:47:39 | few words about some of the country |
---|
0:47:42 | the techniques that are being looked at in order to improve the quality of the waveform generation the model you |
---|
0:47:51 | at the at the beginning of the talk is still relatively simple excitation of the spectral sort of model and |
---|
0:47:58 | i know people looked at the fancier stuff |
---|
0:48:01 | just wondering if you have some comments as to what you think is interesting promising the directions to improve the |
---|
0:48:08 | quality the waveform generation yep |
---|
0:48:11 | i didn't |
---|
0:48:13 | mention but in the newest the system well we are using a straight vocoding a technique and it can improve |
---|
0:48:22 | the speech quality very much |
---|
0:48:25 | however |
---|
0:48:27 | i'm afraid that |
---|
0:48:29 | it's not based on |
---|
0:48:32 | that's that is god |
---|
0:48:34 | frame |
---|
0:48:36 | so |
---|
0:48:37 | i would like to include that kind of most you |
---|
0:48:42 | but coding part should be included this it which |
---|
0:48:47 | that must be |
---|
0:48:49 | this one |
---|
0:48:50 | i but |
---|
0:48:51 | currently |
---|
0:48:53 | we still |
---|
0:48:54 | use |
---|
0:48:55 | many approximation |
---|
0:48:57 | for example |
---|
0:48:58 | the formulation |
---|
0:49:00 | is |
---|
0:49:02 | correct |
---|
0:49:02 | for cost sampras |
---|
0:49:05 | it is right for unvoiced sect |
---|
0:49:07 | however it's |
---|
0:49:08 | not appropriate for |
---|
0:49:10 | theoretic sick person |
---|
0:49:12 | so we need more sophisticated a speech before each initial model |
---|
0:49:17 | and i believe that |
---|
0:49:19 | that can for that kind |
---|
0:49:21 | problem |
---|
0:49:22 | the vocal |
---|
0:49:25 | it is |
---|
0:49:27 | the onset do |
---|
0:49:32 | hi yes i have a couple of questions related to the smoothing of the cepstral coefficients we talked about |
---|
0:49:38 | i'm so the use of the deltas and double deltas gives you the smoothed |
---|
0:49:42 | i cepstral coefficient |
---|
0:49:45 | how is how important is that relative to say representing or generating static coefficients and then perhaps applying a moving |
---|
0:49:55 | a moving average filter some somewhere |
---|
0:49:58 | smoothing like that |
---|
0:49:59 | okay with question |
---|
0:50:04 | the initial |
---|
0:50:06 | a slight |
---|
0:50:08 | i have with one |
---|
0:50:13 | with the moment |
---|
0:50:47 | oh i'm sorry if it was not be greatly in the select but anyway |
---|
0:50:54 | that is a very effective |
---|
0:50:57 | and |
---|
0:50:58 | that the data is not so effect |
---|
0:51:00 | and now of course we can apply some heuristics somebody by filtering or something like |
---|
0:51:07 | it's |
---|
0:51:09 | still effect |
---|
0:51:11 | i have a worse |
---|
0:51:13 | that using that on the today parameter |
---|
0:51:17 | i have a most scores |
---|
0:51:20 | i'm sorry i can't find i one other following question on that when you set up those linear equations that |
---|
0:51:26 | you |
---|
0:51:27 | you weights |
---|
0:51:29 | all the different dimensions equally all the deltas and double deltas and the static coefficients are those all weighted |
---|
0:51:36 | equally when you |
---|
0:51:37 | solve the least squares equations |
---|
0:51:40 | yeah there's no weights |
---|
0:51:48 | yeah O we have a |
---|
0:51:50 | that's no choice weight or some operation |
---|
0:51:54 | we just have a |
---|
0:51:56 | we just have a |
---|
0:51:59 | definition of the probably |
---|
0:52:01 | but it gently |
---|
0:52:02 | and just |
---|
0:52:03 | using |
---|
0:52:08 | so i have a question |
---|
0:52:10 | so obviously we've |
---|
0:52:13 | in texas speech local optimum |
---|
0:52:15 | speech recognition but it's |
---|
0:52:18 | frequent comment and it came up and very minute talks is you know this whole question of how good is |
---|
0:52:23 | there's an hmm model of speech on the |
---|
0:52:26 | received wisdom is either that it's kind of terrible or |
---|
0:52:30 | it's terrible but it so tractable are useful the user button speech synthesis you think that the success of this |
---|
0:52:36 | technique |
---|
0:52:37 | both |
---|
0:52:38 | in fact demonstrated hmms are good model speech |
---|
0:52:41 | because i think it's |
---|
0:52:42 | the quality is for the |
---|
0:52:44 | higher than that of the thing anybody would have believed |
---|
0:52:46 | possible |
---|
0:52:47 | and what follows |
---|
0:52:50 | this workshop on the |
---|
0:52:52 | what is the same channel build models |
---|
0:53:00 | yeah that's |
---|
0:53:01 | yeah good but we've got question |
---|
0:53:03 | and |
---|
0:53:06 | anyway |
---|
0:53:10 | to |
---|
0:53:13 | we have been organising we just try to |
---|
0:53:17 | it's a |
---|
0:53:19 | evaluation campaign |
---|
0:53:21 | i think this systems |
---|
0:53:22 | and we have find that |
---|
0:53:24 | you did in the intelligibility of a hmm based systems |
---|
0:53:29 | or almost perfect |
---|
0:53:31 | almost |
---|
0:53:32 | three but i two notches on a natural speech |
---|
0:53:35 | i was still a naturalness |
---|
0:53:39 | is |
---|
0:53:40 | insufficient compared with natural speech |
---|
0:53:43 | so maybe you |
---|
0:53:47 | prosody |
---|
0:53:48 | so |
---|
0:53:49 | oh |
---|
0:53:50 | i believe that we have to improve prosodic part |
---|
0:53:55 | well |
---|
0:53:58 | D |
---|
0:53:59 | status got more |
---|
0:54:01 | maybe |
---|
0:54:04 | human speech and make one by various non-verbal information |
---|
0:54:09 | yeah speech about it cannot be done in the current speech |
---|
0:54:13 | system |
---|
0:54:14 | that kind of a speech of should be included |
---|
0:54:19 | so |
---|
0:54:20 | that you're talking was very nice |
---|
0:54:22 | i want to go a little bit for the long pause line of questioning because i was thinking about your |
---|
0:54:28 | final call to the asr community to take join you know the stuff |
---|
0:54:32 | one of the things you're seeing a lot with H M and stuff and the speech kill this is that |
---|
0:54:37 | everybody's moving parts discriminant models of various kinds and whatnot and the nice thing about |
---|
0:54:42 | you know |
---|
0:54:43 | the hmms for the synthesis problem is it really used to regenerate of problem right so i in some ways |
---|
0:54:49 | it model matches a little better which just sort of what paul was touching on |
---|
0:54:53 | so do you |
---|
0:54:54 | see |
---|
0:54:55 | in moving forward and synthesis that |
---|
0:54:59 | discriminant techniques are gonna be |
---|
0:55:02 | playing a part in that kind of thing where do you think that you know generated this you know asians |
---|
0:55:08 | are generated |
---|
0:55:09 | models and this definitely gonna be the right way to |
---|
0:55:12 | model this kind of thing |
---|
0:55:13 | yeah a good question |
---|
0:55:17 | this discriminant given training |
---|
0:55:21 | does not allow for speech utterances |
---|
0:55:25 | it is not necessary to discriminate |
---|
0:55:28 | and |
---|
0:55:29 | another point |
---|
0:55:31 | that |
---|
0:55:32 | we can set a specific |
---|
0:55:36 | objective function based on human perception |
---|
0:55:39 | don't |
---|
0:55:40 | quickly like |
---|
0:55:42 | a discriminative training in speech recognition |
---|
0:55:46 | but anyway yeah |
---|
0:55:49 | in speech synthesis |
---|
0:55:50 | the basic problem |
---|
0:55:52 | it's |
---|
0:55:53 | generation |
---|
0:55:54 | so we can consider to concentrate on |
---|
0:55:58 | generative model |
---|
0:55:59 | it's not necessary to tickle |
---|
0:56:01 | with the disk and discriminative training |
---|
0:56:04 | that a nice point of speech synthesis research |
---|
0:56:09 | but it in |
---|
0:56:30 | i |
---|
0:56:33 | i |
---|
0:56:36 | oh |
---|
0:56:40 | yeah |
---|
0:56:41 | yeah |
---|
0:56:42 | but |
---|
0:56:43 | i want to |
---|
0:56:45 | the that kind of |
---|
0:56:48 | optimization in a statistical framework |
---|
0:56:51 | by changing the a paedophile parameter or a more their structure we can do that now that's got four |
---|
0:57:01 | and i've got a related question you so you generate the maximum likelihood sequence |
---|
0:57:07 | if you have a really good |
---|
0:57:08 | and generative model we'd really like to sample stochastically |
---|
0:57:18 | there's not that |
---|
0:57:20 | we are using something |
---|
0:57:22 | because |
---|
0:57:31 | i'm sure this |
---|
0:57:37 | okay |
---|
0:57:38 | those are given variables |
---|
0:57:41 | and this is the speech waveform |
---|
0:57:44 | and a dct predicted distribution and we have something a speech waveform |
---|
0:57:50 | the exciting speech synthesis filter by a gaussian |
---|
0:57:55 | what noise |
---|
0:57:56 | it's |
---|
0:57:57 | just or something |
---|
0:58:01 | and that speech parameters mel cepstrum or this result if a zero S |
---|
0:58:09 | is marginalised in the equation |
---|
0:58:13 | so as a pokemon approximation we add generate it we imagine criteria |
---|
0:58:21 | maximum likelihood criterion |
---|
0:58:24 | it's just approximation |
---|
0:58:26 | but |
---|
0:58:27 | this criterion is |
---|
0:58:29 | something |
---|
0:58:31 | does it make sense |
---|
0:58:35 | well i guess i'm wondering whether it's good approximation of |
---|
0:58:40 | yeah that the reason why we have to reduce or remove |
---|
0:58:45 | we wanna relax the approximation in the future active |
---|
0:58:49 | future work |
---|
0:58:52 | okay things temporal things to close so that's like speaker |
---|
0:59:00 | i |
---|