0:00:13 | okay so |
---|
0:00:14 | um |
---|
0:00:15 | we next have have a talk |
---|
0:00:17 | um |
---|
0:00:17 | but not D J lead who |
---|
0:00:19 | is from jeff pensions group |
---|
0:00:21 | um |
---|
0:00:22 | and both not a model we're very excited to a did not need to be part of the session |
---|
0:00:27 | um in part because there's |
---|
0:00:28 | oh but set of methods that have been um |
---|
0:00:31 | widely used over the last um five to ten years |
---|
0:00:35 | for doing stuff with the images and and and machine vision um a and a sort of things are for |
---|
0:00:39 | as deep belief networks and |
---|
0:00:41 | um |
---|
0:00:42 | a have a really been used in sound very much and so just group as a very recently over the |
---|
0:00:46 | past couple of years |
---|
0:00:47 | and starting to apply these methods |
---|
0:00:49 | uh to do stuff with with audio and so not it's gonna be talking a little bit about this and |
---|
0:00:53 | these methods are really extensions of things that work |
---|
0:00:55 | i initially log back and the eighties |
---|
0:00:57 | um and have really been revived with a set of new training methods |
---|
0:01:01 | um in the last ten years |
---|
0:01:02 | so not deep |
---|
0:01:04 | um i'd like to thank that josh and malcolm for given see them and presents of this work today |
---|
0:01:09 | um um so that like does make josh mentioned |
---|
0:01:12 | there's been a lot of development um recently |
---|
0:01:15 | in |
---|
0:01:16 | a a generative models for high-dimensional data |
---|
0:01:19 | and uh hit some examples |
---|
0:01:21 | of some of um |
---|
0:01:22 | the samples generated from such models |
---|
0:01:25 | i actually |
---|
0:01:26 | seem to have |
---|
0:01:27 | a |
---|
0:01:27 | um here some examples of uh did just generated from the first uh |
---|
0:01:32 | do you believe network |
---|
0:01:33 | which was published in two thousand six |
---|
0:01:36 | um i'd like to point out that these are actually samples from a model |
---|
0:01:39 | rather than we constructions of particular data cases |
---|
0:01:42 | so the models quite powerful |
---|
0:01:44 | and um |
---|
0:01:45 | just |
---|
0:01:46 | it has very high peaks at real data point |
---|
0:01:50 | or some samples from |
---|
0:01:52 | um i recently published model |
---|
0:01:54 | which was a gated M R F |
---|
0:01:56 | and this uh model was trained a natural image patches |
---|
0:02:00 | and so you can see this model to really good at um modeling what's short in wrong long-range correlations an |
---|
0:02:05 | images |
---|
0:02:06 | um |
---|
0:02:08 | as an example of um |
---|
0:02:10 | motion sequence |
---|
0:02:11 | a so models of also been developed for um |
---|
0:02:14 | um motion sequences and the this case um the training data was joint angles |
---|
0:02:20 | from motion capture |
---|
0:02:24 | so okay so it's been seen |
---|
0:02:26 | that um features from these generative setups |
---|
0:02:29 | are also very good at discriminative task |
---|
0:02:31 | and um that makes sense on an intuitive level uh what's good for |
---|
0:02:35 | um generating spats big types of patterns it's probably good at recognising those patterns |
---|
0:02:40 | and um |
---|
0:02:41 | yeah actually start show an example of of features that were |
---|
0:02:44 | um |
---|
0:02:45 | yeah a good at generating a |
---|
0:02:47 | textures are sound textures and those could be used for recognizing these star |
---|
0:02:52 | um if models to been used widely for vision tasks |
---|
0:02:55 | but they have been made it uh quite as much into sound yet and so |
---|
0:02:59 | um for this uh work we wanted to see |
---|
0:03:02 | if we could um |
---|
0:03:03 | use these models for raw speech signals |
---|
0:03:05 | and uh see if the features that the learn |
---|
0:03:08 | where useful than a generative step |
---|
0:03:10 | a in a discrimination task |
---|
0:03:13 | so our goal um specifically is uh given raw speech signals |
---|
0:03:18 | we want to build a generative model for sub sequences |
---|
0:03:22 | of uh |
---|
0:03:23 | these signals of six point two five miliseconds lang |
---|
0:03:26 | were using timit |
---|
0:03:28 | which is sampled at sixteen khz |
---|
0:03:30 | so we had a have |
---|
0:03:31 | data vectors which were a hundred samples long |
---|
0:03:34 | and in in the vector we're actually modeling |
---|
0:03:37 | it's a hundred dimensional vector |
---|
0:03:38 | uh whose entries are the intensities |
---|
0:03:40 | of the raw sound uh |
---|
0:03:42 | sample |
---|
0:03:45 | okay so here's a quick out of that talk um |
---|
0:03:48 | oh could we talk about a R B M's |
---|
0:03:50 | and then now um which the restricted boltzmann machines that's a generate model we use for this paper |
---|
0:03:56 | i and L shows some results from the generative model |
---|
0:03:59 | and all talk about to the application of the features to |
---|
0:04:03 | phone recognition on timit |
---|
0:04:07 | so so for sex should uh |
---|
0:04:08 | oh i would like to talk about why we wanted to use raw signals themselves |
---|
0:04:13 | um |
---|
0:04:14 | have |
---|
0:04:15 | the first the reason was that we didn't one make any assumptions about uh |
---|
0:04:19 | the signal them |
---|
0:04:20 | such just stationarity of the single within a |
---|
0:04:22 | a single frame |
---|
0:04:24 | um |
---|
0:04:25 | secondly we were motivated by speech synthesis |
---|
0:04:28 | and uh in that the domain |
---|
0:04:29 | being able to model draw speech signals what allows eventually |
---|
0:04:33 | to be able to generate a |
---|
0:04:35 | realistic signals without having to satisfy phase constraint |
---|
0:04:39 | um |
---|
0:04:40 | we also want to be able to discover |
---|
0:04:43 | a a pattern which |
---|
0:04:44 | um |
---|
0:04:45 | and their relative onset times |
---|
0:04:47 | i with our model and we think that should probably be helpful |
---|
0:04:50 | uh in uh discriminating between certain point |
---|
0:04:55 | the last reason |
---|
0:04:56 | is a because we now can |
---|
0:04:58 | and that sounds a little facetious but um |
---|
0:05:01 | it's |
---|
0:05:02 | probably the most important a motivation for using raw signals |
---|
0:05:06 | um traditional encoding such as mfccs of been around for quite some time now |
---|
0:05:11 | a within |
---|
0:05:12 | the same might of time computational resources have uh |
---|
0:05:15 | but to that have the time |
---|
0:05:17 | um um at the same time uh |
---|
0:05:20 | a um |
---|
0:05:21 | a lot of data is now available to train really powerful models |
---|
0:05:25 | and also uh machine learning get made a lot of |
---|
0:05:28 | progress in being able to pick out features |
---|
0:05:30 | and building |
---|
0:05:31 | really good models from data alone |
---|
0:05:33 | and so that's why um we wanted to try and do this |
---|
0:05:36 | straight on off |
---|
0:05:41 | a a week |
---|
0:05:42 | it's a quick outline of uh |
---|
0:05:44 | um i'm restricted both machines |
---|
0:05:46 | so um are stress a restricted boltzmann machine or an B M |
---|
0:05:50 | is it and directed graphical model |
---|
0:05:52 | and it has uh two layers |
---|
0:05:54 | of uh node |
---|
0:05:55 | the bottom one which is the visible |
---|
0:05:58 | uh layer |
---|
0:05:59 | a a points the dimensions of the data that's observe |
---|
0:06:02 | and the top clear |
---|
0:06:03 | is that a known it or the hidden variables and these are basically latent variables but ryan explained the data |
---|
0:06:10 | um um there's a |
---|
0:06:11 | set of interaction weights connecting these two layers |
---|
0:06:15 | and um |
---|
0:06:16 | the architecture such that uh |
---|
0:06:18 | there's part part i connectivity |
---|
0:06:20 | which implies that given the visible note all the hidden nodes are independent of each other |
---|
0:06:24 | and the opposite is true of of with the balls when the hidden variables are known |
---|
0:06:30 | and since uh it's and nine directed graphical model |
---|
0:06:35 | ah |
---|
0:06:37 | well |
---|
0:06:37 | okay |
---|
0:06:38 | um |
---|
0:06:39 | it's a nine directed graphical model and so there is an energy function associated with uh |
---|
0:06:44 | and given configuration of the visible and hidden states |
---|
0:06:48 | and um |
---|
0:06:49 | the energy for a given state governs is probability through the boltzmann distribution |
---|
0:06:54 | um |
---|
0:06:55 | what i trying show shown this model |
---|
0:06:57 | in a the set of iterations here |
---|
0:06:59 | was um |
---|
0:07:01 | the uh exactly equation for a a a cost in binary R are M |
---|
0:07:05 | which is the R be "'em" that use for a |
---|
0:07:08 | the scenario where we have real valued signals |
---|
0:07:11 | and binary hidden note |
---|
0:07:13 | a let me see if i couldn't |
---|
0:07:15 | get out of the slideshow show |
---|
0:07:17 | okay |
---|
0:07:20 | sorry |
---|
0:07:22 | uh |
---|
0:07:26 | i never mind |
---|
0:07:27 | um |
---|
0:07:28 | the questions not actually that a many ways |
---|
0:07:30 | so um |
---|
0:07:32 | um |
---|
0:07:32 | the important point to note about the question is there's a term and there |
---|
0:07:35 | which are looks at the interaction between the configuration of the hidden variables |
---|
0:07:40 | and uh the isn't will over all |
---|
0:07:47 | something really interesting about uh this model is that the priors are quite complicated |
---|
0:07:51 | um because they involve a sum of um at an exponential uh |
---|
0:07:56 | number of configuration |
---|
0:07:58 | and on the posteriors on the other hand |
---|
0:08:00 | are are are quite simple |
---|
0:08:02 | so given visible data |
---|
0:08:04 | the hidden variables are all independent of each other |
---|
0:08:07 | and uh |
---|
0:08:07 | they turn on with a probability which is equal to the sigmoid of the input |
---|
0:08:12 | in that hidden node |
---|
0:08:13 | and the input is essentially the dot product of uh |
---|
0:08:16 | the visible data and the uh set a weight connecting a hidden node to |
---|
0:08:21 | the data |
---|
0:08:23 | and so |
---|
0:08:23 | in that since this is a very powerful model and it's different from |
---|
0:08:27 | um |
---|
0:08:28 | other generative models where the prior to are independent but posteriors are very hard to calm |
---|
0:08:32 | to compute |
---|
0:08:33 | so a part of this model is |
---|
0:08:35 | that it had very uh |
---|
0:08:37 | a rich priors but very easy posters |
---|
0:08:42 | "'kay" so the maximum likelihood uh |
---|
0:08:44 | estimation |
---|
0:08:45 | of of uh a models it's is this is really complicated because uh |
---|
0:08:49 | the gradient of the log probably is really hard to compute exactly |
---|
0:08:52 | um fortunately uh jeff |
---|
0:08:55 | in discovered about that that ago |
---|
0:08:57 | that an algorithm called contrastive divergence |
---|
0:09:00 | would be used to train these models |
---|
0:09:02 | um |
---|
0:09:03 | uh |
---|
0:09:04 | a pretty well |
---|
0:09:05 | and that's the model where a that's the algorithm or using the learned the parameters |
---|
0:09:11 | one last |
---|
0:09:11 | a point about uh |
---|
0:09:13 | the model where using |
---|
0:09:14 | um binary hidden units |
---|
0:09:16 | uh |
---|
0:09:17 | are not very optimal for raw speech signals |
---|
0:09:20 | and the reason that is is |
---|
0:09:22 | that um speech patterns can present |
---|
0:09:25 | speech signals in have the same pattern over many |
---|
0:09:28 | different um orders of magnitude |
---|
0:09:30 | but by new units can only turn on one out but intensity level |
---|
0:09:35 | so for this |
---|
0:09:36 | paper we used an alternate of a type of a unit |
---|
0:09:39 | all the steps sigmoidal model unit |
---|
0:09:41 | and uh but in and have the power but property that |
---|
0:09:44 | it can create |
---|
0:09:45 | um i but intensity at almost any all |
---|
0:09:48 | and now i want |
---|
0:09:49 | talk talk much about those units but there's more information but that yeah in this paper that's referent |
---|
0:09:55 | here |
---|
0:09:58 | okay |
---|
0:10:01 | peers the experimental setup |
---|
0:10:02 | um like is that we were looking at um six point two five miliseconds of speech |
---|
0:10:07 | and that course points two hundred samples |
---|
0:10:10 | so for each sample |
---|
0:10:11 | we have a variable in the visible data |
---|
0:10:14 | so are are B M has hundred visible note at the bottom |
---|
0:10:17 | um and we couple that |
---|
0:10:19 | with hundred twenty of this step they model units |
---|
0:10:22 | um |
---|
0:10:23 | in the R M |
---|
0:10:24 | uh D signal itself what selected randomly from the timit database |
---|
0:10:29 | and um what's presented to the model |
---|
0:10:32 | a tool on average the model that scene and use sub segment for about thirty time |
---|
0:10:40 | um here some of the features that were learned by the model |
---|
0:10:44 | a a on the left side |
---|
0:10:45 | uh we see the a actual features |
---|
0:10:48 | and |
---|
0:10:49 | i just a reminder these use just the weights a connecting the |
---|
0:10:52 | visible data to the hidden units |
---|
0:10:54 | so for each hidden unit we have a pattern |
---|
0:10:57 | and that hidden unit turns on optimal when the data presents the this particular pattern associated but |
---|
0:11:03 | so uh i you can see there's a lot of different types of patterns that are learned |
---|
0:11:07 | a let me a go through a few of them very quickly |
---|
0:11:10 | here is uh |
---|
0:11:11 | a pattern that's uh |
---|
0:11:13 | but to pick out really low frequencies |
---|
0:11:15 | um |
---|
0:11:16 | maybe like |
---|
0:11:17 | but F zero for or something |
---|
0:11:19 | or pitch |
---|
0:11:21 | oh here's some patterns |
---|
0:11:22 | uh that uh |
---|
0:11:23 | pick up |
---|
0:11:25 | for some features that pickup patterns which are |
---|
0:11:28 | um |
---|
0:11:28 | slightly higher frequency |
---|
0:11:31 | and here's others that are |
---|
0:11:33 | intermediate level frequency |
---|
0:11:38 | and then some |
---|
0:11:39 | that are really high frequencies |
---|
0:11:42 | there's some other really interesting ones |
---|
0:11:44 | which are these patterns that seem to have composite um frequency characteristics there is a low frequency component and high |
---|
0:11:52 | frequency component |
---|
0:11:53 | and we think that the might be |
---|
0:11:55 | picking up but uh |
---|
0:11:56 | a fricatives |
---|
0:12:02 | okay |
---|
0:12:03 | so not we're blind the model |
---|
0:12:04 | we and |
---|
0:12:05 | uh reconstruct |
---|
0:12:07 | signals from the a posterior activities of the hidden units that themselves |
---|
0:12:11 | so if take um ten frames of signal |
---|
0:12:14 | and we project that signal into the hidden unit |
---|
0:12:17 | i'm showing be activities of the hidden units here in |
---|
0:12:20 | log scale |
---|
0:12:21 | now only shown twenty of the uh i hundred do any units that we actually trained |
---|
0:12:27 | um and you can then take these posterior activities of the hidden units and project them back |
---|
0:12:31 | two visible space to reconstruct a raw signal |
---|
0:12:38 | this is um |
---|
0:12:39 | yeah similar in flavour |
---|
0:12:40 | to the previous talk except were |
---|
0:12:42 | using a parametric model to do this |
---|
0:12:47 | okay uh if you look at the reconstruction and a much larger scale |
---|
0:12:51 | this is |
---|
0:12:51 | six six twenty five miliseconds |
---|
0:12:53 | of raw signal |
---|
0:12:55 | and uh |
---|
0:12:56 | you can see in the |
---|
0:12:58 | a heat map |
---|
0:12:59 | the patterns present a high dimensional pattern and |
---|
0:13:02 | uh the heat map them still |
---|
0:13:10 | here some samples from the model itself |
---|
0:13:12 | um |
---|
0:13:13 | there sixteen samples |
---|
0:13:15 | in these samples um |
---|
0:13:17 | five of them are quite similar to the other |
---|
0:13:20 | um but |
---|
0:13:21 | they're different from the other eleven |
---|
0:13:26 | shoes |
---|
0:13:27 | um |
---|
0:13:28 | so i think we're |
---|
0:13:29 | have a pretty good |
---|
0:13:30 | um |
---|
0:13:31 | model for at least |
---|
0:13:32 | small scale signal |
---|
0:13:34 | so many switch now to the application |
---|
0:13:38 | of these features to phone recognition |
---|
0:13:40 | and so the set |
---|
0:13:41 | that we have |
---|
0:13:42 | is uh we have uh |
---|
0:13:45 | one hundred twenty five |
---|
0:13:46 | millisecond to pry speech |
---|
0:13:48 | and we want to be able to uh |
---|
0:13:50 | uh use the features that we learned to predict |
---|
0:13:52 | the you phoneme labels |
---|
0:13:54 | that we got from a a find model |
---|
0:13:56 | um so we are use the features that we learn to encode this signal you know talk about how we |
---|
0:14:01 | did that in the next slide |
---|
0:14:02 | and then uh we to be encoded features and put "'em" in the neural network |
---|
0:14:07 | and used back propagation two |
---|
0:14:09 | and the mapping to the phoneme label |
---|
0:14:15 | or the set of of uh |
---|
0:14:17 | how we did the encoding |
---|
0:14:18 | so we uh |
---|
0:14:19 | a use the convolutional set here in the way works is |
---|
0:14:23 | we first that the first frame |
---|
0:14:26 | at the first sample of an utterance |
---|
0:14:28 | and we compute the posterior means |
---|
0:14:30 | we then to move it by one sample and we do this computation again |
---|
0:14:35 | we then do this for the entire um |
---|
0:14:37 | utterance |
---|
0:14:38 | and so |
---|
0:14:40 | raymond for high dimensional data |
---|
0:14:42 | but this is a little too high dimensional |
---|
0:14:44 | because with a surely by the signal by hundred twenty tie |
---|
0:14:48 | um |
---|
0:14:49 | so um |
---|
0:14:51 | what what we now do would be sub sample these hidden units |
---|
0:14:54 | um so that |
---|
0:14:55 | we sub sample each feature for twenty five miliseconds |
---|
0:14:58 | of signal |
---|
0:15:00 | and the subsampling helps |
---|
0:15:01 | in smoothing out the signal as well |
---|
0:15:04 | yeah i have to point out that convolutional set of sub be quite useful vision task |
---|
0:15:08 | and |
---|
0:15:09 | think our results |
---|
0:15:10 | just that |
---|
0:15:11 | the same |
---|
0:15:12 | um |
---|
0:15:12 | for the set setup |
---|
0:15:14 | okay so we have a a |
---|
0:15:16 | subsampled a frame for twenty five miliseconds |
---|
0:15:19 | with then advance it |
---|
0:15:21 | by ten miliseconds |
---|
0:15:23 | and we do this for the entire utterance of well |
---|
0:15:26 | so um for any given um |
---|
0:15:29 | speech |
---|
0:15:30 | of twenty of uh one twenty five miliseconds we take the eleven frames that |
---|
0:15:35 | um |
---|
0:15:36 | man that signal |
---|
0:15:37 | and we can cut it all of that into one vector |
---|
0:15:40 | and that's the encoding coding that's put into the neural net |
---|
0:15:46 | or |
---|
0:15:47 | have some shoes |
---|
0:15:48 | um |
---|
0:15:49 | the features were first uh log transformed |
---|
0:15:52 | after we are created the entire set of features |
---|
0:15:55 | and we also added |
---|
0:15:56 | delta and acceleration of the fact vectors to the coding |
---|
0:16:02 | here's a little bit about the baseline M |
---|
0:16:05 | um yeah if fine model was just an hmm |
---|
0:16:08 | i train on mfccs |
---|
0:16:10 | uh there were sixty one phoneme classes with three states |
---|
0:16:13 | for each class |
---|
0:16:14 | and we used the bigram language model |
---|
0:16:16 | a forced alignment |
---|
0:16:17 | to the test data |
---|
0:16:18 | uh and the training data was used to generate the label |
---|
0:16:24 | so we use this |
---|
0:16:25 | this stand it a standard method for D putting the posterior probabilities |
---|
0:16:28 | um |
---|
0:16:29 | just similar to what done in tandem like approach is and this |
---|
0:16:33 | a convert posterior probability predictions two |
---|
0:16:35 | generative probability probabilities which and then be decoded viterbi code |
---|
0:16:41 | a it's a summary of the results for different configurations |
---|
0:16:45 | um |
---|
0:16:45 | of our setup |
---|
0:16:47 | uh we used the for |
---|
0:16:49 | these set of this set of experiments to hidden |
---|
0:16:51 | layer neural network |
---|
0:16:53 | and um |
---|
0:16:55 | what we found was |
---|
0:16:56 | uh if we use more hidden that network hidden units in the neural network that we got better result |
---|
0:17:02 | um |
---|
0:17:02 | we found that |
---|
0:17:03 | uh |
---|
0:17:04 | if we use shorter sampling windows |
---|
0:17:06 | and acceleration uh and um |
---|
0:17:09 | uh |
---|
0:17:09 | shifting windows and we got better results as well |
---|
0:17:12 | also adding the a delta and acceleration parameters help |
---|
0:17:16 | as well |
---|
0:17:17 | we find that uh with one twenty hidden units we got the best of |
---|
0:17:22 | we combine all these four lessons and train one neural network with |
---|
0:17:26 | two layers |
---|
0:17:27 | with four thousand units in each layer |
---|
0:17:29 | and uh |
---|
0:17:30 | we use delta and acceleration parameters |
---|
0:17:33 | and we used the subsampling sampling window of ten miliseconds with an net five milliseconds |
---|
0:17:37 | and uh hundred twenty hidden units |
---|
0:17:39 | you R are M |
---|
0:17:40 | with that we got twenty two point eight percent |
---|
0:17:43 | rows a on the test data |
---|
0:17:45 | uh |
---|
0:17:46 | for the phoneme error rate |
---|
0:17:47 | on timit |
---|
0:17:49 | um and then we uh to get to a uh |
---|
0:17:52 | uh |
---|
0:17:52 | for their uh a neural network and with a dbn pre-training we were able to further reduce it down to |
---|
0:17:57 | twenty one point eight |
---|
0:18:00 | a so uh here's the conclusions |
---|
0:18:02 | um |
---|
0:18:03 | the speech signal talks |
---|
0:18:05 | um |
---|
0:18:06 | which you learning can |
---|
0:18:07 | uh discover meaningful features from data alone |
---|
0:18:11 | and uh i think for their uh work in looking for high dimensional encodings is justified |
---|
0:18:17 | and uh for future work |
---|
0:18:18 | we aim to build better generative models |
---|
0:18:20 | oh |
---|
0:18:21 | and with that |
---|
0:18:22 | acknowledgements |
---|
0:18:23 | and uh |
---|
0:18:24 | um |
---|
0:18:25 | oh of the four K |
---|
0:18:35 | yeah question |
---|
0:18:39 | hi a angel your talk L |
---|
0:18:40 | and a question uh i one to the reasons why people shy in speech from time to make features as |
---|
0:18:45 | the high sensitivity to noise |
---|
0:18:47 | as opposed to just the raw being raw or not |
---|
0:18:50 | so oh how are you going to address that |
---|
0:18:53 | um |
---|
0:18:54 | i think my answer to that is |
---|
0:18:56 | we just need enough data in eventually will be able to figure that out |
---|
0:19:00 | i but that's the the bad that's key because that it noise comes all the some forms you have pink |
---|
0:19:05 | noise of white you have kind i Z |
---|
0:19:07 | i mean it's it's just a have data |
---|
0:19:09 | you know we could have some the recognition problem |
---|
0:19:12 | if it's just that but it |
---|
0:19:14 | one a |
---|
0:19:15 | so i i think it's not just the data it's |
---|
0:19:17 | it's models that go with the data and so if you |
---|
0:19:21 | use |
---|
0:19:21 | um |
---|
0:19:22 | some of these powerful generative models and build trying bill didn't for their assumptions about the characteristics of noise |
---|
0:19:28 | then |
---|
0:19:29 | hopefully you will learn to um pick out noise |
---|
0:19:31 | and separate that from |
---|
0:19:33 | um |
---|
0:19:34 | real signal |
---|
0:19:35 | so in the case of the features we learnt |
---|
0:19:37 | um |
---|
0:19:38 | if you actually look at um the types of features we learned we learned to ignore |
---|
0:19:43 | sort of high frequency components |
---|
0:19:45 | it's so if you look at the reconstruction |
---|
0:19:47 | a signal |
---|
0:19:49 | here |
---|
0:19:49 | you'll find that some of the aspects of the |
---|
0:19:52 | fricative |
---|
0:19:53 | or sub rest in the reconstruction |
---|
0:19:55 | so it's burning to pick out um |
---|
0:19:57 | more of the |
---|
0:19:58 | a vocal tract information then it's |
---|
0:20:01 | a trying to get a noise |
---|
0:20:03 | and |
---|
0:20:03 | um in a sense that's also speak yes but |
---|
0:20:06 | uh the point is that it able to try and separate out what's noise |
---|
0:20:10 | uh |
---|
0:20:10 | from what |
---|
0:20:11 | what's not |
---|
0:20:13 | at the shoe obvious but you question |
---|
0:20:15 | well only because also wait some feature |
---|
0:20:17 | is |
---|
0:20:18 | oh you so as to not also more useless |
---|
0:20:21 | system to the |
---|
0:20:22 | make use of partition |
---|
0:20:24 | so you really be lucky |
---|
0:20:25 | because to oh have a problem if it's a which are another but not ask that i have |
---|
0:20:30 | because i like fifteen years the got to have some you know paper |
---|
0:20:33 | oh |
---|
0:20:34 | oh using weight full |
---|
0:20:35 | you all little course modeling to mock up to feel the model |
---|
0:20:38 | do but mission |
---|
0:20:40 | which actually |
---|
0:20:41 | to to have a lot of |
---|
0:20:42 | but but a thing yeah |
---|
0:20:45 | and because i am so big things what if |
---|
0:20:49 | uh one that what you i should be a slice some other |
---|
0:20:53 | right |
---|
0:20:54 | is |
---|
0:20:55 | where |
---|
0:20:56 | our |
---|
0:20:56 | you know from different files a file to see |
---|
0:21:01 | right so uh we actually didn't |
---|
0:21:03 | try any noisy data for this setup |
---|
0:21:05 | um |
---|
0:21:06 | but i'm an advocate of multiple layers |
---|
0:21:09 | of uh |
---|
0:21:09 | representations |
---|
0:21:11 | and hope is that um when you build |
---|
0:21:13 | um |
---|
0:21:14 | uh |
---|
0:21:15 | sort of deeper models where |
---|
0:21:17 | lower models |
---|
0:21:18 | try to pick up signals and hard models try to look a look for more abstract patterns |
---|
0:21:23 | when you do that |
---|
0:21:24 | um high level uh features will try and suppress the noise and separate the signals |
---|
0:21:28 | but uh |
---|
0:21:29 | for now we don't really have any |
---|
0:21:30 | um |
---|
0:21:32 | experiments to back a clean |
---|
0:21:33 | a one more real close |
---|
0:21:35 | oh yeah well like this one is not tradition |
---|
0:21:38 | i i feel like twenty years ago i saw stuff on uh using neural that's to recognise phonemes |
---|
0:21:43 | and so |
---|
0:21:45 | i |
---|
0:21:45 | i am curious |
---|
0:21:46 | uh_huh |
---|
0:21:47 | what really the change just because the other thing that |
---|
0:21:50 | i i think about with that is |
---|
0:21:52 | scaling issues one i think about did you recognition or anything recognition |
---|
0:21:56 | if a |
---|
0:21:58 | all i have to do is make them something sufficiently slow work or lower or higher and i can usually |
---|
0:22:02 | destroy a under neural that based on you know what is the train performance one wondering |
---|
0:22:06 | there's this some sort of advancement have you gotten around issues with scaling and |
---|
0:22:10 | and uh uh transformations and space |
---|
0:22:13 | right um |
---|
0:22:14 | so i think what's different from twenty years ago is that um |
---|
0:22:18 | these sort of generative models to made a lot of progress in it's been seen that |
---|
0:22:21 | you can use them to see neural networks and get much better results than you good from new map before |
---|
0:22:27 | um |
---|
0:22:28 | i and |
---|
0:22:29 | the amount of data that's now available is much larger than about a years ago |
---|
0:22:33 | so that's have sort of to to the question of |
---|
0:22:36 | what's different from the last twenty years |
---|
0:22:38 | um |
---|
0:22:39 | in terms of uh scale of the data |
---|
0:22:42 | uh |
---|
0:22:43 | i think the |
---|
0:22:44 | kind of |
---|
0:22:44 | um units were using or sort of scale invariant at least in terms of intensity and have the motivation for |
---|
0:22:50 | using them |
---|
0:22:50 | but they're not |
---|
0:22:52 | the uh the time aspect it's not all covered and were actually looking at |
---|
0:22:55 | models to try and uh |
---|
0:22:57 | um |
---|
0:22:58 | sort of |
---|
0:22:58 | the invariant to that aspect of it |
---|
0:23:00 | a a like to mention that convolutional a networks of been |
---|
0:23:04 | useful in vision related task |
---|
0:23:06 | and i think |
---|
0:23:07 | they have the potential for |
---|
0:23:09 | adjusting for scales |
---|
0:23:10 | and half |
---|
0:23:11 | we gonna try and attractive |
---|
0:23:12 | those work |
---|
0:23:13 | but the one last comment |
---|
0:23:15 | what's your definition of row that the with this approach work for a to the that what that's it |
---|
0:23:21 | so is what was used in for a it so |
---|
0:23:23 | as you own definition |
---|
0:23:25 | um that would just for to that |
---|
0:23:28 | a of what this to data of sorry a i've i i i |
---|
0:23:32 | oh |
---|
0:23:34 | ah okay |
---|
0:23:36 | yeah um |
---|
0:23:38 | i think for |
---|
0:23:39 | for this |
---|
0:23:39 | paper the definition was |
---|
0:23:41 | the raw form that you could capture from the entrance |
---|
0:23:44 | um so we didn't wanna to make any assumptions that if you take spectral information |
---|
0:23:49 | that there were |
---|
0:23:50 | assumption that |
---|
0:23:51 | um |
---|
0:23:52 | the uh signal was stationary between a |
---|
0:23:55 | within a single frame which i think is the |
---|
0:23:57 | of the not a very correct information |
---|
0:23:59 | um and probably harms |
---|
0:24:01 | uh |
---|
0:24:02 | detection of certain types of phonemes |
---|
0:24:04 | i'm so um the second answer to that question it is uh |
---|
0:24:07 | um |
---|
0:24:09 | uh it's just |
---|
0:24:10 | a matter of uh convenience |
---|
0:24:11 | it depends on whatever the input was to our system |
---|
0:24:14 | that's already data |
---|
0:24:16 | but the |
---|
0:24:18 | yeah so |
---|
0:24:19 | so that that's that's a first definition which was as close |
---|
0:24:23 | as you can get |
---|
0:24:24 | to the capture device |
---|
0:24:26 | okay we need one thing |
---|