0:00:12 | first let me introduce mike seltzer from microsoft |
---|
0:00:16 | Q has been there since two thousand and three |
---|
0:00:19 | and his interests are really no noise robust speech recognition microsoft the reprocessing the acoustic model adaptation speech enhancement |
---|
0:00:28 | in two thousand and seven to receive the base down all their word from the ieee signal processing society |
---|
0:00:35 | and from two thousand and six and two thousand and eight yeah he was a member of the S L |
---|
0:00:39 | D C and he was also the editor in chief of the |
---|
0:00:43 | electro electronic |
---|
0:00:46 | newsletters all many of us to receive emails from him whenever that the newsletter came out |
---|
0:00:52 | and the holiday season associate that either of the ieee transactions on speech and audio processing |
---|
0:00:58 | and the title of his talkies robust speech recognition more than just a lot of noise |
---|
0:01:03 | so |
---|
0:01:04 | these by michael |
---|
0:01:14 | good afternoon thinks that a great introduction george and things to the |
---|
0:01:20 | it's very two thousand committee for inviting india the stalk it's really the an honour to be here without you |
---|
0:01:28 | and i hope to have the instinct a kind of sentences after lunch and hope for the food come along |
---|
0:01:33 | once it into badly |
---|
0:01:35 | so let's get started |
---|
0:01:38 | so |
---|
0:01:39 | i've been in the field for about oh ten years or so and i brought them to me really seems |
---|
0:01:45 | like |
---|
0:01:46 | where in what i'll call almost a golden age of speech recognition |
---|
0:01:52 | there's based yeah and as we all know there's a number of mainstream products that everyone on the obvious is |
---|
0:01:57 | using that involve speech recognition of course there's a huge proliferation of mobile phones and data plans and voice search |
---|
0:02:04 | and things like that |
---|
0:02:07 | speech is also widely deployed now in automobiles |
---|
0:02:11 | in fact i i'd like to point to the ford sync as one example of a system that we took |
---|
0:02:16 | speech in cars from a high end |
---|
0:02:18 | come on a add on to look for enlarge remote bills to a kind of a low and feature on |
---|
0:02:24 | stand |
---|
0:02:25 | packages for low and for models and sort of like a moderate sized |
---|
0:02:28 | that functionality and then my most recently is that what this the flaming of the connect i don't for X |
---|
0:02:35 | box |
---|
0:02:36 | which has gesturing voice input |
---|
0:02:39 | in addition to these three examples of technologies that are that are out there we're also in many ways swimming |
---|
0:02:45 | in this in this you know drowning in the data that we have as a sort the proverbial a house |
---|
0:02:51 | a fire hose of data coming at us and fifty yeah with cloud base system |
---|
0:02:56 | no all the data thing logged on the system and sort of having data in many cases is not a |
---|
0:03:00 | problem as i know how what do we do with the state is sort of the channels these days |
---|
0:03:04 | and finally as on a on a personal note |
---|
0:03:06 | i think that one is most meaningful for me is the fact that all this is happening so that only |
---|
0:03:11 | have to talk explaining to my mother what it is i do on a daily basis |
---|
0:03:16 | i'm a |
---|
0:03:19 | i'm not sure she's we so happy that a user face on the sly |
---|
0:03:23 | i won't tell if you want to |
---|
0:03:25 | neverless in spite of all the success there'd be lots of challenges that are still out there there's new applications |
---|
0:03:31 | and as an example here this is the virtual receptionist by dental who should ms are sort of a project |
---|
0:03:36 | and situated interaction and multiparty engage me |
---|
0:03:39 | there's always new devices this is a particularly interesting one from time to shorten stands use a electromyography interface where |
---|
0:03:46 | it sort of actually measure |
---|
0:03:48 | engineer skin as the input i have another colleague a ardent thing was working on a speech input using microphone |
---|
0:03:54 | arrays inside a space helmet for us |
---|
0:03:59 | you know systems that are deployed for spacewalks |
---|
0:04:01 | and of course as the thomas friedman wrote the world is becoming flatter and flatter and there's always new languages |
---|
0:04:07 | a new cultures that are systems come in contact with |
---|
0:04:12 | so addressing these challenges takes data and data is |
---|
0:04:16 | time consuming to collect in some ways and it's also expensive to collect |
---|
0:04:20 | as an alternative i like to propose |
---|
0:04:23 | that we can extract additional utility from the data we already have |
---|
0:04:26 | the idea is that by reusing and recycling the existing data we have |
---|
0:04:30 | potentially we can reduce the need rick actually collect new data |
---|
0:04:34 | and you can think of it informally as making better use of our resources so what i like to focus |
---|
0:04:38 | my talk on today is |
---|
0:04:40 | how we can take ideas from a process to help speech recognition go green |
---|
0:04:45 | so the symbol for my talk today |
---|
0:04:48 | we'll be how speech recognition goes bring to reduce recycle and reuse |
---|
0:04:53 | of information |
---|
0:04:56 | like you like to go see a lot |
---|
0:04:58 | okay so first i'm gonna talk about the one aspect of this in the sense of reduce |
---|
0:05:03 | so |
---|
0:05:04 | we know their systems suffer because these are just statistical pattern classifiers when this mismatch between the acoustic models we |
---|
0:05:10 | have and the data that we see a runtime one the biggest sources of this mismatch is environmental noise |
---|
0:05:16 | so the best solution of course is to retrain with matched data and this is either expensive or impossible depending |
---|
0:05:21 | on what your definition of matched data is |
---|
0:05:23 | if matched it is you know i'm in a car on a highway then that's reason to collect it just |
---|
0:05:27 | as a little bit time consuming if matched data is i'm gonna be in the speaker model of car on |
---|
0:05:32 | this particular road with the snow is of course it's impossible to do |
---|
0:05:35 | so as an alternative we have standard adaptation techniques that are tried and true in part of a standard toolkits |
---|
0:05:40 | things like map or mllr adaptation they're really great because their generic in a computationally efficient the only downside is |
---|
0:05:47 | the need sufficient data in order to train is proud |
---|
0:05:50 | as an alternative relative to discuss here is the way we can exploit the model environment |
---|
0:05:54 | and by doing this |
---|
0:05:56 | the estimation the adaptation method you will get some sort of structure imposed on it and as a result we |
---|
0:06:00 | get lots of efficiencies and the data process |
---|
0:06:03 | so before we go into the details let's just take a quick look |
---|
0:06:06 | at the effect of noise on speech through the processing chain of showing of the processing chain for mfccs for |
---|
0:06:12 | a similar features like plps lpc is it's similar so we know that in the spectral domain of the linear |
---|
0:06:19 | the waveform domain speech is additive |
---|
0:06:22 | that's not too bad to handle |
---|
0:06:23 | what's your the power domain it's against additive analysis additional cross term here which is the kind of correlation between |
---|
0:06:29 | speech and noise we gently assume speech or noise uncorrelated so we kind of ignore the term and sort of |
---|
0:06:34 | hideaway |
---|
0:06:35 | now things get a little bit trickier once you have to go to the mel filterbank analog operation because after |
---|
0:06:41 | log domain gets are to get this little bit of a nasty regulation which says that the noise |
---|
0:06:47 | the noisy features i C R can be described as a clean speech features plus some nonlinear function of the |
---|
0:06:52 | clean speech and the noise |
---|
0:06:53 | and indicates that goes up to a linear transform and we get of a vector version of the same equation |
---|
0:06:58 | for the purpose of this talk on a backup before the dct "'cause" easier to visualise things and one or |
---|
0:07:02 | two dimensions rather than thirty nine dimensions |
---|
0:07:05 | and source talk about this equation here |
---|
0:07:07 | so because speech and noise are it's a symmetric a relationship so expos and is and plus X we can |
---|
0:07:13 | swap the positions of X and N in this equation here if we do that i mean and we sort |
---|
0:07:18 | of bring common term is to teach that equations you get a slightly different expression |
---|
0:07:23 | and what what's interesting here is that you have in the lab |
---|
0:07:26 | is because is a lot domain operators this is basically to a function of two signal-to-noise ratios something that's in |
---|
0:07:33 | speech enhancement signal processing called the a posteriori snr which is the snr of the observed speech compared to the |
---|
0:07:38 | noise and the prior snr which is the unknown clean speech and as a version of that and the noise |
---|
0:07:45 | and so if we look at what this relationship is between along this function |
---|
0:07:50 | we have a curve like this |
---|
0:07:52 | and it's curve makes a lot of intuitive sense of it look at different points along the curve and so |
---|
0:07:55 | we can say for example appear in the in the upper right of the curve |
---|
0:07:59 | right now we have high snr basically noise is not much of a factor and the noisy speech is given |
---|
0:08:03 | to the clean speech |
---|
0:08:05 | in a similar way |
---|
0:08:07 | at the other end of the curve |
---|
0:08:09 | we have a low snr |
---|
0:08:11 | and do not you matter with the clean speech is it's completely dominated by the noise |
---|
0:08:15 | so why was that in this case and of course you know the million dollar question is |
---|
0:08:19 | how do we handle things in the middle of the nonlinearities is something that needs to be dealt with |
---|
0:08:24 | there's an added complication of this which is earlier we sort of swept this cross correlation between speech or noise |
---|
0:08:29 | under the rug |
---|
0:08:31 | but it turns out |
---|
0:08:32 | that yes it is your own expectation but it's actually not zero you know it's a distribution that has nonnegligible |
---|
0:08:39 | very that we have to plot data on this curve and see how well it matches the curve |
---|
0:08:43 | you see the direction is that the data lies on that line but there's actually significant spread around that my |
---|
0:08:49 | what that means is |
---|
0:08:51 | that even if we're given the exact value of clean speech and exact value of noisy speech in the feature |
---|
0:08:57 | domain where she can predict exactly what the noisy feature will be there's just you can predict what the distribution |
---|
0:09:03 | will be |
---|
0:09:03 | and ending this additional uncertainty makes things we wanna do things like model adaptation even more complicated |
---|
0:09:09 | so |
---|
0:09:10 | they're been a number way to look at transfer in this equation into the model them |
---|
0:09:16 | if we do that again this nonlinearity and the some thirty great challenges so we again look at extremes of |
---|
0:09:21 | the curve is quite straightforward at high snrs than if we do an adaptation the noisy distribution it's gonna exactly |
---|
0:09:27 | the same as the queen distribute |
---|
0:09:29 | we go over here to the low snrs the lower left of the curve |
---|
0:09:32 | the noisy the speech distributions to just the noise distribution |
---|
0:09:36 | and the real trick is how do we handle this area in the middle right we have even if we |
---|
0:09:40 | assume that the speech and the noise are gaussian we put these things through this nominee relationship or comes out |
---|
0:09:45 | is definitely not gaussian |
---|
0:09:47 | but of course this is speech recognition and |
---|
0:09:51 | i darted if it's not gaussian we're just gonna assume it's gaussian anyway and so there are and so there |
---|
0:09:56 | are various approximations that are made to do this because |
---|
0:10:00 | we just you know six |
---|
0:10:02 | so |
---|
0:10:04 | right the most famous example to do this would you do at that noise is to simply taken a linear |
---|
0:10:08 | approximation to this to linearize around this is the famous vector taylor series algorithm by pager marino |
---|
0:10:14 | the idea here is you simply have an expansion point determined by your |
---|
0:10:17 | that that's given by the mean of the gaussian are trying to adapt and the mean every noise |
---|
0:10:21 | and you simply when your eyes that than on the menu function around that point and once you have a |
---|
0:10:25 | linear function now doing annotation is very straightforward you on the how to how to transform gaussians subject to a |
---|
0:10:30 | linear transformation |
---|
0:10:32 | now the trick here is that the transformation is only determined by the mean of the curve and the size |
---|
0:10:38 | of the variance of the speech and the noise of the clean speech and the noise model will determine the |
---|
0:10:42 | accuracy of the linearisation stuff the brains very broad if it's very in a wide bell then the linear they |
---|
0:10:48 | should not be very accurate "'cause" you be subject to more nonlinear |
---|
0:10:51 | so |
---|
0:10:52 | the refinement of idea of that idea we look into something called linear spline interpolation which is sort of participated |
---|
0:10:59 | well if one line works well many lines must work better and so the idea is to simply take this |
---|
0:11:05 | find an approximate using a linear spline which is the idea that you have a |
---|
0:11:10 | a series of knots which are basic places you could even the dots in this in this figure |
---|
0:11:15 | and with between the dots you have you doing approximation is quite accurate |
---|
0:11:19 | and in fact because it's a simple linear rescue cash you have a variance associated word error associated with that |
---|
0:11:24 | when you're model and that'll account for that alpha spread that spread of the data around |
---|
0:11:29 | and then when you figure out what to do at runtime you have to use |
---|
0:11:36 | right you can use all the splines based on the pdf rather than just having to pick a single one |
---|
0:11:40 | determined by the mean and so essentially depending on how much mass of the probabilities under each of the segments |
---|
0:11:45 | that tells you how much contribution of that linearisation you're gonna use in your final approximation |
---|
0:11:51 | so you're using you know incorporating you do linearisation based an entire distribution rather than just the mean and i |
---|
0:11:57 | think is the spine parameters can be trained from stereo data they are also can be trained from an integrated |
---|
0:12:01 | way using sort of maximum likelihood in an hmm free |
---|
0:12:06 | we just use two examples of a linear a linearisation approach this another approach is the sampling based method |
---|
0:12:13 | and the idea here is based on i-th in of the most famous example this is by the data-driven pmc |
---|
0:12:18 | work for mark gales |
---|
0:12:21 | in ninety six but that method requires you know tens of thousands of samples is for every gaussian are trying |
---|
0:12:26 | to adapt its completely |
---|
0:12:27 | infeasible it is a good upper bound you can do |
---|
0:12:30 | but the unscented transform is very elegant way to sort of do clever sampling the ideas you just take certain |
---|
0:12:36 | sigma points and again because we can assume things are gaussian there's a simple recipe for what these sampling points |
---|
0:12:41 | are take is small set of points in this case it's typically about in a less than a hundred point |
---|
0:12:47 | ask them through the non-linear function you know to be true under your model and then you can compute the |
---|
0:12:54 | moments and basically estimate P Y |
---|
0:12:56 | again |
---|
0:12:58 | depending on how spread the variance of this model of this distribution you're trying to adapt is that will determine |
---|
0:13:04 | how accurate this adaptation is so is gonna for the refinement this method |
---|
0:13:08 | post recently call the unscented gaussian mixture filter in this case you take a very broad gaussian simply chop it |
---|
0:13:14 | up into a gaussian mixture where within each gaussian |
---|
0:13:18 | the variance a small and simple linear approximation works quite well |
---|
0:13:22 | in the sampling works quite efficiently and then use and to combine all discussions back on the other side |
---|
0:13:26 | here |
---|
0:13:27 | so you just for example there are a handful of others out there in the literature |
---|
0:13:33 | but one thing what i've tried to convey here is in contrast to standard adaptation you'll notice i didn't talk |
---|
0:13:37 | at all about data |
---|
0:13:39 | and observations was talk about how to adapt the model all we had was that of the hmm parameter you |
---|
0:13:44 | X and the noise model and |
---|
0:13:46 | so it's of the what's nice about these systems is that |
---|
0:13:50 | excuse me |
---|
0:13:52 | is that basically slowly need is an estimate of what the noises in the signal and given that we can |
---|
0:13:57 | actually depth every single gaussian in the system because the structures impose on the adaptation process |
---|
0:14:02 | and in fact if we can sort of sniffed what the environment is before we even see any speech we |
---|
0:14:06 | can seduce in the first pass which is very nice and of course you can refine this of the second |
---|
0:14:10 | pass by doing you know em type are going to update your noise parameters |
---|
0:14:14 | so of course |
---|
0:14:15 | under this model the accuracy of the technique is largely due to the accuracy of the approximation using so those |
---|
0:14:21 | are four examples i showed earlier and essentially people who work in this area basic trying to come up better |
---|
0:14:27 | approximations to that nonlinear function other alternatives also focus on more explicitly modeling |
---|
0:14:33 | that uncertainty between X with between the speech and noise that accounts that spread in the data that was nearly |
---|
0:14:39 | figure |
---|
0:14:40 | so just a sense of how these things work this is the road to which is a standard noise robustness |
---|
0:14:45 | task it's a noisy connected digit task |
---|
0:14:49 | for people care it's a complex back-end some like we could train system |
---|
0:14:55 | it's for the best next like that sort of baseline you can create with this data i mean you can |
---|
0:14:58 | see that sort of doing standard things like C M and is not great when you cmllr again this it |
---|
0:15:05 | in one utterance you may not have enough data to do that to do the adaptation correct you get a |
---|
0:15:08 | small gain but not but not a huge when |
---|
0:15:13 | the L C advanced front-end shown there is a fee is sort of the |
---|
0:15:16 | i guess representative of state-of-the-art in sort of front end signal processing approach to doing this as i was not |
---|
0:15:21 | where the models are used to treat this as a noisy signal and hands it in the front end and |
---|
0:15:26 | if you do vts |
---|
0:15:27 | in the rain algorithm ignoring |
---|
0:15:31 | that correlation between speech or noise that spread of the data you get about the same performance |
---|
0:15:36 | and now if you actually account for that variance in the data by tuning a weight in your in your |
---|
0:15:42 | update which i won't get into the details of us to get a pretty sick significant gain |
---|
0:15:45 | that's a really nice result the problem with that is that the value that you actually is optimal is that |
---|
0:15:50 | you theoretically implausible and don't and breaks your entire model so that part is a little bit unsatisfying |
---|
0:15:57 | in addition the fact is not quite that might that often not pravda generalise as across corpora and then we |
---|
0:16:03 | see that you get about the same results of the use the spline interpolation method where we have you have |
---|
0:16:07 | the link the linear regression model it does account for the spread and sort of a more natural way |
---|
0:16:12 | and again all the numbers of than similar at first pass numbers they could be refined further with second test |
---|
0:16:18 | so |
---|
0:16:20 | well this shows we could be no you have nice |
---|
0:16:23 | gains by adapting the structure there's been a little bit of a dirty laundry i was trying to cover up |
---|
0:16:29 | which is that the environmental model is completely dependent on the assumption that the hmms trained on clean speech and |
---|
0:16:36 | as you all know clean speech is kind of a an artificial construct that something we can collect in the |
---|
0:16:41 | lab but is not very generic it also means that if we deploy a system out in the world we |
---|
0:16:46 | collect the data that comes like in that it is easy valuable for updating our system and refine your sister |
---|
0:16:51 | but if it's noisy and our system can only be taken clean data we can use that data |
---|
0:16:55 | have a problem |
---|
0:16:56 | so |
---|
0:16:57 | a solution to that problem has been proposed and referred to as |
---|
0:17:04 | noise adaptive training also composes joint adaptive training |
---|
0:17:07 | and the idea is basically completely i can sort of a little brother little sister to speaker adaptive training |
---|
0:17:15 | in the same as figure out the training try to remove speaker variability in your acoustic model by having some |
---|
0:17:20 | other transform absorb the speaker variability we wanna have the same kind of operation happen to absorb the environmental variability |
---|
0:17:27 | what this allows you to do is actually train incorporate train data from different sources |
---|
0:17:32 | into a single model is helpful if you if you if you think about a multi-style model we can take |
---|
0:17:36 | all kinds of data from all different conditions and mix it all together |
---|
0:17:40 | the model will model the noisy speech correctly beer and have a lot of variance is just modeling the fact |
---|
0:17:43 | that are coming from different environments |
---|
0:17:45 | that's not gonna help you with phonetic classification |
---|
0:17:48 | and if you are not a dataset scare scenario this could become very import |
---|
0:17:53 | so again just to make it a little a little bit more explicit he hears the general flow force speaker |
---|
0:17:59 | adaptive training you have some multi-speaker data in a speaker independent hmm |
---|
0:18:03 | that then doesn't a process where you italy update your hmm and some speaker transforms |
---|
0:18:09 | most commonly using cmllr and this process goes back and forth so convergence and what what's left of it |
---|
0:18:14 | speaker adapted hmm |
---|
0:18:16 | so a noise adapting the exact same process happens |
---|
0:18:19 | except the goal is to remove the environmental variability from a multi-style multi environment day |
---|
0:18:25 | so what happens here is we have again i would i guess you could call it an orderly cause an |
---|
0:18:30 | environment independent model but that's |
---|
0:18:32 | what it is and also for apparel structural call that essentially data from lots of by |
---|
0:18:38 | and then in your iterative process you basically trying to model and account for the noise or channel distortion that's |
---|
0:18:43 | in all of your in all of your data |
---|
0:18:46 | with other parameters so that the hmm is free to model the phonetic variability and this case typically what's more |
---|
0:18:51 | stuff and on is the noise that is environmental parameters are updated on a per utterance basis rather than a |
---|
0:18:56 | per speaker basis because there's few parameters and so you're able to estimate those |
---|
0:19:00 | well number comes out is a noise adapted hmm again that the nice thing here again is because you can |
---|
0:19:06 | do this potentially in the first pass you don't need to keep the first environmental independent or noise independent model |
---|
0:19:12 | around like you do in speaker adaptive training you can directly operate all the time and noise adapted H |
---|
0:19:18 | there are some results with noise adaptive training |
---|
0:19:21 | as analysis with noisy multi-style training data you can see this is the result for cmn just cepstral mean normalisation |
---|
0:19:28 | now we try to fight the vts algorithm which assumes the models clean in this case not under the assumption |
---|
0:19:34 | is broken and so we got to get |
---|
0:19:37 | you have to improve over the baseline but the results are not nearly as good and then we get overturned |
---|
0:19:41 | to getting nice gains but we actually do this adaptive training and we see similar performance on the aurora three |
---|
0:19:46 | task an interesting thing there is actually because that's real data collected in a car |
---|
0:19:51 | or she is no clean data to train this on and so you actually need an approach like this to |
---|
0:19:56 | run a successful on that technique and |
---|
0:19:59 | corpus like this |
---|
0:20:02 | so |
---|
0:20:04 | to summarise are for |
---|
0:20:06 | is the triangle and redo |
---|
0:20:10 | i si model adaptation as you all know can reduce environmental mismatch |
---|
0:20:14 | when you impose this environmental structure determine by the model that the adaptation is incredibly data efficient if you think |
---|
0:20:20 | about a general you need |
---|
0:20:22 | and ask them to be noise in an estimate of your |
---|
0:20:25 | of your noise meeting yours variance of potentially last interview channel means that spacey thirty nine was thirty nine you |
---|
0:20:30 | know it's |
---|
0:20:31 | hundred and twenty parameters to estimate which is really a very little and you know you could even for example |
---|
0:20:36 | if you assume that your noise is stationary then you're you can actually eliminate even the delta kappa delta features |
---|
0:20:42 | of your noise |
---|
0:20:44 | every running or static features that even fewer parameters |
---|
0:20:46 | doing the adaptation unfortunately is computationally quite a bore i mean it really it's you adapting every gaussian in your |
---|
0:20:55 | system is probably overkill to do an utterance-by-utterance basis but you can improve the performance by using regression classes shown |
---|
0:21:03 | by i think well as work |
---|
0:21:06 | yeah thing is that we can reduce environmental variability in the final model we have |
---|
0:21:10 | by doing this noise adaptive training in this is helpful when we're in scenarios where there's not much data to |
---|
0:21:15 | work |
---|
0:21:15 | the other considerations that reminders although i'm certain ml systems use can be integrated discover training |
---|
0:21:21 | and is a huge sort of parallel literature to this where the same exact algorithms are used in the front-end |
---|
0:21:27 | where your place the hmm with the gmm you do this as a front-end feature enhancement scheme and see basically |
---|
0:21:32 | the same exact operation with the goal of generating a hand |
---|
0:21:35 | version of the cepstra |
---|
0:21:37 | and |
---|
0:21:38 | those items the exact same sort of mathematics mathematical framework and then the nice thing is there is that you |
---|
0:21:43 | can then if you're data that you work with is noise you can also do the same adaptive training technique |
---|
0:21:48 | on the front-end gmm and |
---|
0:21:51 | still use those technique |
---|
0:21:54 | so well |
---|
0:21:59 | and i wanna move on from reduced to recycle |
---|
0:22:02 | and in this case element talk about is |
---|
0:22:05 | change gears from the ways to channel |
---|
0:22:07 | and talk about how we can recycle narrowband data that we have |
---|
0:22:11 | i think it's not a very controversial statement to say |
---|
0:22:16 | but now that voice over data is replacing voice over the wire |
---|
0:22:20 | and when you do this now because you know especially in speech applications you have you speaking to some a |
---|
0:22:26 | smart phone your voice is not going you know making a telephone call anymore it's going over the data network |
---|
0:22:30 | to some serve |
---|
0:22:31 | when you do that does not capture them with a possible so you can base the captured you know subjective |
---|
0:22:36 | bandwidth constraints because our |
---|
0:22:38 | latency constraints you can see you can basic captured arbitrary bandwidth and this is that we have you know where |
---|
0:22:43 | possible wideband data is preferable |
---|
0:22:46 | games do very you know which you build equipment system with narrowband or wideband data but they are consistent |
---|
0:22:52 | for example if you look at a car |
---|
0:22:56 | the gains you get are larger in that's not context because a lot of the noise it's in the cars |
---|
0:22:59 | at low frequencies sort of the rumble of the of the highway and the tires creates a lot of low |
---|
0:23:03 | frequency noise so having a high energy in the plosives and affricates is really helpful for discriminative ability |
---|
0:23:09 | and of course it's also sort of going becoming a the standard for just human communication is wideband codecs from |
---|
0:23:17 | the M R is the european standard and skype now is going to wideband codec or even an ultra wideband |
---|
0:23:22 | codec so the fact that people perceive it sort of also implies that numbers machines would probably prefer |
---|
0:23:27 | well |
---|
0:23:29 | that said there are existing stockpiles a narrowband data all the systems even building over the years and for many |
---|
0:23:35 | low resources languages in on the developing world mobile phone still are prevalent and i don't think we're gonna go |
---|
0:23:40 | away that soon so we want the ability to do something useful with that data |
---|
0:23:46 | so what i'd like to propose is there a way to use the narrowband data to help augment |
---|
0:23:53 | some wideband data we have in data scare snares to build a better wideband acoustic model and inspiration for this |
---|
0:24:00 | came from the signal processing literature maybe ten or fifteen years ago people propose the bandwidth extension speech processing |
---|
0:24:06 | sort of like again it comes from the fact that we know |
---|
0:24:09 | the people prefer |
---|
0:24:11 | wideband speech it turns out it's not it's not any more intelligible unless you looking at isolated phones it's actually |
---|
0:24:16 | both are equally intelligible but things like listener for T and just personal pride and preference comes across in a |
---|
0:24:23 | much higher for wide |
---|
0:24:26 | speech and so the way these algorithms operated |
---|
0:24:29 | was that the basis set can be learned correlations between low and high frequency spectrum the signal so here's a |
---|
0:24:35 | just |
---|
0:24:36 | a poorly first grade drawing version of |
---|
0:24:39 | of spectra like to say that my four year old to this but i did it myself |
---|
0:24:45 | so this is sort of you know the pilots like about like this i was going for that with a |
---|
0:24:48 | couple of formants as of yet if i ask you guys to predict |
---|
0:24:51 | what is sort of on the other side of the line |
---|
0:24:55 | you know it maybe predict something like that it seems pretty reasonable probably you know you make a down the |
---|
0:24:58 | difference low platform and maybe in a different location but is not for example gonna go up it's not you |
---|
0:25:03 | know you would you would doubt that would |
---|
0:25:05 | and so we can do is basically user like a gaussian mixture model to predict |
---|
0:25:09 | the gas independent mappings from low to high band spectra |
---|
0:25:14 | and then a simple we could do is to say let's just generate wideband features from narrowband features |
---|
0:25:18 | and if you're familiar with the missing feature literature this says basically i'd like i have some in missing features |
---|
0:25:25 | you say i have some |
---|
0:25:27 | components of my features that are too corrupted by noise addition to remove them and then try to fill them |
---|
0:25:32 | in from the surrounding reliable data this is like doing this you features with the deterministic madness given by the |
---|
0:25:37 | telephone |
---|
0:25:39 | you're simply taking some amount of wideband data |
---|
0:25:43 | some potentially large amount narrowband data you're trying to convert that narrowband data into a pseudo wideband features and go |
---|
0:25:50 | to train an acoustic model that way |
---|
0:25:53 | so this actually works okay works pretty well and here's an example |
---|
0:26:00 | this is a wideband log mel-spectrogram |
---|
0:26:03 | the left in this is that same speech but through a telephony channel you can see obviously the information below |
---|
0:26:09 | three hundred hz and above thirty four hundred hz is |
---|
0:26:11 | it has gone missing so to speak and the idea of this bandwidth extension the feature domain is to say |
---|
0:26:16 | can we do something to fill it back |
---|
0:26:19 | and in this particular case "'cause" it's not it's not perfect |
---|
0:26:21 | but you know a lot of you know where there's read it gently read in the other pictures are reserved |
---|
0:26:25 | capturing |
---|
0:26:26 | of the gross features but data and we could use that then to train our system |
---|
0:26:31 | so this is good but the downside is that if you do it this way in the feature domain you |
---|
0:26:34 | end up with a point estimate of what you're wideband feature should be and if that estimates for or it's |
---|
0:26:39 | wrong words you know |
---|
0:26:42 | things like that you really have no way of informing the model during training to not use that data as |
---|
0:26:48 | much as maybe other estimates that maybe more reliable and so to get this to work you have to do |
---|
0:26:53 | some ad hoc things like corpus weightings to say okay we have a little bit of |
---|
0:26:57 | wideband data but i'm the count those statistics much more heavily than my |
---|
0:27:01 | statistics of my narrowband data which i would have extended into therefore don't trust quite as much so as not |
---|
0:27:06 | theoretically optimal |
---|
0:27:09 | and as a result you know a better used to be to use and you know we can could incorporate |
---|
0:27:14 | this into any amalgam directly see on only train hmm |
---|
0:27:18 | would it be the state sequence is the hidden variable so you can figure this is doing the exact same |
---|
0:27:22 | thing but you're adding additional hidden variables for all the missing frequency components that you don't have in the telephone |
---|
0:27:27 | channel |
---|
0:27:30 | so if you do this you get something that looks like this where you have the narrowband goes directly into |
---|
0:27:34 | the training procedure with the wideband data you have this and expand with em algorithm and we comes out as |
---|
0:27:39 | a wideband hmm no i'm not gonna try to go into too many details and i really try to keep |
---|
0:27:43 | equations to a minimum but i just want to point out |
---|
0:27:46 | a few notable thing is this is the variance update equation and a few things that are interesting i think |
---|
0:27:51 | about this relation the this update equation is |
---|
0:27:55 | first of all you look at the why should sorry i should mention the notation have adopted here's from the |
---|
0:28:00 | missing feature literature so oh is something that you would observe in ms and it's missing as you consider O |
---|
0:28:05 | to be the |
---|
0:28:06 | the telephone band frequency components and M to be the missing high-frequency components you're trying to the model when you're |
---|
0:28:12 | hmm |
---|
0:28:13 | second thing at the posterior combination computation is only computed over |
---|
0:28:17 | low band that you have only lives are bands you've actually marginalise out the commode you don't have over all |
---|
0:28:22 | your models and so therefore erroneous estimates that you make in this process don't corrupt your posterior calculations because you |
---|
0:28:28 | only computing posteriors based on reliable information that you know is |
---|
0:28:31 | is that |
---|
0:28:32 | the other interesting thing is that |
---|
0:28:34 | rather than having a an estimate that's global across all your data you actually have a state conditional C estimate |
---|
0:28:41 | where the estimate of the wideband features determined by the observation at time T as well as the state you're |
---|
0:28:46 | in and so the says |
---|
0:28:48 | the extended wideband feature i have your it can be a function of both the data i see as well |
---|
0:28:53 | as whether i mean of our fricative or a plosive |
---|
0:28:57 | sample |
---|
0:28:58 | and finally there's this variance piece at the end here which then says in general for this particular gaussian |
---|
0:29:06 | how much uncertainty overall is there in trying to do this mapping so maybe a minute in a case where |
---|
0:29:10 | them doing this mapping is really heart because there's very little correlation from the time-frequency snack is we will high |
---|
0:29:16 | variance there so that model as i could reflect the fact that we've |
---|
0:29:22 | that we've estimated that the estimates that we're using may be poor |
---|
0:29:26 | so if we look at the performance here we've taken a wall street journal task we base it took the |
---|
0:29:30 | training data and partitioned into wideband set and the narrowband set at some proportion |
---|
0:29:36 | and so the idea is that if you look at the performance of the wideband data that's the lower line |
---|
0:29:40 | it's about ten percent |
---|
0:29:42 | and if you take the entire system and sort of telephone dies at all you end up with the upper |
---|
0:29:46 | curve but in the purple curve that's the sort the narrowband system the goal of this is to say given |
---|
0:29:52 | some wideband data and next thing in the rest narrowband data how far how much coming close that gap |
---|
0:29:57 | so we see that in this is comparing the results of the feature version on the model domain version and |
---|
0:30:01 | so we can see that we have a split of at twenty |
---|
0:30:04 | the performance is about the same and so in that case you know why go through all the extra computation |
---|
0:30:08 | the feature |
---|
0:30:09 | version works quite well interestingly once you go to a more extreme case where only ten percent the training set |
---|
0:30:14 | is actually wide-band the rest is narrowband do in the future version of it is that you worse than just |
---|
0:30:19 | training at an entire narrowband system |
---|
0:30:22 | because there's lots of uncertainty in the extension that you do in the front end which is not reflected in |
---|
0:30:26 | your model at all but if we do the training in this integrated framework |
---|
0:30:31 | we end up with you know a performance that again is better than equal than all narrowband |
---|
0:30:39 | so |
---|
0:30:44 | talk about this last prong of this second volume of the of the triangle here and recycle |
---|
0:30:51 | potentially possibly narrowband data can be recycled for using wideband data this may allow us to use the existing piles |
---|
0:30:58 | of legacy data we have |
---|
0:30:59 | and for initial system that we have narrowband data whether we want to build narrowband data maybe easier to collect |
---|
0:31:05 | and maybe simple just like the small amount of wideband data |
---|
0:31:08 | you can do this in the front end we can come up with the sort of integrated train training framework |
---|
0:31:13 | and |
---|
0:31:14 | like a noise-robust this case there is a front-end version that i talked about and there are advantages to that |
---|
0:31:19 | i shouldn't sort of |
---|
0:31:21 | so it doesn't the right it allows you that if you do this in the front and you can use |
---|
0:31:25 | whatever features you want you can then take the postprocesses news bottle neck features |
---|
0:31:29 | tack a bunch of frames individual the i and so you have a little bit more flexibility what you wanna |
---|
0:31:32 | do downstream from this process |
---|
0:31:34 | and the other interesting thing is that the same technology can be used in the reverse scenario where the input |
---|
0:31:41 | maybe narrowband and the models actually wide |
---|
0:31:45 | you may think where this happened but this action happens in systems lot of some as soon as someone puts |
---|
0:31:50 | on a bluetooth headset |
---|
0:31:51 | you could have a wideband applied system somebody decides that they wanna you'll be safe in hands-free in a put |
---|
0:31:57 | on a bluetooth headset all somewhat you comes and your system is |
---|
0:31:59 | in our band if you want do something about it you're gonna get killed and killed but |
---|
0:32:05 | sorry i and you're going hands free signal killed but anyway you performance is gonna suffer |
---|
0:32:12 | and so you know one up after we to maintain two models in your server the other ideas you can |
---|
0:32:17 | actually do about the station the front end and process that by or by a wideband recognizer noise thing there |
---|
0:32:23 | is like you don't have to be as good as true wideband performing |
---|
0:32:27 | you just have to be better than or as good as but you've got the narrowband performance would be and |
---|
0:32:31 | then it's worth it to do that |
---|
0:32:33 | so |
---|
0:32:34 | finally i'd like to move on to a last component here of a reuse |
---|
0:32:44 | and talk about the reuse the speaker transforms |
---|
0:32:49 | so |
---|
0:32:50 | one of things that we found |
---|
0:32:52 | is that |
---|
0:32:54 | the utterances in the applications that are being deployed commercially now are really sure |
---|
0:32:59 | and so |
---|
0:33:00 | you know one seattle obviously people's a starbucks quite a bit |
---|
0:33:05 | no muppet show times or in the living scenario X box play maybe all that you know the only thing |
---|
0:33:10 | you get |
---|
0:33:11 | in addition to that these are really gently rich dialogue interactive systems and so these are sort of one shot |
---|
0:33:16 | thing for you speak where you get a result in your in your done |
---|
0:33:19 | so that the combination of these two things |
---|
0:33:22 | make it really difficult to obtain sufficient data for doing conventional speaker adaptation from a single session of use in |
---|
0:33:28 | so doing things like mllr cmllr becomes quite difficult in a single utterance |
---|
0:33:33 | case and so |
---|
0:33:35 | and obvious solution to this is to say well let's just accumulate the data over time across sections we have |
---|
0:33:39 | users are you know making multiple queries to the system |
---|
0:33:43 | it's aggregate it all together and then we'll have an update at sufficiently to build a transfer |
---|
0:33:49 | the |
---|
0:33:50 | difficulty comes in because this now because it lies applications on mobile phones it means the people are obviously mobile |
---|
0:33:57 | two |
---|
0:33:58 | and they're all across all these different users they're actually in different environments |
---|
0:34:02 | that creates additional variability in the data that we can lead over time and so in my |
---|
0:34:10 | by numbers i guess i would say or you know them you know a metaphor here let's imagine a user |
---|
0:34:15 | called the system and the observation comes in as Y and that some combination of the phonetic content which i'm |
---|
0:34:21 | showing is as a white box |
---|
0:34:22 | some speaker-specific information shown as a blue box and |
---|
0:34:27 | some you know environmental backer information as the right |
---|
0:34:31 | so user gets the speech and says oh okay well mannered proportion adaptation and store away the transform |
---|
0:34:37 | so the next time this user calls we know will be loaded up and ready to go |
---|
0:34:41 | so sure enough sometime later the user cost back |
---|
0:34:45 | and the phonetic content you know may or may not be the |
---|
0:34:47 | the speaker |
---|
0:34:49 | is the same |
---|
0:34:50 | but now is you know here she is in a different location or different environment and so the observation is |
---|
0:34:56 | now green instead of purple and as a result we can do adaptation on the mile using the store transform |
---|
0:35:01 | but mismatch persists this is not something optimal |
---|
0:35:04 | and so what we would like is |
---|
0:35:09 | a solution where |
---|
0:35:10 | the variability when we do something like annotation can be separate or to use the part |
---|
0:35:16 | so that we can say let's just hold onto the part that's related to speaker and sort of throw away |
---|
0:35:21 | the part that's in environment or very get store the part that's for environments that we oversee different user call |
---|
0:35:27 | back from that same environment we can actually do that as well |
---|
0:35:30 | so in order to do this sort of factorisation or separation of the different compare sources of variability |
---|
0:35:38 | you actually need an explicit way to do joint compensation so it's very heart to separate these things if you |
---|
0:35:43 | don't have a model that explicitly models them |
---|
0:35:46 | as |
---|
0:35:46 | individual sources of variability |
---|
0:35:49 | and so to do this there's |
---|
0:35:52 | several pieces of work that the proposed it's sort of like a being at a diner and it sort of |
---|
0:35:57 | gets use one from column a and one from column B you can sort of take all the you know |
---|
0:36:01 | all your favourite speaker adaptation algorithms in you can take |
---|
0:36:04 | all the games and apply for environmental adaptation pick one up from each thing and combined them and then you |
---|
0:36:09 | can have a usable model made using thing is that this is sort of proposed |
---|
0:36:15 | that ten years ago |
---|
0:36:17 | but as far as i can tell with without with the exception of joint factor analysis and two thousand five |
---|
0:36:22 | is not that much work on it since and now it sort of seems to be have sort of come |
---|
0:36:25 | on the scene again which is good i think it's a it's not obvious the more people |
---|
0:36:31 | in their work on this you know |
---|
0:36:33 | is |
---|
0:36:33 | so |
---|
0:36:35 | all these possible combinations of methods can do this |
---|
0:36:40 | joint compensation together might talk about one particular instance |
---|
0:36:45 | of using cmllr transforms mostly because i've already talked about how vts is used and so trying to |
---|
0:36:52 | several different |
---|
0:36:54 | ways you can go about doing compensation for noise |
---|
0:36:57 | so in this case we're gonna talk about the idea that you can use a cascade of cmllr transforms |
---|
0:37:02 | one that captures environmental variability wanna capture speaker |
---|
0:37:06 | a nice thing about using transforms like this is that we give up the benefit of all the structure we |
---|
0:37:11 | had an environmental model using solutions like be yes |
---|
0:37:14 | but we get the ability to have much more flexible use meaning that we have no restriction on what the |
---|
0:37:20 | features we can use are what the data that where this it's trained from we don't to do this |
---|
0:37:24 | adaptive training schemes like the noise adapted train |
---|
0:37:29 | the idea is quite simply defined transforms that maximise the likelihood of a set of environmental transforms in a spell |
---|
0:37:35 | of speaker transformations given sample of training or adaptation data |
---|
0:37:40 | now of course you know it's not heart to see that this cascade of linear transforms is itself a linear |
---|
0:37:45 | transform |
---|
0:37:46 | in as a result you can take a linear transfer and factor it into two separate transforms in an arbitrary |
---|
0:37:52 | number of ways menu which will |
---|
0:37:55 | not be meaningful and so the way that we're gonna get around this is to borrow heavily from the key |
---|
0:38:01 | idea i think in joint factor analysis from speaker recognition which is to say let's learn the transformations on partitions |
---|
0:38:08 | of the training data where were able to sort of isolate the variability that we range |
---|
0:38:13 | so pictorially |
---|
0:38:14 | still a bit busy inside politics of it |
---|
0:38:17 | gives you a headache but |
---|
0:38:19 | you can think about the idea that your basic gonna group the data by speaker and a given those that |
---|
0:38:25 | you can update your speaker trend |
---|
0:38:27 | then you gonna repartition your data by environment keeper speaker transforms fixed and update your environment transforms and then go |
---|
0:38:32 | back and forth in this manner now of course |
---|
0:38:35 | doing this doing this operation assumes that you have a sense of what you're speaker clusters are in your environment |
---|
0:38:41 | clusters are |
---|
0:38:43 | there are some cases where we it sounds reasonable to assume the labels are given to you so for example |
---|
0:38:48 | if it's a |
---|
0:38:49 | a phone overhead you know mobile phone data plants near you can have a caller id or a user id |
---|
0:38:55 | of the hardware address and so you can have a high confidence that you know the speaker is simile for |
---|
0:39:00 | certain applications |
---|
0:39:02 | like the X box in the living room we certainly think it's result we can say okay this thing is |
---|
0:39:06 | by not driving on the card sixty miles an hour probably isn't in the living room once we can assume |
---|
0:39:10 | the environment in that case or if we don't have this information you can really do environment clustering algorithms are |
---|
0:39:16 | speaker clustering |
---|
0:39:18 | and so |
---|
0:39:19 | yeah just to show some results here |
---|
0:39:24 | the idea is you can again take |
---|
0:39:28 | take the training data that let's say from a bright of environmental the brighter speakers |
---|
0:39:32 | and estimate some environment transforms on the training data |
---|
0:39:35 | to do that of course you have to estimate the speaker transforms as well but in this case the speaker |
---|
0:39:40 | the speakers in training and test are distinct and so the speaker chances are not useful for us in the |
---|
0:39:45 | reuse scenario |
---|
0:39:46 | and so we've tried here to say let's take estimate the speaker transform |
---|
0:39:50 | given data from a single environment this case is the subway |
---|
0:39:54 | we can we take that |
---|
0:39:56 | transform and either estimated in this way where the sources of variability are factored |
---|
0:40:00 | or estimated using sort of conventional cmllr approach and apply to data from the same speaker in six different environments |
---|
0:40:09 | three which aren't times that you've seen in training three what's are |
---|
0:40:12 | that are not seen |
---|
0:40:14 | and you can see in both cases you get a benefit by having additional transform in their absorb |
---|
0:40:21 | the variability from the noise so that this the speaker transform can as you focus on just as the variability |
---|
0:40:27 | that comes from the speaker that you care about and so you can see there's again overdoing cmllr alone and |
---|
0:40:32 | that comes again from the fact that this year margin for me is not presumably |
---|
0:40:38 | learning the mapping of the environment plus the speakers ideally learning the transform just the speaker alone |
---|
0:40:46 | so |
---|
0:40:49 | in |
---|
0:40:50 | scenarios where speaker data is scarce |
---|
0:40:54 | a reuse is important for adaptation |
---|
0:40:59 | no this is a case where each utterance is you know ten or fifteen or twenty seconds this techniques and |
---|
0:41:04 | are not nearly as important but if the case where you only have a second or two data you wanna |
---|
0:41:07 | be able to aggregate all this data and build a model for that speaker |
---|
0:41:11 | it seems that did it comes from |
---|
0:41:14 | places |
---|
0:41:15 | where there's a high degree of variability from other sources |
---|
0:41:18 | the problem becomes a little more challenging |
---|
0:41:20 | and this can be environments it can be devices of your |
---|
0:41:24 | you know all of you but data that's |
---|
0:41:27 | being held up like this and then you have a far field data then you have additional data that's four |
---|
0:41:32 | feet away on your couch |
---|
0:41:33 | all these things are all different in different microphones all these sources are things that are that are basically blurring |
---|
0:41:39 | the speaker transmitter trying to learn and you wanna go to isolate those in order we use the speaker turn |
---|
0:41:44 | so doing this style base it allows a secondary transform to absorb this unwanted variability |
---|
0:41:50 | and |
---|
0:41:52 | there are various ways of doing in there are just a you know obviously if you have a sort of |
---|
0:41:57 | a transforms that are specifically modeling different things explicitly it'll be easier to get the separation if we knew things |
---|
0:42:02 | like |
---|
0:42:03 | two linear transforms then you need to sort of resort to be used just data partitioning schemes |
---|
0:42:09 | which you know |
---|
0:42:10 | makes things a little bit more difficult |
---|
0:42:12 | so |
---|
0:42:14 | here i've just tried to hit a little bit on a three way you know three aspects of speech recognition |
---|
0:42:19 | going green in this reduce reuse recycle framework before i conclude i just wanted to slow touch on i think |
---|
0:42:27 | you know |
---|
0:42:28 | as someone who's worked in you know a we strongly i guess and robustness and these ideas i sorta wanna |
---|
0:42:34 | talk about there's to serve also as i got a member three personalities that i sort of take on and |
---|
0:42:39 | so i wanna sort of address |
---|
0:42:41 | and you may find yourself thinking i one of these present noise in turn |
---|
0:42:44 | and so i wanna sort of address because those and so i think there's people who are the believers |
---|
0:42:49 | there's people who are |
---|
0:42:50 | the sceptics |
---|
0:42:51 | and those people who i was called the willing which are sort of the people who say oh well maybe |
---|
0:42:55 | i'll give this a try and you know i think |
---|
0:43:00 | i think about sort the but the resurgence in neural net acoustic modeling as a as a good example of |
---|
0:43:04 | this that we're maybe some auditory inspired signal processing is another example where |
---|
0:43:08 | there were true believers in sort of acoustic models using neural nets then they're from so though we can't be |
---|
0:43:13 | when an hmm |
---|
0:43:14 | you know to put that aside and then you know that's kind of improve the people that i would give |
---|
0:43:19 | this a try again they want move from being sceptics to the willing |
---|
0:43:22 | now they got good results another all the believers again and so i think i wanna sort of talking about |
---|
0:43:28 | these very briefly so i would say to the sceptics i was sort of say yes you know one thing |
---|
0:43:32 | that i think is interesting is there's increasing robustness in speech recognition thing going on for a long time is |
---|
0:43:38 | in lots of sessions |
---|
0:43:40 | lots of papers slots |
---|
0:43:43 | but if you look at the tasks that it becomes standard for speech recognition like we need like i talked |
---|
0:43:47 | about today they're all very small bus orders |
---|
0:43:49 | today state-of-the-art systems compared to things like switchboard and galen meeting recognition |
---|
0:43:54 | and in is very large scale systems like switchboard and galen meetings |
---|
0:43:58 | robustness techniques are not really a part of the puzzle there and so i think is very fair to say |
---|
0:44:03 | all these methods really necessary in any sort of |
---|
0:44:06 | i still deployed system i would say to that i would just say yes it depends on i sorta wanna |
---|
0:44:10 | give a few very anecdotal examples to sort of motivate why i think this is |
---|
0:44:16 | you think of the bn so in production quality systems that do have all the bells and whistles that we |
---|
0:44:22 | that i one and knows about that are common is large scale systems |
---|
0:44:26 | we see and things like voice search you know in fact the gains are small and so you know it's |
---|
0:44:30 | not really a huge went to employ these techniques and so it's a fair critique just are we don't need |
---|
0:44:35 | we don't need robustness |
---|
0:44:36 | as you move the smell like the car turns out that actually gains are pretty big |
---|
0:44:41 | and not you note taking this is you know this to be much more usable by incorporating some elements of |
---|
0:44:47 | noise robustness in two |
---|
0:44:49 | into your system |
---|
0:44:51 | finally i would actually say with the X box like connect |
---|
0:44:54 | turns out that actually i would say these systems are actually unusable |
---|
0:44:57 | if you know if i consider a robust as the entire sort of audio processing front-end plus whatever happen |
---|
0:45:03 | in the recognizer |
---|
0:45:04 | if we |
---|
0:45:05 | throw all that away which establishes his microphone to listen i will do everything in the model space systems are |
---|
0:45:09 | actually unusable |
---|
0:45:11 | and so there actually is a large place |
---|
0:45:13 | technology in certain scenarios |
---|
0:45:15 | ski |
---|
0:45:16 | peering out to the willing so if someone says well you know what's the easiest way to try celeste of |
---|
0:45:22 | it is this thing to try is noise adaptive training and sort the biggest bang for the buck is what |
---|
0:45:27 | i would say is not well lee dying called noise adaptive training in the feature space |
---|
0:45:32 | the idea is very simple that you have some training data |
---|
0:45:35 | you believe you have some way to enhance the training data run-time we need to take a train data just |
---|
0:45:39 | prior to the same exact process and retrain your acoustic model you know you think that this is this is |
---|
0:45:44 | basically very akin to doing similar for speaker adaptive training you basically updating your features |
---|
0:45:50 | before you reach in your model it turns out that if we do this you have to get performance |
---|
0:45:55 | that generally is far superior to operating are trying to compensate noisy speech to recognise with the clean trained hmm |
---|
0:46:02 | and if you are gonna to try this i think you know the standard algorithms are findings expect subtraction i |
---|
0:46:07 | mean |
---|
0:46:08 | the fanciest ones work are great but in an improvement a small i think getting the basics working is important |
---|
0:46:14 | but the important thing is you need to serve to an optimize the right objective function i've had you know |
---|
0:46:19 | talk to people say oh we got you know a spectral subtraction component from |
---|
0:46:23 | my friend who's in the speech enhancement part of our lab and i just tried it and it was you |
---|
0:46:26 | know i didn't work at all and the reason is that these things are optimized joey completely differently and so |
---|
0:46:31 | we need to really you know it and |
---|
0:46:32 | you do need to understand all the details and nuances of what's happening are gonna but generally is a whole |
---|
0:46:36 | set of parameters and floors and weights |
---|
0:46:39 | and things |
---|
0:46:40 | in those things can all be tuned and you can tune them to where you know to maximise word error |
---|
0:46:44 | or minimize word error rate and that would be great you can do that in a greedy way let's just |
---|
0:46:48 | sweep a whole bunch of parameters to we get the best |
---|
0:46:50 | you can also use something called test which is a computational proxy to stands for the perceptual evaluation of speech |
---|
0:46:56 | quality space you like a model of what |
---|
0:46:59 | human listeners would say it turns out that that's which are quite correlated to speech recognition performance and so if |
---|
0:47:05 | you can maximise that are you have your yeah signal processing bodies have some column that maximizes pack has that's |
---|
0:47:12 | a good place to start and turns out that the doing things like snrs after the worst thing you can |
---|
0:47:16 | do it creates all kinds of |
---|
0:47:17 | distortion free |
---|
0:47:20 | so |
---|
0:47:21 | with that i just want to conclude and say that we proposed that potentially there are there's goodness to be |
---|
0:47:27 | had by using existing data and no we sort of put this on the matter of going green |
---|
0:47:34 | i'm just pretending to this case of just provide you know try to write one example of the way that |
---|
0:47:38 | we can reduce recycle and reuse |
---|
0:47:41 | the data that we have either from environmental mismatch point of view a bandwidth point of view or speaker |
---|
0:47:47 | adaptation point of view so a there's many other |
---|
0:47:50 | ways to do this or just talked about a few and of course there's more work to be done |
---|
0:47:54 | and so with that i will thank you |
---|
0:47:58 | i think the speaker for |
---|
0:48:05 | oh we have plenty of time for questions |
---|
0:48:11 | so what mike things |
---|
0:48:13 | great small i was wondering if you can address |
---|
0:48:16 | some other problems in the your robustness area for example |
---|
0:48:22 | oh there are many cases with the rules what's your nonlinear distortions that are going to be applied to these |
---|
0:48:29 | signal of strange this in the communication channel and what you talked about i mean |
---|
0:48:36 | the transform techniques could obviously work on it or anything but i'm wondering if you have any comments or what |
---|
0:48:42 | do you do one place |
---|
0:48:43 | rules nonlinear distortions of the signal with the signal still basically set my intelligible both the it doesn't fit any |
---|
0:48:51 | of the classical speech plus noise model |
---|
0:48:55 | well |
---|
0:48:56 | the one thing i would say it is |
---|
0:48:59 | that |
---|
0:49:03 | is a heart problem |
---|
0:49:04 | i |
---|
0:49:07 | thank you don't even agreed on it |
---|
0:49:11 | so that feature space adaptive training technique |
---|
0:49:15 | is that you generic across any kind of distortion so if you actually have the ability if you know what |
---|
0:49:19 | that coding is we can model it somehow you guys past data through that that's why the best way to |
---|
0:49:24 | model it sort of the |
---|
0:49:25 | it's not very fancy or what but i think it'll work |
---|
0:49:30 | the thing is a lot of things are burst |
---|
0:49:33 | but i find it so that you can actually just detect them |
---|
0:49:36 | building you know whatever classifier and as just part at that point you know you can for example say i'm |
---|
0:49:41 | gonna you know compute my decoder score bias just giving up on this frames in is no content here that's |
---|
0:49:46 | another way you can do it |
---|
0:49:49 | i think sort of trying to have a model for you know |
---|
0:49:52 | number in your garden in your |
---|
0:49:54 | i think by won't work |
---|
0:49:55 | and then like to believe that there is some you know that we can extend the linear transformation scheme to |
---|
0:50:00 | nonlinear transformations like some kind of an L P |
---|
0:50:03 | mllr kind of thing but you know that's remains to be seen and that that's again it does really quite |
---|
0:50:07 | get it is sort |
---|
0:50:09 | i think and are we talking of this or this occasional |
---|
0:50:11 | gobbledygook that comes and i don't think that would really just that so i think those two other techniques are |
---|
0:50:15 | so |
---|
0:50:21 | i think the one thing that's interesting is the correlation between how people speak and the noise background and or |
---|
0:50:28 | a kind of |
---|
0:50:29 | what does that adding |
---|
0:50:31 | right noise rather than so the long artifact has the obvious |
---|
0:50:36 | lot of speech thing which we pretty but to compensate for |
---|
0:50:40 | you know we normalize stuff |
---|
0:50:42 | but there's the bombard spectral that |
---|
0:50:45 | which means that allowed or the noise is the more vocal effort there is the more |
---|
0:50:51 | to the spectrum and all that sort of thing how do the techniques you're talking about addressed utterance |
---|
0:50:57 | is the whole kind of different problem because |
---|
0:51:00 | the environment |
---|
0:51:01 | really doesn't |
---|
0:51:02 | yeah unless you know the signal to noise ratio |
---|
0:51:06 | straight |
---|
0:51:07 | right so i think |
---|
0:51:11 | what's interesting about those is those are |
---|
0:51:15 | speaker |
---|
0:51:16 | affects that is are manifested by the environment |
---|
0:51:19 | and so like you said having environment models not gonna capture that at all it's more like maybe having a |
---|
0:51:24 | but you may want to have some kind of you know so i don't know i don't have the exact |
---|
0:51:28 | answer although i would think that having a environment informed |
---|
0:51:33 | peak or |
---|
0:51:34 | transform kind of thing would be would be useful so you know potentially |
---|
0:51:38 | your choice of you know vtln work parameters for example could be affected by what you perceive in the environment |
---|
0:51:44 | any level speaker F |
---|
0:51:45 | you |
---|
0:51:46 | you detect |
---|
0:51:49 | and the other thing of course is sort of the |
---|
0:51:51 | the poor man's answer would be you know i'm not sure how much of this can be modelled again it |
---|
0:51:56 | by exist existing speaker adaptation techniques you know |
---|
0:52:00 | again i think a lot of the text in the being of a nonlinear |
---|
0:52:04 | and so it's hardest |
---|
0:52:05 | we put on the rug with the and mllr transform |
---|
0:52:08 | but it's so i think i think that comes at it as you know incidents on that i was trying |
---|
0:52:11 | to talk about |
---|
0:52:12 | orthogonalisation of this |
---|
0:52:14 | the speech and the noise and i think you're actually the opposite which is actually a jointly informed |
---|
0:52:20 | transform which i think is a very enticing area |
---|
0:52:22 | i don't imagine a way of too much work |
---|
0:52:31 | you |
---|
0:52:32 | might be greener features that came in were themselves insensitive |
---|
0:52:37 | to some is just absolutely |
---|
0:52:40 | so that would that would that would well if i if i agree with you that i'm through email talks |
---|
0:52:43 | i can agree with you now |
---|
0:52:45 | maybe at the coffee break i can agree with you but no i think that that's that that's |
---|
0:52:50 | that's true right there's the whole and i think a lot of this comes with the biologically inspired kind of |
---|
0:52:56 | features and i think that's true and i think actually in fact the work that |
---|
0:53:02 | or elan's human kind of did kind of shows that they've made |
---|
0:53:06 | grammar correctly that they train to a deep net on aurora and got you know high degree of noise robustness |
---|
0:53:11 | just running the network |
---|
0:53:13 | potentially learn some kind of |
---|
0:53:15 | noise invariant |
---|
0:53:16 | features |
---|
0:53:17 | you know i think |
---|
0:53:19 | right is right and so no i think that's true i don't know no i problems i think right now |
---|
0:53:23 | where we are |
---|
0:53:24 | it's the heart to come up with sort of a one size fits all |
---|
0:53:28 | scheme so there's one other thing |
---|
0:53:32 | but that's about it |
---|
0:53:34 | using gmms to the government data to what the specific example you |
---|
0:53:40 | the basically as long as i understand the gmm you mentioned was trained and supplied basically doesn't consider the entrance |
---|
0:53:48 | i |
---|
0:53:49 | in the gmm case |
---|
0:53:51 | right but you could also do an hmm for |
---|
0:53:54 | well but that's is easy you can see the transcriptions like phone level transcriptions |
---|
0:54:00 | can you improve that signal absolutely yeah that's what is shown so only small with a technique |
---|
0:54:07 | all possible the pure speech feature based technique |
---|
0:54:11 | yeah and what well that was a good gosh well yes but i think you don't necessarily need a very |
---|
0:54:17 | strong model |
---|
0:54:18 | so you know i guess you class might so you could have you could for example have a phone-loop hmm |
---|
0:54:25 | in the front end that's using like that is using a model based technique but |
---|
0:54:29 | but you know getting the state sequence right is actually is a problem in the feature technique as you guys |
---|
0:54:33 | you have within the context of a if you don't put here takes on the on the search space you |
---|
0:54:38 | can have within a close to have it skipping around states |
---|
0:54:41 | you have inconsistent hypotheses for what the missing band is |
---|
0:54:45 | and you can apply that to some extent if you have a if you do a sort of a cheap |
---|
0:54:48 | decoding the friend where there's a your phone hmm with the phone language model |
---|
0:54:53 | and you could do that just to i think what you have you know that |
---|
0:54:56 | the benefit the models actually |
---|
0:54:58 | restraining your state space to sort of possible sequences of phones |
---|
0:55:03 | once you have that i think generating whether use that to enhance feature order in the model domain is |
---|
0:55:09 | you know what you know the both options |
---|
0:55:11 | yeah i mean it's only also agree i think |
---|
0:55:13 | the model domain is |
---|
0:55:15 | will be optimal |
---|
0:55:16 | i think if you start saying well my system runs with |
---|
0:55:19 | a eleven frames of hlda and all this other stuff it becomes a little harder to |
---|
0:55:24 | to do that you know you can sort of just a minute it's gonna be a blind transform like mllr |
---|
0:55:28 | but if you wanna put structure in the transfer |
---|
0:55:30 | the map the low to high frequency that gets a little more difficult |
---|
0:55:36 | okay |
---|
0:55:37 | is that the speaker again |
---|