0:00:13 | a malcolm and and thank you to |
---|
0:00:15 | thanks to you and josh for uh |
---|
0:00:18 | i |
---|
0:00:18 | scheduling this |
---|
0:00:20 | presentation go a everyone's here to to see get what i like to uh |
---|
0:00:24 | bring up the mean the talks of an really great of and making notes to to really |
---|
0:00:27 | cross correlate whatever is talking about |
---|
0:00:30 | i i |
---|
0:00:31 | for |
---|
0:00:33 | are work we are interested in a speech signals |
---|
0:00:36 | uh |
---|
0:00:37 | the raw data being a |
---|
0:00:39 | rum a sensor |
---|
0:00:41 | really to speech specifically a uh the |
---|
0:00:44 | microphone a the audio signal |
---|
0:00:46 | sampled at a given rate |
---|
0:00:48 | i using uh a parametric model call the source-filter model it's been around for a while what we're |
---|
0:00:54 | extending that to do it is uh do joint source-filter modeling |
---|
0:00:58 | a using a wavelets |
---|
0:01:00 | which we term flexible |
---|
0:01:01 | uh basis functions |
---|
0:01:03 | uh this is work with that then drew doing and patrick |
---|
0:01:05 | a will of uh at a harvard |
---|
0:01:09 | so |
---|
0:01:11 | where we fit in the session is |
---|
0:01:13 | uh we we've seen that |
---|
0:01:15 | the representations of speech can come and and various domains and a |
---|
0:01:20 | uh in the machine learn from the raw data we just uh we just looked at |
---|
0:01:24 | i i and including |
---|
0:01:26 | cortical an auditory like representations that would be very appropriate for |
---|
0:01:31 | a perceptual based |
---|
0:01:34 | uh performance |
---|
0:01:35 | however are |
---|
0:01:37 | where where were slightly |
---|
0:01:39 | a focused on the the opposite side or or production based models |
---|
0:01:43 | and i'll tell you where the motivation comes from a |
---|
0:01:46 | uh from the clinic |
---|
0:01:49 | and uh it it uh a |
---|
0:01:51 | though the lines of what the model is we then look at a how we characterise and parameterize it uh |
---|
0:01:56 | which subspace a in which sub so to be pairs will talk about other |
---|
0:02:00 | i subspace a representation |
---|
0:02:02 | a i i want |
---|
0:02:03 | go into these higher level |
---|
0:02:05 | uh |
---|
0:02:06 | uh ways and and i should but the cup with the |
---|
0:02:09 | uh |
---|
0:02:10 | auditory like representations but what |
---|
0:02:13 | just talked about |
---|
0:02:15 | i it |
---|
0:02:15 | and terms summary statistics |
---|
0:02:17 | uh that is |
---|
0:02:19 | are relevant actually because we use kind of a |
---|
0:02:21 | you know a noise model which is a good so way into what what the model is for for speech |
---|
0:02:26 | and |
---|
0:02:27 | lpc coefficients Y are or how they were developed is using a a model where that there's a stochastic input |
---|
0:02:34 | such as white gaussian noise |
---|
0:02:37 | uh or here this uh W |
---|
0:02:39 | function |
---|
0:02:40 | that being input into a a a a a linear filter to give the resonances of |
---|
0:02:45 | of the the vocal tract and will |
---|
0:02:47 | restrict at the moment to |
---|
0:02:49 | all a of speech |
---|
0:02:51 | but in reality there is that there's a mismatch between that model and a |
---|
0:02:57 | the source that we see and and and speech and the from the production side are both deterministic |
---|
0:03:02 | and stochastic |
---|
0:03:04 | uh if we just had they |
---|
0:03:08 | so that was bird speech model is |
---|
0:03:10 | uh |
---|
0:03:11 | appropriate for certain speech but if we wanna analyse of voiced and unvoiced speech then |
---|
0:03:17 | we need at another term |
---|
0:03:18 | and and understand how to estimate i know this |
---|
0:03:21 | second so a term which we term you of N |
---|
0:03:24 | i sorry is a little bit low yeah |
---|
0:03:27 | uh so |
---|
0:03:29 | everything the same except this added parameter which will fit two that deterministic and non stochastic component |
---|
0:03:36 | oh which comes from the glottal |
---|
0:03:38 | pulses |
---|
0:03:39 | i that maybe periodic may then don't have to be periodic |
---|
0:03:44 | i i and |
---|
0:03:46 | how we estimate these uh is a question in which subspace does this |
---|
0:03:51 | a a representation of the voiced part of speech lie |
---|
0:03:56 | oh preempt |
---|
0:03:57 | a B was question uh by saying that the |
---|
0:04:01 | uh uh we're assuming clean speech |
---|
0:04:03 | and a a and in some cases a |
---|
0:04:06 | my my research i uh my P T research was |
---|
0:04:10 | i i at a massachusetts general hospital and |
---|
0:04:13 | i don't voice clinic were able to get |
---|
0:04:15 | uh i clean environments uh clean speech and in a a very stable environment in a sound both |
---|
0:04:21 | i to analyse uh uh and compare it features |
---|
0:04:25 | directly to |
---|
0:04:26 | a a vocal |
---|
0:04:27 | i characteristics so |
---|
0:04:29 | that's kind of |
---|
0:04:30 | uh uh uh |
---|
0:04:32 | at is a limitation and in real |
---|
0:04:34 | a environments |
---|
0:04:35 | so |
---|
0:04:36 | this is not |
---|
0:04:38 | new in terms of just i'm just including that other uh |
---|
0:04:42 | parameter that's not new this a R X or X a as input |
---|
0:04:46 | in the ar model and is |
---|
0:04:49 | as one looked at |
---|
0:04:51 | given a parametric models such as the lf model which |
---|
0:04:54 | i is a parameterisation of the glottal flow derivative of the error flow pulse |
---|
0:05:00 | in addition to |
---|
0:05:01 | at a linear models but both of these do require |
---|
0:05:04 | uh a a a good knowledge of when |
---|
0:05:07 | these source or when a hour |
---|
0:05:09 | uh in pulses happen |
---|
0:05:11 | when the vocal folds open |
---|
0:05:12 | and close |
---|
0:05:14 | so what we are |
---|
0:05:15 | proposing it |
---|
0:05:16 | R R |
---|
0:05:17 | we're proposing is to extend that model |
---|
0:05:19 | and and find a subspace that is not |
---|
0:05:22 | i i it is not a limited |
---|
0:05:24 | or restricted to |
---|
0:05:25 | to needing these |
---|
0:05:27 | uh |
---|
0:05:27 | time based |
---|
0:05:29 | uh |
---|
0:05:30 | uh estimates |
---|
0:05:31 | so a solution how which present a lot L are group present last year i i was uh using wavelets |
---|
0:05:38 | or these G functions G of ends so having it |
---|
0:05:40 | bunch of base a basis functions with a weighting coefficients |
---|
0:05:44 | beta |
---|
0:05:46 | simple summing those two it ink equal this |
---|
0:05:49 | this non stochastic voicing component |
---|
0:05:52 | and uh |
---|
0:05:53 | this |
---|
0:05:54 | is robust uh and to variations in fundamental frequency or pitch glides irregular pitch periods |
---|
0:06:01 | i in in certain disordered voices that we see in the clinic |
---|
0:06:05 | and and the for way and wavelets being time localised a |
---|
0:06:09 | uh allows us to shift |
---|
0:06:11 | the the bases and and time and and frequency |
---|
0:06:15 | uh without knowing a priori some of the the source properties so |
---|
0:06:20 | i alluded to this a uh uh in and the beginning why do we care about this and uh |
---|
0:06:25 | that |
---|
0:06:26 | data in which uh i i deal with it typically from clinical uh applications and voice assessment in forms a |
---|
0:06:32 | speech the plus and a surgeons who do |
---|
0:06:35 | uh |
---|
0:06:36 | uh who do their work a on the vocal folds and one to see from the production side |
---|
0:06:40 | how certain technique will affect |
---|
0:06:42 | a a a a a a feature |
---|
0:06:45 | and addition there are are |
---|
0:06:47 | a characteristics that are important to to look at what health |
---|
0:06:50 | emotional state and uh a speaker D uh i such stature |
---|
0:06:55 | so |
---|
0:06:56 | how do we do this to uh subspace selection |
---|
0:06:59 | the just but one set of uh if equations here all i'm showing you here is |
---|
0:07:03 | the uh |
---|
0:07:05 | least squares solution to |
---|
0:07:07 | to the |
---|
0:07:08 | problem i should before |
---|
0:07:09 | in the L P C case the a uh uh to rest of only case |
---|
0:07:14 | this |
---|
0:07:15 | well down to the uh a classic model where you get lpc coefficients |
---|
0:07:19 | oh that we have a extra is this G matrix which |
---|
0:07:23 | holes the subspace uh of the wave which will model the voicing on it |
---|
0:07:27 | a segments the uh big the gaussian noise of variance O |
---|
0:07:31 | well i i wanted to show you this is the critical |
---|
0:07:34 | issue by which simply estimating a |
---|
0:07:37 | the wavelet basis on all the samples that say you have to and fifty six samples |
---|
0:07:42 | of the raw speech waveform |
---|
0:07:45 | if we wanna estimate to under fifty six basis functions and uh a and parameter the signal that way |
---|
0:07:52 | i turned into an L condition problem uh because of the inversion version that we see here |
---|
0:07:57 | and that i L that that that conditioning problem we |
---|
0:08:01 | are we address |
---|
0:08:02 | in this paper |
---|
0:08:04 | by |
---|
0:08:05 | thresholding |
---|
0:08:06 | a the wave so we don't take |
---|
0:08:08 | all |
---|
0:08:09 | a the way of functions and and this space |
---|
0:08:12 | so we propose |
---|
0:08:13 | three sorry for the guys back |
---|
0:08:15 | three uh uh wavelet shrinkage algorithms which are simply |
---|
0:08:20 | we don't need a to a fifty six wavelets let's how may do we need a thirty two fifty six |
---|
0:08:24 | sixty and that threshold can be uh uh a specified using a |
---|
0:08:29 | uh uh iterative methods |
---|
0:08:31 | uh by a hard thresholding of of wavelet coefficients soft thresholding |
---|
0:08:35 | uh i in addition to |
---|
0:08:38 | a joint estimation of parameters of we wanna use at top and |
---|
0:08:41 | a set of wavelet coefficient |
---|
0:08:43 | and finally we we present |
---|
0:08:45 | a a a a uh another optimization technique |
---|
0:08:48 | uh using a and L one norm |
---|
0:08:50 | as a as criterion so i'm not going to the details here because |
---|
0:08:54 | that |
---|
0:08:55 | uh |
---|
0:08:55 | i wanna give an intuition of of what some of the performance |
---|
0:08:59 | uh |
---|
0:08:59 | a a big uh issues are |
---|
0:09:01 | as well as uh as well as a results |
---|
0:09:04 | and the paper goes into to id D to about the algorithms and sell |
---|
0:09:08 | so |
---|
0:09:09 | what we get |
---|
0:09:11 | well |
---|
0:09:12 | what i'm showing here |
---|
0:09:14 | is a uh and synthesized speech i also show your real example is |
---|
0:09:19 | when we |
---|
0:09:20 | send the size with a given source we know |
---|
0:09:23 | uh a the pitch we can time very the pitch |
---|
0:09:26 | and C it's affect on on various algorithms |
---|
0:09:30 | the pose algorithm is |
---|
0:09:32 | uh using a a a the hard thresholding |
---|
0:09:35 | it of approach |
---|
0:09:37 | uh i'd heard sixty four is the top sixty four wavelets |
---|
0:09:41 | uh it's kind of a a second step after a hard thresholding |
---|
0:09:44 | and finally what i want you to first focus on is |
---|
0:09:49 | the last output which which is the residual |
---|
0:09:52 | the ar residual and that's giving the that the source waveform that |
---|
0:09:56 | we can parameterize and find speech a voicing properties from that |
---|
0:10:00 | i |
---|
0:10:00 | it is now |
---|
0:10:02 | and i |
---|
0:10:03 | it includes both components of of stochastic and |
---|
0:10:06 | and on stochastic |
---|
0:10:07 | a |
---|
0:10:09 | uh |
---|
0:10:09 | parts so |
---|
0:10:10 | what are algorithm does a separate those in a principled manner to begin with |
---|
0:10:15 | and you can also do separation |
---|
0:10:17 | you knew deterministic stochastic separation on on the ar residual but |
---|
0:10:22 | we do this and as part of the model |
---|
0:10:25 | and we get a better uh root mean square error |
---|
0:10:28 | a |
---|
0:10:29 | almost because of that but |
---|
0:10:31 | big but but the source properties uh estimation is better |
---|
0:10:35 | and the the filter property estimation better |
---|
0:10:38 | so |
---|
0:10:39 | the the the classic cases when you have a a filter with a given |
---|
0:10:44 | uh a given formant values |
---|
0:10:46 | you just use the ar model |
---|
0:10:49 | there's a bias |
---|
0:10:49 | in a a a four in the formant frequency |
---|
0:10:52 | which uh i'm showing uh and the last curve the bottom curve |
---|
0:10:57 | i the a residual |
---|
0:10:58 | i and the lpc coefficients when uh you look at the at the spectrum |
---|
0:11:03 | is a bias and F two and F three |
---|
0:11:05 | relative to |
---|
0:11:07 | V |
---|
0:11:08 | ground truth |
---|
0:11:10 | and all algorithm handles that by putting the probably energy and the source and the filter |
---|
0:11:15 | i in way doing that that the separation problem inverse filtering problem |
---|
0:11:21 | the paper also goes into one i theoretical observation where |
---|
0:11:25 | uh we find uh and derive |
---|
0:11:28 | the cramer-rao lower bounds which are |
---|
0:11:31 | the |
---|
0:11:32 | lower bounds of a of this estimator |
---|
0:11:35 | uh uh uh of its variance so if we have |
---|
0:11:37 | that three parameters |
---|
0:11:39 | and this toy example |
---|
0:11:40 | we took |
---|
0:11:43 | a a a a a a a a filter with the wood |
---|
0:11:46 | one poll i two pole filter one resonance |
---|
0:11:49 | and that that resonances that two khz |
---|
0:11:52 | so of a two khz |
---|
0:11:54 | uh |
---|
0:11:56 | one uh uh one residents function |
---|
0:11:58 | and we swept |
---|
0:12:00 | the source |
---|
0:12:01 | sign use so it's we of input source that was swept |
---|
0:12:03 | or over frequency and what you're saying is from zero to eight khz |
---|
0:12:07 | uh and the estimators for the lpc coefficient |
---|
0:12:11 | a a a a one a two |
---|
0:12:12 | and that uh the uh |
---|
0:12:15 | uh the beta coefficient for for the sign your sorry |
---|
0:12:18 | uh i is giving us the |
---|
0:12:20 | variance in what we see is as expected is that |
---|
0:12:24 | we |
---|
0:12:25 | are able to estimate the filter pretty well |
---|
0:12:31 | across the bandwidth except |
---|
0:12:34 | the source is is then able be uh that the proper is any bill be |
---|
0:12:38 | uh estimated when |
---|
0:12:39 | the overlap that centre frequencies this is the classic problem one we're sampling the spectrum |
---|
0:12:44 | at the center frequency |
---|
0:12:47 | i as a sign you sorry lines up with with the resonance |
---|
0:12:50 | we get a a a a |
---|
0:12:51 | an increase in the in the lower bound of that estimator so |
---|
0:12:56 | uh this is what i was talking this example and we can also show it for the band |
---|
0:13:01 | uh in terms of |
---|
0:13:03 | the best that we can do is if the band was a pretty uh i one miss with the band |
---|
0:13:07 | with of of the resonance |
---|
0:13:09 | uh of the filter |
---|
0:13:12 | X the resonance uh of the filter and we |
---|
0:13:16 | the band what uh sorry fix the |
---|
0:13:19 | frequency of the sinusoid |
---|
0:13:21 | at two khz |
---|
0:13:23 | and very the uh the bandwidth we show that the uh i estimator variance gets as expected gets better as |
---|
0:13:28 | the band with gets smaller |
---|
0:13:30 | uh so that the filter and the uh source or |
---|
0:13:33 | i |
---|
0:13:34 | easier to uh to discriminate but |
---|
0:13:37 | or is showing is that the problem gets more and more difficult when a |
---|
0:13:41 | F not |
---|
0:13:42 | are the fundamental frequency of the source crosses |
---|
0:13:45 | uh F one for example |
---|
0:13:48 | i in in real speech |
---|
0:13:50 | so |
---|
0:13:54 | that gives you a high level overview of what what we're doing and and the last that's are uh using |
---|
0:14:00 | i on the technical side doing uh |
---|
0:14:02 | but thresholding be level dependent thresholding |
---|
0:14:06 | uh and not just a a hard a hard thresholding |
---|
0:14:09 | other ways a penalising uh |
---|
0:14:12 | the the wavelet coefficients to |
---|
0:14:13 | to get a better representation of that source |
---|
0:14:16 | i here's an example of the uh a of natural speech where to see sure the waveform |
---|
0:14:21 | with a glottalized |
---|
0:14:23 | voice quality voice-quality uh uh uh |
---|
0:14:26 | that sounds uh like a a a and set of a a it a periodic |
---|
0:14:31 | uh so source |
---|
0:14:34 | and it |
---|
0:14:35 | be sensitive to that uh because finding the exact in a pulse times uh is difficult when a and things |
---|
0:14:42 | are a periodic |
---|
0:14:44 | uh so it wavelets not really caring about what the period is can uh can |
---|
0:14:49 | i grab those |
---|
0:14:50 | uh impulses and then we can find features from |
---|
0:14:53 | from the reconstructed source |
---|
0:14:57 | a fine way i one of the |
---|
0:14:59 | i'm sorry this is low |
---|
0:15:01 | i |
---|
0:15:03 | may be an example of of of one a clinical applications |
---|
0:15:06 | i there some uh |
---|
0:15:08 | advances in high speed video and does could be uh of the layering |
---|
0:15:12 | and and and |
---|
0:15:13 | i get a work with these uh uh of a would with these video all day and and relate these |
---|
0:15:18 | two |
---|
0:15:19 | that the source features that we estimate theoretically on synthesized bells and and real as but also look at |
---|
0:15:25 | but things down the throat or estimate airflow flow |
---|
0:15:27 | and uh |
---|
0:15:29 | this is kind of uh are next up to |
---|
0:15:31 | to look at we do these features actually relates to what we can measure |
---|
0:15:35 | i in human a subject |
---|
0:15:37 | as you can see this is not them easy setup |
---|
0:15:40 | to collect airflow flow in addition to |
---|
0:15:42 | uh |
---|
0:15:43 | endoscopic be use of the larynx |
---|
0:15:48 | so that is a that's that's what i had a a and i welcome what questions |
---|
0:15:52 | thank you |
---|
0:16:03 | well i can also repeat the question what |
---|
0:16:05 | a microphone maybe coming |
---|
0:16:12 | yeah thank you for a for your talk |
---|
0:16:14 | do you |
---|
0:16:15 | uh my my question relates to um |
---|
0:16:18 | what is the main difference of you approach compared to what is done for camp coding where you also have |
---|
0:16:24 | uh L P C and a sum of way have a or M of |
---|
0:16:28 | uh |
---|
0:16:30 | you know that a dictionary atoms to the L C as a source and so |
---|
0:16:35 | you can tell and and that the main difference between sure I in in code excited linear prediction you have |
---|
0:16:40 | kind of a very |
---|
0:16:41 | a dictionary of of |
---|
0:16:44 | oh i'll go back to the |
---|
0:16:48 | dictionary of sources |
---|
0:16:50 | i that you can read they use to |
---|
0:16:52 | and |
---|
0:16:53 | that aspect and that's not limited to |
---|
0:16:56 | uh |
---|
0:16:57 | that's not limited to it's to some simply noise so in kelp |
---|
0:17:01 | and uh it's similar what we're doing is then parameterising that codebook book are that that each one of the |
---|
0:17:07 | the entries in the codebook |
---|
0:17:08 | i to then |
---|
0:17:09 | fine features that relate to the source |
---|
0:17:12 | instead of just the raw |
---|
0:17:13 | uh source and that in in help you have kind of the the raw samples and you we |
---|
0:17:18 | and your cellphone uh in your cellphone processor in it sounds natural so yeah output in that is kind of |
---|
0:17:23 | a natural sounding a re |
---|
0:17:25 | where we're interested in at once we get the parameters is is then writing |
---|
0:17:29 | features from those parameters |
---|
0:17:32 | um but the X that that's that's a good point |
---|
0:17:34 | okay |
---|
0:17:36 | ask a question it sounds like this would you better for noise |
---|
0:17:39 | and and regular lpc |
---|
0:17:41 | wouldn't it depends on the noise i guess uh what what when would do better when we you worse |
---|
0:17:46 | so |
---|
0:17:47 | some um my master sees actually actually found that we can actually stand this model one one step further because |
---|
0:17:53 | the noise can |
---|
0:17:54 | i in voice speech you can |
---|
0:17:56 | we i has been observed be modulated so not it's not actually white cows you know is but can be |
---|
0:18:01 | modulated white gaussian noise |
---|
0:18:03 | uh |
---|
0:18:04 | so even this the this model is a but it it's it's always the tradeoff between how |
---|
0:18:09 | and a do we care about modulations and noise or or not |
---|
0:18:12 | uh where it |
---|
0:18:14 | may not perform as well is |
---|
0:18:17 | this is putting the noise function before the filter |
---|
0:18:21 | a given a speech |
---|
0:18:22 | model if it's |
---|
0:18:24 | a reverberation noise if it's |
---|
0:18:26 | uh a a car got a are going past that's not filtered by the same |
---|
0:18:31 | uh transfer function so |
---|
0:18:33 | there is a |
---|
0:18:34 | a larger model mismatch uh there |
---|
0:18:37 | it's good |
---|
0:18:38 | that's great thank you |
---|
0:18:39 | okay |
---|