0:00:13these thanks to you
0:00:14thanks to all of you for come
0:00:16they
0:00:17so this is the outline of my to
0:00:19a go first so well i would give a brief introduction on on the topic
0:00:22and then describe the phase based features which are as the did in this work
0:00:27then and i we show you a V there results of four experimental evaluation of these features
0:00:31within the frame of voice but the logic you detection
0:00:34and finally a would come
0:00:38a in the great majority of speech processing application uh and then focus is on the use of the amplitude
0:00:44spectrum of the free transform
0:00:47uh nonetheless that there might be a to begin by also considering the phase information
0:00:53and just for example was
0:00:55uh i was down for uh
0:00:57so that a been approaches using phase of based features
0:01:00to speaker recognition or automatic speech recognition us are
0:01:05so as for example the yeah the work of multi or hatch
0:01:09and this
0:01:10a is
0:01:11i mean
0:01:12that that been an improvement by using phase based features in two systems
0:01:16and this is possible since
0:01:18phase uh provides a compound that the resource information
0:01:21with regard to to the amplitude spectrum
0:01:24and therefore are uh investigating the uh the usefulness of
0:01:28uh the phase information
0:01:30uh seems to be a promising approach in in speech and then
0:01:36an now we describe the phase based features which are the object of this
0:01:39of this work
0:01:42so so so what we focus on the group delay function and group delay is defined i i mine minus
0:01:46the first derivative of the uh of the are wrapped phase of a of the for transform
0:01:52and this can be written as follows
0:01:55here
0:01:56so you can see that a a a a and i you know the real and imaginary part of the
0:02:00free transform
0:02:02and uh uh while that is and multiply by X of that
0:02:06so an advantage in using that equation is that it doesn't require any any phase and wrapping
0:02:11and you know you can understand uh i the group delay function as been most of the time about it
0:02:16in can one speech processing application
0:02:19since considered uh is the Z transform of the signal
0:02:23and a
0:02:24then my
0:02:25there might be some zeros close to the unit circle and this is especially true for for the speech signal
0:02:30so that it is there
0:02:32uh so you have a method the fruit transform on and frequency but located at on the unit circle
0:02:37in a in is plane
0:02:39and five to rules
0:02:40uh close the unit circle
0:02:42uh the variation in the in the phase information is quite high
0:02:46and was a in in the high spikes in in the in the group delay function
0:02:51so you can also understand that
0:02:52uh
0:02:54because first search frequencies
0:02:56uh the a the two of this of that expression becomes the low
0:03:00resulting in in this in the group delay
0:03:04so there a that mean some approaches uh
0:03:07uh aiming meeting at reducing the service the spikes in the group delay
0:03:11and
0:03:12first approach where a the modified group delay proposed by hatch
0:03:16and you can see that yeah in the do meet or
0:03:19it has been a but by as a from a guy which is a a a a cepstral smooth version
0:03:24of that of the for transform
0:03:25and
0:03:26this representation makes also use or
0:03:30to to smoothing parameters i find them
0:03:33which also so in uh at reducing that this spikes in the group delay
0:03:38so and of the version is the product of the pair and the group delay
0:03:41proposed by to
0:03:43and yeah yeah that a can see that that the main source of the all the spikes in the group
0:03:47delay come from the come from the a minute or
0:03:50therefore for just get rid of a and just consider that the new mode or of of the expression
0:03:59so we have also invested to investigated the travel billy and propose by was score
0:04:04and
0:04:05this actually use is uh another control or in the you play
0:04:08uh instead of the of the unit circle so and a
0:04:11another circle in the in the Z plane
0:04:14uh to a it is a transform and just list to
0:04:17uh a both us moves and hire high low uh representation of the peaks in in the speech spectrum
0:04:24so that one present purpose we also use the the straight spectrogram a back our what i
0:04:29and this is a
0:04:30that's a speech uh a pitch at that the uh times small thing of the of the on the speech
0:04:36but uh
0:04:37for me to
0:04:38uh spec
0:04:40that's a baseline we also consider the for a magnitude so of the spectrum of the for transform
0:04:45and yeah you give an example of where have is
0:04:48five spectral
0:04:49uh look like
0:04:51uh
0:04:52a a a a a a a system low
0:04:53produce by and number forty question you both
0:04:56and below low for dysphonic question
0:04:59so here you of the three mind it to the straight spectrum
0:05:01modified really the that power and the group delay and the would delay
0:05:06and you can see fat that for under forty question
0:05:09we have a structure which see which is a we regular in time
0:05:13well this is not true for this funny patient
0:05:16and
0:05:17you can this is especially at the side in the job would delay
0:05:20so
0:05:21basically to explain is uh
0:05:23during the production of a stand what were you you can assume that the vocal tract shape is is constant
0:05:28i is
0:05:29so that the contract function is it can be assume as stationary
0:05:33so if you find some is
0:05:35use come from the the turbulence is the ring of the
0:05:38do and that let the production
0:05:42so this this five run see also use uh of features to were from the space the composition position
0:05:49so to to in a them expect the position just consider the the source speech or approach
0:05:53so we have a lot of for a but they which is convert in time domain with the look at
0:05:57that response
0:05:58to give the speech signal
0:06:01and uh we the mix space model of speech says is that the that but was some maximum phase which
0:06:07means i'm D "'cause" that's is uh and "'cause" an signal
0:06:10well i have a cat that he's mean and phase that is to say uh
0:06:13a "'cause" on then
0:06:15so the day
0:06:16the key idea of them expose the composition is to separate uh the minimum and maximum phase component of speech
0:06:23and
0:06:24this is possible for example in the zeros of the z-transform the mean proposed by was good
0:06:29uh uh we can see that the zeros
0:06:32so
0:06:33this is that Z plane in the input our code in it
0:06:36and you can see that zero related to the good that for our five the unit circle
0:06:40well for the good vocal tract there are inside it sim the vocal tract is so
0:06:45"'cause" a minimum phase system
0:06:47so you can see that in this is it the uh the main
0:06:50there is a a a possible enough to separation between the the minimum and maximum phase components of speech
0:06:56and we have shown that it's also possible in the complex cepstrum mean
0:06:59just using the quick N C are G has a boundary for the for the separation
0:07:05so it just work we we focus on the use of the compressed of strong uh the composition
0:07:10so basically we have a speech in we apply was
0:07:12pacific window
0:07:14which is uh a synchronous on that but that are joins then just yeah
0:07:17and to pitch but of long
0:07:19and then we compute the complex cepstrum
0:07:21and in the complex cepstrum some the it's very easy just keeping than to get a a uh in that
0:07:27by inverse compressed cepstrum we get the maximum phase component speech
0:07:31which is mainly related to the glottal flow
0:07:34well i for the positive index is
0:07:36we get the minimum phase
0:07:37uh component of speech which is mainly influenced by the vocal tract
0:07:41so it is where a uh we just extract
0:07:44we just isolated the the maximum phase component of speech
0:07:47which is a kind of a a a a great than flow me
0:07:52so you are you have an example of a two side of the maximum phase component
0:07:56uh
0:07:57yeah O but one uh
0:07:59one i would say the makes "'em" and that the knicks space my but it's respect to that is to
0:08:03so we obtain obtained waveforms forms which grew will uh but those of the top row
0:08:07such as a a a a a lot more the
0:08:10well have some other frames
0:08:12uh the paint the composition
0:08:14and we have such true
0:08:15well an event waveform
0:08:17so we note that two
0:08:19to know that that the mixtures models was or not
0:08:22so we just completed this to time parameters from from the from yeah
0:08:27makes a maximum phase uh with four
0:08:33so now the experiment that real evaluation of these features
0:08:36uh so for that that the base we have to K the base uh which is made of uh
0:08:41the production for uh
0:08:43fifty three number for nick and six hundred fifty seven dysphonic patients
0:08:47and we just consider the the production of the system vol
0:08:52as features we use the
0:08:53frame
0:08:53frame variation for the five spectral run
0:08:56so as i said i if you assume that the vocal tract shape
0:08:59is constant during the production of this system but words
0:09:02to frame to frame variation
0:09:04mean uh uh i are do to from a are you to the
0:09:08to to is there and the got that prediction
0:09:12so we also use this to uh time parameters uh for that was back uh of the mix phase more
0:09:18the
0:09:18and
0:09:19for comparison purpose we also use
0:09:21uh a three parts of john spectral utterances
0:09:24which are extracted from the uh
0:09:26for a it to spectral
0:09:28uh so actually it is three
0:09:30these out in using three this things subbands in a in the spectrum and that of any here because there
0:09:36uh the the mouse
0:09:37uh informative in our previous study
0:09:43so yeah you have an example of the distribution of this
0:09:46um of
0:09:47some power some features
0:09:49so here you
0:09:49so you might need to the modified group delay and the chirp group delay
0:09:53and you can see that it is at a frame to frame variation uh in relative
0:09:58so you can see that problem of them of funny passion
0:10:01we have much uh
0:10:03but was which are much lower than for dysphonic patient
0:10:06and this is especially true for the job would delay a representation
0:10:11so yeah and the right to you have a uh the used to run for you want so uh the
0:10:16time constant uh
0:10:18for the
0:10:19respect of the mix pays model
0:10:20and
0:10:21actually if if the waveform uh corporate uh
0:10:25that's so that a that's of the group of low we expect values are or zero but do you
0:10:30and
0:10:31this is true for the great majority of the number for nick have friends
0:10:35but you can see that for dysphonic uh fashions
0:10:37um most of the time that a
0:10:40that makes the composition fails
0:10:44so we have a says this features uh in terms of uh mutual information
0:10:49so basically this is the percentage of uh use what information of the features
0:10:54bring to the that to the classification problem
0:10:59so
0:10:59yeah we have the five spectrograms and you can see that
0:11:02the chart would be lay uh gives the high amount of uh a useful information
0:11:06for the
0:11:08classification problem mean number funny dysphonic patient
0:11:11uh you can also see to values for the modified five really and five the two times meet there's for
0:11:17their uh respect of the mix phase model
0:11:20so a an aspect
0:11:21from from our uh but use that is you can see that that the spectral balances is the higher amount
0:11:26of information
0:11:28but you have to note that is well that's a
0:11:30uh the intrinsic discrimination power of each uh feature
0:11:35consider a super lately
0:11:37but if you can i'm them for example of the combination of two features
0:11:41if you use by one was about you
0:11:44you can see that it only brings used sixty four percent of mutual information
0:11:47because they are highly are then don't
0:11:50we the best combination of two features is bad one with T two
0:11:54this do you
0:11:55which leads to seventy nine percent of mutual information
0:11:58and this is possible because this so just two sources of information
0:12:02are mainly complementary and uh a very uh very
0:12:07a not not that much uh weird and then
0:12:12so we also use a plastic value or based uh evaluation
0:12:16uh using an artificial no network uh we sixteen on
0:12:21uh we use a a a ten fold cross validation and for the performance measure we use the or rate
0:12:27but at the frame and the passion levels
0:12:29so a a passion is that most uh as as of phone a this funny
0:12:33so we use uh
0:12:35a for that um
0:12:37a majority and a decision strategy
0:12:40uh considering the frame
0:12:46so it the results
0:12:47just using a single feature
0:12:49you can see a the compose on between the that's it a line for you magnitude than the children really
0:12:54and you can see uh a to improve my using that the that the representation
0:12:58both at the frame level and the passion that
0:13:02using no uh
0:13:04two features you have you have the two time can for their respect of the mix more than
0:13:09and you the best combination of two features
0:13:12but one and T two
0:13:14so we can see
0:13:15uh that
0:13:16up to now i have a patient level that the should to to give the best result
0:13:20uh and the passion level
0:13:22but at the frame level we obtain the best
0:13:24and i was a with a a one D two
0:13:27now we
0:13:28to
0:13:29features
0:13:30so
0:13:31let's a or so that the can representation using the perceptual of a balance as
0:13:35and you can see that with each
0:13:37three features
0:13:39we obtain uh
0:13:41that's a worse result than just using the the chip would be lay at the patient level
0:13:47and now you can see are also latest there
0:13:50the very interesting result just using the tree group delay representation
0:13:55with a very low uh error rate
0:13:57but at the
0:13:58and the passion that
0:14:01no using five features so we had that the for magnitude them strip or the two time constant to the
0:14:07three uh group delay representation actually you you can see
0:14:11comparing comparing with this line
0:14:13that is actually doesn't bring anything anything uh
0:14:17more
0:14:19so finally just using the uh feature set
0:14:22uh so that then features
0:14:23uh
0:14:25or we add obvious obviously the best result that the error rate
0:14:28uh for that
0:14:29frame level
0:14:30but considering the the patient level we get uh
0:14:34for about zero eight per which was already obtain just you think that tree uh sure of the tree group
0:14:39delay representation
0:14:44so i as a conclusion we have shown that a phase based features are appropriate for court rising
0:14:49yeah regular gonna write is in the four nation during sustained vote
0:14:52and this phase pitch features are actually complementary three uh at the
0:14:57on was the features the read from the magnitude spectrum
0:14:59common the use in in speech processing
0:15:02and we obtain a
0:15:03quite good performance just using that three features or of the of the group leave representation
0:15:08a but the bank or of you one
0:15:10if you have any question or comment that well
0:15:12thanks
0:15:13thank you
0:15:21have questions
0:15:24a common
0:15:26yes please
0:15:27i was so
0:15:31a
0:15:32so no observation is that to you uh and exchange decomposition things at a dysphonic speech
0:15:39but uh a do not explain it what is the reason for that
0:15:42oh okay
0:15:45so vertically the that would say that the production does on respect the the mixed his model but
0:15:50as i said yeah the we found the windowing
0:15:53first of all for the windowing wing you have to
0:15:55to apply a a so i for news and to pitch pretty my window way
0:15:59so for some this funny uh a a question the just size are are not well mark or are not
0:16:05present for a little and also for the pitch just that me feel two
0:16:09so
0:16:10that might explain some of bad results
0:16:12and um
0:16:15yeah i maybe maybe because of this i thing
0:16:21is the
0:16:26i time as thanks for the talk i i it is what ask if you have it increases interaction between
0:16:31the vocal tract and the source
0:16:33would do you or or a the sensitivity go up but me for the this fine patience
0:16:39if they happen to have mark coupling
0:16:41but that effects
0:16:43um
0:16:44the the phase
0:16:45the mixed is model
0:16:46okay
0:16:49um
0:16:52i to that
0:16:53i can then swear to that that question
0:16:55but anyway you find a a maximum and minimum phase component but just to say that it is really event
0:17:01to consider that the maximum phase component is a a group of what was to make
0:17:05a would not say that a i'm the sure
0:17:08but
0:17:09okay
0:17:10and the back
0:17:12but to my experience um in for the decomposition for number for an expressions let's say a speech synthesis that
0:17:17the base work
0:17:19but when you're more
0:17:21that's see
0:17:21cool coupling between the vocal tract and the glottal souls
0:17:24can i
0:17:25i can advance
0:17:26hi
0:17:27thanks for my talk
0:17:29a
0:17:30just the question
0:17:31was saying to a court all source to be
0:17:34just have to me
0:17:35max first components
0:17:37but just a as a meeting first components which is you to to the
0:17:41so to yield
0:17:43so a and that my
0:17:44spectral tilt of to got all source
0:17:46yeah
0:17:47and that might also vary from frame to frame
0:17:50so the see that that components lights also
0:17:53if should take account that's you could do and gets better
0:17:57and uh
0:17:57results
0:17:59thank
0:18:01a you okay
0:18:01so of the but what is mainly due to the look the with phase which is a minimum phase signal
0:18:06so it this makes in the let's see
0:18:08in the in the me uh in the minimum phase component
0:18:11yeah
0:18:12which is
0:18:12which is not the object i what mean
0:18:14which is not a
0:18:15is the
0:18:16in this work
0:18:17so we just focus on the analysis of the maximal phase company
0:18:20and also for the the features there are from the
0:18:24the mix phase more that we just consider just two
0:18:26just two parameters
0:18:28so event about that mean that that might also and were to uh a a a is question so
0:18:33uh uh even though is not really a lot that flow estimate
0:18:37my the you might have a
0:18:38okay
0:18:39to just so that yeah we have a let's say uh
0:18:42a relevant with form meant just one is very noisy
0:18:45meaning that a meaning that the the mixed phase the compare them to just feels
0:18:49so you but that you cannot interpret that as as a of us to estimate
0:18:53not the last you might have a a a at of expect the composition
0:19:01is it uh and were question all
0:19:03or
0:19:07i half a question myself again
0:19:09uh
0:19:10in the of dysphonic database i guess you have different classes of this phone near
0:19:14could you comment on that and whether you try to distinguish and those classes as you worth
0:19:19so we do not need that works so you just the let's a binary decision so locations
0:19:24uh
0:19:24normal for the got this warning and also for the that's in the uh database
0:19:28you might have very use um
0:19:31but image which it's for a single
0:19:33or a a single patient
0:19:34we just consider a uh a and you know at the location
0:19:43so that
0:19:44computes a discussion let's thank you again