0:00:13 | these thanks to you |
---|
0:00:14 | thanks to all of you for come |
---|
0:00:16 | they |
---|
0:00:17 | so this is the outline of my to |
---|
0:00:19 | a go first so well i would give a brief introduction on on the topic |
---|
0:00:22 | and then describe the phase based features which are as the did in this work |
---|
0:00:27 | then and i we show you a V there results of four experimental evaluation of these features |
---|
0:00:31 | within the frame of voice but the logic you detection |
---|
0:00:34 | and finally a would come |
---|
0:00:38 | a in the great majority of speech processing application uh and then focus is on the use of the amplitude |
---|
0:00:44 | spectrum of the free transform |
---|
0:00:47 | uh nonetheless that there might be a to begin by also considering the phase information |
---|
0:00:53 | and just for example was |
---|
0:00:55 | uh i was down for uh |
---|
0:00:57 | so that a been approaches using phase of based features |
---|
0:01:00 | to speaker recognition or automatic speech recognition us are |
---|
0:01:05 | so as for example the yeah the work of multi or hatch |
---|
0:01:09 | and this |
---|
0:01:10 | a is |
---|
0:01:11 | i mean |
---|
0:01:12 | that that been an improvement by using phase based features in two systems |
---|
0:01:16 | and this is possible since |
---|
0:01:18 | phase uh provides a compound that the resource information |
---|
0:01:21 | with regard to to the amplitude spectrum |
---|
0:01:24 | and therefore are uh investigating the uh the usefulness of |
---|
0:01:28 | uh the phase information |
---|
0:01:30 | uh seems to be a promising approach in in speech and then |
---|
0:01:36 | an now we describe the phase based features which are the object of this |
---|
0:01:39 | of this work |
---|
0:01:42 | so so so what we focus on the group delay function and group delay is defined i i mine minus |
---|
0:01:46 | the first derivative of the uh of the are wrapped phase of a of the for transform |
---|
0:01:52 | and this can be written as follows |
---|
0:01:55 | here |
---|
0:01:56 | so you can see that a a a a and i you know the real and imaginary part of the |
---|
0:02:00 | free transform |
---|
0:02:02 | and uh uh while that is and multiply by X of that |
---|
0:02:06 | so an advantage in using that equation is that it doesn't require any any phase and wrapping |
---|
0:02:11 | and you know you can understand uh i the group delay function as been most of the time about it |
---|
0:02:16 | in can one speech processing application |
---|
0:02:19 | since considered uh is the Z transform of the signal |
---|
0:02:23 | and a |
---|
0:02:24 | then my |
---|
0:02:25 | there might be some zeros close to the unit circle and this is especially true for for the speech signal |
---|
0:02:30 | so that it is there |
---|
0:02:32 | uh so you have a method the fruit transform on and frequency but located at on the unit circle |
---|
0:02:37 | in a in is plane |
---|
0:02:39 | and five to rules |
---|
0:02:40 | uh close the unit circle |
---|
0:02:42 | uh the variation in the in the phase information is quite high |
---|
0:02:46 | and was a in in the high spikes in in the in the group delay function |
---|
0:02:51 | so you can also understand that |
---|
0:02:52 | uh |
---|
0:02:54 | because first search frequencies |
---|
0:02:56 | uh the a the two of this of that expression becomes the low |
---|
0:03:00 | resulting in in this in the group delay |
---|
0:03:04 | so there a that mean some approaches uh |
---|
0:03:07 | uh aiming meeting at reducing the service the spikes in the group delay |
---|
0:03:11 | and |
---|
0:03:12 | first approach where a the modified group delay proposed by hatch |
---|
0:03:16 | and you can see that yeah in the do meet or |
---|
0:03:19 | it has been a but by as a from a guy which is a a a a cepstral smooth version |
---|
0:03:24 | of that of the for transform |
---|
0:03:25 | and |
---|
0:03:26 | this representation makes also use or |
---|
0:03:30 | to to smoothing parameters i find them |
---|
0:03:33 | which also so in uh at reducing that this spikes in the group delay |
---|
0:03:38 | so and of the version is the product of the pair and the group delay |
---|
0:03:41 | proposed by to |
---|
0:03:43 | and yeah yeah that a can see that that the main source of the all the spikes in the group |
---|
0:03:47 | delay come from the come from the a minute or |
---|
0:03:50 | therefore for just get rid of a and just consider that the new mode or of of the expression |
---|
0:03:59 | so we have also invested to investigated the travel billy and propose by was score |
---|
0:04:04 | and |
---|
0:04:05 | this actually use is uh another control or in the you play |
---|
0:04:08 | uh instead of the of the unit circle so and a |
---|
0:04:11 | another circle in the in the Z plane |
---|
0:04:14 | uh to a it is a transform and just list to |
---|
0:04:17 | uh a both us moves and hire high low uh representation of the peaks in in the speech spectrum |
---|
0:04:24 | so that one present purpose we also use the the straight spectrogram a back our what i |
---|
0:04:29 | and this is a |
---|
0:04:30 | that's a speech uh a pitch at that the uh times small thing of the of the on the speech |
---|
0:04:36 | but uh |
---|
0:04:37 | for me to |
---|
0:04:38 | uh spec |
---|
0:04:40 | that's a baseline we also consider the for a magnitude so of the spectrum of the for transform |
---|
0:04:45 | and yeah you give an example of where have is |
---|
0:04:48 | five spectral |
---|
0:04:49 | uh look like |
---|
0:04:51 | uh |
---|
0:04:52 | a a a a a a a system low |
---|
0:04:53 | produce by and number forty question you both |
---|
0:04:56 | and below low for dysphonic question |
---|
0:04:59 | so here you of the three mind it to the straight spectrum |
---|
0:05:01 | modified really the that power and the group delay and the would delay |
---|
0:05:06 | and you can see fat that for under forty question |
---|
0:05:09 | we have a structure which see which is a we regular in time |
---|
0:05:13 | well this is not true for this funny patient |
---|
0:05:16 | and |
---|
0:05:17 | you can this is especially at the side in the job would delay |
---|
0:05:20 | so |
---|
0:05:21 | basically to explain is uh |
---|
0:05:23 | during the production of a stand what were you you can assume that the vocal tract shape is is constant |
---|
0:05:28 | i is |
---|
0:05:29 | so that the contract function is it can be assume as stationary |
---|
0:05:33 | so if you find some is |
---|
0:05:35 | use come from the the turbulence is the ring of the |
---|
0:05:38 | do and that let the production |
---|
0:05:42 | so this this five run see also use uh of features to were from the space the composition position |
---|
0:05:49 | so to to in a them expect the position just consider the the source speech or approach |
---|
0:05:53 | so we have a lot of for a but they which is convert in time domain with the look at |
---|
0:05:57 | that response |
---|
0:05:58 | to give the speech signal |
---|
0:06:01 | and uh we the mix space model of speech says is that the that but was some maximum phase which |
---|
0:06:07 | means i'm D "'cause" that's is uh and "'cause" an signal |
---|
0:06:10 | well i have a cat that he's mean and phase that is to say uh |
---|
0:06:13 | a "'cause" on then |
---|
0:06:15 | so the day |
---|
0:06:16 | the key idea of them expose the composition is to separate uh the minimum and maximum phase component of speech |
---|
0:06:23 | and |
---|
0:06:24 | this is possible for example in the zeros of the z-transform the mean proposed by was good |
---|
0:06:29 | uh uh we can see that the zeros |
---|
0:06:32 | so |
---|
0:06:33 | this is that Z plane in the input our code in it |
---|
0:06:36 | and you can see that zero related to the good that for our five the unit circle |
---|
0:06:40 | well for the good vocal tract there are inside it sim the vocal tract is so |
---|
0:06:45 | "'cause" a minimum phase system |
---|
0:06:47 | so you can see that in this is it the uh the main |
---|
0:06:50 | there is a a a possible enough to separation between the the minimum and maximum phase components of speech |
---|
0:06:56 | and we have shown that it's also possible in the complex cepstrum mean |
---|
0:06:59 | just using the quick N C are G has a boundary for the for the separation |
---|
0:07:05 | so it just work we we focus on the use of the compressed of strong uh the composition |
---|
0:07:10 | so basically we have a speech in we apply was |
---|
0:07:12 | pacific window |
---|
0:07:14 | which is uh a synchronous on that but that are joins then just yeah |
---|
0:07:17 | and to pitch but of long |
---|
0:07:19 | and then we compute the complex cepstrum |
---|
0:07:21 | and in the complex cepstrum some the it's very easy just keeping than to get a a uh in that |
---|
0:07:27 | by inverse compressed cepstrum we get the maximum phase component speech |
---|
0:07:31 | which is mainly related to the glottal flow |
---|
0:07:34 | well i for the positive index is |
---|
0:07:36 | we get the minimum phase |
---|
0:07:37 | uh component of speech which is mainly influenced by the vocal tract |
---|
0:07:41 | so it is where a uh we just extract |
---|
0:07:44 | we just isolated the the maximum phase component of speech |
---|
0:07:47 | which is a kind of a a a a great than flow me |
---|
0:07:52 | so you are you have an example of a two side of the maximum phase component |
---|
0:07:56 | uh |
---|
0:07:57 | yeah O but one uh |
---|
0:07:59 | one i would say the makes "'em" and that the knicks space my but it's respect to that is to |
---|
0:08:03 | so we obtain obtained waveforms forms which grew will uh but those of the top row |
---|
0:08:07 | such as a a a a a lot more the |
---|
0:08:10 | well have some other frames |
---|
0:08:12 | uh the paint the composition |
---|
0:08:14 | and we have such true |
---|
0:08:15 | well an event waveform |
---|
0:08:17 | so we note that two |
---|
0:08:19 | to know that that the mixtures models was or not |
---|
0:08:22 | so we just completed this to time parameters from from the from yeah |
---|
0:08:27 | makes a maximum phase uh with four |
---|
0:08:33 | so now the experiment that real evaluation of these features |
---|
0:08:36 | uh so for that that the base we have to K the base uh which is made of uh |
---|
0:08:41 | the production for uh |
---|
0:08:43 | fifty three number for nick and six hundred fifty seven dysphonic patients |
---|
0:08:47 | and we just consider the the production of the system vol |
---|
0:08:52 | as features we use the |
---|
0:08:53 | frame |
---|
0:08:53 | frame variation for the five spectral run |
---|
0:08:56 | so as i said i if you assume that the vocal tract shape |
---|
0:08:59 | is constant during the production of this system but words |
---|
0:09:02 | to frame to frame variation |
---|
0:09:04 | mean uh uh i are do to from a are you to the |
---|
0:09:08 | to to is there and the got that prediction |
---|
0:09:12 | so we also use this to uh time parameters uh for that was back uh of the mix phase more |
---|
0:09:18 | the |
---|
0:09:18 | and |
---|
0:09:19 | for comparison purpose we also use |
---|
0:09:21 | uh a three parts of john spectral utterances |
---|
0:09:24 | which are extracted from the uh |
---|
0:09:26 | for a it to spectral |
---|
0:09:28 | uh so actually it is three |
---|
0:09:30 | these out in using three this things subbands in a in the spectrum and that of any here because there |
---|
0:09:36 | uh the the mouse |
---|
0:09:37 | uh informative in our previous study |
---|
0:09:43 | so yeah you have an example of the distribution of this |
---|
0:09:46 | um of |
---|
0:09:47 | some power some features |
---|
0:09:49 | so here you |
---|
0:09:49 | so you might need to the modified group delay and the chirp group delay |
---|
0:09:53 | and you can see that it is at a frame to frame variation uh in relative |
---|
0:09:58 | so you can see that problem of them of funny passion |
---|
0:10:01 | we have much uh |
---|
0:10:03 | but was which are much lower than for dysphonic patient |
---|
0:10:06 | and this is especially true for the job would delay a representation |
---|
0:10:11 | so yeah and the right to you have a uh the used to run for you want so uh the |
---|
0:10:16 | time constant uh |
---|
0:10:18 | for the |
---|
0:10:19 | respect of the mix pays model |
---|
0:10:20 | and |
---|
0:10:21 | actually if if the waveform uh corporate uh |
---|
0:10:25 | that's so that a that's of the group of low we expect values are or zero but do you |
---|
0:10:30 | and |
---|
0:10:31 | this is true for the great majority of the number for nick have friends |
---|
0:10:35 | but you can see that for dysphonic uh fashions |
---|
0:10:37 | um most of the time that a |
---|
0:10:40 | that makes the composition fails |
---|
0:10:44 | so we have a says this features uh in terms of uh mutual information |
---|
0:10:49 | so basically this is the percentage of uh use what information of the features |
---|
0:10:54 | bring to the that to the classification problem |
---|
0:10:59 | so |
---|
0:10:59 | yeah we have the five spectrograms and you can see that |
---|
0:11:02 | the chart would be lay uh gives the high amount of uh a useful information |
---|
0:11:06 | for the |
---|
0:11:08 | classification problem mean number funny dysphonic patient |
---|
0:11:11 | uh you can also see to values for the modified five really and five the two times meet there's for |
---|
0:11:17 | their uh respect of the mix phase model |
---|
0:11:20 | so a an aspect |
---|
0:11:21 | from from our uh but use that is you can see that that the spectral balances is the higher amount |
---|
0:11:26 | of information |
---|
0:11:28 | but you have to note that is well that's a |
---|
0:11:30 | uh the intrinsic discrimination power of each uh feature |
---|
0:11:35 | consider a super lately |
---|
0:11:37 | but if you can i'm them for example of the combination of two features |
---|
0:11:41 | if you use by one was about you |
---|
0:11:44 | you can see that it only brings used sixty four percent of mutual information |
---|
0:11:47 | because they are highly are then don't |
---|
0:11:50 | we the best combination of two features is bad one with T two |
---|
0:11:54 | this do you |
---|
0:11:55 | which leads to seventy nine percent of mutual information |
---|
0:11:58 | and this is possible because this so just two sources of information |
---|
0:12:02 | are mainly complementary and uh a very uh very |
---|
0:12:07 | a not not that much uh weird and then |
---|
0:12:12 | so we also use a plastic value or based uh evaluation |
---|
0:12:16 | uh using an artificial no network uh we sixteen on |
---|
0:12:21 | uh we use a a a ten fold cross validation and for the performance measure we use the or rate |
---|
0:12:27 | but at the frame and the passion levels |
---|
0:12:29 | so a a passion is that most uh as as of phone a this funny |
---|
0:12:33 | so we use uh |
---|
0:12:35 | a for that um |
---|
0:12:37 | a majority and a decision strategy |
---|
0:12:40 | uh considering the frame |
---|
0:12:46 | so it the results |
---|
0:12:47 | just using a single feature |
---|
0:12:49 | you can see a the compose on between the that's it a line for you magnitude than the children really |
---|
0:12:54 | and you can see uh a to improve my using that the that the representation |
---|
0:12:58 | both at the frame level and the passion that |
---|
0:13:02 | using no uh |
---|
0:13:04 | two features you have you have the two time can for their respect of the mix more than |
---|
0:13:09 | and you the best combination of two features |
---|
0:13:12 | but one and T two |
---|
0:13:14 | so we can see |
---|
0:13:15 | uh that |
---|
0:13:16 | up to now i have a patient level that the should to to give the best result |
---|
0:13:20 | uh and the passion level |
---|
0:13:22 | but at the frame level we obtain the best |
---|
0:13:24 | and i was a with a a one D two |
---|
0:13:27 | now we |
---|
0:13:28 | to |
---|
0:13:29 | features |
---|
0:13:30 | so |
---|
0:13:31 | let's a or so that the can representation using the perceptual of a balance as |
---|
0:13:35 | and you can see that with each |
---|
0:13:37 | three features |
---|
0:13:39 | we obtain uh |
---|
0:13:41 | that's a worse result than just using the the chip would be lay at the patient level |
---|
0:13:47 | and now you can see are also latest there |
---|
0:13:50 | the very interesting result just using the tree group delay representation |
---|
0:13:55 | with a very low uh error rate |
---|
0:13:57 | but at the |
---|
0:13:58 | and the passion that |
---|
0:14:01 | no using five features so we had that the for magnitude them strip or the two time constant to the |
---|
0:14:07 | three uh group delay representation actually you you can see |
---|
0:14:11 | comparing comparing with this line |
---|
0:14:13 | that is actually doesn't bring anything anything uh |
---|
0:14:17 | more |
---|
0:14:19 | so finally just using the uh feature set |
---|
0:14:22 | uh so that then features |
---|
0:14:23 | uh |
---|
0:14:25 | or we add obvious obviously the best result that the error rate |
---|
0:14:28 | uh for that |
---|
0:14:29 | frame level |
---|
0:14:30 | but considering the the patient level we get uh |
---|
0:14:34 | for about zero eight per which was already obtain just you think that tree uh sure of the tree group |
---|
0:14:39 | delay representation |
---|
0:14:44 | so i as a conclusion we have shown that a phase based features are appropriate for court rising |
---|
0:14:49 | yeah regular gonna write is in the four nation during sustained vote |
---|
0:14:52 | and this phase pitch features are actually complementary three uh at the |
---|
0:14:57 | on was the features the read from the magnitude spectrum |
---|
0:14:59 | common the use in in speech processing |
---|
0:15:02 | and we obtain a |
---|
0:15:03 | quite good performance just using that three features or of the of the group leave representation |
---|
0:15:08 | a but the bank or of you one |
---|
0:15:10 | if you have any question or comment that well |
---|
0:15:12 | thanks |
---|
0:15:13 | thank you |
---|
0:15:21 | have questions |
---|
0:15:24 | a common |
---|
0:15:26 | yes please |
---|
0:15:27 | i was so |
---|
0:15:31 | a |
---|
0:15:32 | so no observation is that to you uh and exchange decomposition things at a dysphonic speech |
---|
0:15:39 | but uh a do not explain it what is the reason for that |
---|
0:15:42 | oh okay |
---|
0:15:45 | so vertically the that would say that the production does on respect the the mixed his model but |
---|
0:15:50 | as i said yeah the we found the windowing |
---|
0:15:53 | first of all for the windowing wing you have to |
---|
0:15:55 | to apply a a so i for news and to pitch pretty my window way |
---|
0:15:59 | so for some this funny uh a a question the just size are are not well mark or are not |
---|
0:16:05 | present for a little and also for the pitch just that me feel two |
---|
0:16:09 | so |
---|
0:16:10 | that might explain some of bad results |
---|
0:16:12 | and um |
---|
0:16:15 | yeah i maybe maybe because of this i thing |
---|
0:16:21 | is the |
---|
0:16:26 | i time as thanks for the talk i i it is what ask if you have it increases interaction between |
---|
0:16:31 | the vocal tract and the source |
---|
0:16:33 | would do you or or a the sensitivity go up but me for the this fine patience |
---|
0:16:39 | if they happen to have mark coupling |
---|
0:16:41 | but that effects |
---|
0:16:43 | um |
---|
0:16:44 | the the phase |
---|
0:16:45 | the mixed is model |
---|
0:16:46 | okay |
---|
0:16:49 | um |
---|
0:16:52 | i to that |
---|
0:16:53 | i can then swear to that that question |
---|
0:16:55 | but anyway you find a a maximum and minimum phase component but just to say that it is really event |
---|
0:17:01 | to consider that the maximum phase component is a a group of what was to make |
---|
0:17:05 | a would not say that a i'm the sure |
---|
0:17:08 | but |
---|
0:17:09 | okay |
---|
0:17:10 | and the back |
---|
0:17:12 | but to my experience um in for the decomposition for number for an expressions let's say a speech synthesis that |
---|
0:17:17 | the base work |
---|
0:17:19 | but when you're more |
---|
0:17:21 | that's see |
---|
0:17:21 | cool coupling between the vocal tract and the glottal souls |
---|
0:17:24 | can i |
---|
0:17:25 | i can advance |
---|
0:17:26 | hi |
---|
0:17:27 | thanks for my talk |
---|
0:17:29 | a |
---|
0:17:30 | just the question |
---|
0:17:31 | was saying to a court all source to be |
---|
0:17:34 | just have to me |
---|
0:17:35 | max first components |
---|
0:17:37 | but just a as a meeting first components which is you to to the |
---|
0:17:41 | so to yield |
---|
0:17:43 | so a and that my |
---|
0:17:44 | spectral tilt of to got all source |
---|
0:17:46 | yeah |
---|
0:17:47 | and that might also vary from frame to frame |
---|
0:17:50 | so the see that that components lights also |
---|
0:17:53 | if should take account that's you could do and gets better |
---|
0:17:57 | and uh |
---|
0:17:57 | results |
---|
0:17:59 | thank |
---|
0:18:01 | a you okay |
---|
0:18:01 | so of the but what is mainly due to the look the with phase which is a minimum phase signal |
---|
0:18:06 | so it this makes in the let's see |
---|
0:18:08 | in the in the me uh in the minimum phase component |
---|
0:18:11 | yeah |
---|
0:18:12 | which is |
---|
0:18:12 | which is not the object i what mean |
---|
0:18:14 | which is not a |
---|
0:18:15 | is the |
---|
0:18:16 | in this work |
---|
0:18:17 | so we just focus on the analysis of the maximal phase company |
---|
0:18:20 | and also for the the features there are from the |
---|
0:18:24 | the mix phase more that we just consider just two |
---|
0:18:26 | just two parameters |
---|
0:18:28 | so event about that mean that that might also and were to uh a a a is question so |
---|
0:18:33 | uh uh even though is not really a lot that flow estimate |
---|
0:18:37 | my the you might have a |
---|
0:18:38 | okay |
---|
0:18:39 | to just so that yeah we have a let's say uh |
---|
0:18:42 | a relevant with form meant just one is very noisy |
---|
0:18:45 | meaning that a meaning that the the mixed phase the compare them to just feels |
---|
0:18:49 | so you but that you cannot interpret that as as a of us to estimate |
---|
0:18:53 | not the last you might have a a a at of expect the composition |
---|
0:19:01 | is it uh and were question all |
---|
0:19:03 | or |
---|
0:19:07 | i half a question myself again |
---|
0:19:09 | uh |
---|
0:19:10 | in the of dysphonic database i guess you have different classes of this phone near |
---|
0:19:14 | could you comment on that and whether you try to distinguish and those classes as you worth |
---|
0:19:19 | so we do not need that works so you just the let's a binary decision so locations |
---|
0:19:24 | uh |
---|
0:19:24 | normal for the got this warning and also for the that's in the uh database |
---|
0:19:28 | you might have very use um |
---|
0:19:31 | but image which it's for a single |
---|
0:19:33 | or a a single patient |
---|
0:19:34 | we just consider a uh a and you know at the location |
---|
0:19:43 | so that |
---|
0:19:44 | computes a discussion let's thank you again |
---|