Přepis řeči - PHASE-BASED INFORMATION FOR VOICE PATHOLOGY DETECTION

0:00:13	these thanks to you
0:00:14	thanks to all of you for come
0:00:16	they
0:00:17	so this is the outline of my to
0:00:19	a go first so well i would give a brief introduction on on the topic
0:00:22	and then describe the phase based features which are as the did in this work
0:00:27	then and i we show you a V there results of four experimental evaluation of these features
0:00:31	within the frame of voice but the logic you detection
0:00:34	and finally a would come
0:00:38	a in the great majority of speech processing application uh and then focus is on the use of the amplitude
0:00:44	spectrum of the free transform
0:00:47	uh nonetheless that there might be a to begin by also considering the phase information
0:00:53	and just for example was
0:00:55	uh i was down for uh
0:00:57	so that a been approaches using phase of based features
0:01:00	to speaker recognition or automatic speech recognition us are
0:01:05	so as for example the yeah the work of multi or hatch
0:01:09	and this
0:01:10	a is
0:01:11	i mean
0:01:12	that that been an improvement by using phase based features in two systems
0:01:16	and this is possible since
0:01:18	phase uh provides a compound that the resource information
0:01:21	with regard to to the amplitude spectrum
0:01:24	and therefore are uh investigating the uh the usefulness of
0:01:28	uh the phase information
0:01:30	uh seems to be a promising approach in in speech and then
0:01:36	an now we describe the phase based features which are the object of this
0:01:39	of this work
0:01:42	so so so what we focus on the group delay function and group delay is defined i i mine minus
0:01:46	the first derivative of the uh of the are wrapped phase of a of the for transform
0:01:52	and this can be written as follows
0:01:55	here
0:01:56	so you can see that a a a a and i you know the real and imaginary part of the
0:02:00	free transform
0:02:02	and uh uh while that is and multiply by X of that
0:02:06	so an advantage in using that equation is that it doesn't require any any phase and wrapping
0:02:11	and you know you can understand uh i the group delay function as been most of the time about it
0:02:16	in can one speech processing application
0:02:19	since considered uh is the Z transform of the signal
0:02:23	and a
0:02:24	then my
0:02:25	there might be some zeros close to the unit circle and this is especially true for for the speech signal
0:02:30	so that it is there
0:02:32	uh so you have a method the fruit transform on and frequency but located at on the unit circle
0:02:37	in a in is plane
0:02:39	and five to rules
0:02:40	uh close the unit circle
0:02:42	uh the variation in the in the phase information is quite high
0:02:46	and was a in in the high spikes in in the in the group delay function
0:02:51	so you can also understand that
0:02:52	uh
0:02:54	because first search frequencies
0:02:56	uh the a the two of this of that expression becomes the low
0:03:00	resulting in in this in the group delay
0:03:04	so there a that mean some approaches uh
0:03:07	uh aiming meeting at reducing the service the spikes in the group delay
0:03:11	and
0:03:12	first approach where a the modified group delay proposed by hatch
0:03:16	and you can see that yeah in the do meet or
0:03:19	it has been a but by as a from a guy which is a a a a cepstral smooth version
0:03:24	of that of the for transform
0:03:25	and
0:03:26	this representation makes also use or
0:03:30	to to smoothing parameters i find them
0:03:33	which also so in uh at reducing that this spikes in the group delay
0:03:38	so and of the version is the product of the pair and the group delay
0:03:41	proposed by to
0:03:43	and yeah yeah that a can see that that the main source of the all the spikes in the group
0:03:47	delay come from the come from the a minute or
0:03:50	therefore for just get rid of a and just consider that the new mode or of of the expression
0:03:59	so we have also invested to investigated the travel billy and propose by was score
0:04:04	and
0:04:05	this actually use is uh another control or in the you play
0:04:08	uh instead of the of the unit circle so and a
0:04:11	another circle in the in the Z plane
0:04:14	uh to a it is a transform and just list to
0:04:17	uh a both us moves and hire high low uh representation of the peaks in in the speech spectrum
0:04:24	so that one present purpose we also use the the straight spectrogram a back our what i
0:04:29	and this is a
0:04:30	that's a speech uh a pitch at that the uh times small thing of the of the on the speech
0:04:36	but uh
0:04:37	for me to
0:04:38	uh spec
0:04:40	that's a baseline we also consider the for a magnitude so of the spectrum of the for transform
0:04:45	and yeah you give an example of where have is
0:04:48	five spectral
0:04:49	uh look like
0:04:51	uh
0:04:52	a a a a a a a system low
0:04:53	produce by and number forty question you both
0:04:56	and below low for dysphonic question
0:04:59	so here you of the three mind it to the straight spectrum
0:05:01	modified really the that power and the group delay and the would delay
0:05:06	and you can see fat that for under forty question
0:05:09	we have a structure which see which is a we regular in time
0:05:13	well this is not true for this funny patient
0:05:16	and
0:05:17	you can this is especially at the side in the job would delay
0:05:20	so
0:05:21	basically to explain is uh
0:05:23	during the production of a stand what were you you can assume that the vocal tract shape is is constant
0:05:28	i is
0:05:29	so that the contract function is it can be assume as stationary
0:05:33	so if you find some is
0:05:35	use come from the the turbulence is the ring of the
0:05:38	do and that let the production
0:05:42	so this this five run see also use uh of features to were from the space the composition position
0:05:49	so to to in a them expect the position just consider the the source speech or approach
0:05:53	so we have a lot of for a but they which is convert in time domain with the look at
0:05:57	that response
0:05:58	to give the speech signal
0:06:01	and uh we the mix space model of speech says is that the that but was some maximum phase which
0:06:07	means i'm D "'cause" that's is uh and "'cause" an signal
0:06:10	well i have a cat that he's mean and phase that is to say uh
0:06:13	a "'cause" on then
0:06:15	so the day
0:06:16	the key idea of them expose the composition is to separate uh the minimum and maximum phase component of speech
0:06:23	and
0:06:24	this is possible for example in the zeros of the z-transform the mean proposed by was good
0:06:29	uh uh we can see that the zeros
0:06:32	so
0:06:33	this is that Z plane in the input our code in it
0:06:36	and you can see that zero related to the good that for our five the unit circle
0:06:40	well for the good vocal tract there are inside it sim the vocal tract is so
0:06:45	"'cause" a minimum phase system
0:06:47	so you can see that in this is it the uh the main
0:06:50	there is a a a possible enough to separation between the the minimum and maximum phase components of speech
0:06:56	and we have shown that it's also possible in the complex cepstrum mean
0:06:59	just using the quick N C are G has a boundary for the for the separation
0:07:05	so it just work we we focus on the use of the compressed of strong uh the composition
0:07:10	so basically we have a speech in we apply was
0:07:12	pacific window
0:07:14	which is uh a synchronous on that but that are joins then just yeah
0:07:17	and to pitch but of long
0:07:19	and then we compute the complex cepstrum
0:07:21	and in the complex cepstrum some the it's very easy just keeping than to get a a uh in that
0:07:27	by inverse compressed cepstrum we get the maximum phase component speech
0:07:31	which is mainly related to the glottal flow
0:07:34	well i for the positive index is
0:07:36	we get the minimum phase
0:07:37	uh component of speech which is mainly influenced by the vocal tract
0:07:41	so it is where a uh we just extract
0:07:44	we just isolated the the maximum phase component of speech
0:07:47	which is a kind of a a a a great than flow me
0:07:52	so you are you have an example of a two side of the maximum phase component
0:07:56	uh
0:07:57	yeah O but one uh
0:07:59	one i would say the makes "'em" and that the knicks space my but it's respect to that is to
0:08:03	so we obtain obtained waveforms forms which grew will uh but those of the top row
0:08:07	such as a a a a a lot more the
0:08:10	well have some other frames
0:08:12	uh the paint the composition
0:08:14	and we have such true
0:08:15	well an event waveform
0:08:17	so we note that two
0:08:19	to know that that the mixtures models was or not
0:08:22	so we just completed this to time parameters from from the from yeah
0:08:27	makes a maximum phase uh with four
0:08:33	so now the experiment that real evaluation of these features
0:08:36	uh so for that that the base we have to K the base uh which is made of uh
0:08:41	the production for uh
0:08:43	fifty three number for nick and six hundred fifty seven dysphonic patients
0:08:47	and we just consider the the production of the system vol
0:08:52	as features we use the
0:08:53	frame
0:08:53	frame variation for the five spectral run
0:08:56	so as i said i if you assume that the vocal tract shape
0:08:59	is constant during the production of this system but words
0:09:02	to frame to frame variation
0:09:04	mean uh uh i are do to from a are you to the
0:09:08	to to is there and the got that prediction
0:09:12	so we also use this to uh time parameters uh for that was back uh of the mix phase more
0:09:18	the
0:09:18	and
0:09:19	for comparison purpose we also use
0:09:21	uh a three parts of john spectral utterances
0:09:24	which are extracted from the uh
0:09:26	for a it to spectral
0:09:28	uh so actually it is three
0:09:30	these out in using three this things subbands in a in the spectrum and that of any here because there
0:09:36	uh the the mouse
0:09:37	uh informative in our previous study
0:09:43	so yeah you have an example of the distribution of this
0:09:46	um of
0:09:47	some power some features
0:09:49	so here you
0:09:49	so you might need to the modified group delay and the chirp group delay
0:09:53	and you can see that it is at a frame to frame variation uh in relative
0:09:58	so you can see that problem of them of funny passion
0:10:01	we have much uh
0:10:03	but was which are much lower than for dysphonic patient
0:10:06	and this is especially true for the job would delay a representation
0:10:11	so yeah and the right to you have a uh the used to run for you want so uh the
0:10:16	time constant uh
0:10:18	for the
0:10:19	respect of the mix pays model
0:10:20	and
0:10:21	actually if if the waveform uh corporate uh
0:10:25	that's so that a that's of the group of low we expect values are or zero but do you
0:10:30	and
0:10:31	this is true for the great majority of the number for nick have friends
0:10:35	but you can see that for dysphonic uh fashions
0:10:37	um most of the time that a
0:10:40	that makes the composition fails
0:10:44	so we have a says this features uh in terms of uh mutual information
0:10:49	so basically this is the percentage of uh use what information of the features
0:10:54	bring to the that to the classification problem
0:10:59	so
0:10:59	yeah we have the five spectrograms and you can see that
0:11:02	the chart would be lay uh gives the high amount of uh a useful information
0:11:06	for the
0:11:08	classification problem mean number funny dysphonic patient
0:11:11	uh you can also see to values for the modified five really and five the two times meet there's for
0:11:17	their uh respect of the mix phase model
0:11:20	so a an aspect
0:11:21	from from our uh but use that is you can see that that the spectral balances is the higher amount
0:11:26	of information
0:11:28	but you have to note that is well that's a
0:11:30	uh the intrinsic discrimination power of each uh feature
0:11:35	consider a super lately
0:11:37	but if you can i'm them for example of the combination of two features
0:11:41	if you use by one was about you
0:11:44	you can see that it only brings used sixty four percent of mutual information
0:11:47	because they are highly are then don't
0:11:50	we the best combination of two features is bad one with T two
0:11:54	this do you
0:11:55	which leads to seventy nine percent of mutual information
0:11:58	and this is possible because this so just two sources of information
0:12:02	are mainly complementary and uh a very uh very
0:12:07	a not not that much uh weird and then
0:12:12	so we also use a plastic value or based uh evaluation
0:12:16	uh using an artificial no network uh we sixteen on
0:12:21	uh we use a a a ten fold cross validation and for the performance measure we use the or rate
0:12:27	but at the frame and the passion levels
0:12:29	so a a passion is that most uh as as of phone a this funny
0:12:33	so we use uh
0:12:35	a for that um
0:12:37	a majority and a decision strategy
0:12:40	uh considering the frame
0:12:46	so it the results
0:12:47	just using a single feature
0:12:49	you can see a the compose on between the that's it a line for you magnitude than the children really
0:12:54	and you can see uh a to improve my using that the that the representation
0:12:58	both at the frame level and the passion that
0:13:02	using no uh
0:13:04	two features you have you have the two time can for their respect of the mix more than
0:13:09	and you the best combination of two features
0:13:12	but one and T two
0:13:14	so we can see
0:13:15	uh that
0:13:16	up to now i have a patient level that the should to to give the best result
0:13:20	uh and the passion level
0:13:22	but at the frame level we obtain the best
0:13:24	and i was a with a a one D two
0:13:27	now we
0:13:28	to
0:13:29	features
0:13:30	so
0:13:31	let's a or so that the can representation using the perceptual of a balance as
0:13:35	and you can see that with each
0:13:37	three features
0:13:39	we obtain uh
0:13:41	that's a worse result than just using the the chip would be lay at the patient level
0:13:47	and now you can see are also latest there
0:13:50	the very interesting result just using the tree group delay representation
0:13:55	with a very low uh error rate
0:13:57	but at the
0:13:58	and the passion that
0:14:01	no using five features so we had that the for magnitude them strip or the two time constant to the
0:14:07	three uh group delay representation actually you you can see
0:14:11	comparing comparing with this line
0:14:13	that is actually doesn't bring anything anything uh
0:14:17	more
0:14:19	so finally just using the uh feature set
0:14:22	uh so that then features
0:14:23	uh
0:14:25	or we add obvious obviously the best result that the error rate
0:14:28	uh for that
0:14:29	frame level
0:14:30	but considering the the patient level we get uh
0:14:34	for about zero eight per which was already obtain just you think that tree uh sure of the tree group
0:14:39	delay representation
0:14:44	so i as a conclusion we have shown that a phase based features are appropriate for court rising
0:14:49	yeah regular gonna write is in the four nation during sustained vote
0:14:52	and this phase pitch features are actually complementary three uh at the
0:14:57	on was the features the read from the magnitude spectrum
0:14:59	common the use in in speech processing
0:15:02	and we obtain a
0:15:03	quite good performance just using that three features or of the of the group leave representation
0:15:08	a but the bank or of you one
0:15:10	if you have any question or comment that well
0:15:12	thanks
0:15:13	thank you
0:15:21	have questions
0:15:24	a common
0:15:26	yes please
0:15:27	i was so
0:15:31	a
0:15:32	so no observation is that to you uh and exchange decomposition things at a dysphonic speech
0:15:39	but uh a do not explain it what is the reason for that
0:15:42	oh okay
0:15:45	so vertically the that would say that the production does on respect the the mixed his model but
0:15:50	as i said yeah the we found the windowing
0:15:53	first of all for the windowing wing you have to
0:15:55	to apply a a so i for news and to pitch pretty my window way
0:15:59	so for some this funny uh a a question the just size are are not well mark or are not
0:16:05	present for a little and also for the pitch just that me feel two
0:16:09	so
0:16:10	that might explain some of bad results
0:16:12	and um
0:16:15	yeah i maybe maybe because of this i thing
0:16:21	is the
0:16:26	i time as thanks for the talk i i it is what ask if you have it increases interaction between
0:16:31	the vocal tract and the source
0:16:33	would do you or or a the sensitivity go up but me for the this fine patience
0:16:39	if they happen to have mark coupling
0:16:41	but that effects
0:16:43	um
0:16:44	the the phase
0:16:45	the mixed is model
0:16:46	okay
0:16:49	um
0:16:52	i to that
0:16:53	i can then swear to that that question
0:16:55	but anyway you find a a maximum and minimum phase component but just to say that it is really event
0:17:01	to consider that the maximum phase component is a a group of what was to make
0:17:05	a would not say that a i'm the sure
0:17:08	but
0:17:09	okay
0:17:10	and the back
0:17:12	but to my experience um in for the decomposition for number for an expressions let's say a speech synthesis that
0:17:17	the base work
0:17:19	but when you're more
0:17:21	that's see
0:17:21	cool coupling between the vocal tract and the glottal souls
0:17:24	can i
0:17:25	i can advance
0:17:26	hi
0:17:27	thanks for my talk
0:17:29	a
0:17:30	just the question
0:17:31	was saying to a court all source to be
0:17:34	just have to me
0:17:35	max first components
0:17:37	but just a as a meeting first components which is you to to the
0:17:41	so to yield
0:17:43	so a and that my
0:17:44	spectral tilt of to got all source
0:17:46	yeah
0:17:47	and that might also vary from frame to frame
0:17:50	so the see that that components lights also
0:17:53	if should take account that's you could do and gets better
0:17:57	and uh
0:17:57	results
0:17:59	thank
0:18:01	a you okay
0:18:01	so of the but what is mainly due to the look the with phase which is a minimum phase signal
0:18:06	so it this makes in the let's see
0:18:08	in the in the me uh in the minimum phase component
0:18:11	yeah
0:18:12	which is
0:18:12	which is not the object i what mean
0:18:14	which is not a
0:18:15	is the
0:18:16	in this work
0:18:17	so we just focus on the analysis of the maximal phase company
0:18:20	and also for the the features there are from the
0:18:24	the mix phase more that we just consider just two
0:18:26	just two parameters
0:18:28	so event about that mean that that might also and were to uh a a a is question so
0:18:33	uh uh even though is not really a lot that flow estimate
0:18:37	my the you might have a
0:18:38	okay
0:18:39	to just so that yeah we have a let's say uh
0:18:42	a relevant with form meant just one is very noisy
0:18:45	meaning that a meaning that the the mixed phase the compare them to just feels
0:18:49	so you but that you cannot interpret that as as a of us to estimate
0:18:53	not the last you might have a a a at of expect the composition
0:19:01	is it uh and were question all
0:19:03	or
0:19:07	i half a question myself again
0:19:09	uh
0:19:10	in the of dysphonic database i guess you have different classes of this phone near
0:19:14	could you comment on that and whether you try to distinguish and those classes as you worth
0:19:19	so we do not need that works so you just the let's a binary decision so locations
0:19:24	uh
0:19:24	normal for the got this warning and also for the that's in the uh database
0:19:28	you might have very use um
0:19:31	but image which it's for a single
0:19:33	or a a single patient
0:19:34	we just consider a uh a and you know at the location
0:19:43	so that
0:19:44	computes a discussion let's thank you again

PHASE-BASED INFORMATION FOR VOICE PATHOLOGY DETECTION

Modeling and Analysis of Speech Production

Přednášející: Drugman Thomas, Autoři: Thomas Drugman, Thomas Dubuisson, Thierry Dutoit, University of Mons, Belgium