Speech Transcript - IMPROVING TEXT-INDEPENDENT PHONETIC SEGMENTATION BASED ON THE MICROCANONICAL MULTISCALE FORMALISM

0:00:13	and
0:00:13	just that the stance was you met and the statistics in as creation ask data
0:00:18	and the subject of my talk is to introduce an improvement method for text independent
0:00:23	phonetic segmentation based on that might kinda ne call mark came from
0:00:28	in brief
0:00:29	i will first focus on on what you a to be you a speech has a complex signal
0:00:33	physical sense
0:00:34	physical sense that is to say to you read
0:00:37	as a realisation of complex that
0:00:40	but after to having
0:00:41	if we introduce periods that time seen the study of complex system might be to use a powerful two
0:00:47	a cache in your character of the speech signal
0:00:49	this is called micro kind a knee call mark K for money's M M
0:00:53	and i i show the general potential of speak M M F have to be applied and a speech and
0:00:58	all is
0:00:59	and then i with channel on on hunter
0:01:02	application of these formalism them to phonetic segmentation of a speech signal and i been introduce
0:01:07	a basic and improvement to for segmentation
0:01:10	and finally i would take some time to present experimental results and to conclude
0:01:16	so it has been
0:01:17	to a quality and experimentally established that there use
0:01:21	for once a nonlinear phenomena in the production process of the speech
0:01:25	signal for example already was number which is a
0:01:28	number characterising different for a used
0:01:31	i put to be able to as thousand
0:01:33	which corresponds to to a for
0:01:35	a well as we know most of the
0:01:38	a in the speech processing tsar
0:01:40	based on the linear source-filter model which can not a quickly take into a
0:01:45	but a in your character of the speech signal
0:01:48	hence and so but here is to find then value a key parameters which are responsible for the complex
0:01:54	cut of a speech signal
0:01:55	previous studies have me should have shown that such parameters do exist but they are very hard to be estimate
0:02:02	our strategies to take the
0:02:04	and knowledge is coming from a statistical physics and to relate the complexity with the predictability of each point inside
0:02:10	the signal
0:02:11	and in practice need to
0:02:13	there although computationally efficient tools to
0:02:16	yeah
0:02:19	to make these parameters if there exist and to use them for a practical and a
0:02:25	as important one
0:02:27	as in the study of complex system the first phase of started in the late forties with the classical walk
0:02:32	of colour more of
0:02:34	and
0:02:34	which was the basis for the latest at later post in this domain
0:02:38	which are based on the study of a structure functions state
0:02:41	a main result of these methods used to
0:02:44	recognise a global lead the existence of a multiscale that structure without giving access to
0:02:50	state there
0:02:51	i mean
0:02:53	oh is a use is two
0:02:55	side
0:02:56	because they are based on their statistical average is non the stationary assumption
0:03:01	that can be used to decide whether a system is complex or or not that much more information
0:03:06	and the second phase missed we try to
0:03:08	uh
0:03:09	that's a mind you much recording inside the signal where the complexity happens and how it to its a
0:03:15	a more
0:03:16	precise terms we try to find a subset inside the signal which have the highest
0:03:20	information content and we try to explain how these
0:03:23	the transfer of
0:03:24	information between different the scale
0:03:28	organises itself
0:03:30	as methods are being made possible by the approach in the statistical physics in this study of
0:03:35	i lily system and the two size
0:03:38	a study of the notion of
0:03:40	transition site a complex east
0:03:44	as shown that uh so as you metric multi a scalar quantization is responsible for the complex C this inside
0:03:50	a signal
0:03:51	a typical example for the is the cascade of energy in fully developed look problem
0:03:57	fingerprint impact is is the existence of a power law behavior in the temporal correlation function
0:04:03	which has to be you
0:04:04	you value that out of any of stationarity assumption at each point site the signal any
0:04:09	a single exponents related to this power a lot of as we will be see you see shortly
0:04:14	a score of singularity exponents that it can be shown that it completely explains the
0:04:19	a quantization of multi-scale the structures
0:04:23	and
0:04:26	an example in this
0:04:28	i stick can only "'cause" form as mean that is in this study of multi of signals
0:04:33	i the kind a equal for models which was the first that am trying to at them
0:04:37	singularity exponents as a global property of the signal with
0:04:41	to what is called a lower down to spectrum are in this equation we have
0:04:45	a complex signal as
0:04:47	and a multi resolution a multiresolution function grand mal what thing at this scale or
0:04:53	and he the at to stand for expectations of where
0:04:56	a statistical ensemble
0:04:59	the exponent of these power to P could be related to the a
0:05:03	a distribution of singularity exponents
0:05:05	two dollars on transform but main problem is that it's a global description it doesn't give access to
0:05:11	equal
0:05:12	and a local dynamics of the signal
0:05:15	so in but
0:05:17	a can only from one is be try to
0:05:19	instead of of feeling on the statistical able to be try to see
0:05:23	so the signal
0:05:26	i i try to introduce
0:05:27	singularity exponents you much
0:05:29	is related to geometric location like the signal be a
0:05:33	the time index T here and uh
0:05:35	yeah
0:05:36	multiresolution function gram are
0:05:38	and this can just to here the power the
0:05:41	exponent and of this problem this but because single singularity exponent
0:05:45	and
0:05:46	can be estimated
0:05:47	precisely to
0:05:49	a we of the transition phones of the signal
0:05:51	yeah
0:05:52	to main problem is that precise estimation of these parameters
0:05:56	and uh in this regard but a what of one of the crucial sure choices it
0:06:00	problems is the choice of the functional grammar or for example we can use
0:06:05	simply the linear increments
0:06:07	and that it has been shown that it it doesn't give a precise estimation of H of T because of
0:06:12	to
0:06:13	a stable and sensitivity of these
0:06:16	and you in cream
0:06:17	have a best choice for batman
0:06:19	it's trying to be the grab model speech is defined as the integral of the variance models were work the
0:06:25	but i
0:06:26	oh use a B R teen this equation and normalized but the robust me on the real i
0:06:31	that's is defined from be typical characterisation of
0:06:35	can take energy into a real and
0:06:37	it has been shown that it
0:06:40	can
0:06:41	it is related to the information content of each point if we to use these measure four
0:06:46	yeah
0:06:47	calculation of H of T
0:06:50	so make this or if we can have a good estimate of H of T
0:06:55	i can um work
0:06:57	a a very important subset inside the signal which is called most thing we have many for this corresponds to
0:07:02	the
0:07:02	and since i the signal which up have to your of singularity exponents
0:07:06	it has been shown that the
0:07:08	or lower the value of a single exponent is the high
0:07:12	these are on the given point
0:07:13	so the critical transitions of the signal use have is happening
0:07:19	at this points
0:07:20	and a of a reconstruction from has been proposed that
0:07:24	and it has been shown in many applications that P can we construct the whole signal having access to only
0:07:29	this small subset of to date
0:07:31	so this is what just to too the importance of the singularity exponents
0:07:35	how have to that we can turn on to see how they can be applied to speech signal
0:07:39	previously we have shown that the estimation procedure of H of T for a speech signal and B have shown
0:07:45	that we can have
0:07:46	good to estimate of H of T for the majority of point in the speech signal we
0:07:51	have a speech signal extracted from timit
0:07:54	timit database with vertical red lines speech was the
0:07:57	phoneme boundaries them them from manual transcriptions provided in timit database and
0:08:02	of course the objective of text independent to phonetic segmentation is to identify these phoneme boundaries
0:08:08	and in a
0:08:09	tolerance mean do
0:08:12	so
0:08:14	since that is
0:08:15	different phonemes
0:08:16	they have we know that they have different a statistical properties V
0:08:20	expect a singularity exponents to have different behaviours
0:08:24	to show these you studied the
0:08:27	a can
0:08:27	distribution of the single a exponent the time evolution of the distribution of singularity exponents
0:08:33	so we have been those of to length thirty miliseconds be compute can
0:08:36	histogram of B
0:08:38	and we plot it's
0:08:40	a time evolution over time
0:08:42	and can easily not in this uh uh a graphical representation which is which are the P of conditional to
0:08:48	that histogram of singularity exponents conditioned on time
0:08:52	and can easily not a remarkable change in the distribution of singularity exponents between different phonemes
0:08:59	this has been extensively
0:09:02	evaluated over different to speech sect
0:09:04	signal
0:09:05	but the problem is that it cannot use these uh
0:09:08	graphical representation for but for developing a
0:09:11	but an automatic segmentation how
0:09:14	or you provide a E
0:09:16	is here to be used for an automatic algorithm
0:09:19	we we is that the easiest interpretation of these changing distribution is changing the average
0:09:25	a find a new measure of we it a C C V just simply get primitive of exponents
0:09:30	and
0:09:30	this could be considered as the can the average instantaneous average of singular to explore
0:09:37	we can see the resulting functional
0:09:39	and i it is clear that that it shows
0:09:43	a difference in distributions more clear a
0:09:46	so inside each phoneme the
0:09:48	a C see that is
0:09:50	or less in yeah we do not a change in
0:09:53	so a second of phoneme boundary
0:09:56	however
0:09:56	to develop an automatic fit
0:09:58	segmentation have or is that it can is very simple metric used to fit a piecewise linear curve to this
0:10:04	and C C by minimizing the mean square error
0:10:07	uh we have a
0:10:09	a a going wrong with take fitted okay
0:10:12	and we have identified the breaking points have like a candidate point
0:10:17	see that you have a a twenty five many
0:10:19	most of the
0:10:21	boundaries trees bit very good resolution because
0:10:23	a there are the
0:10:25	because we don't have any been doing
0:10:27	problem in this we have
0:10:29	access is high as possible resolution which is the sampling frequency of the speech signal
0:10:33	so
0:10:34	the primary simulations shows that is
0:10:37	but a simple metal
0:10:38	has comparable results with the state of the art these which was present in know previous works
0:10:44	and
0:10:45	oh at that it is that we don't a this it is not a
0:10:50	sensitive to the threshold
0:10:51	selection as we will see in experimental results
0:10:55	but where it's a per by performing a or on not is of this method be observed that
0:11:00	the i mean see in the
0:11:01	uh
0:11:03	that's
0:11:04	yeah i these thinking difference in the distribution of singularity exponents but the a C is not able to reveal
0:11:10	them to
0:11:11	identified the
0:11:13	i boundaries
0:11:15	a are points that there is no distinctive
0:11:17	changing the distributions but a C C and linear care feeding makes some mistakes
0:11:23	has a try to use a
0:11:24	but a classical approach in that
0:11:26	detection of change
0:11:28	change detection which is right to you has been widely used in segmentation of regions
0:11:33	which is a two step procedure to first
0:11:35	to select a set of candidate was generous
0:11:38	and then to a he is to to do the decision to
0:11:43	C but they're each can lead to to the corresponds to a change in the
0:11:47	can you know features or not
0:11:50	so for the process P selection is that we have two observations first we so that some of the missed
0:11:55	boundaries correspond to the
0:11:56	transitions between fricatives stops to roles
0:12:00	and uh
0:12:02	so can be so that that but
0:12:04	positions to detect are the transitions between
0:12:07	well i know it's segments or silence or poses two phonemes because
0:12:11	and silence we have
0:12:13	i would positive value of singularity exponents and you know active parts we have a
0:12:17	i only negative values
0:12:18	so it you an easy to
0:12:20	it take change in the
0:12:22	that's cups of a C C
0:12:24	hence we so to
0:12:26	uh i was a to be applied to a pass filter to the original signal and do exactly this same
0:12:33	to compute the singularity exponents and a C C for the low pass signal you as an example in the
0:12:37	that
0:12:39	the figure you can see that a C C of the original signal and in the right one you can
0:12:43	see the a C C of the lower filter
0:12:46	have to
0:12:47	signal we know that fricative is steep so and as far as are
0:12:51	essentially a high band signal than low pass signal corps
0:12:53	tends them into a a low energy
0:12:56	and to low energy signal
0:12:58	and see that the
0:13:00	figure we have some changing
0:13:02	shape or C C but it is not easy to detect which the
0:13:05	linear curve care feeding but in the right side right hand side yeah
0:13:10	much easier to detect a T reason is a another example of again i emphasise that we have to changing
0:13:15	the original a see C
0:13:16	but it is
0:13:17	not easy to detect
0:13:18	but that in the low pass version on the right hand side
0:13:21	it is really easy to take the
0:13:24	so as the first the you up apply the nmf A C R B C god
0:13:28	two
0:13:29	signal and its low pass filtered version
0:13:31	i'm the
0:13:32	but or or the breaking points as the as a candidates
0:13:36	and in the second
0:13:37	point to be to be perform uh
0:13:41	dynamic and i mean doing
0:13:42	followed by a log likelihood ratio you but as test to see
0:13:46	and one of the candidates but are they actually correspond to a changing distribution of singularity exponents or not
0:13:51	i in for size that be do is on the single exponents of the signal itself because we are interest
0:13:57	to to show the strength of singularity exponents the low pass filter of a filtered version
0:14:02	the does not have any real meaning is just some diversity via at are i grew
0:14:07	so that was the dynamic or window mean during procedure for each point
0:14:11	the consider treating those icsi like again that
0:14:14	oh have to question you put as is on
0:14:17	a question
0:14:18	and
0:14:19	i have to be but this is that to a single the exponents of that are generated by a single
0:14:23	gaussian or
0:14:24	it is generated by two questions on
0:14:27	X or we click
0:14:28	so much for H one what
0:14:31	right could then H C to a and we take the candidate as uh as the boundary otherwise we remove
0:14:36	it from a candidate please then
0:14:39	we go to the next
0:14:41	three
0:14:42	so
0:14:43	i experiment our simulations were done on timit the based on the full training for of to meet which consist
0:14:50	of four thousand and six hundred
0:14:51	sentences and we have developed a
0:14:54	i was move or to randomly chose and files from these data
0:14:58	we have
0:14:59	try to report of the possible performance in because there is this difficult in the literature to compare
0:15:06	have have reported out of time to simplify later corporations
0:15:10	are two category of
0:15:11	a score partial uh a or but you have hit rate or hit rate we shows the
0:15:17	right the
0:15:18	right of correctly detected by take that boundaries or segmentation we chose
0:15:23	how much more we have to take to than false long shows that
0:15:26	how much
0:15:27	i
0:15:27	how many false use have you have to take that
0:15:30	the problem with these partial as scores is that
0:15:33	a can be they can go in opposite directions for example an improvement each rate
0:15:37	could correspond to an increase in false alarm rates so we cannot do a
0:15:41	for on page and only be partial the schools but are about the score
0:15:45	to this partial the course i've missed and used go to a console
0:15:48	for example if one
0:15:50	takes a wrote and false alarm it to content or value takes hit rate and
0:15:54	or were segmentation into a beat
0:15:56	much in is on over segmentation rate so
0:16:00	oh the experimental result first we can see that comp
0:16:04	a C C D's do we seek a good on the improvement
0:16:08	and on the
0:16:09	for a different style utterances
0:16:12	we can see that we have like
0:16:13	two or three percent
0:16:15	huh improvement in france so one road and the like
0:16:18	for presenting in over segmentation and he rates are more or less the same
0:16:23	but and it this shows the
0:16:25	improvement over the procedure great
0:16:27	that compared
0:16:28	then be compared to that
0:16:31	a friends number so and which is the
0:16:34	state of the art in the literature
0:16:36	i can see that for the two runs of twenty five miliseconds be a were almost the same
0:16:41	contrary
0:16:42	yeah but a percent improvement in the file so long but and we have
0:16:46	ten percent improvement in our segmentation
0:16:49	uh right
0:16:50	a a more important for even if we go to
0:16:53	a low tolerance is for five miliseconds we can see that
0:16:57	for
0:16:57	i i love these we have like
0:16:59	more than ten percent improvement in heat rate false alarm and or segmentation this is because the
0:17:04	i would a high resolution of the to C C function of
0:17:08	that's the bit ones
0:17:10	but i been doing we don't have to been doing you have access to the finest possible resolution
0:17:17	in terms of a measure we can see
0:17:19	that's a a for a lower resolutions we have more than ten percent improvement in both of the
0:17:25	okay
0:17:26	for in both of the
0:17:27	um
0:17:28	a
0:17:29	scores and for twenty five miliseconds be have like six or or or or four present
0:17:34	improvement in or a and if so
0:17:37	have have uh to uh i i mentioned that the method is not sensitive to to show which is a
0:17:42	problem of the
0:17:44	as a call
0:17:45	so
0:17:46	text methods of phonetic segmentation
0:17:50	we are trying the
0:17:51	have shown the
0:17:53	a sensitivity of to a is to the care beating to sure
0:17:57	i have changed the could sure sure to over four hundred percent
0:18:01	the value of the threshold and they're
0:18:02	value you of a value only has changed in a zero point five percent this shows that
0:18:07	a choice of the threshold is not important that all in this have agreed
0:18:12	i choose a
0:18:14	for a independent is an important feature
0:18:18	of
0:18:20	we have
0:18:21	but these these to you have shown the you have emphasise on the strength of singularity exponents in section of
0:18:26	transitions found transitions fronts in the speech signal
0:18:31	a more importantly the promising phonetic segment
0:18:34	average be encouraging results in phonetic segmentation shows the
0:18:38	potential of M F in done it is is of week or local dynamics of a speech signal hence this
0:18:43	are are you of work is to use M M F U
0:18:46	i don't know means of a speech technology
0:18:48	and you to use the
0:18:50	constructions from or or or the concept of what to model they've that which is an ongoing research and
0:18:56	result
0:18:57	i hope to have good results in that
0:18:59	from
0:19:00	time to very much for that
0:19:06	right on time
0:19:11	i can take questions one and one but this is officially the end of the fact
0:19:15	oh
0:19:16	okay
0:19:17	yeah
0:19:18	i

IMPROVING TEXT-INDEPENDENT PHONETIC SEGMENTATION BASED ON THE MICROCANONICAL MULTISCALE FORMALISM

Speech Analysis

Presented by: Vahid Khanagha, Author(s): Vahid Khanagha, Khalid Daoudi, Oriol Pont, Hussein Yahia, INRIA Bordeaux Sud-Ouest, France