0:00:13 | and |
---|
0:00:13 | just that the stance was you met and the statistics in as creation ask data |
---|
0:00:18 | and the subject of my talk is to introduce an improvement method for text independent |
---|
0:00:23 | phonetic segmentation based on that might kinda ne call mark came from |
---|
0:00:28 | in brief |
---|
0:00:29 | i will first focus on on what you a to be you a speech has a complex signal |
---|
0:00:33 | physical sense |
---|
0:00:34 | physical sense that is to say to you read |
---|
0:00:37 | as a realisation of complex that |
---|
0:00:40 | but after to having |
---|
0:00:41 | if we introduce periods that time seen the study of complex system might be to use a powerful two |
---|
0:00:47 | a cache in your character of the speech signal |
---|
0:00:49 | this is called micro kind a knee call mark K for money's M M |
---|
0:00:53 | and i i show the general potential of speak M M F have to be applied and a speech and |
---|
0:00:58 | all is |
---|
0:00:59 | and then i with channel on on hunter |
---|
0:01:02 | application of these formalism them to phonetic segmentation of a speech signal and i been introduce |
---|
0:01:07 | a basic and improvement to for segmentation |
---|
0:01:10 | and finally i would take some time to present experimental results and to conclude |
---|
0:01:16 | so it has been |
---|
0:01:17 | to a quality and experimentally established that there use |
---|
0:01:21 | for once a nonlinear phenomena in the production process of the speech |
---|
0:01:25 | signal for example already was number which is a |
---|
0:01:28 | number characterising different for a used |
---|
0:01:31 | i put to be able to as thousand |
---|
0:01:33 | which corresponds to to a for |
---|
0:01:35 | a well as we know most of the |
---|
0:01:38 | a in the speech processing tsar |
---|
0:01:40 | based on the linear source-filter model which can not a quickly take into a |
---|
0:01:45 | but a in your character of the speech signal |
---|
0:01:48 | hence and so but here is to find then value a key parameters which are responsible for the complex |
---|
0:01:54 | cut of a speech signal |
---|
0:01:55 | previous studies have me should have shown that such parameters do exist but they are very hard to be estimate |
---|
0:02:02 | our strategies to take the |
---|
0:02:04 | and knowledge is coming from a statistical physics and to relate the complexity with the predictability of each point inside |
---|
0:02:10 | the signal |
---|
0:02:11 | and in practice need to |
---|
0:02:13 | there although computationally efficient tools to |
---|
0:02:16 | yeah |
---|
0:02:19 | to make these parameters if there exist and to use them for a practical and a |
---|
0:02:25 | as important one |
---|
0:02:27 | as in the study of complex system the first phase of started in the late forties with the classical walk |
---|
0:02:32 | of colour more of |
---|
0:02:34 | and |
---|
0:02:34 | which was the basis for the latest at later post in this domain |
---|
0:02:38 | which are based on the study of a structure functions state |
---|
0:02:41 | a main result of these methods used to |
---|
0:02:44 | recognise a global lead the existence of a multiscale that structure without giving access to |
---|
0:02:50 | state there |
---|
0:02:51 | i mean |
---|
0:02:53 | oh is a use is two |
---|
0:02:55 | side |
---|
0:02:56 | because they are based on their statistical average is non the stationary assumption |
---|
0:03:01 | that can be used to decide whether a system is complex or or not that much more information |
---|
0:03:06 | and the second phase missed we try to |
---|
0:03:08 | uh |
---|
0:03:09 | that's a mind you much recording inside the signal where the complexity happens and how it to its a |
---|
0:03:15 | a more |
---|
0:03:16 | precise terms we try to find a subset inside the signal which have the highest |
---|
0:03:20 | information content and we try to explain how these |
---|
0:03:23 | the transfer of |
---|
0:03:24 | information between different the scale |
---|
0:03:28 | organises itself |
---|
0:03:30 | as methods are being made possible by the approach in the statistical physics in this study of |
---|
0:03:35 | i lily system and the two size |
---|
0:03:38 | a study of the notion of |
---|
0:03:40 | transition site a complex east |
---|
0:03:44 | as shown that uh so as you metric multi a scalar quantization is responsible for the complex C this inside |
---|
0:03:50 | a signal |
---|
0:03:51 | a typical example for the is the cascade of energy in fully developed look problem |
---|
0:03:57 | fingerprint impact is is the existence of a power law behavior in the temporal correlation function |
---|
0:04:03 | which has to be you |
---|
0:04:04 | you value that out of any of stationarity assumption at each point site the signal any |
---|
0:04:09 | a single exponents related to this power a lot of as we will be see you see shortly |
---|
0:04:14 | a score of singularity exponents that it can be shown that it completely explains the |
---|
0:04:19 | a quantization of multi-scale the structures |
---|
0:04:23 | and |
---|
0:04:26 | an example in this |
---|
0:04:28 | i stick can only "'cause" form as mean that is in this study of multi of signals |
---|
0:04:33 | i the kind a equal for models which was the first that am trying to at them |
---|
0:04:37 | singularity exponents as a global property of the signal with |
---|
0:04:41 | to what is called a lower down to spectrum are in this equation we have |
---|
0:04:45 | a complex signal as |
---|
0:04:47 | and a multi resolution a multiresolution function grand mal what thing at this scale or |
---|
0:04:53 | and he the at to stand for expectations of where |
---|
0:04:56 | a statistical ensemble |
---|
0:04:59 | the exponent of these power to P could be related to the a |
---|
0:05:03 | a distribution of singularity exponents |
---|
0:05:05 | two dollars on transform but main problem is that it's a global description it doesn't give access to |
---|
0:05:11 | equal |
---|
0:05:12 | and a local dynamics of the signal |
---|
0:05:15 | so in but |
---|
0:05:17 | a can only from one is be try to |
---|
0:05:19 | instead of of feeling on the statistical able to be try to see |
---|
0:05:23 | so the signal |
---|
0:05:26 | i i try to introduce |
---|
0:05:27 | singularity exponents you much |
---|
0:05:29 | is related to geometric location like the signal be a |
---|
0:05:33 | the time index T here and uh |
---|
0:05:35 | yeah |
---|
0:05:36 | multiresolution function gram are |
---|
0:05:38 | and this can just to here the power the |
---|
0:05:41 | exponent and of this problem this but because single singularity exponent |
---|
0:05:45 | and |
---|
0:05:46 | can be estimated |
---|
0:05:47 | precisely to |
---|
0:05:49 | a we of the transition phones of the signal |
---|
0:05:51 | yeah |
---|
0:05:52 | to main problem is that precise estimation of these parameters |
---|
0:05:56 | and uh in this regard but a what of one of the crucial sure choices it |
---|
0:06:00 | problems is the choice of the functional grammar or for example we can use |
---|
0:06:05 | simply the linear increments |
---|
0:06:07 | and that it has been shown that it it doesn't give a precise estimation of H of T because of |
---|
0:06:12 | to |
---|
0:06:13 | a stable and sensitivity of these |
---|
0:06:16 | and you in cream |
---|
0:06:17 | have a best choice for batman |
---|
0:06:19 | it's trying to be the grab model speech is defined as the integral of the variance models were work the |
---|
0:06:25 | but i |
---|
0:06:26 | oh use a B R teen this equation and normalized but the robust me on the real i |
---|
0:06:31 | that's is defined from be typical characterisation of |
---|
0:06:35 | can take energy into a real and |
---|
0:06:37 | it has been shown that it |
---|
0:06:40 | can |
---|
0:06:41 | it is related to the information content of each point if we to use these measure four |
---|
0:06:46 | yeah |
---|
0:06:47 | calculation of H of T |
---|
0:06:50 | so make this or if we can have a good estimate of H of T |
---|
0:06:55 | i can um work |
---|
0:06:57 | a a very important subset inside the signal which is called most thing we have many for this corresponds to |
---|
0:07:02 | the |
---|
0:07:02 | and since i the signal which up have to your of singularity exponents |
---|
0:07:06 | it has been shown that the |
---|
0:07:08 | or lower the value of a single exponent is the high |
---|
0:07:12 | these are on the given point |
---|
0:07:13 | so the critical transitions of the signal use have is happening |
---|
0:07:19 | at this points |
---|
0:07:20 | and a of a reconstruction from has been proposed that |
---|
0:07:24 | and it has been shown in many applications that P can we construct the whole signal having access to only |
---|
0:07:29 | this small subset of to date |
---|
0:07:31 | so this is what just to too the importance of the singularity exponents |
---|
0:07:35 | how have to that we can turn on to see how they can be applied to speech signal |
---|
0:07:39 | previously we have shown that the estimation procedure of H of T for a speech signal and B have shown |
---|
0:07:45 | that we can have |
---|
0:07:46 | good to estimate of H of T for the majority of point in the speech signal we |
---|
0:07:51 | have a speech signal extracted from timit |
---|
0:07:54 | timit database with vertical red lines speech was the |
---|
0:07:57 | phoneme boundaries them them from manual transcriptions provided in timit database and |
---|
0:08:02 | of course the objective of text independent to phonetic segmentation is to identify these phoneme boundaries |
---|
0:08:08 | and in a |
---|
0:08:09 | tolerance mean do |
---|
0:08:12 | so |
---|
0:08:14 | since that is |
---|
0:08:15 | different phonemes |
---|
0:08:16 | they have we know that they have different a statistical properties V |
---|
0:08:20 | expect a singularity exponents to have different behaviours |
---|
0:08:24 | to show these you studied the |
---|
0:08:27 | a can |
---|
0:08:27 | distribution of the single a exponent the time evolution of the distribution of singularity exponents |
---|
0:08:33 | so we have been those of to length thirty miliseconds be compute can |
---|
0:08:36 | histogram of B |
---|
0:08:38 | and we plot it's |
---|
0:08:40 | a time evolution over time |
---|
0:08:42 | and can easily not in this uh uh a graphical representation which is which are the P of conditional to |
---|
0:08:48 | that histogram of singularity exponents conditioned on time |
---|
0:08:52 | and can easily not a remarkable change in the distribution of singularity exponents between different phonemes |
---|
0:08:59 | this has been extensively |
---|
0:09:02 | evaluated over different to speech sect |
---|
0:09:04 | signal |
---|
0:09:05 | but the problem is that it cannot use these uh |
---|
0:09:08 | graphical representation for but for developing a |
---|
0:09:11 | but an automatic segmentation how |
---|
0:09:14 | or you provide a E |
---|
0:09:16 | is here to be used for an automatic algorithm |
---|
0:09:19 | we we is that the easiest interpretation of these changing distribution is changing the average |
---|
0:09:25 | a find a new measure of we it a C C V just simply get primitive of exponents |
---|
0:09:30 | and |
---|
0:09:30 | this could be considered as the can the average instantaneous average of singular to explore |
---|
0:09:37 | we can see the resulting functional |
---|
0:09:39 | and i it is clear that that it shows |
---|
0:09:43 | a difference in distributions more clear a |
---|
0:09:46 | so inside each phoneme the |
---|
0:09:48 | a C see that is |
---|
0:09:50 | or less in yeah we do not a change in |
---|
0:09:53 | so a second of phoneme boundary |
---|
0:09:56 | however |
---|
0:09:56 | to develop an automatic fit |
---|
0:09:58 | segmentation have or is that it can is very simple metric used to fit a piecewise linear curve to this |
---|
0:10:04 | and C C by minimizing the mean square error |
---|
0:10:07 | uh we have a |
---|
0:10:09 | a a going wrong with take fitted okay |
---|
0:10:12 | and we have identified the breaking points have like a candidate point |
---|
0:10:17 | see that you have a a twenty five many |
---|
0:10:19 | most of the |
---|
0:10:21 | boundaries trees bit very good resolution because |
---|
0:10:23 | a there are the |
---|
0:10:25 | because we don't have any been doing |
---|
0:10:27 | problem in this we have |
---|
0:10:29 | access is high as possible resolution which is the sampling frequency of the speech signal |
---|
0:10:33 | so |
---|
0:10:34 | the primary simulations shows that is |
---|
0:10:37 | but a simple metal |
---|
0:10:38 | has comparable results with the state of the art these which was present in know previous works |
---|
0:10:44 | and |
---|
0:10:45 | oh at that it is that we don't a this it is not a |
---|
0:10:50 | sensitive to the threshold |
---|
0:10:51 | selection as we will see in experimental results |
---|
0:10:55 | but where it's a per by performing a or on not is of this method be observed that |
---|
0:11:00 | the i mean see in the |
---|
0:11:01 | uh |
---|
0:11:03 | that's |
---|
0:11:04 | yeah i these thinking difference in the distribution of singularity exponents but the a C is not able to reveal |
---|
0:11:10 | them to |
---|
0:11:11 | identified the |
---|
0:11:13 | i boundaries |
---|
0:11:15 | a are points that there is no distinctive |
---|
0:11:17 | changing the distributions but a C C and linear care feeding makes some mistakes |
---|
0:11:23 | has a try to use a |
---|
0:11:24 | but a classical approach in that |
---|
0:11:26 | detection of change |
---|
0:11:28 | change detection which is right to you has been widely used in segmentation of regions |
---|
0:11:33 | which is a two step procedure to first |
---|
0:11:35 | to select a set of candidate was generous |
---|
0:11:38 | and then to a he is to to do the decision to |
---|
0:11:43 | C but they're each can lead to to the corresponds to a change in the |
---|
0:11:47 | can you know features or not |
---|
0:11:50 | so for the process P selection is that we have two observations first we so that some of the missed |
---|
0:11:55 | boundaries correspond to the |
---|
0:11:56 | transitions between fricatives stops to roles |
---|
0:12:00 | and uh |
---|
0:12:02 | so can be so that that but |
---|
0:12:04 | positions to detect are the transitions between |
---|
0:12:07 | well i know it's segments or silence or poses two phonemes because |
---|
0:12:11 | and silence we have |
---|
0:12:13 | i would positive value of singularity exponents and you know active parts we have a |
---|
0:12:17 | i only negative values |
---|
0:12:18 | so it you an easy to |
---|
0:12:20 | it take change in the |
---|
0:12:22 | that's cups of a C C |
---|
0:12:24 | hence we so to |
---|
0:12:26 | uh i was a to be applied to a pass filter to the original signal and do exactly this same |
---|
0:12:33 | to compute the singularity exponents and a C C for the low pass signal you as an example in the |
---|
0:12:37 | that |
---|
0:12:39 | the figure you can see that a C C of the original signal and in the right one you can |
---|
0:12:43 | see the a C C of the lower filter |
---|
0:12:46 | have to |
---|
0:12:47 | signal we know that fricative is steep so and as far as are |
---|
0:12:51 | essentially a high band signal than low pass signal corps |
---|
0:12:53 | tends them into a a low energy |
---|
0:12:56 | and to low energy signal |
---|
0:12:58 | and see that the |
---|
0:13:00 | figure we have some changing |
---|
0:13:02 | shape or C C but it is not easy to detect which the |
---|
0:13:05 | linear curve care feeding but in the right side right hand side yeah |
---|
0:13:10 | much easier to detect a T reason is a another example of again i emphasise that we have to changing |
---|
0:13:15 | the original a see C |
---|
0:13:16 | but it is |
---|
0:13:17 | not easy to detect |
---|
0:13:18 | but that in the low pass version on the right hand side |
---|
0:13:21 | it is really easy to take the |
---|
0:13:24 | so as the first the you up apply the nmf A C R B C god |
---|
0:13:28 | two |
---|
0:13:29 | signal and its low pass filtered version |
---|
0:13:31 | i'm the |
---|
0:13:32 | but or or the breaking points as the as a candidates |
---|
0:13:36 | and in the second |
---|
0:13:37 | point to be to be perform uh |
---|
0:13:41 | dynamic and i mean doing |
---|
0:13:42 | followed by a log likelihood ratio you but as test to see |
---|
0:13:46 | and one of the candidates but are they actually correspond to a changing distribution of singularity exponents or not |
---|
0:13:51 | i in for size that be do is on the single exponents of the signal itself because we are interest |
---|
0:13:57 | to to show the strength of singularity exponents the low pass filter of a filtered version |
---|
0:14:02 | the does not have any real meaning is just some diversity via at are i grew |
---|
0:14:07 | so that was the dynamic or window mean during procedure for each point |
---|
0:14:11 | the consider treating those icsi like again that |
---|
0:14:14 | oh have to question you put as is on |
---|
0:14:17 | a question |
---|
0:14:18 | and |
---|
0:14:19 | i have to be but this is that to a single the exponents of that are generated by a single |
---|
0:14:23 | gaussian or |
---|
0:14:24 | it is generated by two questions on |
---|
0:14:27 | X or we click |
---|
0:14:28 | so much for H one what |
---|
0:14:31 | right could then H C to a and we take the candidate as uh as the boundary otherwise we remove |
---|
0:14:36 | it from a candidate please then |
---|
0:14:39 | we go to the next |
---|
0:14:41 | three |
---|
0:14:42 | so |
---|
0:14:43 | i experiment our simulations were done on timit the based on the full training for of to meet which consist |
---|
0:14:50 | of four thousand and six hundred |
---|
0:14:51 | sentences and we have developed a |
---|
0:14:54 | i was move or to randomly chose and files from these data |
---|
0:14:58 | we have |
---|
0:14:59 | try to report of the possible performance in because there is this difficult in the literature to compare |
---|
0:15:06 | have have reported out of time to simplify later corporations |
---|
0:15:10 | are two category of |
---|
0:15:11 | a score partial uh a or but you have hit rate or hit rate we shows the |
---|
0:15:17 | right the |
---|
0:15:18 | right of correctly detected by take that boundaries or segmentation we chose |
---|
0:15:23 | how much more we have to take to than false long shows that |
---|
0:15:26 | how much |
---|
0:15:27 | i |
---|
0:15:27 | how many false use have you have to take that |
---|
0:15:30 | the problem with these partial as scores is that |
---|
0:15:33 | a can be they can go in opposite directions for example an improvement each rate |
---|
0:15:37 | could correspond to an increase in false alarm rates so we cannot do a |
---|
0:15:41 | for on page and only be partial the schools but are about the score |
---|
0:15:45 | to this partial the course i've missed and used go to a console |
---|
0:15:48 | for example if one |
---|
0:15:50 | takes a wrote and false alarm it to content or value takes hit rate and |
---|
0:15:54 | or were segmentation into a beat |
---|
0:15:56 | much in is on over segmentation rate so |
---|
0:16:00 | oh the experimental result first we can see that comp |
---|
0:16:04 | a C C D's do we seek a good on the improvement |
---|
0:16:08 | and on the |
---|
0:16:09 | for a different style utterances |
---|
0:16:12 | we can see that we have like |
---|
0:16:13 | two or three percent |
---|
0:16:15 | huh improvement in france so one road and the like |
---|
0:16:18 | for presenting in over segmentation and he rates are more or less the same |
---|
0:16:23 | but and it this shows the |
---|
0:16:25 | improvement over the procedure great |
---|
0:16:27 | that compared |
---|
0:16:28 | then be compared to that |
---|
0:16:31 | a friends number so and which is the |
---|
0:16:34 | state of the art in the literature |
---|
0:16:36 | i can see that for the two runs of twenty five miliseconds be a were almost the same |
---|
0:16:41 | contrary |
---|
0:16:42 | yeah but a percent improvement in the file so long but and we have |
---|
0:16:46 | ten percent improvement in our segmentation |
---|
0:16:49 | uh right |
---|
0:16:50 | a a more important for even if we go to |
---|
0:16:53 | a low tolerance is for five miliseconds we can see that |
---|
0:16:57 | for |
---|
0:16:57 | i i love these we have like |
---|
0:16:59 | more than ten percent improvement in heat rate false alarm and or segmentation this is because the |
---|
0:17:04 | i would a high resolution of the to C C function of |
---|
0:17:08 | that's the bit ones |
---|
0:17:10 | but i been doing we don't have to been doing you have access to the finest possible resolution |
---|
0:17:17 | in terms of a measure we can see |
---|
0:17:19 | that's a a for a lower resolutions we have more than ten percent improvement in both of the |
---|
0:17:25 | okay |
---|
0:17:26 | for in both of the |
---|
0:17:27 | um |
---|
0:17:28 | a |
---|
0:17:29 | scores and for twenty five miliseconds be have like six or or or or four present |
---|
0:17:34 | improvement in or a and if so |
---|
0:17:37 | have have uh to uh i i mentioned that the method is not sensitive to to show which is a |
---|
0:17:42 | problem of the |
---|
0:17:44 | as a call |
---|
0:17:45 | so |
---|
0:17:46 | text methods of phonetic segmentation |
---|
0:17:50 | we are trying the |
---|
0:17:51 | have shown the |
---|
0:17:53 | a sensitivity of to a is to the care beating to sure |
---|
0:17:57 | i have changed the could sure sure to over four hundred percent |
---|
0:18:01 | the value of the threshold and they're |
---|
0:18:02 | value you of a value only has changed in a zero point five percent this shows that |
---|
0:18:07 | a choice of the threshold is not important that all in this have agreed |
---|
0:18:12 | i choose a |
---|
0:18:14 | for a independent is an important feature |
---|
0:18:18 | of |
---|
0:18:20 | we have |
---|
0:18:21 | but these these to you have shown the you have emphasise on the strength of singularity exponents in section of |
---|
0:18:26 | transitions found transitions fronts in the speech signal |
---|
0:18:31 | a more importantly the promising phonetic segment |
---|
0:18:34 | average be encouraging results in phonetic segmentation shows the |
---|
0:18:38 | potential of M F in done it is is of week or local dynamics of a speech signal hence this |
---|
0:18:43 | are are you of work is to use M M F U |
---|
0:18:46 | i don't know means of a speech technology |
---|
0:18:48 | and you to use the |
---|
0:18:50 | constructions from or or or the concept of what to model they've that which is an ongoing research and |
---|
0:18:56 | result |
---|
0:18:57 | i hope to have good results in that |
---|
0:18:59 | from |
---|
0:19:00 | time to very much for that |
---|
0:19:06 | right on time |
---|
0:19:11 | i can take questions one and one but this is officially the end of the fact |
---|
0:19:15 | oh |
---|
0:19:16 | okay |
---|
0:19:17 | yeah |
---|
0:19:18 | i |
---|