0:00:14 | a a but not well as you uh |
---|
0:00:17 | uh |
---|
0:00:18 | the this stock is uh |
---|
0:00:20 | uh |
---|
0:00:21 | is a |
---|
0:00:22 | go clap of it but by an and the question that's and that's me and all lot |
---|
0:00:26 | and well |
---|
0:00:27 | uh the liz resolving non-uniqueness in the |
---|
0:00:30 | acoustic-to-articulatory mapping which i would for the for it was it a mapping |
---|
0:00:35 | uh |
---|
0:00:36 | i i i think i'll skip the scroll slide because the last two presentations were pretty much about |
---|
0:00:40 | the same thing |
---|
0:00:41 | and uh it's basically just to give an idea as to what it to we uh mapping is or inversion |
---|
0:00:47 | is |
---|
0:00:49 | uh |
---|
0:00:49 | uh i'll do it |
---|
0:00:51 | jump to the the main uh focus of this stop which is actually the non-uniqueness in this mapping |
---|
0:00:57 | which has been a for to by the P uh by |
---|
0:00:59 | uh by the few adults as before |
---|
0:01:01 | uh would oh spoke before me |
---|
0:01:03 | uh so in the literature we have uh |
---|
0:01:06 | things like uh at the loss lists uh you models of the vocal tract to is a parameter gotta |
---|
0:01:11 | oh model of uh speech synthesis and uh |
---|
0:01:14 | you can say that the inverse mapping from acoustic coast is actually to a class of a a function the |
---|
0:01:19 | not exactly one E |
---|
0:01:21 | and you have a similar results from other that such experiments |
---|
0:01:24 | and you have something uh there are some experiments called a bike block experiments where the uh |
---|
0:01:29 | the all these speaker is |
---|
0:01:31 | this constraint |
---|
0:01:32 | uh but still a a uh is the speakers can produce a perceptually sim similar sounds |
---|
0:01:38 | even spite of the natural pollution so this gives an uh indication of money |
---|
0:01:42 | of course these are are sit situations than the |
---|
0:01:45 | it this this may not really a in natural speech |
---|
0:01:48 | so what what about in continuous speech so |
---|
0:01:50 | we would be you can have different forms of data to collect this uh thing uh |
---|
0:01:55 | uh which have listed here |
---|
0:01:57 | uh in our case we use "'em" mocha timit database just like |
---|
0:02:00 | the previous to uh a so i wouldn't going to that |
---|
0:02:03 | too much |
---|
0:02:05 | uh |
---|
0:02:06 | so this is an example from the data set and then we have a a a a a phoneme |
---|
0:02:10 | uh a |
---|
0:02:11 | uh uh and the the red and the blue lines here they they get |
---|
0:02:15 | uh the D spectrum the magnitude spectrum |
---|
0:02:19 | uh |
---|
0:02:20 | from two instances |
---|
0:02:21 | and uh |
---|
0:02:23 | uh the figure two uh |
---|
0:02:26 | uh |
---|
0:02:27 | uh to the bottom to the right bottom is actually of the positions of the articulator quite |
---|
0:02:32 | a a a and you can see that |
---|
0:02:33 | even though the the sticks are are are |
---|
0:02:36 | quite similar |
---|
0:02:37 | the the uh the if you positions art |
---|
0:02:39 | are slow are quite different |
---|
0:02:41 | uh but is this non-uniqueness i mean uh |
---|
0:02:44 | i mean do you still can't say really that just not just because there is a difference in the acoustic |
---|
0:02:48 | so uh can can this difference in acoustics be explained by uh uh |
---|
0:02:54 | by this there |
---|
0:02:55 | variation position of the of the article |
---|
0:02:58 | so uh the but that's that that |
---|
0:03:00 | sort of comes to the problem uh in when you have this kind of uh data a limited data base |
---|
0:03:06 | that you cannot get exactly the same |
---|
0:03:08 | uh uh sticks an exactly the same articulators |
---|
0:03:12 | uh uh uh uh what or it exactly the same of six with different not is so that that's the |
---|
0:03:16 | that the difficulty as data |
---|
0:03:18 | so |
---|
0:03:20 | the P questions in this in this stock or |
---|
0:03:22 | a how does one estimate non-uniqueness in a limited data |
---|
0:03:25 | and uh that we do it but statistical modeling morning based one a one of four previous papers |
---|
0:03:29 | uh how do these non any |
---|
0:03:31 | instances of coding friends agreed to |
---|
0:03:32 | goes stick articulate frame |
---|
0:03:35 | uh does uh applying can here D constraints help |
---|
0:03:38 | a all non less |
---|
0:03:39 | uh these of be a main questions |
---|
0:03:41 | so we are we have a toy example your and you can say that |
---|
0:03:44 | a that the the figure on the top here is uh is |
---|
0:03:47 | uh |
---|
0:03:48 | the acoustic parameters |
---|
0:03:50 | belong to say one phoneme |
---|
0:03:52 | and this is the uh are to two parameters but of long one point men you can see that |
---|
0:03:56 | acoustic is you name but is that |
---|
0:03:58 | i can three parameters are by more so is this non unique |
---|
0:04:04 | a what so you look at the data points here i |
---|
0:04:06 | i don't know whether the |
---|
0:04:07 | points are very clear but uh you can see that i mean it's not it's not completely true i mean |
---|
0:04:12 | you can see that there are some clusters your |
---|
0:04:14 | in the look at that |
---|
0:04:16 | the joint i quickly we an acoustic uh space |
---|
0:04:20 | and uh therefore we what we do is we |
---|
0:04:23 | for a model in this this sort of data and the joint space |
---|
0:04:25 | uh articulatory acoustic space |
---|
0:04:27 | and then we can look at what one of one value of acoustic but i'm with that a shown by |
---|
0:04:32 | the blue line there that's of test sample |
---|
0:04:34 | and we can find the conditional probability distribution and this case this is uh a by more eager which says |
---|
0:04:40 | that |
---|
0:04:40 | at at this uh at this value for acoustic parameter |
---|
0:04:43 | uh the uh the the mapping use non unique |
---|
0:04:46 | but if you look at a another acoustic parameter here which belong to this |
---|
0:04:50 | the same |
---|
0:04:51 | a a close to cluster you can see that it's uh uni modal and it's not |
---|
0:04:54 | it's not not not |
---|
0:04:56 | of course that the there's is the question of uh the variance |
---|
0:04:59 | uh which is also a a a a least of some sort of and because for one value for a |
---|
0:05:03 | stick but i'm with you can have different |
---|
0:05:05 | well use of articulate but i mean |
---|
0:05:07 | but uh we don't we don't look at to this sort of money miss in the in this paper |
---|
0:05:11 | and uh |
---|
0:05:12 | we just look at the uh this by mortar kind of an on in |
---|
0:05:17 | uh this to the close the parameterization of the data and again it's very similar to what has been used |
---|
0:05:22 | in the state of the art though |
---|
0:05:24 | uh uh it we mapping systems and source some that the one which was used previous the previous paper |
---|
0:05:29 | uh this is |
---|
0:05:31 | an example of a non nice so what these uh |
---|
0:05:34 | these uh |
---|
0:05:35 | but blocks are actually |
---|
0:05:37 | the conditional distributions |
---|
0:05:39 | uh |
---|
0:05:40 | given a one vector of a co six |
---|
0:05:42 | these pop but with lots of the the conditional distributions of the uh of the |
---|
0:05:47 | articulate records |
---|
0:05:49 | so in this case and the blue out the blue dot sense and triangles and one they are they are |
---|
0:05:54 | actually that the peaks of these uh different modes |
---|
0:05:56 | and the green line |
---|
0:05:58 | i uh |
---|
0:06:00 | that's clear in the in the presentation that the green line is actually the the recorded positions |
---|
0:06:04 | the this case you can see that the other |
---|
0:06:06 | close or to one of the peaks so and the other P |
---|
0:06:08 | and the other because actually uh |
---|
0:06:10 | uh |
---|
0:06:11 | the the the non unique |
---|
0:06:13 | a a not a non unique estimate of for this |
---|
0:06:17 | uh this particular stick |
---|
0:06:19 | uh what now we look at this in a trajectory |
---|
0:06:21 | uh so uh and in this case there is you can you can see that there they all you anymore |
---|
0:06:25 | more the all the uh the the conditional distribution the you anymore |
---|
0:06:28 | but you look at the next frame and then and in this case you can set saying that that on |
---|
0:06:32 | tip |
---|
0:06:33 | which is here |
---|
0:06:34 | and uh you can start saying that there is a |
---|
0:06:37 | there's another but which of uh which uh |
---|
0:06:39 | you can in you can see the same thing and |
---|
0:06:41 | in the lower lip which is here |
---|
0:06:43 | and the tongue dorsum was |
---|
0:06:45 | oh |
---|
0:06:47 | and it's and so |
---|
0:06:48 | uh but you can see but at the same time though the recorded positions are actually |
---|
0:06:53 | are always close or two |
---|
0:06:55 | one of the uh |
---|
0:06:58 | uh that the two are to one of the modes side than the other |
---|
0:07:01 | and uh |
---|
0:07:03 | but the the to another example of uh not uh |
---|
0:07:06 | following this and this is the |
---|
0:07:08 | uh another example |
---|
0:07:09 | and this case you can see that i this is largely uni modal |
---|
0:07:13 | uh |
---|
0:07:14 | um this is |
---|
0:07:15 | one frame but to |
---|
0:07:16 | uh are in the shop |
---|
0:07:19 | uh |
---|
0:07:20 | and |
---|
0:07:21 | uh you can start seeing that that that is a |
---|
0:07:24 | i the |
---|
0:07:25 | the second mode starts appearing somewhere here |
---|
0:07:30 | and you can see that it's |
---|
0:07:31 | there |
---|
0:07:32 | and |
---|
0:07:33 | the next estimate here post |
---|
0:07:34 | it shifts on to the the new mode |
---|
0:07:37 | so |
---|
0:07:38 | uh there what this work in in the first in the in the first example |
---|
0:07:41 | the this this second mode it up your and then sort of disappeared from |
---|
0:07:45 | from the estimates |
---|
0:07:46 | and this case it seems like there's a switch between the |
---|
0:07:49 | the first set of modes to the second set |
---|
0:07:51 | so that we have new questions zero which is like what is a different between the two examples |
---|
0:07:55 | how often do each type of |
---|
0:07:56 | these non uh a core |
---|
0:07:58 | and what is the role that what role does it play the predictability of the art uh uh i clear |
---|
0:08:03 | articulation |
---|
0:08:05 | and uh what we do that now is that |
---|
0:08:07 | we just shift the uh so that the previous examples what in in the articulate space the midsagittal plane |
---|
0:08:13 | where this one is actually in the uh in the space time |
---|
0:08:17 | are these plots in space time so |
---|
0:08:18 | the blue and the pink lines are actually uh the peaks of these |
---|
0:08:22 | these uh modes that you so that you saw on the black line is the the recorded project |
---|
0:08:26 | so you can see in that in that the type one what we call along the same part |
---|
0:08:30 | these um |
---|
0:08:32 | that the uh the the the recorded positions be sort of this stick to one |
---|
0:08:36 | of the project |
---|
0:08:37 | where as you can see that there is some non unique a estimates |
---|
0:08:41 | for some part of this uh uh uh of this tragic which we call non unique batch |
---|
0:08:46 | uh a the and in the second uh uh example |
---|
0:08:49 | uh |
---|
0:08:50 | you can see that the that they did not any uh so there is a sort of a a shifting |
---|
0:08:54 | from one of these |
---|
0:08:56 | oh well that's that that can be taken do the second but that's from the blue for the big |
---|
0:09:00 | i recall that the change in but |
---|
0:09:02 | so obviously it's it's is it's all obvious that from that type one is can is easy to estimate but |
---|
0:09:07 | using a |
---|
0:09:08 | information about the previous frames but that's not the case but i two |
---|
0:09:11 | uh |
---|
0:09:13 | and in this case you also need a a uh uh this the succeeding frames also you need to know |
---|
0:09:17 | where in which direction |
---|
0:09:20 | uh a but there are some exceptions you for example you can see this here that uh this is actually |
---|
0:09:25 | a |
---|
0:09:26 | the expect type is along the same but |
---|
0:09:28 | a but uh in fact it actually the the the recorded questions goes to W C P |
---|
0:09:33 | through to with the change in but |
---|
0:09:36 | so we'll we we just want to see how often thus |
---|
0:09:39 | you get these kind of excess |
---|
0:09:41 | uh this will uh so what we do is that we just do |
---|
0:09:44 | uh oh we we just have a conditions and be find other miss error |
---|
0:09:48 | the first one is we we apply |
---|
0:09:50 | can unity constraints |
---|
0:09:52 | the based on dynamic programming from the preceding context |
---|
0:09:55 | and then we select the one of the peaks from the to the second one we select the mean between |
---|
0:09:59 | the two peaks actually this is not really but um yeah not body articulate positions but we do it just |
---|
0:10:04 | two |
---|
0:10:04 | C uh uh what how how we reduce a whether it uses the arm error |
---|
0:10:09 | and the last one is that we uh estimate but |
---|
0:10:11 | uh so we estimate which of the |
---|
0:10:14 | don't the the peaks is actually uh |
---|
0:10:16 | gives a low was uh are of so we don't of to continue to constraints but we just uh C |
---|
0:10:20 | say that uh which of the peaks is close to the put |
---|
0:10:24 | so i i i just go to the to the you graph and two |
---|
0:10:28 | uh so it's uh |
---|
0:10:30 | a first at uh sort that the first thing we see that these uh i think so the the X |
---|
0:10:35 | axes actually the light of five |
---|
0:10:37 | so how many as a set uh |
---|
0:10:39 | successive frames you get |
---|
0:10:41 | where you have non unique uh uh estimates |
---|
0:10:43 | that is that's that's the exact x-axis and the number of occurrences is in the the wire |
---|
0:10:48 | so you can see that it form sort of a the if and a function |
---|
0:10:51 | and it sort of uh that the the number of has uh a number of uh |
---|
0:10:55 | a i is not any is with |
---|
0:10:57 | sort of decreases as of people's of with like |
---|
0:10:59 | and |
---|
0:11:01 | and that's that's even uh it's more so for the uh uh with change and but case that from yeah |
---|
0:11:06 | from the long the same part in long the same point you see that |
---|
0:11:09 | for |
---|
0:11:09 | a to uh uh um |
---|
0:11:12 | for two consecutive frames you get a lot more or you get more uh or cry |
---|
0:11:17 | oh |
---|
0:11:18 | oh it so the frequency of occurrence of a along the same but this higher than the used uh |
---|
0:11:22 | C if but only for a shorter parts |
---|
0:11:25 | for for longer parts |
---|
0:11:26 | it seems like it's uh |
---|
0:11:27 | more uh with change |
---|
0:11:30 | uh |
---|
0:11:31 | so fifty want to |
---|
0:11:32 | a three percent of the E frames are are result |
---|
0:11:36 | with a unity constraints for is P |
---|
0:11:38 | uh but |
---|
0:11:39 | it's it's much lower what for or W C B as expect |
---|
0:11:42 | it's only twenty a nine to that a three percent |
---|
0:11:44 | and it keeps or using uh uh that the the a pitch gives are using but the uh the with |
---|
0:11:49 | the length of |
---|
0:11:50 | uh |
---|
0:11:51 | that's a the mean |
---|
0:11:53 | uh between the two but actually works pretty well for uh a them use a the view C P many |
---|
0:11:57 | of the case |
---|
0:11:58 | which is actually a or what it's not a it's not completely into it |
---|
0:12:01 | but |
---|
0:12:02 | it seems to what |
---|
0:12:03 | some |
---|
0:12:05 | but this is probably because you don't know at what point |
---|
0:12:07 | the the trajectory switches from one one of these uh thoughts |
---|
0:12:11 | the other part |
---|
0:12:12 | that's so selecting the mean actually is gonna pragmatic |
---|
0:12:15 | to use that seven |
---|
0:12:16 | um |
---|
0:12:18 | and uh but but this but |
---|
0:12:20 | the uh |
---|
0:12:22 | uh but |
---|
0:12:23 | uh this method actually by selecting the mean actually |
---|
0:12:25 | decreases as the length of the uh uh green |
---|
0:12:30 | a a a a a a a around it percent for that is for that is P and twenty two |
---|
0:12:34 | percent for the W C P i don result in the sense that the uh |
---|
0:12:39 | the uh |
---|
0:12:40 | the mode which actually it gives you the best results |
---|
0:12:43 | uh cannot be estimated using can to constrain |
---|
0:12:46 | so uh that's uh |
---|
0:12:48 | the other |
---|
0:12:49 | i result from from this uh paper |
---|
0:12:51 | it has uh that |
---|
0:12:54 | yes |
---|
0:12:54 | a a a a a a a uh acoustic project clean motion can be uh |
---|
0:12:58 | uh and the non-uniqueness in this uh |
---|
0:13:00 | inversion |
---|
0:13:01 | can be estimated statistically |
---|
0:13:03 | can you constraints but not for all ins uh instead |
---|
0:13:07 | we probably need some other information rather than just got to D for example like the motion state or |
---|
0:13:12 | that that of speech and some some some the time because |
---|
0:13:15 | the estimate |
---|
0:13:17 | uh there are some semidefinite good conclusions is that uh |
---|
0:13:20 | human beings make use of non unique i can uh articulator positions so this is clear but |
---|
0:13:25 | uh |
---|
0:13:26 | but this cannot we i can be a less we have exactly the same |
---|
0:13:30 | with six |
---|
0:13:31 | or for you the same of six with to for |
---|
0:13:33 | so it's a set my some might definite |
---|
0:13:36 | they are are i'm someone many are rather a a on so questions so well |
---|
0:13:41 | the the main question here is that uh |
---|
0:13:43 | and |
---|
0:13:44 | does this is unique like quite positions |
---|
0:13:46 | it change the at a function of the vocal tract and |
---|
0:13:49 | it might it might seem at you to that they do but |
---|
0:13:51 | that had that least to verify and |
---|
0:13:53 | we can hope that we get some uh and my i uh dynamic and might i |
---|
0:13:57 | results |
---|
0:13:58 | to uh but it it there |
---|
0:14:00 | and uh what kind of compensation we kind of them is used to to make this uh non unique uh |
---|
0:14:04 | are quickly uh uh uh articulation sorry |
---|
0:14:07 | one any calculations for the same course |
---|
0:14:10 | and uh |
---|
0:14:11 | a a given that we have non uniqueness |
---|
0:14:13 | in this mapping a what is it all for the for learning uh a line so how the inference figured |
---|
0:14:18 | out |
---|
0:14:19 | vol |
---|
0:14:20 | um |
---|
0:14:21 | and i like when my speech |
---|
0:14:30 | so that |
---|
0:14:31 | for it's open for discussion |
---|
0:14:39 | the many questions some |
---|
0:14:42 | okay over there |
---|
0:14:46 | that's to |
---|
0:14:47 | um |
---|
0:14:48 | so i we have a common to a type and your last slide you had a a question about what |
---|
0:14:52 | uh do not unique assistance of forty political right |
---|
0:14:56 | so |
---|
0:14:57 | maybe can show some i uh you know |
---|
0:14:59 | that's some comments on |
---|
0:15:01 | no the way to measure but the articulation right that these three positions or or even a sagittal it'll some |
---|
0:15:06 | of the image image right |
---|
0:15:08 | well provides this sort of uh projection are is complex channel tree that's also moving in time so we have |
---|
0:15:14 | a |
---|
0:15:15 | uh restrictions of special control sampling |
---|
0:15:17 | and which are again trying to map it to some acoustic uh feature vector it's also some sort of prediction |
---|
0:15:23 | of the signal |
---|
0:15:24 | uh |
---|
0:15:24 | so it's really |
---|
0:15:26 | in often times not the point or whether it this actually a uh are be mapping the same things or |
---|
0:15:31 | or for uh and trying to find something that's not there |
---|
0:15:34 | and |
---|
0:15:35 | well this this just that it results and all all of four or yeah uh has as shown no can |
---|
0:15:40 | to some extent we can show this |
---|
0:15:42 | a so what are you thoughts on an know how one would actually |
---|
0:15:45 | who were there |
---|
0:15:46 | gaps that one could still |
---|
0:15:48 | well the |
---|
0:15:49 | yeah i mean that's that's a very valid question in this in this field of research to because |
---|
0:15:53 | i i as you said that the are was it there all projections from what the reality is |
---|
0:15:57 | and uh |
---|
0:15:58 | i mean this this thing is is uh a sort of a much larger question i |
---|
0:16:03 | in many sense |
---|
0:16:04 | but what i would like to say is that uh |
---|
0:16:06 | B |
---|
0:16:07 | by looking at these statistical methods |
---|
0:16:09 | the let's say that we just do we don't use |
---|
0:16:12 | the acoustic parameters that we use |
---|
0:16:14 | and we instead use some other acoustic but i |
---|
0:16:16 | uh |
---|
0:16:17 | how are we use a uh are articulated parameters |
---|
0:16:20 | which are which are different |
---|
0:16:22 | or instead of using position of the quite we use a a functions for example |
---|
0:16:26 | uh |
---|
0:16:27 | the the the thing is that |
---|
0:16:29 | in this uh by using this kind of a as the stick method we can |
---|
0:16:33 | find of whether it is uh it is a uh you non and nick or not |
---|
0:16:36 | in a reasonable way |
---|
0:16:37 | uh |
---|
0:16:38 | this kind of this of us |
---|
0:16:40 | paper that i i i i that that that we worked on is sort of |
---|
0:16:44 | uh tells you that the problems that come when we try to do |
---|
0:16:47 | statistical based base stick to get you mapping which is very cute which is quite clear have that you are |
---|
0:16:51 | gonna have these problems when you do so a basic course |
---|
0:16:54 | uh i to give you mapping |
---|
0:16:56 | i |
---|
0:16:57 | we just why would it in the sum might of and conclusions i mean |
---|
0:17:00 | because |
---|
0:17:01 | we can be sure |
---|
0:17:02 | as of |
---|
0:17:03 | so i i i i |
---|
0:17:05 | don't know how to go ahead |
---|
0:17:06 | based on the |
---|
0:17:07 | unless of pose we have a three D and i |
---|
0:17:09 | that think uh |
---|
0:17:14 | yeah |
---|
0:17:15 | yeah |
---|
0:17:17 | in front of you |
---|
0:17:19 | uh |
---|
0:17:19 | it |
---|
0:17:21 | uh if i understand it correctly uh in this work you're at the in the question of |
---|
0:17:25 | not in this with within the speaker is that right yeah it's within this speaker so how do we it |
---|
0:17:30 | is there a made to extend this to cross speaker non because |
---|
0:17:33 | i that might to be important for yes actually that's quite clear i mean uh a different it's that i'm |
---|
0:17:38 | many of are many other evidence which show that people that a cross because we use different strategies |
---|
0:17:44 | uh of for to produce the same kind of sound |
---|
0:17:47 | but the problem there i is not is not exactly the same this because you would not produce exactly the |
---|
0:17:52 | same that i mean the course sticks |
---|
0:17:53 | very is bother five is also which is like to shape of a vocal tract and |
---|
0:17:57 | so |
---|
0:17:58 | you can produce the same phonemes |
---|
0:17:59 | the same sounds that we classify as the same phonemes |
---|
0:18:02 | uh a different people use different set it in there are several results which of that |
---|
0:18:06 | but |
---|
0:18:06 | can be produce exactly the same of course to |
---|
0:18:09 | by different uh are like uh by different are to get configuration |
---|
0:18:12 | i think that that question is more relevant to you look at a single speaker |
---|
0:18:17 | so you would say that this is a big telling instant |
---|
0:18:19 | i no |
---|
0:18:20 | oh have was |
---|
0:18:21 | okay |
---|
0:18:22 | i mean there just |
---|
0:18:23 | different |
---|
0:18:24 | questions |
---|
0:18:24 | okay |
---|
0:18:25 | thanks |
---|
0:18:28 | yeah |
---|
0:18:31 | so i think this |
---|
0:18:32 | spring cell session try that can does so as i "'cause" you have those and and the people on the |
---|
0:18:38 | flap but just |
---|