0:00:14a a but not well as you uh
0:00:17uh
0:00:18the this stock is uh
0:00:20uh
0:00:21is a
0:00:22go clap of it but by an and the question that's and that's me and all lot
0:00:26and well
0:00:27uh the liz resolving non-uniqueness in the
0:00:30acoustic-to-articulatory mapping which i would for the for it was it a mapping
0:00:35uh
0:00:36i i i think i'll skip the scroll slide because the last two presentations were pretty much about
0:00:40the same thing
0:00:41and uh it's basically just to give an idea as to what it to we uh mapping is or inversion
0:00:47is
0:00:49uh
0:00:49uh i'll do it
0:00:51jump to the the main uh focus of this stop which is actually the non-uniqueness in this mapping
0:00:57which has been a for to by the P uh by
0:00:59uh by the few adults as before
0:01:01uh would oh spoke before me
0:01:03uh so in the literature we have uh
0:01:06things like uh at the loss lists uh you models of the vocal tract to is a parameter gotta
0:01:11oh model of uh speech synthesis and uh
0:01:14you can say that the inverse mapping from acoustic coast is actually to a class of a a function the
0:01:19not exactly one E
0:01:21and you have a similar results from other that such experiments
0:01:24and you have something uh there are some experiments called a bike block experiments where the uh
0:01:29the all these speaker is
0:01:31this constraint
0:01:32uh but still a a uh is the speakers can produce a perceptually sim similar sounds
0:01:38even spite of the natural pollution so this gives an uh indication of money
0:01:42of course these are are sit situations than the
0:01:45it this this may not really a in natural speech
0:01:48so what what about in continuous speech so
0:01:50we would be you can have different forms of data to collect this uh thing uh
0:01:55uh which have listed here
0:01:57uh in our case we use "'em" mocha timit database just like
0:02:00the previous to uh a so i wouldn't going to that
0:02:03too much
0:02:05uh
0:02:06so this is an example from the data set and then we have a a a a a phoneme
0:02:10uh a
0:02:11uh uh and the the red and the blue lines here they they get
0:02:15uh the D spectrum the magnitude spectrum
0:02:19uh
0:02:20from two instances
0:02:21and uh
0:02:23uh the figure two uh
0:02:26uh
0:02:27uh to the bottom to the right bottom is actually of the positions of the articulator quite
0:02:32a a a and you can see that
0:02:33even though the the sticks are are are
0:02:36quite similar
0:02:37the the uh the if you positions art
0:02:39are slow are quite different
0:02:41uh but is this non-uniqueness i mean uh
0:02:44i mean do you still can't say really that just not just because there is a difference in the acoustic
0:02:48so uh can can this difference in acoustics be explained by uh uh
0:02:54by this there
0:02:55variation position of the of the article
0:02:58so uh the but that's that that
0:03:00sort of comes to the problem uh in when you have this kind of uh data a limited data base
0:03:06that you cannot get exactly the same
0:03:08uh uh sticks an exactly the same articulators
0:03:12uh uh uh uh what or it exactly the same of six with different not is so that that's the
0:03:16that the difficulty as data
0:03:18so
0:03:20the P questions in this in this stock or
0:03:22a how does one estimate non-uniqueness in a limited data
0:03:25and uh that we do it but statistical modeling morning based one a one of four previous papers
0:03:29uh how do these non any
0:03:31instances of coding friends agreed to
0:03:32goes stick articulate frame
0:03:35uh does uh applying can here D constraints help
0:03:38a all non less
0:03:39uh these of be a main questions
0:03:41so we are we have a toy example your and you can say that
0:03:44a that the the figure on the top here is uh is
0:03:47uh
0:03:48the acoustic parameters
0:03:50belong to say one phoneme
0:03:52and this is the uh are to two parameters but of long one point men you can see that
0:03:56acoustic is you name but is that
0:03:58i can three parameters are by more so is this non unique
0:04:04a what so you look at the data points here i
0:04:06i don't know whether the
0:04:07points are very clear but uh you can see that i mean it's not it's not completely true i mean
0:04:12you can see that there are some clusters your
0:04:14in the look at that
0:04:16the joint i quickly we an acoustic uh space
0:04:20and uh therefore we what we do is we
0:04:23for a model in this this sort of data and the joint space
0:04:25uh articulatory acoustic space
0:04:27and then we can look at what one of one value of acoustic but i'm with that a shown by
0:04:32the blue line there that's of test sample
0:04:34and we can find the conditional probability distribution and this case this is uh a by more eager which says
0:04:40that
0:04:40at at this uh at this value for acoustic parameter
0:04:43uh the uh the the mapping use non unique
0:04:46but if you look at a another acoustic parameter here which belong to this
0:04:50the same
0:04:51a a close to cluster you can see that it's uh uni modal and it's not
0:04:54it's not not not
0:04:56of course that the there's is the question of uh the variance
0:04:59uh which is also a a a a least of some sort of and because for one value for a
0:05:03stick but i'm with you can have different
0:05:05well use of articulate but i mean
0:05:07but uh we don't we don't look at to this sort of money miss in the in this paper
0:05:11and uh
0:05:12we just look at the uh this by mortar kind of an on in
0:05:17uh this to the close the parameterization of the data and again it's very similar to what has been used
0:05:22in the state of the art though
0:05:24uh uh it we mapping systems and source some that the one which was used previous the previous paper
0:05:29uh this is
0:05:31an example of a non nice so what these uh
0:05:34these uh
0:05:35but blocks are actually
0:05:37the conditional distributions
0:05:39uh
0:05:40given a one vector of a co six
0:05:42these pop but with lots of the the conditional distributions of the uh of the
0:05:47articulate records
0:05:49so in this case and the blue out the blue dot sense and triangles and one they are they are
0:05:54actually that the peaks of these uh different modes
0:05:56and the green line
0:05:58i uh
0:06:00that's clear in the in the presentation that the green line is actually the the recorded positions
0:06:04the this case you can see that the other
0:06:06close or to one of the peaks so and the other P
0:06:08and the other because actually uh
0:06:10uh
0:06:11the the the non unique
0:06:13a a not a non unique estimate of for this
0:06:17uh this particular stick
0:06:19uh what now we look at this in a trajectory
0:06:21uh so uh and in this case there is you can you can see that there they all you anymore
0:06:25more the all the uh the the conditional distribution the you anymore
0:06:28but you look at the next frame and then and in this case you can set saying that that on
0:06:32tip
0:06:33which is here
0:06:34and uh you can start saying that there is a
0:06:37there's another but which of uh which uh
0:06:39you can in you can see the same thing and
0:06:41in the lower lip which is here
0:06:43and the tongue dorsum was
0:06:45oh
0:06:47and it's and so
0:06:48uh but you can see but at the same time though the recorded positions are actually
0:06:53are always close or two
0:06:55one of the uh
0:06:58uh that the two are to one of the modes side than the other
0:07:01and uh
0:07:03but the the to another example of uh not uh
0:07:06following this and this is the
0:07:08uh another example
0:07:09and this case you can see that i this is largely uni modal
0:07:13uh
0:07:14um this is
0:07:15one frame but to
0:07:16uh are in the shop
0:07:19uh
0:07:20and
0:07:21uh you can start seeing that that that is a
0:07:24i the
0:07:25the second mode starts appearing somewhere here
0:07:30and you can see that it's
0:07:31there
0:07:32and
0:07:33the next estimate here post
0:07:34it shifts on to the the new mode
0:07:37so
0:07:38uh there what this work in in the first in the in the first example
0:07:41the this this second mode it up your and then sort of disappeared from
0:07:45from the estimates
0:07:46and this case it seems like there's a switch between the
0:07:49the first set of modes to the second set
0:07:51so that we have new questions zero which is like what is a different between the two examples
0:07:55how often do each type of
0:07:56these non uh a core
0:07:58and what is the role that what role does it play the predictability of the art uh uh i clear
0:08:03articulation
0:08:05and uh what we do that now is that
0:08:07we just shift the uh so that the previous examples what in in the articulate space the midsagittal plane
0:08:13where this one is actually in the uh in the space time
0:08:17are these plots in space time so
0:08:18the blue and the pink lines are actually uh the peaks of these
0:08:22these uh modes that you so that you saw on the black line is the the recorded project
0:08:26so you can see in that in that the type one what we call along the same part
0:08:30these um
0:08:32that the uh the the the recorded positions be sort of this stick to one
0:08:36of the project
0:08:37where as you can see that there is some non unique a estimates
0:08:41for some part of this uh uh uh of this tragic which we call non unique batch
0:08:46uh a the and in the second uh uh example
0:08:49uh
0:08:50you can see that the that they did not any uh so there is a sort of a a shifting
0:08:54from one of these
0:08:56oh well that's that that can be taken do the second but that's from the blue for the big
0:09:00i recall that the change in but
0:09:02so obviously it's it's is it's all obvious that from that type one is can is easy to estimate but
0:09:07using a
0:09:08information about the previous frames but that's not the case but i two
0:09:11uh
0:09:13and in this case you also need a a uh uh this the succeeding frames also you need to know
0:09:17where in which direction
0:09:20uh a but there are some exceptions you for example you can see this here that uh this is actually
0:09:25a
0:09:26the expect type is along the same but
0:09:28a but uh in fact it actually the the the recorded questions goes to W C P
0:09:33through to with the change in but
0:09:36so we'll we we just want to see how often thus
0:09:39you get these kind of excess
0:09:41uh this will uh so what we do is that we just do
0:09:44uh oh we we just have a conditions and be find other miss error
0:09:48the first one is we we apply
0:09:50can unity constraints
0:09:52the based on dynamic programming from the preceding context
0:09:55and then we select the one of the peaks from the to the second one we select the mean between
0:09:59the two peaks actually this is not really but um yeah not body articulate positions but we do it just
0:10:04two
0:10:04C uh uh what how how we reduce a whether it uses the arm error
0:10:09and the last one is that we uh estimate but
0:10:11uh so we estimate which of the
0:10:14don't the the peaks is actually uh
0:10:16gives a low was uh are of so we don't of to continue to constraints but we just uh C
0:10:20say that uh which of the peaks is close to the put
0:10:24so i i i just go to the to the you graph and two
0:10:28uh so it's uh
0:10:30a first at uh sort that the first thing we see that these uh i think so the the X
0:10:35axes actually the light of five
0:10:37so how many as a set uh
0:10:39successive frames you get
0:10:41where you have non unique uh uh estimates
0:10:43that is that's that's the exact x-axis and the number of occurrences is in the the wire
0:10:48so you can see that it form sort of a the if and a function
0:10:51and it sort of uh that the the number of has uh a number of uh
0:10:55a i is not any is with
0:10:57sort of decreases as of people's of with like
0:10:59and
0:11:01and that's that's even uh it's more so for the uh uh with change and but case that from yeah
0:11:06from the long the same part in long the same point you see that
0:11:09for
0:11:09a to uh uh um
0:11:12for two consecutive frames you get a lot more or you get more uh or cry
0:11:17oh
0:11:18oh it so the frequency of occurrence of a along the same but this higher than the used uh
0:11:22C if but only for a shorter parts
0:11:25for for longer parts
0:11:26it seems like it's uh
0:11:27more uh with change
0:11:30uh
0:11:31so fifty want to
0:11:32a three percent of the E frames are are result
0:11:36with a unity constraints for is P
0:11:38uh but
0:11:39it's it's much lower what for or W C B as expect
0:11:42it's only twenty a nine to that a three percent
0:11:44and it keeps or using uh uh that the the a pitch gives are using but the uh the with
0:11:49the length of
0:11:50uh
0:11:51that's a the mean
0:11:53uh between the two but actually works pretty well for uh a them use a the view C P many
0:11:57of the case
0:11:58which is actually a or what it's not a it's not completely into it
0:12:01but
0:12:02it seems to what
0:12:03some
0:12:05but this is probably because you don't know at what point
0:12:07the the trajectory switches from one one of these uh thoughts
0:12:11the other part
0:12:12that's so selecting the mean actually is gonna pragmatic
0:12:15to use that seven
0:12:16um
0:12:18and uh but but this but
0:12:20the uh
0:12:22uh but
0:12:23uh this method actually by selecting the mean actually
0:12:25decreases as the length of the uh uh green
0:12:30a a a a a a a around it percent for that is for that is P and twenty two
0:12:34percent for the W C P i don result in the sense that the uh
0:12:39the uh
0:12:40the mode which actually it gives you the best results
0:12:43uh cannot be estimated using can to constrain
0:12:46so uh that's uh
0:12:48the other
0:12:49i result from from this uh paper
0:12:51it has uh that
0:12:54yes
0:12:54a a a a a a a uh acoustic project clean motion can be uh
0:12:58uh and the non-uniqueness in this uh
0:13:00inversion
0:13:01can be estimated statistically
0:13:03can you constraints but not for all ins uh instead
0:13:07we probably need some other information rather than just got to D for example like the motion state or
0:13:12that that of speech and some some some the time because
0:13:15the estimate
0:13:17uh there are some semidefinite good conclusions is that uh
0:13:20human beings make use of non unique i can uh articulator positions so this is clear but
0:13:25uh
0:13:26but this cannot we i can be a less we have exactly the same
0:13:30with six
0:13:31or for you the same of six with to for
0:13:33so it's a set my some might definite
0:13:36they are are i'm someone many are rather a a on so questions so well
0:13:41the the main question here is that uh
0:13:43and
0:13:44does this is unique like quite positions
0:13:46it change the at a function of the vocal tract and
0:13:49it might it might seem at you to that they do but
0:13:51that had that least to verify and
0:13:53we can hope that we get some uh and my i uh dynamic and might i
0:13:57results
0:13:58to uh but it it there
0:14:00and uh what kind of compensation we kind of them is used to to make this uh non unique uh
0:14:04are quickly uh uh uh articulation sorry
0:14:07one any calculations for the same course
0:14:10and uh
0:14:11a a given that we have non uniqueness
0:14:13in this mapping a what is it all for the for learning uh a line so how the inference figured
0:14:18out
0:14:19vol
0:14:20um
0:14:21and i like when my speech
0:14:30so that
0:14:31for it's open for discussion
0:14:39the many questions some
0:14:42okay over there
0:14:46that's to
0:14:47um
0:14:48so i we have a common to a type and your last slide you had a a question about what
0:14:52uh do not unique assistance of forty political right
0:14:56so
0:14:57maybe can show some i uh you know
0:14:59that's some comments on
0:15:01no the way to measure but the articulation right that these three positions or or even a sagittal it'll some
0:15:06of the image image right
0:15:08well provides this sort of uh projection are is complex channel tree that's also moving in time so we have
0:15:14a
0:15:15uh restrictions of special control sampling
0:15:17and which are again trying to map it to some acoustic uh feature vector it's also some sort of prediction
0:15:23of the signal
0:15:24uh
0:15:24so it's really
0:15:26in often times not the point or whether it this actually a uh are be mapping the same things or
0:15:31or for uh and trying to find something that's not there
0:15:34and
0:15:35well this this just that it results and all all of four or yeah uh has as shown no can
0:15:40to some extent we can show this
0:15:42a so what are you thoughts on an know how one would actually
0:15:45who were there
0:15:46gaps that one could still
0:15:48well the
0:15:49yeah i mean that's that's a very valid question in this in this field of research to because
0:15:53i i as you said that the are was it there all projections from what the reality is
0:15:57and uh
0:15:58i mean this this thing is is uh a sort of a much larger question i
0:16:03in many sense
0:16:04but what i would like to say is that uh
0:16:06B
0:16:07by looking at these statistical methods
0:16:09the let's say that we just do we don't use
0:16:12the acoustic parameters that we use
0:16:14and we instead use some other acoustic but i
0:16:16uh
0:16:17how are we use a uh are articulated parameters
0:16:20which are which are different
0:16:22or instead of using position of the quite we use a a functions for example
0:16:26uh
0:16:27the the the thing is that
0:16:29in this uh by using this kind of a as the stick method we can
0:16:33find of whether it is uh it is a uh you non and nick or not
0:16:36in a reasonable way
0:16:37uh
0:16:38this kind of this of us
0:16:40paper that i i i i that that that we worked on is sort of
0:16:44uh tells you that the problems that come when we try to do
0:16:47statistical based base stick to get you mapping which is very cute which is quite clear have that you are
0:16:51gonna have these problems when you do so a basic course
0:16:54uh i to give you mapping
0:16:56i
0:16:57we just why would it in the sum might of and conclusions i mean
0:17:00because
0:17:01we can be sure
0:17:02as of
0:17:03so i i i i
0:17:05don't know how to go ahead
0:17:06based on the
0:17:07unless of pose we have a three D and i
0:17:09that think uh
0:17:14yeah
0:17:15yeah
0:17:17in front of you
0:17:19uh
0:17:19it
0:17:21uh if i understand it correctly uh in this work you're at the in the question of
0:17:25not in this with within the speaker is that right yeah it's within this speaker so how do we it
0:17:30is there a made to extend this to cross speaker non because
0:17:33i that might to be important for yes actually that's quite clear i mean uh a different it's that i'm
0:17:38many of are many other evidence which show that people that a cross because we use different strategies
0:17:44uh of for to produce the same kind of sound
0:17:47but the problem there i is not is not exactly the same this because you would not produce exactly the
0:17:52same that i mean the course sticks
0:17:53very is bother five is also which is like to shape of a vocal tract and
0:17:57so
0:17:58you can produce the same phonemes
0:17:59the same sounds that we classify as the same phonemes
0:18:02uh a different people use different set it in there are several results which of that
0:18:06but
0:18:06can be produce exactly the same of course to
0:18:09by different uh are like uh by different are to get configuration
0:18:12i think that that question is more relevant to you look at a single speaker
0:18:17so you would say that this is a big telling instant
0:18:19i no
0:18:20oh have was
0:18:21okay
0:18:22i mean there just
0:18:23different
0:18:24questions
0:18:24okay
0:18:25thanks
0:18:28yeah
0:18:31so i think this
0:18:32spring cell session try that can does so as i "'cause" you have those and and the people on the
0:18:38flap but just