0:00:15 | wanting um |
---|
0:00:17 | my name is that mean are not present |
---|
0:00:21 | or |
---|
0:00:22 | on |
---|
0:00:23 | oh |
---|
0:00:24 | and the two |
---|
0:00:25 | we use |
---|
0:00:26 | this in this guy lower |
---|
0:00:28 | subject independent and acoustic version that |
---|
0:00:31 | you review these two |
---|
0:00:34 | so "'cause" guy in version which you as you saw from the |
---|
0:00:37 | previous uh |
---|
0:00:39 | or |
---|
0:00:40 | a who who is that you're wrong |
---|
0:00:44 | and what you're are trying to |
---|
0:00:46 | S to me |
---|
0:00:47 | are able right |
---|
0:00:49 | but obviously you kind work and the |
---|
0:00:51 | ross speech signal domain or |
---|
0:00:53 | these you to main |
---|
0:00:54 | so you to um |
---|
0:00:56 | a P an acoustic features and then |
---|
0:00:58 | in in fact |
---|
0:00:59 | to we the problem what are really trying to do is you trying to |
---|
0:01:03 | in this we have trying to estimate that are uh these are features feature as from |
---|
0:01:08 | i'll synchronise acoustic features |
---|
0:01:10 | so a given us to estimate the acoustic feature vectors from your six speech signal and then you |
---|
0:01:14 | we are trying to estimate that to give a feature vector |
---|
0:01:19 | and now before having to the subject independent part of this and let me first |
---|
0:01:24 | discuss what subject dependent impression news because this is usually how it's addressed in the literature |
---|
0:01:29 | so that mentioned before you trying to estimate |
---|
0:01:32 | are to be a you feature is given this test utterance |
---|
0:01:35 | and and makes the subject but the dependent what me |
---|
0:01:39 | uh the traditional approach subject to is that the pruning and test subjects |
---|
0:01:43 | are identical |
---|
0:01:44 | so you have |
---|
0:01:45 | parallel acoustic and i to but feature vectors from |
---|
0:01:50 | as you a subject and you use this to develop a model of |
---|
0:01:54 | this mapping from the |
---|
0:01:55 | acoustic right D feature vectors |
---|
0:01:57 | and then you use that to estimate out to but |
---|
0:02:00 | feature vectors for the same once and subject |
---|
0:02:04 | and the easy and some those because |
---|
0:02:05 | you do it on the same subjects so the acoustic spaces mad start but space is matt |
---|
0:02:10 | but |
---|
0:02:11 | but about those try to look into in this paper there is a subject independent acoustic-to-articulatory inversion |
---|
0:02:17 | that |
---|
0:02:18 | given |
---|
0:02:19 | training data uh from a completely different subject |
---|
0:02:21 | how how we do use that to develop a model that you can then |
---|
0:02:25 | used to estimate a T D features for |
---|
0:02:27 | for out from four different subject |
---|
0:02:29 | and it should immediately you start becoming obvious that what are the kind of challenges you will see a |
---|
0:02:34 | so the acoustic space of the these two subjects and a very good so |
---|
0:02:38 | you can i think of the in read a similarity metric for these two subjects |
---|
0:02:42 | at the same time that close even the acoustic space so are not similar |
---|
0:02:45 | right so each |
---|
0:02:46 | each subject will have their own are be |
---|
0:02:49 | uh range |
---|
0:02:50 | so the acoustic and not but space are different |
---|
0:02:52 | and at this point now i would actually want to make it clear that |
---|
0:02:56 | well about those actually estimate as that are two but |
---|
0:03:00 | jack actors |
---|
0:03:01 | are |
---|
0:03:02 | you can to graph of in this fashion so |
---|
0:03:05 | it does that |
---|
0:03:06 | because there are completely feature that that that that D features that the pruning |
---|
0:03:11 | subject would have produced if you was trying to immediate or but that the test utterance you can think of |
---|
0:03:16 | the estimated |
---|
0:03:17 | uh are to be at a feature vectors in that fashion |
---|
0:03:22 | but value you would you want to even do us subject independent version why not just speak to the traditions |
---|
0:03:27 | that different it in |
---|
0:03:29 | so subject depend version is simple and C Z and it works that you read |
---|
0:03:33 | but the problem is that when you have to extended to two |
---|
0:03:36 | a and number of how this because you need a get a data from each of the speaker |
---|
0:03:40 | as you have know a but did do does not as simple look alike |
---|
0:03:44 | as compared to the acoustic data |
---|
0:03:46 | so |
---|
0:03:47 | if we could have an approach where |
---|
0:03:49 | you |
---|
0:03:49 | that that a lot of but acoustic are just for one talk or for one subject and then you could |
---|
0:03:54 | extended to any |
---|
0:03:56 | and that are a number of speakers for whom you have only the acoustic data |
---|
0:04:00 | then set to that would be how do how you desire |
---|
0:04:03 | then you could think of applications like |
---|
0:04:05 | joint joint but acoustic speech recognition are |
---|
0:04:08 | a simple as is |
---|
0:04:10 | and can just |
---|
0:04:11 | there |
---|
0:04:12 | you of once you have a a model from that just one subject and then an adapt to in a |
---|
0:04:17 | number because you want |
---|
0:04:18 | and so that as a motivation for |
---|
0:04:20 | the subject independent and version |
---|
0:04:23 | and i would know where to the details of how about the proposed to do it |
---|
0:04:27 | so they do it by a minimization of the |
---|
0:04:30 | and smoothness criterion for the are completely |
---|
0:04:33 | uh trajectories |
---|
0:04:35 | and and that was initially proposed a seven |
---|
0:04:37 | we hasn't ten four |
---|
0:04:39 | subject dependent and version but they found that this could be extended to subject independent inversion just as that |
---|
0:04:45 | and but to and is in the subject independent version is the second but that that's somebody deeper |
---|
0:04:50 | so |
---|
0:04:54 | but you are estimating were that was done as in this criterion what it's doing is that you are you |
---|
0:04:59 | are essentially have |
---|
0:05:01 | different like functions for each of these trajectories |
---|
0:05:04 | so |
---|
0:05:06 | you would if you have the articulatory feature as you would have these that's criterion intentions |
---|
0:05:10 | and that is because each articulator it is move this with different be so you need |
---|
0:05:15 | different levels of smoothness |
---|
0:05:17 | which but those fess some uh this first M is basically the smoothness spell you can think of it as |
---|
0:05:22 | the smoothness when B |
---|
0:05:24 | and |
---|
0:05:25 | does this in each |
---|
0:05:26 | a a a pension is basically a a high pass filter that you optimized on a development set |
---|
0:05:31 | for this but a block glitter and you do those for each to clear that it time and you get |
---|
0:05:36 | these are right in functions and the first and was radically he's by six those which paint realise |
---|
0:05:41 | any |
---|
0:05:42 | lot of |
---|
0:05:43 | trajectories that you might estimate otherwise |
---|
0:05:46 | the second them as the but that brought some of them which actually does estimation from the training data |
---|
0:05:50 | and you have these two bands your that you and the P |
---|
0:05:53 | but you are the problem values that you are to be T |
---|
0:05:57 | features could be |
---|
0:05:58 | and B are the corresponding probabilities |
---|
0:06:02 | that that's look uh look to how we could estimate of this eat and be the that i mentioned is |
---|
0:06:07 | optimized on no |
---|
0:06:08 | separate development set for each of that lately |
---|
0:06:11 | articulatory features |
---|
0:06:13 | so that to estimate this |
---|
0:06:15 | uh but you down the pu comes |
---|
0:06:18 | are are |
---|
0:06:20 | a simple the subject dependent is again and i i would just i would come to the subject in it |
---|
0:06:25 | just up to this |
---|
0:06:26 | but essentially since you acoustic space that's similar for the subject dependent case |
---|
0:06:30 | what a good |
---|
0:06:30 | it was it was do is equal |
---|
0:06:32 | look let that you could like to estimate these are to feature is the probability of features by approximately in |
---|
0:06:37 | the acoustic space |
---|
0:06:39 | so again and up |
---|
0:06:40 | the |
---|
0:06:40 | do they start utterance |
---|
0:06:42 | and that acoustic feature vector or what you can do as you can |
---|
0:06:45 | estimate |
---|
0:06:46 | the closest a acoustic features are from the training data and then you corresponding articulatory features |
---|
0:06:52 | not become we need to |
---|
0:06:53 | so you can compute some thought of an euclidean distance are i don't know "'cause" distance |
---|
0:06:57 | from each |
---|
0:06:58 | each of these that |
---|
0:07:00 | features as acoustic features in your training data |
---|
0:07:02 | and i i you done that you can |
---|
0:07:05 | the like |
---|
0:07:05 | so that that L M these |
---|
0:07:07 | that that lot of these |
---|
0:07:09 | acoustic features the corresponding a to be feature not be can you need up other probable values that you R |
---|
0:07:14 | acoustic are at be three features can be |
---|
0:07:17 | right and you can estimate the probabilities as |
---|
0:07:21 | as the inversely proportional to the distance because if you acoustic features that |
---|
0:07:26 | but have as are to the test utterance then |
---|
0:07:29 | the that probably do you have that being one of you |
---|
0:07:32 | that ben of bad use features you would want is |
---|
0:07:34 | hi |
---|
0:07:35 | and so |
---|
0:07:37 | i you are |
---|
0:07:38 | is is is doing but we are using the subject dependent case is a simple euclidean distance in the acoustic |
---|
0:07:43 | features and this is that |
---|
0:07:44 | this the subject dependent version of the |
---|
0:07:47 | generalized to this criteria |
---|
0:07:51 | what the question is how would you extend this when these two |
---|
0:07:55 | these two fast because i not the same and acoustic space of that different |
---|
0:07:59 | then |
---|
0:08:00 | do you can't you compute uh something like it's some the you in distance in this case |
---|
0:08:05 | so that about was proposed in this way |
---|
0:08:07 | is is that you map these |
---|
0:08:09 | acoustic features to at |
---|
0:08:10 | but D space |
---|
0:08:12 | that |
---|
0:08:13 | then the you to compare these |
---|
0:08:15 | using a simple distance metric that you know of |
---|
0:08:18 | so that there is achieved this by the concept of a general acoustic space |
---|
0:08:22 | a general like "'cause" you can think of the dell six |
---|
0:08:25 | in this fashion so that an acoustic space you can think of can the sting off |
---|
0:08:29 | that's that insist they a best acoustic feature of of acoustic feature vectors from a number of different |
---|
0:08:34 | that was |
---|
0:08:35 | and and not contain any of these two like a was so it's basically something you can think off |
---|
0:08:41 | because perception of |
---|
0:08:43 | that was the that |
---|
0:08:44 | sound that on him |
---|
0:08:46 | and this an acoustic space which can |
---|
0:08:50 | acoustic features from |
---|
0:08:51 | a see feature vectors from a number of different speakers is not a and to a different clusters |
---|
0:08:56 | and each of but i don't in each of these clusters as model using the bashing mixture model |
---|
0:09:01 | but can not this acoustic space not what you can do is you can prance on each of these acoustic |
---|
0:09:06 | feature like that the acoustic feature vector and the test acoustic feature vectors |
---|
0:09:10 | uh_huh |
---|
0:09:11 | a a any feature vector |
---|
0:09:13 | which can the of the posterior probabilities that |
---|
0:09:15 | this |
---|
0:09:17 | acoustic feature the W having come from each of these if clusters |
---|
0:09:22 | and then it can analyse it so that sends up to one so now that you are map your acoustic |
---|
0:09:27 | feature vector is to do the a feature vector five W |
---|
0:09:30 | now we can probably think of sampling |
---|
0:09:32 | as a some that the since but it like you in distance metric |
---|
0:09:36 | and so that is the modified distance metric which is used |
---|
0:09:38 | for had the subject independent case |
---|
0:09:41 | so i a few computing simple euclidean distance a among the acoustic features than the subject in the and gives |
---|
0:09:46 | you mad them to the probably space and now we compute be you didn't distance |
---|
0:09:49 | is a modified distance |
---|
0:09:51 | but you know what does that we still have |
---|
0:09:54 | and most five hundred thousand frames |
---|
0:09:56 | and was just for one frame and one not to a |
---|
0:09:59 | so the computational cost to it was immense |
---|
0:10:02 | and so that was proposed |
---|
0:10:04 | a the for that the and computational cost |
---|
0:10:06 | that that's proposed |
---|
0:10:08 | just padding |
---|
0:10:09 | the but only but element |
---|
0:10:12 | only the relevant for a feature |
---|
0:10:14 | and the and this is that is |
---|
0:10:16 | and don't remember that now we just have probably D "'cause" we don't have acoustic to then anymore |
---|
0:10:20 | so |
---|
0:10:21 | exactly value you probability vectors |
---|
0:10:24 | and |
---|
0:10:24 | we just to them into a |
---|
0:10:26 | can a different backs that the fast back and contains and that probably back far that's the first kind but |
---|
0:10:31 | it is the high |
---|
0:10:32 | a second that then is all the property like this for which the second companies highest |
---|
0:10:36 | and so and then came back |
---|
0:10:38 | not many get a test utterance when you get it test seconds from the acoustic feature vector to compute probably |
---|
0:10:43 | probably feature vector |
---|
0:10:44 | and the mean find the index on that |
---|
0:10:46 | and in that's in that can only probably but the for which the |
---|
0:10:49 | the company size and suppose but had index has the highest value is probably vector |
---|
0:10:53 | that we were |
---|
0:10:54 | just do the compare as an that are back and you would not consider other back |
---|
0:10:58 | and but even further reduction and class |
---|
0:11:01 | you can just compute the standard is now using values a but that index |
---|
0:11:06 | and then you would do a certain kind of sad and that you did that yeah |
---|
0:11:09 | and you would just be in the i |
---|
0:11:12 | and the |
---|
0:11:13 | the noses and |
---|
0:11:15 | i |
---|
0:11:15 | i |
---|
0:11:18 | uh |
---|
0:11:20 | each |
---|
0:11:22 | and then you get again you the |
---|
0:11:24 | probably be an inverse of this |
---|
0:11:26 | i think that of the of these this |
---|
0:11:29 | the the provide with it and be generalized right |
---|
0:11:33 | and this room by this provides a |
---|
0:11:36 | a a a a is it much as the multiplication |
---|
0:11:38 | and |
---|
0:11:39 | computation in general and on a having a the last |
---|
0:11:43 | then |
---|
0:11:45 | then it because of |
---|
0:11:47 | having that is let you know what the experiment |
---|
0:11:51 | far |
---|
0:11:53 | so uh on there |
---|
0:11:55 | was a it's on a a i which i |
---|
0:11:59 | but was to yeah and uh |
---|
0:12:02 | acoustic and i |
---|
0:12:04 | for one meeting one may dish speaker |
---|
0:12:07 | each |
---|
0:12:08 | where is reading for sixty statements |
---|
0:12:12 | three |
---|
0:12:16 | acoustic features use there but i'm sure image |
---|
0:12:20 | and i features they're |
---|
0:12:22 | the electromagnetic articulography |
---|
0:12:25 | and very is that you |
---|
0:12:28 | or lower |
---|
0:12:29 | lower in is there a more |
---|
0:12:31 | and the and and will |
---|
0:12:33 | and so we use this for it |
---|
0:12:35 | raw |
---|
0:12:40 | that that that that is that image idea was used to optimize |
---|
0:12:43 | but high or whatever is that we have one |
---|
0:12:46 | but uh i i a for each |
---|
0:12:48 | and that that was a |
---|
0:12:50 | good |
---|
0:12:50 | cost be being this stuff and |
---|
0:12:54 | and the data to we don't |
---|
0:12:55 | and for building an acoustic model |
---|
0:12:58 | uh timit at uh |
---|
0:13:00 | being is used |
---|
0:13:01 | but eight thirty two |
---|
0:13:06 | and for for of that evaluation of this proposed rules |
---|
0:13:09 | that that's right for inversion is them i know that |
---|
0:13:17 | i |
---|
0:13:18 | and you you |
---|
0:13:19 | you you |
---|
0:13:20 | in for the job |
---|
0:13:22 | the second is is that is that the training and are i |
---|
0:13:27 | ignored five |
---|
0:13:28 | use |
---|
0:13:30 | a |
---|
0:13:30 | so |
---|
0:13:32 | i |
---|
0:13:34 | that is that it was a a mismatch |
---|
0:13:36 | no where a |
---|
0:13:38 | like like |
---|
0:13:39 | right |
---|
0:13:40 | and so you |
---|
0:13:41 | you |
---|
0:13:44 | uh in the problem by a |
---|
0:13:46 | general |
---|
0:13:48 | uh |
---|
0:13:48 | space |
---|
0:13:50 | and |
---|
0:13:54 | this is uh |
---|
0:13:55 | so |
---|
0:13:56 | me |
---|
0:13:57 | so what you think that is |
---|
0:13:59 | and by |
---|
0:14:00 | i i for one of |
---|
0:14:02 | me |
---|
0:14:03 | a a speaker for for one of the |
---|
0:14:06 | that it is |
---|
0:14:08 | a a is that a one is made by subject |
---|
0:14:12 | the uh |
---|
0:14:14 | but the training and testing think that one done on the me |
---|
0:14:17 | saw |
---|
0:14:18 | uh |
---|
0:14:19 | think of a is a one that was that being is to me using a Q D C that you |
---|
0:14:24 | do that |
---|
0:14:25 | you i image |
---|
0:14:27 | you know |
---|
0:14:29 | but |
---|
0:14:30 | in that to me in a room at all |
---|
0:14:32 | training |
---|
0:14:34 | uh |
---|
0:14:35 | and what you see where is the thing that made all proposed to you |
---|
0:14:40 | the |
---|
0:14:42 | correlation relation should at one or as the evaluation |
---|
0:14:46 | and yet as a result |
---|
0:14:48 | and what |
---|
0:14:49 | i i see that the subject it's of all |
---|
0:14:53 | these are |
---|
0:14:54 | correlation average over all |
---|
0:14:57 | and for each the |
---|
0:14:59 | each of a of a feature |
---|
0:15:02 | and |
---|
0:15:02 | this i think that that does the best because you're your thing on the theme |
---|
0:15:07 | uh is the same |
---|
0:15:09 | but i think thing here is |
---|
0:15:11 | that are was that route |
---|
0:15:13 | right try model i is not you know is |
---|
0:15:16 | i mean that thing and the S |
---|
0:15:18 | so |
---|
0:15:19 | actually that that then |
---|
0:15:21 | then you ignore |
---|
0:15:22 | so these any more |
---|
0:15:24 | not |
---|
0:15:25 | i i i know i am and you |
---|
0:15:28 | so that |
---|
0:15:29 | yeah so that is |
---|
0:15:31 | some i |
---|
0:15:32 | ooh week is normalization by a |
---|
0:15:34 | the |
---|
0:15:34 | jen like that |
---|
0:15:35 | speech |
---|
0:15:36 | and then you a thing by stopping |
---|
0:15:39 | it's a and if that |
---|
0:15:41 | similar |
---|
0:15:42 | oh |
---|
0:15:43 | so we can are |
---|
0:15:44 | uh a a a a a is the most |
---|
0:15:48 | and |
---|
0:15:48 | is sub in are or should to me any |
---|
0:15:53 | i i i got from that |
---|
0:15:56 | you can be you more is on the |
---|
0:15:58 | you can be a a models on friday i three |
---|
0:16:03 | you just |
---|
0:16:04 | at at any from you just |
---|
0:16:07 | acoustic thinking |
---|
0:16:08 | and this is done by a |
---|
0:16:11 | fact |
---|
0:16:12 | it's are not like is acoustics |
---|
0:16:15 | but |
---|
0:16:16 | there is a |
---|
0:16:17 | uh uh uh uh uh and what a question here is that are generated |
---|
0:16:21 | i see you need to your |
---|
0:16:23 | looking for some |
---|
0:16:24 | yeah |
---|
0:16:25 | because the language |
---|
0:16:26 | in your are |
---|
0:16:28 | in your data |
---|
0:16:29 | you but you should have it in your general |
---|
0:16:32 | i |
---|
0:16:32 | so be |
---|
0:16:33 | i |
---|
0:16:34 | and |
---|
0:16:35 | and |
---|
0:16:37 | there's uh |
---|
0:16:38 | ooh a you more things |
---|
0:16:40 | and |
---|
0:16:41 | so |
---|
0:16:41 | we have a a speech recognition |
---|
0:16:44 | using |
---|
0:16:45 | i feature in but our region |
---|
0:16:49 | joint feature |
---|
0:16:50 | and it has a shown some a room and recognition i |
---|
0:16:55 | and the |
---|
0:16:56 | or would also like to investigate |
---|
0:16:58 | for the |
---|
0:17:00 | uh a different of got teacher |
---|
0:17:02 | from the for what he's like yeah i know my you see is which |
---|
0:17:06 | right right and the vision in addition |
---|
0:17:09 | even |
---|
0:17:10 | and we V in a lot your back to collect |
---|
0:17:13 | uh are you data might be used sequence is for for american english talkers |
---|
0:17:18 | it's |
---|
0:17:19 | and you can find more about |
---|
0:17:22 | lee |
---|
0:17:23 | uh thank you listening |
---|
0:17:27 | i |
---|
0:17:28 | you |
---|
0:17:29 | send |
---|
0:17:29 | question |
---|
0:17:30 | oh |
---|
0:17:36 | so so to have time for questions yeah |
---|
0:17:39 | that's please |
---|
0:17:40 | uh so i have a uh like the the main thing is uh that in this case uh you have |
---|
0:17:45 | a |
---|
0:17:46 | uh sentences well |
---|
0:17:47 | but exactly same sentences from the purpose different subjects |
---|
0:17:51 | so uh |
---|
0:17:53 | that's types of the seneca |
---|
0:17:54 | okay but just an approach uh that to um that have to car |
---|
0:17:58 | oh no but |
---|
0:17:59 | the the question |
---|
0:18:01 | i so that i |
---|
0:18:02 | so exactly the same sample sense |
---|
0:18:04 | i i i i to just no actually but some some so test data |
---|
0:18:10 | time |
---|
0:18:11 | so |
---|
0:18:13 | yeah |
---|
0:18:17 | uh |
---|
0:18:18 | or to just P mean it to me or |
---|
0:18:20 | so that "'cause" it's and sentence comes from a general them corpus so |
---|
0:18:24 | and you |
---|
0:18:27 | uh_huh |
---|
0:18:32 | so those a randomized actually a a a a a a a in a sense |
---|
0:18:35 | so we wish we had a more of roots of corpora |
---|
0:18:38 | so |
---|
0:18:40 | yeah |
---|
0:18:45 | so the done actually it's it's pretty stable than seems also use the of |
---|
0:18:50 | a so a if you rooms where five for a lower you know are typically don't to set about four |
---|
0:18:55 | and fifty sentences |
---|
0:18:56 | um |
---|
0:18:57 | and the results a pretty stable be done we tried that actually to the point to just to check or |
---|
0:19:03 | all |
---|
0:19:03 | i |
---|
0:19:05 | not |
---|
0:19:06 | hmmm |
---|
0:19:07 | sure |
---|
0:19:09 | yeah |
---|
0:19:12 | and that the point of what the square lotion just kind of give us a some feel for that uh |
---|
0:19:17 | you know |
---|
0:19:18 | how it works but it's still a to do so but the |
---|
0:19:21 | uh phonetic discrimination results are interesting um it's not cheer but the uh |
---|
0:19:26 | those a a creating actually |
---|
0:19:30 | okay |
---|
0:19:34 | and and the question |
---|
0:19:38 | well actually |
---|
0:19:39 | i would have a quick one in the |
---|
0:19:41 | addition of channel its most most smash or criterion you have two terms T |
---|
0:19:46 | a take to try to measure uh us but those and then the data across |
---|
0:19:50 | and he use a high pass filter that a possible to do something delay |
---|
0:19:55 | if it's the cost of filter |
---|
0:19:57 | so will that in practice some leap between the smoothness definition and the data uh accuracy |
---|
0:20:09 | therefore for each trajectory you and uh joint design |
---|
0:20:12 | um we have a pretty fast recursive but uh a uh uh design for that it's pretty fast so that |
---|
0:20:17 | sometimes lee |
---|
0:20:18 | so so that it is is very small or okay yeah |
---|
0:20:26 | so if the |
---|
0:20:27 | norm question we move on to the next picture |
---|