0:00:15 | thank somewhat similar to kind of you |
---|
0:00:18 | related to be here are known to be part of august twenty sixteen the small |
---|
0:00:21 | percentage or c thank you unreasonable to have me here or in this meeting |
---|
0:00:27 | so that are okay i'm gonna give two days or something giving is so it's |
---|
0:00:31 | about a very classic problem question and speech communication about understanding variability and invariance and |
---|
0:00:38 | speech |
---|
0:00:40 | people been asking this for a long time |
---|
0:00:42 | so |
---|
0:00:43 | the specific sort of focus sample decrease of the very vocal instrument we have to |
---|
0:00:48 | produce the speech |
---|
0:00:52 | six different people here just showing the size of slices their vocal tract |
---|
0:00:58 | and we can see immediately each as the very uniquely shaped vocal instrument |
---|
0:01:03 | with which they produce a speech and which is what you're trying to use for |
---|
0:01:07 | doing speaker recognition speech signals produce sort of his vocal instrument |
---|
0:01:11 | in fact i just orange yourself if you're not familiar with this kind of looking |
---|
0:01:15 | into the be |
---|
0:01:17 | mouth |
---|
0:01:19 | i just for them are the nose and the time and that we limit of |
---|
0:01:22 | the soft palate that you know |
---|
0:01:25 | goes there just you because you'll see a lot of these pictures my talk today |
---|
0:01:30 | there is a good being more people |
---|
0:01:33 | all of them try to produce the well known |
---|
0:01:36 | but you can just a quick look at it and you see even study these |
---|
0:01:40 | people used to produce these sound the slightly different if we look at like another |
---|
0:01:43 | example |
---|
0:01:44 | that are like you know first and second speak very the speaker the lid rates |
---|
0:01:50 | at duncan |
---|
0:01:50 | make the gesture for making side well but they're slightly different |
---|
0:01:55 | so |
---|
0:01:57 | kinda know that the these kinds i that the production of both the structure in |
---|
0:02:02 | which these that speech production happens and how we produced be very close people |
---|
0:02:07 | and some of it is reflecting the speech signal was |
---|
0:02:09 | so we just what you're trying to sort of get out |
---|
0:02:14 | so that the ml my set of this line of work to say well what |
---|
0:02:17 | can speech signs you know play an understanding and supporting speech technologies development no only |
---|
0:02:24 | do we want to recognize speakers one o one make some different |
---|
0:02:29 | so specifically you know what focus today |
---|
0:02:33 | is to look at vocal tract structure the physical instrument at a given in function |
---|
0:02:37 | behaviour and within that about for producing speech |
---|
0:02:41 | and interplay between |
---|
0:02:42 | so by structure i mean physical characteristics of this vocal tract apparatus that we have |
---|
0:02:48 | right like the heart ballad geometry that on volume you know |
---|
0:02:51 | the length of the vocal tract the velum the no mass |
---|
0:02:54 | function typically refers to the hero characteristics of speech articulation |
---|
0:02:58 | how we dynamically warm for example to produce the consonants in all constructions the vocal |
---|
0:03:03 | tract you know to make a sound like intensely kind of research done to when |
---|
0:03:07 | you know |
---|
0:03:09 | and create a variation there were channel two |
---|
0:03:11 | create turbulence |
---|
0:03:15 | so |
---|
0:03:16 | this leads to very specific questions we asked right how are individual vocal tract differences |
---|
0:03:21 | with some pictures of people reflect in the speech acoustics |
---|
0:03:25 | candes no the inverse problem be predicted from the acoustics |
---|
0:03:30 | how to for a people sort of you know make a forty structural differences to |
---|
0:03:34 | create phonetic equivalents right because we all try to communicate use speech coding and language |
---|
0:03:40 | and in austin pointed out what contributes to distinguishing speakers from one another from the |
---|
0:03:44 | speech |
---|
0:03:45 | right so i want to emphasise not willing are we trying to differentiate individuals from |
---|
0:03:50 | their speech signal but understand what makes different from a structure |
---|
0:03:55 | so stop table one some of this |
---|
0:03:59 | sort of very on one where |
---|
0:04:02 | so we'll try to see how we can quantify individual variability given vocal tract quality |
---|
0:04:07 | try to see if we can pretty some of these from the signal and of |
---|
0:04:10 | what are the bounds of it and so one |
---|
0:04:13 | how to individual article two strategies to for can we explore you know automatic speaker |
---|
0:04:19 | recognition type you know applications and |
---|
0:04:23 | offer some interpretation while doing so |
---|
0:04:25 | so do approach that's i don't know or laboratory |
---|
0:04:29 | i one of my research groups is the cost bad all speech production articulation notes |
---|
0:04:33 | grew looks a lot of different questions including questions of variability so we take multimodal |
---|
0:04:39 | approach |
---|
0:04:39 | look at different kinds of ways of getting at the speech production to you know |
---|
0:04:44 | a more i patrol talk about a lot today audio another kind of the measurement |
---|
0:04:48 | technologies the whole a whole lot of a multimodal process of image processing and you |
---|
0:04:53 | know it's a speech processing and what the modelling based on that |
---|
0:04:57 | and try to use |
---|
0:04:58 | these kinds of engineering advances to gain insights about the dynamics of production speaker variability |
---|
0:05:06 | questions about speaking style prosody motions |
---|
0:05:10 | so the rest of that are gonna instructors falling |
---|
0:05:14 | so i'll focus the first part time seeing how we can measure speech production right |
---|
0:05:19 | how do we get those images and so one with that particular focus on a |
---|
0:05:24 | more i magnetic resonance imaging something that we've been trying to develop a lot |
---|
0:05:27 | a then given datasets data how do we analyze the island one with the sort |
---|
0:05:33 | of some modeling questions |
---|
0:05:35 | so |
---|
0:05:36 | how do you get it vocal tract imaging |
---|
0:05:39 | so there has been very central to speech science you know for a long time |
---|
0:05:44 | right the mac observer measure article three details the long surface tree of this and |
---|
0:05:50 | their number of techniques you know each with its own strengths and limitations |
---|
0:05:54 | you know for example really sort of i-vectors that were made right like you know |
---|
0:05:58 | when applied again stevens and so on text race you know |
---|
0:06:03 | but you know that's got pretty good temporal resolution but it's no not resay for |
---|
0:06:07 | people so it's no longer methodology and then the number of other techniques like ultrasound |
---|
0:06:13 | which provide you partial you all of the insides and not necessarily helpful for kinds |
---|
0:06:18 | of modeling hereafter and things like other target facilities shall use picture |
---|
0:06:23 | so here actually is an x ray |
---|
0:06:26 | i did that |
---|
0:06:31 | but in fact is scanned stevens |
---|
0:06:34 | right results are sound so you only see sound surfacing of parts of it on |
---|
0:06:39 | c the edges |
---|
0:06:41 | i so this is that the target you want people to speak about it like |
---|
0:06:47 | that no reading here with the contact electrodes |
---|
0:06:50 | and so when we speak the contact made by the time to the pilot provide |
---|
0:06:55 | you some insights about timing in coordination you know in speech to study |
---|
0:07:00 | right of it |
---|
0:07:01 | and finally |
---|
0:07:03 | by the time to noisy a person's down |
---|
0:07:05 | there |
---|
0:07:06 | be put little rice crispy like a sensors in there and measure the dynamics you |
---|
0:07:11 | know |
---|
0:07:12 | so you know provide you |
---|
0:07:14 | no |
---|
0:07:15 | the new possibilities and are created with the i to advances in the more i |
---|
0:07:19 | which provides you very good the soft tissue contrast to know be capable of basically |
---|
0:07:24 | what it relies on this the water content tissue so it that i didn't find |
---|
0:07:30 | and varies across very soft tissues so we make use of it by |
---|
0:07:34 | exciting the programs and they're releasing it signals generated according to this trend |
---|
0:07:38 | and then we can image it right |
---|
0:07:41 | it's very exciting because provides you very rides |
---|
0:07:45 | save provide you very good quality images but it's very slow the traditional one |
---|
0:07:50 | and so and also it has lot of things it's very noisy i know if |
---|
0:07:53 | you have are then into the scanner |
---|
0:07:55 | to produce speech sounds experiments a little town so these are somewhat things were contending |
---|
0:08:00 | with we put the last in years |
---|
0:08:01 | i mean so you know getting a so the very first that as sort of |
---|
0:08:06 | sub band of the main one third of around two thousand four |
---|
0:08:11 | we're in two |
---|
0:08:12 | a real-time imaging that is |
---|
0:08:15 | get two speeds |
---|
0:08:16 | that or sampling rates that are higher than |
---|
0:08:18 | what the speech rates are like you know what like |
---|
0:08:23 | twelve to be on aboriginal affairs or articulation rates and so |
---|
0:08:28 | maybe show you session |
---|
0:08:31 | huh |
---|
0:08:34 | i |
---|
0:08:36 | i |
---|
0:08:38 | i |
---|
0:08:41 | so |
---|
0:08:41 | if your family that the rainbow passage people write the exotic really ready when is |
---|
0:08:46 | very exciting for us to actually be able to this |
---|
0:08:49 | we we're doing acoustic recordings in a lot of the speech enhancement work therefore more |
---|
0:08:53 | i and was synchronise so kind of opened up a lot of for different possibilities |
---|
0:08:58 | for doing so |
---|
0:09:00 | there we saw |
---|
0:09:02 | so but unlike not happen that right really |
---|
0:09:04 | principal signals for a wide range for signals good but not but have been trying |
---|
0:09:09 | to see can be makes even better |
---|
0:09:11 | and so when you actually the kinds of rates |
---|
0:09:16 | for various because in the speech is not like one comedy using a lot of |
---|
0:09:19 | different you know and then mentoring task |
---|
0:09:21 | so from trials like no we're in spain |
---|
0:09:23 | to and of the saint sounds like on so one |
---|
0:09:27 | they are have different rate |
---|
0:09:28 | so we can get a about that kind of rates right would be really cool |
---|
0:09:33 | so |
---|
0:09:34 | in fact we were able to last year make a breakthrough |
---|
0:09:38 | and get up to sort of one hundred frames per second doing real time are |
---|
0:09:41 | with the |
---|
0:09:44 | more than one postdocs |
---|
0:09:46 | and not only do so very fast is very fast speech coding rate can really |
---|
0:09:51 | see that i'm to when you know a little |
---|
0:09:54 | but you can also do multiples playing simultaneously what you see here is assigned a |
---|
0:10:00 | slice by slice myself like they're |
---|
0:10:02 | or slice a axially like that or carly like this so we can do simultaneous |
---|
0:10:07 | you the vocal |
---|
0:10:09 | so i really exciting actually to be able to do it is really high rate |
---|
0:10:12 | to your two |
---|
0:10:14 | are insights |
---|
0:10:16 | and so this was made possible by both hardware and algorithmic a sort masses |
---|
0:10:22 | we developed a custom colour c requires four |
---|
0:10:27 | the thing |
---|
0:10:27 | it made lot of progress in both sequence design |
---|
0:10:31 | but also sort of consent reconstruction using compressed sensing things that have been happens in |
---|
0:10:35 | the process whatever |
---|
0:10:36 | so we were able to really |
---|
0:10:38 | speed this up and quite excited about it so this is all you know you're |
---|
0:10:42 | an experiment no |
---|
0:10:44 | some western sitting there in doing the audio collection you know the reprogram the scanner |
---|
0:10:48 | to that the audio synchronise with the leading |
---|
0:10:53 | we have interactive sort of |
---|
0:10:56 | control system to a select the scantily in one |
---|
0:11:01 | i |
---|
0:11:03 | i |
---|
0:11:05 | i |
---|
0:11:08 | she i four or she a four |
---|
0:11:15 | she i four |
---|
0:11:18 | lord |
---|
0:11:20 | she i o |
---|
0:11:24 | saying gonna get idea right so you can really see things you know that on |
---|
0:11:28 | the project it doesn't look that good like to actually |
---|
0:11:31 | and non-weighted which really good but actually now we are looking at production data that |
---|
0:11:36 | scales which is conducive the kinds of machine learning and approaches one could you |
---|
0:11:41 | although not be talking about be plotting |
---|
0:11:44 | this we are not outside the problem |
---|
0:11:46 | in addition to doing single plane or multi plane slice meeting we also very interesting |
---|
0:11:51 | the volume at least you want your interest in characterizing speakers with just one of |
---|
0:11:54 | the sort of the topics are researchers interest to control |
---|
0:11:58 | really force we off the geometry well people are speaking |
---|
0:12:03 | and we made some addresses there are two with about seven seconds of folding sort |
---|
0:12:07 | of or things like that |
---|
0:12:09 | we can do full sweep so |
---|
0:12:11 | the entire vocal tract and so we can get similar exemplary geometries off people's a |
---|
0:12:16 | set of clusters |
---|
0:12:18 | in addition |
---|
0:12:19 | we can also do really for getting to know that atomic will structures notable and |
---|
0:12:25 | of so we can do this classically to be to the more i and i'll |
---|
0:12:30 | show you why we are doing all these things for the kinds of measures what |
---|
0:12:33 | we really want to have a comprehensive idea of characterizing speakers a caucus by |
---|
0:12:39 | and the vocal instrument in behaviour |
---|
0:12:43 | so as soon as i one of the things we decide the recently been releasing |
---|
0:12:47 | a lot of these data so for people recognition one more than that really different |
---|
0:12:50 | speaker for both of them it you know sentences for six and |
---|
0:12:55 | with alignments and you know the image features and so on for its all available |
---|
0:13:00 | for free download so |
---|
0:13:04 | so you're some examples of that kind of data |
---|
0:13:07 | i |
---|
0:13:10 | i |
---|
0:13:15 | yes i |
---|
0:13:17 | she |
---|
0:13:19 | i |
---|
0:13:20 | so it's got five male and female speakers |
---|
0:13:22 | maybe some of them |
---|
0:13:26 | actually |
---|
0:13:28 | jamie money by |
---|
0:13:31 | and so on so |
---|
0:13:33 | and we also have alignment basically coregistration of this you know some algorithms for that |
---|
0:13:38 | then that's also released so we have this kind of data that we can work |
---|
0:13:42 | what so what you do this stuff |
---|
0:13:45 | so i'll sort of introduce some analysis preliminary |
---|
0:13:49 | a lot of image processing you to the very first thing is like actually getting |
---|
0:13:54 | at the structural details of the human will clap rather to people interested in sort |
---|
0:14:00 | of you know anatomy and more from a trends for her device |
---|
0:14:04 | of measuring everything else length of the ballot and |
---|
0:14:08 | and i and so one |
---|
0:14:10 | and that's what we wanted to do that very careful at each widget admit a |
---|
0:14:14 | imaging |
---|
0:14:16 | on top of that a for the we also want to track articulators right since |
---|
0:14:20 | articulator certain important specific task |
---|
0:14:23 | so we want to be able to automatically process these things |
---|
0:14:26 | so |
---|
0:14:26 | the methodology we sort of proposed was sort of and sampling for model |
---|
0:14:33 | and it's a very nice mathematical formulation actually work done by one of course |
---|
0:14:38 | and he was able to create a segmentation algorithm works fairly well |
---|
0:14:45 | so just things like okay i |
---|
0:14:49 | i |
---|
0:14:52 | so we're doing that now we would actually capture the various and timing we automatically |
---|
0:14:57 | from these vast amounts of data so it almost like to think about is one |
---|
0:15:00 | kind of feature extraction to me |
---|
0:15:04 | so we can all the buildings that are actually more linguistically more to us events |
---|
0:15:08 | by |
---|
0:15:09 | so one of my clothes collaborative school please so the founders of the articulately from |
---|
0:15:15 | all that even believe that us |
---|
0:15:18 | we sort of conceptualise speech production as a dynamical system |
---|
0:15:22 | and so varies articulators involving task basically created forming and not releasing constructions as we |
---|
0:15:29 | move around |
---|
0:15:30 | so we are interested in features like for example |
---|
0:15:33 | sort of a lip aperture and to but |
---|
0:15:36 | constriction degree and location so one so we are able to kind of that automatic |
---|
0:15:42 | twenty six |
---|
0:15:43 | another you |
---|
0:15:50 | so we need to automatically these things now so going from images to cut segmentation |
---|
0:15:56 | try to actually extract instead of linguistically meaningful |
---|
0:16:02 | features |
---|
0:16:06 | so that you know to do things like no a extract other kinds a representation |
---|
0:16:11 | like for example in look and pca on these contours two |
---|
0:16:14 | do look at the contributions of different articulators |
---|
0:16:18 | and so one so i'll just provide you some ways of getting at this sort |
---|
0:16:22 | of that objectively characterizing this production information |
---|
0:16:26 | and speaker specific |
---|
0:16:30 | so i so far is that like up for told you about look at how |
---|
0:16:34 | to get the data to some of that basic analysis and then with which we |
---|
0:16:39 | can now start looking at speaker specific properties |
---|
0:16:43 | so |
---|
0:16:45 | as i mentioned earlier data analysis to get an anatomical know how to characterise every |
---|
0:16:50 | single vocal instrument actual |
---|
0:16:52 | and this of the test was pretty well that anatomy literature and so on so |
---|
0:16:56 | we went to look at |
---|
0:16:57 | all those literature |
---|
0:16:59 | and you know compiled a whole bunch of these landmarks you may have become not |
---|
0:17:05 | the landmarks in speech |
---|
0:17:07 | and came up with these kinds of measures that we can get at like you |
---|
0:17:11 | know vocal tract sort of what legal and that the cavity lands in a separate |
---|
0:17:16 | and then you know and so on which we can sort of measure from these |
---|
0:17:20 | kinds of very high contrast images so that's one source of sort of speaker specific |
---|
0:17:27 | as an aside the also that you know since many degradations of same tokens by |
---|
0:17:31 | these people at different sessions no |
---|
0:17:34 | you're interested in how consists of people are and was very sort of |
---|
0:17:39 | sort of reaffirming that not people fairly okay fine how to produce that it opens |
---|
0:17:45 | you know that the measurements female we're very consistent so |
---|
0:17:48 | this is for example finding the correlation means and once again to |
---|
0:17:51 | something that presented in interspeech |
---|
0:17:55 | so you the strike we have this land fine article actually sort of environment with |
---|
0:18:00 | them which we are not be produce speech behavior we wanna know |
---|
0:18:05 | how much of it is dictated by the environment we have waters that strategies that |
---|
0:18:09 | are adopted by speakers of a unique to them due to various reasons which we |
---|
0:18:13 | can't really pinpoint but it is you know |
---|
0:18:15 | learning that they have done or the environment follows so more c can be sort |
---|
0:18:21 | of start deconstructing this little bit |
---|
0:18:25 | so next what also use a few examples subset along this direction |
---|
0:18:29 | so for example this picture want you to focus on the following and the palatal |
---|
0:18:33 | variation thought it is like you know your battery genders and think the heart circus |
---|
0:18:37 | we put you don't know right that's about the art part which is like important |
---|
0:18:40 | product or |
---|
0:18:41 | vocal apparatus so here we see |
---|
0:18:43 | but this person |
---|
0:18:45 | course my mouse |
---|
0:19:05 | that it |
---|
0:19:05 | so in a we see that this have i voices are very don't about it |
---|
0:19:11 | here a more posterior |
---|
0:19:14 | then i interior here is sharper drown |
---|
0:19:17 | that is just six different people |
---|
0:19:19 | so now how do we begin to actually why you are qualitatively seeing a |
---|
0:19:24 | can you quantify this right so |
---|
0:19:30 | so what i don't have a very |
---|
0:19:32 | was actually so that you know take these kinds of the extracted image shape and |
---|
0:19:37 | started doing sort of you know even simple pca analysis |
---|
0:19:41 | and showed that no for six percent of variance could be explained four bytes five |
---|
0:19:45 | first factor |
---|
0:19:47 | which were sort of akin to what was like to hunt concavity or complexity offish |
---|
0:19:51 | the next one was more know how forward-backward this |
---|
0:19:56 | this concavity was like sort of and curtin and then how sharp one so these |
---|
0:20:01 | this work test interpretations well that is actually very objective so |
---|
0:20:07 | so we can actually begin to quark one find cluster people along these sort of |
---|
0:20:11 | low dimensional search at least variables |
---|
0:20:14 | and then we can actually |
---|
0:20:15 | plug in these kinds of things into models right the like for example "'cause" you |
---|
0:20:20 | coupons see what acoustic consequences of these variations |
---|
0:20:24 | right |
---|
0:20:24 | so one of things you finite is that |
---|
0:20:27 | that is very word that that's the first performance very much |
---|
0:20:32 | where like the anti r g how four or five or this that the product |
---|
0:20:36 | shapes a incorrectly if you sharpness really didn't matter at least from these for star |
---|
0:20:41 | simulations |
---|
0:20:42 | so from a data to zero |
---|
0:20:45 | a morphological characters we can actually see pretty interpret what a casino once we can |
---|
0:20:50 | expect |
---|
0:20:51 | right |
---|
0:20:52 | in fact we can put this in a synthesiser articles and show at the other |
---|
0:20:57 | words from the th |
---|
0:20:59 | a little less |
---|
0:21:02 | to work on a basic you see are more one to let on |
---|
0:21:09 | you're going on in different bound to the plane |
---|
0:21:13 | so we can do this kind of analysis very no carefully |
---|
0:21:16 | so |
---|
0:21:18 | of course we also interested now likely due to inverse problem right can be estimated |
---|
0:21:22 | these shapes from given the acoustic signal how much of it is a available for |
---|
0:21:27 | us a body shape details right so |
---|
0:21:30 | we did the classic doing right okay be |
---|
0:21:34 | we have all kinds of features from the |
---|
0:21:37 | basic signal i want to realise right |
---|
0:21:42 | the shading on their way as we speak directly so it's influence |
---|
0:21:46 | but the environment and that apply the movements of that the behaviours right so what |
---|
0:21:50 | the mean one so |
---|
0:21:52 | that's what this way to know how we articulate |
---|
0:21:55 | and what we have |
---|
0:21:56 | both influences that influences the signal the |
---|
0:21:59 | so now see how it a single i |
---|
0:22:01 | and we show that no very simple first experiment we can get at the shape |
---|
0:22:05 | sort of detection |
---|
0:22:06 | concave a flat out that like sixty somebody persona time we can guess what kind |
---|
0:22:10 | of attitude they have just from the acoustic signal so that a more information is |
---|
0:22:14 | available |
---|
0:22:15 | so a more interesting question would be |
---|
0:22:18 | sort of a very classy morphological parameter that we've been using a lot as vocal |
---|
0:22:24 | tract length right this is something that office of been important speech rec aligned |
---|
0:22:28 | otherwise been and sound about |
---|
0:22:31 | well it's to |
---|
0:22:33 | normalize for also to estimate that things like for example we're doing a age-recognition and |
---|
0:22:39 | someone |
---|
0:22:39 | right so here again the same question |
---|
0:22:42 | what we have some of the speaker specific i think |
---|
0:22:46 | reflected in the signal right |
---|
0:22:47 | you wanna see how much we can grab added to pinpoint the speaker pair |
---|
0:22:52 | you can you know that you don't to some extent speakers compensated that for what |
---|
0:22:57 | environment they have and we wanna know so now how much |
---|
0:23:02 | all of it is residual that you can actually input |
---|
0:23:05 | get this is again vocal tract length i start with this because of a classic |
---|
0:23:09 | question that people basking so for example here is the data from a work area |
---|
0:23:13 | and you know and s and that the two thousand nine |
---|
0:23:16 | there are like you know a vocal tract length role with eight here |
---|
0:23:21 | for years and so we go across what from six centimetres one seventeen point five |
---|
0:23:27 | eighteen centimetres long |
---|
0:23:29 | and there's some |
---|
0:23:30 | different situation that happens are empirically for males and females well stuff |
---|
0:23:35 | and correspondingly z |
---|
0:23:37 | effect singly formant space in the spectrum |
---|
0:23:40 | no |
---|
0:23:42 | p by zeroing in on the first formant the rain for |
---|
0:23:47 | we can see that shorter vocal tract and |
---|
0:23:52 | shorter vocal tract and longer vocal tract how the space |
---|
0:23:56 | all that sort of |
---|
0:23:58 | get compress |
---|
0:23:59 | and you know shift and this kind of things happen |
---|
0:24:02 | and why people we've been doing implicitly or explicitly in when we do vtln |
---|
0:24:07 | is to basically normalize for this effect |
---|
0:24:12 | so the class that estimation vocal tract length you know has been back you know |
---|
0:24:17 | you know from or very simple sort of rest state |
---|
0:24:21 | sort of like what real impressed data to model we can begin estimate the land |
---|
0:24:26 | of the vocal tract from |
---|
0:24:28 | from the parameter |
---|
0:24:29 | right so what we are proposing |
---|
0:24:31 | what some sort of a problem the performance you can estimate the |
---|
0:24:35 | the delay parameter |
---|
0:24:38 | and |
---|
0:24:39 | one of the early work to improve work was by the key to you know |
---|
0:24:41 | or |
---|
0:24:43 | the really prediction |
---|
0:24:44 | okay and it's just an embryo relies on the third and fourth formant and other |
---|
0:24:50 | people the proposed in |
---|
0:24:51 | what we decide well now since actually |
---|
0:24:54 | direct evidence of the vocal tract length and acoustic |
---|
0:24:57 | can you come up with better regression models |
---|
0:25:00 | and sure enough to be sure that actually from this timit corpus i do not |
---|
0:25:05 | sure that we can get like really good estimates are not with very high correlations |
---|
0:25:10 | of vocal tract plan and you don't |
---|
0:25:12 | and this is kind of very interesting so that we are able to sort of |
---|
0:25:15 | progress and a good model estimate the model parameters |
---|
0:25:18 | and back to six now we are able to estimate vocal tract length as yet |
---|
0:25:22 | another set of more primitive detail of the person from the |
---|
0:25:25 | that's kind exciting |
---|
0:25:27 | last one last |
---|
0:25:29 | so |
---|
0:25:32 | summarizes what i just said no competition with that on a lot or estimation and |
---|
0:25:36 | availability of data and sort of you know good statistical methods allow us to get |
---|
0:25:40 | like better insights |
---|
0:25:42 | now |
---|
0:25:42 | moving on |
---|
0:25:44 | let's look at the slayer vocal tract is kind of the finding construct you know |
---|
0:25:49 | it's very hot defined then by this you was like no |
---|
0:25:54 | pretty funky and so that i'm actually plays a big role in how we dictate |
---|
0:25:59 | the talent |
---|
0:26:01 | so the question we ask is like okay |
---|
0:26:04 | we have sort of |
---|
0:26:06 | so vocal tract length and for infrequent the same charger showing you before |
---|
0:26:10 | we normalize for using clean you normalization but that is that what typically about |
---|
0:26:15 | we still have residual differences that are explained people you know putting as |
---|
0:26:21 | proposed like nonlinear vocal tract normalisation multi very limited all the test again at the |
---|
0:26:26 | specified what with so what we want to know is that the residual effect |
---|
0:26:30 | yes actually |
---|
0:26:32 | that's something about the size of that and the people have |
---|
0:26:36 | that some automatically to work well for |
---|
0:26:38 | so |
---|
0:26:40 | so i have up here is that the sentence and the like relative punk shape |
---|
0:26:44 | here |
---|
0:26:45 | this thing |
---|
0:26:47 | up to people |
---|
0:26:49 | we will explain some of the wall space differences |
---|
0:26:52 | okay |
---|
0:26:53 | so |
---|
0:26:54 | also the questions way but we have and this light of what is it well |
---|
0:27:00 | how does one defined measured on size |
---|
0:27:03 | or just people want to the concise is the people across the population |
---|
0:27:09 | what is effective downsizing articulation |
---|
0:27:11 | and |
---|
0:27:13 | what is that |
---|
0:27:14 | visible in the acoustics |
---|
0:27:16 | can be predicted and normalized |
---|
0:27:19 | same question so is very little don't publish work and that kind of thing |
---|
0:27:23 | a people know that there's a coordinated sort of a global the size of vocal |
---|
0:27:28 | tract that's be developed |
---|
0:27:30 | there are some disorders like you know balance enrollment so one but i one usually |
---|
0:27:35 | accuracy a large chunk sizes |
---|
0:27:38 | so |
---|
0:27:39 | what happens at least have so |
---|
0:27:42 | effect on how we produce speech like one lemmatization of corals corners of sounds like |
---|
0:27:50 | made in the corpus |
---|
0:27:52 | like else thing in a decent it's a one |
---|
0:27:56 | lemmatization it's like how we try to use it with the and laid than that |
---|
0:27:59 | are |
---|
0:28:01 | and sort of using almost like listing right leg lingual using the time in producing |
---|
0:28:06 | know what by labeled sounds like b and b |
---|
0:28:09 | and |
---|
0:28:10 | other call three articulation slowing of speech rate because you've larger mass to content of |
---|
0:28:14 | it |
---|
0:28:15 | and so on |
---|
0:28:16 | this something might mention but not |
---|
0:28:18 | much sort of quantify right |
---|
0:28:21 | so |
---|
0:28:22 | we sort of set out to say well we have lots of data |
---|
0:28:25 | can you set of a estimated mean posture huh |
---|
0:28:29 | and there is the segmentation |
---|
0:28:32 | and sort of |
---|
0:28:34 | come up with some proxy measure for someone right there was more things with it |
---|
0:28:38 | and so once you do that right we can actually plot the distributions of the |
---|
0:28:42 | time slices across the male and female speakers not to but corpus |
---|
0:28:46 | so what we see it |
---|
0:28:48 | the green |
---|
0:28:49 | e |
---|
0:28:50 | female i'm all your |
---|
0:28:53 | i don't average so there's significant setup |
---|
0:28:57 | six difference easy |
---|
0:28:58 | in the time |
---|
0:29:00 | size so yet another we can get added from the acoustic signal |
---|
0:29:04 | it set another sort of interpretable |
---|
0:29:06 | sort of |
---|
0:29:08 | marker |
---|
0:29:09 | it so |
---|
0:29:11 | because that |
---|
0:29:13 | how well we will at the environment this part structure with that down |
---|
0:29:17 | still not really well established again has open question so how do you really |
---|
0:29:23 | as this thing |
---|
0:29:25 | but |
---|
0:29:26 | we have taken sort of a shot |
---|
0:29:28 | so we did both sort of different kinds of normalization factor looking addressed cheapened |
---|
0:29:33 | well during movement this not much difference between don't they are pretty highly correlated |
---|
0:29:39 | so once you have that right |
---|
0:29:41 | we can actually not use this information in simulations say for example think it you |
---|
0:29:45 | model right people still study speech production |
---|
0:29:48 | we all the little from you know |
---|
0:29:52 | people like and that you know in our goner five |
---|
0:29:56 | there you can actually now reflect this back and try to study from analysis by |
---|
0:30:00 | synthesis |
---|
0:30:01 | so you have a mother tongue we can expect longer instructions and so on so |
---|
0:30:05 | what we did was to vary based on measurements we don't |
---|
0:30:09 | look at different constriction bands and |
---|
0:30:13 | locations just cy thumbsized difference will play a role in the acoustic selecting a four |
---|
0:30:18 | way |
---|
0:30:19 | so what we observe that concise differences in the population be had |
---|
0:30:24 | and what was estimated by simulation very well correlated in terms of part |
---|
0:30:29 | i part |
---|
0:30:30 | so it was very nice so what you saw see here is that the |
---|
0:30:33 | in the simulation spk and five |
---|
0:30:36 | the move that |
---|
0:30:39 | type of well ryan or likewise |
---|
0:30:43 | so the general trends are okay so |
---|
0:30:46 | so we saw all in all the pilot we saw with another what is it |
---|
0:30:51 | varies across speakers quite of a fifteen pick up to thirty percent |
---|
0:30:56 | had a consequence of a large time s |
---|
0:31:00 | longer constructions that are may in the vocal tract s p produce sounds because constructions |
---|
0:31:04 | are very sensual to how we produce very speech sounds |
---|
0:31:08 | they data stretching twist the wells basis so that's of us |
---|
0:31:14 | signal that the playwright |
---|
0:31:15 | and |
---|
0:31:17 | but this |
---|
0:31:18 | interplay between contractions performance and downsize is fairly complex requires much more sophisticated so |
---|
0:31:24 | learning |
---|
0:31:25 | a model that |
---|
0:31:27 | but with hopefully with data is you know these things can be pursued |
---|
0:31:32 | this one |
---|
0:31:33 | so the final thing sort of a not a on the slide of speaker specific |
---|
0:31:36 | behaviour |
---|
0:31:37 | is to actually talk about articulator study |
---|
0:31:40 | okay what i mean but that is how talkers move the vocal tracks right so |
---|
0:31:45 | as you know the vocal tract is actually a pretty clever assistants a very that |
---|
0:31:48 | we didn't systems of got all tolerance little bit |
---|
0:31:52 | exactly can use the same a different articulated to create the same to a complete |
---|
0:31:57 | the same task for example |
---|
0:31:59 | in move the john looks two |
---|
0:32:01 | both dialects to contribute by little constructions like no making b and b and one |
---|
0:32:06 | you have a mortgage august we lips |
---|
0:32:09 | and people have several ways to change their i every shapes to do this |
---|
0:32:13 | and so we columns are contractor strategies and some of these are speaker specific some |
---|
0:32:16 | of these language-specific consider a we wanna get added because is again yet another piece |
---|
0:32:22 | of the palatal as you try to understand what makes |
---|
0:32:25 | me different from you in trying when you produce speech signal |
---|
0:32:29 | the only just knowing that i'm different from you from a speech |
---|
0:32:33 | okay |
---|
0:32:33 | so this is approach you again very early work |
---|
0:32:36 | so we have lots of |
---|
0:32:38 | i built anymore i data |
---|
0:32:40 | so since then i don't know the database we collect is about from a pilot |
---|
0:32:45 | study of eighteen speakers but like north all these volume between all that stuff |
---|
0:32:48 | very detailed weight |
---|
0:32:50 | and so we can actually |
---|
0:32:53 | i get i know characterizing the morphology speaking style |
---|
0:32:57 | once we have that right be established what we call the speaker specific for maps |
---|
0:33:01 | a off but from the vocal tract shapes the construction so imagine |
---|
0:33:07 | the shape changes to create this task or like consummate dynamical system you know actually |
---|
0:33:12 | is estimate the for maps of like you know |
---|
0:33:15 | in that in a different recreation sense |
---|
0:33:17 | and then we can |
---|
0:33:19 | pulling all from each of these speakers format |
---|
0:33:21 | put this back and was synthesized or model |
---|
0:33:24 | which a to dynamical system ought to use and task dynamics |
---|
0:33:28 | and see that contributions of the varies articulators people use actually to predict how to |
---|
0:33:33 | be what studies people about |
---|
0:33:37 | so |
---|
0:33:38 | again reminding use of a cell we can go from data to extract a sort |
---|
0:33:43 | of a on tourism and do pca extract basically |
---|
0:33:47 | factors able contractually you know how much darker compute on with the time factors are |
---|
0:33:52 | and someone |
---|
0:33:52 | and then |
---|
0:33:53 | a from that we can go with estimate various constructions in a place of articulation |
---|
0:33:58 | you probably more right |
---|
0:34:00 | we have along the would try to make an six different anatomical regions like the |
---|
0:34:05 | outfielder reading about you can the be elevating their injuries and the one |
---|
0:34:10 | we can is |
---|
0:34:12 | automatically estimate that |
---|
0:34:13 | the baseline level what people this |
---|
0:34:16 | so |
---|
0:34:18 | problem so we have some insights from about eighteen speakers that we analyze testing again |
---|
0:34:23 | that are sorensen |
---|
0:34:25 | a leaf presented interest feet and fill white that we went about use a model |
---|
0:34:31 | based approach |
---|
0:34:32 | so |
---|
0:34:33 | be approximated like the speaker specific format a to pin from that a more i |
---|
0:34:37 | data from exceeding speakers |
---|
0:34:40 | the simulated with that a static to you have a to belong to the from |
---|
0:34:44 | a motor control sort of |
---|
0:34:46 | a legit are fantastic system |
---|
0:34:48 | the dynamical systems are basically |
---|
0:34:52 | control system that the in this state space for |
---|
0:34:54 | and then we were able to interpret the results so one of the results here |
---|
0:34:58 | like to make sure you know it's basically represent a the ratio of lips to |
---|
0:35:03 | use a lipstick |
---|
0:35:06 | and or job used by speakers to create constructions various constriction bilabial alveolar palatable |
---|
0:35:12 | we look print your along the vocal |
---|
0:35:15 | and you see that there's you know |
---|
0:35:17 | different |
---|
0:35:18 | ratio of how people use how much dog use |
---|
0:35:22 | one is like more target lips |
---|
0:35:25 | zero it's like you're using more |
---|
0:35:27 | different conceptions different we use |
---|
0:35:29 | different ways of creating transitions in fact used |
---|
0:35:34 | put this work we see that elephant on the right where you know |
---|
0:35:39 | contribute more than job in so |
---|
0:35:42 | except for all real close to the score of a target the time and |
---|
0:35:48 | the speakers in our set like in speaker |
---|
0:35:51 | very you know how they used to create the same kind of constructions i so |
---|
0:35:56 | people are different in how it studies i |
---|
0:35:59 | so one of the sort of this is very early inside straight how much speaker |
---|
0:36:04 | used on the lips you know it there's a function specificity how what is it |
---|
0:36:09 | out the remote are planning |
---|
0:36:10 | there are exceptions that actually begging for more sort of you know a computational approach |
---|
0:36:16 | is now with the data inside we can go and cy |
---|
0:36:20 | how people actually use the vocal instrument in producing this sounds |
---|
0:36:26 | that we call speech |
---|
0:36:29 | so the final in this is now we get family the slides we've been seeing |
---|
0:36:32 | this conference of |
---|
0:36:35 | so you are also explore a little bit |
---|
0:36:38 | well production information be of use in you know |
---|
0:36:42 | in speaker recognition type of experiment so we did little better well work one speaker |
---|
0:36:48 | verification with the production data does not much data so not so you know particular |
---|
0:36:54 | but that's the people pretty much common or things like so that was not one |
---|
0:36:58 | has this |
---|
0:37:00 | we'll speech production data be of any use at all your speaker verification |
---|
0:37:04 | so we know i one point on a getting like data like rewind showing right |
---|
0:37:10 | x-ray or more i or |
---|
0:37:11 | it's not |
---|
0:37:12 | but we okay in operation conditions |
---|
0:37:15 | right so we need to be able to have some articulatory type representation so people |
---|
0:37:20 | been working on inversion problems that is |
---|
0:37:23 | given |
---|
0:37:25 | acoustic |
---|
0:37:26 | can be estimated glitch parameters like this the classic problem in fact mozaic setting problem |
---|
0:37:31 | where you know where i feel that deep-learning that approaches that are very powerful because |
---|
0:37:35 | it's of any nonlinear process so you know these things every conducive to these |
---|
0:37:39 | mapping a |
---|
0:37:41 | nevertheless what we wanted us to do so a speaker-independent mapping |
---|
0:37:45 | right so this work of profound a small within just a few years ago what |
---|
0:37:52 | said well |
---|
0:37:52 | if i can really |
---|
0:37:54 | acoustic articulately mapping between people |
---|
0:37:56 | you know of that an exemplary talker right i have lots of data from one |
---|
0:38:00 | single speaker for like and synthesis right you always take long |
---|
0:38:03 | the properties from one talker and then try to produce it |
---|
0:38:08 | and then we can protect anyone else's acoustics on this |
---|
0:38:12 | so speakers maps to see how this guy were to produce the statistics like everything |
---|
0:38:17 | to get some semblance of an articulate representation |
---|
0:38:20 | so |
---|
0:38:22 | that we can do speaker independent sort of you know measures so that was sort |
---|
0:38:26 | of the i so we said well we can use a reference speaker |
---|
0:38:31 | to create a articulate acoustic target like to map and to the inverse model and |
---|
0:38:37 | then when you get that speakers |
---|
0:38:39 | for one acoustic signal |
---|
0:38:42 | we can actually do inverted sort of features and use these to a few |
---|
0:38:48 | the three |
---|
0:38:48 | there's any benefit the rationale there is enormous |
---|
0:38:53 | is that it pretty produces like projections they not no |
---|
0:38:57 | robust way and constraints the kind where |
---|
0:39:00 | provide sort of |
---|
0:39:02 | physically meaningful constraints on how we partition signal so |
---|
0:39:05 | that might be some advantage to come that come up |
---|
0:39:08 | so this was sort of you know |
---|
0:39:11 | this like earlier this year |
---|
0:39:13 | in c s l |
---|
0:39:15 | so |
---|
0:39:15 | the front end this started be used actually for some of these all experiments used |
---|
0:39:20 | x-ray microbeam database also available because a lot of speakers |
---|
0:39:25 | and standard |
---|
0:39:27 | thanks here gmm model because you don't have the much data |
---|
0:39:32 | and you're some sort of the initial results of you use just |
---|
0:39:37 | mfccs only you know |
---|
0:39:39 | that like what that for this small set that's not that's pretty noisy data set |
---|
0:39:44 | about |
---|
0:39:46 | you know seven point five the are but you know if you actually have the |
---|
0:39:50 | real articulation |
---|
0:39:52 | the measured articulation actually get a result of post |
---|
0:39:56 | in |
---|
0:39:57 | providing sort of you know nice complementary information that's kinda encouraging so that you might |
---|
0:40:02 | think about as an oracle experiment or upper bound if you have session |
---|
0:40:06 | now if you can use of the inverted sort of measurement about that we shall |
---|
0:40:12 | we do as well compare really well slightly better by putting them together actually provides |
---|
0:40:17 | you an additional both with this pretty significant actually |
---|
0:40:20 | so this grading of this kind of if you have lots of data that we |
---|
0:40:24 | are sort of you know if you have |
---|
0:40:25 | in the data to create these maps about speakers you know we need just example |
---|
0:40:29 | each case |
---|
0:40:30 | and if we can provide additional source of information |
---|
0:40:32 | perhaps will give us so the some wheels but maybe also some insight into why |
---|
0:40:37 | people are different or what data categories of articulation or structure and started is a |
---|
0:40:42 | different by |
---|
0:40:47 | so this is just the standard set of |
---|
0:40:51 | the first |
---|
0:40:52 | showing the same as of the film |
---|
0:40:55 | x-ray microbeam database |
---|
0:40:57 | so |
---|
0:40:59 | summary of the speaker recognition experiments that notes and she'll so that step |
---|
0:41:04 | of using both acoustic and articulatory information |
---|
0:41:07 | there is significant and f eight |
---|
0:41:10 | if you use of measured articulately information with the standard acoustic features |
---|
0:41:16 | gains of marble or more honest |
---|
0:41:18 | if we stuff you know used estimated articulate information |
---|
0:41:22 | so what would be nice is to actually look a new ways of doing english |
---|
0:41:27 | and with the kinds of so advances that are happening right now |
---|
0:41:31 | nor feels |
---|
0:41:32 | and the availability of data number two data |
---|
0:41:35 | to do |
---|
0:41:36 | i know this |
---|
0:41:37 | no better |
---|
0:41:38 | i'll be able to evaluate larger sort of acoustic data sets from sort of sre |
---|
0:41:42 | like the campaigns |
---|
0:41:45 | so mowing for most on |
---|
0:41:48 | so we're very excited about no some of this actually |
---|
0:41:52 | a premier work was done with my collaborators that lincoln laboratory some point your unique |
---|
0:41:58 | model is gonna |
---|
0:41:59 | and parallel work was mice your voice now also their |
---|
0:42:03 | and so we had some initial pilot work and then |
---|
0:42:06 | i recently got an innocent right actually to a and you the slider work people |
---|
0:42:10 | actually |
---|
0:42:12 | or okay we're doing speed signs looks like |
---|
0:42:14 | so we are excited about it |
---|
0:42:16 | and so our ideas do this in a very systematically your set to collect about |
---|
0:42:21 | two hundred subjects this |
---|
0:42:22 | all this |
---|
0:42:24 | real time and volume a tree and about |
---|
0:42:26 | detail and share with people |
---|
0:42:28 | and |
---|
0:42:31 | we kinda describe this sort of in an upcoming the paper |
---|
0:42:36 | and this is kind of that material if you're targeting i'll show the slides and |
---|
0:42:41 | people want to suggest that is we are more in for you collected what ten |
---|
0:42:44 | speakers of the product or so far |
---|
0:42:47 | with the project the starter |
---|
0:42:49 | i everything from a notable exception the rainbow passage two |
---|
0:42:52 | all kinds of you know spontaneously and so on |
---|
0:42:56 | if you have any suggestions ideas how what would be useful for speaker modeling you |
---|
0:43:01 | know i'm use like this now we have to consider |
---|
0:43:04 | most in order to be native speakers of english and about twenty percents could be |
---|
0:43:08 | nonnative speakers it's gotten in english |
---|
0:43:11 | but in other projects to collect a lot of people doing other languages are everything |
---|
0:43:17 | from african languages to other |
---|
0:43:20 | so finally also you know a getting insights inter speaker variability also we can do |
---|
0:43:25 | some sort of these use cases problem |
---|
0:43:27 | in the case or mother developing vocal tract length from kids tradition |
---|
0:43:32 | how the speaker very so that no manifesting the signal right so for example |
---|
0:43:37 | we've been working along with people operations of attending i'll or can see |
---|
0:43:42 | so the intention surgical interventions class actually basically what you with you |
---|
0:43:47 | the parts of town |
---|
0:43:48 | on top of that we have other therapeutic sort of treatments with the radiation and |
---|
0:43:51 | are |
---|
0:43:53 | people |
---|
0:43:53 | so cost like modified physical structural damage to the thing |
---|
0:43:58 | so here we see two |
---|
0:44:00 | of patients |
---|
0:44:02 | there are no |
---|
0:44:03 | one basically lost pretty much more so that are because the cancer with your base |
---|
0:44:08 | you know that and that's of the four reports on |
---|
0:44:10 | and it's replaced by reconstruct with them flat from the four |
---|
0:44:15 | so you see sort of variation in the convoy the normalized and therefore here |
---|
0:44:21 | so how this their speech cope what this is not getting speech and small is |
---|
0:44:25 | one of the big quality of life measure |
---|
0:44:26 | so we have different things is also keep us additional insights about you know looking |
---|
0:44:31 | at speaker variability |
---|
0:44:35 | the interesting something's only eleven cases you know and had in history the norton |
---|
0:44:39 | though |
---|
0:44:39 | some people bought reported on ability a so we have access to all other speakers |
---|
0:44:44 | and collect a lot of data from where and |
---|
0:44:46 | and so we can compare what |
---|
0:44:49 | a how to compensate how to use the strategies how person |
---|
0:44:54 | speaks pretty intuitively pretty well so |
---|
0:44:57 | this provides an additional source of information to understand this question of individual very good |
---|
0:45:05 | so in conclusion |
---|
0:45:07 | appoint someone may well yes data is very a good integral to advancing speech communication |
---|
0:45:13 | research your vocal tract information plays a crucial part of this piece of this but |
---|
0:45:18 | the like i believe |
---|
0:45:20 | so to do that we need to gather data from like lots of different sources |
---|
0:45:24 | to get a complete picture of the speech production |
---|
0:45:27 | it's that's |
---|
0:45:28 | not very telling from a technological computational |
---|
0:45:32 | as well this conceptual and theoretical to the perspective |
---|
0:45:35 | but |
---|
0:45:36 | i don't believe that are written still so that no applications including into the machine |
---|
0:45:41 | speech recognition speaker modeling |
---|
0:45:44 | but i that this sort of |
---|
0:45:47 | approach just like very interdisciplinary so people have to come together to work well on |
---|
0:45:51 | these topics |
---|
0:45:52 | and share |
---|
0:45:53 | so these are some of the people and my speech production that no |
---|
0:45:56 | the problem of our |
---|
0:45:58 | although a bottom line and people were currently there in particular award that |
---|
0:46:04 | we also contributed this particular a collection of my |
---|
0:46:08 | calling who does all these imaging work |
---|
0:46:11 | and testing them are scientist |
---|
0:46:13 | lois of these the linguist very |
---|
0:46:16 | well |
---|
0:46:16 | linguists provides a conceptual framework of how we |
---|
0:46:20 | approach |
---|
0:46:21 | such an that all this work on |
---|
0:46:23 | this apply meeting stuff recently and the lower can only morphology where my that was |
---|
0:46:29 | talking a lot model where |
---|
0:46:32 | that |
---|
0:46:34 | namely that a lot of things actually translating to a speaker verification |
---|
0:46:39 | and i separate that michael i-vectors in all our women amazing no i'm forty really |
---|
0:46:44 | for this guy had available |
---|
0:46:46 | and here not only finally no he's been very supportive is vanilla rampantly support incorrect |
---|
0:46:52 | he's be important for this and no i'm pushing is to |
---|
0:46:56 | not people one that's of things here too |
---|
0:46:58 | so that i thank all of you listening to be |
---|
0:47:02 | and various people find that |
---|
0:47:04 | well this is like online if you're interested including might be charged |
---|
0:47:09 | thank you very much |
---|
0:47:27 | for instance |
---|
0:47:32 | you very much with fascinating to |
---|
0:47:35 | two questions first of all |
---|
0:47:38 | when you're gonna get to the larynx |
---|
0:47:42 | because that's i'm okay i'm talking from the |
---|
0:47:46 | perspective you |
---|
0:47:48 | the forensic phoneticians |
---|
0:47:51 | and |
---|
0:47:54 | we are conscious of between speaker differences from the larynx on two |
---|
0:48:02 | spectral slope of that sort of thing but in this that suppressing |
---|
0:48:05 | and also super the residual e |
---|
0:48:09 | relationships between what i would |
---|
0:48:11 | give almost more robust harmful is we'll knowledge about the speaker variability in |
---|
0:48:20 | the nasal |
---|
0:48:22 | basically nasal cavity sinuses that sort of thing |
---|
0:48:26 | that is the below about speaker i |
---|
0:48:30 | it's great "'cause" you're not gonna get in this |
---|
0:48:32 | we telephone speech and so forth anything above |
---|
0:48:35 | three k is the good |
---|
0:48:37 | some parts that so the first questions about lyrics right |
---|
0:48:41 | so here are in this region |
---|
0:48:43 | so |
---|
0:48:45 | so the glottal so that the voice a voice source of phenomena like happens that |
---|
0:48:49 | much higher rate |
---|
0:48:50 | and so i'm are still is not good enough right it's about |
---|
0:48:54 | we can go about how did want reprints for second year |
---|
0:48:58 | so what people have been doing particular you know according to you salience one no |
---|
0:49:02 | up to |
---|
0:49:04 | you high speed imaging off this larynx but wouldn't camera to the nose |
---|
0:49:08 | in two |
---|
0:49:10 | little bit intervention and |
---|
0:49:13 | at so |
---|
0:49:14 | on the other hand |
---|
0:49:15 | what we can do you have them or i used to look at things like |
---|
0:49:19 | little joe hi then they'd all other things but also get some |
---|
0:49:22 | it it's one zero information |
---|
0:49:25 | and particularly one of things a more approaches like complete you of your region so |
---|
0:49:30 | we can really |
---|
0:49:31 | this is not available any of the other but all these people use you know |
---|
0:49:34 | in this so you look at like to be for usual sort of |
---|
0:49:40 | behavior phenomena |
---|
0:49:42 | and in terms of actually characterize and things like that is the variance and so |
---|
0:49:45 | on which don't change very much during speech behavioural i cannot to characterize that's what |
---|
0:49:49 | he really i contrast to weighted images |
---|
0:49:51 | to really characterize every speaker by you know what is that they have the you |
---|
0:49:55 | know and in terms of |
---|
0:49:57 | with which we can actually get i |
---|
0:50:00 | some anatomical good characterization of a speaker and see how can relate or account for |
---|
0:50:05 | it in the signal |
---|
0:50:06 | and so |
---|
0:50:08 | we are trying to see how can |
---|
0:50:10 | sort of controlling t do some multimodal meeting of voice source that no we tried |
---|
0:50:14 | to you d |
---|
0:50:15 | but you know they are quite small window into this thing is you know |
---|
0:50:19 | we wanna see the high speed stuff |
---|
0:50:23 | still open question in terms of contrary to meeting |
---|
0:50:29 | so that like by the button references |
---|
0:50:33 | in the previous slide show organisers people interested |
---|
0:50:40 | no more questions i was just |
---|
0:50:46 | normal |
---|
0:50:50 | s |
---|
0:50:53 | is it possible to say broadly |
---|
0:50:55 | if there are any a particular areas that show the greatest amount of the between |
---|
0:51:00 | speaker difference |
---|
0:51:02 | and that's to me and use |
---|
0:51:03 | so you know if you gonna look for where is a completely |
---|
0:51:08 | goodness knows it or is it just and that you know people differ in all |
---|
0:51:11 | sorts of the from which was |
---|
0:51:14 | so i think that the latter is that what my guess is right no unless |
---|
0:51:18 | we know i do think they begin to start begin to cluster |
---|
0:51:21 | a ones as increase the and number |
---|
0:51:25 | just like you know what we do it eigenvoice and the |
---|
0:51:28 | i didn't phase i think i'm sure a good prime things that start at clustering |
---|
0:51:32 | for getting direct mode |
---|
0:51:33 | but now the source of variability seems to be |
---|
0:51:36 | a perceptual point of view |
---|
0:51:38 | all the place |
---|
0:51:40 | plus you know how people became weakened that |
---|
0:51:42 | also varies quite a bit because you know |
---|
0:51:46 | where they come from mine how be applied and so one right and practices people |
---|
0:51:50 | use no |
---|
0:51:51 | there are other piece of work that i can talk about no one article to |
---|
0:51:54 | setting and you know |
---|
0:51:56 | ideas about |
---|
0:51:59 | how people set of actually |
---|
0:52:02 | be but i do |
---|
0:52:04 | extract parameters of |
---|
0:52:06 | from or to control problem point of view white people the for it i can |
---|
0:52:09 | lead to language or |
---|
0:52:11 | background or other kinds of things still open question |
---|
0:52:15 | but what i feel like as being trees that it is these of we talk |
---|
0:52:18 | about very small datasets is compared to what you've been for state would just on |
---|
0:52:23 | the speech side |
---|
0:52:25 | but if we increase this to some extent |
---|
0:52:28 | and again or this kind the computational tools and advances that you're making i think |
---|
0:52:33 | slowly can begin to understand this at the level to go |
---|
0:52:40 | open question |
---|
0:52:49 | structure so it are you make a comment |
---|
0:52:53 | you put up a kind of the acoustic to model but well all remember point |
---|
0:52:57 | out one thing from one of the workshops from |
---|
0:53:00 | the early nineties |
---|
0:53:02 | from mid sixties up until late eighties early nineties we use their own acoustic to |
---|
0:53:09 | model that was when you're like flat screen |
---|
0:53:12 | and we should tell at a summer student would basically spent the summer saying well |
---|
0:53:18 | actually the vocal track as a writing all turn and no one it really thought |
---|
0:53:23 | about what how much is that right angle actually impact vocal i persona formant locations |
---|
0:53:28 | and bandwidths |
---|
0:53:29 | so he we formulate a can or closed form solution i think they saw it |
---|
0:53:34 | was between one two three percent ships informed location bandwidths right so a very much |
---|
0:53:39 | like sting the physiological per state you take care what might one right basic questions |
---|
0:53:44 | you focused on speaker id |
---|
0:53:47 | i'm assuming many of your speakers here bilingual have you thought about looking at language |
---|
0:53:52 | id to see if the physiological production systematically changes between people speak one language versus |
---|
0:53:59 | another |
---|
0:54:00 | absolutely solid lines of that for the first a common to jon hansen made was |
---|
0:54:05 | regarding to but the vocal to been but it sort of unruly do the simulations |
---|
0:54:11 | note that |
---|
0:54:12 | for |
---|
0:54:13 | articulation acoustics and the effect of the band in fact there is a classic people |
---|
0:54:17 | by enrollment order moments on the |
---|
0:54:19 | and yes and the release of |
---|
0:54:21 | long time ago |
---|
0:54:23 | that actually estimates is about the three five percent the student actually verified it but |
---|
0:54:27 | some and simulations later on |
---|
0:54:31 | i used to get the last |
---|
0:54:34 | and |
---|
0:54:36 | so i think of the more recent models try to do this you know but |
---|
0:54:40 | like fans here simulations main street and simulations the one we can do with this |
---|
0:54:44 | node access to those one what you did i talked about right |
---|
0:54:48 | for all the postures from all these speakers we had that |
---|
0:54:50 | so and with the high performance computing |
---|
0:54:53 | this is becoming a reality we can actually what implanting and want to do right |
---|
0:54:56 | no nodes |
---|
0:54:58 | possible |
---|
0:55:00 | this second question |
---|
0:55:03 | john a reminder |
---|
0:55:07 | all the language id yes of course we have actually |
---|
0:55:10 | about |
---|
0:55:11 | forty or fifty different languages actually languages and set l to a second language them |
---|
0:55:17 | speak english in or datasets you know across very linguistic experiments we've been doing |
---|
0:55:22 | so one things we |
---|
0:55:24 | the real the data |
---|
0:55:25 | little bit not as much maybe |
---|
0:55:27 | cup people intuition language id |
---|
0:55:31 | may have some hypotheses and so on their be looked at things like articulately setting |
---|
0:55:34 | you know which is then |
---|
0:55:36 | the place from would you start executing a task right now from rest to rent |
---|
0:55:40 | you so if you think about as a database system right as you know from |
---|
0:55:44 | a individually creation like you know so the modelling you initial state is important from |
---|
0:55:48 | which we go to another state and where you set of but |
---|
0:55:53 | release that particular task and go to next aspect of making one construction going on |
---|
0:55:57 | an on and so we found that people have preferred sort of settings from which |
---|
0:56:02 | they start executing and that's very language specific we showed like normal german speakers presents |
---|
0:56:07 | and spanish speakers with english speakers so these kinds of things can be estimated from |
---|
0:56:11 | articulatory data |
---|
0:56:13 | the inversion is not been to the viewing done that no |
---|
0:56:17 | but that's quite possible and you know happy to share data |
---|
0:56:21 | top two people body |
---|
0:56:26 | okay |
---|
0:56:27 | sure it's first okay |
---|
0:56:32 | okay so |
---|
0:56:34 | i have a comment i like to respond |
---|
0:56:37 | one of all the problems in speaker recognition is i happens between the hot this |
---|
0:56:44 | but the speech right |
---|
0:56:48 | the first line that explains |
---|
0:56:51 | cepstral mean subtraction |
---|
0:56:54 | basically you find the way the average side of the vocal tract |
---|
0:57:02 | how does that sort of |
---|
0:57:04 | impact on what you |
---|
0:57:07 | right so that you know i didn't talk about the channel effects and channel normalization |
---|
0:57:11 | things that happen you know the recording conditions and so one right so |
---|
0:57:16 | one of things that the art of contemplating is like you know like many people |
---|
0:57:19 | have been talking what do joint factor analysis or these kinds of even with these |
---|
0:57:24 | new a deep-learning systems right |
---|
0:57:27 | you could these multiple factors jointly together to see how |
---|
0:57:31 | we can have speaker specific variability sort of measures |
---|
0:57:35 | and things that are cost by sort of other |
---|
0:57:39 | so it's a extraneous setup |
---|
0:57:42 | interferences or thirty two or more other kinds of transformation that might happen |
---|
0:57:46 | so that's what we're doing from first principal type things right like the way we |
---|
0:57:51 | want to do not just make the jump into a drawing some all these into |
---|
0:57:55 | some you know machine learning to and beginning to estimate by |
---|
0:58:01 | systematically trying to look at linguistic theory speech signs we could features to analysis by |
---|
0:58:07 | synthesis type of approaches and then we can then see well if you have other |
---|
0:58:11 | kinds of these kinds of snow |
---|
0:58:15 | both |
---|
0:58:16 | open environment speech recording not |
---|
0:58:19 | for distance the speech recording is spelled much interest to other bus |
---|
0:58:23 | for various reasons and |
---|
0:58:26 | we can account for these things so i tend to believe in that kind of |
---|
0:58:30 | more organic approach |
---|
0:58:38 | we have temporal one question may be processed foods |
---|
0:58:44 | i |
---|
0:58:47 | i'm sorry i'm the both fast |
---|
0:58:50 | i |
---|
0:58:51 | i won't first to thank you and it's very nice |
---|
0:58:57 | sorry noise |
---|
0:58:59 | science |
---|
0:59:00 | which technology and particularly in speaker recognition or in the forensic so |
---|
0:59:06 | adjust my common this to remind the difference between speaker recognition and a forensic voice |
---|
0:59:12 | comparison |
---|
0:59:14 | but it really both and |
---|
0:59:17 | the field |
---|
0:59:18 | present that you |
---|
0:59:20 | because |
---|
0:59:21 | we know about when we try to do some article in addition we think like |
---|
0:59:26 | that |
---|
0:59:27 | we have a huge difference between the board to read speech |
---|
0:59:32 | it's train include kick back wall |
---|
0:59:37 | speech right |
---|
0:59:40 | for speaker recognition we could imagine but the speaker are trying to |
---|
0:59:46 | very |
---|
0:59:47 | classical to you could not be processed |
---|
0:59:50 | in forensic voice |
---|
0:59:51 | comparison |
---|
0:59:52 | we could imagine exactly you put it right are reading my question |
---|
0:59:58 | posted but midget but |
---|
1:00:01 | would be five |
---|
1:00:03 | constructions the or optimization strategy you know that |
---|
1:00:08 | challenge department you expose |
---|
1:00:12 | yes and alright because there's certain things we can change certain things we can't write |
---|
1:00:16 | your given right that's one of the things that we are trying to go after |
---|
1:00:20 | that there's something are given in or physical instrument it can compensate for it as |
---|
1:00:25 | much but we still see the residual effects and want to see can you get |
---|
1:00:29 | it is residual effect maybe |
---|
1:00:31 | the bounds are not there so no i have a big that of information theory |
---|
1:00:35 | so always interesting bound the limits of things how much can be actually |
---|
1:00:39 | after all we have |
---|
1:00:40 | a one dimensional signal from which we project on all kinds of feature space and |
---|
1:00:44 | do all or computation based on that to do all the inferences problems targeted speaker |
---|
1:00:49 | or whatever this and so |
---|
1:00:52 | say you menu plate that the strategies that's only one degree of freedom or you |
---|
1:00:58 | know if you mean |
---|
1:01:00 | and then it causes some differences but still if we can account for this somehow |
---|
1:01:04 | i can you still see the residual effects of the instrument that there have or |
---|
1:01:10 | specific ways they are |
---|
1:01:11 | changing the shot used a common database when they have right you can't speak so |
---|
1:01:17 | just two random things with your articulation to create the speech sounds right so that's |
---|
1:01:23 | why not disjoint modelling of you know the structure and function you please a very |
---|
1:01:28 | interesting to see and how much can be spoofed by people like you know you |
---|
1:01:32 | may if you're getting added |
---|
1:01:33 | it remains to be seen by the no i |
---|
1:01:36 | but i'm hoping that like by no |
---|
1:01:38 | being very microscopic here these analyses we can get some insight into it |
---|
1:01:43 | you know but one that is very objective not you know |
---|
1:01:46 | just a |
---|
1:01:48 | impressionistic you know single this place is definitely all these experts billing talk about it |
---|
1:01:52 | on you know on the court |
---|
1:01:55 | i think that's one of the reasons |
---|
1:01:57 | here was very |
---|
1:01:59 | but support the idea no let's go it every object to way you know scientifically |
---|
1:02:03 | grounded way as possible |
---|
1:02:06 | we don't loads its adjoint you see vertigo |
---|
1:02:11 | can be so |
---|
1:02:13 | since then the speaker again thank you thank you |
---|