thank somewhat similar to kind of you

related to be here are known to be part of august twenty sixteen the small

percentage or c thank you unreasonable to have me here or in this meeting

so that are okay i'm gonna give two days or something giving is so it's

about a very classic problem question and speech communication about understanding variability and invariance and

speech

people been asking this for a long time

so

the specific sort of focus sample decrease of the very vocal instrument we have to

produce the speech

six different people here just showing the size of slices their vocal tract

and we can see immediately each as the very uniquely shaped vocal instrument

with which they produce a speech and which is what you're trying to use for

doing speaker recognition speech signals produce sort of his vocal instrument

in fact i just orange yourself if you're not familiar with this kind of looking

into the be

mouth

i just for them are the nose and the time and that we limit of

the soft palate that you know

goes there just you because you'll see a lot of these pictures my talk today

there is a good being more people

all of them try to produce the well known

but you can just a quick look at it and you see even study these

people used to produce these sound the slightly different if we look at like another

example

that are like you know first and second speak very the speaker the lid rates

at duncan

make the gesture for making side well but they're slightly different

so

kinda know that the these kinds i that the production of both the structure in

which these that speech production happens and how we produced be very close people

and some of it is reflecting the speech signal was

so we just what you're trying to sort of get out

so that the ml my set of this line of work to say well what

can speech signs you know play an understanding and supporting speech technologies development no only

do we want to recognize speakers one o one make some different

so specifically you know what focus today

is to look at vocal tract structure the physical instrument at a given in function

behaviour and within that about for producing speech

and interplay between

so by structure i mean physical characteristics of this vocal tract apparatus that we have

right like the heart ballad geometry that on volume you know

the length of the vocal tract the velum the no mass

function typically refers to the hero characteristics of speech articulation

how we dynamically warm for example to produce the consonants in all constructions the vocal

tract you know to make a sound like intensely kind of research done to when

you know

and create a variation there were channel two

create turbulence

so

this leads to very specific questions we asked right how are individual vocal tract differences

with some pictures of people reflect in the speech acoustics

candes no the inverse problem be predicted from the acoustics

how to for a people sort of you know make a forty structural differences to

create phonetic equivalents right because we all try to communicate use speech coding and language

and in austin pointed out what contributes to distinguishing speakers from one another from the

speech

right so i want to emphasise not willing are we trying to differentiate individuals from

their speech signal but understand what makes different from a structure

so stop table one some of this

sort of very on one where

so we'll try to see how we can quantify individual variability given vocal tract quality

try to see if we can pretty some of these from the signal and of

what are the bounds of it and so one

how to individual article two strategies to for can we explore you know automatic speaker

recognition type you know applications and

offer some interpretation while doing so

so do approach that's i don't know or laboratory

i one of my research groups is the cost bad all speech production articulation notes

grew looks a lot of different questions including questions of variability so we take multimodal

approach

look at different kinds of ways of getting at the speech production to you know

a more i patrol talk about a lot today audio another kind of the measurement

technologies the whole a whole lot of a multimodal process of image processing and you

know it's a speech processing and what the modelling based on that

and try to use

these kinds of engineering advances to gain insights about the dynamics of production speaker variability

questions about speaking style prosody motions

so the rest of that are gonna instructors falling

so i'll focus the first part time seeing how we can measure speech production right

how do we get those images and so one with that particular focus on a

more i magnetic resonance imaging something that we've been trying to develop a lot

a then given datasets data how do we analyze the island one with the sort

of some modeling questions

so

how do you get it vocal tract imaging

so there has been very central to speech science you know for a long time

right the mac observer measure article three details the long surface tree of this and

their number of techniques you know each with its own strengths and limitations

you know for example really sort of i-vectors that were made right like you know

when applied again stevens and so on text race you know

but you know that's got pretty good temporal resolution but it's no not resay for

people so it's no longer methodology and then the number of other techniques like ultrasound

which provide you partial you all of the insides and not necessarily helpful for kinds

of modeling hereafter and things like other target facilities shall use picture

so here actually is an x ray

i did that

but in fact is scanned stevens

right results are sound so you only see sound surfacing of parts of it on

c the edges

i so this is that the target you want people to speak about it like

that no reading here with the contact electrodes

and so when we speak the contact made by the time to the pilot provide

you some insights about timing in coordination you know in speech to study

right of it

and finally

by the time to noisy a person's down

there

be put little rice crispy like a sensors in there and measure the dynamics you

know

so you know provide you

no

the new possibilities and are created with the i to advances in the more i

which provides you very good the soft tissue contrast to know be capable of basically

what it relies on this the water content tissue so it that i didn't find

and varies across very soft tissues so we make use of it by

exciting the programs and they're releasing it signals generated according to this trend

and then we can image it right

it's very exciting because provides you very rides

save provide you very good quality images but it's very slow the traditional one

and so and also it has lot of things it's very noisy i know if

you have are then into the scanner

to produce speech sounds experiments a little town so these are somewhat things were contending

with we put the last in years

i mean so you know getting a so the very first that as sort of

sub band of the main one third of around two thousand four

we're in two

a real-time imaging that is

get two speeds

that or sampling rates that are higher than

what the speech rates are like you know what like

twelve to be on aboriginal affairs or articulation rates and so

maybe show you session

huh

i

i

i

so

if your family that the rainbow passage people write the exotic really ready when is

very exciting for us to actually be able to this

we we're doing acoustic recordings in a lot of the speech enhancement work therefore more

i and was synchronise so kind of opened up a lot of for different possibilities

for doing so

there we saw

so but unlike not happen that right really

principal signals for a wide range for signals good but not but have been trying

to see can be makes even better

and so when you actually the kinds of rates

for various because in the speech is not like one comedy using a lot of

different you know and then mentoring task

so from trials like no we're in spain

to and of the saint sounds like on so one

they are have different rate

so we can get a about that kind of rates right would be really cool

so

in fact we were able to last year make a breakthrough

and get up to sort of one hundred frames per second doing real time are

with the

more than one postdocs

and not only do so very fast is very fast speech coding rate can really

see that i'm to when you know a little

but you can also do multiples playing simultaneously what you see here is assigned a

slice by slice myself like they're

or slice a axially like that or carly like this so we can do simultaneous

you the vocal

so i really exciting actually to be able to do it is really high rate

to your two

are insights

and so this was made possible by both hardware and algorithmic a sort masses

we developed a custom colour c requires four

the thing

it made lot of progress in both sequence design

but also sort of consent reconstruction using compressed sensing things that have been happens in

the process whatever

so we were able to really

speed this up and quite excited about it so this is all you know you're

an experiment no

some western sitting there in doing the audio collection you know the reprogram the scanner

to that the audio synchronise with the leading

we have interactive sort of

control system to a select the scantily in one

i

i

i

she i four or she a four

she i four

lord

she i o

saying gonna get idea right so you can really see things you know that on

the project it doesn't look that good like to actually

and non-weighted which really good but actually now we are looking at production data that

scales which is conducive the kinds of machine learning and approaches one could you

although not be talking about be plotting

this we are not outside the problem

in addition to doing single plane or multi plane slice meeting we also very interesting

the volume at least you want your interest in characterizing speakers with just one of

the sort of the topics are researchers interest to control

really force we off the geometry well people are speaking

and we made some addresses there are two with about seven seconds of folding sort

of or things like that

we can do full sweep so

the entire vocal tract and so we can get similar exemplary geometries off people's a

set of clusters

in addition

we can also do really for getting to know that atomic will structures notable and

of so we can do this classically to be to the more i and i'll

show you why we are doing all these things for the kinds of measures what

we really want to have a comprehensive idea of characterizing speakers a caucus by

and the vocal instrument in behaviour

so as soon as i one of the things we decide the recently been releasing

a lot of these data so for people recognition one more than that really different

speaker for both of them it you know sentences for six and

with alignments and you know the image features and so on for its all available

for free download so

so you're some examples of that kind of data

i

i

yes i

she

i

so it's got five male and female speakers

maybe some of them

actually

jamie money by

and so on so

and we also have alignment basically coregistration of this you know some algorithms for that

then that's also released so we have this kind of data that we can work

what so what you do this stuff

so i'll sort of introduce some analysis preliminary

a lot of image processing you to the very first thing is like actually getting

at the structural details of the human will clap rather to people interested in sort

of you know anatomy and more from a trends for her device

of measuring everything else length of the ballot and

and i and so one

and that's what we wanted to do that very careful at each widget admit a

imaging

on top of that a for the we also want to track articulators right since

articulator certain important specific task

so we want to be able to automatically process these things

so

the methodology we sort of proposed was sort of and sampling for model

and it's a very nice mathematical formulation actually work done by one of course

and he was able to create a segmentation algorithm works fairly well

so just things like okay i

i

so we're doing that now we would actually capture the various and timing we automatically

from these vast amounts of data so it almost like to think about is one

kind of feature extraction to me

so we can all the buildings that are actually more linguistically more to us events

by

so one of my clothes collaborative school please so the founders of the articulately from

all that even believe that us

we sort of conceptualise speech production as a dynamical system

and so varies articulators involving task basically created forming and not releasing constructions as we

move around

so we are interested in features like for example

sort of a lip aperture and to but

constriction degree and location so one so we are able to kind of that automatic

twenty six

another you

so we need to automatically these things now so going from images to cut segmentation

try to actually extract instead of linguistically meaningful

features

so that you know to do things like no a extract other kinds a representation

like for example in look and pca on these contours two

do look at the contributions of different articulators

and so one so i'll just provide you some ways of getting at this sort

of that objectively characterizing this production information

and speaker specific

so i so far is that like up for told you about look at how

to get the data to some of that basic analysis and then with which we

can now start looking at speaker specific properties

so

as i mentioned earlier data analysis to get an anatomical know how to characterise every

single vocal instrument actual

and this of the test was pretty well that anatomy literature and so on so

we went to look at

all those literature

and you know compiled a whole bunch of these landmarks you may have become not

the landmarks in speech

and came up with these kinds of measures that we can get at like you

know vocal tract sort of what legal and that the cavity lands in a separate

and then you know and so on which we can sort of measure from these

kinds of very high contrast images so that's one source of sort of speaker specific

as an aside the also that you know since many degradations of same tokens by

these people at different sessions no

you're interested in how consists of people are and was very sort of

sort of reaffirming that not people fairly okay fine how to produce that it opens

you know that the measurements female we're very consistent so

this is for example finding the correlation means and once again to

something that presented in interspeech

so you the strike we have this land fine article actually sort of environment with

them which we are not be produce speech behavior we wanna know

how much of it is dictated by the environment we have waters that strategies that

are adopted by speakers of a unique to them due to various reasons which we

can't really pinpoint but it is you know

learning that they have done or the environment follows so more c can be sort

of start deconstructing this little bit

so next what also use a few examples subset along this direction

so for example this picture want you to focus on the following and the palatal

variation thought it is like you know your battery genders and think the heart circus

we put you don't know right that's about the art part which is like important

product or

vocal apparatus so here we see

but this person

course my mouse

that it

so in a we see that this have i voices are very don't about it

here a more posterior

then i interior here is sharper drown

that is just six different people

so now how do we begin to actually why you are qualitatively seeing a

can you quantify this right so

so what i don't have a very

was actually so that you know take these kinds of the extracted image shape and

started doing sort of you know even simple pca analysis

and showed that no for six percent of variance could be explained four bytes five

first factor

which were sort of akin to what was like to hunt concavity or complexity offish

the next one was more know how forward-backward this

this concavity was like sort of and curtin and then how sharp one so these

this work test interpretations well that is actually very objective so

so we can actually begin to quark one find cluster people along these sort of

low dimensional search at least variables

and then we can actually

plug in these kinds of things into models right the like for example "'cause" you

coupons see what acoustic consequences of these variations

right

so one of things you finite is that

that is very word that that's the first performance very much

where like the anti r g how four or five or this that the product

shapes a incorrectly if you sharpness really didn't matter at least from these for star

simulations

so from a data to zero

a morphological characters we can actually see pretty interpret what a casino once we can

expect

right

in fact we can put this in a synthesiser articles and show at the other

words from the th

a little less

to work on a basic you see are more one to let on

you're going on in different bound to the plane

so we can do this kind of analysis very no carefully

so

of course we also interested now likely due to inverse problem right can be estimated

these shapes from given the acoustic signal how much of it is a available for

us a body shape details right so

we did the classic doing right okay be

we have all kinds of features from the

basic signal i want to realise right

the shading on their way as we speak directly so it's influence

but the environment and that apply the movements of that the behaviours right so what

the mean one so

that's what this way to know how we articulate

and what we have

both influences that influences the signal the

so now see how it a single i

and we show that no very simple first experiment we can get at the shape

sort of detection

concave a flat out that like sixty somebody persona time we can guess what kind

of attitude they have just from the acoustic signal so that a more information is

available

so a more interesting question would be

sort of a very classy morphological parameter that we've been using a lot as vocal

tract length right this is something that office of been important speech rec aligned

otherwise been and sound about

well it's to

normalize for also to estimate that things like for example we're doing a age-recognition and

someone

right so here again the same question

what we have some of the speaker specific i think

reflected in the signal right

you wanna see how much we can grab added to pinpoint the speaker pair

you can you know that you don't to some extent speakers compensated that for what

environment they have and we wanna know so now how much

all of it is residual that you can actually input

get this is again vocal tract length i start with this because of a classic

question that people basking so for example here is the data from a work area

and you know and s and that the two thousand nine

there are like you know a vocal tract length role with eight here

for years and so we go across what from six centimetres one seventeen point five

eighteen centimetres long

and there's some

different situation that happens are empirically for males and females well stuff

and correspondingly z

effect singly formant space in the spectrum

no

p by zeroing in on the first formant the rain for

we can see that shorter vocal tract and

shorter vocal tract and longer vocal tract how the space

all that sort of

get compress

and you know shift and this kind of things happen

and why people we've been doing implicitly or explicitly in when we do vtln

is to basically normalize for this effect

so the class that estimation vocal tract length you know has been back you know

you know from or very simple sort of rest state

sort of like what real impressed data to model we can begin estimate the land

of the vocal tract from

from the parameter

right so what we are proposing

what some sort of a problem the performance you can estimate the

the delay parameter

and

one of the early work to improve work was by the key to you know

or

the really prediction

okay and it's just an embryo relies on the third and fourth formant and other

people the proposed in

what we decide well now since actually

direct evidence of the vocal tract length and acoustic

can you come up with better regression models

and sure enough to be sure that actually from this timit corpus i do not

sure that we can get like really good estimates are not with very high correlations

of vocal tract plan and you don't

and this is kind of very interesting so that we are able to sort of

progress and a good model estimate the model parameters

and back to six now we are able to estimate vocal tract length as yet

another set of more primitive detail of the person from the

that's kind exciting

last one last

so

summarizes what i just said no competition with that on a lot or estimation and

availability of data and sort of you know good statistical methods allow us to get

like better insights

now

moving on

let's look at the slayer vocal tract is kind of the finding construct you know

it's very hot defined then by this you was like no

pretty funky and so that i'm actually plays a big role in how we dictate

the talent

so the question we ask is like okay

we have sort of

so vocal tract length and for infrequent the same charger showing you before

we normalize for using clean you normalization but that is that what typically about

we still have residual differences that are explained people you know putting as

proposed like nonlinear vocal tract normalisation multi very limited all the test again at the

specified what with so what we want to know is that the residual effect

yes actually

that's something about the size of that and the people have

that some automatically to work well for

so

so i have up here is that the sentence and the like relative punk shape

here

this thing

up to people

we will explain some of the wall space differences

okay

so

also the questions way but we have and this light of what is it well

how does one defined measured on size

or just people want to the concise is the people across the population

what is effective downsizing articulation

and

what is that

visible in the acoustics

can be predicted and normalized

same question so is very little don't publish work and that kind of thing

a people know that there's a coordinated sort of a global the size of vocal

tract that's be developed

there are some disorders like you know balance enrollment so one but i one usually

accuracy a large chunk sizes

so

what happens at least have so

effect on how we produce speech like one lemmatization of corals corners of sounds like

made in the corpus

like else thing in a decent it's a one

lemmatization it's like how we try to use it with the and laid than that

are

and sort of using almost like listing right leg lingual using the time in producing

know what by labeled sounds like b and b

and

other call three articulation slowing of speech rate because you've larger mass to content of

it

and so on

this something might mention but not

much sort of quantify right

so

we sort of set out to say well we have lots of data

can you set of a estimated mean posture huh

and there is the segmentation

and sort of

come up with some proxy measure for someone right there was more things with it

and so once you do that right we can actually plot the distributions of the

time slices across the male and female speakers not to but corpus

so what we see it

the green

e

female i'm all your

i don't average so there's significant setup

six difference easy

in the time

size so yet another we can get added from the acoustic signal

it set another sort of interpretable

sort of

marker

it so

because that

how well we will at the environment this part structure with that down

still not really well established again has open question so how do you really

as this thing

but

we have taken sort of a shot

so we did both sort of different kinds of normalization factor looking addressed cheapened

well during movement this not much difference between don't they are pretty highly correlated

so once you have that right

we can actually not use this information in simulations say for example think it you

model right people still study speech production

we all the little from you know

people like and that you know in our goner five

there you can actually now reflect this back and try to study from analysis by

synthesis

so you have a mother tongue we can expect longer instructions and so on so

what we did was to vary based on measurements we don't

look at different constriction bands and

locations just cy thumbsized difference will play a role in the acoustic selecting a four

way

so what we observe that concise differences in the population be had

and what was estimated by simulation very well correlated in terms of part

i part

so it was very nice so what you saw see here is that the

in the simulation spk and five

the move that

type of well ryan or likewise

so the general trends are okay so

so we saw all in all the pilot we saw with another what is it

varies across speakers quite of a fifteen pick up to thirty percent

had a consequence of a large time s

longer constructions that are may in the vocal tract s p produce sounds because constructions

are very sensual to how we produce very speech sounds

they data stretching twist the wells basis so that's of us

signal that the playwright

and

but this

interplay between contractions performance and downsize is fairly complex requires much more sophisticated so

learning

a model that

but with hopefully with data is you know these things can be pursued

this one

so the final thing sort of a not a on the slide of speaker specific

behaviour

is to actually talk about articulator study

okay what i mean but that is how talkers move the vocal tracks right so

as you know the vocal tract is actually a pretty clever assistants a very that

we didn't systems of got all tolerance little bit

exactly can use the same a different articulated to create the same to a complete

the same task for example

in move the john looks two

both dialects to contribute by little constructions like no making b and b and one

you have a mortgage august we lips

and people have several ways to change their i every shapes to do this

and so we columns are contractor strategies and some of these are speaker specific some

of these language-specific consider a we wanna get added because is again yet another piece

of the palatal as you try to understand what makes

me different from you in trying when you produce speech signal

the only just knowing that i'm different from you from a speech

okay

so this is approach you again very early work

so we have lots of

i built anymore i data

so since then i don't know the database we collect is about from a pilot

study of eighteen speakers but like north all these volume between all that stuff

very detailed weight

and so we can actually

i get i know characterizing the morphology speaking style

once we have that right be established what we call the speaker specific for maps

a off but from the vocal tract shapes the construction so imagine

the shape changes to create this task or like consummate dynamical system you know actually

is estimate the for maps of like you know

in that in a different recreation sense

and then we can

pulling all from each of these speakers format

put this back and was synthesized or model

which a to dynamical system ought to use and task dynamics

and see that contributions of the varies articulators people use actually to predict how to

be what studies people about

so

again reminding use of a cell we can go from data to extract a sort

of a on tourism and do pca extract basically

factors able contractually you know how much darker compute on with the time factors are

and someone

and then

a from that we can go with estimate various constructions in a place of articulation

you probably more right

we have along the would try to make an six different anatomical regions like the

outfielder reading about you can the be elevating their injuries and the one

we can is

automatically estimate that

the baseline level what people this

so

problem so we have some insights from about eighteen speakers that we analyze testing again

that are sorensen

a leaf presented interest feet and fill white that we went about use a model

based approach

so

be approximated like the speaker specific format a to pin from that a more i

data from exceeding speakers

the simulated with that a static to you have a to belong to the from

a motor control sort of

a legit are fantastic system

the dynamical systems are basically

control system that the in this state space for

and then we were able to interpret the results so one of the results here

like to make sure you know it's basically represent a the ratio of lips to

use a lipstick

and or job used by speakers to create constructions various constriction bilabial alveolar palatable

we look print your along the vocal

and you see that there's you know

different

ratio of how people use how much dog use

one is like more target lips

zero it's like you're using more

different conceptions different we use

different ways of creating transitions in fact used

put this work we see that elephant on the right where you know

contribute more than job in so

except for all real close to the score of a target the time and

the speakers in our set like in speaker

very you know how they used to create the same kind of constructions i so

people are different in how it studies i

so one of the sort of this is very early inside straight how much speaker

used on the lips you know it there's a function specificity how what is it

out the remote are planning

there are exceptions that actually begging for more sort of you know a computational approach

is now with the data inside we can go and cy

how people actually use the vocal instrument in producing this sounds

that we call speech

so the final in this is now we get family the slides we've been seeing

this conference of

so you are also explore a little bit

well production information be of use in you know

in speaker recognition type of experiment so we did little better well work one speaker

verification with the production data does not much data so not so you know particular

but that's the people pretty much common or things like so that was not one

has this

we'll speech production data be of any use at all your speaker verification

so we know i one point on a getting like data like rewind showing right

x-ray or more i or

it's not

but we okay in operation conditions

right so we need to be able to have some articulatory type representation so people

been working on inversion problems that is

given

acoustic

can be estimated glitch parameters like this the classic problem in fact mozaic setting problem

where you know where i feel that deep-learning that approaches that are very powerful because

it's of any nonlinear process so you know these things every conducive to these

mapping a

nevertheless what we wanted us to do so a speaker-independent mapping

right so this work of profound a small within just a few years ago what

said well

if i can really

acoustic articulately mapping between people

you know of that an exemplary talker right i have lots of data from one

single speaker for like and synthesis right you always take long

the properties from one talker and then try to produce it

and then we can protect anyone else's acoustics on this

so speakers maps to see how this guy were to produce the statistics like everything

to get some semblance of an articulate representation

so

that we can do speaker independent sort of you know measures so that was sort

of the i so we said well we can use a reference speaker

to create a articulate acoustic target like to map and to the inverse model and

then when you get that speakers

for one acoustic signal

we can actually do inverted sort of features and use these to a few

the three

there's any benefit the rationale there is enormous

is that it pretty produces like projections they not no

robust way and constraints the kind where

provide sort of

physically meaningful constraints on how we partition signal so

that might be some advantage to come that come up

so this was sort of you know

this like earlier this year

in c s l

so

the front end this started be used actually for some of these all experiments used

x-ray microbeam database also available because a lot of speakers

and standard

thanks here gmm model because you don't have the much data

and you're some sort of the initial results of you use just

mfccs only you know

that like what that for this small set that's not that's pretty noisy data set

about

you know seven point five the are but you know if you actually have the

real articulation

the measured articulation actually get a result of post

in

providing sort of you know nice complementary information that's kinda encouraging so that you might

think about as an oracle experiment or upper bound if you have session

now if you can use of the inverted sort of measurement about that we shall

we do as well compare really well slightly better by putting them together actually provides

you an additional both with this pretty significant actually

so this grading of this kind of if you have lots of data that we

are sort of you know if you have

in the data to create these maps about speakers you know we need just example

each case

and if we can provide additional source of information

perhaps will give us so the some wheels but maybe also some insight into why

people are different or what data categories of articulation or structure and started is a

different by

so this is just the standard set of

the first

showing the same as of the film

x-ray microbeam database

so

summary of the speaker recognition experiments that notes and she'll so that step

of using both acoustic and articulatory information

there is significant and f eight

if you use of measured articulately information with the standard acoustic features

gains of marble or more honest

if we stuff you know used estimated articulate information

so what would be nice is to actually look a new ways of doing english

and with the kinds of so advances that are happening right now

nor feels

and the availability of data number two data

to do

i know this

no better

i'll be able to evaluate larger sort of acoustic data sets from sort of sre

like the campaigns

so mowing for most on

so we're very excited about no some of this actually

a premier work was done with my collaborators that lincoln laboratory some point your unique

model is gonna

and parallel work was mice your voice now also their

and so we had some initial pilot work and then

i recently got an innocent right actually to a and you the slider work people

actually

or okay we're doing speed signs looks like

so we are excited about it

and so our ideas do this in a very systematically your set to collect about

two hundred subjects this

all this

real time and volume a tree and about

detail and share with people

and

we kinda describe this sort of in an upcoming the paper

and this is kind of that material if you're targeting i'll show the slides and

people want to suggest that is we are more in for you collected what ten

speakers of the product or so far

with the project the starter

i everything from a notable exception the rainbow passage two

all kinds of you know spontaneously and so on

if you have any suggestions ideas how what would be useful for speaker modeling you

know i'm use like this now we have to consider

most in order to be native speakers of english and about twenty percents could be

nonnative speakers it's gotten in english

but in other projects to collect a lot of people doing other languages are everything

from african languages to other

so finally also you know a getting insights inter speaker variability also we can do

some sort of these use cases problem

in the case or mother developing vocal tract length from kids tradition

how the speaker very so that no manifesting the signal right so for example

we've been working along with people operations of attending i'll or can see

so the intention surgical interventions class actually basically what you with you

the parts of town

on top of that we have other therapeutic sort of treatments with the radiation and

are

people

so cost like modified physical structural damage to the thing

so here we see two

of patients

there are no

one basically lost pretty much more so that are because the cancer with your base

you know that and that's of the four reports on

and it's replaced by reconstruct with them flat from the four

so you see sort of variation in the convoy the normalized and therefore here

so how this their speech cope what this is not getting speech and small is

one of the big quality of life measure

so we have different things is also keep us additional insights about you know looking

at speaker variability

the interesting something's only eleven cases you know and had in history the norton

though

some people bought reported on ability a so we have access to all other speakers

and collect a lot of data from where and

and so we can compare what

a how to compensate how to use the strategies how person

speaks pretty intuitively pretty well so

this provides an additional source of information to understand this question of individual very good

so in conclusion

appoint someone may well yes data is very a good integral to advancing speech communication

research your vocal tract information plays a crucial part of this piece of this but

the like i believe

so to do that we need to gather data from like lots of different sources

to get a complete picture of the speech production

it's that's

not very telling from a technological computational

as well this conceptual and theoretical to the perspective

but

i don't believe that are written still so that no applications including into the machine

speech recognition speaker modeling

but i that this sort of

approach just like very interdisciplinary so people have to come together to work well on

these topics

and share

so these are some of the people and my speech production that no

the problem of our

although a bottom line and people were currently there in particular award that

we also contributed this particular a collection of my

calling who does all these imaging work

and testing them are scientist

lois of these the linguist very

well

linguists provides a conceptual framework of how we

approach

such an that all this work on

this apply meeting stuff recently and the lower can only morphology where my that was

talking a lot model where

that

namely that a lot of things actually translating to a speaker verification

and i separate that michael i-vectors in all our women amazing no i'm forty really

for this guy had available

and here not only finally no he's been very supportive is vanilla rampantly support incorrect

he's be important for this and no i'm pushing is to

not people one that's of things here too

so that i thank all of you listening to be

and various people find that

well this is like online if you're interested including might be charged

thank you very much

for instance

you very much with fascinating to

two questions first of all

when you're gonna get to the larynx

because that's i'm okay i'm talking from the

perspective you

the forensic phoneticians

and

we are conscious of between speaker differences from the larynx on two

spectral slope of that sort of thing but in this that suppressing

and also super the residual e

relationships between what i would

give almost more robust harmful is we'll knowledge about the speaker variability in

the nasal

basically nasal cavity sinuses that sort of thing

that is the below about speaker i

it's great "'cause" you're not gonna get in this

we telephone speech and so forth anything above

three k is the good

some parts that so the first questions about lyrics right

so here are in this region

so

so the glottal so that the voice a voice source of phenomena like happens that

much higher rate

and so i'm are still is not good enough right it's about

we can go about how did want reprints for second year

so what people have been doing particular you know according to you salience one no

up to

you high speed imaging off this larynx but wouldn't camera to the nose

in two

little bit intervention and

at so

on the other hand

what we can do you have them or i used to look at things like

little joe hi then they'd all other things but also get some

it it's one zero information

and particularly one of things a more approaches like complete you of your region so

we can really

this is not available any of the other but all these people use you know

in this so you look at like to be for usual sort of

behavior phenomena

and in terms of actually characterize and things like that is the variance and so

on which don't change very much during speech behavioural i cannot to characterize that's what

he really i contrast to weighted images

to really characterize every speaker by you know what is that they have the you

know and in terms of

with which we can actually get i

some anatomical good characterization of a speaker and see how can relate or account for

it in the signal

and so

we are trying to see how can

sort of controlling t do some multimodal meeting of voice source that no we tried

to you d

but you know they are quite small window into this thing is you know

we wanna see the high speed stuff

still open question in terms of contrary to meeting

so that like by the button references

in the previous slide show organisers people interested

no more questions i was just

normal

s

is it possible to say broadly

if there are any a particular areas that show the greatest amount of the between

speaker difference

and that's to me and use

so you know if you gonna look for where is a completely

goodness knows it or is it just and that you know people differ in all

sorts of the from which was

so i think that the latter is that what my guess is right no unless

we know i do think they begin to start begin to cluster

a ones as increase the and number

just like you know what we do it eigenvoice and the

i didn't phase i think i'm sure a good prime things that start at clustering

for getting direct mode

but now the source of variability seems to be

a perceptual point of view

all the place

plus you know how people became weakened that

also varies quite a bit because you know

where they come from mine how be applied and so one right and practices people

use no

there are other piece of work that i can talk about no one article to

setting and you know

ideas about

how people set of actually

be but i do

extract parameters of

from or to control problem point of view white people the for it i can

lead to language or

background or other kinds of things still open question

but what i feel like as being trees that it is these of we talk

about very small datasets is compared to what you've been for state would just on

the speech side

but if we increase this to some extent

and again or this kind the computational tools and advances that you're making i think

slowly can begin to understand this at the level to go

open question

structure so it are you make a comment

you put up a kind of the acoustic to model but well all remember point

out one thing from one of the workshops from

the early nineties

from mid sixties up until late eighties early nineties we use their own acoustic to

model that was when you're like flat screen

and we should tell at a summer student would basically spent the summer saying well

actually the vocal track as a writing all turn and no one it really thought

about what how much is that right angle actually impact vocal i persona formant locations

and bandwidths

so he we formulate a can or closed form solution i think they saw it

was between one two three percent ships informed location bandwidths right so a very much

like sting the physiological per state you take care what might one right basic questions

you focused on speaker id

i'm assuming many of your speakers here bilingual have you thought about looking at language

id to see if the physiological production systematically changes between people speak one language versus

another

absolutely solid lines of that for the first a common to jon hansen made was

regarding to but the vocal to been but it sort of unruly do the simulations

note that

for

articulation acoustics and the effect of the band in fact there is a classic people

by enrollment order moments on the

and yes and the release of

long time ago

that actually estimates is about the three five percent the student actually verified it but

some and simulations later on

i used to get the last

and

so i think of the more recent models try to do this you know but

like fans here simulations main street and simulations the one we can do with this

node access to those one what you did i talked about right

for all the postures from all these speakers we had that

so and with the high performance computing

this is becoming a reality we can actually what implanting and want to do right

no nodes

possible

this second question

john a reminder

all the language id yes of course we have actually

about

forty or fifty different languages actually languages and set l to a second language them

speak english in or datasets you know across very linguistic experiments we've been doing

so one things we

the real the data

little bit not as much maybe

cup people intuition language id

may have some hypotheses and so on their be looked at things like articulately setting

you know which is then

the place from would you start executing a task right now from rest to rent

you so if you think about as a database system right as you know from

a individually creation like you know so the modelling you initial state is important from

which we go to another state and where you set of but

release that particular task and go to next aspect of making one construction going on

an on and so we found that people have preferred sort of settings from which

they start executing and that's very language specific we showed like normal german speakers presents

and spanish speakers with english speakers so these kinds of things can be estimated from

articulatory data

the inversion is not been to the viewing done that no

but that's quite possible and you know happy to share data

top two people body

okay

sure it's first okay

okay so

i have a comment i like to respond

one of all the problems in speaker recognition is i happens between the hot this

but the speech right

the first line that explains

cepstral mean subtraction

basically you find the way the average side of the vocal tract

how does that sort of

impact on what you

right so that you know i didn't talk about the channel effects and channel normalization

things that happen you know the recording conditions and so one right so

one of things that the art of contemplating is like you know like many people

have been talking what do joint factor analysis or these kinds of even with these

new a deep-learning systems right

you could these multiple factors jointly together to see how

we can have speaker specific variability sort of measures

and things that are cost by sort of other

so it's a extraneous setup

interferences or thirty two or more other kinds of transformation that might happen

so that's what we're doing from first principal type things right like the way we

want to do not just make the jump into a drawing some all these into

some you know machine learning to and beginning to estimate by

systematically trying to look at linguistic theory speech signs we could features to analysis by

synthesis type of approaches and then we can then see well if you have other

kinds of these kinds of snow

both

open environment speech recording not

for distance the speech recording is spelled much interest to other bus

for various reasons and

we can account for these things so i tend to believe in that kind of

more organic approach

we have temporal one question may be processed foods

i

i'm sorry i'm the both fast

i

i won't first to thank you and it's very nice

sorry noise

science

which technology and particularly in speaker recognition or in the forensic so

adjust my common this to remind the difference between speaker recognition and a forensic voice

comparison

but it really both and

the field

present that you

because

we know about when we try to do some article in addition we think like

that

we have a huge difference between the board to read speech

it's train include kick back wall

speech right

for speaker recognition we could imagine but the speaker are trying to

very

classical to you could not be processed

in forensic voice

comparison

we could imagine exactly you put it right are reading my question

posted but midget but

would be five

constructions the or optimization strategy you know that

challenge department you expose

yes and alright because there's certain things we can change certain things we can't write

your given right that's one of the things that we are trying to go after

that there's something are given in or physical instrument it can compensate for it as

much but we still see the residual effects and want to see can you get

it is residual effect maybe

the bounds are not there so no i have a big that of information theory

so always interesting bound the limits of things how much can be actually

after all we have

a one dimensional signal from which we project on all kinds of feature space and

do all or computation based on that to do all the inferences problems targeted speaker

or whatever this and so

say you menu plate that the strategies that's only one degree of freedom or you

know if you mean

and then it causes some differences but still if we can account for this somehow

i can you still see the residual effects of the instrument that there have or

specific ways they are

changing the shot used a common database when they have right you can't speak so

just two random things with your articulation to create the speech sounds right so that's

why not disjoint modelling of you know the structure and function you please a very

interesting to see and how much can be spoofed by people like you know you

may if you're getting added

it remains to be seen by the no i

but i'm hoping that like by no

being very microscopic here these analyses we can get some insight into it

you know but one that is very objective not you know

just a

impressionistic you know single this place is definitely all these experts billing talk about it

on you know on the court

i think that's one of the reasons

here was very

but support the idea no let's go it every object to way you know scientifically

grounded way as possible

we don't loads its adjoint you see vertigo

can be so

since then the speaker again thank you thank you