thank somewhat similar to kind of you
related to be here are known to be part of august twenty sixteen the small
percentage or c thank you unreasonable to have me here or in this meeting
so that are okay i'm gonna give two days or something giving is so it's
about a very classic problem question and speech communication about understanding variability and invariance and
speech
people been asking this for a long time
so
the specific sort of focus sample decrease of the very vocal instrument we have to
produce the speech
six different people here just showing the size of slices their vocal tract
and we can see immediately each as the very uniquely shaped vocal instrument
with which they produce a speech and which is what you're trying to use for
doing speaker recognition speech signals produce sort of his vocal instrument
in fact i just orange yourself if you're not familiar with this kind of looking
into the be
mouth
i just for them are the nose and the time and that we limit of
the soft palate that you know
goes there just you because you'll see a lot of these pictures my talk today
there is a good being more people
all of them try to produce the well known
but you can just a quick look at it and you see even study these
people used to produce these sound the slightly different if we look at like another
example
that are like you know first and second speak very the speaker the lid rates
at duncan
make the gesture for making side well but they're slightly different
so
kinda know that the these kinds i that the production of both the structure in
which these that speech production happens and how we produced be very close people
and some of it is reflecting the speech signal was
so we just what you're trying to sort of get out
so that the ml my set of this line of work to say well what
can speech signs you know play an understanding and supporting speech technologies development no only
do we want to recognize speakers one o one make some different
so specifically you know what focus today
is to look at vocal tract structure the physical instrument at a given in function
behaviour and within that about for producing speech
and interplay between
so by structure i mean physical characteristics of this vocal tract apparatus that we have
right like the heart ballad geometry that on volume you know
the length of the vocal tract the velum the no mass
function typically refers to the hero characteristics of speech articulation
how we dynamically warm for example to produce the consonants in all constructions the vocal
tract you know to make a sound like intensely kind of research done to when
you know
and create a variation there were channel two
create turbulence
so
this leads to very specific questions we asked right how are individual vocal tract differences
with some pictures of people reflect in the speech acoustics
candes no the inverse problem be predicted from the acoustics
how to for a people sort of you know make a forty structural differences to
create phonetic equivalents right because we all try to communicate use speech coding and language
and in austin pointed out what contributes to distinguishing speakers from one another from the
speech
right so i want to emphasise not willing are we trying to differentiate individuals from
their speech signal but understand what makes different from a structure
so stop table one some of this
sort of very on one where
so we'll try to see how we can quantify individual variability given vocal tract quality
try to see if we can pretty some of these from the signal and of
what are the bounds of it and so one
how to individual article two strategies to for can we explore you know automatic speaker
recognition type you know applications and
offer some interpretation while doing so
so do approach that's i don't know or laboratory
i one of my research groups is the cost bad all speech production articulation notes
grew looks a lot of different questions including questions of variability so we take multimodal
approach
look at different kinds of ways of getting at the speech production to you know
a more i patrol talk about a lot today audio another kind of the measurement
technologies the whole a whole lot of a multimodal process of image processing and you
know it's a speech processing and what the modelling based on that
and try to use
these kinds of engineering advances to gain insights about the dynamics of production speaker variability
questions about speaking style prosody motions
so the rest of that are gonna instructors falling
so i'll focus the first part time seeing how we can measure speech production right
how do we get those images and so one with that particular focus on a
more i magnetic resonance imaging something that we've been trying to develop a lot
a then given datasets data how do we analyze the island one with the sort
of some modeling questions
so
how do you get it vocal tract imaging
so there has been very central to speech science you know for a long time
right the mac observer measure article three details the long surface tree of this and
their number of techniques you know each with its own strengths and limitations
you know for example really sort of i-vectors that were made right like you know
when applied again stevens and so on text race you know
but you know that's got pretty good temporal resolution but it's no not resay for
people so it's no longer methodology and then the number of other techniques like ultrasound
which provide you partial you all of the insides and not necessarily helpful for kinds
of modeling hereafter and things like other target facilities shall use picture
so here actually is an x ray
i did that
but in fact is scanned stevens
right results are sound so you only see sound surfacing of parts of it on
c the edges
i so this is that the target you want people to speak about it like
that no reading here with the contact electrodes
and so when we speak the contact made by the time to the pilot provide
you some insights about timing in coordination you know in speech to study
right of it
and finally
by the time to noisy a person's down
there
be put little rice crispy like a sensors in there and measure the dynamics you
know
so you know provide you
no
the new possibilities and are created with the i to advances in the more i
which provides you very good the soft tissue contrast to know be capable of basically
what it relies on this the water content tissue so it that i didn't find
and varies across very soft tissues so we make use of it by
exciting the programs and they're releasing it signals generated according to this trend
and then we can image it right
it's very exciting because provides you very rides
save provide you very good quality images but it's very slow the traditional one
and so and also it has lot of things it's very noisy i know if
you have are then into the scanner
to produce speech sounds experiments a little town so these are somewhat things were contending
with we put the last in years
i mean so you know getting a so the very first that as sort of
sub band of the main one third of around two thousand four
we're in two
a real-time imaging that is
get two speeds
that or sampling rates that are higher than
what the speech rates are like you know what like
twelve to be on aboriginal affairs or articulation rates and so
maybe show you session
huh
i
i
i
so
if your family that the rainbow passage people write the exotic really ready when is
very exciting for us to actually be able to this
we we're doing acoustic recordings in a lot of the speech enhancement work therefore more
i and was synchronise so kind of opened up a lot of for different possibilities
for doing so
there we saw
so but unlike not happen that right really
principal signals for a wide range for signals good but not but have been trying
to see can be makes even better
and so when you actually the kinds of rates
for various because in the speech is not like one comedy using a lot of
different you know and then mentoring task
so from trials like no we're in spain
to and of the saint sounds like on so one
they are have different rate
so we can get a about that kind of rates right would be really cool
so
in fact we were able to last year make a breakthrough
and get up to sort of one hundred frames per second doing real time are
with the
more than one postdocs
and not only do so very fast is very fast speech coding rate can really
see that i'm to when you know a little
but you can also do multiples playing simultaneously what you see here is assigned a
slice by slice myself like they're
or slice a axially like that or carly like this so we can do simultaneous
you the vocal
so i really exciting actually to be able to do it is really high rate
to your two
are insights
and so this was made possible by both hardware and algorithmic a sort masses
we developed a custom colour c requires four
the thing
it made lot of progress in both sequence design
but also sort of consent reconstruction using compressed sensing things that have been happens in
the process whatever
so we were able to really
speed this up and quite excited about it so this is all you know you're
an experiment no
some western sitting there in doing the audio collection you know the reprogram the scanner
to that the audio synchronise with the leading
we have interactive sort of
control system to a select the scantily in one
i
i
i
she i four or she a four
she i four
lord
she i o
saying gonna get idea right so you can really see things you know that on
the project it doesn't look that good like to actually
and non-weighted which really good but actually now we are looking at production data that
scales which is conducive the kinds of machine learning and approaches one could you
although not be talking about be plotting
this we are not outside the problem
in addition to doing single plane or multi plane slice meeting we also very interesting
the volume at least you want your interest in characterizing speakers with just one of
the sort of the topics are researchers interest to control
really force we off the geometry well people are speaking
and we made some addresses there are two with about seven seconds of folding sort
of or things like that
we can do full sweep so
the entire vocal tract and so we can get similar exemplary geometries off people's a
set of clusters
in addition
we can also do really for getting to know that atomic will structures notable and
of so we can do this classically to be to the more i and i'll
show you why we are doing all these things for the kinds of measures what
we really want to have a comprehensive idea of characterizing speakers a caucus by
and the vocal instrument in behaviour
so as soon as i one of the things we decide the recently been releasing
a lot of these data so for people recognition one more than that really different
speaker for both of them it you know sentences for six and
with alignments and you know the image features and so on for its all available
for free download so
so you're some examples of that kind of data
i
i
yes i
she
i
so it's got five male and female speakers
maybe some of them
actually
jamie money by
and so on so
and we also have alignment basically coregistration of this you know some algorithms for that
then that's also released so we have this kind of data that we can work
what so what you do this stuff
so i'll sort of introduce some analysis preliminary
a lot of image processing you to the very first thing is like actually getting
at the structural details of the human will clap rather to people interested in sort
of you know anatomy and more from a trends for her device
of measuring everything else length of the ballot and
and i and so one
and that's what we wanted to do that very careful at each widget admit a
imaging
on top of that a for the we also want to track articulators right since
articulator certain important specific task
so we want to be able to automatically process these things
so
the methodology we sort of proposed was sort of and sampling for model
and it's a very nice mathematical formulation actually work done by one of course
and he was able to create a segmentation algorithm works fairly well
so just things like okay i
i
so we're doing that now we would actually capture the various and timing we automatically
from these vast amounts of data so it almost like to think about is one
kind of feature extraction to me
so we can all the buildings that are actually more linguistically more to us events
by
so one of my clothes collaborative school please so the founders of the articulately from
all that even believe that us
we sort of conceptualise speech production as a dynamical system
and so varies articulators involving task basically created forming and not releasing constructions as we
move around
so we are interested in features like for example
sort of a lip aperture and to but
constriction degree and location so one so we are able to kind of that automatic
twenty six
another you
so we need to automatically these things now so going from images to cut segmentation
try to actually extract instead of linguistically meaningful
features
so that you know to do things like no a extract other kinds a representation
like for example in look and pca on these contours two
do look at the contributions of different articulators
and so one so i'll just provide you some ways of getting at this sort
of that objectively characterizing this production information
and speaker specific
so i so far is that like up for told you about look at how
to get the data to some of that basic analysis and then with which we
can now start looking at speaker specific properties
so
as i mentioned earlier data analysis to get an anatomical know how to characterise every
single vocal instrument actual
and this of the test was pretty well that anatomy literature and so on so
we went to look at
all those literature
and you know compiled a whole bunch of these landmarks you may have become not
the landmarks in speech
and came up with these kinds of measures that we can get at like you
know vocal tract sort of what legal and that the cavity lands in a separate
and then you know and so on which we can sort of measure from these
kinds of very high contrast images so that's one source of sort of speaker specific
as an aside the also that you know since many degradations of same tokens by
these people at different sessions no
you're interested in how consists of people are and was very sort of
sort of reaffirming that not people fairly okay fine how to produce that it opens
you know that the measurements female we're very consistent so
this is for example finding the correlation means and once again to
something that presented in interspeech
so you the strike we have this land fine article actually sort of environment with
them which we are not be produce speech behavior we wanna know
how much of it is dictated by the environment we have waters that strategies that
are adopted by speakers of a unique to them due to various reasons which we
can't really pinpoint but it is you know
learning that they have done or the environment follows so more c can be sort
of start deconstructing this little bit
so next what also use a few examples subset along this direction
so for example this picture want you to focus on the following and the palatal
variation thought it is like you know your battery genders and think the heart circus
we put you don't know right that's about the art part which is like important
product or
vocal apparatus so here we see
but this person
course my mouse
that it
so in a we see that this have i voices are very don't about it
here a more posterior
then i interior here is sharper drown
that is just six different people
so now how do we begin to actually why you are qualitatively seeing a
can you quantify this right so
so what i don't have a very
was actually so that you know take these kinds of the extracted image shape and
started doing sort of you know even simple pca analysis
and showed that no for six percent of variance could be explained four bytes five
first factor
which were sort of akin to what was like to hunt concavity or complexity offish
the next one was more know how forward-backward this
this concavity was like sort of and curtin and then how sharp one so these
this work test interpretations well that is actually very objective so
so we can actually begin to quark one find cluster people along these sort of
low dimensional search at least variables
and then we can actually
plug in these kinds of things into models right the like for example "'cause" you
coupons see what acoustic consequences of these variations
right
so one of things you finite is that
that is very word that that's the first performance very much
where like the anti r g how four or five or this that the product
shapes a incorrectly if you sharpness really didn't matter at least from these for star
simulations
so from a data to zero
a morphological characters we can actually see pretty interpret what a casino once we can
expect
right
in fact we can put this in a synthesiser articles and show at the other
words from the th
a little less
to work on a basic you see are more one to let on
you're going on in different bound to the plane
so we can do this kind of analysis very no carefully
so
of course we also interested now likely due to inverse problem right can be estimated
these shapes from given the acoustic signal how much of it is a available for
us a body shape details right so
we did the classic doing right okay be
we have all kinds of features from the
basic signal i want to realise right
the shading on their way as we speak directly so it's influence
but the environment and that apply the movements of that the behaviours right so what
the mean one so
that's what this way to know how we articulate
and what we have
both influences that influences the signal the
so now see how it a single i
and we show that no very simple first experiment we can get at the shape
sort of detection
concave a flat out that like sixty somebody persona time we can guess what kind
of attitude they have just from the acoustic signal so that a more information is
available
so a more interesting question would be
sort of a very classy morphological parameter that we've been using a lot as vocal
tract length right this is something that office of been important speech rec aligned
otherwise been and sound about
well it's to
normalize for also to estimate that things like for example we're doing a age-recognition and
someone
right so here again the same question
what we have some of the speaker specific i think
reflected in the signal right
you wanna see how much we can grab added to pinpoint the speaker pair
you can you know that you don't to some extent speakers compensated that for what
environment they have and we wanna know so now how much
all of it is residual that you can actually input
get this is again vocal tract length i start with this because of a classic
question that people basking so for example here is the data from a work area
and you know and s and that the two thousand nine
there are like you know a vocal tract length role with eight here
for years and so we go across what from six centimetres one seventeen point five
eighteen centimetres long
and there's some
different situation that happens are empirically for males and females well stuff
and correspondingly z
effect singly formant space in the spectrum
no
p by zeroing in on the first formant the rain for
we can see that shorter vocal tract and
shorter vocal tract and longer vocal tract how the space
all that sort of
get compress
and you know shift and this kind of things happen
and why people we've been doing implicitly or explicitly in when we do vtln
is to basically normalize for this effect
so the class that estimation vocal tract length you know has been back you know
you know from or very simple sort of rest state
sort of like what real impressed data to model we can begin estimate the land
of the vocal tract from
from the parameter
right so what we are proposing
what some sort of a problem the performance you can estimate the
the delay parameter
and
one of the early work to improve work was by the key to you know
or
the really prediction
okay and it's just an embryo relies on the third and fourth formant and other
people the proposed in
what we decide well now since actually
direct evidence of the vocal tract length and acoustic
can you come up with better regression models
and sure enough to be sure that actually from this timit corpus i do not
sure that we can get like really good estimates are not with very high correlations
of vocal tract plan and you don't
and this is kind of very interesting so that we are able to sort of
progress and a good model estimate the model parameters
and back to six now we are able to estimate vocal tract length as yet
another set of more primitive detail of the person from the
that's kind exciting
last one last
so
summarizes what i just said no competition with that on a lot or estimation and
availability of data and sort of you know good statistical methods allow us to get
like better insights
now
moving on
let's look at the slayer vocal tract is kind of the finding construct you know
it's very hot defined then by this you was like no
pretty funky and so that i'm actually plays a big role in how we dictate
the talent
so the question we ask is like okay
we have sort of
so vocal tract length and for infrequent the same charger showing you before
we normalize for using clean you normalization but that is that what typically about
we still have residual differences that are explained people you know putting as
proposed like nonlinear vocal tract normalisation multi very limited all the test again at the
specified what with so what we want to know is that the residual effect
yes actually
that's something about the size of that and the people have
that some automatically to work well for
so
so i have up here is that the sentence and the like relative punk shape
here
this thing
up to people
we will explain some of the wall space differences
okay
so
also the questions way but we have and this light of what is it well
how does one defined measured on size
or just people want to the concise is the people across the population
what is effective downsizing articulation
and
what is that
visible in the acoustics
can be predicted and normalized
same question so is very little don't publish work and that kind of thing
a people know that there's a coordinated sort of a global the size of vocal
tract that's be developed
there are some disorders like you know balance enrollment so one but i one usually
accuracy a large chunk sizes
so
what happens at least have so
effect on how we produce speech like one lemmatization of corals corners of sounds like
made in the corpus
like else thing in a decent it's a one
lemmatization it's like how we try to use it with the and laid than that
are
and sort of using almost like listing right leg lingual using the time in producing
know what by labeled sounds like b and b
and
other call three articulation slowing of speech rate because you've larger mass to content of
it
and so on
this something might mention but not
much sort of quantify right
so
we sort of set out to say well we have lots of data
can you set of a estimated mean posture huh
and there is the segmentation
and sort of
come up with some proxy measure for someone right there was more things with it
and so once you do that right we can actually plot the distributions of the
time slices across the male and female speakers not to but corpus
so what we see it
the green
e
female i'm all your
i don't average so there's significant setup
six difference easy
in the time
size so yet another we can get added from the acoustic signal
it set another sort of interpretable
sort of
marker
it so
because that
how well we will at the environment this part structure with that down
still not really well established again has open question so how do you really
as this thing
but
we have taken sort of a shot
so we did both sort of different kinds of normalization factor looking addressed cheapened
well during movement this not much difference between don't they are pretty highly correlated
so once you have that right
we can actually not use this information in simulations say for example think it you
model right people still study speech production
we all the little from you know
people like and that you know in our goner five
there you can actually now reflect this back and try to study from analysis by
synthesis
so you have a mother tongue we can expect longer instructions and so on so
what we did was to vary based on measurements we don't
look at different constriction bands and
locations just cy thumbsized difference will play a role in the acoustic selecting a four
way
so what we observe that concise differences in the population be had
and what was estimated by simulation very well correlated in terms of part
i part
so it was very nice so what you saw see here is that the
in the simulation spk and five
the move that
type of well ryan or likewise
so the general trends are okay so
so we saw all in all the pilot we saw with another what is it
varies across speakers quite of a fifteen pick up to thirty percent
had a consequence of a large time s
longer constructions that are may in the vocal tract s p produce sounds because constructions
are very sensual to how we produce very speech sounds
they data stretching twist the wells basis so that's of us
signal that the playwright
and
but this
interplay between contractions performance and downsize is fairly complex requires much more sophisticated so
learning
a model that
but with hopefully with data is you know these things can be pursued
this one
so the final thing sort of a not a on the slide of speaker specific
behaviour
is to actually talk about articulator study
okay what i mean but that is how talkers move the vocal tracks right so
as you know the vocal tract is actually a pretty clever assistants a very that
we didn't systems of got all tolerance little bit
exactly can use the same a different articulated to create the same to a complete
the same task for example
in move the john looks two
both dialects to contribute by little constructions like no making b and b and one
you have a mortgage august we lips
and people have several ways to change their i every shapes to do this
and so we columns are contractor strategies and some of these are speaker specific some
of these language-specific consider a we wanna get added because is again yet another piece
of the palatal as you try to understand what makes
me different from you in trying when you produce speech signal
the only just knowing that i'm different from you from a speech
okay
so this is approach you again very early work
so we have lots of
i built anymore i data
so since then i don't know the database we collect is about from a pilot
study of eighteen speakers but like north all these volume between all that stuff
very detailed weight
and so we can actually
i get i know characterizing the morphology speaking style
once we have that right be established what we call the speaker specific for maps
a off but from the vocal tract shapes the construction so imagine
the shape changes to create this task or like consummate dynamical system you know actually
is estimate the for maps of like you know
in that in a different recreation sense
and then we can
pulling all from each of these speakers format
put this back and was synthesized or model
which a to dynamical system ought to use and task dynamics
and see that contributions of the varies articulators people use actually to predict how to
be what studies people about
so
again reminding use of a cell we can go from data to extract a sort
of a on tourism and do pca extract basically
factors able contractually you know how much darker compute on with the time factors are
and someone
and then
a from that we can go with estimate various constructions in a place of articulation
you probably more right
we have along the would try to make an six different anatomical regions like the
outfielder reading about you can the be elevating their injuries and the one
we can is
automatically estimate that
the baseline level what people this
so
problem so we have some insights from about eighteen speakers that we analyze testing again
that are sorensen
a leaf presented interest feet and fill white that we went about use a model
based approach
so
be approximated like the speaker specific format a to pin from that a more i
data from exceeding speakers
the simulated with that a static to you have a to belong to the from
a motor control sort of
a legit are fantastic system
the dynamical systems are basically
control system that the in this state space for
and then we were able to interpret the results so one of the results here
like to make sure you know it's basically represent a the ratio of lips to
use a lipstick
and or job used by speakers to create constructions various constriction bilabial alveolar palatable
we look print your along the vocal
and you see that there's you know
different
ratio of how people use how much dog use
one is like more target lips
zero it's like you're using more
different conceptions different we use
different ways of creating transitions in fact used
put this work we see that elephant on the right where you know
contribute more than job in so
except for all real close to the score of a target the time and
the speakers in our set like in speaker
very you know how they used to create the same kind of constructions i so
people are different in how it studies i
so one of the sort of this is very early inside straight how much speaker
used on the lips you know it there's a function specificity how what is it
out the remote are planning
there are exceptions that actually begging for more sort of you know a computational approach
is now with the data inside we can go and cy
how people actually use the vocal instrument in producing this sounds
that we call speech
so the final in this is now we get family the slides we've been seeing
this conference of
so you are also explore a little bit
well production information be of use in you know
in speaker recognition type of experiment so we did little better well work one speaker
verification with the production data does not much data so not so you know particular
but that's the people pretty much common or things like so that was not one
has this
we'll speech production data be of any use at all your speaker verification
so we know i one point on a getting like data like rewind showing right
x-ray or more i or
it's not
but we okay in operation conditions
right so we need to be able to have some articulatory type representation so people
been working on inversion problems that is
given
acoustic
can be estimated glitch parameters like this the classic problem in fact mozaic setting problem
where you know where i feel that deep-learning that approaches that are very powerful because
it's of any nonlinear process so you know these things every conducive to these
mapping a
nevertheless what we wanted us to do so a speaker-independent mapping
right so this work of profound a small within just a few years ago what
said well
if i can really
acoustic articulately mapping between people
you know of that an exemplary talker right i have lots of data from one
single speaker for like and synthesis right you always take long
the properties from one talker and then try to produce it
and then we can protect anyone else's acoustics on this
so speakers maps to see how this guy were to produce the statistics like everything
to get some semblance of an articulate representation
so
that we can do speaker independent sort of you know measures so that was sort
of the i so we said well we can use a reference speaker
to create a articulate acoustic target like to map and to the inverse model and
then when you get that speakers
for one acoustic signal
we can actually do inverted sort of features and use these to a few
the three
there's any benefit the rationale there is enormous
is that it pretty produces like projections they not no
robust way and constraints the kind where
provide sort of
physically meaningful constraints on how we partition signal so
that might be some advantage to come that come up
so this was sort of you know
this like earlier this year
in c s l
so
the front end this started be used actually for some of these all experiments used
x-ray microbeam database also available because a lot of speakers
and standard
thanks here gmm model because you don't have the much data
and you're some sort of the initial results of you use just
mfccs only you know
that like what that for this small set that's not that's pretty noisy data set
about
you know seven point five the are but you know if you actually have the
real articulation
the measured articulation actually get a result of post
in
providing sort of you know nice complementary information that's kinda encouraging so that you might
think about as an oracle experiment or upper bound if you have session
now if you can use of the inverted sort of measurement about that we shall
we do as well compare really well slightly better by putting them together actually provides
you an additional both with this pretty significant actually
so this grading of this kind of if you have lots of data that we
are sort of you know if you have
in the data to create these maps about speakers you know we need just example
each case
and if we can provide additional source of information
perhaps will give us so the some wheels but maybe also some insight into why
people are different or what data categories of articulation or structure and started is a
different by
so this is just the standard set of
the first
showing the same as of the film
x-ray microbeam database
so
summary of the speaker recognition experiments that notes and she'll so that step
of using both acoustic and articulatory information
there is significant and f eight
if you use of measured articulately information with the standard acoustic features
gains of marble or more honest
if we stuff you know used estimated articulate information
so what would be nice is to actually look a new ways of doing english
and with the kinds of so advances that are happening right now
nor feels
and the availability of data number two data
to do
i know this
no better
i'll be able to evaluate larger sort of acoustic data sets from sort of sre
like the campaigns
so mowing for most on
so we're very excited about no some of this actually
a premier work was done with my collaborators that lincoln laboratory some point your unique
model is gonna
and parallel work was mice your voice now also their
and so we had some initial pilot work and then
i recently got an innocent right actually to a and you the slider work people
actually
or okay we're doing speed signs looks like
so we are excited about it
and so our ideas do this in a very systematically your set to collect about
two hundred subjects this
all this
real time and volume a tree and about
detail and share with people
and
we kinda describe this sort of in an upcoming the paper
and this is kind of that material if you're targeting i'll show the slides and
people want to suggest that is we are more in for you collected what ten
speakers of the product or so far
with the project the starter
i everything from a notable exception the rainbow passage two
all kinds of you know spontaneously and so on
if you have any suggestions ideas how what would be useful for speaker modeling you
know i'm use like this now we have to consider
most in order to be native speakers of english and about twenty percents could be
nonnative speakers it's gotten in english
but in other projects to collect a lot of people doing other languages are everything
from african languages to other
so finally also you know a getting insights inter speaker variability also we can do
some sort of these use cases problem
in the case or mother developing vocal tract length from kids tradition
how the speaker very so that no manifesting the signal right so for example
we've been working along with people operations of attending i'll or can see
so the intention surgical interventions class actually basically what you with you
the parts of town
on top of that we have other therapeutic sort of treatments with the radiation and
are
people
so cost like modified physical structural damage to the thing
so here we see two
of patients
there are no
one basically lost pretty much more so that are because the cancer with your base
you know that and that's of the four reports on
and it's replaced by reconstruct with them flat from the four
so you see sort of variation in the convoy the normalized and therefore here
so how this their speech cope what this is not getting speech and small is
one of the big quality of life measure
so we have different things is also keep us additional insights about you know looking
at speaker variability
the interesting something's only eleven cases you know and had in history the norton
though
some people bought reported on ability a so we have access to all other speakers
and collect a lot of data from where and
and so we can compare what
a how to compensate how to use the strategies how person
speaks pretty intuitively pretty well so
this provides an additional source of information to understand this question of individual very good
so in conclusion
appoint someone may well yes data is very a good integral to advancing speech communication
research your vocal tract information plays a crucial part of this piece of this but
the like i believe
so to do that we need to gather data from like lots of different sources
to get a complete picture of the speech production
it's that's
not very telling from a technological computational
as well this conceptual and theoretical to the perspective
but
i don't believe that are written still so that no applications including into the machine
speech recognition speaker modeling
but i that this sort of
approach just like very interdisciplinary so people have to come together to work well on
these topics
and share
so these are some of the people and my speech production that no
the problem of our
although a bottom line and people were currently there in particular award that
we also contributed this particular a collection of my
calling who does all these imaging work
and testing them are scientist
lois of these the linguist very
well
linguists provides a conceptual framework of how we
approach
such an that all this work on
this apply meeting stuff recently and the lower can only morphology where my that was
talking a lot model where
that
namely that a lot of things actually translating to a speaker verification
and i separate that michael i-vectors in all our women amazing no i'm forty really
for this guy had available
and here not only finally no he's been very supportive is vanilla rampantly support incorrect
he's be important for this and no i'm pushing is to
not people one that's of things here too
so that i thank all of you listening to be
and various people find that
well this is like online if you're interested including might be charged
thank you very much
for instance
you very much with fascinating to
two questions first of all
when you're gonna get to the larynx
because that's i'm okay i'm talking from the
perspective you
the forensic phoneticians
and
we are conscious of between speaker differences from the larynx on two
spectral slope of that sort of thing but in this that suppressing
and also super the residual e
relationships between what i would
give almost more robust harmful is we'll knowledge about the speaker variability in
the nasal
basically nasal cavity sinuses that sort of thing
that is the below about speaker i
it's great "'cause" you're not gonna get in this
we telephone speech and so forth anything above
three k is the good
some parts that so the first questions about lyrics right
so here are in this region
so
so the glottal so that the voice a voice source of phenomena like happens that
much higher rate
and so i'm are still is not good enough right it's about
we can go about how did want reprints for second year
so what people have been doing particular you know according to you salience one no
up to
you high speed imaging off this larynx but wouldn't camera to the nose
in two
little bit intervention and
at so
on the other hand
what we can do you have them or i used to look at things like
little joe hi then they'd all other things but also get some
it it's one zero information
and particularly one of things a more approaches like complete you of your region so
we can really
this is not available any of the other but all these people use you know
in this so you look at like to be for usual sort of
behavior phenomena
and in terms of actually characterize and things like that is the variance and so
on which don't change very much during speech behavioural i cannot to characterize that's what
he really i contrast to weighted images
to really characterize every speaker by you know what is that they have the you
know and in terms of
with which we can actually get i
some anatomical good characterization of a speaker and see how can relate or account for
it in the signal
and so
we are trying to see how can
sort of controlling t do some multimodal meeting of voice source that no we tried
to you d
but you know they are quite small window into this thing is you know
we wanna see the high speed stuff
still open question in terms of contrary to meeting
so that like by the button references
in the previous slide show organisers people interested
no more questions i was just
normal
s
is it possible to say broadly
if there are any a particular areas that show the greatest amount of the between
speaker difference
and that's to me and use
so you know if you gonna look for where is a completely
goodness knows it or is it just and that you know people differ in all
sorts of the from which was
so i think that the latter is that what my guess is right no unless
we know i do think they begin to start begin to cluster
a ones as increase the and number
just like you know what we do it eigenvoice and the
i didn't phase i think i'm sure a good prime things that start at clustering
for getting direct mode
but now the source of variability seems to be
a perceptual point of view
all the place
plus you know how people became weakened that
also varies quite a bit because you know
where they come from mine how be applied and so one right and practices people
use no
there are other piece of work that i can talk about no one article to
setting and you know
ideas about
how people set of actually
be but i do
extract parameters of
from or to control problem point of view white people the for it i can
lead to language or
background or other kinds of things still open question
but what i feel like as being trees that it is these of we talk
about very small datasets is compared to what you've been for state would just on
the speech side
but if we increase this to some extent
and again or this kind the computational tools and advances that you're making i think
slowly can begin to understand this at the level to go
open question
structure so it are you make a comment
you put up a kind of the acoustic to model but well all remember point
out one thing from one of the workshops from
the early nineties
from mid sixties up until late eighties early nineties we use their own acoustic to
model that was when you're like flat screen
and we should tell at a summer student would basically spent the summer saying well
actually the vocal track as a writing all turn and no one it really thought
about what how much is that right angle actually impact vocal i persona formant locations
and bandwidths
so he we formulate a can or closed form solution i think they saw it
was between one two three percent ships informed location bandwidths right so a very much
like sting the physiological per state you take care what might one right basic questions
you focused on speaker id
i'm assuming many of your speakers here bilingual have you thought about looking at language
id to see if the physiological production systematically changes between people speak one language versus
another
absolutely solid lines of that for the first a common to jon hansen made was
regarding to but the vocal to been but it sort of unruly do the simulations
note that
for
articulation acoustics and the effect of the band in fact there is a classic people
by enrollment order moments on the
and yes and the release of
long time ago
that actually estimates is about the three five percent the student actually verified it but
some and simulations later on
i used to get the last
and
so i think of the more recent models try to do this you know but
like fans here simulations main street and simulations the one we can do with this
node access to those one what you did i talked about right
for all the postures from all these speakers we had that
so and with the high performance computing
this is becoming a reality we can actually what implanting and want to do right
no nodes
possible
this second question
john a reminder
all the language id yes of course we have actually
about
forty or fifty different languages actually languages and set l to a second language them
speak english in or datasets you know across very linguistic experiments we've been doing
so one things we
the real the data
little bit not as much maybe
cup people intuition language id
may have some hypotheses and so on their be looked at things like articulately setting
you know which is then
the place from would you start executing a task right now from rest to rent
you so if you think about as a database system right as you know from
a individually creation like you know so the modelling you initial state is important from
which we go to another state and where you set of but
release that particular task and go to next aspect of making one construction going on
an on and so we found that people have preferred sort of settings from which
they start executing and that's very language specific we showed like normal german speakers presents
and spanish speakers with english speakers so these kinds of things can be estimated from
articulatory data
the inversion is not been to the viewing done that no
but that's quite possible and you know happy to share data
top two people body
okay
sure it's first okay
okay so
i have a comment i like to respond
one of all the problems in speaker recognition is i happens between the hot this
but the speech right
the first line that explains
cepstral mean subtraction
basically you find the way the average side of the vocal tract
how does that sort of
impact on what you
right so that you know i didn't talk about the channel effects and channel normalization
things that happen you know the recording conditions and so one right so
one of things that the art of contemplating is like you know like many people
have been talking what do joint factor analysis or these kinds of even with these
new a deep-learning systems right
you could these multiple factors jointly together to see how
we can have speaker specific variability sort of measures
and things that are cost by sort of other
so it's a extraneous setup
interferences or thirty two or more other kinds of transformation that might happen
so that's what we're doing from first principal type things right like the way we
want to do not just make the jump into a drawing some all these into
some you know machine learning to and beginning to estimate by
systematically trying to look at linguistic theory speech signs we could features to analysis by
synthesis type of approaches and then we can then see well if you have other
kinds of these kinds of snow
both
open environment speech recording not
for distance the speech recording is spelled much interest to other bus
for various reasons and
we can account for these things so i tend to believe in that kind of
more organic approach
we have temporal one question may be processed foods
i
i'm sorry i'm the both fast
i
i won't first to thank you and it's very nice
sorry noise
science
which technology and particularly in speaker recognition or in the forensic so
adjust my common this to remind the difference between speaker recognition and a forensic voice
comparison
but it really both and
the field
present that you
because
we know about when we try to do some article in addition we think like
that
we have a huge difference between the board to read speech
it's train include kick back wall
speech right
for speaker recognition we could imagine but the speaker are trying to
very
classical to you could not be processed
in forensic voice
comparison
we could imagine exactly you put it right are reading my question
posted but midget but
would be five
constructions the or optimization strategy you know that
challenge department you expose
yes and alright because there's certain things we can change certain things we can't write
your given right that's one of the things that we are trying to go after
that there's something are given in or physical instrument it can compensate for it as
much but we still see the residual effects and want to see can you get
it is residual effect maybe
the bounds are not there so no i have a big that of information theory
so always interesting bound the limits of things how much can be actually
after all we have
a one dimensional signal from which we project on all kinds of feature space and
do all or computation based on that to do all the inferences problems targeted speaker
or whatever this and so
say you menu plate that the strategies that's only one degree of freedom or you
know if you mean
and then it causes some differences but still if we can account for this somehow
i can you still see the residual effects of the instrument that there have or
specific ways they are
changing the shot used a common database when they have right you can't speak so
just two random things with your articulation to create the speech sounds right so that's
why not disjoint modelling of you know the structure and function you please a very
interesting to see and how much can be spoofed by people like you know you
may if you're getting added
it remains to be seen by the no i
but i'm hoping that like by no
being very microscopic here these analyses we can get some insight into it
you know but one that is very objective not you know
just a
impressionistic you know single this place is definitely all these experts billing talk about it
on you know on the court
i think that's one of the reasons
here was very
but support the idea no let's go it every object to way you know scientifically
grounded way as possible
we don't loads its adjoint you see vertigo
can be so
since then the speaker again thank you thank you