so
becomes a features from university of vienna these prevent something
at the
department of cognitive biology
and main interest or in the evolution of language and the
mobile communication in
but separates
and what makes this
also very interesting for us is the t v
all the users synthetic speech
two
investigate is questions into
there's use hypotheses
and
are there is a
from the
i allowed artificial intelligence lab on the friday university
brussels
and he's
interested in the also in the cognitive
it uses of language and
all the user's machine learning in speech technology for
investigated in all
this
combinatorial
factor can
somehow be modeled
and
we also very well known for their work and
also for the work on the
nine q
we ct a monkey vocal tract of speech ready which we will here today
this is what i'm because
my family
is that sounds pretty good fact there are not you
i'll try not to put
thank you michael effective for the kind introduction said this is the first time bargain
i have tried to do it
tag team two you know like this that will see how well it works but
all start off and then part will
give you more technical details of the sort that i'm sure you all hungry for
on saturday morning
but i'll try and start off the start by giving some
just perspective on why a biologist like myself
who's interested in animal communication would dive in the speech science actually studied speech science
with people like and stevens and mit one as opposed arc
and use that used what kind of you guys invented to investigate how we animals
make their sounds and y what those sounds me
and then we're basically gonna talk
so in other words that using the technology of speech science
to create animal sounds to understand animal communication and then in the second part of
that arc will turn that around and say how can we use an understanding of
the animal vocal per tract
to understand the evolution of human speech
and that is that may the answer may surprise some of you
okay so why would why would anyone want to synthesize animal vocalisations why would you
wanna make a synthetic cats
academy our a synthetic bark
and's as i said
my drive my main reason for this is because i'm a biologist
reg interested in understanding the
the biology of animal communication from the point of view of physics and physiology and
because speech scientist of done so much of that work we can essentially borrow that
to understand animal communication
and then we'll turn of the second part where we try and understand how our
speech act
so i'm sure this is very familiar to you but i just wanna very quickly
run through the source-filter theory i'm sure virtually all of you are familiar with this
theory
what as applies to human language what you might be more surprised by is how
broad this theory applies across vertebrate
so with the possible exception of fish dolphins another toothed whales and probably a few
others like some rodent high frequency sounds
this theory that was developed to understand our and speech apparatus and you know basically
from the nineteen thirties onto the nineteen seventies turns out to apply to virtually all
other sounds that you might think of dogs barking cows moving birds singing it's utterance
the basic idea of course is that we can break the speech production
process into two components the source which turns aside airflow at the sound and the
filter which then modifies that's
using formant frequencies which are vocal tract resonances that filter out certain frequency
and this is an image that may look familiar
this these are vocal folds except these of the vocal folds on the siberian tiger
so these this is that a larynx that's the vocal folds are about that long
so of course it makes very low frequency vocalisations but you can see that the
basic process this error dynamically excited vibration is pretty much the same as what you
would see in human vocal folds
and of course the vibration rate of these vocal folds the rate at which they
slap together determines the pitch of the sound
and you may be wondering how we did this we didn't have a live tiger
vocalise thing with an enter scope died want to do that this is a dead
tagger so this tiger was removed from an animal that was used a nice put
on a table we blew air through it and we videotape that and what that
shows is just like in humans
we don't need active neural firing at the rate of the fundamental frequency to create
the source
and that seems to be true in the vast majority of sounds bird songs acts
are actually localising it at fundamentals of eight khz
whales or for are of localising at fundamentals of ten khz
all using the same principle
there are a few exceptions in my favourite one that many of you will be
familiar with
is one task per
that's a situation where the there is an actual contraction well each contraction of muscle
that generates the paper is driven by the brain so that's one of the few
exceptions where it's not this kind of passive vibration
but again for the vast majority of sounds at we're talking about including everything we
know from nonhuman primates this is the way
so then that's source out whether it's noisy or harmonic passes through the vocal tract
which
i we show my students this image the formants being like windows that allow certain
frequencies to pass through
but it certainly much more fun to listen to what a form it is
what i've done here is used lpc resynthesis
to take the human speech which is of course of the source
and the filter combines
where and or
and now i'm gonna take the formants of that speech
and apply them to this source this is a bison whirring
and this is what we hear as a result
i
i think everybody can understand the words even though it sounds more
terrifying when it's a bison saying it
just another random example this is an or well
in here is the nor we're with my performance
okay so i think that illustrates the point what we hear the vocal signal we
here is this composite of source and filter
and in these cases we can hear the filter doing the phonetic work
and this but the source still comes through loud
so taking this basic principles of source-filter theory we started thinking
okay what kind of
cues other than speech might be there an animal signals and one of the first
things that's now been
really extensively investigated was based on the idea that vocal tract length correlates with body
size and because formant frequencies are determined by vocal tract length maybe formants provide a
cue to body size in other species
so the first part of this is easy we just get "'em" a riser x
rays a measure of the vocal tract length you can do that on anaesthetised animals
and then we is a little harder to get them to vocalise but when we
do that and that of the formants we find this is just one of many
cases these are monkeys that vocal tract length correlates with formant dispersion which is the
average spacing between the formants and because vocal tract length correlates with body size that
means the body length correlates very nicely
with well sorry this is one body like correlates very nicely with formants
and i first this in monkeys but then we didn't obvious and in pigs it's
true in humans it's true and dear this seems like a kind of for the
mental aspect of the voice signal that it carries information about body so
so
this is something that we can see as scientist objectively we can measure this
but the question is do animals pay attention to that
so it's fine if i go and i measure formants and i can say formants
correlate with body size but that's kind of meaningless for animal communication unless the animals
themselves perceive that signal
so
this is where animal sound synthesis comes and how do we ask that question how
do we find out whether an animal is paying attention to formants
and the answer this is a long time ago this you may some of you
may recognise this all version of matlab running on an old macintosh that i generated
this speech animal sounds synthesizer using very standard technology that most of you will be
familiar with basically
when you're prediction predict the formants subtract those away and we have an error signal
which we can use as a source and then we can change the formants shift
only the formants leaving everything else the same and ask if the animals perceive that
shift inform
now the way we do these experiments how do you ask an animal whether it
perceives that we usually do you something called habituation this a bit you a sheep
where we play a bunch of sounds that
the in this case the formants remain the same but other aspects very the fundamental
frequency the length et cetera varies performance are fixed
and now once
our listening animal
stops paying attention
so it may take
ten plays or a hundred play is before the animal finally stops looking at the
sound but once it's gotten with the original sounds then we play the sounds where
we change the formants or change whatever variable interest
and we
if the animal pays attention to that
if they perceive it
and find it
salient enough to be noticeable then they should look again
okay
so the first piece is i actually tried this with his whooping cranes a now
explain why the second
so what i'm gonna do you know it's sort of walk you through this experiment
these are whooping crane contact calls
and what we did is play a bunch of the actual calls from one particular
brand
and they sound like this
or
it's more here's another one sound pretty similar to our years
and we keep playing those in cell are so these are recorded we're playing these
from a laptop and now we see if the listening bird looks up to we
wait till the bird goes down its feeding we play one of these sounds and
it looks at
because it sounds like there's another would be great
so the logic is pretty simple
the case of whooping cranes we had to do this in the winter
it takes these birds hundreds of trials before they start listening before they start paying
attention to the laptop dies and it starts snowing et cetera et cetera
but eventually we were able to do this
where you get the bird the bits are weighted by playing these kinds of sounds
over and over
anyway and then
just to be safe
we play a synthetic replica that we've run through the synthesizer but without changing the
formants and if everything's fine they shouldn't just a bit rate of that hears with
that sounds like
pretty similar
and now here's the key moment
we play either the formants lowered
where the formants fire
or
and of course you walk in here that because you're humans and you we already
knew you perceive formants so the question is one of the birds do
and when we do this what we find is that initially
the birds respond eighty percent of the time on average but has we go as
we get so twenty five or thirty trials finally the last but you a sheep
trial
by definition is the one where they don't look at all we actually get three
of those in a row now we play that synthetic replica they don't work
so that means or synthesizer is working and then finally we play these test stimuli
and
we get a massive just a pitch
so we've done this that would make a difference
sees and always found the same thing it seems like paying attention the formant frequency
shifts
in this kind of context is a basic mammalian thing
birds to it monkeys do it dogs to it pigs do it and of course
people
so now you might ask can we go further with that and for example these
are two colleagues who have used animal sound synthesis
you basically look at what other species are using these formant frequencies for
in this case we can show that the model that the deer or the colours
are using these sounds as indicators of body size and the kind of evidence we
have is for example males played by another male with its with lower formant frequencies
that with an elongated vocal tract runaway and are afraid females find the more attractive
et cetera et cetera this is again been done with many speech
many of probably many of you have heard gear but you might not of her
the colossal this is a colossal they have a very impressive vocalisation
if you're wondering how a little teddy bear sized animal
makes that terrifying sound
it's because they actually have a track which is that they've
pull the larynx down to make their vocal tract much longer then it would be
and a normal animal so by and one getting their vocal tract they make themselves
and vector
just these are a few of the many publications that use this approach that i
just been telling you about to dig deeper into animal communication so i hope but
makes the case that this is a worthwhile thing to do it again in a
wide variety of sleazy
okay so now maybe getting something that's closer to what a lot of you do
i wanna turn to the to the this is supposed to be part two sorry
we just
put this together yesterday
why would you
what i mean how can you turn this around to start ask questions about
human communication based on what we understand about animals
and the first fact that kind of course fact that many people in the world
of speech sciences been trying to understand for a long time is the fact that
we humans are amazing it imitating sounds we not only imitate the speech sounds of
our environment
but we learn to sing songs we can even in the tape animal sounds or
basically kids will imitate whatever sounds they have a rare
and it turns out that are nearest living relatives the great apes can't do this
at all
so this is just one example all these are examples of apes that been raised
in human homes
and of course a human child by the edge of about one is already making
the sounds a bit it is already starting to say it's first words and making
the sounds of its environment that adheres and it's in its native language phonology or
phonology is and no eight has ever done that no ape has even spontaneously said
mama much less learn complex vocalisations
and the question that has i mean people are known this for a long time
the question that has been driving this field for at least a hundred years and
start once time is why is
why is it that
and animal
that's in english seemingly so similar to us that can
where to do things like i h
and drive a car
can even produce the most basic
speech so
with its vocal tract
so that's the sort of driving force behind the second part of
block
and there's two theories darwin had already mentioned this one is that has something to
do with the peripheral vocal apparatus
and the other is that it has more to do with the brain and darwin
said well they probably both matter but the brain is probably more important what we're
gonna try and convince you now is that it is actually the brain that's g
and vocal tract differences although they exist are not what are keeping a monkey or
an ape from producing speech
now the most famous example of
a difference between us and apes is illustrated by this these m r is on
the on the left side we see here a chimpanzee and the red line marks
the vocal folds so that's the larynx
and of course in humans the larynx is descended in the vocal tract it pulls
down in the throat
where is in the chimpanzee the lexus and a high position engaged in the nasal
passage most the time
and that means that on
rests flat in the in them in the map of the tongue is basically sitting
like this
what happens in humans
is that are we essentially swallow the back of our town are larynx to sends
pulling the time with it so that we have this two part on that we
can move up and down and back and forth and that's how we get this
wide variety of speech
so the idea and this goes back to darwin's time but it really became concrete
in the nineteen sixties is that
with the time like that
you simply can't make the sense of speech and therefore no matter what brain was
in control that vocal tract can make the sounds that you would need to imitate
speech
and it's a plausible hypothesis
it goes back to actually my and meant for phil lieberman who was my phd
thesis supervisor published a series of papers in the late sixties and early seventies
and what he did was take a dead multi and the beta cast of the
vocal tract of the smoky
they use that to produce a computer program to simulate the sounds that
vocal tract can make there was a lot of guesswork involved because it was one
that multi and one cast
but they did the best they could
and what they found this is an formant one
to space
what they found it is yours the famous three vials the point files of english
e
and are that are found in most languages and all those things in there all
the numbers are what the monkey vocal tract or what the computer model of the
multi track remotely vocal tract could do
so they concluded that the acoustic vowel space of a riesz as multi use quite
restricted they lack the output mechanism
for speech per
and this is one of those ideas like i said it's its well-founded in acoustics
if you look at what we actually do when we produce speech these just a
couple videos that it will be familiar
a rainbow as division of white light into many beautiful colours
you see that from dancing around in that two dimensional space
here it is slow down a bit
so we use that ni that additional space "'cause" by swallowing the back of our
turn we clearly are using that to its full extent when we produce speech
so i think this lieberman hypothesis is quite plausible
i became suspicious of this when we first started to train do x rays of
animals as they vocalise instead of looking at data animals like this is the classic
way of analysing the animal vocal tract take a day got cut in half and
draw conclusions about that we trying to get a good localising in the x ray
harder than it may seem
i have that many animals sitting in a situation like this without localising at all
but this little go was one of our first subjects in we played it it's
mother's bleeds it would respond
and this is what we saw in the extra
also use again i want you to look in this region right there
when you look that's this anonymous claimed
at the glottis prevents mouth breathing so in other words the idea based on the
static anatomy is that a goat can't breeze through its mouth
and so here's what we actually see
this i
pulling down a
such that every one of those vocalisations passes out through the mouth the get
now this shouldn't be that surprising if you think about if you wanna make allow
the sound you should other eight through your mouth and not through your nose but
again this is what i'm data most acclaimed was impossible up until we started doing
this work we've seen in another animal so this is a dog you're gonna see
a very expensive pulling down of the larynx to send of the larynx when the
dog barks this is low motion
however
that's the lyrics
right
what you can see here is that every time the dog parks
the larynx pulls down pulling the back at the time with it and basically going
into a human like vocal configuration but just one only animal is talking white only
while it's vocal i
the unusual thing about is that are larynx stays low we keep our larynx low
light on not only while we're vocal
so when we first got these data more than it's almost twenty years ago i
became convinced that this that the set of the larynx can't be the crucial factor
keeping animals from localising
but unfortunately in the text books it canteens said the reason monkeys can't localise rates
can't localise
based on peripheral and that they just don't have the vocal tract
and it was what i saw the simpsons episode where
where
it system
the simpsons the main guy
part no the old guy
homer homework like you
can wear gets this multi
and the motley can talk so homers learning sign language are kept saying it's because
he doesn't have the vocal tree
so that's when we decided okay this dog and goat stuff isn't enough we have
to do it with nonhuman primates and working together with passive thousand far whose monkeys
they were and bart who's gonna take over from here we check x rays like
this one
the multi vocal arising
and you'll see there's a little movement of the larynx just the same as we
saw in the gutter in the dog and then we trace those to create a
vocal tract model in this is where part's gonna
i
do you wanna take this
that looks good
a reality
okay
so
yes how we actually
and model to
to create
localization of the monkey no
if you think about it it's very different problem from or a problem that requires
a very different solution from what we use for human speech because what we're trying
to do is to figure out what the monkey
could do in principle with its vocal tract and it's not based on what it's
actually doing the whole point is that we count multi don't well so
so what we don't have is a corpus of data on which we could use
some kind of machine learning problem
so what we need to do is
that really productive approach
based on
what is in it sends a very old fashioned way of going about speech synthesis
and which is articulatory synthesis the not just recap which relate
how it works for you but i assume you mural intimately familiar with it and
what i would like to stress however is that even though we can to be
talking about biology and about speech assigns
these methods were developed by people who we're actually engineers they were also people interested
in trying to be able to put is many phone conversations on transplant transatlantic cables
as possible
and so this is very much
the fear read it has been developed by engineers by people who were working with
the same goals
as you guys
so how this articulatory synthesis where well you start with an articulatory model you start
with an it year of how the vocal tract works
and from
with a model you can create different positions of the tongue and lips et cetera
and from that you need to calculate what is called an area function so an
area function is basically the cross sectional area of the vocal tract at each position
in the vocal tract
and it turns out that the precise details of that area function
well the area is the thing that counts the precise shape in the sense that
for instance there is a
right angle here in the vocal tract that's cool because of the wavelength interval you
can ignore that so you can basically model it as straight q with the circular
cross sectional shape but the area is the important thing now of course if you
want to
model that any computer model you have to discuss the score times that so what
you and that is
with is called a chi model so i and number of choose along the length
of the vocal tract from that
larynx basically to that
and then on the basis of that you can calculate the acoustic response either in
the time-domain the frequency domain so that's what we're going to do so how did
we do that for the monkey model
this is the x-ray image that to come sages child
with the outline
and in red here you can see the outline of the vocal tract
so this is what we have this is what we start with we have we
had about a hundred of these
and i guess they were made by hand that ratings were made by hand and
so what we first need to do is to figure out
how the sound waves propagate through this tract
and for that the technique that we use is called a medial axis transform so
it's basically you're trying to squeeze
a circle
through that tract and that circle basically represents the propagating acoustic wavefront and if the
line in the middle it's kind of the center of the wavefront and the radius
of the circle
for the diameter of the circle as the diameter of the vocal tract
so this is what you end up with
and so
you can then calculate for each position
in the vocal tract
from the glottis to the lips
the diameter
okay so you have it
a function
the diameter of the vocal tract
at each point in the vocal tract
however the problem is that this is just
part of what we need we need to have the area we don't need to
have we that the diameter isn't enough so the problem is
we need to calculate the area on the bases of the observed diameter
no fortunately it turns out that do good approximation for those monkey vocal tract the
function converting diameter to area
is more or less the same everywhere in the vocal tract so how do we
figured that out
apart from the x-ray movies we also had a few mri scans of than the
anaesthetised monkey
and if you if you look at that
so this is this side view so this is where the basically the monkeys
let's are
this is it's vocal tract
here's the larynx
and so you can make if you cross
section of cuts there and you can see that the shape of the vocal tract
i don't these different
cross section there is
follows this it's not quite a rabble but
in this particular shape is kind of the same everywhere
and so what you want to know is
for a given opening of the vocal tract how large is that area so suppose
that the
the diameter would be
about
about this
so the area would be this now if you open up further then of obviously
the area gets bigger any turns out that follows you know it's just a matter
of integration any turns out that what you find is that the areas proportional to
some cut some constant
times the diameter to the power of
one point four there's no deep theoretical reason for that value of one point for
each it's something that we learned from observing
so now by applying that function to the diameters that we observe we actually find
a
the area function so this is
the position
and the area that at each point
in the vocal tract no
the next step is turning that into someone's
and for that we use a again very old fashioned classical approach and acoustic a
mobile an electric line analog of the vocal track again you can kind of see
that historically a lot of this theory was
developed by electrical engineers "'cause" it's an electrical electronic circuit so for each of those
discrete to you
the electric line a lot models just model basically models the physical wave equation with
a little electrical circuit
and from that
we can then calculate the
formant frequencies
so for each of those hundred points
we
we can calculate the first and the second and third formant and these are the
values we actually calculated for all those
all those point
and but there's
didn't from this point we've kind of
determined what the acoustic abilities of the monkey vocal tract or not
from there there's different things that you could do
in principle
on the basis of this kind of data you can actually make a computer articulatory
model
and so this is something that is changing my to as done in nineteen eighty
nine again you know quite some time ago on the basis of a very similar
data about the human vocal tract
but
it's not certain that we have enough data to actually do the same thing so
changing my to what he didn't was he made a thousand
tracing so the vocal track and if you if you in if you know how
difficult it is to make a single tracing
you can imagine how much time he must've spent on making this model
and what he then that is basically
look at these articulations to a factor analysis and basically derive an hour and articulatory
model
and articulatory synthesizer so you could basically then use that model to synthesize new so
no the problem is we don't have that many tracing so we couldn't problem probably
couldn't make a good quality model
what we wanted to do and what to comes is going to say in a
moment to explain a moment it's re-synthesize some of these sounds and that's still very
challenging with a articulatory synthesizer and it wasn't reading necessary for what we wanted to
do so we took slightly different approach
now
one of the things we wanted to do with just quantify the
articulatory abilities of monkeys and compared them to humans
and wanting to do that
we could measure the
acoustic range of the monkey vocalisations and one way to do that is by calculating
the convex hull now again i'm assume you're all familiar with what a convex whole
is just very quickly show you how we did it basically if you wanna call
calculate the context will
you start with the one of the extreme points
and then you
basically
fit a lying
a round those points like if you if you would take a rubber band and
just
squeeze it around the points and then you can do several things you can calculate
the area of the context of all or you can calculate the extend of these
things in the f one or the first formant or the second formant and the
thing that we did was we based ourselves on the extent
well in the area and the extent
and one of the things we get is the amp this week
wanted to know how the monkey sound it
it would be speaking
and in order to do that we
modified some human sounds in a way very similar to what the comes just showed
remote recordings
and so this is it
sentences spoken by human we that's like this into the
formant tracks which is basically which represents the
the filter and the source
and then we modified those formants
in a
in a way to make it more similar to a monkey vocal tract so what
you've seen so far in the examples that to comes at play to you is
where the formants were just shifted up or shifted down we did a little more
so we modified them
didn't just so the
we need to shift the formants up a little bit because the monkey vocal tract
is shorter than the human vocal tract so that the formants tend to be higher
but in addition what we found is that the range of the second formant it's
somewhat be used in the monkey vocal tract
in comparison to the
human vocal tract so we also
breast the range of the second formant
and then we resynthesized the sound
now
the thing with
and analysis in terms of source and filter
is that it's complete so if you have discourse information and the filter information
you can basically
re-synthesize the sound perfectly this and there's no loss
so if we would you just
the humans stores with the modified formants the sound would probably have sounded to perfect
so what we wanted to do is use the source that was more monkey like
so we actually also synthesized in use force which was based on a very simple
model
the monkey vocal folds which vibrating the much more irregular weight and human vocal folds
do so we took our monkey stores
applied
the modified formant filter to it
and then we got a real monkey focalization
and this is where the complete x over again
okay
so
hopefully that satisfied your morning need for technical details but now you must all be
wondering after this is just a synopsis of the whole process that we x-ray the
monkey making a hundred different vocal tract configurations
basically everything that monkey did while he was in our x ray
we trace those
we use the medial axis and then this complex area diameter the area function to
create the
model of the vocal tract and then we can form for a synthesized performance from
and so what we get here's the original data from lieberman that i showed you
at the beginning so the red triangle represents a human females bocal the f one
f range of two range of a human female with e a new making up
the points
and that little blue triangle is what the all model from lieberman said a monkey
could do
and this is what are mark our model looks like compared to that
so unlike me romans model which is very restricted we can see that the multi
what a remote key actually does would be to a quite wide variety and the
first formant but a somewhat compressed second formant
we use that to create multi vowels so artificial multi vowels that occupy the corner
of the corners of that convex hull so with five motive hours in a discrimination
task humans are basically at ceiling record so they do just as well with the
monkey vowels as they do with human vowels and what that shows us
is the to mark his capacity to produce a diverse set of files the same
as the number in most human languages namely five
is absolutely intact so the monkeys vocal tract
has no problem doing that
we also have good indications that things like bilabial and glottal stops et cetera et
cetera many of the different consonants would be possible so clearly the multi vocal tract
is capable of producing a wide range of seven
note that all sounds very dry such kind of more interesting to hear what are
model sounds like if we're trying to imitate human speech
i usually so we the model for this was my wife
so we had or speak a bunch of sentences but rather than play her first
what you should understand i'm gonna play the monkey model first and see if you
can understand with the smoke you say
right i
right
everybody got it right
okay and their this is my wife's formants with that synthetic monkey a source
i
okay
right i
time so
what you can here is that there's the phonetic content is basically preserved the human
formants are lower which makes sense because humans are larger than monkeys so it has
a more based c and less where you're the sound to it but i
that the phonetic content is basically present so what the shows us is that whatever
it is that keeps a monkey or an eight rate and the human how speaking
it's not the peripheral vocal tract it's not the anatomy of their total there
and that's basically the conclusion that we drew from this paper the paper was called
multi vocal tracts are speech ready
and what that tells us is that rather than looking more at the anatomy of
the vocal tract
we should be paying attention to what to the brain that's in charge and that
would be another talk to explain we have lots of evidence about what is about
the human brain that gives a such exquisite control over a vocal apparatus but it
doesn't seen that the vocal apparatus itself
the crucial thing and put in other terms we've done it with the multi but
i'm quite sure that the same thing would be true with a dog or a
pig or a cal if a human brain were in control a dog or at
cal or a pig or a monkey
the vocal tract would be perfectly able to communicate english
so
there's a lot of work to do before we make talking animals but it's gonna
involve the brain and not the vocal tract
okay so that is our story that was actually faster than we thought just to
they are general conclusions is that
you can use these methods that we're mainly developed by physicists and engineers to understand
human language for human speech to basically understand and synthesize a wide variety of vertebrate
sounds
i nearly work with four arms with birds and mammals but other people have used
these same methods to do things like alligators and frauds so these are very general
principles what you all learned in your sort of intro the speech class actually applies
to most of the species we know about
it's not the vocal tract that keeps most mammals from talking it's really their neural
control of that vocal tract
and i think the more general message that probably
meaningful to pretty much everybody in this room is a better understanding of the physics
and physiology of the vocal production system whether it's and the dog a remote you're
a dirac a wall can really play a key role it should play a key
role in speech synthesis
and thus you wanna say a few extra words of wisdom i guess
no
okay so we i think we have plenty of time for questions so thanks to
all the people who did this work and thank you for
it'll take the question mike or should i
i
a cushion is able to
inspired by using the women the ball box
the vocal folds
them again example can force for by using the like behaviour the dynamics will say
he's trying to imitate a human it's just what dogs do when they bark it's
the ways a second this is one point so and the second is that at
the last part of the user that
the key by the key difference lies in new mechanisms was really in the what
no mechanism yes neural mechanism so my question is able
as sometimes because of the dot plot the that this happens so will be disabilities
but actually act was again and almost a result of the bit if
it is not gonna but only in time
so my question was
i just talked that the debut the end of the vocal fold dynamics for the
ball but
and the most mapping that happens in the subject
because of that these so is there any kind of q for this was a
good use ms
question i two r are you asking about the recovery of the source properties or
i'm asking about the new them again is on that is responsible because for that
piece was good
for the auditory perception or for the production okay so what we know i don't
have a slide for this but we know that in humans there are direct connections
from the neural from the motor cortex onto the neurons you actually control the laryngeal
and the tongue muscles
those direct connections from cortex on to the laryngeal matter of us are not present
in most members
so these are absent in other primates they appear to be absent in austin cats
and travel et cetera but in those p c's which are good vocal imitators and
this includes many birds the parents and my numbers but it also include some packets
include elephants it includes various the tations
so in all of those groups that have been investigated these direct connections the equivalent
of what we humans have are present so the current theory for what is it
about our brains that gives us this control is that we have direct connections a
lot of the motor neurons
and in most animals there's only indirect connections via various brain stem intermediary onto the
vocal system itself
so in other words we've got this new we its essentially like a new gear
shift on this each and vocal tract that we've got
that gives our brains more control over it then we would otherwise have
a lot more interesting talk
so myself i have a free pass at home and a white or evidence we
nitpick
and so i also works for that it would be quite directional at all be
remote or police and what they are saying yes i don't is you are there
are also paper published in a channel about converting bring thing last told to speech
that the much using speech synthesis for a construction
of speech from right how do thing how it is possible to actual and or
something similar for our pets to be able to evangelise handle task a signal possible
sufficient
but that's an interesting question so if
given that we can use your all signals but fmri or geology to synthesize okay
speech
could we do the same thing for animals and my answer from most animals because
of my answer the first question would be no the reason is that the there
is a correspondence between the cortical signals that we can measure it something like fmri
really g and the actual sounds that are produced
because in most animals its mainly the brain stem in the midrange that are controlling
these as someone attacking or a dog parks
it doesn't in fact you can remove the cortex and a cat are still meowing
adorable still more
in the same way that a human baby who's born without cortex will still cry
and laugh
in a normal way
so i but also say if i would be a lot easier to do this
is probably better usage rent money
see if you can synthesize laughter and crying
from a cortical signal y prediction would be you and if you can do that
humans then you won't be able to do it in so i would predict a
fink laugh like what i go a that's not a real that i should be
correctly control but when i really laugh are i really cry
that's gonna be coming from this score brain that's very hard to measure and so
you should be able to synthesize realistic laughter crying even it easy maybe
do you have any evidence of what the which point enables cmbp connection between the
brain and the vocal tract it starts appearing
that's the unfortunate answer to that is no probably many of you know there's a
there's a whole field in this you have a slide about this there's a whole
field that's essentially trying to reconstruct
based on fossils when in our history when of this i in the common in
history of a revolution these that are capacity for speech occurred and the old argument
was always based on if we could know when the larynx decided and we would
know one speech occurred
hey what i think i've shown you and all this work is that it's not
alaryngeal descent
that's crucial for speech it's these direct connections
and those unfortunately there's just no fossil q
to whether there's direct connections that's basically the stuff that really doesn't preserved even for
an hour
much less for in the fossil record you would need
detailed narrow an at any on the micron level to answer that question so it
even it's even hard with again
please
so to comes and i are
well we agree on the importance of the of the neural control of course and
but we can disagree on the
exact precise interpretation of and what the vocal tract data means and video clip
i can we do this you know how we think we're
that so innocent you could say that has been some fine tuning of the of
the human vocal tract to for localization and if you
you know if you if you the little liberal in the interpretation of what we
find in the fossil record you can say
it happened somewhere between three million and three hundred thousand years ago
it's not very precise i
so that the evidence for this is based on various cues that supposedly indicate based
on the base of the scroll what the position of the larynx and tone would
be it just "'cause" with
"'cause" i have these slides and i took them out "'cause" i thought we'd be
too long i want to show you some examples on animals that have independently modify
their vocal tract
in a way that has nothing to do with speech so the way you can
make your vocal tract longer is one make your nose longer like this process monkey
or lots of various animals like elephants course you can stick your lips out which
many species do so if you do this you sound bigger and if you do
this you sound smaller or you can do more bizarre things like
make an extension to your nasal tract that forms a big crest like that dinosaur
up there or these birds which because sources at the base of the trachea have
elongated trachea and all of these adaptations seem to be ways of making that animal
sound bigger
it's just a nice example this is an animal with the permanently descended larynx is
a red deer and you'll find this a pretty impressive sound
wow
wow
so the first thing you probably noticed in that images that pinnits pumping that we're
going back that ignore that look at what's happening
okay what's happening in the front of the animal and you'll see
i as well
back and forth
and so when we first saw these videos we were like what is this and
it turns out what this is that resting position of the larynx that's is a
permanently descended larynx in an argument animal and watch what it does what it localisers
i
i
so i think we could all agree that some much more impressive just set of
the larynx then the few centimetres that happens in humans
and it turns out
these are not the only species because in our islands p c's there's a secondary
the set of the larynx that happens only and then and only at puberty and
i think that's exactly the same kind of adaptation that makes this to do your
sound bigger the aurora or a bird sound bigger so i guess that's where we
differ i think that
even if we know when the larynx to send it in humans it could have
been an adaptation to just make yourself sound bigger and it might have been a
million years after that
that we started using that for speech
so that's why i really don't think the fossils are gonna answer because we do
not have any answer the only way we're gonna get it i think is by
is from genetics now we're covering genetics
the gene genome from data seven the neanderthals and these that might help us answer
this question about the recognition
i've just want to mention that the result where you know scores against based on
the part of the story my question is about earlier you and more to communicate
of course okay bye divorce so
you know you're talking about the vocal tract varies with a voice source of for
really downtime it's whatever
a lot of seems to do with a with a voice source do have an
idea of video poker bring
which is i don't aboard to
to use pieces
well not use the vocal really over emotions so for sure of social behaviors
we we've got actually quite a lot of evidence about sort of overall vocabulary size
for different species but most of that comes from relatively intuitive
scientist listen and they say it in a there's about five sounds there is about
twenty sounds there
only a few species have we really don't what we need to do which is
played back experiments to see what the animals discriminate from others and i would say
in many cases that shows us that something that we think is one thing what's
a i'm not i'm now or a bark or ground actually has multiple a variance
so but i think a conservative number for animal vocabularies is something like fifteen thumbs
and a less conservative number would be something like fifty difference
and in some birds it goes a lot larger than that but if you're talking
about your average mammal it somewhere in that right so roughly thirty would be a
good nonhuman primate
vocabulary size of discriminable so that have different meetings
of course there are sounds animals like can make thousands of different sounds
but they do this for example birds in their songs or wales in their songs
but they don't appear to use this to second of different meetings so then we
can talk about vocabulary anymore we have to just start talking about
it's more like
phonemes or syllables types router and then meetings
we will say something
sorry
is there's somebody else but who and what do we know what is the frequency
resolution of the monkey hearing
so that we could hear the relative position of all the formants but
to reproduce it absolutely i mean most monkeys have a higher free a higher high
frequency cutoffs the most monkeys could hear up to forty or even sixteen khz so
the high frequencies are more extensive than ours
but where it counts in the low frequencies they're perfect frequency resolution so from five
hundred hz to twenty five or thirty five hundred hertz which is where all that
formant information is they can they can
and that's why of course an animal like and or a chimpanzee or basically any
other species you cares can learn to discriminate different human words
virtually every dog knows its name and in some cases you can train a dog
to discriminate between hundreds or even thousands of words
and they can do that
so the speech perception apparatus seems to be built on the basically why they share
perceptual masking
sorry
i'm nothing and speech synthesis and of course leaving about how to
it would be a place to say that but
why
actually did you
need to do this in this is what we do not to sort of more
standard phonetic thing just flew
record load of loads of monkey localizations and measure the formant and what you what
would happen if we did that
well we we've done that and we've actually looked at the subset of the sounds
so remember what we have a some of these vocal tract doing what multi vocal
tract to do and that influence of things like feeding chewing swallowing et cetera it
also includes a class of
non vocal displays that most known human pride well most monkeys and apes to do
things like this
which it's called lip smacking it's a very typical primate thing but it's virtually silent
so they make some able a little tiny bit of sand and once p c's
they actually vocalise when they do it turns out that those that the most is
doing a lot more with its vocal tract in these visual displays then it doesn't
it's auditory display
so if we just take that the vocal tract configurations where the monte is making
a sound it's a subset of what the vocal tract can actually do and in
project these nonvocal communicative so you
could call them visual communication signals have a lot of the a lot of the
interesting variance of the vocal tract shape are there
and because those are silent we have to figure out what it sound like if
the monkey was vocalised so that's why we have to that's why we had to
do all this work that's why it took
years to do this
and then adjust and to that
well i guess coincidentally almost at the same time as our paper came out that
we change the way and according to which just mentioned here in the front and
came up with the paper where they get exactly what you would use it and
they five and basically that
actually what the user to different monkey species act-utterance but and they can produce a
surprisingly large range of silence that especially surprising if you compared to what the lieberman
had claimed that they could produce
but not as large as the range of sounds that are mobile produced so
they do mainly not produce in their in their actual productions the potential that they
have with their vocal tract
i would like to come from that i understood correctly what you say on this
slide
that there is that more generally
it is generally passive
is the output or at least experiment that
generally this woman from give the in two thousand then
that just air flow is coming out
and then we can say that the vibration rate is generally a c
i think this is too risky
because this is exactly what would happen if you i'm dead and you bust a
are thrown
air flow through my vocal folds
i don't think we mush my much will be different
and in order to do that even though to say that is generally passive i
think you have to go and look
more about neuronal activity
and not just about experiment i respect teachers work but i think this is to
dangers to
to say these
you on that slide i think there may be a miss i mean because we're
not saying that you don't need muscles to put the larynx in to phonatory position
of course you do that work in this case i move you tigers larynx in
the phonatory position
what we're saying is that the individual pulses that represent the fundamental frequency so the
openings and closings of the glottis that's what that's what is passively determined by things
like muscle tension and pressure
so we're not saying that muscle activity doesn't play a role what we're saying that
it doesn't have to happen at the periodicity of the fundamental frequency
and that's obvious thing if you think about a pack that's producing sounds at forty
thousand ten at a forty thousand hz there's no way neurons can fire that neurons
basically can't fire faster than thousand
so even if it didn't work for something like an elephant and it does work
for something like a cat at thirty hz
it could never work for most of the high causation
even a cat two thousand hz and certainly not these animals that are producing in
the high khz range it has to be passed because there's no way neurons can
fire or muscles can twitch
that rapidly
so the clean is not then in humans you or any animal that you don't
need to use muscles to put the and that to control the larynx you do
but only that you don't need muscle activity at the frequency the fundamental frequency
is that make sense
it's better
and some just curious
you labour man and you both did work trying to figure out exactly the same
thing a subject and i came to radically different conclusions so
was the lieberman what's the improvements is that approach never going to work or what
was the issue that distinguished and that you know that made the difference between what
you did and he did and what can that teachers for other things we want
to do as well do not draw conclusions
i would say from the i mean maybe you can comment on this two but
from the point of view the technology
what we're doing to understand how you go from a vocal tract to formant frequencies
not much just change they did a pretty good job a given the computers they
had their simulation was pretty good their problem was in the biology their problem was
that they took a single then animal and the expected that
then animal was gonna tell them the range of motions that are possible in a
living animals vocal tract
so they had no indication of what the dynamics
the vocal tract or
from looking at the data and that's what we needed this x rays of a
building monkey to be able to find out
okay so but you don't saying that you can never figure out what to do
is going on from a dead animal what so if you
so
so by the way that is class which should be familiar name two people working
on speech synthesis with the call theorem one of these paper here and so he
was basically the guy at the acoustic modeling
work and so at the time there are q competing labs working on speech synthesis
and i basically the acoustic model i used for my model is basically contemporaneous with
a ten is quite small so indeed you know classic stuff
so basically they just didn't have the data it's kind of like all eighties neural
nets verses google
they just didn't have the data and we have a data
and
yes
and i think it's a very as
defined benefit different bands right okay not can make it and fifteen t fact there
is no
something like fifteen to fifty as a session one and here is to now
if the semantics of a time to express
i was trained praying all rights a very different
set
just a it is a fiction planes are the and in my state is virtually
pains they're very different to what they're trying to express
there's a certain set of course vocalisations that are very widely shared among species for
so for example sounds that means threat sounds that say i'm being mean and scary
so i tend to be low and have very low performance
sounds that are appeasing in saying that we don't hurt me i'm just a little
guy tend to be high frequency
so we see that class the vocalisations vary widely across mammals and birds
then we have this class of kind of meeting vocalisations that a lot of species
do but they typically sound very different sometimes it's males just going well like that
and sometimes it much more interesting and complicated
and then there's typically mother infant communications and so there's usually sounds that are that
a mother users with for this particular in mammals that the mother uses to communicate
again very widespread
and then there's really weird stuff mike where all songs or echo location clicks at
all phones that are really only found in particular groups so i'd say there's a
kind of shared core of semantics and then various it's biology so there's all kinds
of weird stuff in the corners but if you say parental care
aggression affiliation
and
there's also alarm calls and three calls are pretty common but a handful of maybe
five semantic axes would probably do it from a standard
well the there are some vocalisations that basically saying i'm here
and their other vocalisations the try their best a high that so back the a
very high frequency quiet thing that tails off it makes it hard to find so
various alarm calls are like that
it like a there is an active basis is it
so for fact that market i block
in fact it is quite a lot of human where it's that's right
but if a vocabulary but can express is so small that maps model about what
making this pen
seven or something that brightness
to various have
i do not put it in a fight
if i if i and response
and then where it's at an unconstrained a few
that's kind of frustration very
well i think that is a fundamental finding of animal communication is that animals understand
a lot more the then they can say
so essentially we have many species for example that understand not only their own species
but they can learn the alarm calls of other species in their environment and of
course animals raise with humans learn to understand human words and not of the species
every produce those
so it just does the child's write any of us are receptive vocabulary the words
we understand are much larger than the number of words we say typically
for most animals i think the receptive vocabulary is large and the productive vocabulary is
very limited
when they find that frustrating or not
i don't know that's harder so
so the
humans have more control over all or there are also in the water no value
model to use the excitation signal was much working or
so project was to every other mean and what we present more clearer and more
how to model
this case back to this image we've done a lot of work now doing excise
larynx work in one of the things we found is the most species can very
easily be driven into a chaotic state
where rather than this nice regular harmonic process that we see here you get essentially
coupled oscillators and the vocal folds generating chaos and you can see the classic steps
from by phonation into a triphone a period doubling to chaos in vocal folds in
virtually every species that we looked at
now and it seems to be very easy for most animals to go into a
chaotic state and that's reflected by the fact that many sounds we hear animals produce
or have a chaotic source
so for example monkeys do this all the time they do this
an even dog barks are like that there's the they let themselves use chaos much
more in speech and you like this
but unless you're batman
you know
nobody does that we we'd we favour this harmonic source for most things if you
listen to a baby crying you'll hear plenty of k
so i think what's hard to say is whether humans
we can produce chaos with their vocal folds but do we just choose to use
this nice regular harmonic nice clear pitch signal
because it
you know better for understanding or it sounds nice or a vocal folds actually less
inclined to go chaotic
than those of other species
that's a question that i don't think we can answer at present
but we certainly do a lot less chaos monkeys it's the most common thing you're
gonna hear these threads grounds
are chaotic and so that's what we were trying to model in the sentence
so i've done if you
models where there's interaction between the vocal tract in the vocal folds and also looking
at chaotic vibrations and one of the other things that you find even if you
get these chaotic vibrations is it's somewhat well it's
quite a bit harder to control vocal fold onset so tends to be more gradual
and which makes for instance it almost impossible to make a distinction between voiced and
voiceless
that consonants which are pretty important in speech and so am i just find out
there but it seems that this
more
regular vibration of the human vocal fold is useful for speech whether it's you know
being
the being used by speech because that way or because whether it has become that
way because it useful for speech that's another question
okay
thank you very much