Speech Transcript - Synthesizing animal vocalizations and modelling animal speech

0:00:09	so
0:00:11	becomes a features from university of vienna these prevent something
0:00:16	at the
0:00:16	department of cognitive biology
0:00:19	and main interest or in the evolution of language and the
0:00:25	mobile communication in
0:00:27	but separates
0:00:29	and what makes this
0:00:30	also very interesting for us is the t v
0:00:34	all the users synthetic speech
0:00:36	two
0:00:37	investigate is questions into
0:00:40	there's use hypotheses
0:00:43	and
0:00:44	are there is a
0:00:45	from the
0:00:48	i allowed artificial intelligence lab on the friday university
0:00:52	brussels
0:00:53	and he's
0:00:55	interested in the also in the cognitive
0:00:58	it uses of language and
0:01:01	all the user's machine learning in speech technology for
0:01:06	investigated in all
0:01:09	this
0:01:09	combinatorial
0:01:11	factor can
0:01:12	somehow be modeled
0:01:14	and
0:01:15	we also very well known for their work and
0:01:19	also for the work on the
0:01:22	nine q
0:01:23	we ct a monkey vocal tract of speech ready which we will here today
0:01:29	this is what i'm because
0:01:36	my family
0:01:37	is that sounds pretty good fact there are not you
0:01:40	i'll try not to put
0:01:43	thank you michael effective for the kind introduction said this is the first time bargain
0:01:47	i have tried to do it
0:01:48	tag team two you know like this that will see how well it works but
0:01:52	all start off and then part will
0:01:55	give you more technical details of the sort that i'm sure you all hungry for
0:01:59	on saturday morning
0:02:00	but i'll try and start off the start by giving some
0:02:05	just perspective on why a biologist like myself
0:02:08	who's interested in animal communication would dive in the speech science actually studied speech science
0:02:14	with people like and stevens and mit one as opposed arc
0:02:18	and use that used what kind of you guys invented to investigate how we animals
0:02:24	make their sounds and y what those sounds me
0:02:29	and then we're basically gonna talk
0:02:32	so in other words that using the technology of speech science
0:02:36	to create animal sounds to understand animal communication and then in the second part of
0:02:41	that arc will turn that around and say how can we use an understanding of
0:02:45	the animal vocal per tract
0:02:48	to understand the evolution of human speech
0:02:50	and that is that may the answer may surprise some of you
0:02:55	okay so why would why would anyone want to synthesize animal vocalisations why would you
0:03:00	wanna make a synthetic cats
0:03:02	academy our a synthetic bark
0:03:05	and's as i said
0:03:07	my drive my main reason for this is because i'm a biologist
0:03:11	reg interested in understanding the
0:03:13	the biology of animal communication from the point of view of physics and physiology and
0:03:18	because speech scientist of done so much of that work we can essentially borrow that
0:03:22	to understand animal communication
0:03:25	and then we'll turn of the second part where we try and understand how our
0:03:29	speech act
0:03:30	so i'm sure this is very familiar to you but i just wanna very quickly
0:03:35	run through the source-filter theory i'm sure virtually all of you are familiar with this
0:03:39	theory
0:03:40	what as applies to human language what you might be more surprised by is how
0:03:45	broad this theory applies across vertebrate
0:03:48	so with the possible exception of fish dolphins another toothed whales and probably a few
0:03:56	others like some rodent high frequency sounds
0:03:59	this theory that was developed to understand our and speech apparatus and you know basically
0:04:03	from the nineteen thirties onto the nineteen seventies turns out to apply to virtually all
0:04:09	other sounds that you might think of dogs barking cows moving birds singing it's utterance
0:04:15	the basic idea of course is that we can break the speech production
0:04:19	process into two components the source which turns aside airflow at the sound and the
0:04:25	filter which then modifies that's
0:04:27	using formant frequencies which are vocal tract resonances that filter out certain frequency
0:04:32	and this is an image that may look familiar
0:04:35	this these are vocal folds except these of the vocal folds on the siberian tiger
0:04:40	so these this is that a larynx that's the vocal folds are about that long
0:04:44	so of course it makes very low frequency vocalisations but you can see that the
0:04:49	basic process this error dynamically excited vibration is pretty much the same as what you
0:04:55	would see in human vocal folds
0:04:58	and of course the vibration rate of these vocal folds the rate at which they
0:05:01	slap together determines the pitch of the sound
0:05:05	and you may be wondering how we did this we didn't have a live tiger
0:05:09	vocalise thing with an enter scope died want to do that this is a dead
0:05:14	tagger so this tiger was removed from an animal that was used a nice put
0:05:17	on a table we blew air through it and we videotape that and what that
0:05:21	shows is just like in humans
0:05:24	we don't need active neural firing at the rate of the fundamental frequency to create
0:05:30	the source
0:05:31	and that seems to be true in the vast majority of sounds bird songs acts
0:05:36	are actually localising it at fundamentals of eight khz
0:05:40	whales or for are of localising at fundamentals of ten khz
0:05:43	all using the same principle
0:05:45	there are a few exceptions in my favourite one that many of you will be
0:05:48	familiar with
0:05:49	is one task per
0:05:51	that's a situation where the there is an actual contraction well each contraction of muscle
0:05:57	that generates the paper is driven by the brain so that's one of the few
0:06:02	exceptions where it's not this kind of passive vibration
0:06:05	but again for the vast majority of sounds at we're talking about including everything we
0:06:09	know from nonhuman primates this is the way
0:06:12	so then that's source out whether it's noisy or harmonic passes through the vocal tract
0:06:18	which
0:06:19	i we show my students this image the formants being like windows that allow certain
0:06:23	frequencies to pass through
0:06:25	but it certainly much more fun to listen to what a form it is
0:06:29	what i've done here is used lpc resynthesis
0:06:32	to take the human speech which is of course of the source
0:06:36	and the filter combines
0:06:38	where and or
0:06:40	and now i'm gonna take the formants of that speech
0:06:43	and apply them to this source this is a bison whirring
0:06:48	and this is what we hear as a result
0:06:50	i
0:06:54	i think everybody can understand the words even though it sounds more
0:06:58	terrifying when it's a bison saying it
0:07:00	just another random example this is an or well
0:07:05	in here is the nor we're with my performance
0:07:11	okay so i think that illustrates the point what we hear the vocal signal we
0:07:15	here is this composite of source and filter
0:07:18	and in these cases we can hear the filter doing the phonetic work
0:07:22	and this but the source still comes through loud
0:07:25	so taking this basic principles of source-filter theory we started thinking
0:07:30	okay what kind of
0:07:31	cues other than speech might be there an animal signals and one of the first
0:07:36	things that's now been
0:07:37	really extensively investigated was based on the idea that vocal tract length correlates with body
0:07:44	size and because formant frequencies are determined by vocal tract length maybe formants provide a
0:07:50	cue to body size in other species
0:07:52	so the first part of this is easy we just get "'em" a riser x
0:07:56	rays a measure of the vocal tract length you can do that on anaesthetised animals
0:08:00	and then we is a little harder to get them to vocalise but when we
0:08:04	do that and that of the formants we find this is just one of many
0:08:07	cases these are monkeys that vocal tract length correlates with formant dispersion which is the
0:08:12	average spacing between the formants and because vocal tract length correlates with body size that
0:08:18	means the body length correlates very nicely
0:08:21	with well sorry this is one body like correlates very nicely with formants
0:08:26	and i first this in monkeys but then we didn't obvious and in pigs it's
0:08:30	true in humans it's true and dear this seems like a kind of for the
0:08:34	mental aspect of the voice signal that it carries information about body so
0:08:42	so
0:08:43	this is something that we can see as scientist objectively we can measure this
0:08:48	but the question is do animals pay attention to that
0:08:51	so it's fine if i go and i measure formants and i can say formants
0:08:54	correlate with body size but that's kind of meaningless for animal communication unless the animals
0:08:59	themselves perceive that signal
0:09:02	so
0:09:03	this is where animal sound synthesis comes and how do we ask that question how
0:09:07	do we find out whether an animal is paying attention to formants
0:09:10	and the answer this is a long time ago this you may some of you
0:09:13	may recognise this all version of matlab running on an old macintosh that i generated
0:09:19	this speech animal sounds synthesizer using very standard technology that most of you will be
0:09:24	familiar with basically
0:09:26	when you're prediction predict the formants subtract those away and we have an error signal
0:09:30	which we can use as a source and then we can change the formants shift
0:09:34	only the formants leaving everything else the same and ask if the animals perceive that
0:09:39	shift inform
0:09:42	now the way we do these experiments how do you ask an animal whether it
0:09:45	perceives that we usually do you something called habituation this a bit you a sheep
0:09:49	where we play a bunch of sounds that
0:09:52	the in this case the formants remain the same but other aspects very the fundamental
0:09:57	frequency the length et cetera varies performance are fixed
0:10:00	and now once
0:10:02	our listening animal
0:10:03	stops paying attention
0:10:05	so it may take
0:10:06	ten plays or a hundred play is before the animal finally stops looking at the
0:10:11	sound but once it's gotten with the original sounds then we play the sounds where
0:10:17	we change the formants or change whatever variable interest
0:10:20	and we
0:10:21	if the animal pays attention to that
0:10:23	if they perceive it
0:10:24	and find it
0:10:25	salient enough to be noticeable then they should look again
0:10:29	okay
0:10:30	so the first piece is i actually tried this with his whooping cranes a now
0:10:34	explain why the second
0:10:36	so what i'm gonna do you know it's sort of walk you through this experiment
0:10:39	these are whooping crane contact calls
0:10:41	and what we did is play a bunch of the actual calls from one particular
0:10:45	brand
0:10:46	and they sound like this
0:10:50	or
0:10:51	it's more here's another one sound pretty similar to our years
0:10:56	and we keep playing those in cell are so these are recorded we're playing these
0:11:00	from a laptop and now we see if the listening bird looks up to we
0:11:05	wait till the bird goes down its feeding we play one of these sounds and
0:11:09	it looks at
0:11:10	because it sounds like there's another would be great
0:11:12	so the logic is pretty simple
0:11:14	the case of whooping cranes we had to do this in the winter
0:11:17	it takes these birds hundreds of trials before they start listening before they start paying
0:11:21	attention to the laptop dies and it starts snowing et cetera et cetera
0:11:25	but eventually we were able to do this
0:11:27	where you get the bird the bits are weighted by playing these kinds of sounds
0:11:31	over and over
0:11:36	anyway and then
0:11:37	just to be safe
0:11:39	we play a synthetic replica that we've run through the synthesizer but without changing the
0:11:43	formants and if everything's fine they shouldn't just a bit rate of that hears with
0:11:48	that sounds like
0:11:53	pretty similar
0:11:54	and now here's the key moment
0:11:56	we play either the formants lowered
0:11:59	where the formants fire
0:12:01	or
0:12:03	and of course you walk in here that because you're humans and you we already
0:12:06	knew you perceive formants so the question is one of the birds do
0:12:09	and when we do this what we find is that initially
0:12:13	the birds respond eighty percent of the time on average but has we go as
0:12:17	we get so twenty five or thirty trials finally the last but you a sheep
0:12:22	trial
0:12:22	by definition is the one where they don't look at all we actually get three
0:12:25	of those in a row now we play that synthetic replica they don't work
0:12:30	so that means or synthesizer is working and then finally we play these test stimuli
0:12:34	and
0:12:35	we get a massive just a pitch
0:12:38	so we've done this that would make a difference
0:12:40	sees and always found the same thing it seems like paying attention the formant frequency
0:12:44	shifts
0:12:45	in this kind of context is a basic mammalian thing
0:12:49	birds to it monkeys do it dogs to it pigs do it and of course
0:12:54	people
0:12:55	so now you might ask can we go further with that and for example these
0:12:59	are two colleagues who have used animal sound synthesis
0:13:03	you basically look at what other species are using these formant frequencies for
0:13:10	in this case we can show that the model that the deer or the colours
0:13:14	are using these sounds as indicators of body size and the kind of evidence we
0:13:18	have is for example males played by another male with its with lower formant frequencies
0:13:25	that with an elongated vocal tract runaway and are afraid females find the more attractive
0:13:30	et cetera et cetera this is again been done with many speech
0:13:34	many of probably many of you have heard gear but you might not of her
0:13:38	the colossal this is a colossal they have a very impressive vocalisation
0:13:48	if you're wondering how a little teddy bear sized animal
0:13:52	makes that terrifying sound
0:13:54	it's because they actually have a track which is that they've
0:13:57	pull the larynx down to make their vocal tract much longer then it would be
0:14:02	and a normal animal so by and one getting their vocal tract they make themselves
0:14:06	and vector
0:14:08	just these are a few of the many publications that use this approach that i
0:14:13	just been telling you about to dig deeper into animal communication so i hope but
0:14:20	makes the case that this is a worthwhile thing to do it again in a
0:14:23	wide variety of sleazy
0:14:26	okay so now maybe getting something that's closer to what a lot of you do
0:14:29	i wanna turn to the to the this is supposed to be part two sorry
0:14:33	we just
0:14:34	put this together yesterday
0:14:37	why would you
0:14:38	what i mean how can you turn this around to start ask questions about
0:14:43	human communication based on what we understand about animals
0:14:46	and the first fact that kind of course fact that many people in the world
0:14:50	of speech sciences been trying to understand for a long time is the fact that
0:14:54	we humans are amazing it imitating sounds we not only imitate the speech sounds of
0:14:59	our environment
0:15:00	but we learn to sing songs we can even in the tape animal sounds or
0:15:04	basically kids will imitate whatever sounds they have a rare
0:15:07	and it turns out that are nearest living relatives the great apes can't do this
0:15:11	at all
0:15:13	so this is just one example all these are examples of apes that been raised
0:15:17	in human homes
0:15:19	and of course a human child by the edge of about one is already making
0:15:22	the sounds a bit it is already starting to say it's first words and making
0:15:26	the sounds of its environment that adheres and it's in its native language phonology or
0:15:30	phonology is and no eight has ever done that no ape has even spontaneously said
0:15:35	mama much less learn complex vocalisations
0:15:39	and the question that has i mean people are known this for a long time
0:15:42	the question that has been driving this field for at least a hundred years and
0:15:47	start once time is why is
0:15:49	why is it that
0:15:51	and animal
0:15:52	that's in english seemingly so similar to us that can
0:15:55	where to do things like i h
0:15:57	and drive a car
0:16:00	can even produce the most basic
0:16:02	speech so
0:16:03	with its vocal tract
0:16:06	so that's the sort of driving force behind the second part of
0:16:09	block
0:16:10	and there's two theories darwin had already mentioned this one is that has something to
0:16:15	do with the peripheral vocal apparatus
0:16:17	and the other is that it has more to do with the brain and darwin
0:16:20	said well they probably both matter but the brain is probably more important what we're
0:16:24	gonna try and convince you now is that it is actually the brain that's g
0:16:29	and vocal tract differences although they exist are not what are keeping a monkey or
0:16:34	an ape from producing speech
0:16:37	now the most famous example of
0:16:40	a difference between us and apes is illustrated by this these m r is on
0:16:45	the on the left side we see here a chimpanzee and the red line marks
0:16:50	the vocal folds so that's the larynx
0:16:52	and of course in humans the larynx is descended in the vocal tract it pulls
0:16:57	down in the throat
0:16:58	where is in the chimpanzee the lexus and a high position engaged in the nasal
0:17:03	passage most the time
0:17:04	and that means that on
0:17:06	rests flat in the in them in the map of the tongue is basically sitting
0:17:10	like this
0:17:11	what happens in humans
0:17:13	is that are we essentially swallow the back of our town are larynx to sends
0:17:18	pulling the time with it so that we have this two part on that we
0:17:21	can move up and down and back and forth and that's how we get this
0:17:25	wide variety of speech
0:17:27	so the idea and this goes back to darwin's time but it really became concrete
0:17:32	in the nineteen sixties is that
0:17:34	with the time like that
0:17:35	you simply can't make the sense of speech and therefore no matter what brain was
0:17:40	in control that vocal tract can make the sounds that you would need to imitate
0:17:44	speech
0:17:46	and it's a plausible hypothesis
0:17:48	it goes back to actually my and meant for phil lieberman who was my phd
0:17:52	thesis supervisor published a series of papers in the late sixties and early seventies
0:17:57	and what he did was take a dead multi and the beta cast of the
0:18:01	vocal tract of the smoky
0:18:03	they use that to produce a computer program to simulate the sounds that
0:18:07	vocal tract can make there was a lot of guesswork involved because it was one
0:18:11	that multi and one cast
0:18:13	but they did the best they could
0:18:14	and what they found this is an formant one
0:18:18	to space
0:18:19	what they found it is yours the famous three vials the point files of english
0:18:23	e
0:18:25	and are that are found in most languages and all those things in there all
0:18:28	the numbers are what the monkey vocal tract or what the computer model of the
0:18:32	multi track remotely vocal tract could do
0:18:34	so they concluded that the acoustic vowel space of a riesz as multi use quite
0:18:38	restricted they lack the output mechanism
0:18:42	for speech per
0:18:44	and this is one of those ideas like i said it's its well-founded in acoustics
0:18:47	if you look at what we actually do when we produce speech these just a
0:18:51	couple videos that it will be familiar
0:18:54	a rainbow as division of white light into many beautiful colours
0:18:57	you see that from dancing around in that two dimensional space
0:19:01	here it is slow down a bit
0:19:10	so we use that ni that additional space "'cause" by swallowing the back of our
0:19:15	turn we clearly are using that to its full extent when we produce speech
0:19:21	so i think this lieberman hypothesis is quite plausible
0:19:26	i became suspicious of this when we first started to train do x rays of
0:19:29	animals as they vocalise instead of looking at data animals like this is the classic
0:19:34	way of analysing the animal vocal tract take a day got cut in half and
0:19:38	draw conclusions about that we trying to get a good localising in the x ray
0:19:44	harder than it may seem
0:19:46	i have that many animals sitting in a situation like this without localising at all
0:19:51	but this little go was one of our first subjects in we played it it's
0:19:54	mother's bleeds it would respond
0:19:56	and this is what we saw in the extra
0:20:06	also use again i want you to look in this region right there
0:20:09	when you look that's this anonymous claimed
0:20:13	at the glottis prevents mouth breathing so in other words the idea based on the
0:20:17	static anatomy is that a goat can't breeze through its mouth
0:20:21	and so here's what we actually see
0:20:25	this i
0:20:26	pulling down a
0:20:30	such that every one of those vocalisations passes out through the mouth the get
0:20:34	now this shouldn't be that surprising if you think about if you wanna make allow
0:20:38	the sound you should other eight through your mouth and not through your nose but
0:20:42	again this is what i'm data most acclaimed was impossible up until we started doing
0:20:46	this work we've seen in another animal so this is a dog you're gonna see
0:20:49	a very expensive pulling down of the larynx to send of the larynx when the
0:20:53	dog barks this is low motion
0:21:01	however
0:21:05	that's the lyrics
0:21:08	right
0:21:09	what you can see here is that every time the dog parks
0:21:12	the larynx pulls down pulling the back at the time with it and basically going
0:21:17	into a human like vocal configuration but just one only animal is talking white only
0:21:22	while it's vocal i
0:21:24	the unusual thing about is that are larynx stays low we keep our larynx low
0:21:28	light on not only while we're vocal
0:21:31	so when we first got these data more than it's almost twenty years ago i
0:21:36	became convinced that this that the set of the larynx can't be the crucial factor
0:21:41	keeping animals from localising
0:21:43	but unfortunately in the text books it canteens said the reason monkeys can't localise rates
0:21:48	can't localise
0:21:49	based on peripheral and that they just don't have the vocal tract
0:21:53	and it was what i saw the simpsons episode where
0:21:56	where
0:21:57	it system
0:21:58	the simpsons the main guy
0:22:01	part no the old guy
0:22:03	homer homework like you
0:22:04	can wear gets this multi
0:22:06	and the motley can talk so homers learning sign language are kept saying it's because
0:22:09	he doesn't have the vocal tree
0:22:11	so that's when we decided okay this dog and goat stuff isn't enough we have
0:22:15	to do it with nonhuman primates and working together with passive thousand far whose monkeys
0:22:21	they were and bart who's gonna take over from here we check x rays like
0:22:25	this one
0:22:27	the multi vocal arising
0:22:29	and you'll see there's a little movement of the larynx just the same as we
0:22:32	saw in the gutter in the dog and then we trace those to create a
0:22:36	vocal tract model in this is where part's gonna
0:22:42	i
0:22:49	do you wanna take this
0:22:55	that looks good
0:22:56	a reality
0:22:58	okay
0:23:00	so
0:23:01	yes how we actually
0:23:06	and model to
0:23:10	to create
0:23:11	localization of the monkey no
0:23:14	if you think about it it's very different problem from or a problem that requires
0:23:21	a very different solution from what we use for human speech because what we're trying
0:23:26	to do is to figure out what the monkey
0:23:30	could do in principle with its vocal tract and it's not based on what it's
0:23:34	actually doing the whole point is that we count multi don't well so
0:23:40	so what we don't have is a corpus of data on which we could use
0:23:46	some kind of machine learning problem
0:23:49	so what we need to do is
0:23:52	that really productive approach
0:23:54	based on
0:23:56	what is in it sends a very old fashioned way of going about speech synthesis
0:24:01	and which is articulatory synthesis the not just recap which relate
0:24:07	how it works for you but i assume you mural intimately familiar with it and
0:24:13	what i would like to stress however is that even though we can to be
0:24:18	talking about biology and about speech assigns
0:24:22	these methods were developed by people who we're actually engineers they were also people interested
0:24:28	in trying to be able to put is many phone conversations on transplant transatlantic cables
0:24:35	as possible
0:24:37	and so this is very much
0:24:40	the fear read it has been developed by engineers by people who were working with
0:24:46	the same goals
0:24:48	as you guys
0:24:49	so how this articulatory synthesis where well you start with an articulatory model you start
0:24:55	with an it year of how the vocal tract works
0:25:00	and from
0:25:03	with a model you can create different positions of the tongue and lips et cetera
0:25:08	and from that you need to calculate what is called an area function so an
0:25:14	area function is basically the cross sectional area of the vocal tract at each position
0:25:20	in the vocal tract
0:25:22	and it turns out that the precise details of that area function
0:25:28	well the area is the thing that counts the precise shape in the sense that
0:25:34	for instance there is a
0:25:37	right angle here in the vocal tract that's cool because of the wavelength interval you
0:25:43	can ignore that so you can basically model it as straight q with the circular
0:25:51	cross sectional shape but the area is the important thing now of course if you
0:25:56	want to
0:25:58	model that any computer model you have to discuss the score times that so what
0:26:02	you and that is
0:26:04	with is called a chi model so i and number of choose along the length
0:26:09	of the vocal tract from that
0:26:12	larynx basically to that
0:26:14	and then on the basis of that you can calculate the acoustic response either in
0:26:20	the time-domain the frequency domain so that's what we're going to do so how did
0:26:24	we do that for the monkey model
0:26:26	this is the x-ray image that to come sages child
0:26:32	with the outline
0:26:34	and in red here you can see the outline of the vocal tract
0:26:39	so this is what we have this is what we start with we have we
0:26:43	had about a hundred of these
0:26:46	and i guess they were made by hand that ratings were made by hand and
0:26:50	so what we first need to do is to figure out
0:26:54	how the sound waves propagate through this tract
0:26:58	and for that the technique that we use is called a medial axis transform so
0:27:04	it's basically you're trying to squeeze
0:27:08	a circle
0:27:09	through that tract and that circle basically represents the propagating acoustic wavefront and if the
0:27:18	line in the middle it's kind of the center of the wavefront and the radius
0:27:23	of the circle
0:27:24	for the diameter of the circle as the diameter of the vocal tract
0:27:32	so this is what you end up with
0:27:38	and so
0:27:40	you can then calculate for each position
0:27:43	in the vocal tract
0:27:45	from the glottis to the lips
0:27:48	the diameter
0:27:52	okay so you have it
0:27:54	a function
0:27:57	the diameter of the vocal tract
0:27:59	at each point in the vocal tract
0:28:01	however the problem is that this is just
0:28:05	part of what we need we need to have the area we don't need to
0:28:08	have we that the diameter isn't enough so the problem is
0:28:14	we need to calculate the area on the bases of the observed diameter
0:28:21	no fortunately it turns out that do good approximation for those monkey vocal tract the
0:28:28	function converting diameter to area
0:28:32	is more or less the same everywhere in the vocal tract so how do we
0:28:36	figured that out
0:28:39	apart from the x-ray movies we also had a few mri scans of than the
0:28:45	anaesthetised monkey
0:28:48	and if you if you look at that
0:28:51	so this is this side view so this is where the basically the monkeys
0:28:55	let's are
0:28:58	this is it's vocal tract
0:29:00	here's the larynx
0:29:01	and so you can make if you cross
0:29:04	section of cuts there and you can see that the shape of the vocal tract
0:29:12	i don't these different
0:29:14	cross section there is
0:29:16	follows this it's not quite a rabble but
0:29:20	in this particular shape is kind of the same everywhere
0:29:24	and so what you want to know is
0:29:28	for a given opening of the vocal tract how large is that area so suppose
0:29:34	that the
0:29:35	the diameter would be
0:29:37	about
0:29:39	about this
0:29:42	so the area would be this now if you open up further then of obviously
0:29:47	the area gets bigger any turns out that follows you know it's just a matter
0:29:53	of integration any turns out that what you find is that the areas proportional to
0:29:59	some cut some constant
0:30:00	times the diameter to the power of
0:30:04	one point four there's no deep theoretical reason for that value of one point for
0:30:09	each it's something that we learned from observing
0:30:13	so now by applying that function to the diameters that we observe we actually find
0:30:20	a
0:30:23	the area function so this is
0:30:26	the position
0:30:27	and the area that at each point
0:30:30	in the vocal tract no
0:30:34	the next step is turning that into someone's
0:30:39	and for that we use a again very old fashioned classical approach and acoustic a
0:30:46	mobile an electric line analog of the vocal track again you can kind of see
0:30:51	that historically a lot of this theory was
0:30:57	developed by electrical engineers "'cause" it's an electrical electronic circuit so for each of those
0:31:05	discrete to you
0:31:07	the electric line a lot models just model basically models the physical wave equation with
0:31:13	a little electrical circuit
0:31:16	and from that
0:31:18	we can then calculate the
0:31:21	formant frequencies
0:31:26	so for each of those hundred points
0:31:29	we
0:31:32	we can calculate the first and the second and third formant and these are the
0:31:37	values we actually calculated for all those
0:31:42	all those point
0:31:46	and but there's
0:31:49	didn't from this point we've kind of
0:31:55	determined what the acoustic abilities of the monkey vocal tract or not
0:31:59	from there there's different things that you could do
0:32:03	in principle
0:32:06	on the basis of this kind of data you can actually make a computer articulatory
0:32:10	model
0:32:11	and so this is something that is changing my to as done in nineteen eighty
0:32:16	nine again you know quite some time ago on the basis of a very similar
0:32:21	data about the human vocal tract
0:32:26	but
0:32:28	it's not certain that we have enough data to actually do the same thing so
0:32:33	changing my to what he didn't was he made a thousand
0:32:39	tracing so the vocal track and if you if you in if you know how
0:32:42	difficult it is to make a single tracing
0:32:45	you can imagine how much time he must've spent on making this model
0:32:51	and what he then that is basically
0:32:55	look at these articulations to a factor analysis and basically derive an hour and articulatory
0:33:02	model
0:33:03	and articulatory synthesizer so you could basically then use that model to synthesize new so
0:33:10	no the problem is we don't have that many tracing so we couldn't problem probably
0:33:15	couldn't make a good quality model
0:33:21	what we wanted to do and what to comes is going to say in a
0:33:24	moment to explain a moment it's re-synthesize some of these sounds and that's still very
0:33:30	challenging with a articulatory synthesizer and it wasn't reading necessary for what we wanted to
0:33:37	do so we took slightly different approach
0:33:40	now
0:33:43	one of the things we wanted to do with just quantify the
0:33:48	articulatory abilities of monkeys and compared them to humans
0:33:53	and wanting to do that
0:33:55	we could measure the
0:33:58	acoustic range of the monkey vocalisations and one way to do that is by calculating
0:34:04	the convex hull now again i'm assume you're all familiar with what a convex whole
0:34:09	is just very quickly show you how we did it basically if you wanna call
0:34:14	calculate the context will
0:34:16	you start with the one of the extreme points
0:34:21	and then you
0:34:23	basically
0:34:26	fit a lying
0:34:27	a round those points like if you if you would take a rubber band and
0:34:32	just
0:34:33	squeeze it around the points and then you can do several things you can calculate
0:34:37	the area of the context of all or you can calculate the extend of these
0:34:42	things in the f one or the first formant or the second formant and the
0:34:47	thing that we did was we based ourselves on the extent
0:34:52	well in the area and the extent
0:34:55	and one of the things we get is the amp this week
0:35:00	wanted to know how the monkey sound it
0:35:03	it would be speaking
0:35:06	and in order to do that we
0:35:08	modified some human sounds in a way very similar to what the comes just showed
0:35:16	remote recordings
0:35:18	and so this is it
0:35:24	sentences spoken by human we that's like this into the
0:35:30	formant tracks which is basically which represents the
0:35:35	the filter and the source
0:35:38	and then we modified those formants
0:35:42	in a
0:35:44	in a way to make it more similar to a monkey vocal tract so what
0:35:47	you've seen so far in the examples that to comes at play to you is
0:35:53	where the formants were just shifted up or shifted down we did a little more
0:35:58	so we modified them
0:36:01	didn't just so the
0:36:05	we need to shift the formants up a little bit because the monkey vocal tract
0:36:10	is shorter than the human vocal tract so that the formants tend to be higher
0:36:15	but in addition what we found is that the range of the second formant it's
0:36:21	somewhat be used in the monkey vocal tract
0:36:24	in comparison to the
0:36:27	human vocal tract so we also
0:36:30	breast the range of the second formant
0:36:33	and then we resynthesized the sound
0:36:36	now
0:36:37	the thing with
0:36:40	and analysis in terms of source and filter
0:36:44	is that it's complete so if you have discourse information and the filter information
0:36:52	you can basically
0:36:54	re-synthesize the sound perfectly this and there's no loss
0:36:59	so if we would you just
0:37:01	the humans stores with the modified formants the sound would probably have sounded to perfect
0:37:09	so what we wanted to do is use the source that was more monkey like
0:37:14	so we actually also synthesized in use force which was based on a very simple
0:37:20	model
0:37:23	the monkey vocal folds which vibrating the much more irregular weight and human vocal folds
0:37:29	do so we took our monkey stores
0:37:34	applied
0:37:35	the modified formant filter to it
0:37:38	and then we got a real monkey focalization
0:37:42	and this is where the complete x over again
0:37:44	okay
0:37:45	so
0:37:51	hopefully that satisfied your morning need for technical details but now you must all be
0:37:57	wondering after this is just a synopsis of the whole process that we x-ray the
0:38:00	monkey making a hundred different vocal tract configurations
0:38:04	basically everything that monkey did while he was in our x ray
0:38:08	we trace those
0:38:09	we use the medial axis and then this complex area diameter the area function to
0:38:15	create the
0:38:16	model of the vocal tract and then we can form for a synthesized performance from
0:38:21	and so what we get here's the original data from lieberman that i showed you
0:38:26	at the beginning so the red triangle represents a human females bocal the f one
0:38:32	f range of two range of a human female with e a new making up
0:38:36	the points
0:38:37	and that little blue triangle is what the all model from lieberman said a monkey
0:38:42	could do
0:38:43	and this is what are mark our model looks like compared to that
0:38:47	so unlike me romans model which is very restricted we can see that the multi
0:38:51	what a remote key actually does would be to a quite wide variety and the
0:38:56	first formant but a somewhat compressed second formant
0:39:01	we use that to create multi vowels so artificial multi vowels that occupy the corner
0:39:07	of the corners of that convex hull so with five motive hours in a discrimination
0:39:11	task humans are basically at ceiling record so they do just as well with the
0:39:15	monkey vowels as they do with human vowels and what that shows us
0:39:19	is the to mark his capacity to produce a diverse set of files the same
0:39:23	as the number in most human languages namely five
0:39:26	is absolutely intact so the monkeys vocal tract
0:39:29	has no problem doing that
0:39:31	we also have good indications that things like bilabial and glottal stops et cetera et
0:39:37	cetera many of the different consonants would be possible so clearly the multi vocal tract
0:39:42	is capable of producing a wide range of seven
0:39:45	note that all sounds very dry such kind of more interesting to hear what are
0:39:49	model sounds like if we're trying to imitate human speech
0:39:53	i usually so we the model for this was my wife
0:39:57	so we had or speak a bunch of sentences but rather than play her first
0:40:01	what you should understand i'm gonna play the monkey model first and see if you
0:40:04	can understand with the smoke you say
0:40:06	right i
0:40:09	right
0:40:11	everybody got it right
0:40:14	okay and their this is my wife's formants with that synthetic monkey a source
0:40:21	i
0:40:23	okay
0:40:24	right i
0:40:27	time so
0:40:28	what you can here is that there's the phonetic content is basically preserved the human
0:40:33	formants are lower which makes sense because humans are larger than monkeys so it has
0:40:38	a more based c and less where you're the sound to it but i
0:40:43	that the phonetic content is basically present so what the shows us is that whatever
0:40:48	it is that keeps a monkey or an eight rate and the human how speaking
0:40:53	it's not the peripheral vocal tract it's not the anatomy of their total there
0:40:59	and that's basically the conclusion that we drew from this paper the paper was called
0:41:02	multi vocal tracts are speech ready
0:41:05	and what that tells us is that rather than looking more at the anatomy of
0:41:09	the vocal tract
0:41:10	we should be paying attention to what to the brain that's in charge and that
0:41:16	would be another talk to explain we have lots of evidence about what is about
0:41:19	the human brain that gives a such exquisite control over a vocal apparatus but it
0:41:23	doesn't seen that the vocal apparatus itself
0:41:26	the crucial thing and put in other terms we've done it with the multi but
0:41:30	i'm quite sure that the same thing would be true with a dog or a
0:41:33	pig or a cal if a human brain were in control a dog or at
0:41:38	cal or a pig or a monkey
0:41:41	the vocal tract would be perfectly able to communicate english
0:41:45	so
0:41:46	there's a lot of work to do before we make talking animals but it's gonna
0:41:49	involve the brain and not the vocal tract
0:41:53	okay so that is our story that was actually faster than we thought just to
0:41:57	they are general conclusions is that
0:42:01	you can use these methods that we're mainly developed by physicists and engineers to understand
0:42:06	human language for human speech to basically understand and synthesize a wide variety of vertebrate
0:42:13	sounds
0:42:14	i nearly work with four arms with birds and mammals but other people have used
0:42:18	these same methods to do things like alligators and frauds so these are very general
0:42:24	principles what you all learned in your sort of intro the speech class actually applies
0:42:28	to most of the species we know about
0:42:31	it's not the vocal tract that keeps most mammals from talking it's really their neural
0:42:36	control of that vocal tract
0:42:38	and i think the more general message that probably
0:42:42	meaningful to pretty much everybody in this room is a better understanding of the physics
0:42:47	and physiology of the vocal production system whether it's and the dog a remote you're
0:42:52	a dirac a wall can really play a key role it should play a key
0:42:56	role in speech synthesis
0:42:59	and thus you wanna say a few extra words of wisdom i guess
0:43:03	no
0:43:06	okay so we i think we have plenty of time for questions so thanks to
0:43:10	all the people who did this work and thank you for
0:43:29	it'll take the question mike or should i
0:43:31	i
0:43:34	a cushion is able to
0:43:37	inspired by using the women the ball box
0:43:43	the vocal folds
0:43:45	them again example can force for by using the like behaviour the dynamics will say
0:43:52	he's trying to imitate a human it's just what dogs do when they bark it's
0:43:56	the ways a second this is one point so and the second is that at
0:44:01	the last part of the user that
0:44:03	the key by the key difference lies in new mechanisms was really in the what
0:44:07	no mechanism yes neural mechanism so my question is able
0:44:13	as sometimes because of the dot plot the that this happens so will be disabilities
0:44:18	but actually act was again and almost a result of the bit if
0:44:23	it is not gonna but only in time
0:44:26	so my question was
0:44:29	i just talked that the debut the end of the vocal fold dynamics for the
0:44:33	ball but
0:44:34	and the most mapping that happens in the subject
0:44:37	because of that these so is there any kind of q for this was a
0:44:41	good use ms
0:44:42	question i two r are you asking about the recovery of the source properties or
0:44:47	i'm asking about the new them again is on that is responsible because for that
0:44:51	piece was good
0:44:53	for the auditory perception or for the production okay so what we know i don't
0:45:00	have a slide for this but we know that in humans there are direct connections
0:45:04	from the neural from the motor cortex onto the neurons you actually control the laryngeal
0:45:09	and the tongue muscles
0:45:11	those direct connections from cortex on to the laryngeal matter of us are not present
0:45:16	in most members
0:45:17	so these are absent in other primates they appear to be absent in austin cats
0:45:22	and travel et cetera but in those p c's which are good vocal imitators and
0:45:27	this includes many birds the parents and my numbers but it also include some packets
0:45:32	include elephants it includes various the tations
0:45:36	so in all of those groups that have been investigated these direct connections the equivalent
0:45:40	of what we humans have are present so the current theory for what is it
0:45:46	about our brains that gives us this control is that we have direct connections a
0:45:51	lot of the motor neurons
0:45:52	and in most animals there's only indirect connections via various brain stem intermediary onto the
0:45:59	vocal system itself
0:46:00	so in other words we've got this new we its essentially like a new gear
0:46:04	shift on this each and vocal tract that we've got
0:46:08	that gives our brains more control over it then we would otherwise have
0:46:15	a lot more interesting talk
0:46:18	so myself i have a free pass at home and a white or evidence we
0:46:23	nitpick
0:46:24	and so i also works for that it would be quite directional at all be
0:46:28	remote or police and what they are saying yes i don't is you are there
0:46:32	are also paper published in a channel about converting bring thing last told to speech
0:46:37	that the much using speech synthesis for a construction
0:46:40	of speech from right how do thing how it is possible to actual and or
0:46:44	something similar for our pets to be able to evangelise handle task a signal possible
0:46:51	sufficient
0:46:52	but that's an interesting question so if
0:46:55	given that we can use your all signals but fmri or geology to synthesize okay
0:47:02	speech
0:47:04	could we do the same thing for animals and my answer from most animals because
0:47:09	of my answer the first question would be no the reason is that the there
0:47:14	is a correspondence between the cortical signals that we can measure it something like fmri
0:47:21	really g and the actual sounds that are produced
0:47:24	because in most animals its mainly the brain stem in the midrange that are controlling
0:47:30	these as someone attacking or a dog parks
0:47:33	it doesn't in fact you can remove the cortex and a cat are still meowing
0:47:37	adorable still more
0:47:39	in the same way that a human baby who's born without cortex will still cry
0:47:43	and laugh
0:47:44	in a normal way
0:47:45	so i but also say if i would be a lot easier to do this
0:47:48	is probably better usage rent money
0:47:51	see if you can synthesize laughter and crying
0:47:54	from a cortical signal y prediction would be you and if you can do that
0:47:59	humans then you won't be able to do it in so i would predict a
0:48:03	fink laugh like what i go a that's not a real that i should be
0:48:08	correctly control but when i really laugh are i really cry
0:48:12	that's gonna be coming from this score brain that's very hard to measure and so
0:48:16	you should be able to synthesize realistic laughter crying even it easy maybe
0:48:29	do you have any evidence of what the which point enables cmbp connection between the
0:48:33	brain and the vocal tract it starts appearing
0:48:36	that's the unfortunate answer to that is no probably many of you know there's a
0:48:41	there's a whole field in this you have a slide about this there's a whole
0:48:45	field that's essentially trying to reconstruct
0:48:49	based on fossils when in our history when of this i in the common in
0:48:54	history of a revolution these that are capacity for speech occurred and the old argument
0:49:01	was always based on if we could know when the larynx decided and we would
0:49:05	know one speech occurred
0:49:06	hey what i think i've shown you and all this work is that it's not
0:49:10	alaryngeal descent
0:49:11	that's crucial for speech it's these direct connections
0:49:14	and those unfortunately there's just no fossil q
0:49:18	to whether there's direct connections that's basically the stuff that really doesn't preserved even for
0:49:23	an hour
0:49:25	much less for in the fossil record you would need
0:49:27	detailed narrow an at any on the micron level to answer that question so it
0:49:32	even it's even hard with again
0:49:34	please
0:49:37	so to comes and i are
0:49:40	well we agree on the importance of the of the neural control of course and
0:49:45	but we can disagree on the
0:49:48	exact precise interpretation of and what the vocal tract data means and video clip
0:49:58	i can we do this you know how we think we're
0:50:04	that so innocent you could say that has been some fine tuning of the of
0:50:09	the human vocal tract to for localization and if you
0:50:15	you know if you if you the little liberal in the interpretation of what we
0:50:18	find in the fossil record you can say
0:50:23	it happened somewhere between three million and three hundred thousand years ago
0:50:29	it's not very precise i
0:50:34	so that the evidence for this is based on various cues that supposedly indicate based
0:50:40	on the base of the scroll what the position of the larynx and tone would
0:50:44	be it just "'cause" with
0:50:46	"'cause" i have these slides and i took them out "'cause" i thought we'd be
0:50:48	too long i want to show you some examples on animals that have independently modify
0:50:54	their vocal tract
0:50:56	in a way that has nothing to do with speech so the way you can
0:50:58	make your vocal tract longer is one make your nose longer like this process monkey
0:51:02	or lots of various animals like elephants course you can stick your lips out which
0:51:07	many species do so if you do this you sound bigger and if you do
0:51:11	this you sound smaller or you can do more bizarre things like
0:51:14	make an extension to your nasal tract that forms a big crest like that dinosaur
0:51:19	up there or these birds which because sources at the base of the trachea have
0:51:24	elongated trachea and all of these adaptations seem to be ways of making that animal
0:51:29	sound bigger
0:51:30	it's just a nice example this is an animal with the permanently descended larynx is
0:51:35	a red deer and you'll find this a pretty impressive sound
0:51:39	wow
0:51:41	wow
0:51:43	so the first thing you probably noticed in that images that pinnits pumping that we're
0:51:47	going back that ignore that look at what's happening
0:51:50	okay what's happening in the front of the animal and you'll see
0:51:54	i as well
0:51:57	back and forth
0:51:58	and so when we first saw these videos we were like what is this and
0:52:01	it turns out what this is that resting position of the larynx that's is a
0:52:06	permanently descended larynx in an argument animal and watch what it does what it localisers
0:52:13	i
0:52:16	i
0:52:22	so i think we could all agree that some much more impressive just set of
0:52:26	the larynx then the few centimetres that happens in humans
0:52:30	and it turns out
0:52:32	these are not the only species because in our islands p c's there's a secondary
0:52:36	the set of the larynx that happens only and then and only at puberty and
0:52:40	i think that's exactly the same kind of adaptation that makes this to do your
0:52:44	sound bigger the aurora or a bird sound bigger so i guess that's where we
0:52:49	differ i think that
0:52:50	even if we know when the larynx to send it in humans it could have
0:52:54	been an adaptation to just make yourself sound bigger and it might have been a
0:52:58	million years after that
0:53:00	that we started using that for speech
0:53:02	so that's why i really don't think the fossils are gonna answer because we do
0:53:05	not have any answer the only way we're gonna get it i think is by
0:53:08	is from genetics now we're covering genetics
0:53:11	the gene genome from data seven the neanderthals and these that might help us answer
0:53:16	this question about the recognition
0:53:22	i've just want to mention that the result where you know scores against based on
0:53:26	the part of the story my question is about earlier you and more to communicate
0:53:33	of course okay bye divorce so
0:53:36	you know you're talking about the vocal tract varies with a voice source of for
0:53:42	really downtime it's whatever
0:53:45	a lot of seems to do with a with a voice source do have an
0:53:48	idea of video poker bring
0:53:50	which is i don't aboard to
0:53:53	to use pieces
0:53:57	well not use the vocal really over emotions so for sure of social behaviors
0:54:03	we we've got actually quite a lot of evidence about sort of overall vocabulary size
0:54:08	for different species but most of that comes from relatively intuitive
0:54:13	scientist listen and they say it in a there's about five sounds there is about
0:54:18	twenty sounds there
0:54:20	only a few species have we really don't what we need to do which is
0:54:23	played back experiments to see what the animals discriminate from others and i would say
0:54:28	in many cases that shows us that something that we think is one thing what's
0:54:32	a i'm not i'm now or a bark or ground actually has multiple a variance
0:54:39	so but i think a conservative number for animal vocabularies is something like fifteen thumbs
0:54:45	and a less conservative number would be something like fifty difference
0:54:49	and in some birds it goes a lot larger than that but if you're talking
0:54:52	about your average mammal it somewhere in that right so roughly thirty would be a
0:54:57	good nonhuman primate
0:55:00	vocabulary size of discriminable so that have different meetings
0:55:04	of course there are sounds animals like can make thousands of different sounds
0:55:09	but they do this for example birds in their songs or wales in their songs
0:55:13	but they don't appear to use this to second of different meetings so then we
0:55:18	can talk about vocabulary anymore we have to just start talking about
0:55:23	it's more like
0:55:24	phonemes or syllables types router and then meetings
0:55:29	we will say something
0:55:32	sorry
0:55:36	is there's somebody else but who and what do we know what is the frequency
0:55:41	resolution of the monkey hearing
0:55:44	so that we could hear the relative position of all the formants but
0:55:48	to reproduce it absolutely i mean most monkeys have a higher free a higher high
0:55:53	frequency cutoffs the most monkeys could hear up to forty or even sixteen khz so
0:55:58	the high frequencies are more extensive than ours
0:56:01	but where it counts in the low frequencies they're perfect frequency resolution so from five
0:56:06	hundred hz to twenty five or thirty five hundred hertz which is where all that
0:56:09	formant information is they can they can
0:56:12	and that's why of course an animal like and or a chimpanzee or basically any
0:56:16	other species you cares can learn to discriminate different human words
0:56:21	virtually every dog knows its name and in some cases you can train a dog
0:56:24	to discriminate between hundreds or even thousands of words
0:56:27	and they can do that
0:56:29	so the speech perception apparatus seems to be built on the basically why they share
0:56:34	perceptual masking
0:56:38	sorry
0:56:39	i'm nothing and speech synthesis and of course leaving about how to
0:56:42	it would be a place to say that but
0:56:44	why
0:56:45	actually did you
0:56:47	need to do this in this is what we do not to sort of more
0:56:50	standard phonetic thing just flew
0:56:52	record load of loads of monkey localizations and measure the formant and what you what
0:56:58	would happen if we did that
0:57:00	well we we've done that and we've actually looked at the subset of the sounds
0:57:04	so remember what we have a some of these vocal tract doing what multi vocal
0:57:09	tract to do and that influence of things like feeding chewing swallowing et cetera it
0:57:15	also includes a class of
0:57:17	non vocal displays that most known human pride well most monkeys and apes to do
0:57:23	things like this
0:57:25	which it's called lip smacking it's a very typical primate thing but it's virtually silent
0:57:31	so they make some able a little tiny bit of sand and once p c's
0:57:36	they actually vocalise when they do it turns out that those that the most is
0:57:40	doing a lot more with its vocal tract in these visual displays then it doesn't
0:57:44	it's auditory display
0:57:45	so if we just take that the vocal tract configurations where the monte is making
0:57:50	a sound it's a subset of what the vocal tract can actually do and in
0:57:54	project these nonvocal communicative so you
0:57:58	could call them visual communication signals have a lot of the a lot of the
0:58:02	interesting variance of the vocal tract shape are there
0:58:05	and because those are silent we have to figure out what it sound like if
0:58:09	the monkey was vocalised so that's why we have to that's why we had to
0:58:12	do all this work that's why it took
0:58:14	years to do this
0:58:16	and then adjust and to that
0:58:20	well i guess coincidentally almost at the same time as our paper came out that
0:58:25	we change the way and according to which just mentioned here in the front and
0:58:29	came up with the paper where they get exactly what you would use it and
0:58:33	they five and basically that
0:58:36	actually what the user to different monkey species act-utterance but and they can produce a
0:58:42	surprisingly large range of silence that especially surprising if you compared to what the lieberman
0:58:50	had claimed that they could produce
0:58:52	but not as large as the range of sounds that are mobile produced so
0:58:58	they do mainly not produce in their in their actual productions the potential that they
0:59:06	have with their vocal tract
0:59:11	i would like to come from that i understood correctly what you say on this
0:59:16	slide
0:59:17	that there is that more generally
0:59:22	it is generally passive
0:59:24	is the output or at least experiment that
0:59:30	generally this woman from give the in two thousand then
0:59:35	that just air flow is coming out
0:59:39	and then we can say that the vibration rate is generally a c
0:59:45	i think this is too risky
0:59:48	because this is exactly what would happen if you i'm dead and you bust a
0:59:53	are thrown
0:59:54	air flow through my vocal folds
0:59:57	i don't think we mush my much will be different
1:00:02	and in order to do that even though to say that is generally passive i
1:00:07	think you have to go and look
1:00:11	more about neuronal activity
1:00:15	and not just about experiment i respect teachers work but i think this is to
1:00:24	dangers to
1:00:25	to say these
1:00:27	you on that slide i think there may be a miss i mean because we're
1:00:33	not saying that you don't need muscles to put the larynx in to phonatory position
1:00:38	of course you do that work in this case i move you tigers larynx in
1:00:43	the phonatory position
1:00:45	what we're saying is that the individual pulses that represent the fundamental frequency so the
1:00:49	openings and closings of the glottis that's what that's what is passively determined by things
1:00:55	like muscle tension and pressure
1:00:58	so we're not saying that muscle activity doesn't play a role what we're saying that
1:01:03	it doesn't have to happen at the periodicity of the fundamental frequency
1:01:08	and that's obvious thing if you think about a pack that's producing sounds at forty
1:01:12	thousand ten at a forty thousand hz there's no way neurons can fire that neurons
1:01:17	basically can't fire faster than thousand
1:01:20	so even if it didn't work for something like an elephant and it does work
1:01:24	for something like a cat at thirty hz
1:01:26	it could never work for most of the high causation
1:01:29	even a cat two thousand hz and certainly not these animals that are producing in
1:01:34	the high khz range it has to be passed because there's no way neurons can
1:01:38	fire or muscles can twitch
1:01:40	that rapidly
1:01:41	so the clean is not then in humans you or any animal that you don't
1:01:44	need to use muscles to put the and that to control the larynx you do
1:01:49	but only that you don't need muscle activity at the frequency the fundamental frequency
1:01:53	is that make sense
1:01:57	it's better
1:02:04	and some just curious
1:02:06	you labour man and you both did work trying to figure out exactly the same
1:02:11	thing a subject and i came to radically different conclusions so
1:02:17	was the lieberman what's the improvements is that approach never going to work or what
1:02:22	was the issue that distinguished and that you know that made the difference between what
1:02:27	you did and he did and what can that teachers for other things we want
1:02:31	to do as well do not draw conclusions
1:02:34	i would say from the i mean maybe you can comment on this two but
1:02:38	from the point of view the technology
1:02:41	what we're doing to understand how you go from a vocal tract to formant frequencies
1:02:47	not much just change they did a pretty good job a given the computers they
1:02:51	had their simulation was pretty good their problem was in the biology their problem was
1:02:55	that they took a single then animal and the expected that
1:02:59	then animal was gonna tell them the range of motions that are possible in a
1:03:04	living animals vocal tract
1:03:05	so they had no indication of what the dynamics
1:03:08	the vocal tract or
1:03:10	from looking at the data and that's what we needed this x rays of a
1:03:13	building monkey to be able to find out
1:03:16	okay so but you don't saying that you can never figure out what to do
1:03:22	is going on from a dead animal what so if you
1:03:27	so
1:03:29	so by the way that is class which should be familiar name two people working
1:03:33	on speech synthesis with the call theorem one of these paper here and so he
1:03:38	was basically the guy at the acoustic modeling
1:03:42	work and so at the time there are q competing labs working on speech synthesis
1:03:48	and i basically the acoustic model i used for my model is basically contemporaneous with
1:03:55	a ten is quite small so indeed you know classic stuff
1:03:59	so basically they just didn't have the data it's kind of like all eighties neural
1:04:03	nets verses google
1:04:05	they just didn't have the data and we have a data
1:04:13	and
1:04:15	yes
1:04:16	and i think it's a very as
1:04:20	defined benefit different bands right okay not can make it and fifteen t fact there
1:04:25	is no
1:04:28	something like fifteen to fifty as a session one and here is to now
1:04:32	if the semantics of a time to express
1:04:35	i was trained praying all rights a very different
1:04:38	set
1:04:39	just a it is a fiction planes are the and in my state is virtually
1:04:44	pains they're very different to what they're trying to express
1:04:47	there's a certain set of course vocalisations that are very widely shared among species for
1:04:53	so for example sounds that means threat sounds that say i'm being mean and scary
1:04:58	so i tend to be low and have very low performance
1:05:02	sounds that are appeasing in saying that we don't hurt me i'm just a little
1:05:05	guy tend to be high frequency
1:05:07	so we see that class the vocalisations vary widely across mammals and birds
1:05:12	then we have this class of kind of meeting vocalisations that a lot of species
1:05:17	do but they typically sound very different sometimes it's males just going well like that
1:05:22	and sometimes it much more interesting and complicated
1:05:24	and then there's typically mother infant communications and so there's usually sounds that are that
1:05:31	a mother users with for this particular in mammals that the mother uses to communicate
1:05:36	again very widespread
1:05:38	and then there's really weird stuff mike where all songs or echo location clicks at
1:05:44	all phones that are really only found in particular groups so i'd say there's a
1:05:48	kind of shared core of semantics and then various it's biology so there's all kinds
1:05:54	of weird stuff in the corners but if you say parental care
1:06:00	aggression affiliation
1:06:01	and
1:06:03	there's also alarm calls and three calls are pretty common but a handful of maybe
1:06:08	five semantic axes would probably do it from a standard
1:06:20	well the there are some vocalisations that basically saying i'm here
1:06:25	and their other vocalisations the try their best a high that so back the a
1:06:29	very high frequency quiet thing that tails off it makes it hard to find so
1:06:33	various alarm calls are like that
1:06:36	it like a there is an active basis is it
1:06:39	so for fact that market i block
1:06:42	in fact it is quite a lot of human where it's that's right
1:06:46	but if a vocabulary but can express is so small that maps model about what
1:06:53	making this pen
1:06:55	seven or something that brightness
1:06:57	to various have
1:07:00	i do not put it in a fight
1:07:02	if i if i and response
1:07:04	and then where it's at an unconstrained a few
1:07:08	that's kind of frustration very
1:07:14	well i think that is a fundamental finding of animal communication is that animals understand
1:07:20	a lot more the then they can say
1:07:22	so essentially we have many species for example that understand not only their own species
1:07:27	but they can learn the alarm calls of other species in their environment and of
1:07:31	course animals raise with humans learn to understand human words and not of the species
1:07:36	every produce those
1:07:38	so it just does the child's write any of us are receptive vocabulary the words
1:07:42	we understand are much larger than the number of words we say typically
1:07:46	for most animals i think the receptive vocabulary is large and the productive vocabulary is
1:07:52	very limited
1:07:53	when they find that frustrating or not
1:07:55	i don't know that's harder so
1:08:03	so the
1:08:07	humans have more control over all or there are also in the water no value
1:08:13	model to use the excitation signal was much working or
1:08:18	so project was to every other mean and what we present more clearer and more
1:08:23	how to model
1:08:25	this case back to this image we've done a lot of work now doing excise
1:08:31	larynx work in one of the things we found is the most species can very
1:08:35	easily be driven into a chaotic state
1:08:38	where rather than this nice regular harmonic process that we see here you get essentially
1:08:45	coupled oscillators and the vocal folds generating chaos and you can see the classic steps
1:08:50	from by phonation into a triphone a period doubling to chaos in vocal folds in
1:08:55	virtually every species that we looked at
1:08:58	now and it seems to be very easy for most animals to go into a
1:09:02	chaotic state and that's reflected by the fact that many sounds we hear animals produce
1:09:07	or have a chaotic source
1:09:09	so for example monkeys do this all the time they do this
1:09:13	an even dog barks are like that there's the they let themselves use chaos much
1:09:18	more in speech and you like this
1:09:22	but unless you're batman
1:09:23	you know
1:09:25	nobody does that we we'd we favour this harmonic source for most things if you
1:09:30	listen to a baby crying you'll hear plenty of k
1:09:33	so i think what's hard to say is whether humans
1:09:37	we can produce chaos with their vocal folds but do we just choose to use
1:09:41	this nice regular harmonic nice clear pitch signal
1:09:44	because it
1:09:46	you know better for understanding or it sounds nice or a vocal folds actually less
1:09:52	inclined to go chaotic
1:09:54	than those of other species
1:09:55	that's a question that i don't think we can answer at present
1:09:58	but we certainly do a lot less chaos monkeys it's the most common thing you're
1:10:02	gonna hear these threads grounds
1:10:04	are chaotic and so that's what we were trying to model in the sentence
1:10:08	so i've done if you
1:10:11	models where there's interaction between the vocal tract in the vocal folds and also looking
1:10:16	at chaotic vibrations and one of the other things that you find even if you
1:10:21	get these chaotic vibrations is it's somewhat well it's
1:10:25	quite a bit harder to control vocal fold onset so tends to be more gradual
1:10:30	and which makes for instance it almost impossible to make a distinction between voiced and
1:10:35	voiceless
1:10:36	that consonants which are pretty important in speech and so am i just find out
1:10:42	there but it seems that this
1:10:46	more
1:10:47	regular vibration of the human vocal fold is useful for speech whether it's you know
1:10:54	being
1:10:56	the being used by speech because that way or because whether it has become that
1:11:01	way because it useful for speech that's another question
1:11:12	okay
1:11:17	thank you very much

Synthesizing animal vocalizations and modelling animal speech

Keynotes

Tecumseh Fitch (University of Vienna, Austria), Bart de Boer (Vrije Universiteit Brussel, Belgium)