0:00:09 | so |
---|
0:00:11 | becomes a features from university of vienna these prevent something |
---|
0:00:16 | at the |
---|
0:00:16 | department of cognitive biology |
---|
0:00:19 | and main interest or in the evolution of language and the |
---|
0:00:25 | mobile communication in |
---|
0:00:27 | but separates |
---|
0:00:29 | and what makes this |
---|
0:00:30 | also very interesting for us is the t v |
---|
0:00:34 | all the users synthetic speech |
---|
0:00:36 | two |
---|
0:00:37 | investigate is questions into |
---|
0:00:40 | there's use hypotheses |
---|
0:00:43 | and |
---|
0:00:44 | are there is a |
---|
0:00:45 | from the |
---|
0:00:48 | i allowed artificial intelligence lab on the friday university |
---|
0:00:52 | brussels |
---|
0:00:53 | and he's |
---|
0:00:55 | interested in the also in the cognitive |
---|
0:00:58 | it uses of language and |
---|
0:01:01 | all the user's machine learning in speech technology for |
---|
0:01:06 | investigated in all |
---|
0:01:09 | this |
---|
0:01:09 | combinatorial |
---|
0:01:11 | factor can |
---|
0:01:12 | somehow be modeled |
---|
0:01:14 | and |
---|
0:01:15 | we also very well known for their work and |
---|
0:01:19 | also for the work on the |
---|
0:01:22 | nine q |
---|
0:01:23 | we ct a monkey vocal tract of speech ready which we will here today |
---|
0:01:29 | this is what i'm because |
---|
0:01:36 | my family |
---|
0:01:37 | is that sounds pretty good fact there are not you |
---|
0:01:40 | i'll try not to put |
---|
0:01:43 | thank you michael effective for the kind introduction said this is the first time bargain |
---|
0:01:47 | i have tried to do it |
---|
0:01:48 | tag team two you know like this that will see how well it works but |
---|
0:01:52 | all start off and then part will |
---|
0:01:55 | give you more technical details of the sort that i'm sure you all hungry for |
---|
0:01:59 | on saturday morning |
---|
0:02:00 | but i'll try and start off the start by giving some |
---|
0:02:05 | just perspective on why a biologist like myself |
---|
0:02:08 | who's interested in animal communication would dive in the speech science actually studied speech science |
---|
0:02:14 | with people like and stevens and mit one as opposed arc |
---|
0:02:18 | and use that used what kind of you guys invented to investigate how we animals |
---|
0:02:24 | make their sounds and y what those sounds me |
---|
0:02:29 | and then we're basically gonna talk |
---|
0:02:32 | so in other words that using the technology of speech science |
---|
0:02:36 | to create animal sounds to understand animal communication and then in the second part of |
---|
0:02:41 | that arc will turn that around and say how can we use an understanding of |
---|
0:02:45 | the animal vocal per tract |
---|
0:02:48 | to understand the evolution of human speech |
---|
0:02:50 | and that is that may the answer may surprise some of you |
---|
0:02:55 | okay so why would why would anyone want to synthesize animal vocalisations why would you |
---|
0:03:00 | wanna make a synthetic cats |
---|
0:03:02 | academy our a synthetic bark |
---|
0:03:05 | and's as i said |
---|
0:03:07 | my drive my main reason for this is because i'm a biologist |
---|
0:03:11 | reg interested in understanding the |
---|
0:03:13 | the biology of animal communication from the point of view of physics and physiology and |
---|
0:03:18 | because speech scientist of done so much of that work we can essentially borrow that |
---|
0:03:22 | to understand animal communication |
---|
0:03:25 | and then we'll turn of the second part where we try and understand how our |
---|
0:03:29 | speech act |
---|
0:03:30 | so i'm sure this is very familiar to you but i just wanna very quickly |
---|
0:03:35 | run through the source-filter theory i'm sure virtually all of you are familiar with this |
---|
0:03:39 | theory |
---|
0:03:40 | what as applies to human language what you might be more surprised by is how |
---|
0:03:45 | broad this theory applies across vertebrate |
---|
0:03:48 | so with the possible exception of fish dolphins another toothed whales and probably a few |
---|
0:03:56 | others like some rodent high frequency sounds |
---|
0:03:59 | this theory that was developed to understand our and speech apparatus and you know basically |
---|
0:04:03 | from the nineteen thirties onto the nineteen seventies turns out to apply to virtually all |
---|
0:04:09 | other sounds that you might think of dogs barking cows moving birds singing it's utterance |
---|
0:04:15 | the basic idea of course is that we can break the speech production |
---|
0:04:19 | process into two components the source which turns aside airflow at the sound and the |
---|
0:04:25 | filter which then modifies that's |
---|
0:04:27 | using formant frequencies which are vocal tract resonances that filter out certain frequency |
---|
0:04:32 | and this is an image that may look familiar |
---|
0:04:35 | this these are vocal folds except these of the vocal folds on the siberian tiger |
---|
0:04:40 | so these this is that a larynx that's the vocal folds are about that long |
---|
0:04:44 | so of course it makes very low frequency vocalisations but you can see that the |
---|
0:04:49 | basic process this error dynamically excited vibration is pretty much the same as what you |
---|
0:04:55 | would see in human vocal folds |
---|
0:04:58 | and of course the vibration rate of these vocal folds the rate at which they |
---|
0:05:01 | slap together determines the pitch of the sound |
---|
0:05:05 | and you may be wondering how we did this we didn't have a live tiger |
---|
0:05:09 | vocalise thing with an enter scope died want to do that this is a dead |
---|
0:05:14 | tagger so this tiger was removed from an animal that was used a nice put |
---|
0:05:17 | on a table we blew air through it and we videotape that and what that |
---|
0:05:21 | shows is just like in humans |
---|
0:05:24 | we don't need active neural firing at the rate of the fundamental frequency to create |
---|
0:05:30 | the source |
---|
0:05:31 | and that seems to be true in the vast majority of sounds bird songs acts |
---|
0:05:36 | are actually localising it at fundamentals of eight khz |
---|
0:05:40 | whales or for are of localising at fundamentals of ten khz |
---|
0:05:43 | all using the same principle |
---|
0:05:45 | there are a few exceptions in my favourite one that many of you will be |
---|
0:05:48 | familiar with |
---|
0:05:49 | is one task per |
---|
0:05:51 | that's a situation where the there is an actual contraction well each contraction of muscle |
---|
0:05:57 | that generates the paper is driven by the brain so that's one of the few |
---|
0:06:02 | exceptions where it's not this kind of passive vibration |
---|
0:06:05 | but again for the vast majority of sounds at we're talking about including everything we |
---|
0:06:09 | know from nonhuman primates this is the way |
---|
0:06:12 | so then that's source out whether it's noisy or harmonic passes through the vocal tract |
---|
0:06:18 | which |
---|
0:06:19 | i we show my students this image the formants being like windows that allow certain |
---|
0:06:23 | frequencies to pass through |
---|
0:06:25 | but it certainly much more fun to listen to what a form it is |
---|
0:06:29 | what i've done here is used lpc resynthesis |
---|
0:06:32 | to take the human speech which is of course of the source |
---|
0:06:36 | and the filter combines |
---|
0:06:38 | where and or |
---|
0:06:40 | and now i'm gonna take the formants of that speech |
---|
0:06:43 | and apply them to this source this is a bison whirring |
---|
0:06:48 | and this is what we hear as a result |
---|
0:06:50 | i |
---|
0:06:54 | i think everybody can understand the words even though it sounds more |
---|
0:06:58 | terrifying when it's a bison saying it |
---|
0:07:00 | just another random example this is an or well |
---|
0:07:05 | in here is the nor we're with my performance |
---|
0:07:11 | okay so i think that illustrates the point what we hear the vocal signal we |
---|
0:07:15 | here is this composite of source and filter |
---|
0:07:18 | and in these cases we can hear the filter doing the phonetic work |
---|
0:07:22 | and this but the source still comes through loud |
---|
0:07:25 | so taking this basic principles of source-filter theory we started thinking |
---|
0:07:30 | okay what kind of |
---|
0:07:31 | cues other than speech might be there an animal signals and one of the first |
---|
0:07:36 | things that's now been |
---|
0:07:37 | really extensively investigated was based on the idea that vocal tract length correlates with body |
---|
0:07:44 | size and because formant frequencies are determined by vocal tract length maybe formants provide a |
---|
0:07:50 | cue to body size in other species |
---|
0:07:52 | so the first part of this is easy we just get "'em" a riser x |
---|
0:07:56 | rays a measure of the vocal tract length you can do that on anaesthetised animals |
---|
0:08:00 | and then we is a little harder to get them to vocalise but when we |
---|
0:08:04 | do that and that of the formants we find this is just one of many |
---|
0:08:07 | cases these are monkeys that vocal tract length correlates with formant dispersion which is the |
---|
0:08:12 | average spacing between the formants and because vocal tract length correlates with body size that |
---|
0:08:18 | means the body length correlates very nicely |
---|
0:08:21 | with well sorry this is one body like correlates very nicely with formants |
---|
0:08:26 | and i first this in monkeys but then we didn't obvious and in pigs it's |
---|
0:08:30 | true in humans it's true and dear this seems like a kind of for the |
---|
0:08:34 | mental aspect of the voice signal that it carries information about body so |
---|
0:08:42 | so |
---|
0:08:43 | this is something that we can see as scientist objectively we can measure this |
---|
0:08:48 | but the question is do animals pay attention to that |
---|
0:08:51 | so it's fine if i go and i measure formants and i can say formants |
---|
0:08:54 | correlate with body size but that's kind of meaningless for animal communication unless the animals |
---|
0:08:59 | themselves perceive that signal |
---|
0:09:02 | so |
---|
0:09:03 | this is where animal sound synthesis comes and how do we ask that question how |
---|
0:09:07 | do we find out whether an animal is paying attention to formants |
---|
0:09:10 | and the answer this is a long time ago this you may some of you |
---|
0:09:13 | may recognise this all version of matlab running on an old macintosh that i generated |
---|
0:09:19 | this speech animal sounds synthesizer using very standard technology that most of you will be |
---|
0:09:24 | familiar with basically |
---|
0:09:26 | when you're prediction predict the formants subtract those away and we have an error signal |
---|
0:09:30 | which we can use as a source and then we can change the formants shift |
---|
0:09:34 | only the formants leaving everything else the same and ask if the animals perceive that |
---|
0:09:39 | shift inform |
---|
0:09:42 | now the way we do these experiments how do you ask an animal whether it |
---|
0:09:45 | perceives that we usually do you something called habituation this a bit you a sheep |
---|
0:09:49 | where we play a bunch of sounds that |
---|
0:09:52 | the in this case the formants remain the same but other aspects very the fundamental |
---|
0:09:57 | frequency the length et cetera varies performance are fixed |
---|
0:10:00 | and now once |
---|
0:10:02 | our listening animal |
---|
0:10:03 | stops paying attention |
---|
0:10:05 | so it may take |
---|
0:10:06 | ten plays or a hundred play is before the animal finally stops looking at the |
---|
0:10:11 | sound but once it's gotten with the original sounds then we play the sounds where |
---|
0:10:17 | we change the formants or change whatever variable interest |
---|
0:10:20 | and we |
---|
0:10:21 | if the animal pays attention to that |
---|
0:10:23 | if they perceive it |
---|
0:10:24 | and find it |
---|
0:10:25 | salient enough to be noticeable then they should look again |
---|
0:10:29 | okay |
---|
0:10:30 | so the first piece is i actually tried this with his whooping cranes a now |
---|
0:10:34 | explain why the second |
---|
0:10:36 | so what i'm gonna do you know it's sort of walk you through this experiment |
---|
0:10:39 | these are whooping crane contact calls |
---|
0:10:41 | and what we did is play a bunch of the actual calls from one particular |
---|
0:10:45 | brand |
---|
0:10:46 | and they sound like this |
---|
0:10:50 | or |
---|
0:10:51 | it's more here's another one sound pretty similar to our years |
---|
0:10:56 | and we keep playing those in cell are so these are recorded we're playing these |
---|
0:11:00 | from a laptop and now we see if the listening bird looks up to we |
---|
0:11:05 | wait till the bird goes down its feeding we play one of these sounds and |
---|
0:11:09 | it looks at |
---|
0:11:10 | because it sounds like there's another would be great |
---|
0:11:12 | so the logic is pretty simple |
---|
0:11:14 | the case of whooping cranes we had to do this in the winter |
---|
0:11:17 | it takes these birds hundreds of trials before they start listening before they start paying |
---|
0:11:21 | attention to the laptop dies and it starts snowing et cetera et cetera |
---|
0:11:25 | but eventually we were able to do this |
---|
0:11:27 | where you get the bird the bits are weighted by playing these kinds of sounds |
---|
0:11:31 | over and over |
---|
0:11:36 | anyway and then |
---|
0:11:37 | just to be safe |
---|
0:11:39 | we play a synthetic replica that we've run through the synthesizer but without changing the |
---|
0:11:43 | formants and if everything's fine they shouldn't just a bit rate of that hears with |
---|
0:11:48 | that sounds like |
---|
0:11:53 | pretty similar |
---|
0:11:54 | and now here's the key moment |
---|
0:11:56 | we play either the formants lowered |
---|
0:11:59 | where the formants fire |
---|
0:12:01 | or |
---|
0:12:03 | and of course you walk in here that because you're humans and you we already |
---|
0:12:06 | knew you perceive formants so the question is one of the birds do |
---|
0:12:09 | and when we do this what we find is that initially |
---|
0:12:13 | the birds respond eighty percent of the time on average but has we go as |
---|
0:12:17 | we get so twenty five or thirty trials finally the last but you a sheep |
---|
0:12:22 | trial |
---|
0:12:22 | by definition is the one where they don't look at all we actually get three |
---|
0:12:25 | of those in a row now we play that synthetic replica they don't work |
---|
0:12:30 | so that means or synthesizer is working and then finally we play these test stimuli |
---|
0:12:34 | and |
---|
0:12:35 | we get a massive just a pitch |
---|
0:12:38 | so we've done this that would make a difference |
---|
0:12:40 | sees and always found the same thing it seems like paying attention the formant frequency |
---|
0:12:44 | shifts |
---|
0:12:45 | in this kind of context is a basic mammalian thing |
---|
0:12:49 | birds to it monkeys do it dogs to it pigs do it and of course |
---|
0:12:54 | people |
---|
0:12:55 | so now you might ask can we go further with that and for example these |
---|
0:12:59 | are two colleagues who have used animal sound synthesis |
---|
0:13:03 | you basically look at what other species are using these formant frequencies for |
---|
0:13:10 | in this case we can show that the model that the deer or the colours |
---|
0:13:14 | are using these sounds as indicators of body size and the kind of evidence we |
---|
0:13:18 | have is for example males played by another male with its with lower formant frequencies |
---|
0:13:25 | that with an elongated vocal tract runaway and are afraid females find the more attractive |
---|
0:13:30 | et cetera et cetera this is again been done with many speech |
---|
0:13:34 | many of probably many of you have heard gear but you might not of her |
---|
0:13:38 | the colossal this is a colossal they have a very impressive vocalisation |
---|
0:13:48 | if you're wondering how a little teddy bear sized animal |
---|
0:13:52 | makes that terrifying sound |
---|
0:13:54 | it's because they actually have a track which is that they've |
---|
0:13:57 | pull the larynx down to make their vocal tract much longer then it would be |
---|
0:14:02 | and a normal animal so by and one getting their vocal tract they make themselves |
---|
0:14:06 | and vector |
---|
0:14:08 | just these are a few of the many publications that use this approach that i |
---|
0:14:13 | just been telling you about to dig deeper into animal communication so i hope but |
---|
0:14:20 | makes the case that this is a worthwhile thing to do it again in a |
---|
0:14:23 | wide variety of sleazy |
---|
0:14:26 | okay so now maybe getting something that's closer to what a lot of you do |
---|
0:14:29 | i wanna turn to the to the this is supposed to be part two sorry |
---|
0:14:33 | we just |
---|
0:14:34 | put this together yesterday |
---|
0:14:37 | why would you |
---|
0:14:38 | what i mean how can you turn this around to start ask questions about |
---|
0:14:43 | human communication based on what we understand about animals |
---|
0:14:46 | and the first fact that kind of course fact that many people in the world |
---|
0:14:50 | of speech sciences been trying to understand for a long time is the fact that |
---|
0:14:54 | we humans are amazing it imitating sounds we not only imitate the speech sounds of |
---|
0:14:59 | our environment |
---|
0:15:00 | but we learn to sing songs we can even in the tape animal sounds or |
---|
0:15:04 | basically kids will imitate whatever sounds they have a rare |
---|
0:15:07 | and it turns out that are nearest living relatives the great apes can't do this |
---|
0:15:11 | at all |
---|
0:15:13 | so this is just one example all these are examples of apes that been raised |
---|
0:15:17 | in human homes |
---|
0:15:19 | and of course a human child by the edge of about one is already making |
---|
0:15:22 | the sounds a bit it is already starting to say it's first words and making |
---|
0:15:26 | the sounds of its environment that adheres and it's in its native language phonology or |
---|
0:15:30 | phonology is and no eight has ever done that no ape has even spontaneously said |
---|
0:15:35 | mama much less learn complex vocalisations |
---|
0:15:39 | and the question that has i mean people are known this for a long time |
---|
0:15:42 | the question that has been driving this field for at least a hundred years and |
---|
0:15:47 | start once time is why is |
---|
0:15:49 | why is it that |
---|
0:15:51 | and animal |
---|
0:15:52 | that's in english seemingly so similar to us that can |
---|
0:15:55 | where to do things like i h |
---|
0:15:57 | and drive a car |
---|
0:16:00 | can even produce the most basic |
---|
0:16:02 | speech so |
---|
0:16:03 | with its vocal tract |
---|
0:16:06 | so that's the sort of driving force behind the second part of |
---|
0:16:09 | block |
---|
0:16:10 | and there's two theories darwin had already mentioned this one is that has something to |
---|
0:16:15 | do with the peripheral vocal apparatus |
---|
0:16:17 | and the other is that it has more to do with the brain and darwin |
---|
0:16:20 | said well they probably both matter but the brain is probably more important what we're |
---|
0:16:24 | gonna try and convince you now is that it is actually the brain that's g |
---|
0:16:29 | and vocal tract differences although they exist are not what are keeping a monkey or |
---|
0:16:34 | an ape from producing speech |
---|
0:16:37 | now the most famous example of |
---|
0:16:40 | a difference between us and apes is illustrated by this these m r is on |
---|
0:16:45 | the on the left side we see here a chimpanzee and the red line marks |
---|
0:16:50 | the vocal folds so that's the larynx |
---|
0:16:52 | and of course in humans the larynx is descended in the vocal tract it pulls |
---|
0:16:57 | down in the throat |
---|
0:16:58 | where is in the chimpanzee the lexus and a high position engaged in the nasal |
---|
0:17:03 | passage most the time |
---|
0:17:04 | and that means that on |
---|
0:17:06 | rests flat in the in them in the map of the tongue is basically sitting |
---|
0:17:10 | like this |
---|
0:17:11 | what happens in humans |
---|
0:17:13 | is that are we essentially swallow the back of our town are larynx to sends |
---|
0:17:18 | pulling the time with it so that we have this two part on that we |
---|
0:17:21 | can move up and down and back and forth and that's how we get this |
---|
0:17:25 | wide variety of speech |
---|
0:17:27 | so the idea and this goes back to darwin's time but it really became concrete |
---|
0:17:32 | in the nineteen sixties is that |
---|
0:17:34 | with the time like that |
---|
0:17:35 | you simply can't make the sense of speech and therefore no matter what brain was |
---|
0:17:40 | in control that vocal tract can make the sounds that you would need to imitate |
---|
0:17:44 | speech |
---|
0:17:46 | and it's a plausible hypothesis |
---|
0:17:48 | it goes back to actually my and meant for phil lieberman who was my phd |
---|
0:17:52 | thesis supervisor published a series of papers in the late sixties and early seventies |
---|
0:17:57 | and what he did was take a dead multi and the beta cast of the |
---|
0:18:01 | vocal tract of the smoky |
---|
0:18:03 | they use that to produce a computer program to simulate the sounds that |
---|
0:18:07 | vocal tract can make there was a lot of guesswork involved because it was one |
---|
0:18:11 | that multi and one cast |
---|
0:18:13 | but they did the best they could |
---|
0:18:14 | and what they found this is an formant one |
---|
0:18:18 | to space |
---|
0:18:19 | what they found it is yours the famous three vials the point files of english |
---|
0:18:23 | e |
---|
0:18:25 | and are that are found in most languages and all those things in there all |
---|
0:18:28 | the numbers are what the monkey vocal tract or what the computer model of the |
---|
0:18:32 | multi track remotely vocal tract could do |
---|
0:18:34 | so they concluded that the acoustic vowel space of a riesz as multi use quite |
---|
0:18:38 | restricted they lack the output mechanism |
---|
0:18:42 | for speech per |
---|
0:18:44 | and this is one of those ideas like i said it's its well-founded in acoustics |
---|
0:18:47 | if you look at what we actually do when we produce speech these just a |
---|
0:18:51 | couple videos that it will be familiar |
---|
0:18:54 | a rainbow as division of white light into many beautiful colours |
---|
0:18:57 | you see that from dancing around in that two dimensional space |
---|
0:19:01 | here it is slow down a bit |
---|
0:19:10 | so we use that ni that additional space "'cause" by swallowing the back of our |
---|
0:19:15 | turn we clearly are using that to its full extent when we produce speech |
---|
0:19:21 | so i think this lieberman hypothesis is quite plausible |
---|
0:19:26 | i became suspicious of this when we first started to train do x rays of |
---|
0:19:29 | animals as they vocalise instead of looking at data animals like this is the classic |
---|
0:19:34 | way of analysing the animal vocal tract take a day got cut in half and |
---|
0:19:38 | draw conclusions about that we trying to get a good localising in the x ray |
---|
0:19:44 | harder than it may seem |
---|
0:19:46 | i have that many animals sitting in a situation like this without localising at all |
---|
0:19:51 | but this little go was one of our first subjects in we played it it's |
---|
0:19:54 | mother's bleeds it would respond |
---|
0:19:56 | and this is what we saw in the extra |
---|
0:20:06 | also use again i want you to look in this region right there |
---|
0:20:09 | when you look that's this anonymous claimed |
---|
0:20:13 | at the glottis prevents mouth breathing so in other words the idea based on the |
---|
0:20:17 | static anatomy is that a goat can't breeze through its mouth |
---|
0:20:21 | and so here's what we actually see |
---|
0:20:25 | this i |
---|
0:20:26 | pulling down a |
---|
0:20:30 | such that every one of those vocalisations passes out through the mouth the get |
---|
0:20:34 | now this shouldn't be that surprising if you think about if you wanna make allow |
---|
0:20:38 | the sound you should other eight through your mouth and not through your nose but |
---|
0:20:42 | again this is what i'm data most acclaimed was impossible up until we started doing |
---|
0:20:46 | this work we've seen in another animal so this is a dog you're gonna see |
---|
0:20:49 | a very expensive pulling down of the larynx to send of the larynx when the |
---|
0:20:53 | dog barks this is low motion |
---|
0:21:01 | however |
---|
0:21:05 | that's the lyrics |
---|
0:21:08 | right |
---|
0:21:09 | what you can see here is that every time the dog parks |
---|
0:21:12 | the larynx pulls down pulling the back at the time with it and basically going |
---|
0:21:17 | into a human like vocal configuration but just one only animal is talking white only |
---|
0:21:22 | while it's vocal i |
---|
0:21:24 | the unusual thing about is that are larynx stays low we keep our larynx low |
---|
0:21:28 | light on not only while we're vocal |
---|
0:21:31 | so when we first got these data more than it's almost twenty years ago i |
---|
0:21:36 | became convinced that this that the set of the larynx can't be the crucial factor |
---|
0:21:41 | keeping animals from localising |
---|
0:21:43 | but unfortunately in the text books it canteens said the reason monkeys can't localise rates |
---|
0:21:48 | can't localise |
---|
0:21:49 | based on peripheral and that they just don't have the vocal tract |
---|
0:21:53 | and it was what i saw the simpsons episode where |
---|
0:21:56 | where |
---|
0:21:57 | it system |
---|
0:21:58 | the simpsons the main guy |
---|
0:22:01 | part no the old guy |
---|
0:22:03 | homer homework like you |
---|
0:22:04 | can wear gets this multi |
---|
0:22:06 | and the motley can talk so homers learning sign language are kept saying it's because |
---|
0:22:09 | he doesn't have the vocal tree |
---|
0:22:11 | so that's when we decided okay this dog and goat stuff isn't enough we have |
---|
0:22:15 | to do it with nonhuman primates and working together with passive thousand far whose monkeys |
---|
0:22:21 | they were and bart who's gonna take over from here we check x rays like |
---|
0:22:25 | this one |
---|
0:22:27 | the multi vocal arising |
---|
0:22:29 | and you'll see there's a little movement of the larynx just the same as we |
---|
0:22:32 | saw in the gutter in the dog and then we trace those to create a |
---|
0:22:36 | vocal tract model in this is where part's gonna |
---|
0:22:42 | i |
---|
0:22:49 | do you wanna take this |
---|
0:22:55 | that looks good |
---|
0:22:56 | a reality |
---|
0:22:58 | okay |
---|
0:23:00 | so |
---|
0:23:01 | yes how we actually |
---|
0:23:06 | and model to |
---|
0:23:10 | to create |
---|
0:23:11 | localization of the monkey no |
---|
0:23:14 | if you think about it it's very different problem from or a problem that requires |
---|
0:23:21 | a very different solution from what we use for human speech because what we're trying |
---|
0:23:26 | to do is to figure out what the monkey |
---|
0:23:30 | could do in principle with its vocal tract and it's not based on what it's |
---|
0:23:34 | actually doing the whole point is that we count multi don't well so |
---|
0:23:40 | so what we don't have is a corpus of data on which we could use |
---|
0:23:46 | some kind of machine learning problem |
---|
0:23:49 | so what we need to do is |
---|
0:23:52 | that really productive approach |
---|
0:23:54 | based on |
---|
0:23:56 | what is in it sends a very old fashioned way of going about speech synthesis |
---|
0:24:01 | and which is articulatory synthesis the not just recap which relate |
---|
0:24:07 | how it works for you but i assume you mural intimately familiar with it and |
---|
0:24:13 | what i would like to stress however is that even though we can to be |
---|
0:24:18 | talking about biology and about speech assigns |
---|
0:24:22 | these methods were developed by people who we're actually engineers they were also people interested |
---|
0:24:28 | in trying to be able to put is many phone conversations on transplant transatlantic cables |
---|
0:24:35 | as possible |
---|
0:24:37 | and so this is very much |
---|
0:24:40 | the fear read it has been developed by engineers by people who were working with |
---|
0:24:46 | the same goals |
---|
0:24:48 | as you guys |
---|
0:24:49 | so how this articulatory synthesis where well you start with an articulatory model you start |
---|
0:24:55 | with an it year of how the vocal tract works |
---|
0:25:00 | and from |
---|
0:25:03 | with a model you can create different positions of the tongue and lips et cetera |
---|
0:25:08 | and from that you need to calculate what is called an area function so an |
---|
0:25:14 | area function is basically the cross sectional area of the vocal tract at each position |
---|
0:25:20 | in the vocal tract |
---|
0:25:22 | and it turns out that the precise details of that area function |
---|
0:25:28 | well the area is the thing that counts the precise shape in the sense that |
---|
0:25:34 | for instance there is a |
---|
0:25:37 | right angle here in the vocal tract that's cool because of the wavelength interval you |
---|
0:25:43 | can ignore that so you can basically model it as straight q with the circular |
---|
0:25:51 | cross sectional shape but the area is the important thing now of course if you |
---|
0:25:56 | want to |
---|
0:25:58 | model that any computer model you have to discuss the score times that so what |
---|
0:26:02 | you and that is |
---|
0:26:04 | with is called a chi model so i and number of choose along the length |
---|
0:26:09 | of the vocal tract from that |
---|
0:26:12 | larynx basically to that |
---|
0:26:14 | and then on the basis of that you can calculate the acoustic response either in |
---|
0:26:20 | the time-domain the frequency domain so that's what we're going to do so how did |
---|
0:26:24 | we do that for the monkey model |
---|
0:26:26 | this is the x-ray image that to come sages child |
---|
0:26:32 | with the outline |
---|
0:26:34 | and in red here you can see the outline of the vocal tract |
---|
0:26:39 | so this is what we have this is what we start with we have we |
---|
0:26:43 | had about a hundred of these |
---|
0:26:46 | and i guess they were made by hand that ratings were made by hand and |
---|
0:26:50 | so what we first need to do is to figure out |
---|
0:26:54 | how the sound waves propagate through this tract |
---|
0:26:58 | and for that the technique that we use is called a medial axis transform so |
---|
0:27:04 | it's basically you're trying to squeeze |
---|
0:27:08 | a circle |
---|
0:27:09 | through that tract and that circle basically represents the propagating acoustic wavefront and if the |
---|
0:27:18 | line in the middle it's kind of the center of the wavefront and the radius |
---|
0:27:23 | of the circle |
---|
0:27:24 | for the diameter of the circle as the diameter of the vocal tract |
---|
0:27:32 | so this is what you end up with |
---|
0:27:38 | and so |
---|
0:27:40 | you can then calculate for each position |
---|
0:27:43 | in the vocal tract |
---|
0:27:45 | from the glottis to the lips |
---|
0:27:48 | the diameter |
---|
0:27:52 | okay so you have it |
---|
0:27:54 | a function |
---|
0:27:57 | the diameter of the vocal tract |
---|
0:27:59 | at each point in the vocal tract |
---|
0:28:01 | however the problem is that this is just |
---|
0:28:05 | part of what we need we need to have the area we don't need to |
---|
0:28:08 | have we that the diameter isn't enough so the problem is |
---|
0:28:14 | we need to calculate the area on the bases of the observed diameter |
---|
0:28:21 | no fortunately it turns out that do good approximation for those monkey vocal tract the |
---|
0:28:28 | function converting diameter to area |
---|
0:28:32 | is more or less the same everywhere in the vocal tract so how do we |
---|
0:28:36 | figured that out |
---|
0:28:39 | apart from the x-ray movies we also had a few mri scans of than the |
---|
0:28:45 | anaesthetised monkey |
---|
0:28:48 | and if you if you look at that |
---|
0:28:51 | so this is this side view so this is where the basically the monkeys |
---|
0:28:55 | let's are |
---|
0:28:58 | this is it's vocal tract |
---|
0:29:00 | here's the larynx |
---|
0:29:01 | and so you can make if you cross |
---|
0:29:04 | section of cuts there and you can see that the shape of the vocal tract |
---|
0:29:12 | i don't these different |
---|
0:29:14 | cross section there is |
---|
0:29:16 | follows this it's not quite a rabble but |
---|
0:29:20 | in this particular shape is kind of the same everywhere |
---|
0:29:24 | and so what you want to know is |
---|
0:29:28 | for a given opening of the vocal tract how large is that area so suppose |
---|
0:29:34 | that the |
---|
0:29:35 | the diameter would be |
---|
0:29:37 | about |
---|
0:29:39 | about this |
---|
0:29:42 | so the area would be this now if you open up further then of obviously |
---|
0:29:47 | the area gets bigger any turns out that follows you know it's just a matter |
---|
0:29:53 | of integration any turns out that what you find is that the areas proportional to |
---|
0:29:59 | some cut some constant |
---|
0:30:00 | times the diameter to the power of |
---|
0:30:04 | one point four there's no deep theoretical reason for that value of one point for |
---|
0:30:09 | each it's something that we learned from observing |
---|
0:30:13 | so now by applying that function to the diameters that we observe we actually find |
---|
0:30:20 | a |
---|
0:30:23 | the area function so this is |
---|
0:30:26 | the position |
---|
0:30:27 | and the area that at each point |
---|
0:30:30 | in the vocal tract no |
---|
0:30:34 | the next step is turning that into someone's |
---|
0:30:39 | and for that we use a again very old fashioned classical approach and acoustic a |
---|
0:30:46 | mobile an electric line analog of the vocal track again you can kind of see |
---|
0:30:51 | that historically a lot of this theory was |
---|
0:30:57 | developed by electrical engineers "'cause" it's an electrical electronic circuit so for each of those |
---|
0:31:05 | discrete to you |
---|
0:31:07 | the electric line a lot models just model basically models the physical wave equation with |
---|
0:31:13 | a little electrical circuit |
---|
0:31:16 | and from that |
---|
0:31:18 | we can then calculate the |
---|
0:31:21 | formant frequencies |
---|
0:31:26 | so for each of those hundred points |
---|
0:31:29 | we |
---|
0:31:32 | we can calculate the first and the second and third formant and these are the |
---|
0:31:37 | values we actually calculated for all those |
---|
0:31:42 | all those point |
---|
0:31:46 | and but there's |
---|
0:31:49 | didn't from this point we've kind of |
---|
0:31:55 | determined what the acoustic abilities of the monkey vocal tract or not |
---|
0:31:59 | from there there's different things that you could do |
---|
0:32:03 | in principle |
---|
0:32:06 | on the basis of this kind of data you can actually make a computer articulatory |
---|
0:32:10 | model |
---|
0:32:11 | and so this is something that is changing my to as done in nineteen eighty |
---|
0:32:16 | nine again you know quite some time ago on the basis of a very similar |
---|
0:32:21 | data about the human vocal tract |
---|
0:32:26 | but |
---|
0:32:28 | it's not certain that we have enough data to actually do the same thing so |
---|
0:32:33 | changing my to what he didn't was he made a thousand |
---|
0:32:39 | tracing so the vocal track and if you if you in if you know how |
---|
0:32:42 | difficult it is to make a single tracing |
---|
0:32:45 | you can imagine how much time he must've spent on making this model |
---|
0:32:51 | and what he then that is basically |
---|
0:32:55 | look at these articulations to a factor analysis and basically derive an hour and articulatory |
---|
0:33:02 | model |
---|
0:33:03 | and articulatory synthesizer so you could basically then use that model to synthesize new so |
---|
0:33:10 | no the problem is we don't have that many tracing so we couldn't problem probably |
---|
0:33:15 | couldn't make a good quality model |
---|
0:33:21 | what we wanted to do and what to comes is going to say in a |
---|
0:33:24 | moment to explain a moment it's re-synthesize some of these sounds and that's still very |
---|
0:33:30 | challenging with a articulatory synthesizer and it wasn't reading necessary for what we wanted to |
---|
0:33:37 | do so we took slightly different approach |
---|
0:33:40 | now |
---|
0:33:43 | one of the things we wanted to do with just quantify the |
---|
0:33:48 | articulatory abilities of monkeys and compared them to humans |
---|
0:33:53 | and wanting to do that |
---|
0:33:55 | we could measure the |
---|
0:33:58 | acoustic range of the monkey vocalisations and one way to do that is by calculating |
---|
0:34:04 | the convex hull now again i'm assume you're all familiar with what a convex whole |
---|
0:34:09 | is just very quickly show you how we did it basically if you wanna call |
---|
0:34:14 | calculate the context will |
---|
0:34:16 | you start with the one of the extreme points |
---|
0:34:21 | and then you |
---|
0:34:23 | basically |
---|
0:34:26 | fit a lying |
---|
0:34:27 | a round those points like if you if you would take a rubber band and |
---|
0:34:32 | just |
---|
0:34:33 | squeeze it around the points and then you can do several things you can calculate |
---|
0:34:37 | the area of the context of all or you can calculate the extend of these |
---|
0:34:42 | things in the f one or the first formant or the second formant and the |
---|
0:34:47 | thing that we did was we based ourselves on the extent |
---|
0:34:52 | well in the area and the extent |
---|
0:34:55 | and one of the things we get is the amp this week |
---|
0:35:00 | wanted to know how the monkey sound it |
---|
0:35:03 | it would be speaking |
---|
0:35:06 | and in order to do that we |
---|
0:35:08 | modified some human sounds in a way very similar to what the comes just showed |
---|
0:35:16 | remote recordings |
---|
0:35:18 | and so this is it |
---|
0:35:24 | sentences spoken by human we that's like this into the |
---|
0:35:30 | formant tracks which is basically which represents the |
---|
0:35:35 | the filter and the source |
---|
0:35:38 | and then we modified those formants |
---|
0:35:42 | in a |
---|
0:35:44 | in a way to make it more similar to a monkey vocal tract so what |
---|
0:35:47 | you've seen so far in the examples that to comes at play to you is |
---|
0:35:53 | where the formants were just shifted up or shifted down we did a little more |
---|
0:35:58 | so we modified them |
---|
0:36:01 | didn't just so the |
---|
0:36:05 | we need to shift the formants up a little bit because the monkey vocal tract |
---|
0:36:10 | is shorter than the human vocal tract so that the formants tend to be higher |
---|
0:36:15 | but in addition what we found is that the range of the second formant it's |
---|
0:36:21 | somewhat be used in the monkey vocal tract |
---|
0:36:24 | in comparison to the |
---|
0:36:27 | human vocal tract so we also |
---|
0:36:30 | breast the range of the second formant |
---|
0:36:33 | and then we resynthesized the sound |
---|
0:36:36 | now |
---|
0:36:37 | the thing with |
---|
0:36:40 | and analysis in terms of source and filter |
---|
0:36:44 | is that it's complete so if you have discourse information and the filter information |
---|
0:36:52 | you can basically |
---|
0:36:54 | re-synthesize the sound perfectly this and there's no loss |
---|
0:36:59 | so if we would you just |
---|
0:37:01 | the humans stores with the modified formants the sound would probably have sounded to perfect |
---|
0:37:09 | so what we wanted to do is use the source that was more monkey like |
---|
0:37:14 | so we actually also synthesized in use force which was based on a very simple |
---|
0:37:20 | model |
---|
0:37:23 | the monkey vocal folds which vibrating the much more irregular weight and human vocal folds |
---|
0:37:29 | do so we took our monkey stores |
---|
0:37:34 | applied |
---|
0:37:35 | the modified formant filter to it |
---|
0:37:38 | and then we got a real monkey focalization |
---|
0:37:42 | and this is where the complete x over again |
---|
0:37:44 | okay |
---|
0:37:45 | so |
---|
0:37:51 | hopefully that satisfied your morning need for technical details but now you must all be |
---|
0:37:57 | wondering after this is just a synopsis of the whole process that we x-ray the |
---|
0:38:00 | monkey making a hundred different vocal tract configurations |
---|
0:38:04 | basically everything that monkey did while he was in our x ray |
---|
0:38:08 | we trace those |
---|
0:38:09 | we use the medial axis and then this complex area diameter the area function to |
---|
0:38:15 | create the |
---|
0:38:16 | model of the vocal tract and then we can form for a synthesized performance from |
---|
0:38:21 | and so what we get here's the original data from lieberman that i showed you |
---|
0:38:26 | at the beginning so the red triangle represents a human females bocal the f one |
---|
0:38:32 | f range of two range of a human female with e a new making up |
---|
0:38:36 | the points |
---|
0:38:37 | and that little blue triangle is what the all model from lieberman said a monkey |
---|
0:38:42 | could do |
---|
0:38:43 | and this is what are mark our model looks like compared to that |
---|
0:38:47 | so unlike me romans model which is very restricted we can see that the multi |
---|
0:38:51 | what a remote key actually does would be to a quite wide variety and the |
---|
0:38:56 | first formant but a somewhat compressed second formant |
---|
0:39:01 | we use that to create multi vowels so artificial multi vowels that occupy the corner |
---|
0:39:07 | of the corners of that convex hull so with five motive hours in a discrimination |
---|
0:39:11 | task humans are basically at ceiling record so they do just as well with the |
---|
0:39:15 | monkey vowels as they do with human vowels and what that shows us |
---|
0:39:19 | is the to mark his capacity to produce a diverse set of files the same |
---|
0:39:23 | as the number in most human languages namely five |
---|
0:39:26 | is absolutely intact so the monkeys vocal tract |
---|
0:39:29 | has no problem doing that |
---|
0:39:31 | we also have good indications that things like bilabial and glottal stops et cetera et |
---|
0:39:37 | cetera many of the different consonants would be possible so clearly the multi vocal tract |
---|
0:39:42 | is capable of producing a wide range of seven |
---|
0:39:45 | note that all sounds very dry such kind of more interesting to hear what are |
---|
0:39:49 | model sounds like if we're trying to imitate human speech |
---|
0:39:53 | i usually so we the model for this was my wife |
---|
0:39:57 | so we had or speak a bunch of sentences but rather than play her first |
---|
0:40:01 | what you should understand i'm gonna play the monkey model first and see if you |
---|
0:40:04 | can understand with the smoke you say |
---|
0:40:06 | right i |
---|
0:40:09 | right |
---|
0:40:11 | everybody got it right |
---|
0:40:14 | okay and their this is my wife's formants with that synthetic monkey a source |
---|
0:40:21 | i |
---|
0:40:23 | okay |
---|
0:40:24 | right i |
---|
0:40:27 | time so |
---|
0:40:28 | what you can here is that there's the phonetic content is basically preserved the human |
---|
0:40:33 | formants are lower which makes sense because humans are larger than monkeys so it has |
---|
0:40:38 | a more based c and less where you're the sound to it but i |
---|
0:40:43 | that the phonetic content is basically present so what the shows us is that whatever |
---|
0:40:48 | it is that keeps a monkey or an eight rate and the human how speaking |
---|
0:40:53 | it's not the peripheral vocal tract it's not the anatomy of their total there |
---|
0:40:59 | and that's basically the conclusion that we drew from this paper the paper was called |
---|
0:41:02 | multi vocal tracts are speech ready |
---|
0:41:05 | and what that tells us is that rather than looking more at the anatomy of |
---|
0:41:09 | the vocal tract |
---|
0:41:10 | we should be paying attention to what to the brain that's in charge and that |
---|
0:41:16 | would be another talk to explain we have lots of evidence about what is about |
---|
0:41:19 | the human brain that gives a such exquisite control over a vocal apparatus but it |
---|
0:41:23 | doesn't seen that the vocal apparatus itself |
---|
0:41:26 | the crucial thing and put in other terms we've done it with the multi but |
---|
0:41:30 | i'm quite sure that the same thing would be true with a dog or a |
---|
0:41:33 | pig or a cal if a human brain were in control a dog or at |
---|
0:41:38 | cal or a pig or a monkey |
---|
0:41:41 | the vocal tract would be perfectly able to communicate english |
---|
0:41:45 | so |
---|
0:41:46 | there's a lot of work to do before we make talking animals but it's gonna |
---|
0:41:49 | involve the brain and not the vocal tract |
---|
0:41:53 | okay so that is our story that was actually faster than we thought just to |
---|
0:41:57 | they are general conclusions is that |
---|
0:42:01 | you can use these methods that we're mainly developed by physicists and engineers to understand |
---|
0:42:06 | human language for human speech to basically understand and synthesize a wide variety of vertebrate |
---|
0:42:13 | sounds |
---|
0:42:14 | i nearly work with four arms with birds and mammals but other people have used |
---|
0:42:18 | these same methods to do things like alligators and frauds so these are very general |
---|
0:42:24 | principles what you all learned in your sort of intro the speech class actually applies |
---|
0:42:28 | to most of the species we know about |
---|
0:42:31 | it's not the vocal tract that keeps most mammals from talking it's really their neural |
---|
0:42:36 | control of that vocal tract |
---|
0:42:38 | and i think the more general message that probably |
---|
0:42:42 | meaningful to pretty much everybody in this room is a better understanding of the physics |
---|
0:42:47 | and physiology of the vocal production system whether it's and the dog a remote you're |
---|
0:42:52 | a dirac a wall can really play a key role it should play a key |
---|
0:42:56 | role in speech synthesis |
---|
0:42:59 | and thus you wanna say a few extra words of wisdom i guess |
---|
0:43:03 | no |
---|
0:43:06 | okay so we i think we have plenty of time for questions so thanks to |
---|
0:43:10 | all the people who did this work and thank you for |
---|
0:43:29 | it'll take the question mike or should i |
---|
0:43:31 | i |
---|
0:43:34 | a cushion is able to |
---|
0:43:37 | inspired by using the women the ball box |
---|
0:43:43 | the vocal folds |
---|
0:43:45 | them again example can force for by using the like behaviour the dynamics will say |
---|
0:43:52 | he's trying to imitate a human it's just what dogs do when they bark it's |
---|
0:43:56 | the ways a second this is one point so and the second is that at |
---|
0:44:01 | the last part of the user that |
---|
0:44:03 | the key by the key difference lies in new mechanisms was really in the what |
---|
0:44:07 | no mechanism yes neural mechanism so my question is able |
---|
0:44:13 | as sometimes because of the dot plot the that this happens so will be disabilities |
---|
0:44:18 | but actually act was again and almost a result of the bit if |
---|
0:44:23 | it is not gonna but only in time |
---|
0:44:26 | so my question was |
---|
0:44:29 | i just talked that the debut the end of the vocal fold dynamics for the |
---|
0:44:33 | ball but |
---|
0:44:34 | and the most mapping that happens in the subject |
---|
0:44:37 | because of that these so is there any kind of q for this was a |
---|
0:44:41 | good use ms |
---|
0:44:42 | question i two r are you asking about the recovery of the source properties or |
---|
0:44:47 | i'm asking about the new them again is on that is responsible because for that |
---|
0:44:51 | piece was good |
---|
0:44:53 | for the auditory perception or for the production okay so what we know i don't |
---|
0:45:00 | have a slide for this but we know that in humans there are direct connections |
---|
0:45:04 | from the neural from the motor cortex onto the neurons you actually control the laryngeal |
---|
0:45:09 | and the tongue muscles |
---|
0:45:11 | those direct connections from cortex on to the laryngeal matter of us are not present |
---|
0:45:16 | in most members |
---|
0:45:17 | so these are absent in other primates they appear to be absent in austin cats |
---|
0:45:22 | and travel et cetera but in those p c's which are good vocal imitators and |
---|
0:45:27 | this includes many birds the parents and my numbers but it also include some packets |
---|
0:45:32 | include elephants it includes various the tations |
---|
0:45:36 | so in all of those groups that have been investigated these direct connections the equivalent |
---|
0:45:40 | of what we humans have are present so the current theory for what is it |
---|
0:45:46 | about our brains that gives us this control is that we have direct connections a |
---|
0:45:51 | lot of the motor neurons |
---|
0:45:52 | and in most animals there's only indirect connections via various brain stem intermediary onto the |
---|
0:45:59 | vocal system itself |
---|
0:46:00 | so in other words we've got this new we its essentially like a new gear |
---|
0:46:04 | shift on this each and vocal tract that we've got |
---|
0:46:08 | that gives our brains more control over it then we would otherwise have |
---|
0:46:15 | a lot more interesting talk |
---|
0:46:18 | so myself i have a free pass at home and a white or evidence we |
---|
0:46:23 | nitpick |
---|
0:46:24 | and so i also works for that it would be quite directional at all be |
---|
0:46:28 | remote or police and what they are saying yes i don't is you are there |
---|
0:46:32 | are also paper published in a channel about converting bring thing last told to speech |
---|
0:46:37 | that the much using speech synthesis for a construction |
---|
0:46:40 | of speech from right how do thing how it is possible to actual and or |
---|
0:46:44 | something similar for our pets to be able to evangelise handle task a signal possible |
---|
0:46:51 | sufficient |
---|
0:46:52 | but that's an interesting question so if |
---|
0:46:55 | given that we can use your all signals but fmri or geology to synthesize okay |
---|
0:47:02 | speech |
---|
0:47:04 | could we do the same thing for animals and my answer from most animals because |
---|
0:47:09 | of my answer the first question would be no the reason is that the there |
---|
0:47:14 | is a correspondence between the cortical signals that we can measure it something like fmri |
---|
0:47:21 | really g and the actual sounds that are produced |
---|
0:47:24 | because in most animals its mainly the brain stem in the midrange that are controlling |
---|
0:47:30 | these as someone attacking or a dog parks |
---|
0:47:33 | it doesn't in fact you can remove the cortex and a cat are still meowing |
---|
0:47:37 | adorable still more |
---|
0:47:39 | in the same way that a human baby who's born without cortex will still cry |
---|
0:47:43 | and laugh |
---|
0:47:44 | in a normal way |
---|
0:47:45 | so i but also say if i would be a lot easier to do this |
---|
0:47:48 | is probably better usage rent money |
---|
0:47:51 | see if you can synthesize laughter and crying |
---|
0:47:54 | from a cortical signal y prediction would be you and if you can do that |
---|
0:47:59 | humans then you won't be able to do it in so i would predict a |
---|
0:48:03 | fink laugh like what i go a that's not a real that i should be |
---|
0:48:08 | correctly control but when i really laugh are i really cry |
---|
0:48:12 | that's gonna be coming from this score brain that's very hard to measure and so |
---|
0:48:16 | you should be able to synthesize realistic laughter crying even it easy maybe |
---|
0:48:29 | do you have any evidence of what the which point enables cmbp connection between the |
---|
0:48:33 | brain and the vocal tract it starts appearing |
---|
0:48:36 | that's the unfortunate answer to that is no probably many of you know there's a |
---|
0:48:41 | there's a whole field in this you have a slide about this there's a whole |
---|
0:48:45 | field that's essentially trying to reconstruct |
---|
0:48:49 | based on fossils when in our history when of this i in the common in |
---|
0:48:54 | history of a revolution these that are capacity for speech occurred and the old argument |
---|
0:49:01 | was always based on if we could know when the larynx decided and we would |
---|
0:49:05 | know one speech occurred |
---|
0:49:06 | hey what i think i've shown you and all this work is that it's not |
---|
0:49:10 | alaryngeal descent |
---|
0:49:11 | that's crucial for speech it's these direct connections |
---|
0:49:14 | and those unfortunately there's just no fossil q |
---|
0:49:18 | to whether there's direct connections that's basically the stuff that really doesn't preserved even for |
---|
0:49:23 | an hour |
---|
0:49:25 | much less for in the fossil record you would need |
---|
0:49:27 | detailed narrow an at any on the micron level to answer that question so it |
---|
0:49:32 | even it's even hard with again |
---|
0:49:34 | please |
---|
0:49:37 | so to comes and i are |
---|
0:49:40 | well we agree on the importance of the of the neural control of course and |
---|
0:49:45 | but we can disagree on the |
---|
0:49:48 | exact precise interpretation of and what the vocal tract data means and video clip |
---|
0:49:58 | i can we do this you know how we think we're |
---|
0:50:04 | that so innocent you could say that has been some fine tuning of the of |
---|
0:50:09 | the human vocal tract to for localization and if you |
---|
0:50:15 | you know if you if you the little liberal in the interpretation of what we |
---|
0:50:18 | find in the fossil record you can say |
---|
0:50:23 | it happened somewhere between three million and three hundred thousand years ago |
---|
0:50:29 | it's not very precise i |
---|
0:50:34 | so that the evidence for this is based on various cues that supposedly indicate based |
---|
0:50:40 | on the base of the scroll what the position of the larynx and tone would |
---|
0:50:44 | be it just "'cause" with |
---|
0:50:46 | "'cause" i have these slides and i took them out "'cause" i thought we'd be |
---|
0:50:48 | too long i want to show you some examples on animals that have independently modify |
---|
0:50:54 | their vocal tract |
---|
0:50:56 | in a way that has nothing to do with speech so the way you can |
---|
0:50:58 | make your vocal tract longer is one make your nose longer like this process monkey |
---|
0:51:02 | or lots of various animals like elephants course you can stick your lips out which |
---|
0:51:07 | many species do so if you do this you sound bigger and if you do |
---|
0:51:11 | this you sound smaller or you can do more bizarre things like |
---|
0:51:14 | make an extension to your nasal tract that forms a big crest like that dinosaur |
---|
0:51:19 | up there or these birds which because sources at the base of the trachea have |
---|
0:51:24 | elongated trachea and all of these adaptations seem to be ways of making that animal |
---|
0:51:29 | sound bigger |
---|
0:51:30 | it's just a nice example this is an animal with the permanently descended larynx is |
---|
0:51:35 | a red deer and you'll find this a pretty impressive sound |
---|
0:51:39 | wow |
---|
0:51:41 | wow |
---|
0:51:43 | so the first thing you probably noticed in that images that pinnits pumping that we're |
---|
0:51:47 | going back that ignore that look at what's happening |
---|
0:51:50 | okay what's happening in the front of the animal and you'll see |
---|
0:51:54 | i as well |
---|
0:51:57 | back and forth |
---|
0:51:58 | and so when we first saw these videos we were like what is this and |
---|
0:52:01 | it turns out what this is that resting position of the larynx that's is a |
---|
0:52:06 | permanently descended larynx in an argument animal and watch what it does what it localisers |
---|
0:52:13 | i |
---|
0:52:16 | i |
---|
0:52:22 | so i think we could all agree that some much more impressive just set of |
---|
0:52:26 | the larynx then the few centimetres that happens in humans |
---|
0:52:30 | and it turns out |
---|
0:52:32 | these are not the only species because in our islands p c's there's a secondary |
---|
0:52:36 | the set of the larynx that happens only and then and only at puberty and |
---|
0:52:40 | i think that's exactly the same kind of adaptation that makes this to do your |
---|
0:52:44 | sound bigger the aurora or a bird sound bigger so i guess that's where we |
---|
0:52:49 | differ i think that |
---|
0:52:50 | even if we know when the larynx to send it in humans it could have |
---|
0:52:54 | been an adaptation to just make yourself sound bigger and it might have been a |
---|
0:52:58 | million years after that |
---|
0:53:00 | that we started using that for speech |
---|
0:53:02 | so that's why i really don't think the fossils are gonna answer because we do |
---|
0:53:05 | not have any answer the only way we're gonna get it i think is by |
---|
0:53:08 | is from genetics now we're covering genetics |
---|
0:53:11 | the gene genome from data seven the neanderthals and these that might help us answer |
---|
0:53:16 | this question about the recognition |
---|
0:53:22 | i've just want to mention that the result where you know scores against based on |
---|
0:53:26 | the part of the story my question is about earlier you and more to communicate |
---|
0:53:33 | of course okay bye divorce so |
---|
0:53:36 | you know you're talking about the vocal tract varies with a voice source of for |
---|
0:53:42 | really downtime it's whatever |
---|
0:53:45 | a lot of seems to do with a with a voice source do have an |
---|
0:53:48 | idea of video poker bring |
---|
0:53:50 | which is i don't aboard to |
---|
0:53:53 | to use pieces |
---|
0:53:57 | well not use the vocal really over emotions so for sure of social behaviors |
---|
0:54:03 | we we've got actually quite a lot of evidence about sort of overall vocabulary size |
---|
0:54:08 | for different species but most of that comes from relatively intuitive |
---|
0:54:13 | scientist listen and they say it in a there's about five sounds there is about |
---|
0:54:18 | twenty sounds there |
---|
0:54:20 | only a few species have we really don't what we need to do which is |
---|
0:54:23 | played back experiments to see what the animals discriminate from others and i would say |
---|
0:54:28 | in many cases that shows us that something that we think is one thing what's |
---|
0:54:32 | a i'm not i'm now or a bark or ground actually has multiple a variance |
---|
0:54:39 | so but i think a conservative number for animal vocabularies is something like fifteen thumbs |
---|
0:54:45 | and a less conservative number would be something like fifty difference |
---|
0:54:49 | and in some birds it goes a lot larger than that but if you're talking |
---|
0:54:52 | about your average mammal it somewhere in that right so roughly thirty would be a |
---|
0:54:57 | good nonhuman primate |
---|
0:55:00 | vocabulary size of discriminable so that have different meetings |
---|
0:55:04 | of course there are sounds animals like can make thousands of different sounds |
---|
0:55:09 | but they do this for example birds in their songs or wales in their songs |
---|
0:55:13 | but they don't appear to use this to second of different meetings so then we |
---|
0:55:18 | can talk about vocabulary anymore we have to just start talking about |
---|
0:55:23 | it's more like |
---|
0:55:24 | phonemes or syllables types router and then meetings |
---|
0:55:29 | we will say something |
---|
0:55:32 | sorry |
---|
0:55:36 | is there's somebody else but who and what do we know what is the frequency |
---|
0:55:41 | resolution of the monkey hearing |
---|
0:55:44 | so that we could hear the relative position of all the formants but |
---|
0:55:48 | to reproduce it absolutely i mean most monkeys have a higher free a higher high |
---|
0:55:53 | frequency cutoffs the most monkeys could hear up to forty or even sixteen khz so |
---|
0:55:58 | the high frequencies are more extensive than ours |
---|
0:56:01 | but where it counts in the low frequencies they're perfect frequency resolution so from five |
---|
0:56:06 | hundred hz to twenty five or thirty five hundred hertz which is where all that |
---|
0:56:09 | formant information is they can they can |
---|
0:56:12 | and that's why of course an animal like and or a chimpanzee or basically any |
---|
0:56:16 | other species you cares can learn to discriminate different human words |
---|
0:56:21 | virtually every dog knows its name and in some cases you can train a dog |
---|
0:56:24 | to discriminate between hundreds or even thousands of words |
---|
0:56:27 | and they can do that |
---|
0:56:29 | so the speech perception apparatus seems to be built on the basically why they share |
---|
0:56:34 | perceptual masking |
---|
0:56:38 | sorry |
---|
0:56:39 | i'm nothing and speech synthesis and of course leaving about how to |
---|
0:56:42 | it would be a place to say that but |
---|
0:56:44 | why |
---|
0:56:45 | actually did you |
---|
0:56:47 | need to do this in this is what we do not to sort of more |
---|
0:56:50 | standard phonetic thing just flew |
---|
0:56:52 | record load of loads of monkey localizations and measure the formant and what you what |
---|
0:56:58 | would happen if we did that |
---|
0:57:00 | well we we've done that and we've actually looked at the subset of the sounds |
---|
0:57:04 | so remember what we have a some of these vocal tract doing what multi vocal |
---|
0:57:09 | tract to do and that influence of things like feeding chewing swallowing et cetera it |
---|
0:57:15 | also includes a class of |
---|
0:57:17 | non vocal displays that most known human pride well most monkeys and apes to do |
---|
0:57:23 | things like this |
---|
0:57:25 | which it's called lip smacking it's a very typical primate thing but it's virtually silent |
---|
0:57:31 | so they make some able a little tiny bit of sand and once p c's |
---|
0:57:36 | they actually vocalise when they do it turns out that those that the most is |
---|
0:57:40 | doing a lot more with its vocal tract in these visual displays then it doesn't |
---|
0:57:44 | it's auditory display |
---|
0:57:45 | so if we just take that the vocal tract configurations where the monte is making |
---|
0:57:50 | a sound it's a subset of what the vocal tract can actually do and in |
---|
0:57:54 | project these nonvocal communicative so you |
---|
0:57:58 | could call them visual communication signals have a lot of the a lot of the |
---|
0:58:02 | interesting variance of the vocal tract shape are there |
---|
0:58:05 | and because those are silent we have to figure out what it sound like if |
---|
0:58:09 | the monkey was vocalised so that's why we have to that's why we had to |
---|
0:58:12 | do all this work that's why it took |
---|
0:58:14 | years to do this |
---|
0:58:16 | and then adjust and to that |
---|
0:58:20 | well i guess coincidentally almost at the same time as our paper came out that |
---|
0:58:25 | we change the way and according to which just mentioned here in the front and |
---|
0:58:29 | came up with the paper where they get exactly what you would use it and |
---|
0:58:33 | they five and basically that |
---|
0:58:36 | actually what the user to different monkey species act-utterance but and they can produce a |
---|
0:58:42 | surprisingly large range of silence that especially surprising if you compared to what the lieberman |
---|
0:58:50 | had claimed that they could produce |
---|
0:58:52 | but not as large as the range of sounds that are mobile produced so |
---|
0:58:58 | they do mainly not produce in their in their actual productions the potential that they |
---|
0:59:06 | have with their vocal tract |
---|
0:59:11 | i would like to come from that i understood correctly what you say on this |
---|
0:59:16 | slide |
---|
0:59:17 | that there is that more generally |
---|
0:59:22 | it is generally passive |
---|
0:59:24 | is the output or at least experiment that |
---|
0:59:30 | generally this woman from give the in two thousand then |
---|
0:59:35 | that just air flow is coming out |
---|
0:59:39 | and then we can say that the vibration rate is generally a c |
---|
0:59:45 | i think this is too risky |
---|
0:59:48 | because this is exactly what would happen if you i'm dead and you bust a |
---|
0:59:53 | are thrown |
---|
0:59:54 | air flow through my vocal folds |
---|
0:59:57 | i don't think we mush my much will be different |
---|
1:00:02 | and in order to do that even though to say that is generally passive i |
---|
1:00:07 | think you have to go and look |
---|
1:00:11 | more about neuronal activity |
---|
1:00:15 | and not just about experiment i respect teachers work but i think this is to |
---|
1:00:24 | dangers to |
---|
1:00:25 | to say these |
---|
1:00:27 | you on that slide i think there may be a miss i mean because we're |
---|
1:00:33 | not saying that you don't need muscles to put the larynx in to phonatory position |
---|
1:00:38 | of course you do that work in this case i move you tigers larynx in |
---|
1:00:43 | the phonatory position |
---|
1:00:45 | what we're saying is that the individual pulses that represent the fundamental frequency so the |
---|
1:00:49 | openings and closings of the glottis that's what that's what is passively determined by things |
---|
1:00:55 | like muscle tension and pressure |
---|
1:00:58 | so we're not saying that muscle activity doesn't play a role what we're saying that |
---|
1:01:03 | it doesn't have to happen at the periodicity of the fundamental frequency |
---|
1:01:08 | and that's obvious thing if you think about a pack that's producing sounds at forty |
---|
1:01:12 | thousand ten at a forty thousand hz there's no way neurons can fire that neurons |
---|
1:01:17 | basically can't fire faster than thousand |
---|
1:01:20 | so even if it didn't work for something like an elephant and it does work |
---|
1:01:24 | for something like a cat at thirty hz |
---|
1:01:26 | it could never work for most of the high causation |
---|
1:01:29 | even a cat two thousand hz and certainly not these animals that are producing in |
---|
1:01:34 | the high khz range it has to be passed because there's no way neurons can |
---|
1:01:38 | fire or muscles can twitch |
---|
1:01:40 | that rapidly |
---|
1:01:41 | so the clean is not then in humans you or any animal that you don't |
---|
1:01:44 | need to use muscles to put the and that to control the larynx you do |
---|
1:01:49 | but only that you don't need muscle activity at the frequency the fundamental frequency |
---|
1:01:53 | is that make sense |
---|
1:01:57 | it's better |
---|
1:02:04 | and some just curious |
---|
1:02:06 | you labour man and you both did work trying to figure out exactly the same |
---|
1:02:11 | thing a subject and i came to radically different conclusions so |
---|
1:02:17 | was the lieberman what's the improvements is that approach never going to work or what |
---|
1:02:22 | was the issue that distinguished and that you know that made the difference between what |
---|
1:02:27 | you did and he did and what can that teachers for other things we want |
---|
1:02:31 | to do as well do not draw conclusions |
---|
1:02:34 | i would say from the i mean maybe you can comment on this two but |
---|
1:02:38 | from the point of view the technology |
---|
1:02:41 | what we're doing to understand how you go from a vocal tract to formant frequencies |
---|
1:02:47 | not much just change they did a pretty good job a given the computers they |
---|
1:02:51 | had their simulation was pretty good their problem was in the biology their problem was |
---|
1:02:55 | that they took a single then animal and the expected that |
---|
1:02:59 | then animal was gonna tell them the range of motions that are possible in a |
---|
1:03:04 | living animals vocal tract |
---|
1:03:05 | so they had no indication of what the dynamics |
---|
1:03:08 | the vocal tract or |
---|
1:03:10 | from looking at the data and that's what we needed this x rays of a |
---|
1:03:13 | building monkey to be able to find out |
---|
1:03:16 | okay so but you don't saying that you can never figure out what to do |
---|
1:03:22 | is going on from a dead animal what so if you |
---|
1:03:27 | so |
---|
1:03:29 | so by the way that is class which should be familiar name two people working |
---|
1:03:33 | on speech synthesis with the call theorem one of these paper here and so he |
---|
1:03:38 | was basically the guy at the acoustic modeling |
---|
1:03:42 | work and so at the time there are q competing labs working on speech synthesis |
---|
1:03:48 | and i basically the acoustic model i used for my model is basically contemporaneous with |
---|
1:03:55 | a ten is quite small so indeed you know classic stuff |
---|
1:03:59 | so basically they just didn't have the data it's kind of like all eighties neural |
---|
1:04:03 | nets verses google |
---|
1:04:05 | they just didn't have the data and we have a data |
---|
1:04:13 | and |
---|
1:04:15 | yes |
---|
1:04:16 | and i think it's a very as |
---|
1:04:20 | defined benefit different bands right okay not can make it and fifteen t fact there |
---|
1:04:25 | is no |
---|
1:04:28 | something like fifteen to fifty as a session one and here is to now |
---|
1:04:32 | if the semantics of a time to express |
---|
1:04:35 | i was trained praying all rights a very different |
---|
1:04:38 | set |
---|
1:04:39 | just a it is a fiction planes are the and in my state is virtually |
---|
1:04:44 | pains they're very different to what they're trying to express |
---|
1:04:47 | there's a certain set of course vocalisations that are very widely shared among species for |
---|
1:04:53 | so for example sounds that means threat sounds that say i'm being mean and scary |
---|
1:04:58 | so i tend to be low and have very low performance |
---|
1:05:02 | sounds that are appeasing in saying that we don't hurt me i'm just a little |
---|
1:05:05 | guy tend to be high frequency |
---|
1:05:07 | so we see that class the vocalisations vary widely across mammals and birds |
---|
1:05:12 | then we have this class of kind of meeting vocalisations that a lot of species |
---|
1:05:17 | do but they typically sound very different sometimes it's males just going well like that |
---|
1:05:22 | and sometimes it much more interesting and complicated |
---|
1:05:24 | and then there's typically mother infant communications and so there's usually sounds that are that |
---|
1:05:31 | a mother users with for this particular in mammals that the mother uses to communicate |
---|
1:05:36 | again very widespread |
---|
1:05:38 | and then there's really weird stuff mike where all songs or echo location clicks at |
---|
1:05:44 | all phones that are really only found in particular groups so i'd say there's a |
---|
1:05:48 | kind of shared core of semantics and then various it's biology so there's all kinds |
---|
1:05:54 | of weird stuff in the corners but if you say parental care |
---|
1:06:00 | aggression affiliation |
---|
1:06:01 | and |
---|
1:06:03 | there's also alarm calls and three calls are pretty common but a handful of maybe |
---|
1:06:08 | five semantic axes would probably do it from a standard |
---|
1:06:20 | well the there are some vocalisations that basically saying i'm here |
---|
1:06:25 | and their other vocalisations the try their best a high that so back the a |
---|
1:06:29 | very high frequency quiet thing that tails off it makes it hard to find so |
---|
1:06:33 | various alarm calls are like that |
---|
1:06:36 | it like a there is an active basis is it |
---|
1:06:39 | so for fact that market i block |
---|
1:06:42 | in fact it is quite a lot of human where it's that's right |
---|
1:06:46 | but if a vocabulary but can express is so small that maps model about what |
---|
1:06:53 | making this pen |
---|
1:06:55 | seven or something that brightness |
---|
1:06:57 | to various have |
---|
1:07:00 | i do not put it in a fight |
---|
1:07:02 | if i if i and response |
---|
1:07:04 | and then where it's at an unconstrained a few |
---|
1:07:08 | that's kind of frustration very |
---|
1:07:14 | well i think that is a fundamental finding of animal communication is that animals understand |
---|
1:07:20 | a lot more the then they can say |
---|
1:07:22 | so essentially we have many species for example that understand not only their own species |
---|
1:07:27 | but they can learn the alarm calls of other species in their environment and of |
---|
1:07:31 | course animals raise with humans learn to understand human words and not of the species |
---|
1:07:36 | every produce those |
---|
1:07:38 | so it just does the child's write any of us are receptive vocabulary the words |
---|
1:07:42 | we understand are much larger than the number of words we say typically |
---|
1:07:46 | for most animals i think the receptive vocabulary is large and the productive vocabulary is |
---|
1:07:52 | very limited |
---|
1:07:53 | when they find that frustrating or not |
---|
1:07:55 | i don't know that's harder so |
---|
1:08:03 | so the |
---|
1:08:07 | humans have more control over all or there are also in the water no value |
---|
1:08:13 | model to use the excitation signal was much working or |
---|
1:08:18 | so project was to every other mean and what we present more clearer and more |
---|
1:08:23 | how to model |
---|
1:08:25 | this case back to this image we've done a lot of work now doing excise |
---|
1:08:31 | larynx work in one of the things we found is the most species can very |
---|
1:08:35 | easily be driven into a chaotic state |
---|
1:08:38 | where rather than this nice regular harmonic process that we see here you get essentially |
---|
1:08:45 | coupled oscillators and the vocal folds generating chaos and you can see the classic steps |
---|
1:08:50 | from by phonation into a triphone a period doubling to chaos in vocal folds in |
---|
1:08:55 | virtually every species that we looked at |
---|
1:08:58 | now and it seems to be very easy for most animals to go into a |
---|
1:09:02 | chaotic state and that's reflected by the fact that many sounds we hear animals produce |
---|
1:09:07 | or have a chaotic source |
---|
1:09:09 | so for example monkeys do this all the time they do this |
---|
1:09:13 | an even dog barks are like that there's the they let themselves use chaos much |
---|
1:09:18 | more in speech and you like this |
---|
1:09:22 | but unless you're batman |
---|
1:09:23 | you know |
---|
1:09:25 | nobody does that we we'd we favour this harmonic source for most things if you |
---|
1:09:30 | listen to a baby crying you'll hear plenty of k |
---|
1:09:33 | so i think what's hard to say is whether humans |
---|
1:09:37 | we can produce chaos with their vocal folds but do we just choose to use |
---|
1:09:41 | this nice regular harmonic nice clear pitch signal |
---|
1:09:44 | because it |
---|
1:09:46 | you know better for understanding or it sounds nice or a vocal folds actually less |
---|
1:09:52 | inclined to go chaotic |
---|
1:09:54 | than those of other species |
---|
1:09:55 | that's a question that i don't think we can answer at present |
---|
1:09:58 | but we certainly do a lot less chaos monkeys it's the most common thing you're |
---|
1:10:02 | gonna hear these threads grounds |
---|
1:10:04 | are chaotic and so that's what we were trying to model in the sentence |
---|
1:10:08 | so i've done if you |
---|
1:10:11 | models where there's interaction between the vocal tract in the vocal folds and also looking |
---|
1:10:16 | at chaotic vibrations and one of the other things that you find even if you |
---|
1:10:21 | get these chaotic vibrations is it's somewhat well it's |
---|
1:10:25 | quite a bit harder to control vocal fold onset so tends to be more gradual |
---|
1:10:30 | and which makes for instance it almost impossible to make a distinction between voiced and |
---|
1:10:35 | voiceless |
---|
1:10:36 | that consonants which are pretty important in speech and so am i just find out |
---|
1:10:42 | there but it seems that this |
---|
1:10:46 | more |
---|
1:10:47 | regular vibration of the human vocal fold is useful for speech whether it's you know |
---|
1:10:54 | being |
---|
1:10:56 | the being used by speech because that way or because whether it has become that |
---|
1:11:01 | way because it useful for speech that's another question |
---|
1:11:12 | okay |
---|
1:11:17 | thank you very much |
---|