0:00:13 | K in on in of from true halogen of them i don't |
---|
0:00:16 | and um i be presenting our preliminary work and processing your |
---|
0:00:20 | and short utterances and to speech |
---|
0:00:22 | so it just |
---|
0:00:23 | starve an it's by saying that this is promising be the work of |
---|
0:00:26 | fess campbell it could be here |
---|
0:00:29 | but i also about than work was hell in one two is you've visit a and trinity for um |
---|
0:00:34 | for remote last somewhere and to work on the court |
---|
0:00:38 | so we interested in spoken conversation or interested looking at describing characterising modeling and ultimately sent sizing and this button |
---|
0:00:46 | conversation |
---|
0:00:47 | and probably most striking that aspect of spoken conversation is that it's massively the interact |
---|
0:00:53 | so you have two people say having a conversation |
---|
0:00:55 | it could be that one person is doing most the talk which happens quite a lot |
---|
0:00:59 | and |
---|
0:01:00 | perhaps it's |
---|
0:01:01 | and somebody talk talking an issue that had to work |
---|
0:01:03 | the the to a friend |
---|
0:01:05 | even no one is doing most |
---|
0:01:07 | of the talking |
---|
0:01:08 | and the interaction is is still in mask the interactive and um |
---|
0:01:12 | if the other participant |
---|
0:01:13 | providing constant feedback to the main speaker |
---|
0:01:16 | a base what are they understand of |
---|
0:01:18 | something needs to be or the it that the need to go faster provide background information |
---|
0:01:22 | this sort of |
---|
0:01:24 | and |
---|
0:01:25 | spoken conversation is |
---|
0:01:26 | that's sake contrast a type of model like speech you get in |
---|
0:01:29 | news broadcast lectures or well talks like this |
---|
0:01:33 | and in spoken interaction meaning is built up |
---|
0:01:35 | uh adaptive iain's clubbers of |
---|
0:01:38 | so a and being a kind of a linear rather of information |
---|
0:01:41 | it's |
---|
0:01:41 | that rather to directly the goes forward and back and forth than even a one times |
---|
0:01:46 | i you look at the uh and you spoken interaction of was is |
---|
0:01:50 | very apparent from a that there's a very high frequency of short of a |
---|
0:01:54 | and |
---|
0:01:54 | shortens utterances have um |
---|
0:01:57 | a function spoken conversation which is disproportionate to their let |
---|
0:02:00 | there |
---|
0:02:01 | very very useful and very important than managing spoken disk |
---|
0:02:06 | you have here um |
---|
0:02:07 | a graph of a a a telephone conversation |
---|
0:02:09 | so when each of the panels |
---|
0:02:11 | um corresponds each speaker |
---|
0:02:13 | and and and the x-axis is time in the y-axis is a |
---|
0:02:16 | well speech density and that ten second frames |
---|
0:02:20 | and so speech then see here's that |
---|
0:02:22 | and a measure of talking time at per frame and |
---|
0:02:26 | and and |
---|
0:02:27 | from this in this conversation that a speaker is much you |
---|
0:02:30 | we need to be most oh of the of the talking and you can see a high frequency |
---|
0:02:34 | a both a a hot uh long and short utterances here |
---|
0:02:38 | the second speaker even known that than us |
---|
0:02:41 | using as many long utterances there still very active in terms of you meant a short of it all utterances |
---|
0:02:46 | that use |
---|
0:02:47 | suppose that a partner is a highly active the number of short utterances |
---|
0:02:51 | is extremely high you look the transcription of these utterances |
---|
0:02:54 | the linguistic contents |
---|
0:02:56 | is |
---|
0:02:56 | really very repetitive and |
---|
0:02:58 | the variation is only quite as |
---|
0:03:00 | quite minimal |
---|
0:03:02 | the course when we're engaging spoken conversation for not using linguistic content as a two |
---|
0:03:06 | and we have to use some other to and |
---|
0:03:09 | to provide different iteration to the speaker |
---|
0:03:11 | and hence importance of |
---|
0:03:13 | prosody and voice quality or vocal timbre |
---|
0:03:17 | so and just to illustrate this a little best i'm gonna play |
---|
0:03:20 | uh a sequence of a short utterances from single males speak speaker from the T sixty four corpus |
---|
0:03:26 | and we show described it few minutes |
---|
0:03:28 | um so i just play a first |
---|
0:03:33 | i |
---|
0:03:33 | i |
---|
0:03:35 | yeah |
---|
0:03:36 | i |
---|
0:03:37 | i |
---|
0:03:38 | i |
---|
0:03:39 | i |
---|
0:03:41 | i |
---|
0:03:41 | i |
---|
0:03:42 | i |
---|
0:03:43 | i |
---|
0:03:44 | i |
---|
0:03:45 | oh |
---|
0:03:45 | i i i |
---|
0:03:47 | i |
---|
0:03:48 | i |
---|
0:03:48 | oh |
---|
0:03:49 | i |
---|
0:03:50 | i |
---|
0:03:51 | i |
---|
0:03:52 | i |
---|
0:03:52 | i |
---|
0:03:53 | i |
---|
0:03:54 | i |
---|
0:03:55 | i |
---|
0:03:56 | i |
---|
0:03:56 | i |
---|
0:03:58 | okay um |
---|
0:03:59 | just kind of a ones should show from out is just as uh i i'm sure just by since then |
---|
0:04:03 | you can hear thus |
---|
0:04:05 | in a spoken conversation those different same linguistic units have |
---|
0:04:09 | very different pragmatic functions in in this in discourse |
---|
0:04:12 | and the the of the lime that we believe |
---|
0:04:15 | provides a sort different iteration well i'll a a lot of it is |
---|
0:04:18 | the prosody and voice quality which i think you could hear and some of those shorter and |
---|
0:04:23 | one of the |
---|
0:04:24 | corporate thus professor campbell worked on was the express speech processing corpus which was one thousand five hundred hours of |
---|
0:04:30 | interactive speech |
---|
0:04:31 | recorded in japan between two guys and and to hasn't five six |
---|
0:04:35 | and um |
---|
0:04:36 | one of the most common words made up |
---|
0:04:39 | like single words made of more than half of the total utterance count |
---|
0:04:43 | these these single words came in a die range of prosodic conditions |
---|
0:04:47 | trying entirely different mess |
---|
0:04:49 | of the um examples that press to campbell sometimes gives gives |
---|
0:04:53 | is the word have home are |
---|
0:04:54 | which is a a sack dialect of of or words which roughly translates as really |
---|
0:04:59 | in in english |
---|
0:05:00 | and and state as a of the corpus |
---|
0:05:02 | um |
---|
0:05:03 | and it and it ages is twenty different at least twenty different |
---|
0:05:07 | and pragmatic functions that single single word |
---|
0:05:10 | and again processing voice quality essential and provide the different station and |
---|
0:05:14 | a spoken conversation |
---|
0:05:16 | just um a final uh a final graph just to um |
---|
0:05:21 | a are just a trace the the frequency short utterances |
---|
0:05:24 | also a large party conversations with |
---|
0:05:27 | and uh graph here from the uh free talk corpus |
---|
0:05:30 | so and |
---|
0:05:31 | there's is a five five speakers involved in the conversation |
---|
0:05:35 | each of the different colours represent different speakers and the of the bar represents the length of the other utterance |
---|
0:05:40 | and again if you look to this there is and a high frequency of short utterances sometimes by single speaker |
---|
0:05:45 | and sometimes |
---|
0:05:46 | and |
---|
0:05:47 | why i more than more same time |
---|
0:05:51 | okay and so this brings sounds on the corpus not and the current study so um a at an in |
---|
0:05:57 | to present end the T sixty four corpus was recorded in um |
---|
0:06:01 | in a a a a an part uh apartments in double |
---|
0:06:04 | and the goal of the corpus was to um richly and a |
---|
0:06:08 | re receive records |
---|
0:06:10 | and highly naturalistic and spoken conversation that |
---|
0:06:14 | um |
---|
0:06:15 | so that was twelve audio lines five you cameras to three sixty degree videos and six |
---|
0:06:20 | up to track motion capture |
---|
0:06:22 | and this five participants |
---|
0:06:23 | three male and two female |
---|
0:06:25 | a social interaction was completely unstructured non scripted and |
---|
0:06:28 | there was no particular conversation go |
---|
0:06:30 | and |
---|
0:06:31 | for this reason that the topics |
---|
0:06:32 | very it um |
---|
0:06:34 | very widely |
---|
0:06:35 | so that was four sessions over two days in in the current study we look as the first two sessions |
---|
0:06:40 | for session i don't i don't even was meant to be recorded post |
---|
0:06:44 | and the |
---|
0:06:44 | the three male speakers in the room at the time |
---|
0:06:47 | well have and |
---|
0:06:48 | um what how headset mikes on |
---|
0:06:51 | and um |
---|
0:06:52 | that was only a short of time before the other two female participants are arrives |
---|
0:06:56 | and because of |
---|
0:06:58 | problems with Q base and are technical issues |
---|
0:07:01 | and a for a knows a be tense |
---|
0:07:03 | and stressful environments and is very apparent from the speech data and you listen to it after |
---|
0:07:08 | second session was um |
---|
0:07:10 | and we're to the two female participants ride was um |
---|
0:07:13 | a much more relaxed |
---|
0:07:15 | kind of a um people are sitting and drinking cups a copy talking a little bit of themselves |
---|
0:07:20 | and |
---|
0:07:20 | so that two sessions are starkly contrast them terms |
---|
0:07:23 | yeah i in this regard |
---|
0:07:25 | and what should state that's and only one of the female speakers and a it is in know analysis |
---|
0:07:30 | in the current work |
---|
0:07:32 | so what having them on its was over with us |
---|
0:07:34 | and last so much um |
---|
0:07:36 | she annotates it's and twelve |
---|
0:07:38 | twelve that and is used of an annotation labels for the short utterances |
---|
0:07:42 | in these two sessions |
---|
0:07:43 | so not gonna go through all of them but just the high like the most frequent ones |
---|
0:07:47 | so that back channels are clearly the um |
---|
0:07:50 | the the the the most frequent so back channels is kind of and kind of feedback that a speaker might |
---|
0:07:55 | be given like yeah we okay rice |
---|
0:07:58 | uh also very come more filled pauses |
---|
0:08:00 | so like um uh |
---|
0:08:02 | like |
---|
0:08:03 | these sort of things |
---|
0:08:04 | and also parents interjections and repetitions where and quite freak |
---|
0:08:09 | we can it's some and prosodic analysis on these shores |
---|
0:08:12 | short utterances |
---|
0:08:13 | and we met measure and fundamental frequency mean max and |
---|
0:08:18 | position of P set of present percentage |
---|
0:08:21 | location of the peak in the order an |
---|
0:08:23 | i are the same edges |
---|
0:08:24 | we used to and |
---|
0:08:26 | break crude voice quality measures the difference being the first two harmonics of the speech spectrum |
---|
0:08:30 | and it if seen the first |
---|
0:08:32 | harmonic and the harmonic |
---|
0:08:33 | because someone a the third formant region |
---|
0:08:36 | and we also measured duration |
---|
0:08:39 | we don't carry principal component analysis not that showed |
---|
0:08:42 | the first loading to be dominated by power values the second we dominated by F zero values and third be |
---|
0:08:47 | dominated by both what's quality in duration bodies |
---|
0:08:50 | so this kind of suggested to them |
---|
0:08:52 | uh in the independence of these |
---|
0:08:54 | of these groups |
---|
0:08:55 | in the first five loadings accounted for seventy percent of the very |
---|
0:09:00 | we wants to look for or us at the voice quality involved than this |
---|
0:09:03 | and so we wants look S voice qualities across the it's tense |
---|
0:09:06 | and to |
---|
0:09:08 | so and as as phonation mode or mode of vocal fold vibration is |
---|
0:09:12 | uh a critical to these voice qualities |
---|
0:09:15 | and like shown here and |
---|
0:09:16 | a kind of image of the of the vote of poke of of the larynx taken from above both |
---|
0:09:20 | and that three men mostly or tensions um high like |
---|
0:09:24 | so the breath you voice quality when the vocal folds or vibrating you you have these low levels of tension |
---|
0:09:29 | load up to tension |
---|
0:09:30 | so this |
---|
0:09:31 | means that there's not block your |
---|
0:09:34 | and you get this |
---|
0:09:35 | and you get this chain get posterior and of the vocal folds that lies that agenda every and there |
---|
0:09:40 | and that to pass tree the vocal folds in this this is the main contributor to the |
---|
0:09:45 | and sort of brandy perceptual quality |
---|
0:09:47 | at tense voice call you at the other end of the spectrum |
---|
0:09:50 | you've you've |
---|
0:09:51 | yeah high levels of the three main range of tensions |
---|
0:09:54 | a producing a uh a a a a tensor voice called so we want to use some acoustic measures to |
---|
0:09:58 | measure and these |
---|
0:10:00 | physiological current |
---|
0:10:01 | we use the tree three step method first we |
---|
0:10:04 | measures done closure instances |
---|
0:10:06 | using instance |
---|
0:10:07 | using uh |
---|
0:10:09 | dsps S S method so called your instance |
---|
0:10:11 | and corresponds to the moments where the vocal folds uh come together |
---|
0:10:16 | we used um |
---|
0:10:18 | and the inverse filtering method so in for filtering is basically a to remove the contribution of the vocal tract |
---|
0:10:23 | from the speech signal giving it and estimates of that but source signal of the uh uh same was created |
---|
0:10:28 | by the folk files of the larynx |
---|
0:10:30 | so use the issues of adaptive inverse filtering method |
---|
0:10:33 | scribe out Q |
---|
0:10:34 | i one country the block diagram just |
---|
0:10:36 | at that the methods it tends to compensate for the spectral roll off of the voice source signal |
---|
0:10:41 | and use the lpc analysis to try and i guess an all-pole model |
---|
0:10:44 | of and of the vocal tract transfer function |
---|
0:10:47 | this is done in a couple of iterations and uh |
---|
0:10:50 | i put is the estimate of the of source signal |
---|
0:10:53 | um so then we with this with this i put signal we want it's |
---|
0:10:57 | we used these glottal gradients describe by look or in yeah and in two present six which is kind of |
---|
0:11:02 | follow on work from scenes and house |
---|
0:11:04 | reason for using these buttons gradients was |
---|
0:11:06 | they they were described in previous work to be um |
---|
0:11:09 | to be useful even in a less than ideal recording conditions which never be happen when you when you um |
---|
0:11:16 | um hmmm when you're dealing with |
---|
0:11:18 | kind of interactive speech like that's |
---|
0:11:20 | well so we chose the to but by gradient stuff from a own work can carefully controlled at a oh |
---|
0:11:25 | show the best different station of voice quality qualities of cross the breath it's tense the match |
---|
0:11:30 | a just highlighted here are the two gone gradients with |
---|
0:11:33 | geology buffalo G gradients gradient and or C G rate of closure gradient |
---|
0:11:38 | and so uh |
---|
0:11:39 | i want just described is any for to but just to state thus low levels of these two at value |
---|
0:11:44 | suggest |
---|
0:11:45 | tensor voice qualities and |
---|
0:11:47 | a a higher levels the chance |
---|
0:11:48 | as suggest um um or voice score |
---|
0:11:52 | okay so we we carried as |
---|
0:11:54 | we carried at this uh we analyse the short utterances using using this and |
---|
0:11:59 | this method |
---|
0:12:01 | and i should have mentioned earlier that we that in the annotation was also annotation of and overlapping and non |
---|
0:12:07 | overlapping segments |
---|
0:12:08 | so we find that um |
---|
0:12:10 | or C G values are significantly lower |
---|
0:12:12 | staking taking uh is taking or um |
---|
0:12:15 | or four speakers |
---|
0:12:16 | uh and that was lower G G values |
---|
0:12:19 | this this trend was also seen in each of the speakers individually |
---|
0:12:23 | um for |
---|
0:12:24 | comparing session once session two we only use the three male speakers because one of the females wasn't present in |
---|
0:12:29 | in the first |
---|
0:12:30 | we and lower or C G an lower G or you values |
---|
0:12:34 | a book when we looked us the the the speakers individually |
---|
0:12:37 | a to the of the male speakers showed significantly lower or ct values where |
---|
0:12:41 | another the one the male speakers show show significantly higher or C G value so this was |
---|
0:12:46 | this a little bit cute |
---|
0:12:49 | so |
---|
0:12:49 | how we interpret this well we in of this is a tensor over all voice called you in the first |
---|
0:12:53 | session |
---|
0:12:54 | and steering overlapping speech that this is reasonably shoes of |
---|
0:12:57 | and perhaps in overlapping speech and a tensor voice calls you could be a mechanism for a competing for turn |
---|
0:13:05 | also um as i stated at the beginning of this kind of |
---|
0:13:08 | more stressful at first session |
---|
0:13:10 | and they leads to an overall tensor |
---|
0:13:12 | and um but the productions by the by to is but |
---|
0:13:16 | wouldn't participant two showed up the trends across sessions we we spoke to after |
---|
0:13:20 | and am |
---|
0:13:21 | yeah actually describes |
---|
0:13:23 | and the first session as a a uh and environment is more completely equipment set up some people can find |
---|
0:13:27 | us to be more control didn't |
---|
0:13:29 | P of the stress that the others that |
---|
0:13:31 | and where |
---|
0:13:32 | at the can get to know you session can actually be very socially or for some people and you you |
---|
0:13:37 | to miss it's to to this in the second session |
---|
0:13:40 | over the kind of take a mess just as short utterances and very substantially in terms of processing voice quality |
---|
0:13:46 | in in spoken conversation |
---|
0:13:48 | traditional and speech recognition systems |
---|
0:13:51 | i i don't take can't of these |
---|
0:13:53 | these aspects |
---|
0:13:54 | of speech |
---|
0:13:55 | and if we want to |
---|
0:13:57 | a a proper properly model and uh |
---|
0:14:00 | the type of naturalistic speech we haven't spoken conversation |
---|
0:14:03 | that we feel that these these aspects |
---|
0:14:05 | and need to be taken care |
---|
0:14:07 | so i just finally just to just the state at what we're doing with this we're currently have a um |
---|
0:14:12 | an exhibition signs a gallery and train college |
---|
0:14:15 | where at her meet the robot is a like a robot with |
---|
0:14:18 | with a |
---|
0:14:19 | and bows and audio recordings |
---|
0:14:23 | and uh base see tracks people's faces and walks walks rents them the strikes of a conversation |
---|
0:14:27 | so we use in this for data collection on spun conversation and short utterances |
---|
0:14:31 | and also that is uh a platform for testing our hypotheses bikes |
---|
0:14:36 | a a short of its |
---|
0:14:38 | and so |
---|
0:14:39 | yeah um |
---|
0:14:40 | and make uh nick the campbell i about from the S F I |
---|
0:14:44 | and they "'em" more it's was supported by F C T god |
---|
0:14:46 | let |
---|
0:14:48 | and that's |
---|
0:14:48 | i stuff |
---|
0:14:54 | so we can have |
---|
0:14:55 | a a time for two question |
---|
0:15:02 | maybe i thought |
---|
0:15:03 | all there it one |
---|
0:15:08 | i |
---|
0:15:08 | the on |
---|
0:15:10 | i K |
---|
0:15:10 | i i just my i would maybe i |
---|
0:15:13 | no something that we trying to a different state and a different types are are so that you have labeled |
---|
0:15:18 | the lower and we didn't do well that they that was done in the annotation but we didn't smash that |
---|
0:15:24 | to the acoustics and in |
---|
0:15:25 | description here |
---|
0:15:26 | and post but that that would be something thus thus |
---|
0:15:29 | and thus |
---|
0:15:30 | press covers |
---|
0:15:31 | yeah and i think you my work along along those lines with with |
---|
0:15:35 | but some the measurements we use but i that didn't |
---|
0:15:37 | that wasn't |
---|
0:15:38 | yeah |
---|
0:15:38 | i don't have that that |
---|
0:15:45 | um |
---|
0:15:46 | why you're my a very similar but |
---|
0:15:48 | a in the very fact one again what |
---|
0:15:50 | for |
---|
0:15:52 | was going from over that you that |
---|
0:15:54 | and i i'm as |
---|
0:15:56 | may may maybe be N Z might know little that better than me and this but um |
---|
0:16:00 | uh i i i i |
---|
0:16:01 | just |
---|
0:16:02 | a what i what i think is is the true is that that's press how both a a set that |
---|
0:16:06 | C you does want a are people to be using a i think you want to the a system that's |
---|
0:16:10 | any annotation that people would do would be |
---|
0:16:12 | conch be the back into the to the overall project |
---|
0:16:15 | but and if you contact him at nick a T C D dot |
---|
0:16:19 | and B I is definitely open to to chains |
---|
0:16:23 | thank you my |
---|
0:16:24 | oh |
---|