0:00:18 | martian will be presenting the next talk |
---|
0:00:26 | is this on |
---|
0:00:28 | so how do i how do i do that |
---|
0:00:35 | i need some help here think or maybe |
---|
0:00:38 | oops i'm sorry stop by computer |
---|
0:00:44 | they are i'm and the that the presentation is on this computer but i can't |
---|
0:00:47 | find the |
---|
0:00:48 | how there is no point there now right or |
---|
0:00:53 | right |
---|
0:01:02 | is the other |
---|
0:01:07 | well i can start while this is happening i can start by saying but |
---|
0:01:12 | the work that i'm gonna be present thing was really cornell of cost gives work |
---|
0:01:16 | and he very |
---|
0:01:19 | generously |
---|
0:01:21 | invited us to collaborate don't |
---|
0:01:24 | here to collaborate arithmetic invite comedy as and me to |
---|
0:01:29 | to collaborate with him on this |
---|
0:01:31 | and then it turned out that he |
---|
0:01:33 | cannot make it today |
---|
0:01:35 | which means that you are |
---|
0:01:39 | which means that you are stuck with me here i will try not to make |
---|
0:01:43 | too much of a mess up stocks |
---|
0:01:45 | so the question that we're that we're |
---|
0:01:48 | but we are talking here is a very old question |
---|
0:01:52 | in speech science in is the question of |
---|
0:01:55 | whether page or to what extent pitch plays a role in management of speaker change |
---|
0:02:01 | and this was generated if you to the bay to generate a huge so steady |
---|
0:02:04 | stream with papers |
---|
0:02:06 | and but if you look across those papers you can |
---|
0:02:09 | so to extract some broad |
---|
0:02:11 | then but broad consensus that's |
---|
0:02:13 | first of all |
---|
0:02:14 | pitch does play some or all and then secondly that there is this binary opposition |
---|
0:02:18 | between flat pitch |
---|
0:02:19 | signalling or being links to turnholding and any kind of pitch movement dynamic pitch being |
---|
0:02:25 | linked to turn-yielding |
---|
0:02:27 | and that's it trained that's the whole the story |
---|
0:02:29 | except of course it is not because there are still |
---|
0:02:31 | and number of questions that you might want to ask about the contribution of pitch |
---|
0:02:35 | to turn taking |
---|
0:02:37 | such as well |
---|
0:02:38 | doesn't matter whether you're looking at spontaneous or task oriented material does it matter whether |
---|
0:02:43 | you're |
---|
0:02:44 | speakers can see each other with the you know each other |
---|
0:02:47 | what is the actual contribution of |
---|
0:02:49 | off |
---|
0:02:50 | pitch over |
---|
0:02:52 | lexical or are syntactic cues and finally |
---|
0:02:55 | i mean i'm a i'm a linguist by training or politician and so we know |
---|
0:02:59 | that different languages use pitch linguistically to different extents and where the question is with |
---|
0:03:04 | this is also reflected in |
---|
0:03:06 | how the user pitch |
---|
0:03:07 | for pragmatic purposes such as |
---|
0:03:10 | that just turn taking |
---|
0:03:12 | and then there's a whole |
---|
0:03:14 | other another list of questions is how do you |
---|
0:03:17 | how do you transform how do you are present your pitch in your |
---|
0:03:20 | model right so how do you do some kind of perceptual stylisation based on |
---|
0:03:25 | perceptual threshold |
---|
0:03:26 | do you do kind of some sort of curve fitting |
---|
0:03:29 | polynomial functional data analysis what have you |
---|
0:03:32 | to use log scale |
---|
0:03:33 | the do you do transform at a semi tones how far back to the you |
---|
0:03:37 | look for those cues right now we're looking at ten miliseconds hundred or one second |
---|
0:03:40 | or ten second right |
---|
0:03:41 | these are all interesting an important question but is very difficult to |
---|
0:03:45 | to answer them in an a systematic way because any two studies you point two |
---|
0:03:49 | well vary across so many dimensions that it's very difficult to |
---|
0:03:54 | to estimate a sort of quantify the contribution of each of any of these factors |
---|
0:03:59 | to actual contribution of pitch to turn taking |
---|
0:04:02 | so what we've trying to do here's propose a way of |
---|
0:04:05 | evaluating the role of pitch |
---|
0:04:08 | in turn taking and that's a method which has three important we think a i |
---|
0:04:14 | properties the first one it's |
---|
0:04:16 | scalable trying to its applicable to material when the size |
---|
0:04:18 | it |
---|
0:04:19 | is not |
---|
0:04:20 | reliant a large the are many miller reliant on manual annotation |
---|
0:04:25 | and it is |
---|
0:04:26 | it gives you a sort of quantative |
---|
0:04:28 | index of |
---|
0:04:30 | contribution of pitch or any other feature as a matter of fact because in the |
---|
0:04:35 | long term i mean this model this method can be applied to any potential turn |
---|
0:04:40 | taking can to a few candidate |
---|
0:04:43 | so |
---|
0:04:45 | the way we chose to showcase this and also to evaluate this method was to |
---|
0:04:49 | ask three questions which |
---|
0:04:50 | well we thought we but there were interesting for us and we hope |
---|
0:04:54 | or interesting to some of you and this is the first question is whether pitch |
---|
0:04:58 | but there is any benefit in having pitch information to prediction of |
---|
0:05:01 | of speech activity and dialogue |
---|
0:05:04 | the second one is if it does |
---|
0:05:06 | make a difference how best to represent |
---|
0:05:09 | your pitch information and the third one is how far back to you have to |
---|
0:05:13 | look for the for these cues |
---|
0:05:15 | so these are the question that will be asking and will be trying to answer |
---|
0:05:18 | them |
---|
0:05:20 | using switchboard |
---|
0:05:21 | which we divided into these three speaker disjoint sets right there's no speaker |
---|
0:05:26 | in more than one of those and instead of running our own voice activity detection |
---|
0:05:31 | we just use the forced alignments of the of the manual transcriptions that come up |
---|
0:05:36 | with switchboard |
---|
0:05:39 | and the whole i mean what you have what we did ben and this is |
---|
0:05:42 | the idea that lies |
---|
0:05:43 | at the heart of this of this of this method and i'm sure you've seen |
---|
0:05:46 | this |
---|
0:05:47 | before and it's this idea of contractual pornography |
---|
0:05:51 | which is a sort of |
---|
0:05:52 | discrete eyes are quantized speech silence annotation right so you have basically a |
---|
0:05:57 | a frame of predefined duration here we used hundred milliseconds and for each of those |
---|
0:06:01 | frames and for each of the speakers you indicate whether someone was speaking |
---|
0:06:05 | or was silent during in that interval and so here we have a person |
---|
0:06:09 | speaker a speaking for |
---|
0:06:11 | four hundred milliseconds and there's a hundred miliseconds of overlap |
---|
0:06:14 | the speaker b |
---|
0:06:16 | takes four |
---|
0:06:16 | frame it for frames of speech and there is a hundred milliseconds of |
---|
0:06:21 | of silence and then speaker a contain |
---|
0:06:24 | and what you can do then it once you have this sort of representation that |
---|
0:06:27 | of course you can |
---|
0:06:28 | do this very simple very simply you can very simply predict speech activity and that |
---|
0:06:33 | you just take |
---|
0:06:35 | speech this one speakers history we call this speaker target speaker |
---|
0:06:39 | you take this person's |
---|
0:06:40 | is to speak speech activity history |
---|
0:06:42 | you can potentially if you're interested in that it can take this the other |
---|
0:06:46 | persons the speech activity history |
---|
0:06:48 | and then what you do is |
---|
0:06:49 | you would trying to predict |
---|
0:06:50 | where the target speaker is gonna be silent or is going to be speaking in |
---|
0:06:54 | the next hundred milliseconds |
---|
0:06:57 | and this kind of model can serve as a very neat baseline onto which you |
---|
0:07:03 | can then keep adding |
---|
0:07:04 | other features in our case pitch |
---|
0:07:07 | and what you can do though is then you can compare this speech activity based |
---|
0:07:11 | only model so baseline and the composite speech activity and |
---|
0:07:16 | in our case pitch model |
---|
0:07:18 | any kind of course also |
---|
0:07:19 | compare the different types of pitch parameterization with one another |
---|
0:07:23 | of course the only thing that you have to do before you do this kind |
---|
0:07:27 | of |
---|
0:07:28 | exercise |
---|
0:07:29 | is you somehow have to take the continuously varying |
---|
0:07:33 | pitch values and you somehow have to cast them into this chromagram |
---|
0:07:36 | a matrix like representation and what we did here was of the simplest possible solution |
---|
0:07:41 | we just calculate they |
---|
0:07:43 | for each hundred millisecond frame we calculate be the average |
---|
0:07:47 | pitch in that interval or missing or we just leave it is the missing value |
---|
0:07:51 | if there was no voicing in that interval |
---|
0:07:55 | right |
---|
0:07:57 | and then we run those prediction experiments using quite simple feed forward networks with the |
---|
0:08:02 | single hidden layer and for all the experiments that are talking about here |
---|
0:08:06 | we had a two units in that hidden layer |
---|
0:08:08 | other some more in the paper which i will not be talking about here |
---|
0:08:13 | and you will note that this is a non recurrent network in there is a |
---|
0:08:17 | reason for this right because since we are actually interested in the in the |
---|
0:08:21 | in the length of the of the usable |
---|
0:08:24 | pitch history we actually want to have axes we want to have control over how |
---|
0:08:28 | much |
---|
0:08:29 | history that the network has |
---|
0:08:31 | access to |
---|
0:08:33 | and before we go on the difference is were compared |
---|
0:08:37 | using cross entropy |
---|
0:08:39 | expressed in those bits per hundred miliseconds frame there'll be a lot of comparisons here |
---|
0:08:44 | so there'll be lots of pictures |
---|
0:08:45 | there's even more in the paper i've sort of to the liberty of picking out |
---|
0:08:48 | the more boring ones which i think is good as long as you don't tell |
---|
0:08:51 | cornell so if you know them don't tell |
---|
0:08:54 | so |
---|
0:08:55 | the two questions |
---|
0:08:57 | the first two questions where a |
---|
0:09:00 | first of all |
---|
0:09:01 | well is there any benefit in knowing |
---|
0:09:04 | in having access to pitch history well doing is a speech activity prediction |
---|
0:09:08 | and the second one |
---|
0:09:09 | is |
---|
0:09:10 | how to |
---|
0:09:11 | what's the optimal representation of pitch values for in such a system |
---|
0:09:16 | and |
---|
0:09:17 | so what we do here |
---|
0:09:18 | it's we start with the speech activity only baseline or in so will be seeing |
---|
0:09:22 | this kind of picture a lot |
---|
0:09:24 | so what we have here have here is the training set dev set and test |
---|
0:09:27 | set here we have the cross entropy rates for all those systems and what we |
---|
0:09:31 | have here |
---|
0:09:32 | on the x-axis is the conditioning context right so this is a system which is |
---|
0:09:36 | trained on one hundred millisecond frame of a speech activity history and this is a |
---|
0:09:41 | system trained on |
---|
0:09:42 | one second of speech |
---|
0:09:44 | activity history and you can see that the big |
---|
0:09:46 | we cross |
---|
0:09:47 | all those three sets the cross entropy is drop as you would expect right |
---|
0:09:52 | so there is an improvement in prediction |
---|
0:09:55 | and |
---|
0:09:56 | and what we will be doing |
---|
0:09:58 | from now on |
---|
0:09:59 | it's will be taking this |
---|
0:10:01 | guy so will be taking the system which is trained on |
---|
0:10:05 | ten |
---|
0:10:05 | on one second of speech activity history of both speakers |
---|
0:10:08 | and will be adding |
---|
0:10:11 | more and more all |
---|
0:10:12 | of pitch history right so it's always |
---|
0:10:16 | ten frames of speech activity history propose speakers and then pitch |
---|
0:10:19 | all one |
---|
0:10:22 | and what we did first we just added absolute pitch a linear |
---|
0:10:26 | scale in hz |
---|
0:10:28 | and surprisingly base even this simple pitch representation helps quite a bit trying to i |
---|
0:10:33 | mean you can see that even having one frame with pitch |
---|
0:10:37 | history is already better than |
---|
0:10:40 | then this baseline here |
---|
0:10:42 | and |
---|
0:10:43 | but then it sort of improve c and further and it starts to settle around |
---|
0:10:47 | three hundred milliseconds |
---|
0:10:50 | so that the that that's good news rank it seems to suggest that the pitch |
---|
0:10:53 | information is somehow relevant for speech active prediction |
---|
0:10:56 | but i mean |
---|
0:10:58 | clearly adding apps use representing pitch in absolute terms this is a kind of a |
---|
0:11:03 | laughable id alright that we this completely |
---|
0:11:06 | speaker dependent |
---|
0:11:07 | so what you wanna do is you want to |
---|
0:11:09 | do it well speaker-independent somehow so you want to |
---|
0:11:12 | the speaker normalization and what we did hear your we do this again the simplest |
---|
0:11:16 | thing |
---|
0:11:16 | so we just that score the |
---|
0:11:18 | the pitch values |
---|
0:11:19 | and surprisingly this did not really might make much of a different side so that's |
---|
0:11:25 | that's surprising |
---|
0:11:27 | you would expect some improvement but of course |
---|
0:11:30 | if you think about it actually |
---|
0:11:32 | this introduces more confusion because ones that scoring does of course it brings the mean |
---|
0:11:37 | to zero |
---|
0:11:38 | and the voiceless frames |
---|
0:11:40 | are also represented as zeros in the model |
---|
0:11:43 | so then these models are just |
---|
0:11:46 | confusing those two |
---|
0:11:47 | those two phenomena |
---|
0:11:49 | this can be |
---|
0:11:50 | quite easily |
---|
0:11:52 | improved |
---|
0:11:53 | by just adding another feature vector this to be |
---|
0:11:58 | a feature vector which is just a binary feature |
---|
0:12:00 | both voicing feature right so it's one when there's voicing and zero when it's not |
---|
0:12:04 | and this allows us to |
---|
0:12:05 | this allows the model to disambiguate zeros which are due to being close to speakers |
---|
0:12:09 | mean from zeros which are due to voice lessons |
---|
0:12:12 | and when you do this that you actually get a quite is quite a substantial |
---|
0:12:17 | drop in cross entropy rates right switch |
---|
0:12:19 | the just the bases a |
---|
0:12:20 | as a good representation and this drop was actually greater |
---|
0:12:24 | then if you add voicing on top of absolute pitch exact again it's not something |
---|
0:12:27 | i'm showing here but it is in the in the paper |
---|
0:12:31 | and then of course |
---|
0:12:32 | you can go on and say well we know that speech is really |
---|
0:12:35 | it perceived on semi timescale runs on log scale so does actually matter if we |
---|
0:12:40 | convert |
---|
0:12:41 | are how the hz to semi turn before is that scoring and it actually does |
---|
0:12:45 | a little bit trying to there is that there is a slight improvement would generalizes |
---|
0:12:49 | to the |
---|
0:12:50 | the test set |
---|
0:12:52 | and of course and the last up with data was asking |
---|
0:12:55 | so all along with only been using pitch history of the target speaker but you |
---|
0:12:59 | can also ask well that's not doesn't help to know the pitch history of the |
---|
0:13:02 | interlocutor |
---|
0:13:03 | and again there is a there is a |
---|
0:13:06 | slides |
---|
0:13:07 | but consistent improvement if you if you use both speakers history right |
---|
0:13:12 | so this is our solution arg answer to question number one and two |
---|
0:13:17 | or preliminary answer anyway |
---|
0:13:18 | and then we have question number three which is how far back do you have |
---|
0:13:22 | to walk and for this we have this sort of diagram |
---|
0:13:26 | the so the topline is as before so this is the speech activity only |
---|
0:13:31 | model |
---|
0:13:32 | except previously be ended here on this blue dots and here we |
---|
0:13:36 | extended |
---|
0:13:38 | for another ten frames so this model is trained on |
---|
0:13:42 | two seconds of speech activity his trade we can say see that is sort of |
---|
0:13:45 | continues dropping but a little bit less |
---|
0:13:48 | bless abruptly this curve here is exactly the curve that we had before so trained |
---|
0:13:53 | on |
---|
0:13:53 | pitch plus |
---|
0:13:56 | one second of speech activity history and this one is |
---|
0:14:00 | more and more of speech history |
---|
0:14:01 | plus |
---|
0:14:02 | two seconds of speech act i pitch history plus |
---|
0:14:05 | two seconds of speech activities training |
---|
0:14:08 | and this is quite interesting actually hand and a little bit puzzling in that |
---|
0:14:11 | these curves |
---|
0:14:13 | i mean whiskers are quite similar i mean they all still |
---|
0:14:17 | start settling around four hundred |
---|
0:14:19 | milliseconds |
---|
0:14:20 | but this one is just is just a shifted down to know what this means |
---|
0:14:23 | is basically that |
---|
0:14:24 | the same amount of |
---|
0:14:25 | pitch history is more helpful |
---|
0:14:28 | if you have more speech activity history that just kind of interesting have some ideas |
---|
0:14:32 | about we don't let me weekly we don't know why that is |
---|
0:14:35 | one possibility that could be something to do with the sort of backchannel nonbackchannel thing |
---|
0:14:39 | and that |
---|
0:14:40 | a pitch act as out of a whatever |
---|
0:14:42 | four hundred of those four hundred milliseconds of |
---|
0:14:46 | off |
---|
0:14:46 | pitch cues |
---|
0:14:47 | might be only useful when the when the person has been talking for a |
---|
0:14:52 | for sufficiently long |
---|
0:14:56 | right so as i said there's more in the paper but this is all i |
---|
0:14:59 | wanted to show you for here |
---|
0:15:01 | but then what have we learned the three questions are back first what was well |
---|
0:15:06 | the speaker does have does that speech help |
---|
0:15:09 | and a prediction of |
---|
0:15:10 | a speech activity |
---|
0:15:12 | in dialogue the answer is yes |
---|
0:15:14 | what is the optimal representation well from what we've seen it seems to be |
---|
0:15:18 | the binary voicing combination of binary voicing for this disambiguation of voice listeners |
---|
0:15:23 | and |
---|
0:15:23 | is that score normalization normalized pitch on an intel on the same assembly don't scale |
---|
0:15:31 | and how far back should one log well it seems that four hundred of context |
---|
0:15:35 | is |
---|
0:15:36 | sufficient |
---|
0:15:39 | but we have also seen that in terms of the absolute reduction and cross entropy |
---|
0:15:44 | then into a that the best performing pitch |
---|
0:15:48 | and representation |
---|
0:15:51 | retreated resulted in a reduction in reduction which is corresponds to roughly seventy five percent |
---|
0:15:55 | of the reduction |
---|
0:15:57 | in the speech activity only model when you go from one frame |
---|
0:16:00 | to ten frames right so it's quite a |
---|
0:16:02 | quite substantial in the in that |
---|
0:16:05 | and the most arms |
---|
0:16:07 | we have also seen that |
---|
0:16:09 | but that |
---|
0:16:09 | i mean four hundred millisecond seems to be enough |
---|
0:16:12 | which is not much if you |
---|
0:16:15 | think |
---|
0:16:15 | about this study that cornell did with less tried work in two thousand twelve and |
---|
0:16:20 | they found that if you do |
---|
0:16:21 | speech activity history only you can go |
---|
0:16:23 | back as much as eight |
---|
0:16:25 | seconds and you still |
---|
0:16:26 | keep |
---|
0:16:27 | i improving |
---|
0:16:29 | but on the other hand if you think about the sort of prosodic domain with |
---|
0:16:32 | the window which within which any kind of |
---|
0:16:36 | pitch |
---|
0:16:37 | q |
---|
0:16:37 | could be embedded then something on the order of the magnitude of the foot of |
---|
0:16:42 | the method of a prosodic foot so something like |
---|
0:16:45 | four hundred milisecond |
---|
0:16:47 | long |
---|
0:16:47 | makes |
---|
0:16:48 | perfect sense to me |
---|
0:16:50 | and |
---|
0:16:52 | we have a coke or we one thing we did was of course cheat a |
---|
0:16:55 | little bit in that |
---|
0:16:57 | when we did those that scoring of the pitch |
---|
0:17:01 | we used speakers |
---|
0:17:03 | means and standard deviation that we assume that they are known a prior alright and |
---|
0:17:06 | this of course is not the case if you work to run this analysis of |
---|
0:17:10 | real time |
---|
0:17:11 | a scenario |
---|
0:17:12 | and these would then have to be estimated incrementally |
---|
0:17:17 | and i want to finish here |
---|
0:17:19 | and go back to the to the rationale of doing all this |
---|
0:17:23 | analysis and all this sort of playing around with this and this was really to |
---|
0:17:27 | to come up with a better way |
---|
0:17:30 | of doing |
---|
0:17:31 | automated analysis of large speech material and then especially |
---|
0:17:35 | to be able to |
---|
0:17:36 | to bootstrap to produce results |
---|
0:17:38 | across |
---|
0:17:39 | across different corpora and make them so of comp arable so one thing you could |
---|
0:17:43 | do with this for instance is |
---|
0:17:44 | we run this in switchboard what you can do is take the same thing and |
---|
0:17:47 | run it on callhome for instance which is also dyadic |
---|
0:17:50 | which is also |
---|
0:17:52 | phone |
---|
0:17:53 | and but people know each other there right |
---|
0:17:56 | and then which you can and what you can then do is sort of you |
---|
0:17:59 | can compare those things |
---|
0:18:00 | and you can see to what extent familiarity between speakers for instance plays a role |
---|
0:18:04 | a in how pitch is employed for |
---|
0:18:09 | turn management |
---|
0:18:11 | and of course in this is kind of what goblet here's and me excited about |
---|
0:18:15 | this |
---|
0:18:16 | is that |
---|
0:18:17 | there there's nothing but limits |
---|
0:18:19 | these things to pitch trying to can do we intend there's nothing stop the printing |
---|
0:18:23 | you from doing intensity and the kind of voice quality features so or a bottom-up |
---|
0:18:27 | multimodal features so this |
---|
0:18:28 | this really opens the way in a sense for doing a lot |
---|
0:18:33 | of interesting things and of course in the long term whatever you find out |
---|
0:18:36 | could potentially be also used in some sort of mixed initiative dialogue system but this |
---|
0:18:41 | really is something that but that you know about than i don't so i will |
---|
0:18:45 | i will stop here thank you |
---|
0:18:53 | can we have plenty of time for questions |
---|
0:19:03 | i have a hidden slide with corn else phone numbers like i |
---|
0:19:08 | so perhaps aim is this but so how you handling cases where you're not able |
---|
0:19:13 | to fine depicts the pitch isn't the thing because you have voiceless that any particular |
---|
0:19:16 | thing i mean i are originally it's its left to assess the missing value |
---|
0:19:21 | but then of course of all the because of all the shenanigans that happened inside |
---|
0:19:25 | i understand they just |
---|
0:19:26 | they just the transformed into zeros right so that's why then there is this confusion |
---|
0:19:31 | between |
---|
0:19:32 | voiceless nist and the |
---|
0:19:34 | after that scoring of the and the mean pitch |
---|
0:19:42 | their questions |
---|
0:19:50 | thanks for in there is to so i'm as i was wondering i |
---|
0:19:56 | absolute |
---|
0:19:58 | is |
---|
0:20:00 | a little bit is very different from a male voice you |
---|
0:20:04 | voice is on female voices |
---|
0:20:07 | so |
---|
0:20:10 | i'm wondering if you you're more than a non tools |
---|
0:20:15 | i mean voice and a female voices define three |
---|
0:20:19 | i mean |
---|
0:20:20 | well maybe but i mean how would that's information b |
---|
0:20:24 | useful the prediction of speaker of |
---|
0:20:27 | so the speaker of speaking in the next hundred milisecond |
---|
0:20:31 | also |
---|
0:20:33 | but you results is very surprising that absolutely yes is right i think so too |
---|
0:20:40 | i think so too |
---|
0:20:41 | because i mean you don't assume that |
---|
0:20:43 | those speaking and hundred and sixty five hz |
---|
0:20:46 | signals |
---|
0:20:47 | but you wanna |
---|
0:20:48 | all the time right i agree that it is it is it is it is |
---|
0:20:51 | surprising |
---|
0:20:52 | but of course i mean |
---|
0:20:53 | i if you compare those |
---|
0:20:57 | what was |
---|
0:21:00 | right so if you compare the absolute pitch and is that the that speaker normalized |
---|
0:21:04 | speech there is a lot |
---|
0:21:05 | clearly that the that the absolute pitch missus so there is a lot to improve |
---|
0:21:10 | on that there must be some information that of that is still |
---|
0:21:15 | how do you mean of there is some kind of the model man it sort |
---|
0:21:18 | of inside the network there was some kind of |
---|
0:21:21 | clustering that it sort of had a one classes of classifier for men and one |
---|
0:21:24 | for women sort of |
---|
0:21:28 | yes actually i think you just he don't my question i'm wondering here how much |
---|
0:21:32 | is the modeling doing like you're proposing a certain representation you with binarize pretty so |
---|
0:21:37 | but obviously the model is probably also doing something on top of that and so |
---|
0:21:42 | i i'm not sure if we did you guys have looked into |
---|
0:21:45 | can you disentangle really understand because if someone takes a different approach that c where |
---|
0:21:49 | construct features that are temporally nature you know like looking at slopes and all the |
---|
0:21:53 | stuff like a much as the model accounting for i'm not sure it's hard i |
---|
0:21:57 | guess what to say i cannot answer this but it's i mean of course you |
---|
0:22:01 | don't know what the model is actually doing yes absolutely |
---|
0:22:04 | absolute but i mean |
---|
0:22:05 | that the things the than thing is that this is what i think this is |
---|
0:22:08 | one way of sort of |
---|
0:22:10 | approaching this problem well |
---|
0:22:12 | producing results which are sort of comparable across studies yes but its absence |
---|
0:22:23 | you mentioned at the beginning that |
---|
0:22:26 | the pitch might flat and before turn taking so we for unity norm |
---|
0:22:31 | and sees you don't user recurrent model did you also consider doing and |
---|
0:22:37 | taking the tent are of the absolute pitch not only the absolute values no we |
---|
0:22:42 | didn't but isn't the something that the network of potentially kind of figure out and |
---|
0:22:46 | so does the question i mean i think so |
---|
0:22:54 | the question whether you i don't think you've done but are you planning to take |
---|
0:22:57 | this out of the corpus and see whether the kinds of differentiation the your models |
---|
0:23:04 | finding might be used |
---|
0:23:07 | productively to change the behavior of the other speaker like if you alter the captured |
---|
0:23:12 | or vice fits well right of people we generate absolutely out that could be done |
---|
0:23:18 | and the other question |
---|
0:23:22 | i was wondering what using would need to change its it was a multi-speaker situations |
---|
0:23:27 | and not just to at three four |
---|
0:23:29 | possibly i mean then this is a this is something that we have discussed a |
---|
0:23:32 | lot i mean the problem with |
---|
0:23:33 | doing this then |
---|
0:23:35 | is that |
---|
0:23:37 | we had a paper it into speech and two thousand seventeen where we did this |
---|
0:23:41 | kind of modeling for |
---|
0:23:43 | for the for respiratory data and turn taking |
---|
0:23:46 | and the problem is then we had three speakers and then you can absolutely do |
---|
0:23:50 | it's |
---|
0:23:51 | but then you have to do all this kind of so we then you would |
---|
0:23:53 | have another row here right |
---|
0:23:55 | what you then have to do |
---|
0:23:57 | is that you have to sort of he |
---|
0:23:59 | for you have to keep sort of shifting those speakers because you don't want your |
---|
0:24:02 | model |
---|
0:24:03 | two |
---|
0:24:04 | to rely on the final but speaker b was on the row |
---|
0:24:08 | two and speaker you see was all row three right so then with three speaker |
---|
0:24:13 | it with three speakers is still doable once you go into really multiparty things then |
---|
0:24:17 | there's just this explodes |
---|
0:24:19 | so then you would have to do it |
---|
0:24:20 | the somehow differently and perhaps only use the only take into account the speaker so |
---|
0:24:25 | we're speaking wouldn't then it's the last i don't five minute or five minutes or |
---|
0:24:28 | something and then sort of two |
---|
0:24:30 | to an incremental also dynamically |
---|
0:24:33 | produce those |
---|
0:24:34 | subsets of speakers that you that you predict for |
---|
0:24:39 | anymore questions |
---|
0:24:48 | just wondering whether you've looked into the granularity here so you picking hundred milliseconds of |
---|
0:24:54 | you look at all the time windows |
---|
0:24:56 | i mean well we had we didn't but this is a clear of bayes i |
---|
0:24:59 | think is a clique you |
---|
0:25:01 | problem right that that's |
---|
0:25:02 | but that could but for the |
---|
0:25:07 | somehow should be addressed a absolutely but i mean that the them at the method |
---|
0:25:14 | itself right i mean you like is agnostic of this sort of the |
---|
0:25:17 | is like |
---|
0:25:18 | whatever your |
---|
0:25:20 | your pitch extraction is |
---|
0:25:21 | the then i mean we will produce different |
---|
0:25:24 | pitch tracks but also whatever your voice activity detection run like these were also produces |
---|
0:25:29 | a but this sort of a pretty in some sense of the preprocessing |
---|
0:25:32 | but still i think |
---|
0:25:36 | absolutely |
---|
0:25:39 | absolutely |
---|
0:25:41 | alright let's thank our speaker again |
---|