0:00:14 | everyone |
---|
0:00:16 | i'm planning to k i'm working in you gap research and ct would and i |
---|
0:00:19 | mean representing the talk on modeling overlapping speech using but it in the cities |
---|
0:00:24 | this work i have been but modified eisenhower we would low |
---|
0:00:28 | so |
---|
0:00:30 | first i would be presenting the motivation for using this problem |
---|
0:00:35 | then in brief i would discuss the previous approaches for detecting overlapping speech |
---|
0:00:41 | then i would go towards the vector taylor series approach |
---|
0:00:44 | in which we have two parts on the first part is the |
---|
0:00:47 | using the standard vts approach and then the next part would be the multiclass we |
---|
0:00:51 | just algorithm which is |
---|
0:00:53 | which we have proposed in this work and then we will discuss the results in |
---|
0:00:57 | experiments |
---|
0:00:59 | so the more recent comes from the problem of speaker diarization |
---|
0:01:03 | but if the task of deciding or determining who spoke when in the meeting audio |
---|
0:01:07 | so ugly when the audio recording you want to find out on different portions which |
---|
0:01:12 | belong to the speakers |
---|
0:01:15 | one challenges that the number of speakers is not applied in one |
---|
0:01:18 | so it's it has to be determined booty an unsupervised manner |
---|
0:01:24 | now in this task overlapping speech |
---|
0:01:28 | becomes a very huge |
---|
0:01:30 | the source of either |
---|
0:01:32 | so first i would define the overlapping speech it is at the moment trying to |
---|
0:01:37 | speaker speak simultaneously it might be when people are debating they are arguing or when |
---|
0:01:42 | they are just |
---|
0:01:45 | okay so when there does agreeing or disagreeing like |
---|
0:01:49 | this kind of things are men they're laughing together |
---|
0:01:52 | so what happens is that when you have overlapping speech in your audio then you |
---|
0:01:57 | cannot model |
---|
0:01:59 | the art speaker models very precisely |
---|
0:02:01 | or when you are doing speaker recognition and you want to assign one speaker anybody |
---|
0:02:06 | to a portion in which actually there are two people speak speaking then that also |
---|
0:02:10 | results in some errors in speaker recognition |
---|
0:02:14 | and a previous studies have shown that in meetings sometimes those model twenty percent of |
---|
0:02:19 | the spoken name can be overlapping if the participants are maybe |
---|
0:02:24 | active |
---|
0:02:28 | no the previous approaches so one of the first what was done by book |
---|
0:02:32 | in we see well it made adamant be segmented for the three classes speech non-speech |
---|
0:02:37 | an overlapping speech |
---|
0:02:39 | this was the baseline |
---|
0:02:40 | then people have used |
---|
0:02:42 | a salad the knowledge of silence distribution |
---|
0:02:45 | and some things like speaker changes because it has been found that people tractable with |
---|
0:02:50 | that when the speaker change |
---|
0:02:53 | the state-of-the-art a is based on convolutive non-negative sparse coding and which |
---|
0:03:00 | d gotten we put like they have |
---|
0:03:03 | no and basis for each speaker |
---|
0:03:05 | and then the artifact to find out the activity of each speaker for each stream |
---|
0:03:09 | and they have used the same features using non stardom neural networks long short-term memory |
---|
0:03:14 | neural networks |
---|
0:03:18 | now we come to |
---|
0:03:19 | our problem |
---|
0:03:20 | so before i moved to overlapping speech that is one and all of this problem |
---|
0:03:24 | of how to model the speech which is corrupted with noise |
---|
0:03:28 | so if you have a noisy speech model by then you can express it in |
---|
0:03:32 | the signal domain as |
---|
0:03:33 | x plus and where x if you're clean speech |
---|
0:03:37 | actually of the channel noise and is the additive noise |
---|
0:03:43 | so in the mfcc domain |
---|
0:03:46 | these are the mel-scale filter power spectrum |
---|
0:03:49 | you take the log and dct then and then you but the mfcc features |
---|
0:03:53 | so in the mfcc domain |
---|
0:03:55 | this simple expression here |
---|
0:03:57 | it becomes |
---|
0:03:59 | a quite complex expression value have a linear park and the nonlinear part |
---|
0:04:04 | the see if the dct a text and seen what six the pseudoinverse of that |
---|
0:04:09 | so we call this nonlinear part of g |
---|
0:04:12 | so you have by the way to x besides plus this nonlinear part |
---|
0:04:16 | no |
---|
0:04:18 | we want to model this equation and we use that the data c d is |
---|
0:04:22 | basically two |
---|
0:04:23 | x point this expression here |
---|
0:04:26 | so that it is it is simply an expansion of objective function about the point |
---|
0:04:30 | where you |
---|
0:04:31 | have the first then |
---|
0:04:33 | this is the first order don't but you pick the first derivative |
---|
0:04:37 | so do when this expression for the noisy speech |
---|
0:04:41 | if you at this point it around this point m u x new and new |
---|
0:04:44 | at which are the mean of clean speech mean of noisy speech and we wanna |
---|
0:04:48 | you know |
---|
0:04:49 | channel noise |
---|
0:04:51 | so you can't this expression here in which the first line |
---|
0:04:54 | if s at the evaluation of y around this point |
---|
0:04:58 | and the second line is |
---|
0:04:59 | the first order down |
---|
0:05:01 | a bit |
---|
0:05:01 | that with energy this capital g and this capital have a |
---|
0:05:07 | they are the derivatives of y with respect to accent at ten and |
---|
0:05:15 | so in the standard |
---|
0:05:18 | and the standard rectangular c d's when you are trying to model |
---|
0:05:21 | this of i here |
---|
0:05:23 | what people do is that the if you model gmm for x |
---|
0:05:26 | and |
---|
0:05:27 | a single gaussian for the noise and add |
---|
0:05:29 | this is because the nicest a study and |
---|
0:05:32 | that's ads if the channel noise |
---|
0:05:37 | so the expletive a gmm it is being corrupted by additive noise and then at |
---|
0:05:42 | using the vts approximation and that gives you the can obtain |
---|
0:05:46 | speech by |
---|
0:05:49 | the |
---|
0:05:49 | these are i these are gaussian so that look like wave but they are the |
---|
0:05:53 | this of the gmm |
---|
0:05:57 | now become to the overlapping speech so what we propose what we propose is that |
---|
0:06:01 | the overlapping speech is actually just a superposition of two or more in you just |
---|
0:06:05 | speakers |
---|
0:06:06 | so if you if we see the model for the noisy speech we can make |
---|
0:06:11 | the analogous model for overlapping speech but we see that this x it's x one |
---|
0:06:16 | which we call them in speaker |
---|
0:06:18 | and this external here is the corrupting speaker |
---|
0:06:21 | with like the additive noise |
---|
0:06:24 | the we for simplicity of be ignored are the channel noise because of |
---|
0:06:30 | the recordings are the recording for all the speakers and the same room |
---|
0:06:34 | so we are not going to deal with edgier |
---|
0:06:38 | so |
---|
0:06:38 | doing |
---|
0:06:40 | analogy we have this expression where the than the can overlapping speech is now a |
---|
0:06:45 | combination of |
---|
0:06:47 | this is no linear campus this non linear term |
---|
0:06:50 | this nonlinear domain cms the k as in the case of the noisy speech |
---|
0:06:56 | again analogous to the case of target speech we have the mean speaker gmm here |
---|
0:07:01 | and the corrupting speaker which is being represented by a single gaussian here as a |
---|
0:07:05 | like the additive noise |
---|
0:07:07 | the equations are totally similar as in the noisy speech |
---|
0:07:11 | the and you can see here |
---|
0:07:13 | that the subscript m so each component of y here is being computed using this |
---|
0:07:19 | component from the main speaker and then some contribution from the corrupting speaker |
---|
0:07:26 | and |
---|
0:07:27 | this g and have which are the derivatives of by |
---|
0:07:31 | they are also different for each component |
---|
0:07:35 | now |
---|
0:07:37 | if you take the expectation of this why here then you can guide the mean |
---|
0:07:42 | for the overlapping speech and the variance for the overlapping speech so this if the |
---|
0:07:46 | final overlapping speech model which we want to estimate |
---|
0:07:53 | now a for estimating that model we are going to use the em algorithm for |
---|
0:07:58 | which this of the q function |
---|
0:08:00 | so q one of the overlapping speech data x from excellent to at the time |
---|
0:08:04 | frames |
---|
0:08:05 | we want to use the probability of |
---|
0:08:09 | having this data using them overlapping speech model new why am signal y m |
---|
0:08:14 | and |
---|
0:08:15 | we optimized this function q |
---|
0:08:17 | with respect to the mean of look at a mean of the corrupting speaker x |
---|
0:08:20 | two |
---|
0:08:21 | so the update equations for me units this exhibition the new x to zero if |
---|
0:08:26 | the previous |
---|
0:08:28 | value for the |
---|
0:08:29 | mean of adapting speaker and that of the new value for the mean of adapting |
---|
0:08:32 | speaker |
---|
0:08:34 | one thing that you can notice here is that |
---|
0:08:37 | this one mean of its two presents the kind of things because it is being |
---|
0:08:41 | updating |
---|
0:08:42 | using all the mixture components |
---|
0:08:45 | from the overlapping speech model |
---|
0:08:49 | the through the whole vts algorithm but something like this thought initially we estimate or |
---|
0:08:55 | we initialize the mean of adapting speaker and the covariance |
---|
0:08:59 | then we compute the overlapping speech model using these expressions |
---|
0:09:04 | after that we use the em look but we optimize the q function |
---|
0:09:08 | and will be replaced them or us to go in signal x to zero by |
---|
0:09:12 | mu extrinsic next two |
---|
0:09:15 | in this work we are not going to update segments to because it |
---|
0:09:20 | it's very have you for computation |
---|
0:09:24 | then when this look on what converges we finally a t v finally guide of |
---|
0:09:28 | overlapping speech model by |
---|
0:09:30 | which we used for overlapping speech detection |
---|
0:09:36 | so the overlapping speech detection system it i for input it takes the meeting audio |
---|
0:09:40 | recordings |
---|
0:09:42 | and the recordings are informal speech segments which we got using the speech activity detection |
---|
0:09:46 | system |
---|
0:09:48 | then one major task is to have speaker models the initial speaker models for the |
---|
0:09:54 | mean speaker and the kind of things because the how to how to get that |
---|
0:09:56 | so there are two options either you can use the oracle speaker segmentations |
---|
0:10:01 | or we take them from the data are not that additional port |
---|
0:10:04 | so this is much more challenging task because when you take the speaker lines alignment |
---|
0:10:10 | from the data is not put |
---|
0:10:12 | you don't know how many speakers that what actually in your audio |
---|
0:10:16 | so you might get more than the actual number of speakers as an utterance and |
---|
0:10:20 | output |
---|
0:10:22 | the output which we are one finally if the detection of overlaps |
---|
0:10:28 | now so |
---|
0:10:30 | given the audio recording this blue box shows the |
---|
0:10:34 | a speech segment given by the speech activity detection |
---|
0:10:37 | remove a slight sliding analysis window what it |
---|
0:10:40 | for each analysis window we can have and square hypothesis so we have you on |
---|
0:10:45 | that in and then overlap that would be two speakers who would be overlapping so |
---|
0:10:49 | if you have and speakers then the total number of overlapping speech models can be |
---|
0:10:53 | and squared minus and |
---|
0:10:55 | that this and shows the single speaker models when only one speaker is speaking |
---|
0:11:01 | so this is a huge number so what we do with that for each speech |
---|
0:11:05 | segment first we determine the means speaker and then we compute the overlapping speech models |
---|
0:11:11 | when that means speaker if being a big by |
---|
0:11:13 | some of the speaker |
---|
0:11:16 | finally we have overlap model is that the speaker |
---|
0:11:21 | i is being adapted with speaker g |
---|
0:11:23 | and then there i think that speaker models |
---|
0:11:25 | where the speaker i is speaking alone so we compared all this likelihood ratios for |
---|
0:11:30 | the domain if we have overlapping speech a single speaker speech |
---|
0:11:37 | so |
---|
0:11:39 | up to hear that was the standard but it is likely that bloat now be |
---|
0:11:42 | moved to the multiclass but it is really the algorithm so you would have seen |
---|
0:11:47 | that in the standard vts we used only one simple gaussian distribution for the noise |
---|
0:11:52 | but there sometimes and might be good in the case when we are dealing with |
---|
0:11:55 | noise but in case of overlapping speech |
---|
0:11:58 | the other cup are the expert without collecting speaker he himself if the human being |
---|
0:12:02 | in and said so |
---|
0:12:04 | it's not like a noisy might be he might i don't multiple phonemes in that |
---|
0:12:08 | window |
---|
0:12:09 | so we want to prevent him using more data |
---|
0:12:13 | or more a better modeling |
---|
0:12:16 | so what we suppose that likes instead of having one single question here we assume |
---|
0:12:22 | that all the gaussian all or all the questions in the gmm of x two |
---|
0:12:26 | are also present |
---|
0:12:29 | so now we are going to have a rectangular to this combination of |
---|
0:12:33 | two gmms with this gmms for the adapting speaker |
---|
0:12:40 | so what we do here is that v for start with the times and that |
---|
0:12:44 | each of this gaussian might have might have hit in that analysis window |
---|
0:12:49 | by then for each of the gaussian be computed i'm value which is the average |
---|
0:12:54 | number of frames assigned to that question component in that analysis window |
---|
0:12:58 | if this gamma value happens to be lower than it actually you to |
---|
0:13:03 | then we clustering with and you're just watching component |
---|
0:13:06 | so |
---|
0:13:07 | v guide like this kind of clustering |
---|
0:13:10 | and |
---|
0:13:11 | then we say that |
---|
0:13:14 | the gaussian which have the highest got mine discussed that would be that of the |
---|
0:13:18 | standard so idea this d the all components they would being adopted by one single |
---|
0:13:23 | gaussian here now these all gaussians would be good update by the cluster center of |
---|
0:13:28 | disgusted |
---|
0:13:30 | we make that have them sent because |
---|
0:13:31 | all this gaussian mixture models that have been doing |
---|
0:13:34 | using the difference ubm the same difference ubm |
---|
0:13:39 | so |
---|
0:13:40 | in the gonna pick speech by |
---|
0:13:42 | the question here |
---|
0:13:43 | it would be computed using the gaussian here last the a contribution from the kind |
---|
0:13:48 | of things speaker |
---|
0:13:50 | a from this component |
---|
0:13:53 | if you said you'd like what you zero that you don't want to set any |
---|
0:13:57 | threshold and their window clustering |
---|
0:13:59 | and each question would be going to one than what having the one-to-one combination to |
---|
0:14:04 | give you look at a bit speech |
---|
0:14:09 | the equations for mean update in case of multiclass we get think we show might |
---|
0:14:13 | because we d s is the cms the previous case the only difference being that |
---|
0:14:17 | now you have a subscript see here which denotes the cluster |
---|
0:14:20 | the for each class you have a different the third thing going |
---|
0:14:23 | and that centroid would be updated using this equation |
---|
0:14:27 | and as i or shall work in the previous like that |
---|
0:14:31 | idea this mean was being computed using all the gaussian components but now this equation |
---|
0:14:38 | only takes into account |
---|
0:14:40 | the questions which is the which are in the cluster c |
---|
0:14:46 | similarly all the other questions they are identical the only difference being that |
---|
0:14:51 | instead of having the single gaussian thing the gaussian representation for text to now be |
---|
0:14:56 | doing the stairways |
---|
0:14:57 | so you have a subscripts the every good |
---|
0:15:02 | so that's the multiclass we do this algorithm framework |
---|
0:15:07 | now coming to that experiments so different than on the it might it as i |
---|
0:15:10 | which is the meeting data set |
---|
0:15:12 | so the meetings are kind of like there are a group of three or four |
---|
0:15:16 | people who are trying to design a remote or something there are so they are |
---|
0:15:20 | discussing arguing debating |
---|
0:15:22 | and the vector the duration varies from seventeen to fifty seven minutes |
---|
0:15:27 | the audio which we take |
---|
0:15:28 | if of from a single distant microphone which is the most difficult task |
---|
0:15:34 | and then we use like mfcc features |
---|
0:15:36 | and for the think that speaker model we use a i mean be adapted |
---|
0:15:41 | gmm |
---|
0:15:45 | now the added my so that it's called it would have detection error which is |
---|
0:15:49 | the false alarm time but smith time |
---|
0:15:51 | divided by the label speaker overlap time so one thing to notice that the false |
---|
0:15:55 | alarms that come from the reasons we're the only think that speaker is speaking |
---|
0:15:59 | and that those reasons are quite much more than the overlapping speech |
---|
0:16:03 | so this whole expression it can be more it can take values over a hundred |
---|
0:16:07 | percent |
---|
0:16:11 | the first experiment which we did what using the standard vts where we have only |
---|
0:16:15 | one gaussian representation for the corrupting speaker |
---|
0:16:18 | we wanted to determine the analysis window size which were about the best |
---|
0:16:22 | so we found that |
---|
0:16:24 | when you were using going over a window size of three point two seconds the |
---|
0:16:27 | elderly voice |
---|
0:16:28 | lower as compared to the smaller venues like this |
---|
0:16:32 | above this |
---|
0:16:33 | the added identically if that much and |
---|
0:16:36 | instead a the computation time in a lot because then you have you are doing |
---|
0:16:40 | the same computational burden |
---|
0:16:41 | apply a larger window |
---|
0:16:44 | so in the next experiment we are going to use this window size |
---|
0:16:50 | so these are this is the cost for the previous table so this that the |
---|
0:16:54 | required precision guided and a the cut one the top if the with the window |
---|
0:16:58 | size two point two seconds |
---|
0:17:03 | so not be that the results for the multiclass vts so in the standard vts |
---|
0:17:07 | the overlap detection error rate was ninety six point two percent |
---|
0:17:11 | when we use the multiclass vts |
---|
0:17:13 | it top of well by an absolute value of sixteen percent |
---|
0:17:17 | and these for experiments that type of data domain |
---|
0:17:22 | what should be and optimal value for the threshold for this thing |
---|
0:17:26 | so when so in a window of three point two seconds we had three hundred |
---|
0:17:30 | twenty frames |
---|
0:17:31 | and if without l |
---|
0:17:34 | threshold of five frames for each gaussian |
---|
0:17:36 | then |
---|
0:17:38 | this values here the denote how much that the clustering happen i mean we start |
---|
0:17:44 | from sixty four clusters in the beginning |
---|
0:17:46 | and if we have what utah five then and then we have tens of this |
---|
0:17:50 | we found that the best results were |
---|
0:17:52 | when we were having and threshold of one frame |
---|
0:17:55 | in that case |
---|
0:17:56 | the data this and the overlap detection error it reduces to eighty percent |
---|
0:18:00 | which is quite good |
---|
0:18:02 | and |
---|
0:18:03 | the final number of clusters that we got if |
---|
0:18:07 | twenty four point seven still beginning with sixty four we end up having twenty four |
---|
0:18:10 | point seven does this year |
---|
0:18:14 | so |
---|
0:18:15 | as i said we have like to different kind of options for modeling the speaker |
---|
0:18:20 | one likely model the speaker from the oracle or one |
---|
0:18:23 | we are modeled the speakers on the data is not bored |
---|
0:18:26 | so in case of articles the speaker models are ready purely to begin with so |
---|
0:18:29 | that's why |
---|
0:18:30 | the results that are quite good |
---|
0:18:32 | but when we start with the database an output |
---|
0:18:35 | we don't we might get a seven speaker target speakers |
---|
0:18:38 | when there are actually only for speaker so |
---|
0:18:41 | it's a set of problems given that is |
---|
0:18:43 | the added it is ninety three point three percent |
---|
0:18:46 | which is it better than the standard vts approach |
---|
0:18:53 | so these are the kernel sorta previous table |
---|
0:18:56 | but i that if using but that a vision system so that efficient system works |
---|
0:19:02 | in a totally unsupervised manner and the final goal we have it is to make |
---|
0:19:05 | this by data back end we want to so |
---|
0:19:09 | improve it |
---|
0:19:10 | up to this point which is by the articles |
---|
0:19:12 | so we are trying to reduce this gap |
---|
0:19:16 | comparing to the other words |
---|
0:19:18 | so the mfcc a gmm system which is which was proposed by bouquet |
---|
0:19:24 | it works with a ninety two point four percent |
---|
0:19:27 | or whatever it takes another the state-of-the-art which using l s d m o'clock set |
---|
0:19:31 | seventy six point nine percent |
---|
0:19:33 | the best of that we have in this work at eighty percent |
---|
0:19:37 | but then there's of using the or tickets |
---|
0:19:40 | a completely unsupervised the system works at and add an error tradeoff ninety three point |
---|
0:19:44 | two people think |
---|
0:19:49 | so |
---|
0:19:50 | after |
---|
0:19:52 | okay so |
---|
0:19:54 | through in the conclusions we have proposed a new approach for overlapping speech model |
---|
0:19:59 | and we extended the biggest crime framework to the multiclass vts |
---|
0:20:04 | system |
---|
0:20:04 | and we analyze that if we have a billows of three point two seconds and |
---|
0:20:09 | it was better |
---|
0:20:11 | and then we were able to have |
---|
0:20:14 | okay concentrations precisions up to fourteen seventy percent |
---|
0:20:17 | one thing to note here is that in the l svm approach |
---|
0:20:21 | they had very good precision but in a case we have a much better because |
---|
0:20:25 | then that |
---|
0:20:28 | the future about which we want to do with into the covariance operation and delta |
---|
0:20:31 | features and in the case of the activation not work we want we order models |
---|
0:20:36 | to |
---|
0:20:37 | use |
---|
0:20:38 | so after that we also we extended the work for you think we wouldn't submission |
---|
0:20:43 | and you when the security of got its output |
---|
0:20:45 | so we have been a way to improve these numbers from you do seventy eight |
---|
0:20:50 | and this ninety three point two i don't in nine |
---|
0:20:53 | but still from eighty nine to seventy six it's we have to work for that |
---|
0:20:59 | or although we cannot say that says working in part with the |
---|
0:21:03 | state-of-the-art system but we think that this of the very promising approach |
---|
0:21:06 | and this can be used maybe for some other kind of maybe if you want |
---|
0:21:11 | to model speech corrupted with noise but noise which is much more complex |
---|
0:21:18 | i think that to thank you |
---|
0:21:34 | so i'm having problems understanding our when you go from ninety six ninety three percent |
---|
0:21:38 | error that that's a big improvement |
---|
0:21:41 | i'm not questioning that he's |
---|
0:21:44 | more what might help of i guess if you can be done sometime a test |
---|
0:21:48 | like that's what is the performance that you think |
---|
0:21:52 | is necessary for usable system to work |
---|
0:21:56 | you hit seventy six is kind of state-of-the-art |
---|
0:21:59 | has anyone done any test where maybe you take a clean data that doesn't have |
---|
0:22:03 | any overlap at all |
---|
0:22:05 | but in certain control amounts of overlap |
---|
0:22:09 | where you can run that performance metric there and decide whether humans or where the |
---|
0:22:14 | this the subsequent diarisation system |
---|
0:22:18 | is acceptable when it hits you know an error rate of fifty percent i i'm |
---|
0:22:23 | not sure what number you actually have to hit before you say it's a viable |
---|
0:22:27 | solution because come from ninety three bird ninety six to ninety three year a year |
---|
0:22:32 | something just seems like the numbers are just too high to make it practically users |
---|
0:22:38 | okay so are the ones that the first person i'm not aware of any for |
---|
0:22:42 | where they have artificially created or they have concluded overlaps in the audio |
---|
0:22:47 | but |
---|
0:22:48 | so the main task the main purpose of doing all this to improve the speaker |
---|
0:22:52 | recognition system so we want to know that it's values for finally improved activation either |
---|
0:22:59 | so |
---|
0:23:01 | the state-of-the-art using an svm which had the edited of seventy six point nine but |
---|
0:23:06 | i think in that paper they have not a given the data vision edited which |
---|
0:23:10 | they have achieved using that system |
---|
0:23:12 | in our system but so |
---|
0:23:15 | of you have a people in interspeech very we also present |
---|
0:23:19 | the effect of this overlap detection on television and |
---|
0:23:24 | in the case when we have eighty nine percent error |
---|
0:23:27 | so |
---|
0:23:29 | this value ninety three point three we have we need way to reduce it to |
---|
0:23:32 | eighty nine percent and when we use that system for television we have marginal improvement |
---|
0:23:37 | over the baseline |
---|
0:23:39 | so i hope that when someone by when you have a over the prediction error |
---|
0:23:44 | rate |
---|
0:23:45 | below at it would have quite a significant improvement only |
---|
0:23:59 | a show why |
---|
0:24:03 | sure |
---|
0:24:07 | more speakers once |
---|
0:24:09 | the second question how defined who is the main speaker who is the a six |
---|
0:24:17 | so |
---|
0:24:19 | once the first base and that's of anybody question |
---|
0:24:21 | that's can have them sent to keep the number of more than slow and that |
---|
0:24:25 | we have done the thing that |
---|
0:24:28 | the overlaps are i don't remember the exact values but |
---|
0:24:33 | unless people are laughing together or a having a very uncontrolled |
---|
0:24:38 | meeting are discussed and then they would tend to speak like to be for all |
---|
0:24:41 | together otherwise the claim to like |
---|
0:24:43 | but when one speaker that i speaking and then someone other speakers start speaking at |
---|
0:24:47 | that moment they might have an overlap of with speaker |
---|
0:24:50 | and this but this formulation of vts |
---|
0:24:55 | p |
---|
0:24:56 | at this moment we cannot extended to three speakers |
---|
0:25:00 | because of the formulation so be we are assuming one additive noise |
---|
0:25:06 | and in a repeat the second version |
---|
0:25:10 | sorry |
---|
0:25:12 | okay |
---|
0:25:13 | so for the means because we use the we have speaker models for all speakers |
---|
0:25:18 | so we directly use them |
---|
0:25:20 | to find out which gives the most likely what for that analysis window |
---|
0:25:25 | so we use that thing to determine the mean speaker |
---|
0:25:31 | i'm just wondering about the inter annotator agreement on this task i it seems to |
---|
0:25:36 | be very difficult task to even for humans |
---|
0:25:38 | so all those numbers in the range of a inter annotator agreement story |
---|
0:25:43 | i mean |
---|
0:25:44 | do you have any ideal on this point |
---|
0:25:47 | or what the annotation which we have come from icsi and i have descent with |
---|
0:25:51 | annotation it's quite accurate even the overlaps like but is more than over that's the |
---|
0:25:56 | have been annotated |
---|
0:25:59 | but i'm not sure about the inter annotator document |
---|