0:00:06 | yeah we'll come back to the session |
---|
0:00:08 | so now let's |
---|
0:00:10 | and not only from my seat deployable on recent |
---|
0:00:13 | so now we change the topic today to speaker diarization |
---|
0:00:17 | and uh |
---|
0:00:18 | not in speaker diarization |
---|
0:00:20 | uh |
---|
0:00:20 | one of the important things to guess the number |
---|
0:00:23 | right |
---|
0:00:25 | let me give you that |
---|
0:00:26 | we have a speaker |
---|
0:00:27 | and each one |
---|
0:00:29 | you need it |
---|
0:00:30 | do the segmentation |
---|
0:00:33 | uh |
---|
0:00:35 | and uh so we have four people as in the first people is on the |
---|
0:00:39 | well my diarization of telephone conversations |
---|
0:00:43 | uh presented by all three penthouse |
---|
0:00:45 | please |
---|
0:00:46 | i know |
---|
0:00:47 | giving him |
---|
0:00:49 | the topic of the presentation is a non derivation of telephone conversation |
---|
0:00:54 | yeah i would begin by presenting the speaker |
---|
0:00:57 | you get diarization problem |
---|
0:00:59 | and after when cheryl |
---|
0:01:01 | talk about online those online speaker diarisation |
---|
0:01:04 | and the overview of current to short overview of current speaker diarisation system |
---|
0:01:10 | i will then said present the suggested online speaker diarization system |
---|
0:01:14 | including description derivation time complexity and performance |
---|
0:01:19 | and i will |
---|
0:01:20 | conclude |
---|
0:01:21 | of course |
---|
0:01:22 | the conclusion |
---|
0:01:23 | the task of speaker diarization system is to assign temporal segments of speech |
---|
0:01:28 | why now are |
---|
0:01:29 | participants in a conversation |
---|
0:01:32 | speaker diarization basically a ten |
---|
0:01:34 | two |
---|
0:01:35 | cluster |
---|
0:01:36 | the segment and cluster conversation |
---|
0:01:39 | such that |
---|
0:01:40 | if we see a from the left it's a manual derivation of a conversation down |
---|
0:01:44 | by a human listener |
---|
0:01:46 | and on the right |
---|
0:01:48 | automatic diarisation |
---|
0:01:49 | exhibited by |
---|
0:01:51 | the suggested this because diarisation system |
---|
0:01:56 | more |
---|
0:01:57 | a state of the art speaker diarization system operates in an on off line manner |
---|
0:02:01 | that is |
---|
0:02:03 | conversation samples are |
---|
0:02:05 | gathered until the conversation and |
---|
0:02:08 | falling |
---|
0:02:08 | an application of the diarization system |
---|
0:02:11 | however |
---|
0:02:12 | for some applications such as forensic or |
---|
0:02:15 | a speech recognition |
---|
0:02:17 | online diarization could be beneficial |
---|
0:02:19 | that is if |
---|
0:02:20 | we want to |
---|
0:02:21 | apply some automatic speaker recognition system |
---|
0:02:24 | we would uh |
---|
0:02:26 | be able to see this |
---|
0:02:27 | realisation of the conversation until the point |
---|
0:02:30 | yeah we want to apply |
---|
0:02:33 | online or something online derivation can be achieved by removing |
---|
0:02:36 | or minimising the size of the |
---|
0:02:39 | uh but for |
---|
0:02:40 | and |
---|
0:02:42 | however this |
---|
0:02:43 | incurs in the sun |
---|
0:02:45 | difficult to to the system because |
---|
0:02:47 | the amount of data is reduced |
---|
0:02:53 | most of the offline diarization systems operate in a two stage uh process |
---|
0:02:57 | first i'll just i'll just remain |
---|
0:03:00 | over generated over segmented by some change detection algorithm |
---|
0:03:05 | and then and then the ground or yeah |
---|
0:03:07 | i hierarchical clustering |
---|
0:03:09 | algorithm is applied |
---|
0:03:11 | in which |
---|
0:03:12 | segments are merged |
---|
0:03:14 | until some termination conditions are met |
---|
0:03:16 | generally the number of the |
---|
0:03:18 | final speakers in the conversation |
---|
0:03:23 | some recent approaches in uh offline diarization system |
---|
0:03:26 | include gmmubm |
---|
0:03:28 | is |
---|
0:03:28 | figure modelling |
---|
0:03:30 | speaker identification clustering |
---|
0:03:32 | and the fusion of several system was several |
---|
0:03:35 | a a feature set |
---|
0:03:37 | in order to apply |
---|
0:03:38 | there is a nation |
---|
0:03:41 | online speaker diarization system income to the encountered in the literature |
---|
0:03:45 | include |
---|
0:03:46 | online gmm learning |
---|
0:03:48 | as some novelty detection algorithms apply |
---|
0:03:51 | into detecting when a new speaker is appearing in a conversation |
---|
0:03:56 | and uh gmmubm |
---|
0:03:58 | this scheme |
---|
0:04:02 | most of the |
---|
0:04:03 | state of the art diarization systems |
---|
0:04:05 | online and offline |
---|
0:04:07 | and carton it in the literature requires some |
---|
0:04:10 | offline training background a channel or gender |
---|
0:04:14 | and models in order to apply |
---|
0:04:16 | on the diarisation algorithms |
---|
0:04:18 | is some require several sets of features |
---|
0:04:21 | and the |
---|
0:04:23 | practically all require a large amount of the |
---|
0:04:26 | computation power |
---|
0:04:30 | this is just an online diarization system operates in a two stage process |
---|
0:04:35 | first |
---|
0:04:35 | and unsupervised algorithm is applied |
---|
0:04:38 | over an initial training segment |
---|
0:04:41 | of the conversation |
---|
0:04:43 | followed by |
---|
0:04:44 | the use of the model generate in the in the first stage in order to put |
---|
0:04:48 | apply |
---|
0:04:49 | and eh |
---|
0:04:50 | receiver segmentation of the conversation |
---|
0:04:52 | on demand |
---|
0:04:55 | that is |
---|
0:04:56 | this |
---|
0:04:57 | sound |
---|
0:04:58 | a on |
---|
0:04:59 | the samples are entered into the |
---|
0:05:01 | preprocessing stage |
---|
0:05:03 | feature extraction |
---|
0:05:04 | and |
---|
0:05:05 | uh into the buffer |
---|
0:05:06 | which incorporates the uh initial training segments |
---|
0:05:10 | there is a show is applied only on the initial training segments |
---|
0:05:13 | and models are generated from the initial |
---|
0:05:17 | training segment |
---|
0:05:19 | once the models are available we could |
---|
0:05:21 | a apply or perform segmentation of the conversation |
---|
0:05:25 | based on these |
---|
0:05:26 | initial models |
---|
0:05:29 | however a major assumption a |
---|
0:05:31 | is that |
---|
0:05:32 | all of the speakers in the conversation must participate in this initial training segment |
---|
0:05:37 | or else |
---|
0:05:38 | they want the |
---|
0:05:39 | a model for these speakers will not be a |
---|
0:05:41 | be available |
---|
0:05:42 | for the rest of the segmentation process |
---|
0:05:46 | the first data validation |
---|
0:05:48 | is if we can still provide a |
---|
0:05:51 | telephone conversation their vision over the initial training segment |
---|
0:05:55 | yeah |
---|
0:05:55 | and which |
---|
0:05:56 | the samples in the initial training segment that preposterous |
---|
0:05:59 | feature extraction |
---|
0:06:01 | is applied on the emission thingy thing man |
---|
0:06:03 | and an initial assignment algorithm |
---|
0:06:06 | that is in a conversation and let's assume a telephone conversation once we have |
---|
0:06:10 | we have |
---|
0:06:11 | successfully identified the non speech |
---|
0:06:13 | we still have two speakers |
---|
0:06:15 | it was signed |
---|
0:06:16 | features too |
---|
0:06:18 | that is what we |
---|
0:06:19 | can identify the speech |
---|
0:06:21 | however we must apply to some kind of algorithm plus nine features |
---|
0:06:25 | to either of the speakers |
---|
0:06:28 | one features are assigned to each of the speakers |
---|
0:06:30 | uh an iterative process of modelling |
---|
0:06:33 | and time series clustering |
---|
0:06:35 | is applied until termination conditions are met |
---|
0:06:39 | once termination conditions are right where we can provide |
---|
0:06:43 | the segmentation |
---|
0:06:44 | modelling in this paper |
---|
0:06:46 | or in this work is the band by song and the time series processing is done by |
---|
0:06:51 | it's some variant of the |
---|
0:06:53 | hidden markov model |
---|
0:06:56 | when we apply diarization over short segments of speech eh |
---|
0:07:00 | two main issues arise |
---|
0:07:02 | one |
---|
0:07:03 | is it a low model complexities required |
---|
0:07:06 | because of the sparse amount of data |
---|
0:07:09 | and another problem is the or clustering constraints that is we would not like that |
---|
0:07:14 | and |
---|
0:07:15 | segmentation |
---|
0:07:16 | what |
---|
0:07:17 | skit with men speakers we would like to |
---|
0:07:19 | okay |
---|
0:07:20 | employed |
---|
0:07:21 | physical ones |
---|
0:07:21 | trains on the time of |
---|
0:07:23 | speech |
---|
0:07:24 | for each |
---|
0:07:24 | speaker |
---|
0:07:26 | the fourth problem tackled by replacing the common gmm models |
---|
0:07:30 | by a self organising map |
---|
0:07:33 | that is we train a self organising map |
---|
0:07:35 | for each of the speakers |
---|
0:07:37 | self organising maps was a uh |
---|
0:07:40 | presented by on it |
---|
0:07:42 | any composed of the three main stages the first uh |
---|
0:07:46 | initialisation |
---|
0:07:48 | the second is a rough |
---|
0:07:50 | training |
---|
0:07:51 | and finally |
---|
0:07:52 | a a fine tuning |
---|
0:07:54 | of the neurons or the |
---|
0:07:56 | centroid |
---|
0:07:57 | into the distribution of |
---|
0:07:59 | point |
---|
0:08:04 | once we have |
---|
0:08:05 | and |
---|
0:08:06 | train the model for each of the all speakers in the conversation |
---|
0:08:09 | a we would require some means to estimate |
---|
0:08:13 | the likelihood |
---|
0:08:14 | oh |
---|
0:08:15 | given a new feature okay |
---|
0:08:17 | a we would like to the |
---|
0:08:19 | estimate the probability of the likelihood of the uh feature observation given the model |
---|
0:08:25 | under the assumption of |
---|
0:08:27 | normality that is its centroid in the self organising map |
---|
0:08:31 | is a a |
---|
0:08:32 | the mean over |
---|
0:08:33 | yeah |
---|
0:08:34 | and |
---|
0:08:35 | probability |
---|
0:08:36 | uh with the unit covariance metric |
---|
0:08:38 | we could apply |
---|
0:08:40 | the following equation in order to estimate |
---|
0:08:42 | the log likelihood |
---|
0:08:44 | what the minus log likelihood of the |
---|
0:08:47 | and |
---|
0:08:48 | observation |
---|
0:08:51 | see that we we estimate the loglikelihood only with a single neuron |
---|
0:08:55 | because |
---|
0:08:56 | generally it will |
---|
0:08:58 | contain the most |
---|
0:08:59 | to um |
---|
0:09:00 | most of the information regarding the closest |
---|
0:09:03 | observation point |
---|
0:09:07 | the joint likelihood go |
---|
0:09:09 | and a set of features |
---|
0:09:11 | could be estimated by some |
---|
0:09:13 | of the log likelihoods of the single feature |
---|
0:09:15 | given that is |
---|
0:09:17 | that |
---|
0:09:17 | the clean independent |
---|
0:09:21 | justin constraints are enabled using |
---|
0:09:24 | uhuh |
---|
0:09:25 | if hidden markov model or a minimum duration hidden markov model |
---|
0:09:30 | in this model it's they |
---|
0:09:31 | is modelled using yeah |
---|
0:09:33 | hyper state that is |
---|
0:09:35 | in each hyper state we enforce a minimum duration of transitions from |
---|
0:09:40 | once |
---|
0:09:40 | one one state |
---|
0:09:42 | to another state |
---|
0:09:43 | and in this manner we could use the |
---|
0:09:46 | hidden markov model in order to enforce the minimum duration time |
---|
0:09:50 | a for each of the |
---|
0:09:51 | speakers |
---|
0:09:52 | each state in the meeting duration hidden markov model if the left or right |
---|
0:09:56 | hi per state |
---|
0:09:58 | you know which songs you |
---|
0:09:59 | to estimate |
---|
0:10:00 | the the log likelihood |
---|
0:10:02 | or the emission probability |
---|
0:10:03 | for each of the observation |
---|
0:10:09 | i don't know |
---|
0:10:10 | in the |
---|
0:10:11 | that's right |
---|
0:10:12 | transition matrix of the hidden markov models elements on the diagonal |
---|
0:10:17 | and a hyper state transition matrix matrix of |
---|
0:10:20 | and the element and all that do not uh the entire hyper state the transition matrices |
---|
0:10:25 | and then this matrix is updated it |
---|
0:10:27 | part of the training process |
---|
0:10:33 | segmentation |
---|
0:10:35 | once we have the models for each of the speakers in the hmm segmentation is applied |
---|
0:10:40 | and using the a viterbi |
---|
0:10:43 | time series clustering algorithm |
---|
0:10:45 | that is |
---|
0:10:47 | samples of the um |
---|
0:10:49 | sound wave |
---|
0:10:50 | is entered into a buffer |
---|
0:10:51 | initial training segment |
---|
0:10:53 | is applied |
---|
0:10:54 | derivation |
---|
0:10:55 | and hidden markov models which is generated by the diarization system |
---|
0:11:00 | once we have this |
---|
0:11:01 | hidden markov model segmentation is applied almost |
---|
0:11:04 | instantaneously on the mac |
---|
0:11:07 | i would |
---|
0:11:07 | and it |
---|
0:11:11 | viterbi algorithm |
---|
0:11:12 | computation complexity is in the order of Q squared chi where you are the number of states in the H |
---|
0:11:19 | M and and T is the number of features |
---|
0:11:21 | uh in the conversation |
---|
0:11:24 | so that |
---|
0:11:25 | initialisation and recursion of the viterbi algorithm could be applied online that is |
---|
0:11:31 | F F S with which is |
---|
0:11:32 | after they were |
---|
0:11:33 | really |
---|
0:11:33 | which is the first feature |
---|
0:11:35 | is |
---|
0:11:35 | used to initialise |
---|
0:11:37 | the viterbi algorithm |
---|
0:11:39 | followed by F and which is your |
---|
0:11:41 | a two |
---|
0:11:43 | in the recursion process |
---|
0:11:44 | once |
---|
0:11:45 | segmentation is demanded |
---|
0:11:47 | um |
---|
0:11:48 | termination |
---|
0:11:49 | and backtracking could be applied online |
---|
0:11:52 | and that is almost instantaneous |
---|
0:11:58 | a graph |
---|
0:11:59 | stating the time required |
---|
0:12:01 | to generate the segmentation of a conversation is a function of the conversation length |
---|
0:12:05 | uh is given here |
---|
0:12:07 | and it's show that |
---|
0:12:08 | four |
---|
0:12:09 | four hundred |
---|
0:12:10 | second the conversation for example |
---|
0:12:12 | only one millisecond of it and |
---|
0:12:14 | of time computer time is required |
---|
0:12:17 | and in the current implementation of the diarization system |
---|
0:12:20 | one second of processing time give a white man alive |
---|
0:12:23 | seventy three seconds of the audio |
---|
0:12:27 | doing the first aid of a derivation |
---|
0:12:31 | and experimentation the database used was the |
---|
0:12:34 | of two thousand forty eight conversation from the nist two thousand and five speaker recognition evaluation |
---|
0:12:40 | recordings L to speaker conversation in at a four wire which was sound |
---|
0:12:45 | and normalised in order to be generated two speaker conversations |
---|
0:12:49 | and |
---|
0:12:50 | the features extracted was |
---|
0:12:52 | twelve |
---|
0:12:53 | mfcc features and twelve mfcc including |
---|
0:12:56 | delta features |
---|
0:12:59 | the entire database was first |
---|
0:13:01 | processed by the diarization system using all of the data available |
---|
0:13:05 | to produce |
---|
0:13:06 | twenty percent diarization error rate in six point nine percent |
---|
0:13:10 | speaker right |
---|
0:13:15 | diarisation error rate |
---|
0:13:17 | how to the way we measured it was to include |
---|
0:13:20 | all of the hours available that is |
---|
0:13:22 | speech |
---|
0:13:23 | confusion and the uh |
---|
0:13:25 | also i mean |
---|
0:13:26 | speech and nonspeech |
---|
0:13:28 | also overlapped speech which is the set which are segments of |
---|
0:13:32 | speakers speaking together |
---|
0:13:34 | was also considered as an arrow |
---|
0:13:36 | in the speaker error rate |
---|
0:13:38 | and |
---|
0:13:39 | we actually eliminated |
---|
0:13:41 | the nonspeech in both of the segmentations |
---|
0:13:44 | in order to generate only the speaker confusion |
---|
0:13:49 | the derivation error rate as a function of the initial segment length |
---|
0:13:53 | it's shown to |
---|
0:13:55 | approach |
---|
0:13:56 | the optimal of the |
---|
0:13:57 | eh |
---|
0:13:58 | performance obtained by the applying the nation system over the entire segment |
---|
0:14:04 | as we can see that |
---|
0:14:05 | four |
---|
0:14:08 | say one twenty one or two minutes of initial training segment where we where you save twenty four |
---|
0:14:15 | percent diarization error rate |
---|
0:14:17 | and the |
---|
0:14:19 | this |
---|
0:14:21 | behaviour |
---|
0:14:22 | is also presented in the |
---|
0:14:24 | application of a speaker error |
---|
0:14:30 | it seems that given |
---|
0:14:32 | two minutes of initial training segment they relation iterative |
---|
0:14:35 | sufficiently close |
---|
0:14:36 | uh to the diarization error rate obtained by applying this segmentation |
---|
0:14:40 | the diarization over the entire conversation |
---|
0:14:43 | and using one or twenty |
---|
0:14:44 | seconds of the initial training segment |
---|
0:14:47 | we could obtain twenty three |
---|
0:14:49 | twenty four diarisation percent diarisation error rate |
---|
0:14:51 | and twenty points |
---|
0:14:52 | ten point six |
---|
0:14:54 | signal |
---|
0:14:54 | a speaker error rate |
---|
0:14:56 | well using one and i think |
---|
0:14:58 | seconds of initial training segment |
---|
0:15:00 | provide twenty two point three diarization error rate |
---|
0:15:02 | and about ten percent speaker |
---|
0:15:05 | that the features |
---|
0:15:07 | eh |
---|
0:15:07 | did not |
---|
0:15:08 | provide an improved performance |
---|
0:15:14 | to conclude |
---|
0:15:14 | ascending online speaker that information system |
---|
0:15:17 | uh was presented |
---|
0:15:19 | and it was shown that using as few as |
---|
0:15:21 | one hundred twenty seconds of conversation and we could apply |
---|
0:15:25 | and provide |
---|
0:15:26 | segmentation of the conversation |
---|
0:15:28 | by an increase of |
---|
0:15:29 | four percent |
---|
0:15:30 | when compared to the diarization error rate obtained by the by applying the vision system |
---|
0:15:35 | over the entire conversation |
---|
0:15:37 | for them |
---|
0:15:38 | corpus of robustness and simplicity |
---|
0:15:40 | gmm models or or replaced by a self organising map |
---|
0:15:46 | a um |
---|
0:15:47 | and |
---|
0:15:48 | we assume no prior information regarding the speakers on the or the conversation that if we use |
---|
0:15:53 | no background models of any kind |
---|
0:15:56 | yeah |
---|
0:15:56 | in order to apply |
---|
0:15:58 | there is asian |
---|
0:15:59 | and no parameters are required |
---|
0:16:01 | to be trained offline |
---|
0:16:03 | and in order to apply diarization |
---|
0:16:07 | thank you |
---|
0:16:14 | take some questions |
---|
0:16:37 | oh |
---|
0:16:44 | no |
---|
0:16:45 | uh |
---|
0:16:46 | well as opposed to some initialisation |
---|
0:16:48 | uh maybe i missed what is the length of the segment |
---|
0:16:52 | that you get into the sum |
---|
0:16:55 | okay that's fine |
---|
0:16:56 | we've done this |
---|
0:16:57 | merriment using a variable length of initial training segment that is |
---|
0:17:01 | assuming you are |
---|
0:17:03 | one hundred and twenty seconds of initial training segment |
---|
0:17:06 | some of which belongs to speaker a sound which belong to speaker B |
---|
0:17:10 | and sound belongs to non speech |
---|
0:17:12 | that is the the the exact amount of features |
---|
0:17:15 | belonging to each of the speakers was not measured because it's a it's a |
---|
0:17:19 | function of the initialisation algorithm |
---|
0:17:22 | okay |
---|
0:17:22 | but um i i mean |
---|
0:17:24 | what you know |
---|
0:17:26 | do also |
---|
0:17:27 | self organising map |
---|
0:17:29 | is using the short segments |
---|
0:17:31 | from this initialisation |
---|
0:17:32 | yeah |
---|
0:17:33 | and do you have a fixed |
---|
0:17:34 | flanks |
---|
0:17:36 | for the for the segments or is it |
---|
0:17:38 | so the uh |
---|
0:17:40 | segmented okay |
---|
0:17:41 | okay |
---|
0:17:49 | here |
---|
0:17:51 | okay |
---|
0:17:53 | the initial training segment |
---|
0:17:55 | there is a she's actually applied on the initial training segment |
---|
0:17:59 | that is |
---|
0:17:59 | first |
---|
0:18:00 | speech or nonspeech is |
---|
0:18:02 | uh detected |
---|
0:18:03 | nonspeech of that and then the segments belonging |
---|
0:18:05 | belonging |
---|
0:18:06 | speech are |
---|
0:18:07 | a distributed among the two speakers |
---|
0:18:10 | in the conversation |
---|
0:18:11 | the distribution of the features to each of the speakers as a function of the initialisation algorithm |
---|
0:18:17 | which is a client of the K means |
---|
0:18:20 | a clustering algorithm |
---|
0:18:23 | so |
---|
0:18:24 | the exact amount of features assigned to each of the |
---|
0:18:27 | speakers |
---|
0:18:28 | eh |
---|
0:18:28 | i was not nice |
---|
0:18:31 | okay |
---|
0:18:32 | um i have a note on the question about the overlapping speech you said that you |
---|
0:18:37 | um overlapping speech in the responses but |
---|
0:18:40 | you score it as an error |
---|
0:18:42 | yeah and that you did not take it |
---|
0:18:44 | into account so we |
---|
0:18:45 | always and they're only one way to |
---|
0:18:47 | yeah and do you have an idea of the amount |
---|
0:18:50 | appeal is that it |
---|
0:18:51 | yes to to your result |
---|
0:18:52 | we have used two databases for uh there is a nation and |
---|
0:18:56 | the one used here was two thousand and forty eight conversation from then these |
---|
0:19:00 | the two of them |
---|
0:19:01 | two thousand and five speaker recognition |
---|
0:19:03 | and |
---|
0:19:04 | if |
---|
0:19:04 | i correctly remember it was about |
---|
0:19:07 | three dot eight |
---|
0:19:09 | percent |
---|
0:19:09 | of overlapped speech |
---|
0:19:11 | and in average |
---|
0:19:12 | okay |
---|
0:19:14 | like |
---|
0:19:21 | i also have two questions first |
---|
0:19:22 | have you evaluated the degradation you get |
---|
0:19:25 | from replacing the gaussian model with the |
---|
0:19:27 | the uh that's why model |
---|
0:19:29 | and secondly |
---|
0:19:30 | um |
---|
0:19:32 | uh could you i mean you want to use the initial |
---|
0:19:35 | you know so many seconds |
---|
0:19:36 | for for building your your uh |
---|
0:19:39 | you're speaker clusters |
---|
0:19:40 | a could you just redo that every so often i mean most uh |
---|
0:19:45 | machines this dataset more than once if you record |
---|
0:19:47 | uh you can continue doing online segmentation and in the background you can we compute your |
---|
0:19:53 | speaker clusters |
---|
0:19:54 | you know every |
---|
0:19:55 | uh thirty seconds or something like that |
---|
0:19:57 | of course |
---|
0:19:58 | for the first question |
---|
0:19:59 | we have examined |
---|
0:20:01 | self organising maps and gmm models for derivation |
---|
0:20:04 | in papers presented the previous |
---|
0:20:07 | that is |
---|
0:20:08 | jan and then solve for the nation |
---|
0:20:10 | in our studies experiments |
---|
0:20:12 | presented the same performance |
---|
0:20:14 | so we didn't find any reason to use a gmm |
---|
0:20:18 | especially because the training process for so long |
---|
0:20:21 | is a lot |
---|
0:20:21 | fast |
---|
0:20:22 | faster quicker |
---|
0:20:24 | and |
---|
0:20:24 | basically |
---|
0:20:25 | for us more robust |
---|
0:20:27 | for a second question |
---|
0:20:29 | and exact paper was submitted to interspeech |
---|
0:20:32 | it does |
---|
0:20:34 | exactly what is it |
---|
0:20:38 | so i |
---|
0:20:39 | two questions |
---|
0:20:40 | here |
---|
0:20:41 | one is the um |
---|
0:20:43 | comment about each set |
---|
0:20:44 | being used |
---|
0:20:45 | yeah |
---|
0:20:46 | it is the first |
---|
0:20:47 | you get good performance going |
---|
0:20:49 | first hundred twenty seconds |
---|
0:20:50 | your initial |
---|
0:20:51 | thing |
---|
0:20:52 | at the door |
---|
0:20:52 | the files are only |
---|
0:20:54 | i mean for |
---|
0:20:54 | five minutes long you're using |
---|
0:20:56 | yeah |
---|
0:20:56 | percent of the data |
---|
0:20:57 | yeah |
---|
0:20:58 | you into that realistic to go halfway through a conversation |
---|
0:21:02 | absolutely |
---|
0:21:04 | not because just |
---|
0:21:05 | and |
---|
0:21:07 | if we use about a thirty plus |
---|
0:21:10 | thirty second of the data in order to initialise the conversation |
---|
0:21:14 | the performance |
---|
0:21:15 | why that is |
---|
0:21:16 | see |
---|
0:21:17 | i mean |
---|
0:21:17 | we get like a thirty three percent diarization error rate and |
---|
0:21:21 | about |
---|
0:21:23 | twenty |
---|
0:21:25 | four percent speaker |
---|
0:21:27 | the the amount of data |
---|
0:21:29 | required by the initial training but by the diarization system |
---|
0:21:32 | it's quite large |
---|
0:21:34 | so |
---|
0:21:36 | if we have |
---|
0:21:37 | uh the possibility to train online thing the system as the conversation goes |
---|
0:21:42 | it would be great |
---|
0:21:43 | that's exactly what we partition |
---|
0:21:44 | in it |
---|
0:21:45 | in the next |
---|
0:21:47 | a paper in this |
---|
0:21:48 | fearing |
---|
0:21:48 | did you see the link that was also it's just to name one |
---|
0:21:52 | they're looking at things like that |
---|
0:21:54 | oh well |
---|
0:21:55 | oh |
---|
0:21:56 | we use that |
---|
0:21:57 | well what where the conversation although ten minutes |
---|
0:22:01 | let's not the duration issues knots of its duration |
---|
0:22:04 | structure |
---|
0:22:04 | right |
---|
0:22:05 | you conversations between street |
---|
0:22:08 | they |
---|
0:22:08 | i take it turns you take |
---|
0:22:10 | you know it's |
---|
0:22:11 | the |
---|
0:22:11 | duty cycle |
---|
0:22:12 | very |
---|
0:22:14 | if you look |
---|
0:22:15 | variation |
---|
0:22:16 | variance |
---|
0:22:17 | format you like |
---|
0:22:17 | i mean |
---|
0:22:18 | E R |
---|
0:22:19 | you know |
---|
0:22:20 | there it should be fine |
---|
0:22:21 | if someone dominates |
---|
0:22:22 | first part |
---|
0:22:23 | conversation you know well |
---|
0:22:25 | exactly |
---|
0:22:25 | that's so |
---|
0:22:26 | and i also think in the call home and call friend |
---|
0:22:30 | but the actually |
---|
0:22:31 | more than |
---|
0:22:32 | two people getting |
---|
0:22:33 | yeah |
---|
0:22:33 | yeah |
---|
0:22:34 | two people on one side getting on |
---|
0:22:36 | sharing |
---|
0:22:37 | um so you have more |
---|
0:22:38 | realistic |
---|
0:22:39 | action |
---|
0:22:39 | so what |
---|
0:22:40 | yeah the point |
---|
0:22:41 | in |
---|
0:22:42 | maybe you had this |
---|
0:22:43 | i |
---|
0:22:43 | really |
---|
0:22:45 | type your address in the |
---|
0:22:46 | online |
---|
0:22:47 | what |
---|
0:22:47 | online |
---|
0:22:48 | what you compare this to so for example the window has |
---|
0:22:52 | at published papers we did this workshop that's me |
---|
0:22:55 | exactly this task |
---|
0:22:56 | you start out blindly |
---|
0:22:58 | you start building up doing online |
---|
0:23:00 | did you use that the baseline |
---|
0:23:02 | did you |
---|
0:23:02 | formant |
---|
0:23:03 | okay |
---|
0:23:03 | yeah |
---|
0:23:05 | no i i think uh to |
---|
0:23:07 | to |
---|
0:23:07 | two papers |
---|
0:23:09 | a which perform this online diarization task |
---|
0:23:12 | but mostly of broadcast news |
---|
0:23:15 | naked on telephone i believe |
---|
0:23:17 | so |
---|
0:23:26 | this very little |
---|
0:23:27 | a problem |
---|
0:23:29 | yeah |
---|
0:23:30 | i would we have yeah |
---|
0:23:32 | um you know |
---|
0:23:33 | yeah |
---|
0:23:36 | thank you |
---|
0:23:43 | wanted to know if you have some idea |
---|
0:23:46 | two |
---|
0:23:46 | detect a new cluster a new speaker the system not to be able to |
---|
0:23:51 | i do class |
---|
0:23:52 | during decoding |
---|
0:23:54 | yeah |
---|
0:23:55 | our diarization system is |
---|
0:23:57 | only oriented to telephone conversation between two speakers that is what we already assumed that the number of speakers is |
---|
0:24:04 | too |
---|
0:24:05 | but |
---|
0:24:05 | i have encountered some ideas |
---|
0:24:07 | eh |
---|
0:24:09 | part of which use the leader follower algorithm which is a practically very simple |
---|
0:24:14 | that is the distance |
---|
0:24:15 | all |
---|
0:24:16 | if we take and segment the conversation and take and and you segment |
---|
0:24:20 | you can take the distance to the |
---|
0:24:22 | current model you have |
---|
0:24:24 | and in the distance for all the it's a certain threshold |
---|
0:24:28 | then you |
---|
0:24:29 | and meeting new model |
---|
0:24:31 | you say that |
---|
0:24:32 | this is a new speaker |
---|
0:24:33 | and you train a new model for it |
---|
0:24:35 | and use it |
---|
0:24:35 | in order to a cluster the conversation |
---|
0:24:38 | later on |
---|
0:24:40 | when you come to the end of the conversation could also use |
---|
0:24:43 | did this distance matrix between and models |
---|
0:24:46 | you know to um |
---|
0:24:47 | march model which are |
---|
0:24:49 | very very close |
---|
0:25:03 | uh |
---|
0:25:04 | i want to make one of the um |
---|
0:25:07 | uh when you say that uh |
---|
0:25:09 | uh |
---|
0:25:10 | to meaning |
---|
0:25:11 | out of five in real life |
---|
0:25:13 | we never know what will be the length |
---|
0:25:15 | eh |
---|
0:25:16 | recitation |
---|
0:25:18 | can be for mean |
---|
0:25:19 | can be ten million |
---|
0:25:21 | so |
---|
0:25:21 | yeah |
---|
0:25:22 | to mean |
---|
0:25:23 | oh |
---|
0:25:24 | no |
---|
0:25:25 | oh |
---|
0:25:29 | right |
---|
0:25:30 | just |
---|
0:25:34 | action |
---|
0:25:34 | finished |
---|
0:25:35 | before the meeting |
---|
0:25:37 | just make it |
---|
0:25:38 | cation |
---|
0:25:42 | yeah |
---|
0:25:55 | hmmm |
---|
0:25:58 | no |
---|
0:26:00 | i don't agree because we |
---|
0:26:02 | uh |
---|
0:26:03 | to me it's for initialisation |
---|
0:26:05 | does it matter if after the |
---|
0:26:07 | the |
---|
0:26:08 | yeah the computation |
---|
0:26:09 | before |
---|
0:26:10 | one means more or |
---|
0:26:12 | when you mean |
---|
0:26:13 | do you do need it i mean to me |
---|
0:26:15 | you should |
---|
0:26:16 | them |
---|
0:26:17 | and |
---|
0:26:18 | doesn't matter |
---|
0:26:19 | a piece to me |
---|
0:26:21 | online |
---|
0:26:21 | king |
---|
0:26:22 | the results |
---|
0:26:23 | no matter what |
---|
0:26:24 | and |
---|
0:26:37 | so |
---|
0:26:38 | so |
---|
0:26:39 | if you one day |
---|
0:26:40 | four |
---|
0:26:41 | fig |
---|
0:26:42 | the conversation |
---|
0:26:43 | so |
---|
0:26:43 | oh |
---|
0:26:44 | fig |
---|
0:26:45 | well |
---|
0:26:45 | i see |
---|
0:26:46 | second |
---|
0:26:47 | to initiate |
---|
0:26:48 | then |
---|
0:26:48 | yeah |
---|
0:26:49 | can have better without |
---|
0:26:52 | on the |
---|
0:26:53 | i think you be more if we if we just need to know how many how |
---|
0:26:57 | almost iterations then you get |
---|
0:26:59 | sufficient statistics |
---|
0:27:00 | cover both speaker |
---|
0:27:03 | right |
---|
0:27:04 | cindy |
---|
0:27:05 | fig |
---|
0:27:07 | this is not i'm |
---|
0:27:08 | it is not only show the percentage of the conversation it's a matter of that |
---|
0:27:12 | the amount of statistics required to train |
---|
0:27:14 | two speakers wanted right |
---|
0:27:16 | that is |
---|
0:27:17 | if the conversation would last |
---|
0:27:19 | for half an hour following the two minutes |
---|
0:27:22 | unless the channel is change in such a manner that the models are not no longer no longer valid |
---|
0:27:28 | the result will be the same |
---|
0:27:31 | but |
---|
0:27:31 | you are correct we have examined payment |
---|
0:27:34 | in order |
---|
0:27:34 | show that and that we wanted |
---|
0:27:36 | right i think we do not have anything to speak you know what i mean |
---|
0:27:40 | right |
---|
0:27:43 | so you so you have an online system but i suspect you actually don't i suspect that you're online system |
---|
0:27:47 | is actually an offline system |
---|
0:27:51 | okay |
---|
0:27:52 | do you know what |
---|
0:27:53 | anything before you reach the end of the file |
---|
0:27:57 | in any point |
---|
0:27:58 | where we get results |
---|
0:28:01 | that's the output of that |
---|
0:28:02 | diarisation system |
---|
0:28:05 | do you but you do use an hmm |
---|
0:28:07 | i do have an original |
---|
0:28:09 | so you are differing your decisions |
---|
0:28:14 | so you output is soon but you output the history as soon as it's a single |
---|
0:28:19 | uh so the |
---|
0:28:20 | the |
---|
0:28:21 | results to a single pair |
---|
0:28:23 | i uh there is all on user request |
---|
0:28:25 | that is |
---|
0:28:26 | yeah |
---|
0:28:27 | using the hmm |
---|
0:28:30 | in order to provide diarization results |
---|
0:28:33 | i only need |
---|
0:28:34 | perform termination and backtrack |
---|
0:28:36 | and this could be done using |
---|
0:28:38 | one millisecond of processing time |
---|
0:28:41 | this stage |
---|
0:28:42 | can all be done online |
---|
0:28:44 | that is initialisation using only the first feature |
---|
0:28:47 | and the rest of the features and their their fortunes stage |
---|
0:28:52 | for any new feature i |
---|
0:28:55 | determination and backtracking is only memory than memorising i could provide results instantaneous |
---|
0:29:02 | instantaneously |
---|
0:29:04 | what |
---|
0:29:04 | instantaneously before the uh uh |
---|
0:29:07 | hmm results to single path |
---|
0:29:09 | yeah |
---|
0:29:10 | hmmm |
---|
0:29:11 | you know |
---|
0:29:11 | what i really want to say is i think that |
---|
0:29:13 | this uh online offline distinction a distinction is really a red herring |
---|
0:29:18 | but |
---|
0:29:19 | uh |
---|
0:29:19 | it would be better i think to um |
---|
0:29:23 | talk about the |
---|
0:29:25 | uh |
---|
0:29:26 | allow |
---|
0:29:27 | D for |
---|
0:29:28 | the allowed deferral time before a decision needs to be made |
---|
0:29:31 | uh you know |
---|
0:29:33 | you you you make a distinction between online and offline but really what you're doing is you're convolving with that |
---|
0:29:38 | particular approach |
---|
0:29:40 | where |
---|
0:29:40 | you |
---|
0:29:42 | uh |
---|
0:29:42 | create models with an initial segment |
---|
0:29:45 | that |
---|
0:29:47 | to my with thinking |
---|
0:29:48 | that |
---|
0:29:50 | um doesn't really make the distinction between what is online once offline and if i would call it semi online |
---|
0:29:57 | it would be okay with you |
---|
0:29:59 | oh what what i would like |
---|
0:30:00 | to see you |
---|
0:30:01 | oh |
---|
0:30:03 | oh specification of the uh |
---|
0:30:05 | amount of time |
---|
0:30:06 | that's allowed to be the decision is allowed to be P for |
---|
0:30:10 | and uh you know and if you do that then um |
---|
0:30:14 | the um |
---|
0:30:15 | an offline system that deferral time would be infinite |
---|
0:30:19 | in an online system the real time would be |
---|
0:30:23 | something that is demanded by the application yeah |
---|
0:30:27 | i |
---|
0:30:28 | online |
---|
0:30:29 | definition of the system |
---|
0:30:30 | of the client |
---|
0:30:40 | oh |
---|
0:30:45 | but that |
---|