0:00:26 | so um this is the second talk |
---|
0:00:29 | i about uh i J again a a speaker diarization them what we are trying to focus on multistream approach |
---|
0:00:34 | use |
---|
0:00:35 | and it's uh actually detect in the the the baseline technique which we are using a |
---|
0:00:40 | is the same as in the previous talk which is uh |
---|
0:00:43 | information but to like system |
---|
0:00:46 | and to uh we are made me as the P saying the are need trying to look at the a |
---|
0:00:50 | combination of the outputs or a combination actually of for different different seems on different levels |
---|
0:00:56 | and these are only acoustic strings cell so no prior information |
---|
0:01:00 | from brawl statistic so |
---|
0:01:02 | again |
---|
0:01:03 | um |
---|
0:01:04 | and the third |
---|
0:01:05 | for order here |
---|
0:01:07 | is this was done by D |
---|
0:01:08 | was was D is you them to a T D a we let |
---|
0:01:12 | for for should L D |
---|
0:01:14 | and um |
---|
0:01:16 | i interaction a motivation |
---|
0:01:18 | as the same or |
---|
0:01:19 | kind of close |
---|
0:01:21 | um again we set holes uh or we assume that's uh |
---|
0:01:26 | the recordings which we are working with a are recorded with multiple distant microphones |
---|
0:01:31 | i i as um actually features what you are using a two kind of a "'cause" to features and that |
---|
0:01:37 | mfcc features which are kind of standards |
---|
0:01:39 | and then |
---|
0:01:40 | i that time delay of are right and i was features |
---|
0:01:44 | um |
---|
0:01:45 | each loop they are pretty uh a compliment to mfcc |
---|
0:01:49 | and uh um people nowadays they they use quite quite a lot for |
---|
0:01:53 | for uh diarization |
---|
0:01:56 | actually this combination |
---|
0:01:58 | winning acoustic feature combination |
---|
0:02:00 | for we uh |
---|
0:02:01 | uh uh information but like a technique is |
---|
0:02:04 | a a key |
---|
0:02:06 | less a a state-of-the-art results in a meeting data stations |
---|
0:02:12 | um so back to O two motivations so usually the feature streams are combined or a model level |
---|
0:02:19 | so |
---|
0:02:19 | there are separate models for a gmm models |
---|
0:02:22 | for |
---|
0:02:23 | different uh actually speak uh streams |
---|
0:02:25 | and this is are those way away |
---|
0:02:28 | and the and |
---|
0:02:29 | these uh actually uh a look like use in the and are combined |
---|
0:02:32 | with it's some you know waiting |
---|
0:02:34 | a and there also some other approach is like a voting schemes between |
---|
0:02:38 | these uh systems |
---|
0:02:39 | i diarisation systems already |
---|
0:02:41 | or or actually the initialisation |
---|
0:02:44 | i run system is done on the output of the other system or some the graded approach |
---|
0:02:49 | our are actually question a is uh if we can if |
---|
0:02:52 | and if you see or do this to kind of different acoustic features |
---|
0:02:56 | can be integrated using independent diarization systems |
---|
0:02:59 | rather than independent |
---|
0:03:01 | models or in other word |
---|
0:03:03 | but but actually D add some advantage of using systems are then |
---|
0:03:07 | a a combination |
---|
0:03:08 | but do we mean by system or a combination i hope is going to be clear |
---|
0:03:12 | uh uh sure or |
---|
0:03:14 | a to slides |
---|
0:03:16 | um |
---|
0:03:19 | so maybe the last one about i'd like blind of the talks so for let me say a few words |
---|
0:03:24 | of all this |
---|
0:03:25 | information about but like principal which we use |
---|
0:03:28 | and which is actually done on single stream that a station so no combination of before features |
---|
0:03:34 | and also if few words about to model based combination about |
---|
0:03:38 | system based combination some he bit combination |
---|
0:03:40 | and the experiment a result |
---|
0:03:43 | again uh a state-of-the-art results using actually |
---|
0:03:47 | uh this uh but to make uh |
---|
0:03:48 | information but the like a technique |
---|
0:03:51 | um |
---|
0:03:53 | um we we are getting state of the results with such system and that is not too much of a |
---|
0:03:58 | computational |
---|
0:03:59 | complexity in in that |
---|
0:04:01 | um |
---|
0:04:02 | so this is uh are the can the advantage |
---|
0:04:04 | uh how does it work |
---|
0:04:06 | these information about button like principle |
---|
0:04:08 | um actually this kind of intuitive div approach each has been borrowed from uh |
---|
0:04:13 | from a a document clustering so |
---|
0:04:15 | at the beginning sample that we have some document that you want to class or in |
---|
0:04:20 | C clusters |
---|
0:04:21 | in our terminology |
---|
0:04:22 | and |
---|
0:04:24 | um |
---|
0:04:25 | and uh |
---|
0:04:27 | what these actually |
---|
0:04:28 | a a what is added did that as a as the information is some body Y which is about but |
---|
0:04:33 | be of interest |
---|
0:04:35 | a a or we call it as are but i of a body able which it surely no |
---|
0:04:39 | or something about discussed ring so some in these uh |
---|
0:04:43 | a document clustering these why uh why why able can be |
---|
0:04:47 | a can be words |
---|
0:04:49 | oh all the vocabulary which |
---|
0:04:51 | of course to was about uh a about these uh |
---|
0:04:55 | discussed serves and has information about |
---|
0:04:57 | a about six a |
---|
0:04:59 | also so actually some all that there is a a normal condition distribution P you white X so like given |
---|
0:05:04 | X is available |
---|
0:05:06 | and back |
---|
0:05:07 | and going back to this uh a problem or speaker diarisation |
---|
0:05:11 | our X got to i X is actually set of elements |
---|
0:05:15 | oh and the speech so again |
---|
0:05:17 | speech uh segments |
---|
0:05:20 | again you need for segmentation we we set and |
---|
0:05:23 | these need to be |
---|
0:05:24 | uh |
---|
0:05:25 | uh a cluster into C C class or |
---|
0:05:29 | so we to this information about the like a principal state |
---|
0:05:32 | uh that the clustering should be press the ring as much information as possible between |
---|
0:05:38 | a a C a Y |
---|
0:05:40 | or by minimizing the distortion these distortion we can see as a |
---|
0:05:44 | uh some |
---|
0:05:45 | compression for example |
---|
0:05:47 | or |
---|
0:05:48 | also in a our |
---|
0:05:49 | our way it's actually some regularization regularization so if you don't have uh |
---|
0:05:54 | these distortion C N N |
---|
0:05:56 | which is actually but our terms uh |
---|
0:05:59 | i'm each information |
---|
0:06:00 | oh oh X and C for i X and C |
---|
0:06:03 | uh uh if you don't have a it's probably going to |
---|
0:06:06 | cussing to one one global class or which which is not so the case C one |
---|
0:06:11 | so i get this i'm |
---|
0:06:13 | i intuitive div approach |
---|
0:06:15 | but in the end it looks that uh |
---|
0:06:18 | or you can be proved |
---|
0:06:19 | but |
---|
0:06:20 | if we actually |
---|
0:06:21 | you are going to |
---|
0:06:23 | um |
---|
0:06:24 | have to my this objective function which is again |
---|
0:06:27 | uh a mutual information C Y |
---|
0:06:30 | and my nose |
---|
0:06:31 | some |
---|
0:06:32 | uh like i to rate or uh X and C |
---|
0:06:35 | uh yeah are going to |
---|
0:06:37 | actually uh |
---|
0:06:39 | to move the problem to the |
---|
0:06:41 | uh to the way you where |
---|
0:06:42 | uh the properties |
---|
0:06:44 | those |
---|
0:06:45 | that he's Y given X are going to be |
---|
0:06:48 | uh |
---|
0:06:49 | measure don't can but using a simple divorce and |
---|
0:06:52 | so |
---|
0:06:53 | but the point so we don't need to look for some |
---|
0:06:55 | especially divisions of the as your which is saying |
---|
0:06:58 | which got a of we should we should be him together |
---|
0:07:01 | in this uh in do |
---|
0:07:02 | and so intuitive approach |
---|
0:07:04 | i be due the derivation we will find out that actually that should be jensen jensen channel uh the imagines |
---|
0:07:09 | used for |
---|
0:07:10 | for clustering |
---|
0:07:11 | so in the end uh the approach is pretty simple or |
---|
0:07:16 | going to be is |
---|
0:07:17 | so here it's actually a got marty for a |
---|
0:07:20 | a also in each iteration them the are |
---|
0:07:23 | we are uh |
---|
0:07:24 | we are thing to clusters together are based on the information |
---|
0:07:28 | uh from these uh give chance so we take those clusters which have |
---|
0:07:32 | the small the and we just met jim |
---|
0:07:34 | and you do it it's that to the um |
---|
0:07:36 | until |
---|
0:07:37 | should is some stop criteria |
---|
0:07:39 | stop it that you know |
---|
0:07:40 | is again pretty simple and it is actually a normalized |
---|
0:07:44 | but you or from |
---|
0:07:46 | i go back |
---|
0:07:47 | uh this a mutual information between C and Y |
---|
0:07:51 | so so again mm to somehow O |
---|
0:07:55 | i i know finalised this uh i the approach |
---|
0:07:58 | uh |
---|
0:07:59 | right is good we have us to pink daddy and we have actually |
---|
0:08:03 | the the um |
---|
0:08:05 | where you how to measure your the |
---|
0:08:07 | the similarity between between clusters |
---|
0:08:09 | and uh |
---|
0:08:11 | it's pretty simple |
---|
0:08:12 | to to and coded it you know |
---|
0:08:14 | so |
---|
0:08:15 | um |
---|
0:08:17 | oh just a a few information about uh are those properties which are actually here so |
---|
0:08:21 | would be fairly suppose that uh by but you of C given an X where C is cluster eight X |
---|
0:08:27 | is input uh segment |
---|
0:08:28 | is going to be hard |
---|
0:08:30 | partition meaning |
---|
0:08:31 | it all |
---|
0:08:32 | all these bills only to one class or |
---|
0:08:34 | but is no like |
---|
0:08:35 | a a week a uh weighting between several class er |
---|
0:08:39 | and place probability why given C which is actually |
---|
0:08:43 | a a some yeah but about a viable |
---|
0:08:46 | yeah distribution |
---|
0:08:47 | which which is used to a actually to do this so merging |
---|
0:08:51 | and um |
---|
0:08:55 | everything should be more clear to on this |
---|
0:08:58 | on this up your |
---|
0:08:59 | so i mean suppose we have input speech which is uniformly segment it |
---|
0:09:04 | oh for example mfcc features in this single |
---|
0:09:07 | some the approach |
---|
0:09:09 | we have uh elements of |
---|
0:09:11 | these |
---|
0:09:12 | and among variables |
---|
0:09:13 | i still didn't say what it is but |
---|
0:09:15 | i i it's probably in T if in |
---|
0:09:17 | our case is just universal background model |
---|
0:09:20 | you just on and tired speech |
---|
0:09:22 | and uh |
---|
0:09:23 | uh this is actually defining body able to what you to do the thing so |
---|
0:09:28 | actually actually state or which you see in the middle or are back doors P why you an X which |
---|
0:09:33 | are |
---|
0:09:33 | probabilities |
---|
0:09:35 | for a vector Y given |
---|
0:09:37 | uh you the input segments |
---|
0:09:40 | and um |
---|
0:09:42 | the clustering which is a a again competitive technique and in the end we get some initial segmentation |
---|
0:09:48 | and finally we do refinement using ca |
---|
0:09:51 | training a gmm and doing viterbi decoding |
---|
0:09:58 | that are let's go back to |
---|
0:09:59 | to the feature combination |
---|
0:10:02 | so in case of uh |
---|
0:10:04 | uh a feature combination which is based on the big around what else so suppose that we can have to |
---|
0:10:09 | features again uh a few just a at is and and tdoa away |
---|
0:10:13 | and we have to big our models |
---|
0:10:15 | uh each are trained on on such features |
---|
0:10:18 | uh what we can simply do that |
---|
0:10:19 | we uh we can just wait can nearly weights |
---|
0:10:22 | these uh |
---|
0:10:23 | B Y given X uh |
---|
0:10:25 | vectors or probabilities |
---|
0:10:27 | with |
---|
0:10:27 | put some weight |
---|
0:10:28 | and it's going to be us new mats weeks |
---|
0:10:31 | oh for these settlements sorry abilities |
---|
0:10:33 | in the |
---|
0:10:34 | a these weights |
---|
0:10:36 | how to get a to of course we trained them or estimate them on the development data so |
---|
0:10:41 | we should be juror rising or different data |
---|
0:10:43 | L so one |
---|
0:10:45 | we have actually these P Y X is make it's the rest of the diarization system is same so P |
---|
0:10:49 | actually do it just at the beginning where we combine these |
---|
0:10:53 | i are buttons where is |
---|
0:10:54 | and then we just just do a iterative |
---|
0:10:57 | approach to |
---|
0:10:58 | to do clustering |
---|
0:11:00 | so actually this is not a new these has been already but uh |
---|
0:11:03 | published be i row last the interspeech |
---|
0:11:06 | um this is just again the gap how how it is down |
---|
0:11:10 | a again there is a matrix cold |
---|
0:11:11 | thus be white X |
---|
0:11:13 | probably |
---|
0:11:14 | um |
---|
0:11:15 | the vectors like an vectors |
---|
0:11:17 | and they are simply |
---|
0:11:18 | a a it's uh by by alright right |
---|
0:11:21 | yeah and then there is a clustering operation and refinement |
---|
0:11:25 | now what is actually knew and what uh what we are type in this paper is uh |
---|
0:11:30 | multiple system combination |
---|
0:11:32 | so so |
---|
0:11:33 | a set of doing the combination before clustering uh what would happen if you do combination after clustering |
---|
0:11:39 | so |
---|
0:11:40 | um |
---|
0:11:41 | again with a of that they are to big our models |
---|
0:11:44 | oh trained on different uh features |
---|
0:11:46 | and they are two diarization systems in the end so |
---|
0:11:49 | uh we |
---|
0:11:50 | actually it actively |
---|
0:11:52 | get some clusters |
---|
0:11:53 | a stopping titanium actually can be different |
---|
0:11:56 | meaning |
---|
0:11:57 | can have different number of clusters for |
---|
0:11:59 | for a feature a a or four it should be |
---|
0:12:02 | the end to be get a this in these wide given X |
---|
0:12:06 | or a you see actually |
---|
0:12:08 | and |
---|
0:12:09 | and |
---|
0:12:09 | a time to go back |
---|
0:12:11 | from this class to initial segmentation |
---|
0:12:14 | is |
---|
0:12:14 | have been that would D Y you X |
---|
0:12:16 | i to do is just simple by bison operation |
---|
0:12:20 | and um |
---|
0:12:21 | again there is um |
---|
0:12:23 | something you image how how this is done |
---|
0:12:25 | so again and that two diarization systems |
---|
0:12:29 | which are doing complete clustering |
---|
0:12:32 | and in the end we are again getting a |
---|
0:12:34 | um some |
---|
0:12:36 | we are getting |
---|
0:12:37 | some clusters and to get actually back |
---|
0:12:39 | two |
---|
0:12:40 | to this initial segments P Y given X |
---|
0:12:43 | uh we just a apply those uh a simple operations um |
---|
0:12:47 | and just simply |
---|
0:12:48 | uh integrated over all be like C |
---|
0:12:54 | uh |
---|
0:12:54 | why why this should actually work uh is uh again between two intuitive |
---|
0:12:59 | in this case uh these be Y X |
---|
0:13:02 | after combination are actually estimate it on |
---|
0:13:05 | a a large amount of data so if they are not estimated on those short segments |
---|
0:13:09 | as in case so for a your combination |
---|
0:13:12 | before for clustering |
---|
0:13:13 | now each actually white a is uh |
---|
0:13:16 | estimated it or not |
---|
0:13:17 | on a lot of data because you have just you cost in the end of course |
---|
0:13:24 | um um |
---|
0:13:25 | the third approach so |
---|
0:13:27 | a actually keep it system so each is just the combination of those two but also |
---|
0:13:33 | uh are before passing and after clustering |
---|
0:13:36 | so in one case |
---|
0:13:38 | what we can do use just |
---|
0:13:40 | that before a as we just uh |
---|
0:13:42 | or |
---|
0:13:43 | and a one in one a a simple stream just do uh |
---|
0:13:47 | a a system combination and then we just uh |
---|
0:13:50 | a combine such output with a |
---|
0:13:53 | yeah are the others |
---|
0:13:54 | stream uh |
---|
0:13:56 | and she's to be before to cussing so maybe it's it's more seen here |
---|
0:13:59 | i into two streams |
---|
0:14:00 | in one case we do this system combination so we two clustering and from these be white C but is |
---|
0:14:06 | we go back to be Y X |
---|
0:14:08 | to get initial |
---|
0:14:09 | we show segmentation or initial properties for for the segmentation |
---|
0:14:13 | and in in the second case actually be |
---|
0:14:17 | we just do these uh um |
---|
0:14:20 | she's uh did you always stream |
---|
0:14:22 | just |
---|
0:14:22 | uh |
---|
0:14:24 | i try to do these combination before |
---|
0:14:27 | for for clustering |
---|
0:14:28 | that's a those to the kings are simply combine of course |
---|
0:14:31 | i i D and we have some you Y X uh |
---|
0:14:34 | but takes |
---|
0:14:35 | a P Y C about six N B just the i'm and as before |
---|
0:14:38 | of course there are two possible K sees uh what should be done on beach kind of theme |
---|
0:14:43 | and uh this is going to be seen the results are going to be the seen in table but again |
---|
0:14:47 | maybe it's into a D for how this should be done so that we say a few words about the |
---|
0:14:51 | experiments |
---|
0:14:52 | uh we are using the same but each transcription data uh system me sister uh sending meetings so no i |
---|
0:14:58 | mean data but the only rich transcription |
---|
0:15:01 | um the mfcc features and these uh |
---|
0:15:04 | uh tdoa features |
---|
0:15:06 | um |
---|
0:15:07 | and uh |
---|
0:15:08 | and she or the speech is coming from and the and they again |
---|
0:15:11 | um be |
---|
0:15:12 | uh |
---|
0:15:13 | single and hence speech signal |
---|
0:15:15 | um |
---|
0:15:16 | again the was weights which between the estimate are are estimated on the open set |
---|
0:15:21 | um as before we are only many shopping diarization error rate with respect to speaker or or so not speech |
---|
0:15:28 | or speech nonspeech there |
---|
0:15:31 | a a are the results each be a shift if you remember from the previews uh to talk |
---|
0:15:36 | the baseline was around fifteen or |
---|
0:15:38 | fifteen point five uh percent |
---|
0:15:41 | was uh |
---|
0:15:42 | actually use |
---|
0:15:43 | single stream techniques so just mfcc features |
---|
0:15:46 | you do can nation |
---|
0:15:48 | oh for mfcc and tdoa features |
---|
0:15:51 | in case of information but to technique and |
---|
0:15:54 | kind of the H M and gmm |
---|
0:15:56 | uh we may see that to because we get to you and twelve percent |
---|
0:16:00 | um |
---|
0:16:01 | and the second but is just to a being but are the weights those are weights for |
---|
0:16:06 | reading the |
---|
0:16:08 | but different features so in case of |
---|
0:16:10 | because these are different quantity so in our case of some properties which are actually |
---|
0:16:15 | which we are combining |
---|
0:16:17 | in case of a and uh J and those are a look like people so |
---|
0:16:21 | a that's why also be so uh weights are different |
---|
0:16:25 | and again in our case the combination is done using can of variables |
---|
0:16:29 | and this is actually as you see you can see a perform K the |
---|
0:16:33 | the a of system |
---|
0:16:35 | so these are the results for combination |
---|
0:16:38 | uh but combination |
---|
0:16:40 | one no |
---|
0:16:42 | on the um |
---|
0:16:43 | actually after clustering so |
---|
0:16:45 | combination system level as as we call it |
---|
0:16:47 | so in that is this base like you and point six percent comes from the previous table |
---|
0:16:52 | you do system |
---|
0:16:54 | combination meaning after a can my these tolerance of labels after |
---|
0:16:58 | clustering cut a you may C V are getting pretty high uh almost forty percent uh improvement |
---|
0:17:04 | and then they are of course two possible combinations of system and model |
---|
0:17:08 | and a weeding |
---|
0:17:10 | um |
---|
0:17:11 | actually |
---|
0:17:12 | looks |
---|
0:17:13 | and again it's pretty straightforward that |
---|
0:17:15 | it's better to |
---|
0:17:16 | to do see stan |
---|
0:17:17 | combination or system waiting we the tdoa features because they are usually |
---|
0:17:22 | mm more noisy |
---|
0:17:24 | and they need probably more data to were to be what estimated it or at least those of viable |
---|
0:17:30 | to have more data to to to be but estimated |
---|
0:17:32 | in case of a and that's is is features uh it looks at works so much better |
---|
0:17:36 | so that's why reason |
---|
0:17:38 | also you may look at the table |
---|
0:17:40 | a race |
---|
0:17:41 | if the the weights goals close to the |
---|
0:17:44 | those weights uh which we need to estimate the goal the go close to the system combination so instead of |
---|
0:17:50 | zero point seven zero point three |
---|
0:17:52 | we go to zero point eight |
---|
0:17:53 | and then estimated on different data but |
---|
0:17:56 | to generalise |
---|
0:17:58 | for this case |
---|
0:18:00 | um |
---|
0:18:01 | uh |
---|
0:18:02 | just a B to explain why |
---|
0:18:04 | possibly why we are getting such improvement |
---|
0:18:07 | a if you look at the single the stream |
---|
0:18:09 | a a results |
---|
0:18:10 | for each meeting seventeen meetings can this case |
---|
0:18:13 | so are |
---|
0:18:14 | but model combination and system combination |
---|
0:18:17 | um um |
---|
0:18:18 | and you look at the button or which is just simple and S C and D do you do away |
---|
0:18:23 | information but to neck techniques so there is no combination of different features |
---|
0:18:27 | may see that |
---|
0:18:28 | most of the improvement comes in case |
---|
0:18:31 | but is a big gap between |
---|
0:18:32 | those two single stream techniques |
---|
0:18:36 | we have the course you don't get to improvement but you is |
---|
0:18:38 | a a big gap between mfcc and tdoa single stream |
---|
0:18:42 | but system combination works so P develop for such a meeting |
---|
0:18:51 | and um |
---|
0:18:52 | just to conclude the paper |
---|
0:18:54 | uh so here we are present a new technique for or new weight of combination of of the streams of |
---|
0:19:00 | a was six teams |
---|
0:19:01 | so rather as we did before uh |
---|
0:19:04 | before clustering to to way the the acoustic features here we are present technique which |
---|
0:19:09 | actually is trying to do we after clustering |
---|
0:19:11 | and the reason uh a simple for that this uh probably the these on the variables which |
---|
0:19:16 | which are used to then to what you're to match different different uh a clusters or different segments |
---|
0:19:22 | are |
---|
0:19:23 | going to be estimated on are more data |
---|
0:19:25 | or not just on on |
---|
0:19:27 | short segments |
---|
0:19:29 | and uh actually uh as it was seeing in uh in uh |
---|
0:19:34 | the results you are getting pretty cool to improvement for |
---|
0:19:37 | for such a technique so forty percent uh |
---|
0:19:40 | that were all seventeen meeting |
---|
0:19:43 | um i think i'm done |
---|
0:19:46 | oh |
---|
0:19:47 | we |
---|
0:19:48 | the on spoken |
---|
0:19:53 | since something that i mean |
---|
0:19:55 | i no not i think some a specific question |
---|
0:20:00 | for them |
---|
0:20:01 | yeah |
---|
0:20:03 | i for all of the |
---|
0:20:04 | and and goes to P |
---|