0:00:20 | a this is such a are going to a uh and uh i we're talk about |
---|
0:00:24 | a binary that is a to they |
---|
0:00:26 | this is a during what we jump and so last from university of a you're i'm coming from that of |
---|
0:00:30 | funny nick research |
---|
0:00:32 | so what do not that way |
---|
0:00:34 | use |
---|
0:00:35 | has these out |
---|
0:00:37 | it is in front of a |
---|
0:00:39 | "'cause" is outlined |
---|
0:00:40 | i'm going to use what speaker diarization use or at least |
---|
0:00:44 | for the ones of you that the remember from the produced |
---|
0:00:46 | talk |
---|
0:00:47 | i i'm about a binary speaker modeling |
---|
0:00:50 | and then when during the two things into the binary speaker diarization system that we just developed |
---|
0:00:55 | experiments and then conclude a future work |
---|
0:00:59 | first first a speaker there is a and yeah as we have a a a of you are we split |
---|
0:01:03 | the give the speakers |
---|
0:01:05 | we see who spoke were and and we don't know |
---|
0:01:08 | need |
---|
0:01:09 | speakers is or how many speakers are there K is like the P three dogs |
---|
0:01:14 | no |
---|
0:01:16 | is the art days |
---|
0:01:18 | well we have done |
---|
0:01:20 | oh as in the people in the last year as we've got an |
---|
0:01:23 | do around seven ten percent in but cousin was even though these this |
---|
0:01:27 | something that since two thousand four uh |
---|
0:01:29 | it's not a part of the nist evaluations and i bet nowadays it's |
---|
0:01:33 | C even lower than that |
---|
0:01:35 | and we've got an to twelve fourteen percent for meetings even maybe nine percent now on uh |
---|
0:01:41 | on making the me |
---|
0:01:43 | this is a a a a a great to result these makes that a shouldn't be able to use for |
---|
0:01:47 | all there a you know as a a a as a blocking as a block step for other applications like |
---|
0:01:52 | a speaker I D when there is a a multiple speaker that that there |
---|
0:01:55 | but still we have a problem |
---|
0:01:56 | it's too small |
---|
0:01:58 | example of have some numbers an in in uh standard systems if you develop a diarization system you the do |
---|
0:02:04 | anything about it |
---|
0:02:05 | it's |
---|
0:02:06 | most probably gonna go way up of one time real |
---|
0:02:10 | and if you try doing something about it |
---|
0:02:14 | the two systems at base so that the people that they saw the were trying to do something about it |
---|
0:02:18 | that for mixing |
---|
0:02:19 | the first one is |
---|
0:02:20 | one couple years ago |
---|
0:02:22 | i was going down to point ninety seven for real time that's a on a model or and they were |
---|
0:02:28 | do some tweaks |
---|
0:02:29 | to to a gmm based algorithms to or it's a hierarchical bottom-up system |
---|
0:02:35 | a and the they were getting to a just on the real time |
---|
0:02:38 | and a father on they said okay let's go to do you know |
---|
0:02:41 | so we can use uh sixteen core whatever the uh our you much this |
---|
0:02:45 | and they went down to zero point zero seven i nine all the nowadays these this still even five you |
---|
0:02:49 | and faster but is to P O so |
---|
0:02:52 | you don't have to be you in a mobile phone or you don't have you know these |
---|
0:02:55 | uh |
---|
0:02:56 | the is have to work on used in |
---|
0:02:59 | depending on what architecture so |
---|
0:03:02 | and this is what you |
---|
0:03:04 | have a system |
---|
0:03:05 | that the really really is very fast and it doesn't matter and one and or what uh architecture running it |
---|
0:03:12 | and still have a |
---|
0:03:14 | this case we by adapting a recently proposed the uh uh uh a technique called binary speaker model |
---|
0:03:21 | we also have another poster or in uh getting i on using this for a speaker I |
---|
0:03:27 | and so in this case we that it to there is a show and i'll tell you i'll tell you |
---|
0:03:31 | how we it |
---|
0:03:34 | uh uh to know what we'll do that to know the basics of what's uh by speaker modeling as were |
---|
0:03:39 | i'm explain about it a little bit more now |
---|
0:03:43 | so |
---|
0:03:44 | and this is a a six of it so we have a a an acoustic C have some input acoustic |
---|
0:03:47 | data and one uh and that we |
---|
0:03:50 | a a vector well actor |
---|
0:03:52 | of J |
---|
0:03:54 | a zeros and ones |
---|
0:03:55 | so that it is uh basically in a very general way like explain here we just extract some to sleep |
---|
0:04:02 | parameters mfcc or whatever want |
---|
0:04:05 | and and we use a |
---|
0:04:07 | and he back background model K B M which is basically a ubm but trained in a different way |
---|
0:04:13 | to |
---|
0:04:14 | yeah this acoustic they and then with these K B M we obtain this uh |
---|
0:04:18 | these minor to case |
---|
0:04:19 | for each uh acoustic |
---|
0:04:21 | they say which could be a a data for one speaker or data for a couple seconds a |
---|
0:04:28 | that C D's T M |
---|
0:04:30 | be T B M you the understand it different ways this is basically a set of options |
---|
0:04:35 | position |
---|
0:04:36 | in a particular way in the acoustic space in that more be them may show acoustic space |
---|
0:04:40 | in you have just one they may so that |
---|
0:04:43 | we can see the example |
---|
0:04:44 | so we first position be |
---|
0:04:46 | acoustic options in the space and then we take up put that that were acoustic stick or or or of |
---|
0:04:50 | a speaker data |
---|
0:04:51 | i we see which all these options |
---|
0:04:54 | at most are present in the best our would data |
---|
0:04:58 | and uh from there extract a binary fingerprint which uh or by taking which |
---|
0:05:03 | has to does |
---|
0:05:05 | are present in the positions of discussions that do not really represent well than that and ones |
---|
0:05:10 | oh on the options that are are ending our data |
---|
0:05:14 | and is you right |
---|
0:05:16 | so how do we do it for a a a for an obvious to how the we all these together |
---|
0:05:20 | well we can see here on the left side we have already puts signal |
---|
0:05:25 | where we compute that were uh mfcc acoustic features at at the and on the right side we have a |
---|
0:05:30 | can be yeah |
---|
0:05:32 | and and the |
---|
0:05:33 | so |
---|
0:05:34 | and that's the vertical vectors |
---|
0:05:36 | he's we have uh vector as |
---|
0:05:38 | which each uh whose dimensionality is and is than a certain number of options we have |
---|
0:05:43 | you in our what be a model |
---|
0:05:46 | and for each input feature vector we select |
---|
0:05:49 | the best |
---|
0:05:51 | we could say other nor the one percent best two percent best that ten best whatever |
---|
0:05:56 | we wanna use whatever |
---|
0:05:57 | for a scroll one of us |
---|
0:05:59 | the |
---|
0:06:00 | and but feature vector |
---|
0:06:02 | that that for X one to X and a |
---|
0:06:06 | our where data one a model |
---|
0:06:08 | we can get down to this uh camping vector |
---|
0:06:11 | the first of the of the result of actors which basically |
---|
0:06:15 | counts how many times |
---|
0:06:16 | each of these options have been |
---|
0:06:18 | has been selected as one of the best representing options for the acoustic data |
---|
0:06:23 | and then i C we just say okay |
---|
0:06:25 | that |
---|
0:06:27 | a and know that by or whatever the options are present in the data of our once on the rest |
---|
0:06:31 | are also does |
---|
0:06:39 | so |
---|
0:06:40 | once we have a a a a E |
---|
0:06:42 | a binary vector |
---|
0:06:43 | for a two speakers of for two sets of acoustic data |
---|
0:06:47 | it is very fast and very easy to |
---|
0:06:49 | to compare them to combat how close they are |
---|
0:06:52 | in here is just an example and a is the the type of a few that should be a some |
---|
0:06:57 | uh uh in the form of a |
---|
0:07:00 | as in the top one of the model |
---|
0:07:01 | and uh a basic this one possible |
---|
0:07:03 | i mean that is many possibilities it in the working in by them are you just need to find a |
---|
0:07:07 | way to compare to binary signals in this case |
---|
0:07:10 | well we used in this paper is |
---|
0:07:12 | in the uh not we just need the sum |
---|
0:07:16 | oh of uh uh you know it's uh some supplies one whenever in the to back to we have a |
---|
0:07:20 | one |
---|
0:07:22 | and the denominator just uh |
---|
0:07:24 | do are so we some but as one of a number in a a in you that of the vectors |
---|
0:07:28 | we have a one |
---|
0:07:29 | and this gives as a score or from zero to one no |
---|
0:07:32 | the zero either use |
---|
0:07:33 | a this in a body not seem that and one is the the same back |
---|
0:07:41 | a a a a a speaker by and models |
---|
0:07:43 | and that they said to we have a poster experiment more about and you can go back to uh to |
---|
0:07:48 | a post we cut speech |
---|
0:07:50 | and that's see now how we apply |
---|
0:07:52 | to us |
---|
0:07:53 | to speaker diarization |
---|
0:07:57 | so this is basically the system the new system that was into they |
---|
0:08:01 | this is uh uh |
---|
0:08:03 | just |
---|
0:08:04 | even if it was a because different of strange this is just and a minute if but the map system |
---|
0:08:10 | we can see that the is a but if clustering down B and we have a |
---|
0:08:14 | kind of a stopping therapy or or or a cluster selection |
---|
0:08:17 | but the see |
---|
0:08:19 | the the of about so first the bottom it |
---|
0:08:22 | uh its uh D feature extraction to extract mfcc whatever we run |
---|
0:08:27 | training the next to eight |
---|
0:08:28 | so we need to train these K be a models in this case we train them from be they the |
---|
0:08:32 | the data itself we don't use external up |
---|
0:08:35 | features i did by stations |
---|
0:08:38 | well the we take the acoustic features |
---|
0:08:40 | and we |
---|
0:08:41 | like |
---|
0:08:42 | a i'm interested in in summarisation we always need to initialize as or a system as we are doing about |
---|
0:08:46 | the bottom-up system we need |
---|
0:08:47 | many more clusters than actual speakers are there so |
---|
0:08:51 | we need some how to create those clusters |
---|
0:08:53 | and |
---|
0:08:55 | this part is that was processing |
---|
0:08:57 | that is |
---|
0:08:59 | in at this is just a nice of using should would just |
---|
0:09:02 | a just a little bit of time of the computational time of the system |
---|
0:09:05 | after that the of minute of clustering |
---|
0:09:08 | which is what we uh keep blasting keep joining together or those clusters that are closest to that this is |
---|
0:09:14 | all going in the binary space |
---|
0:09:15 | and final once we have reached to one and this is one difference from a standard |
---|
0:09:19 | have a minute of clustering system go |
---|
0:09:21 | from a and to one |
---|
0:09:23 | we have reached a one |
---|
0:09:24 | use an algorithm to select how many |
---|
0:09:26 | a terms of to multi we have |
---|
0:09:30 | as a said |
---|
0:09:31 | uh of mfccs |
---|
0:09:33 | we use like be they have to C is a standard uh and B ten millisecond T five miliseconds |
---|
0:09:39 | and can be um well as a said that |
---|
0:09:42 | a model but train to the you know a special way |
---|
0:09:46 | i in a special |
---|
0:09:48 | if you use a uh you a model train it we stand standard em M L techniques |
---|
0:09:52 | you going to have the options positions at the average points |
---|
0:09:56 | modeling optimal more the in the late that uh and this a but it's is are we so that they |
---|
0:10:01 | are not |
---|
0:10:02 | uh uh uh are present in the particle it is of the discriminative information that the speakers have that the |
---|
0:10:07 | speakers of your all you have |
---|
0:10:09 | so we try to do something different that can model that |
---|
0:10:12 | and uh and the this and X so that it can be anything higher than five hundred options we can |
---|
0:10:18 | go to ten thousand the the performance |
---|
0:10:20 | those an the neither neither a uh that rates |
---|
0:10:24 | how to do this |
---|
0:10:26 | so in this case |
---|
0:10:28 | in this paper or to it in the following way |
---|
0:10:31 | to the uh we to be the that these is uh a i would put audio and we first train |
---|
0:10:35 | as to option for them |
---|
0:10:37 | i believe it's two seconds of speech we some overlap |
---|
0:10:41 | so we and that is parental |
---|
0:10:42 | oh |
---|
0:10:43 | second that the house and options |
---|
0:10:45 | oh two thousand a all the options |
---|
0:10:48 | the options of was and very small portions of the only so whenever the is speaker they represent the speaker |
---|
0:10:54 | very discriminatively |
---|
0:10:55 | and and we use that can do to uh medic to adaptively |
---|
0:11:00 | yeah shows shows that we're |
---|
0:11:02 | this space is optimally ultimately |
---|
0:11:05 | model the space |
---|
0:11:06 | like more do more separate between them |
---|
0:11:09 | the whole acoustic space |
---|
0:11:11 | and that's it |
---|
0:11:12 | this is actually much faster than than doing with additive splitting uh yeah M L |
---|
0:11:18 | no |
---|
0:11:19 | right |
---|
0:11:20 | and a is of the data |
---|
0:11:22 | these these binary vectors |
---|
0:11:24 | from the acoustic data and in two steps |
---|
0:11:27 | to do stuff so |
---|
0:11:28 | a step which is |
---|
0:11:30 | oh in the |
---|
0:11:31 | first best the the K best |
---|
0:11:34 | uh captions for each acoustic feature that we have to do |
---|
0:11:37 | we one time only and then on the second step |
---|
0:11:40 | for every subset of features that we to compute a fingerprint from that's gonna meet only the evenly in our |
---|
0:11:47 | uh |
---|
0:11:48 | that is that is addition |
---|
0:11:49 | hmmm |
---|
0:11:51 | a time we need it then this is actually very fast |
---|
0:11:53 | so that |
---|
0:11:55 | we have the mfcc vectors |
---|
0:11:57 | a in top |
---|
0:11:58 | and for each of them yet is |
---|
0:12:00 | this best options you may not working in |
---|
0:12:04 | for the time |
---|
0:12:05 | and that is our were first part and we can store in the score memory |
---|
0:12:09 | and that's done on one time this is a little expensive because evaluating option mixture models |
---|
0:12:13 | but this is |
---|
0:12:15 | one time only |
---|
0:12:16 | then at every time when i can be here |
---|
0:12:19 | speaker model just have to get |
---|
0:12:21 | that that the |
---|
0:12:23 | and |
---|
0:12:24 | the counts |
---|
0:12:25 | and from those counts get a binary vector |
---|
0:12:27 | okay and this is like fast |
---|
0:12:31 | a five |
---|
0:12:33 | acoustic have to talk about initialisation |
---|
0:12:36 | and he just uh uh did something so for simplicity just use we use the can be M |
---|
0:12:42 | the kingdom |
---|
0:12:44 | and then initial clusters which you you just to bit any options that where |
---|
0:12:49 | uh chosen first |
---|
0:12:51 | i mean |
---|
0:12:52 | as that that the segmental or segmentation and we've it there and with those we assigned |
---|
0:12:58 | we got the clusters that |
---|
0:13:00 | we than the most |
---|
0:13:03 | now are in the binary the me |
---|
0:13:05 | okay and we have that |
---|
0:13:07 | this is have for us is is is is exactly the same as |
---|
0:13:11 | for example the icsi system is a format for them map clustering |
---|
0:13:14 | except that now or anything the domain |
---|
0:13:17 | so for example to is |
---|
0:13:20 | fingerprints from our approach of |
---|
0:13:22 | of a cave as options |
---|
0:13:25 | a close per T is completely a binary |
---|
0:13:28 | a between all the models are that all the cluster models and just choosing the two that are closest |
---|
0:13:34 | to merge them |
---|
0:13:36 | i |
---|
0:13:37 | i am and we |
---|
0:13:39 | there are we just take |
---|
0:13:41 | three seconds of data |
---|
0:13:43 | in one second at time and assign compute a fingerprint from T for each of them |
---|
0:13:47 | and assign it to the to the better speaker model |
---|
0:13:55 | last but |
---|
0:13:56 | the last part of the system |
---|
0:13:58 | these ones we not to one so we have one a cluster we have to choose how many clusters is |
---|
0:14:02 | our optimum number of clusters |
---|
0:14:04 | so for bad |
---|
0:14:06 | a a a that the S uh |
---|
0:14:09 | to test this terms that was present |
---|
0:14:11 | but i i two are people in interspeech two thousand eight |
---|
0:14:14 | and the uh in a fit of time |
---|
0:14:17 | so five it all in the paper are but we just a is estimated to in the uh just a |
---|
0:14:22 | relation between the uh in and inter distances between the power of the terms |
---|
0:14:26 | which allows us to select the optimal number of clusters |
---|
0:14:29 | as as i have to say |
---|
0:14:30 | he's about in the system that i'm less happy about and the have to improve this by |
---|
0:14:37 | about eight |
---|
0:14:38 | of course we use that as a should it but also use a by a factor |
---|
0:14:43 | and |
---|
0:14:43 | because |
---|
0:14:44 | the diarization results of so freaking decided to use |
---|
0:14:48 | a nice to rich transcription evaluation that he's is about thirty the six |
---|
0:14:53 | uh shows |
---|
0:14:55 | and uh i |
---|
0:14:55 | to say that he's |
---|
0:14:56 | runs see in just a but an hour in in a lot the P C so it's pretty fast |
---|
0:15:03 | they |
---|
0:15:03 | maybe results |
---|
0:15:05 | the first aligned use the results using uh a big easy could a gmm system but just an implementation of |
---|
0:15:12 | the |
---|
0:15:13 | um |
---|
0:15:14 | basic one |
---|
0:15:15 | a a is as about twenty three mm send and average that position of than a running down of about |
---|
0:15:21 | one point nineteen uh real time |
---|
0:15:24 | he's is optimization here that is no i mean is just an implementation |
---|
0:15:28 | the standard implementation |
---|
0:15:31 | a at the last two lines |
---|
0:15:33 | but do that this is a uh to uh configuration depending on the number of options we do we take |
---|
0:15:38 | for the K B N |
---|
0:15:39 | two possible implementations of by system |
---|
0:15:42 | we can see that in |
---|
0:15:44 | a five |
---|
0:15:45 | or that position it is this is slightly higher than the baseline instant |
---|
0:15:51 | a the real time factor is ten times |
---|
0:15:53 | faster |
---|
0:15:54 | so is pretty good |
---|
0:15:55 | and uh i mean was to importance of the training of the K B in |
---|
0:15:59 | a a a a uh we |
---|
0:16:01 | the the that's we used just a standard gmm just T V if too |
---|
0:16:05 | that's the second line of results are we see that it just breaks |
---|
0:16:08 | i mean the reaches as if at a speaker |
---|
0:16:11 | characteristic a speaker discriminant down shown it just doesn't work |
---|
0:16:15 | i also about so that |
---|
0:16:18 | a a selection of the number of clusters |
---|
0:16:20 | still those and |
---|
0:16:21 | do the job |
---|
0:16:23 | a number of clusters after running the system |
---|
0:16:26 | we actually get to the five percent of the error rate |
---|
0:16:30 | which is |
---|
0:16:31 | a better than the than our baseline |
---|
0:16:34 | this is just a a a two show that's right and |
---|
0:16:37 | all depending on the number of options that the position error rate we have |
---|
0:16:41 | how we can see just think of the the black is the average |
---|
0:16:44 | we can see that event |
---|
0:16:47 | and have nine hundred but after five hundred a our sense for the K B um the results are more |
---|
0:16:52 | less flat |
---|
0:16:53 | so we doesn't matter of five hundred six and the |
---|
0:16:56 | that's fine |
---|
0:16:58 | and this is a body is so i've shows was or meetings |
---|
0:17:02 | oh |
---|
0:17:02 | are our proposed system or the baseline of and see that in most cases they have |
---|
0:17:07 | but was the same out of course a sum up is that |
---|
0:17:10 | make these two percent difference but |
---|
0:17:12 | and |
---|
0:17:14 | and and that a couple of shows that are are better |
---|
0:17:18 | so |
---|
0:17:19 | we |
---|
0:17:20 | so that that is a shown was kind of a a a a a star |
---|
0:17:23 | shown is more uh a was but the things on top of of a standard system that i to get |
---|
0:17:29 | these little gains in performance |
---|
0:17:31 | but |
---|
0:17:33 | just start a a a a system that we call we can even get that |
---|
0:17:39 | and and and when i'm working the next to uh uh we can improve the by key fingerprinting |
---|
0:17:44 | we gonna find a better of stopping at the hopefully |
---|
0:17:47 | and uh also |
---|
0:17:49 | that the system always monocle in and maybe working in cell phones will |
---|
0:17:55 | thank you very much |
---|
0:18:02 | that's can like to think of making did not |
---|
0:18:07 | and he's my |
---|
0:18:09 | okay |
---|
0:18:19 | no |
---|
0:18:20 | no this is this is and the M |
---|
0:18:22 | oh |
---|
0:18:25 | oh |
---|
0:18:27 | oh sorry L merging and speech key detection is on right at the beginning |
---|
0:18:31 | at the very beginning |
---|
0:18:32 | so as just the stand like a standard uh |
---|
0:18:35 | that action system |
---|
0:18:36 | it's just not it's |
---|
0:18:37 | mean |
---|
0:18:39 | the |
---|
0:18:40 | the |
---|
0:18:42 | and see if it goes back |
---|
0:18:46 | justin the acoustic feature extractions at the beginning of the system |
---|
0:18:49 | and but uh used uh the speech taking that action from for you to come |
---|
0:18:53 | thanks to that |
---|
0:18:59 | no no i just i just to acoustic |
---|
0:19:02 | i don't merge |
---|
0:19:03 | i use M D and that's multiple microphones but just been for than the use a single channel then |
---|
0:19:13 | many ideas but that work at the no |
---|
0:19:16 | have to try |
---|
0:19:18 | okay since we ran out of a nice thing |
---|