0:00:18 | a Q four |
---|
0:00:19 | for in |
---|
0:00:20 | uh so but come on this uh all the whole car |
---|
0:00:24 | which she's going to be about uh actually implementing or |
---|
0:00:28 | uh investigating a a a a a a role and grams |
---|
0:00:32 | a model seen speaker diarization |
---|
0:00:35 | uh |
---|
0:00:36 | and my and is better but from what a check uh a i'm the third outdoor or this paper |
---|
0:00:41 | main out or with be one at that |
---|
0:00:43 | which mainly known for |
---|
0:00:45 | speaker diarization work he's not here so it's me was present |
---|
0:00:49 | um |
---|
0:00:50 | and it |
---|
0:00:52 | oh of us are coming from from space of from the up uh |
---|
0:00:56 | research search institute |
---|
0:01:00 | um |
---|
0:01:00 | so let me say a few words about the right direction |
---|
0:01:03 | so uh a C the speaker diarization |
---|
0:01:06 | case uh |
---|
0:01:07 | try to segment a input speech |
---|
0:01:09 | according to the speaker |
---|
0:01:11 | and uh |
---|
0:01:13 | a recent applications are trying to focus a lot or a and E and data or to |
---|
0:01:17 | recordings speech are |
---|
0:01:19 | actually record it to by multiple some microphones |
---|
0:01:22 | and also trying to look at uh a small than those conversations |
---|
0:01:27 | and and so on |
---|
0:01:28 | um |
---|
0:01:29 | we we sent to a mainly |
---|
0:01:31 | a main work was actually |
---|
0:01:33 | again uh convert |
---|
0:01:35 | but are can might convincing to uh using different kind of features |
---|
0:01:39 | acoustic features |
---|
0:01:40 | um like to deal away a uh a a time delay of arrival or |
---|
0:01:45 | my this can combination |
---|
0:01:47 | which also we are actually |
---|
0:01:49 | a do in our work but |
---|
0:01:51 | actually more so the speaker diarization systems |
---|
0:01:54 | are ignoring the fact that |
---|
0:01:56 | data |
---|
0:01:57 | alright is that this case then sees of human correlation |
---|
0:02:01 | so that such a a statistics for information which can be estimated from |
---|
0:02:06 | from let's say um |
---|
0:02:08 | um |
---|
0:02:08 | conversation i this is not usually use |
---|
0:02:12 | um that is just um |
---|
0:02:14 | uh a simple a lot so which makes use actually spending T the output of for speaker diarization |
---|
0:02:20 | so i put speech and the segments |
---|
0:02:23 | for each speaker |
---|
0:02:27 | a so again uh going back to actually do the motivation of the work |
---|
0:02:32 | um |
---|
0:02:32 | if you look at a conversation speech are usually a bearing to be and constraint and spun to |
---|
0:02:38 | yeah are also go an by by principles and they are actually some laws |
---|
0:02:42 | are we can actually a yeah pretty some be a year but are so |
---|
0:02:46 | one of those are |
---|
0:02:47 | for example |
---|
0:02:48 | uh trials |
---|
0:02:49 | it's out |
---|
0:02:50 | which are are still to be a or about there |
---|
0:02:53 | and uh um actually |
---|
0:02:55 | each speaker in the during the meeting |
---|
0:02:58 | um |
---|
0:02:59 | in in in a conversation has some or or actually in the meeting usually and based on that uh uh |
---|
0:03:04 | uh people take their turns |
---|
0:03:06 | in in in conversation |
---|
0:03:08 | so just for that minority uh roles scan your of for two three classes there formal roles |
---|
0:03:15 | factual roles or social or |
---|
0:03:19 | and so uh need the motivation to can the motivation |
---|
0:03:22 | um |
---|
0:03:24 | usually |
---|
0:03:25 | to perform the analysis like conversation any these four |
---|
0:03:28 | S actually use a role recognition |
---|
0:03:30 | people like using the information or statistics for conversations |
---|
0:03:34 | like this |
---|
0:03:35 | turn-taking taking that turns or |
---|
0:03:37 | uh |
---|
0:03:38 | and turn duration or the length |
---|
0:03:40 | actually of them just those of speaker segment |
---|
0:03:44 | uh the egg thing many core us like uh |
---|
0:03:47 | probably know |
---|
0:03:48 | a a C M use or i mean meetings |
---|
0:03:50 | i meetings |
---|
0:03:51 | and those i mean me for example are coming especially or may need from a from india for these precision |
---|
0:03:56 | should you'd and B are |
---|
0:03:58 | using that here |
---|
0:04:00 | and uh uh a any the all statistics are often up to be expected from a speaker diarisation of course |
---|
0:04:07 | but in our case what we are going to present this paper is |
---|
0:04:10 | actually we tried to use |
---|
0:04:12 | the information from |
---|
0:04:13 | a a a i is of conversations as a writer information |
---|
0:04:17 | and to use a back actually in a speaker diarization |
---|
0:04:20 | so there is um |
---|
0:04:22 | is a diagram |
---|
0:04:24 | oh |
---|
0:04:24 | actually of our technique |
---|
0:04:27 | uh where you we can see simply acoustic features that speaker diarization which is doing clustering |
---|
0:04:32 | for the speaker turns or segment |
---|
0:04:35 | and |
---|
0:04:35 | but is the output from a speaker diarization |
---|
0:04:38 | then you can do actually computer estimate the statistics |
---|
0:04:41 | based of this |
---|
0:04:42 | turn taking meaning cut for example statistics about the role |
---|
0:04:46 | and it's uh this information can be some use back |
---|
0:04:50 | back part in speaker diarization |
---|
0:04:52 | so it's kind of like by a information |
---|
0:04:54 | which can or |
---|
0:04:56 | how and |
---|
0:04:57 | uh acoustic features are up information in in that is that |
---|
0:05:03 | um let P you words that about the data set so as i said that at beginning we are using |
---|
0:05:07 | uh |
---|
0:05:08 | a in this work again i mean meeting data base a |
---|
0:05:12 | approximately approximately that hours of data |
---|
0:05:15 | but with of course uh for training men and test |
---|
0:05:19 | and um |
---|
0:05:21 | and uh |
---|
0:05:23 | also |
---|
0:05:24 | actually the center i is uh a kind of |
---|
0:05:27 | state also we doesn't change much over the recording or different meetings |
---|
0:05:31 | so they are always base uh for but it's buttons uh |
---|
0:05:35 | which has a given the role |
---|
0:05:36 | therefore all |
---|
0:05:38 | a a project manager user interface expect |
---|
0:05:41 | uh mark an X better than industrial design or |
---|
0:05:43 | actually those people are somehow |
---|
0:05:45 | talking a devil being some i think a remote control device |
---|
0:05:49 | like that |
---|
0:05:51 | um |
---|
0:05:52 | we also that are also assume that the is |
---|
0:05:54 | once supervisor |
---|
0:05:56 | like a project manager of with actually directing this this meeting |
---|
0:06:00 | and uh you you can see those the data about how much a |
---|
0:06:03 | how many meetings used for training testing and so on so twenty |
---|
0:06:07 | uh a think uh recordings |
---|
0:06:09 | the end where used |
---|
0:06:11 | um |
---|
0:06:12 | going back to take to like getting just again to |
---|
0:06:16 | uh to somehow well um |
---|
0:06:18 | this kind |
---|
0:06:19 | uh the the technique |
---|
0:06:21 | so we have tried to simplify to be to work in the sense that |
---|
0:06:24 | and the speech regions |
---|
0:06:26 | uh can not be a a a a can be posed |
---|
0:06:29 | by by a speech over then |
---|
0:06:31 | theorem miliseconds seconds and we don't |
---|
0:06:33 | can much about close talk so we are actually |
---|
0:06:36 | or a not in the cost talk to do |
---|
0:06:38 | uh a previous |
---|
0:06:39 | speaker |
---|
0:06:39 | a he is just a a a lot so at |
---|
0:06:42 | with those a speaker segments where each speaker segment |
---|
0:06:46 | uh |
---|
0:06:47 | i some beating beginning of uh duration and each |
---|
0:06:50 | these or speech segment is associated with |
---|
0:06:53 | some |
---|
0:06:54 | a a speaker |
---|
0:06:55 | which has some role in the meeting |
---|
0:06:59 | uh again going back to those n-grams so how to use these better information |
---|
0:07:04 | um |
---|
0:07:05 | so some as we have a again we have a |
---|
0:07:07 | sequence of speakers which is uh get the dress |
---|
0:07:11 | and |
---|
0:07:12 | we also have um |
---|
0:07:13 | actually uh uh to one mapping between speakers and the roles |
---|
0:07:18 | which will speakers have been |
---|
0:07:20 | each uh |
---|
0:07:21 | each meeting |
---|
0:07:22 | so |
---|
0:07:23 | or or or that stuck at the air |
---|
0:07:25 | here are |
---|
0:07:28 | and and |
---|
0:07:28 | base some that he of course had the corresponding sequence of for all |
---|
0:07:32 | which is this a P |
---|
0:07:34 | because caress |
---|
0:07:35 | so |
---|
0:07:36 | what we are what we are doing he's uh we're trying to |
---|
0:07:40 | uh |
---|
0:07:42 | um which i make the probability of the roles |
---|
0:07:46 | of the speakers depending on their roles |
---|
0:07:48 | based on the previous speaker so we are just simply applying something like a a language model or in uh |
---|
0:07:53 | automatic speech recognition |
---|
0:07:55 | and uh of course you are trying simple |
---|
0:07:58 | bigram and trigram to the beginning |
---|
0:08:01 | here you can see |
---|
0:08:02 | but traditional equation for |
---|
0:08:05 | for um training got language model so |
---|
0:08:08 | um |
---|
0:08:11 | there is the creation for for P they are of course are and |
---|
0:08:15 | i D and there is a table with but but C T V was obtained or that test data |
---|
0:08:19 | so of course for uni |
---|
0:08:21 | the perplexity is going to be for but then you may see that for bigrams and trigrams |
---|
0:08:25 | but that but |
---|
0:08:26 | and but at the E six discrete decreasing can in there is a whole that such information can |
---|
0:08:32 | a a can be an actually have meant it so to acoustic uh |
---|
0:08:36 | acoustic information |
---|
0:08:40 | so |
---|
0:08:41 | now now little to that the diarization system i don't want saying i don't the same much in about it |
---|
0:08:46 | because |
---|
0:08:47 | i is the second talk where i |
---|
0:08:49 | i have one or two slides about |
---|
0:08:51 | diarization |
---|
0:08:53 | no technique itself |
---|
0:08:55 | so superior again combining acoustic uh score sweet |
---|
0:08:59 | the language model scores which are actually those roles anagram |
---|
0:09:03 | and uh the diarization system |
---|
0:09:05 | our method is based on information bottleneck principle which has been |
---|
0:09:10 | are we discussing last year in this paper |
---|
0:09:14 | what we assume as the as input data speech |
---|
0:09:17 | non-speech uh segmentation or speech segmentation into initial segment |
---|
0:09:22 | uh and we do actually uniform segmentation |
---|
0:09:26 | yeah so i think uh |
---|
0:09:28 | nothing too much fancy |
---|
0:09:29 | and then there is a kind of |
---|
0:09:31 | these clustering corporation |
---|
0:09:33 | so we are trying to cluster input speech into |
---|
0:09:36 | two speakers segments |
---|
0:09:38 | and this is a for as uh a a a a a to the with which star |
---|
0:09:42 | so in the end it's uh |
---|
0:09:43 | uh uh we had some |
---|
0:09:45 | some um |
---|
0:09:47 | estimation of clustering and B do actually a define to by a challenge in an system |
---|
0:09:52 | is |
---|
0:09:52 | so retrain on speaker |
---|
0:09:55 | uh speech data meaning for each uh a speaker we train one one G and E D and you just |
---|
0:10:00 | to |
---|
0:10:01 | um |
---|
0:10:02 | a simple or with without decoding |
---|
0:10:04 | which is going to give us the the sense that that the sequence of |
---|
0:10:08 | the a speaker |
---|
0:10:10 | so a this time we of course didn't can by any prior information uh from |
---|
0:10:15 | from these roles |
---|
0:10:17 | and uh but this can be simply be done in a during viterbi decoding |
---|
0:10:21 | i just mentioning that there was always some them |
---|
0:10:24 | about |
---|
0:10:25 | actually using such information |
---|
0:10:27 | so um |
---|
0:10:29 | but |
---|
0:10:29 | a for example this paper or a in two thousand nine |
---|
0:10:32 | but we were uh should be using meeting dependent uh utterance |
---|
0:10:37 | between speakers |
---|
0:10:40 | um |
---|
0:10:41 | so during experiment uh what testing so we want to prove to what we are doing works works well or |
---|
0:10:47 | or or improves to |
---|
0:10:49 | but a sure |
---|
0:10:50 | we are samples or we assume we have to K |
---|
0:10:53 | in one case we know |
---|
0:10:55 | one to one mapping between |
---|
0:10:57 | mean can't uh between speakers and the roles |
---|
0:11:00 | this is that |
---|
0:11:01 | of course this is uh |
---|
0:11:02 | kind of cheating but so of that we know this information before we know that speaker one is for example |
---|
0:11:07 | a project manager and so on |
---|
0:11:09 | then actually the |
---|
0:11:11 | uh to obtain a a a a or to at like this uh |
---|
0:11:14 | uh and a um in a diarization system as i said is pretty simple use to bitter the decoding |
---|
0:11:21 | right |
---|
0:11:22 | a only acoustic like use are used but also |
---|
0:11:25 | these ah a property power property from |
---|
0:11:27 | from the |
---|
0:11:28 | and roles |
---|
0:11:29 | so just so |
---|
0:11:31 | a a classical education uh four |
---|
0:11:34 | or or viterbi decoding |
---|
0:11:36 | and um |
---|
0:11:37 | we |
---|
0:11:38 | i so use some scaling factors and insertion penalty to should you will do the output |
---|
0:11:44 | uh of course these um |
---|
0:11:46 | a these factors of this um |
---|
0:11:49 | can are tuned on the elements set S |
---|
0:11:52 | we have a it's from i mean to |
---|
0:11:54 | there i mean |
---|
0:11:55 | and meeting |
---|
0:11:57 | a the second case |
---|
0:11:59 | i is uh more difficult but uh |
---|
0:12:01 | i it's it's more real we suppose that be don't know the mapping between speakers and and about |
---|
0:12:08 | so we need to some of estimate it and we do we |
---|
0:12:11 | from actually from the |
---|
0:12:13 | uh those uh estimate it uh speaker segments uh |
---|
0:12:17 | which comes it should be for this viterbi decoding |
---|
0:12:20 | so what we do is uh a pretty simple |
---|
0:12:23 | uh uh also we takes actually some time |
---|
0:12:26 | and a computation but we do just a |
---|
0:12:29 | uh |
---|
0:12:31 | the search |
---|
0:12:32 | uh |
---|
0:12:33 | between a or impossible possible actually a combinations of of mapping |
---|
0:12:38 | and you are just looking for the maximum |
---|
0:12:40 | like i just or maximum uh likelihood uh |
---|
0:12:44 | that you so in the end |
---|
0:12:45 | but did getting some estimation |
---|
0:12:47 | we each speaker |
---|
0:12:49 | actually |
---|
0:12:49 | has feature or in the meeting and we can then apply this prior information |
---|
0:12:53 | from |
---|
0:12:55 | from uh are the statistics |
---|
0:12:57 | and then in the end the decoding is pretty pretty simple again |
---|
0:13:01 | we do we just bit ago |
---|
0:13:04 | um |
---|
0:13:05 | is R |
---|
0:13:06 | should be |
---|
0:13:07 | so the plots for that so in first case actually disease i seen from no "'cause" they see |
---|
0:13:13 | but the first case just gives one where you know actually |
---|
0:13:16 | this one to one mapping is uh the button |
---|
0:13:19 | uh |
---|
0:13:19 | the mean sure plot |
---|
0:13:21 | and the second case where we need to estimate this mapping is the first one so we should try to |
---|
0:13:26 | estimate it from those speakers which come from clustering directly |
---|
0:13:29 | before for gmm modeling |
---|
0:13:34 | oh us to have results so so again |
---|
0:13:37 | i mean meetings in this case for speakers for roles |
---|
0:13:41 | so we actually for of the clustering to commercial four speakers |
---|
0:13:45 | and to the results are nice the diarization it yeah rates we don't count uh speech nonspeech speech errors because |
---|
0:13:52 | just just use it uh |
---|
0:13:53 | uh for all they can the same so |
---|
0:13:57 | uh we are just actually |
---|
0:13:58 | accounting for speaker or this case |
---|
0:14:01 | so uh of for case one if |
---|
0:14:03 | they is more by a we get are fourteen percent error rate |
---|
0:14:06 | but are uh using ink uh |
---|
0:14:08 | actually you go we got to get we my see that there is uh a decrease in error rate |
---|
0:14:13 | and we get two percent |
---|
0:14:15 | for case to which is the reader one what we don't know the mapping between speed of the roll |
---|
0:14:20 | but good thing up to seventy percent for |
---|
0:14:22 | for three gram |
---|
0:14:24 | and um |
---|
0:14:27 | and someone results uh |
---|
0:14:29 | yeah if a model of speaker time |
---|
0:14:31 | okay activity active duty to to each of four speakers |
---|
0:14:34 | may see that for program manager which is kind of a direct in each other |
---|
0:14:38 | meeting |
---|
0:14:39 | we are getting the most of the gain shouldn't for compared to a three |
---|
0:14:44 | a a part is then seeing meeting |
---|
0:14:47 | uh also from the and as is we we have seen that |
---|
0:14:50 | the proposed method out performs the previous one the baseline right especially for short |
---|
0:14:54 | where the acoustic score |
---|
0:14:56 | actually are are not probably |
---|
0:14:58 | well |
---|
0:14:59 | estimate it and some these are information can |
---|
0:15:03 | a the question is how this can generalise who other data because everything what is done is done all |
---|
0:15:08 | i mean meeting speech are of same |
---|
0:15:11 | so for this case so we are using a a a a a a to data to um which transcription |
---|
0:15:15 | from two thousand six a double seven |
---|
0:15:17 | seventeen meetings |
---|
0:15:19 | um but there and the and microphones being for so we get one single and and speech signal at the |
---|
0:15:25 | end |
---|
0:15:26 | and so i actually D are not only you for about this but the but they are for up to |
---|
0:15:31 | nine but suspense in the meeting but all days |
---|
0:15:33 | we somehow |
---|
0:15:35 | as a part of the is a one uh and project manager or the the |
---|
0:15:39 | the guy who was leading to |
---|
0:15:40 | the meeting |
---|
0:15:42 | and uh the at as can be run of those three simple |
---|
0:15:46 | um |
---|
0:15:47 | so if you look at the results |
---|
0:15:49 | i again we made we we may see the form of batteries |
---|
0:15:52 | you two person and |
---|
0:15:53 | for applying to me we got to go there is a decreasing |
---|
0:15:57 | an error or forced and speaker |
---|
0:15:59 | me |
---|
0:16:00 | to twelve percent |
---|
0:16:01 | and um |
---|
0:16:04 | actually |
---|
0:16:05 | but so all this is kind of |
---|
0:16:07 | though um for for each meeting we C the speaker or a at a rate for for each meeting but |
---|
0:16:12 | in this case it's i mean actually not so each transcription is going |
---|
0:16:16 | uh we may see that in most so the meetings that is uh really some some gain for |
---|
0:16:21 | for the meetings i think from a from just |
---|
0:16:24 | to of |
---|
0:16:25 | of those meetings from a rich transcription we don't get so uh |
---|
0:16:28 | uh improvement |
---|
0:16:29 | but those fifteens actually improving |
---|
0:16:32 | and this information |
---|
0:16:36 | um um |
---|
0:16:37 | so i is um |
---|
0:16:39 | that's it uh so conclusion |
---|
0:16:42 | um is um |
---|
0:16:44 | so what we are presenting here is a speaker diarization system |
---|
0:16:48 | right are we at yeah attending to use uh the prior information from role uh roles |
---|
0:16:54 | or or a conversation i Z is |
---|
0:16:56 | back in a |
---|
0:16:58 | in and a speaker diarization |
---|
0:17:00 | um |
---|
0:17:02 | also so actually you are doing get are also so show it in the second told that |
---|
0:17:07 | uh the technique can improve by also by combining so different so of feature |
---|
0:17:13 | but uh in that case actually |
---|
0:17:15 | oh features in but in that case we still you use this but information |
---|
0:17:19 | so my money we are getting a a a a a bus like and percent improvement over i mean data |
---|
0:17:25 | and what is also what's important is that |
---|
0:17:28 | we set technique or |
---|
0:17:31 | uh the |
---|
0:17:32 | the pilot information this language model which we |
---|
0:17:35 | estimated form |
---|
0:17:37 | i mean data |
---|
0:17:38 | a is a generalized in also on different uh |
---|
0:17:41 | a a so uh |
---|
0:17:43 | means that this technique has a whole for for ready to be used for for different data |
---|
0:17:49 | and um |
---|
0:17:51 | actually |
---|
0:17:52 | as it is in the last uh |
---|
0:17:54 | last item a we in in that case |
---|
0:17:57 | in our case of just consider a very simple |
---|
0:18:00 | for roles |
---|
0:18:01 | uh in in a meeting and of course this can be somehow |
---|
0:18:05 | improved by uh |
---|
0:18:08 | i i actually expecting other information for for other for there |
---|
0:18:12 | at this button school |
---|
0:18:15 | i think that's |
---|
0:18:16 | small small for for the stop |
---|
0:18:19 | speaker |
---|
0:18:25 | a a question you |
---|
0:18:29 | and |
---|
0:18:32 | oh |
---|
0:18:35 | i |
---|
0:18:37 | as in to be large used when you a rate was already |
---|
0:18:40 | a fusion |
---|
0:18:42 | is is just a feature of the |
---|
0:18:44 | a being the in which would yeah |
---|
0:18:48 | as the because when the some of the meetings where you can still |
---|
0:18:51 | for the ones a very high and rate getting anyone all |
---|
0:18:55 | come |
---|
0:19:01 | i |
---|
0:19:05 | oh |
---|
0:19:12 | yeah |
---|
0:19:32 | two |
---|
0:19:33 | i |
---|
0:19:37 | know |
---|
0:19:38 | but you |
---|
0:19:42 | wang |
---|
0:19:46 | um no all the language or is trained on double open data out over |
---|
0:19:50 | seven meetings of course |
---|
0:19:52 | so i think it was like twenty meetings |
---|
0:19:54 | i don't |
---|
0:19:55 | and then the applied such a language model or or or an n-gram |
---|
0:20:00 | for the test data so it's not um |
---|
0:20:02 | meeting specific civic sure is joe |
---|
0:20:05 | uh |
---|
0:20:07 | i |
---|
0:20:08 | i |
---|
0:20:09 | um |
---|
0:20:10 | um they have been using unigrams by and trigram |
---|
0:20:14 | i |
---|
0:20:14 | oh |
---|
0:20:15 | so kind of traditional |
---|
0:20:18 | or members and uh |
---|
0:20:20 | the results i even for breaks that picks a T it looks uh it you'd have some of you look |
---|
0:20:25 | at the results we be able to receive |
---|
0:20:27 | they're which we achieve with the technique looks that also for trigrams it's the work |
---|
0:20:35 | we can you not that is from questions |
---|
0:20:50 | uh |
---|
0:20:52 | so |
---|
0:20:53 | oh |
---|
0:20:54 | oh |
---|
0:20:54 | or |
---|
0:20:56 | hmmm |
---|
0:21:00 | oh |
---|
0:21:02 | but only for |
---|
0:21:04 | oh |
---|
0:21:05 | oh |
---|
0:21:07 | all |
---|
0:21:08 | one |
---|
0:21:10 | but |
---|
0:21:12 | i |
---|
0:21:13 | i |
---|
0:21:15 | five then |
---|
0:21:17 | or |
---|
0:21:17 | uh |
---|
0:21:19 | one |
---|
0:21:19 | but |
---|
0:21:22 | one |
---|
0:21:22 | i |
---|
0:21:25 | yeah |
---|
0:21:26 | i |
---|
0:21:28 | and |
---|
0:21:30 | i |
---|
0:21:41 | i |
---|
0:21:42 | you |
---|
0:21:43 | oh |
---|
0:21:44 | uh |
---|
0:21:45 | oh |
---|
0:21:46 | oh |
---|
0:21:48 | is |
---|
0:21:49 | yeah |
---|
0:21:51 | hmmm |
---|
0:21:51 | why |
---|
0:21:54 | oh |
---|
0:21:58 | you |
---|