0:00:15 | to do not an i'm here two percent this work the influence of transition costs |
---|
0:00:20 | in this one depicts the states of a speaker diarization i'm bad at the in |
---|
0:00:24 | a i work in the speech technology group in the ubm how well the wonders |
---|
0:00:29 | apply the kingdom at the here units playing |
---|
0:00:32 | so here is the online i'm going to following the in this presentation first of |
---|
0:00:36 | all i'm going to well explain the baseline system the baseline actually changes that we |
---|
0:00:41 | are used and with what they are detect to without the modules the states so |
---|
0:00:47 | on |
---|
0:00:47 | focusing in the segmentation of clustering of states which is |
---|
0:00:52 | basically that the that initiation estates |
---|
0:00:55 | where a which is where we have been well actually making things and analysing results |
---|
0:01:02 | and what we |
---|
0:01:05 | one kid in this work was to analyze the effect of |
---|
0:01:10 | that the of the parameters involved in the in the duration of the speaker that's |
---|
0:01:15 | i mean |
---|
0:01:17 | at time can be very long various or tweets parameters are involved in this decision |
---|
0:01:22 | and how much it the probably affecting in our system |
---|
0:01:28 | then i well i will present the experiments we have done with a the development |
---|
0:01:34 | dataset and a all the analysis and some compression |
---|
0:01:39 | here is the baseline system architecture we how we work with more t-ball a it |
---|
0:01:45 | multiple signals from multiple microphones so that's or input that we first filtered to reduce |
---|
0:01:52 | noise |
---|
0:01:53 | and then we extract from these various signal the delay between them that is the |
---|
0:01:59 | time delay of arrival and we use these information for to work for two things |
---|
0:02:04 | first |
---|
0:02:05 | is the acoustic function which is to create as you |
---|
0:02:09 | probably no |
---|
0:02:10 | to create a and i can see now just to me and all the all |
---|
0:02:14 | the signals from the different my in the microphones a delay in or one to |
---|
0:02:19 | the other what a |
---|
0:02:22 | probably about the proposed for them |
---|
0:02:25 | one delay to the other so they this soon as at the end it the |
---|
0:02:29 | voice on the acoustic a signal or something and it's a nice but |
---|
0:02:36 | no we use this signal to extract the cepstral with these in the mel frequency |
---|
0:02:41 | cepstral coefficient |
---|
0:02:43 | and also what to extract information about where their use a speech on there is |
---|
0:02:48 | not with the voice activity with that |
---|
0:02:52 | the on a way we use the delays that at the tdoa are used as |
---|
0:02:58 | actually as an input to |
---|
0:03:00 | and you see the last the states that is the segmentation of agglomerative clustering of |
---|
0:03:04 | speech strenuous that these the a day is to each day |
---|
0:03:09 | where we actually decide who uses beacon and way needs to speak and so when |
---|
0:03:14 | we performed a diarization |
---|
0:03:17 | it's a calm |
---|
0:03:19 | what it doesn't matter |
---|
0:03:22 | that these |
---|
0:03:23 | segmentation i'm mlps clustering i mean |
---|
0:03:27 | here is like a more in the diagram of what an of what these estates |
---|
0:03:33 | performs we first with a i'm not at the initialization allows a segmentation any sound |
---|
0:03:39 | segmentation |
---|
0:03:40 | that is porous is uniform for the baseline system we use his unit probably a |
---|
0:03:46 | segment into and plus their for a cease |
---|
0:03:48 | sixteen bit as or more but we use it might be sixteen because then we |
---|
0:03:53 | iteratively are going to |
---|
0:03:55 | reduce the number by marriage in or hypothesize a hypothesis that size clusters |
---|
0:04:03 | then |
---|
0:04:04 | after this initial segmentation we before we start and that the of segmentation a training |
---|
0:04:10 | and during in the states we with these segmentation create it models than we that |
---|
0:04:15 | the that we that we use to a restatement the see now |
---|
0:04:20 | and well at the end of we will have a better segmentation is according to |
---|
0:04:26 | date at twelve a speaker models we have train |
---|
0:04:29 | i thirty three segmentation we a we compare these |
---|
0:04:34 | as clusters one-to-one birds and we didn't joint in we match those that are more |
---|
0:04:40 | seem a lot |
---|
0:04:42 | we used for that but using information you know |
---|
0:04:47 | we use |
---|
0:04:48 | in general |
---|
0:04:50 | well with the with a we do eat all these iteratively until there is no |
---|
0:04:55 | more clusters to merits because |
---|
0:04:57 | well this is then can see that there are no more plaster that are in |
---|
0:05:00 | a similar to demerits and it |
---|
0:05:03 | it finished |
---|
0:05:04 | score |
---|
0:05:05 | that's features |
---|
0:05:10 | while here is a |
---|
0:05:12 | we are moving to the point this is like that diagram where you can see |
---|
0:05:17 | all the parameters that actually involved in the |
---|
0:05:21 | duration of these speaker that's we have one parameter the called medium on duration on |
---|
0:05:26 | that |
---|
0:05:28 | for us is to fifty frames a two hundred fifty prince of the music and |
---|
0:05:32 | so a round to two point five seconds that is like |
---|
0:05:36 | okay i'm going i want that might speaker times are at least of two one |
---|
0:05:42 | five seconds of duration because if they are stored their well i'm you know still |
---|
0:05:46 | much interested them and so well let's force this is then to go at least |
---|
0:05:52 | two fifty |
---|
0:05:53 | then these parameters are beta a we won it we wanted to cancel them because |
---|
0:05:58 | they define to concern i mean to constantly influence it has in the duration of |
---|
0:06:04 | the speaker a |
---|
0:06:06 | out |
---|
0:06:07 | good mean okay is probably t you would applied to |
---|
0:06:14 | remain in the same speaker or not moving to another and data is to |
---|
0:06:19 | one two and all that the one of the clustered by another speaker so we |
---|
0:06:23 | said then to one |
---|
0:06:24 | we know would they does not so one that is |
---|
0:06:28 | like the stopping people think that is they way they actually having not influence in |
---|
0:06:34 | the final decision |
---|
0:06:35 | but we also a in the a in there you know is used and the |
---|
0:06:39 | experiment that one this last and that and we |
---|
0:06:44 | discover is it was the from there you know system and |
---|
0:06:49 | the problem for us is that is not just the this parameter that is a |
---|
0:06:54 | happy some useless again in the decision of moving from one to speak at one |
---|
0:06:58 | other but that these m is the number of active clusters |
---|
0:07:02 | and |
---|
0:07:04 | overall system iteratively rate use these number of clusters it goes from sixteen one sixteen |
---|
0:07:10 | for us "'cause" is what we used in |
---|
0:07:12 | to well bass-fine a time and that could be too |
---|
0:07:16 | or one |
---|
0:07:17 | what one would be just the timit but two three four and in each iteration |
---|
0:07:22 | it going is going to change |
---|
0:07:26 | here you see actually when it sees a state institution to change first we have |
---|
0:07:33 | the likelihood of a and while in the basic question is |
---|
0:07:38 | the likelihood of some primes |
---|
0:07:41 | to belong to one cluster the other side the likelihood of the same frames to |
---|
0:07:46 | belong to another cluster |
---|
0:07:48 | those two and i related to the about data we have so we are okay |
---|
0:07:54 | with this but these all the parameter located great we have called logarithm of k |
---|
0:08:00 | is just a independent and deals with a well what was in the data in |
---|
0:08:06 | the by diagram of the of the previous slide was a we present is that |
---|
0:08:11 | why this last |
---|
0:08:14 | so |
---|
0:08:14 | is in the band has nothing to do with the to and this |
---|
0:08:18 | actually if it's lower than one is kind of been analyzed in changes |
---|
0:08:24 | because well as a variant of |
---|
0:08:26 | zero point whatever yes and negative |
---|
0:08:29 | and if hi there it |
---|
0:08:31 | well |
---|
0:08:32 | five is fable or d is these changes |
---|
0:08:36 | and as i said |
---|
0:08:38 | when |
---|
0:08:39 | as we have that case one is less and a |
---|
0:08:43 | and decreases in every iteration also calculate increase in every iteration but it still is |
---|
0:08:48 | going to be always lower than twice on so you know baseline sees the not |
---|
0:08:52 | fighting we are always been analysing fancy since even though we really don't know if |
---|
0:08:57 | we want to make |
---|
0:08:59 | sort their parents or wrong bass we are doing it so |
---|
0:09:03 | well we see and if we do not really a lower number of speakers because |
---|
0:09:08 | what we increase from sixteen whatever if we have a lower number of speakers |
---|
0:09:13 | we really have high probability of haven't changes i don't know i assume that i |
---|
0:09:19 | and all isolated so more these transitions |
---|
0:09:22 | so we thought well let's concept scott time that |
---|
0:09:26 | maybe it works fine and not these we remove this variability and take the decision |
---|
0:09:32 | only a few data and also |
---|
0:09:35 | of course as we have a we decided to do this experiment we decided also |
---|
0:09:39 | to say okay |
---|
0:09:40 | it said this case to a fixed value |
---|
0:09:43 | negative maybe but it what's actually we wanted to look is a if we could |
---|
0:09:49 | fable that is |
---|
0:09:51 | these transitioned plastics not used so it doesn't change over iterations |
---|
0:09:56 | but a also |
---|
0:09:58 | maybe a positive value so we have probably faber transition in a speaker's changing of |
---|
0:10:04 | speakers |
---|
0:10:06 | is that experiments |
---|
0:10:07 | i a here is the database i fused |
---|
0:10:12 | we have the development set that is |
---|
0:10:15 | probably switch task is to evaluate somebody eight meetings |
---|
0:10:18 | from yes |
---|
0:10:19 | you see two thousand two thousand and five two thousand six and seven |
---|
0:10:27 | and we have used that for all the development dataset and then the test set |
---|
0:10:31 | that this r t o nine from |
---|
0:10:33 | come on |
---|
0:10:35 | well that the element that the this it is the from these results presented here |
---|
0:10:39 | i the two thousand nine |
---|
0:10:43 | here is all or the been all the experiments we have don't to analyze |
---|
0:10:47 | a study the effect of these |
---|
0:10:49 | parameters we wanted to check the effect of the season when we got we have |
---|
0:10:55 | all these k |
---|
0:10:56 | consistent weight because is like |
---|
0:10:59 | well i applied to taxis you have |
---|
0:11:03 | and we wanted also to tasty to evaluate its influence a |
---|
0:11:10 | we if we are also taking into account the minimum duration parameter and talk you |
---|
0:11:15 | about |
---|
0:11:16 | because well also of them are actually influencing parameter data duration of the speaker time |
---|
0:11:23 | we used widely work in the baseline if i two fifty frames so there there's |
---|
0:11:29 | the baseline which is the flat line |
---|
0:11:32 | it's this is only after a score of course because the transition weighting the baseline |
---|
0:11:36 | is it to one it's last m so you change to about the process |
---|
0:11:41 | and then we have all these other |
---|
0:11:44 | experiments a here that i see still weight can be well if it's one i |
---|
0:11:49 | want you to do not is that winter season weight is one it's like constantly |
---|
0:11:54 | its effect caused by one zero so |
---|
0:11:57 | no effect at all is only data |
---|
0:12:00 | and if you y very you know a it put me in that changes are |
---|
0:12:05 | very few people actually |
---|
0:12:08 | i have and |
---|
0:12:09 | put the value for detecting some way equal to zero |
---|
0:12:14 | because |
---|
0:12:15 | it's like fifty |
---|
0:12:16 | and a vertical line was there i have the two with the only needs and |
---|
0:12:24 | was like okay well very high because this e-step for be trained sees you know |
---|
0:12:28 | sent at the end is segments all the recording p one speaker which is obviously |
---|
0:12:34 | a very high error rate |
---|
0:12:38 | then we sell okay with me duration equal to two hundred we actually have like |
---|
0:12:44 | that every instable and all where it with low error rate various table section on |
---|
0:12:49 | a in trying to find yourself one |
---|
0:12:52 | and with a lower error rate and the baseline so |
---|
0:12:57 | maybe it's good to have peace into consideration |
---|
0:13:02 | so let's see what |
---|
0:13:03 | what i and in at the end |
---|
0:13:07 | what happened on the n |
---|
0:13:08 | if we six |
---|
0:13:10 | this with that that's that we choose |
---|
0:13:12 | three point is all of those points we have checked the we have a evaluated |
---|
0:13:17 | with the development dataset where |
---|
0:13:19 | with a better than they baseline we three |
---|
0:13:22 | one two three |
---|
0:13:23 | we also compute the well compute all the system compute the data position for the |
---|
0:13:30 | a for a transition weight you want to |
---|
0:13:33 | one as last and which is the baseline but with a minimum duration but the |
---|
0:13:37 | two hundred so |
---|
0:13:39 | and what we could compare actually a |
---|
0:13:42 | the improvement you to these transition we variation in you to the mean duration |
---|
0:13:48 | separately because what the baseline use minimum duration to fifty and so on |
---|
0:13:54 | i liked very much the idea scott setting it because i and well |
---|
0:13:59 | was |
---|
0:14:00 | good to see |
---|
0:14:02 | parameter is in the band then if we can console and have to better results |
---|
0:14:06 | are at least the best was also at least good enough |
---|
0:14:11 | why not |
---|
0:14:12 | something less to train for future experiments |
---|
0:14:16 | the problem is actually the test set it out what it didn't go very well |
---|
0:14:22 | not very much but we may be well compute the average of the two error |
---|
0:14:28 | rates it's good but |
---|
0:14:30 | it was worse what |
---|
0:14:32 | we have what was barry we thought well |
---|
0:14:36 | and that the results for day |
---|
0:14:39 | prediction we what is three which is very boring actually very for in it changes |
---|
0:14:44 | of a speaker |
---|
0:14:45 | and rate using them anymore and iteration of any a speaker time |
---|
0:14:52 | conclusion model compression i think i four for these more or less during the during |
---|
0:14:57 | the presentation was more like stream rice and what i think that these turn transitioned |
---|
0:15:03 | weight i don't have it we discover because was |
---|
0:15:06 | it's a was a previously statically that came from icsi was well maybe someone you |
---|
0:15:14 | have worked with it is |
---|
0:15:18 | unleashed |
---|
0:15:18 | for s we discovered that very small changes can affect the very much that i |
---|
0:15:24 | use a c and that's why i look like at the beginning to have calculated |
---|
0:15:28 | but if you is the one to constantly it at least you have to note |
---|
0:15:31 | that it exists a you if you want to change your the duration or to |
---|
0:15:37 | work with the duration of jerry speaker dance is important it's important to make it |
---|
0:15:43 | to run experiments |
---|
0:15:46 | it with both transmit what the racing on "'em" also because well a very three |
---|
0:15:52 | of these |
---|
0:15:54 | we |
---|
0:15:54 | actually got |
---|
0:15:56 | better it's also |
---|
0:15:57 | for us is good but and the main thing we land problem this is that |
---|
0:16:04 | and |
---|
0:16:05 | if the variability with one time is very high |
---|
0:16:09 | or can be very high you mass |
---|
0:16:13 | i try to take into account the maybe evolve constantly made with a this technique |
---|
0:16:21 | what to one is the best option so you can |
---|
0:16:24 | i make the system or upwards for future experiments |
---|
0:16:30 | well that's more honest that i think |
---|
0:16:39 | so we then proposed |
---|
0:16:51 | thanks to multiple english |
---|
0:16:55 | first of all |
---|
0:16:57 | we look for this so that it's much platoons good solution from the whole circle |
---|
0:17:05 | two six should also and so each time constant |
---|
0:17:15 | so smooth the lasted a okay a sycamore to all the weights |
---|
0:17:22 | we should |
---|
0:17:25 | flirt |
---|
0:17:25 | the phone but it's very important to train them |
---|
0:17:30 | the show a high constant |
---|
0:17:34 | i know not so the remote were used for training the transition probabilities |
---|
0:17:40 | in rooms do not want to work with them or whatever how to cope with |
---|
0:17:47 | this remote is |
---|
0:17:50 | and it's as much as the solution |
---|
0:17:55 | this transition |
---|
0:17:58 | the motivation and the results |
---|
0:18:04 | i dunno why the snow |
---|
0:18:06 | okay well |
---|
0:18:18 | i don't use the word |
---|
0:18:21 | those in differences |
---|
0:18:24 | all rates were all the routes two goals in the logo to go all this |
---|
0:18:30 | is |
---|
0:18:33 | as a constant |
---|
0:18:35 | it's a cost and is |
---|
0:18:38 | one two three doesn't matter at all |
---|
0:18:41 | so in quantum o one with the home and speaker of the |
---|
0:18:47 | because why you try |
---|
0:18:53 | well |
---|
0:18:56 | i is a three and you know is a constant value but is a different |
---|
0:19:01 | number and the decisions taken when this inequality it is a full field i want |
---|
0:19:07 | you made this inequality |
---|
0:19:08 | like okay i'll be is a idea would all be is a brings belong to |
---|
0:19:14 | discuss the that likelihood of these is saying brings belong to another cluster completely different |
---|
0:19:20 | and then be used and if it's high and have a forced that you okay |
---|
0:19:25 | change of class that is like if it's a very low |
---|
0:19:29 | it for b s |
---|
0:19:30 | to go to another cluster |
---|
0:19:32 | that's why it's a variable you fable more deterrence easiest the changes or you penalise |
---|
0:19:37 | them |
---|
0:19:43 | questions |
---|
0:19:51 | so why i of course there is also probably |
---|
0:19:56 | you transition words so |
---|
0:19:59 | we can use |
---|
0:20:00 | okay for volume of english could be thing |
---|
0:20:04 | in there as well |
---|
0:20:05 | first for ratio between the core model on the |
---|
0:20:11 | the |
---|
0:20:12 | moreover maybe |
---|
0:20:13 | become the new speaker |
---|
0:20:15 | so it's |
---|
0:20:17 | so do you think is sort of threshold are just |
---|
0:20:21 | it would be dependent on the task of the database just have one still because |
---|
0:20:27 | i haven't actually an take a nice okay i have a right |
---|
0:20:32 | that's why i think is that for the system to be more robust in the |
---|
0:20:37 | to be using future task or you know that databases and |
---|
0:20:43 | well it is the speaker out of the rights and in different meetings before and |
---|
0:20:47 | databases that have a slightly longer duration maybe i speak and a lot at all |
---|
0:20:53 | their interface the on a sorta |
---|
0:20:56 | and if you are in that room with four people just don't well it depends |
---|
0:21:01 | on the basically see that that's why i tied states yes okay if a if |
---|
0:21:07 | you can that's a similar results yes have you |
---|
0:21:12 | is a time you don't have to train and that's always |
---|
0:21:18 | if you it you have a similar result or something in that this would to |
---|
0:21:23 | know what you have you will have a less work to do you know used |
---|
0:21:28 | to let that the c sent one you that bayes and unique the menu or |
---|
0:21:32 | and |
---|
0:21:34 | straightforward |
---|
0:21:36 | get rid of this problem right i also a one because this is like a |
---|
0:21:44 | preliminary work and i would like to maybe to use these |
---|
0:21:49 | if i |
---|
0:21:50 | somehow could this somehow i really don't know that don't have any clear the of |
---|
0:21:55 | what to do two |
---|
0:21:58 | but when i a get a good resampling that |
---|
0:22:01 | if i somehow had a i in the idea of how long the speaker concept |
---|
0:22:05 | going to be |
---|
0:22:06 | or how many singers or maybe if i have some information about the role of |
---|
0:22:10 | the speakers in the room and that could |
---|
0:22:13 | actually |
---|
0:22:14 | not would be used to i think smiling at that is aligned and that that's |
---|
0:22:19 | all a lot |
---|
0:22:21 | actually staying i think this kind of the probability of this is a low enough |
---|
0:22:26 | to these one or something some way of extracting this information in |
---|
0:22:32 | unsupervised diarization could be tricky but still i think you then you could |
---|
0:22:38 | and achieve this parameter full for get them better results |
---|
0:22:44 | but not |
---|
0:22:46 | okay |
---|
0:22:48 | show you questions |
---|