0:00:15 | okay |
---|
0:00:15 | thank you my name is a problem |
---|
0:00:18 | i'll present the work we have carried out by extracting i-vectors from |
---|
0:00:23 | short and long time speech features for speaker clustering |
---|
0:00:26 | and |
---|
0:00:27 | this is also the what of don't really k and have it at london |
---|
0:00:32 | so the outline of |
---|
0:00:34 | a representation that's is as follows so we would describe |
---|
0:00:38 | so objectives of our research |
---|
0:00:40 | we would also describe the main |
---|
0:00:44 | long-term features that are used in our experiments we would also mention the |
---|
0:00:49 | baseline and the proposed speaker the standard vector |
---|
0:00:53 | and then we will |
---|
0:00:55 | describe the fusion techniques that are carried out in the speaker segmentation and speaker clustering |
---|
0:01:00 | and finally the experimental setups conclusions would be presented |
---|
0:01:06 | so far the on all |
---|
0:01:08 | speaker diarization consists of two men tasks and these are |
---|
0:01:12 | speaker segmentation and speaker clustering |
---|
0:01:14 | and in a speaker segmentation |
---|
0:01:16 | a given audio speech is |
---|
0:01:19 | split it into homogeneous boxes and in speaker clustering |
---|
0:01:23 | speech clusters that belong to a given speaker are grouped together |
---|
0:01:28 | so the main motivation for this dataset used in our previous |
---|
0:01:32 | work |
---|
0:01:32 | we have shown that the use of jitter and shimmer and |
---|
0:01:36 | prosodic features have improved |
---|
0:01:39 | the performance of |
---|
0:01:41 | gmm based speaker detection systems so based on these |
---|
0:01:45 | we have proposed to the extraction of i-vectors from these |
---|
0:01:49 | detection or prosodic features |
---|
0:01:51 | and then want to fuse their cosine distance courses with the |
---|
0:01:56 | mfcc for speaker clustering task |
---|
0:02:00 | so here in the feature selection |
---|
0:02:02 | we select different set of features from the voice quality and from the prosodic |
---|
0:02:08 | from the voice quality way extracts |
---|
0:02:10 | features called absolute jitter absolute stream attention meticulously and from the prosodic once we extract |
---|
0:02:16 | the speech |
---|
0:02:18 | intensity and the first four formant frequencies |
---|
0:02:21 | once these features are extracted the abstract the same feature vectors |
---|
0:02:27 | then we extract two different set of i-vectors the first i-vector is from |
---|
0:02:32 | the mfcc |
---|
0:02:33 | and the second i-vector is from the long-term features |
---|
0:02:37 | then the cosine similarity of these two |
---|
0:02:41 | i-vectors is used for speaker clustering task |
---|
0:02:46 | so these are the main speech features that are used in our experiments without mfcc |
---|
0:02:52 | voice quality that the jitter and shimmer and we have also used the prosodic ones |
---|
0:02:59 | so from the voice qualities we have selected three different measurement is based on previous |
---|
0:03:04 | studies these are the absolute jitter which major the variation between |
---|
0:03:09 | two consecutive periods |
---|
0:03:11 | and we have also very absolute stream |
---|
0:03:15 | it may just evaluation of the amplitude between consecutive periods and also |
---|
0:03:20 | they should medically two c d's |
---|
0:03:22 | similar to should matter out of instrument that's |
---|
0:03:26 | it takes into consideration three consecutive periods |
---|
0:03:31 | so from prosody we have extracted speech |
---|
0:03:34 | in basically and formant frequencies |
---|
0:03:38 | so when it comes to the speaker diarization architecture first i'll try to describe the |
---|
0:03:43 | baseline system |
---|
0:03:45 | so given speech signal |
---|
0:03:48 | so we |
---|
0:03:49 | further steak the speech different mappings or i thought |
---|
0:03:53 | the main reason wearers using the oracle studies |
---|
0:03:56 | where i'm really interested on the speaker errors |
---|
0:03:59 | where the restaurant the speaker segmentation errors |
---|
0:04:03 | then we extract the mfcc the jitter and shimmer and the prosodic once only for |
---|
0:04:08 | the speech frames |
---|
0:04:10 | then the jitter and shimmer and that was only once output in the same feature |
---|
0:04:14 | vectors |
---|
0:04:17 | so based on the side of that inside the |
---|
0:04:19 | new channel number of clusters is initialized if we have |
---|
0:04:23 | more number of if there |
---|
0:04:25 | size of the data is or if the |
---|
0:04:29 | sure is longer to have more number of clusters if it is shown to do |
---|
0:04:33 | have less number of clusters all the initial number of clusters |
---|
0:04:38 | depend just on the duration of the audio signal |
---|
0:04:42 | then we assign genments complex tenish ali for this neutralised clusters |
---|
0:04:48 | then we perform the hmm decoding and |
---|
0:04:51 | training process and then we'll get to different to log-likelihood scores the first one is |
---|
0:04:56 | for the |
---|
0:04:57 | short time spectral features |
---|
0:04:58 | and then we also get another score role |
---|
0:05:01 | you don't of features |
---|
0:05:02 | then these two scores are used you nearly in the speaker segmentation and |
---|
0:05:07 | we get the speaker segmentation still in gives us |
---|
0:05:10 | a set of clusters |
---|
0:05:12 | so we use a classical bic |
---|
0:05:15 | computation technique and computes |
---|
0:05:17 | pairwise similarity between |
---|
0:05:19 | all set of clusters and i each iteration the two clusters that that's the not |
---|
0:05:25 | the highest |
---|
0:05:28 | bic score |
---|
0:05:29 | will be but you'd and this process |
---|
0:05:33 | i to its until the highest peak value among the clusters is less than the |
---|
0:05:38 | specified threshold value |
---|
0:05:40 | so this is a classical be computation so you know work |
---|
0:05:44 | the initialization and the speaker segmentation are the same |
---|
0:05:47 | the may conclusion we should it in the speaker clustering one the speech the decomposition |
---|
0:05:52 | of the gmm be competition is replaced by the i-vector clustering one |
---|
0:06:00 | so this is our proposed architecture so given a set of clusters |
---|
0:06:05 | that are |
---|
0:06:06 | the output of the viterbi segmentation we extract |
---|
0:06:09 | two different set of i-vectors |
---|
0:06:11 | if a test i-vector is from the mfcc |
---|
0:06:14 | and the second one is from the detection and the problems once |
---|
0:06:18 | and we used to difference |
---|
0:06:20 | universal background models the first one is forty |
---|
0:06:25 | short-term spectral features |
---|
0:06:26 | and the second one is four |
---|
0:06:29 | the |
---|
0:06:31 | long-term features |
---|
0:06:33 | so the ubm and the t matrix is are trained using the same source from |
---|
0:06:38 | a and have selected one hundreds unusual side of a duration of forty hours to |
---|
0:06:43 | train |
---|
0:06:44 | the ubm |
---|
0:06:46 | and the i-vectors are extracted using an is a toolkit |
---|
0:06:49 | so that the less than the g |
---|
0:06:52 | is normally based on |
---|
0:06:54 | specified threshold value so if the threshold value is based on |
---|
0:06:59 | specified one the |
---|
0:07:01 | system stops margin |
---|
0:07:04 | so |
---|
0:07:05 | to find the optimum |
---|
0:07:07 | threshold value we have used a semi-automatic way of |
---|
0:07:11 | finding |
---|
0:07:12 | the number of triphone value |
---|
0:07:14 | for example in this figure |
---|
0:07:17 | we have displayed how we have selected |
---|
0:07:20 | the lamp that value and the stopping criterion for five shows from the development set |
---|
0:07:25 | so these once the or it once show the highest |
---|
0:07:29 | cosine distance scores per each iteration |
---|
0:07:32 | and |
---|
0:07:34 | these block once they are the diarization error rates but each iteration so there |
---|
0:07:40 | horizontal dashed line is the lamb data selected |
---|
0:07:44 | as a threshold to stop the process for example |
---|
0:07:48 | if we talk about the rest |
---|
0:07:49 | showed |
---|
0:07:52 | there's system |
---|
0:07:53 | stops at the for citation because in the fall detection |
---|
0:07:56 | the |
---|
0:07:57 | maximum |
---|
0:07:58 | cosine distance score value is made than this threshold value so we have applied this |
---|
0:08:04 | techniques on |
---|
0:08:06 | the whole development shows and this number about it is applied directly on the test |
---|
0:08:12 | sites |
---|
0:08:15 | so we have used to different fusion techniques ones on the speaker segmentation and the |
---|
0:08:21 | other on speaker clustering |
---|
0:08:23 | so in the |
---|
0:08:25 | segmentation |
---|
0:08:26 | the figure technique is based on log likelihood scores so we get |
---|
0:08:31 | two different scores for a given we can see that the axes |
---|
0:08:35 | the short-term spectral features and the what is the long-term features all |
---|
0:08:40 | we gates |
---|
0:08:41 | and more than |
---|
0:08:43 | for |
---|
0:08:44 | the short-term spectral features so we get the log-likelihood score so this is multiplied by |
---|
0:08:48 | are five and again similarly for the |
---|
0:08:52 | long-term features we ate |
---|
0:08:54 | extract |
---|
0:08:55 | the log-likelihood score and this is multiplied by a the file and the and fast |
---|
0:09:00 | how to be against you on the development data sites |
---|
0:09:06 | so that putting technique in the speaker clustering is carried out possible so we have |
---|
0:09:11 | three different set of features very mfcc the voice quality and the prosodic once |
---|
0:09:17 | so the long-term features are stacked a basic |
---|
0:09:21 | then we extract two different sets of i-vectors from the mfcc and from the long |
---|
0:09:26 | term one |
---|
0:09:27 | then the cosine similarity between |
---|
0:09:30 | these two sets of i-vectors |
---|
0:09:33 | is fused divide |
---|
0:09:34 | a linear weighting function |
---|
0:09:36 | so that fused score that is a multi it "'cause" i similarity is multiplied by |
---|
0:09:46 | weight functions |
---|
0:09:48 | but also the beta in this one is |
---|
0:09:51 | the weights |
---|
0:09:52 | but applied for their cosine task force extracted from |
---|
0:09:57 | the spectral features and one minus data is |
---|
0:10:00 | the way to signs |
---|
0:10:02 | for the cosine distance scores |
---|
0:10:04 | extracted from the long-term features |
---|
0:10:10 | so when we come to the experimental setup |
---|
0:10:13 | we have |
---|
0:10:15 | developed and tested that experiment on ami corpus which is |
---|
0:10:19 | and multi-party and spontaneous that of meeting recordings |
---|
0:10:24 | so normally in the i shows the number of speakers is |
---|
0:10:27 | let me just two |
---|
0:10:29 | so you to five that's mostly |
---|
0:10:31 | the number of speakers these for and |
---|
0:10:34 | it is and meeting records and it is a model channel with the fight of |
---|
0:10:37 | each condition |
---|
0:10:38 | so we have selected potentials as a development set to tune the different parameter studies |
---|
0:10:43 | the weight values |
---|
0:10:45 | and that threshold values |
---|
0:10:48 | then we have defined |
---|
0:10:50 | two experimental setups the first one is a single sides so potentials |
---|
0:10:54 | how to be selected from idea |
---|
0:10:58 | and the other one is a multiple sites |
---|
0:11:00 | we have selected ten calls from idea |
---|
0:11:03 | adam back end to end all sides so |
---|
0:11:06 | the |
---|
0:11:07 | optimum parameters that are obtained from the development sites are directly used on these |
---|
0:11:14 | a single and the multiple sites roles so we have used to difference |
---|
0:11:18 | as of i-vectors |
---|
0:11:20 | for the short and long term features and is also |
---|
0:11:24 | do you want on the development set and we have |
---|
0:11:27 | use the artists at all the speech differences |
---|
0:11:30 | at the speech activity detection so very but is that the city portage in this |
---|
0:11:35 | work |
---|
0:11:36 | corresponds mainly to the speaker errors missus speech and the form out on this have |
---|
0:11:40 | a zero value |
---|
0:11:44 | so he if we see |
---|
0:11:45 | the results the baseline system that is based on mfcc and gmm big |
---|
0:11:51 | clustering p does |
---|
0:11:52 | is a model of the art |
---|
0:11:54 | but when we are using jitter and shimmer and prosody both in the gmm and |
---|
0:12:00 | i-vector |
---|
0:12:02 | clustering technique it improves |
---|
0:12:05 | a lot compared to |
---|
0:12:07 | the baseline |
---|
0:12:08 | and if we compare these to the i-vector |
---|
0:12:12 | clustering techniques with the gmm ones |
---|
0:12:16 | but i with a clustering techniques |
---|
0:12:19 | again provide better result is on |
---|
0:12:22 | they gmm clustering technique |
---|
0:12:24 | and we can also conclude that |
---|
0:12:26 | if we compare the same clustering techniques the i-vector clustering techniques that this one based |
---|
0:12:31 | on only short-term spectral feature and this one |
---|
0:12:34 | using two different set of features it's |
---|
0:12:37 | i provide us better results on |
---|
0:12:40 | using one i-vectors from the |
---|
0:12:43 | sure that features |
---|
0:12:48 | so we have |
---|
0:12:49 | also then |
---|
0:12:50 | some posts paper processing work |
---|
0:12:53 | after the sensational stories to better |
---|
0:12:55 | so we have |
---|
0:12:57 | also pasted |
---|
0:12:59 | that the lda scoring |
---|
0:13:01 | in the clustering stage |
---|
0:13:02 | and the p l a clustering as it is shown in the table |
---|
0:13:06 | with that it uses only one set of i-vector or |
---|
0:13:09 | two sets of i-vectors |
---|
0:13:11 | it provides a better diarization of results on what the gmm and cosine scoring techniques |
---|
0:13:19 | so one of the issues in a speaker adaptation is the diarization error rates among |
---|
0:13:24 | the different roles is |
---|
0:13:28 | a relatively |
---|
0:13:31 | it follows from one to one show for example is a wonderful may give us |
---|
0:13:34 | a small d are like five percent and another show to make debusk idea of |
---|
0:13:39 | like a fifty percent |
---|
0:13:41 | so for example this box plot shows the |
---|
0:13:45 | d r evaluation all the multi pole and a single side so |
---|
0:13:50 | this one is a d r evaluation for the single five and the grey one |
---|
0:13:54 | is |
---|
0:13:55 | the idea validation for the multiple site |
---|
0:13:57 | so this easy high d r and d c the lowest eer |
---|
0:14:01 | so we can see that there is |
---|
0:14:02 | a huge evaluation |
---|
0:14:04 | between |
---|
0:14:05 | the maximum and the minimum |
---|
0:14:09 | so if we see |
---|
0:14:12 | here the use of long-term features |
---|
0:14:15 | both in the gmm and i-vector clustering technique |
---|
0:14:18 | help us to reduce the |
---|
0:14:21 | the other what if you normal the different roles |
---|
0:14:24 | and the other thing we can see both |
---|
0:14:27 | i-vector clustering techniques that are based on |
---|
0:14:30 | short-term and shorter class long-term features |
---|
0:14:33 | they give us |
---|
0:14:35 | a bit errors |
---|
0:14:37 | at least we can say it reduces again |
---|
0:14:39 | the idea variations among |
---|
0:14:42 | the different roles |
---|
0:14:43 | and finally this one that is the i-vector clustering technique based on |
---|
0:14:48 | short-term and long-term features used as |
---|
0:14:51 | the lost |
---|
0:14:52 | variations among |
---|
0:14:53 | the different roles |
---|
0:14:58 | so in conclusion |
---|
0:15:00 | we have proposed the extraction of i-vectors from |
---|
0:15:04 | short and long term c feature for |
---|
0:15:06 | speaker clustering task |
---|
0:15:09 | and in the experiments are designed to sit strum that's the |
---|
0:15:12 | i-vector clustering techniques provide |
---|
0:15:15 | bitter diarization error is that is and the clustering the general clustering once |
---|
0:15:20 | and also the extraction of i-vectors from the |
---|
0:15:24 | long-term features |
---|
0:15:25 | in addition to the |
---|
0:15:27 | a short time once |
---|
0:15:29 | help us to reduce the d r |
---|
0:15:32 | so in conclusion we can phase that's the extraction of i-vectors |
---|
0:15:37 | and the use of |
---|
0:15:39 | i-vector clustering techniques are helpful for speaker diarization system |
---|
0:15:43 | and thank you |
---|
0:15:52 | then it's time for questions |
---|
0:16:12 | so i have |
---|
0:16:19 | but i was one thing to do explain the process you using for calculating the |
---|
0:16:26 | jitter and shimmer in did you find it to be a robust process across the |
---|
0:16:32 | tv shows |
---|
0:16:37 | normally are |
---|
0:16:40 | shows a meeting domains |
---|
0:16:42 | but |
---|
0:16:44 | it is |
---|
0:16:45 | it is |
---|
0:16:46 | and meeting domain it's not a t v show |
---|
0:16:49 | but when we extract different remote |
---|
0:16:53 | we the problem of bases if |
---|
0:16:56 | the speech is almost |
---|
0:16:59 | we give zero buttons |
---|
0:17:01 | so we compensate them by averaging over five hundred milisecond duration |
---|
0:17:06 | that extract the fattest all certainly second duration |
---|
0:17:10 | sort compensates a zero values for the unvoiced frames we averaging over five hundred milisecond |
---|
0:17:16 | duration |
---|
0:17:27 | you have also in one of your a slight and you said that the training |
---|
0:17:31 | from the development set |
---|
0:17:33 | how did you you'll find it or train it how did you find that threshold |
---|
0:17:38 | and did you experiment with changing the threshold value |
---|
0:17:42 | you mean that the segmentation i think this one |
---|
0:17:47 | no in the formula you |
---|
0:17:51 | we present the segmentation |
---|
0:17:57 | this one |
---|
0:17:58 | or you hear |
---|
0:18:01 | so you mean that i four buttons they have been |
---|
0:18:05 | modeling be you on the development sites |
---|
0:18:08 | we taste different weights while the weight |
---|
0:18:11 | bottles |
---|
0:18:12 | for the two features |
---|
0:18:14 | and |
---|
0:18:15 | these files are directly applied on the test sites |
---|
0:18:22 | okay so they are fixed your exists in the test experiments affix it's |
---|
0:18:41 | of thank you very clear presentation i just wanted to understand the little bit about |
---|
0:18:48 | the physical what we should you have an explanation why he went so did to |
---|
0:18:54 | shiver and prosody |
---|
0:18:56 | so for example in explains that we do we for pitch to be quite well |
---|
0:19:00 | quite to important how did you sort of converge of these two did you go |
---|
0:19:06 | through a selection opposed to get to the mean do you have any intuition or |
---|
0:19:10 | expression for the |
---|
0:19:11 | so you're saying why we are interested in the extraction of the detection but and |
---|
0:19:15 | prosodic how did you zero it on the balloon is what's your sort of physical |
---|
0:19:19 | intuition for what using that as opposed to of the long-term features |
---|
0:19:25 | because they are voice quality measurements |
---|
0:19:27 | no special potentially much |
---|
0:19:29 | so they can be used to discriminate |
---|
0:19:33 | where the speech of one percent from another one so you'll hypothesis is that they |
---|
0:19:38 | would that would be significant difference between us we have seen it is and the |
---|
0:19:41 | this will be robust to whatever channel that is going through |
---|
0:19:46 | but we didn't similar extremely delicate if you will so |
---|
0:19:53 | if you had extend this outside this dataset for example of real life recording |
---|
0:19:59 | we're going to worry about the sensitivity of these features that you looking at |
---|
0:20:04 | okay for example jitter and shimmer they have also been used in a speaker |
---|
0:20:10 | verification and recognition on these database |
---|
0:20:14 | so we have normally |
---|
0:20:16 | that is it will not is the reason why we applied on speaker diarization |
---|
0:20:20 | and we have checked the jitter and shimmer on ami corpus |
---|
0:20:25 | here's what i'm presenting we have also attracted on how about campus it is a |
---|
0:20:30 | cut on projects t v show |
---|
0:20:32 | is that also we got some improvements |
---|
0:20:36 | so you would like companies it's helps and |
---|
0:20:40 | would that be any other as you think |
---|
0:20:42 | i don't it would a but others you think that you could out to the |
---|
0:20:46 | two |
---|
0:20:47 | note that different types of region we have about ten or eleven types of jitter |
---|
0:20:51 | and shimmer measurements |
---|
0:20:53 | beds we have selected this c d based on previous studies for speaker recognition and |
---|
0:20:59 | of maybe you can check with the others also |
---|
0:21:08 | and you in a question |
---|
0:21:14 | and i don't have we question so it's about the stopping criterion so you are |
---|
0:21:22 | not assuming and that you know the number of speakers beforehand |
---|
0:21:26 | that's right now we know the number of speakers you know you larson and you |
---|
0:21:29 | know it is okay conditions |
---|
0:21:37 | so any other questions |
---|
0:21:42 | there are no more questions to estimate speaker again |
---|