0:00:15okay
0:00:15thank you my name is a problem
0:00:18i'll present the work we have carried out by extracting i-vectors from
0:00:23short and long time speech features for speaker clustering
0:00:26and
0:00:27this is also the what of don't really k and have it at london
0:00:32so the outline of
0:00:34a representation that's is as follows so we would describe
0:00:38so objectives of our research
0:00:40we would also describe the main
0:00:44long-term features that are used in our experiments we would also mention the
0:00:49baseline and the proposed speaker the standard vector
0:00:53and then we will
0:00:55describe the fusion techniques that are carried out in the speaker segmentation and speaker clustering
0:01:00and finally the experimental setups conclusions would be presented
0:01:06so far the on all
0:01:08speaker diarization consists of two men tasks and these are
0:01:12speaker segmentation and speaker clustering
0:01:14and in a speaker segmentation
0:01:16a given audio speech is
0:01:19split it into homogeneous boxes and in speaker clustering
0:01:23speech clusters that belong to a given speaker are grouped together
0:01:28so the main motivation for this dataset used in our previous
0:01:32work
0:01:32we have shown that the use of jitter and shimmer and
0:01:36prosodic features have improved
0:01:39the performance of
0:01:41gmm based speaker detection systems so based on these
0:01:45we have proposed to the extraction of i-vectors from these
0:01:49detection or prosodic features
0:01:51and then want to fuse their cosine distance courses with the
0:01:56mfcc for speaker clustering task
0:02:00so here in the feature selection
0:02:02we select different set of features from the voice quality and from the prosodic
0:02:08from the voice quality way extracts
0:02:10features called absolute jitter absolute stream attention meticulously and from the prosodic once we extract
0:02:16the speech
0:02:18intensity and the first four formant frequencies
0:02:21once these features are extracted the abstract the same feature vectors
0:02:27then we extract two different set of i-vectors the first i-vector is from
0:02:32the mfcc
0:02:33and the second i-vector is from the long-term features
0:02:37then the cosine similarity of these two
0:02:41i-vectors is used for speaker clustering task
0:02:46so these are the main speech features that are used in our experiments without mfcc
0:02:52voice quality that the jitter and shimmer and we have also used the prosodic ones
0:02:59so from the voice qualities we have selected three different measurement is based on previous
0:03:04studies these are the absolute jitter which major the variation between
0:03:09two consecutive periods
0:03:11and we have also very absolute stream
0:03:15it may just evaluation of the amplitude between consecutive periods and also
0:03:20they should medically two c d's
0:03:22similar to should matter out of instrument that's
0:03:26it takes into consideration three consecutive periods
0:03:31so from prosody we have extracted speech
0:03:34in basically and formant frequencies
0:03:38so when it comes to the speaker diarization architecture first i'll try to describe the
0:03:43baseline system
0:03:45so given speech signal
0:03:48so we
0:03:49further steak the speech different mappings or i thought
0:03:53the main reason wearers using the oracle studies
0:03:56where i'm really interested on the speaker errors
0:03:59where the restaurant the speaker segmentation errors
0:04:03then we extract the mfcc the jitter and shimmer and the prosodic once only for
0:04:08the speech frames
0:04:10then the jitter and shimmer and that was only once output in the same feature
0:04:14vectors
0:04:17so based on the side of that inside the
0:04:19new channel number of clusters is initialized if we have
0:04:23more number of if there
0:04:25size of the data is or if the
0:04:29sure is longer to have more number of clusters if it is shown to do
0:04:33have less number of clusters all the initial number of clusters
0:04:38depend just on the duration of the audio signal
0:04:42then we assign genments complex tenish ali for this neutralised clusters
0:04:48then we perform the hmm decoding and
0:04:51training process and then we'll get to different to log-likelihood scores the first one is
0:04:56for the
0:04:57short time spectral features
0:04:58and then we also get another score role
0:05:01you don't of features
0:05:02then these two scores are used you nearly in the speaker segmentation and
0:05:07we get the speaker segmentation still in gives us
0:05:10a set of clusters
0:05:12so we use a classical bic
0:05:15computation technique and computes
0:05:17pairwise similarity between
0:05:19all set of clusters and i each iteration the two clusters that that's the not
0:05:25the highest
0:05:28bic score
0:05:29will be but you'd and this process
0:05:33i to its until the highest peak value among the clusters is less than the
0:05:38specified threshold value
0:05:40so this is a classical be computation so you know work
0:05:44the initialization and the speaker segmentation are the same
0:05:47the may conclusion we should it in the speaker clustering one the speech the decomposition
0:05:52of the gmm be competition is replaced by the i-vector clustering one
0:06:00so this is our proposed architecture so given a set of clusters
0:06:05that are
0:06:06the output of the viterbi segmentation we extract
0:06:09two different set of i-vectors
0:06:11if a test i-vector is from the mfcc
0:06:14and the second one is from the detection and the problems once
0:06:18and we used to difference
0:06:20universal background models the first one is forty
0:06:25short-term spectral features
0:06:26and the second one is four
0:06:29the
0:06:31long-term features
0:06:33so the ubm and the t matrix is are trained using the same source from
0:06:38a and have selected one hundreds unusual side of a duration of forty hours to
0:06:43train
0:06:44the ubm
0:06:46and the i-vectors are extracted using an is a toolkit
0:06:49so that the less than the g
0:06:52is normally based on
0:06:54specified threshold value so if the threshold value is based on
0:06:59specified one the
0:07:01system stops margin
0:07:04so
0:07:05to find the optimum
0:07:07threshold value we have used a semi-automatic way of
0:07:11finding
0:07:12the number of triphone value
0:07:14for example in this figure
0:07:17we have displayed how we have selected
0:07:20the lamp that value and the stopping criterion for five shows from the development set
0:07:25so these once the or it once show the highest
0:07:29cosine distance scores per each iteration
0:07:32and
0:07:34these block once they are the diarization error rates but each iteration so there
0:07:40horizontal dashed line is the lamb data selected
0:07:44as a threshold to stop the process for example
0:07:48if we talk about the rest
0:07:49showed
0:07:52there's system
0:07:53stops at the for citation because in the fall detection
0:07:56the
0:07:57maximum
0:07:58cosine distance score value is made than this threshold value so we have applied this
0:08:04techniques on
0:08:06the whole development shows and this number about it is applied directly on the test
0:08:12sites
0:08:15so we have used to different fusion techniques ones on the speaker segmentation and the
0:08:21other on speaker clustering
0:08:23so in the
0:08:25segmentation
0:08:26the figure technique is based on log likelihood scores so we get
0:08:31two different scores for a given we can see that the axes
0:08:35the short-term spectral features and the what is the long-term features all
0:08:40we gates
0:08:41and more than
0:08:43for
0:08:44the short-term spectral features so we get the log-likelihood score so this is multiplied by
0:08:48are five and again similarly for the
0:08:52long-term features we ate
0:08:54extract
0:08:55the log-likelihood score and this is multiplied by a the file and the and fast
0:09:00how to be against you on the development data sites
0:09:06so that putting technique in the speaker clustering is carried out possible so we have
0:09:11three different set of features very mfcc the voice quality and the prosodic once
0:09:17so the long-term features are stacked a basic
0:09:21then we extract two different sets of i-vectors from the mfcc and from the long
0:09:26term one
0:09:27then the cosine similarity between
0:09:30these two sets of i-vectors
0:09:33is fused divide
0:09:34a linear weighting function
0:09:36so that fused score that is a multi it "'cause" i similarity is multiplied by
0:09:46weight functions
0:09:48but also the beta in this one is
0:09:51the weights
0:09:52but applied for their cosine task force extracted from
0:09:57the spectral features and one minus data is
0:10:00the way to signs
0:10:02for the cosine distance scores
0:10:04extracted from the long-term features
0:10:10so when we come to the experimental setup
0:10:13we have
0:10:15developed and tested that experiment on ami corpus which is
0:10:19and multi-party and spontaneous that of meeting recordings
0:10:24so normally in the i shows the number of speakers is
0:10:27let me just two
0:10:29so you to five that's mostly
0:10:31the number of speakers these for and
0:10:34it is and meeting records and it is a model channel with the fight of
0:10:37each condition
0:10:38so we have selected potentials as a development set to tune the different parameter studies
0:10:43the weight values
0:10:45and that threshold values
0:10:48then we have defined
0:10:50two experimental setups the first one is a single sides so potentials
0:10:54how to be selected from idea
0:10:58and the other one is a multiple sites
0:11:00we have selected ten calls from idea
0:11:03adam back end to end all sides so
0:11:06the
0:11:07optimum parameters that are obtained from the development sites are directly used on these
0:11:14a single and the multiple sites roles so we have used to difference
0:11:18as of i-vectors
0:11:20for the short and long term features and is also
0:11:24do you want on the development set and we have
0:11:27use the artists at all the speech differences
0:11:30at the speech activity detection so very but is that the city portage in this
0:11:35work
0:11:36corresponds mainly to the speaker errors missus speech and the form out on this have
0:11:40a zero value
0:11:44so he if we see
0:11:45the results the baseline system that is based on mfcc and gmm big
0:11:51clustering p does
0:11:52is a model of the art
0:11:54but when we are using jitter and shimmer and prosody both in the gmm and
0:12:00i-vector
0:12:02clustering technique it improves
0:12:05a lot compared to
0:12:07the baseline
0:12:08and if we compare these to the i-vector
0:12:12clustering techniques with the gmm ones
0:12:16but i with a clustering techniques
0:12:19again provide better result is on
0:12:22they gmm clustering technique
0:12:24and we can also conclude that
0:12:26if we compare the same clustering techniques the i-vector clustering techniques that this one based
0:12:31on only short-term spectral feature and this one
0:12:34using two different set of features it's
0:12:37i provide us better results on
0:12:40using one i-vectors from the
0:12:43sure that features
0:12:48so we have
0:12:49also then
0:12:50some posts paper processing work
0:12:53after the sensational stories to better
0:12:55so we have
0:12:57also pasted
0:12:59that the lda scoring
0:13:01in the clustering stage
0:13:02and the p l a clustering as it is shown in the table
0:13:06with that it uses only one set of i-vector or
0:13:09two sets of i-vectors
0:13:11it provides a better diarization of results on what the gmm and cosine scoring techniques
0:13:19so one of the issues in a speaker adaptation is the diarization error rates among
0:13:24the different roles is
0:13:28a relatively
0:13:31it follows from one to one show for example is a wonderful may give us
0:13:34a small d are like five percent and another show to make debusk idea of
0:13:39like a fifty percent
0:13:41so for example this box plot shows the
0:13:45d r evaluation all the multi pole and a single side so
0:13:50this one is a d r evaluation for the single five and the grey one
0:13:54is
0:13:55the idea validation for the multiple site
0:13:57so this easy high d r and d c the lowest eer
0:14:01so we can see that there is
0:14:02a huge evaluation
0:14:04between
0:14:05the maximum and the minimum
0:14:09so if we see
0:14:12here the use of long-term features
0:14:15both in the gmm and i-vector clustering technique
0:14:18help us to reduce the
0:14:21the other what if you normal the different roles
0:14:24and the other thing we can see both
0:14:27i-vector clustering techniques that are based on
0:14:30short-term and shorter class long-term features
0:14:33they give us
0:14:35a bit errors
0:14:37at least we can say it reduces again
0:14:39the idea variations among
0:14:42the different roles
0:14:43and finally this one that is the i-vector clustering technique based on
0:14:48short-term and long-term features used as
0:14:51the lost
0:14:52variations among
0:14:53the different roles
0:14:58so in conclusion
0:15:00we have proposed the extraction of i-vectors from
0:15:04short and long term c feature for
0:15:06speaker clustering task
0:15:09and in the experiments are designed to sit strum that's the
0:15:12i-vector clustering techniques provide
0:15:15bitter diarization error is that is and the clustering the general clustering once
0:15:20and also the extraction of i-vectors from the
0:15:24long-term features
0:15:25in addition to the
0:15:27a short time once
0:15:29help us to reduce the d r
0:15:32so in conclusion we can phase that's the extraction of i-vectors
0:15:37and the use of
0:15:39i-vector clustering techniques are helpful for speaker diarization system
0:15:43and thank you
0:15:52then it's time for questions
0:16:12so i have
0:16:19but i was one thing to do explain the process you using for calculating the
0:16:26jitter and shimmer in did you find it to be a robust process across the
0:16:32tv shows
0:16:37normally are
0:16:40shows a meeting domains
0:16:42but
0:16:44it is
0:16:45it is
0:16:46and meeting domain it's not a t v show
0:16:49but when we extract different remote
0:16:53we the problem of bases if
0:16:56the speech is almost
0:16:59we give zero buttons
0:17:01so we compensate them by averaging over five hundred milisecond duration
0:17:06that extract the fattest all certainly second duration
0:17:10sort compensates a zero values for the unvoiced frames we averaging over five hundred milisecond
0:17:16duration
0:17:27you have also in one of your a slight and you said that the training
0:17:31from the development set
0:17:33how did you you'll find it or train it how did you find that threshold
0:17:38and did you experiment with changing the threshold value
0:17:42you mean that the segmentation i think this one
0:17:47no in the formula you
0:17:51we present the segmentation
0:17:57this one
0:17:58or you hear
0:18:01so you mean that i four buttons they have been
0:18:05modeling be you on the development sites
0:18:08we taste different weights while the weight
0:18:11bottles
0:18:12for the two features
0:18:14and
0:18:15these files are directly applied on the test sites
0:18:22okay so they are fixed your exists in the test experiments affix it's
0:18:41of thank you very clear presentation i just wanted to understand the little bit about
0:18:48the physical what we should you have an explanation why he went so did to
0:18:54shiver and prosody
0:18:56so for example in explains that we do we for pitch to be quite well
0:19:00quite to important how did you sort of converge of these two did you go
0:19:06through a selection opposed to get to the mean do you have any intuition or
0:19:10expression for the
0:19:11so you're saying why we are interested in the extraction of the detection but and
0:19:15prosodic how did you zero it on the balloon is what's your sort of physical
0:19:19intuition for what using that as opposed to of the long-term features
0:19:25because they are voice quality measurements
0:19:27no special potentially much
0:19:29so they can be used to discriminate
0:19:33where the speech of one percent from another one so you'll hypothesis is that they
0:19:38would that would be significant difference between us we have seen it is and the
0:19:41this will be robust to whatever channel that is going through
0:19:46but we didn't similar extremely delicate if you will so
0:19:53if you had extend this outside this dataset for example of real life recording
0:19:59we're going to worry about the sensitivity of these features that you looking at
0:20:04okay for example jitter and shimmer they have also been used in a speaker
0:20:10verification and recognition on these database
0:20:14so we have normally
0:20:16that is it will not is the reason why we applied on speaker diarization
0:20:20and we have checked the jitter and shimmer on ami corpus
0:20:25here's what i'm presenting we have also attracted on how about campus it is a
0:20:30cut on projects t v show
0:20:32is that also we got some improvements
0:20:36so you would like companies it's helps and
0:20:40would that be any other as you think
0:20:42i don't it would a but others you think that you could out to the
0:20:46two
0:20:47note that different types of region we have about ten or eleven types of jitter
0:20:51and shimmer measurements
0:20:53beds we have selected this c d based on previous studies for speaker recognition and
0:20:59of maybe you can check with the others also
0:21:08and you in a question
0:21:14and i don't have we question so it's about the stopping criterion so you are
0:21:22not assuming and that you know the number of speakers beforehand
0:21:26that's right now we know the number of speakers you know you larson and you
0:21:29know it is okay conditions
0:21:37so any other questions
0:21:42there are no more questions to estimate speaker again