0:00:14 | she good morning at second university at the it is data signs that you recently |
---|
0:00:19 | worked on a soft voice activity detection in that factor analysis based speaker segmentation of |
---|
0:00:23 | a broadcast news |
---|
0:00:26 | so what this work has been done in the context of the artiste on project |
---|
0:00:31 | so the u r d is actually the public broadcasters of long as |
---|
0:00:35 | the dutch speaking region of that belgium |
---|
0:00:38 | and the idea is to use the speech technology to |
---|
0:00:42 | speed of the process of a subset of grading subtitles for tv shows |
---|
0:00:47 | another case can be for journalists to meter reports two |
---|
0:00:51 | have a fess track to put the report online with the subtitles so then they |
---|
0:00:55 | can use the speech technology to generate the subtitles |
---|
0:00:58 | and the quality maybe a bit less but |
---|
0:01:01 | in case of for online you the speed is more important than the quality of |
---|
0:01:05 | the subtitles |
---|
0:01:07 | so the ideas that the subtitling as a very time-consuming a manual process so we |
---|
0:01:12 | want to use the |
---|
0:01:13 | speech technology |
---|
0:01:15 | so in this presentation will focus on the diarisation and why do you want to |
---|
0:01:21 | solve this of who spoke when problem |
---|
0:01:23 | first of all we want to at colours to the subtitles |
---|
0:01:27 | and if you want to generate subtitles it can also be useful to use the |
---|
0:01:31 | speaker adapted models so we got speaker labels we can use these other models |
---|
0:01:36 | and another thing is that actually if we detect speaker changes this can be extra |
---|
0:01:41 | information for the language model of the speech recognizer to |
---|
0:01:46 | begin and sentences so this can also help to recognition |
---|
0:01:51 | so i the interspeech to have a show and tell session which of all the |
---|
0:01:56 | shall be a complete system platform |
---|
0:01:58 | so |
---|
0:02:00 | it will a show how can uploaded be you and then start the whole chain |
---|
0:02:03 | of a speech nonspeech segmentation speaker diarization language detection system and then speech recognition |
---|
0:02:09 | but that's not the final step then we actually have to make short sentences to |
---|
0:02:12 | display them on the screen |
---|
0:02:18 | okay so what is this concept i think more probably get or audio signal plus |
---|
0:02:22 | all the first step is the speech nonspeech segmentation we have to move a laughter |
---|
0:02:26 | we have to remove music |
---|
0:02:28 | so when once be detected the speech segments we can start that or a speaker |
---|
0:02:32 | diarization |
---|
0:02:33 | so this includes a detecting the speaker change points and finding homogeneous segments |
---|
0:02:39 | and once we found of segments we can cluster those segments to assign a speaker |
---|
0:02:42 | label to all these segments |
---|
0:02:45 | so done you make the hypothesis that the each speaker only uses one language |
---|
0:02:51 | and because in flanders you're interested in image we only keep the flemish segments |
---|
0:02:55 | and then we will do the speech recognition |
---|
0:02:58 | and the output of the speech recognizer will need some processing to make the sentences |
---|
0:03:02 | short enough to display them on the screen |
---|
0:03:05 | so here we will focus on more accurate speaker segmentation because if we use to |
---|
0:03:10 | short segments that kernel provides and all data for reliable speaker models but costs in |
---|
0:03:15 | this kind of the files we will use we have sometimes fifty speakers in one |
---|
0:03:19 | audio file so the longer this homogeneous speaker segments will be the more reliable clustering |
---|
0:03:25 | will be |
---|
0:03:26 | obviously we don't detector speaker change this will result in nantes a homogeneous segments and |
---|
0:03:32 | this will result in error propagation during the clustering process and also if we make |
---|
0:03:37 | two short segments this will make clustering a lot slower because we have to accompany |
---|
0:03:41 | lot more distances between segments |
---|
0:03:46 | okay it'll propose a two-pass system so when the first a single other speech segments |
---|
0:03:52 | are generated by the speech and non speech segmentation |
---|
0:03:55 | so and then we will do so my speaker segmentation to actually the a standard |
---|
0:03:59 | eigenvoice approach so would be vocal this a generic eigenvoices because these |
---|
0:04:04 | a composer stuff the model actually every speaker that can appear |
---|
0:04:07 | so why once we detected those speaker segment we can do standard speaker clustering |
---|
0:04:12 | and the output of these of the speaker clustering i'm the speaker clusters we will |
---|
0:04:16 | use that actually to |
---|
0:04:18 | retrain or eigenvoice model so we know which speakers are active in the audio file |
---|
0:04:22 | and the broadcast news file so we will retrain eigenvoices that match those speakers |
---|
0:04:27 | and we also got speech segments so we can also actually retrain or a universal |
---|
0:04:32 | background model |
---|
0:04:33 | so then going go to a second sparse again the us start from or baseline |
---|
0:04:37 | speech segments you do the speech segmentation again but now with our specific eigenvoices matching |
---|
0:04:42 | the speakers inside the audio file |
---|
0:04:44 | and then we do again speaker clustering and an evil three have that the speaker |
---|
0:04:48 | clusters that in the first pass |
---|
0:04:52 | okay the first step or speaking segmentation will be a boundary generation so that is |
---|
0:04:58 | actually a generation of a kind of speaker change points |
---|
0:05:02 | so we will lie use a sliding window approach we have to comparison windows left |
---|
0:05:07 | window and the right window so and you can have a two hypothesis either we |
---|
0:05:11 | have the same speaker and the to win also we have a different speaker |
---|
0:05:16 | so we will use the a measure that looks for the maximal the similarity between |
---|
0:05:20 | the distribution of the acoustic features and of there is a fixed to somebody then |
---|
0:05:24 | this will indicate that there was a speaker change |
---|
0:05:30 | okay also |
---|
0:05:31 | speech nonspeech segmentation actually did not eliminate short pauses so it is tuned to detect |
---|
0:05:36 | all laughter and music segments of longer than one seconds |
---|
0:05:41 | so there can actually be a short alternate between speakers |
---|
0:05:45 | so if we would use adjacent comparison windows it's actually generate several maxima |
---|
0:05:51 | during the speaker change so we argue that is |
---|
0:05:54 | i maxima can actually appear at the and vq at the beginning and the end |
---|
0:05:58 | of the pulses because then the dissimilarity between acoustic features would be maximal and in |
---|
0:06:03 | both windows |
---|
0:06:05 | so and stats we propose to use overlapping comparison windows |
---|
0:06:09 | so if you look at the regions that the classes of these actually attribute to |
---|
0:06:13 | the summer the summer the |
---|
0:06:15 | and the red regions |
---|
0:06:18 | make them segments more the comparison windows more similar |
---|
0:06:21 | so with actually the overlapped region between both comparison minnows |
---|
0:06:25 | matches the false |
---|
0:06:27 | then the dissimilarity between both windows will be maximal and the pause and the speaker |
---|
0:06:31 | change will be inserted at the middle of the poles which is actually the thing |
---|
0:06:35 | we want |
---|
0:06:36 | just the more logical thing to do |
---|
0:06:39 | so one if we apply to us |
---|
0:06:42 | two or |
---|
0:06:43 | sliding window approach we just simply use a two |
---|
0:06:46 | overlapping sliding windows a left window and a right |
---|
0:06:53 | okay for each comparison in the we actually want to extract speaker specific information |
---|
0:06:58 | so we will do this to factor analysis |
---|
0:07:03 | we will use so because we use the sliding window approach we will use very |
---|
0:07:07 | low dimensional models because we have to extract those speaker factors for each frame |
---|
0:07:12 | so we will use the gmm-ubm speech model with the thirty two components and use |
---|
0:07:17 | a low dimensional speaker viable the or eigenvoice matrix with only twenty eigenvoices |
---|
0:07:23 | so we use of in the wall for one second then we slide across each |
---|
0:07:26 | frame and we expect those the twenty speaker factors |
---|
0:07:30 | so i mentioned that for the training data we use the english broadcast news data |
---|
0:07:39 | okay so not to another now that we have the speaker factors per frame we |
---|
0:07:43 | actually look for a significant local changes between the speaker factors because these will indicate |
---|
0:07:48 | a speaker change |
---|
0:07:50 | so we use the extraction of one seconds so it's quite obvious that the phonetic |
---|
0:07:55 | content of this one second window |
---|
0:07:57 | we'll have a huge impact on the speaker factors |
---|
0:08:00 | so we propose to estimate the subphonetic fallibility this intra speaker variability on that that's |
---|
0:08:06 | that the data itself so we got or to i-vector speaker factor extraction then those |
---|
0:08:12 | but |
---|
0:08:13 | if we look at the segment to the left and in my to make the |
---|
0:08:17 | hypothesis with the same speaker and the same to the right |
---|
0:08:20 | we can actually use the question model |
---|
0:08:22 | to estimate the phonetic variability are the intra speaker variability on the that the signal |
---|
0:08:28 | l |
---|
0:08:29 | and we have a right speaker we can say we estimate the phonetic fundable you |
---|
0:08:33 | the signal are |
---|
0:08:34 | and |
---|
0:08:35 | actually want to use of you want to find changes in the speaker factors that |
---|
0:08:39 | are not explained by this phonetic valuable do you want to look for changes other |
---|
0:08:43 | have occurred because of a real speaker change |
---|
0:08:46 | so if we use the model and will be space distance we can actually look |
---|
0:08:49 | for changes that are in other directions than that caused by the phonetic variability |
---|
0:08:54 | so we propose to make and mahalanobis space this with the components one where we |
---|
0:08:59 | have the hypothesis that we have left speaker |
---|
0:09:01 | so we look for changes in the speaker factors that are not explained by phonetic |
---|
0:09:05 | fundable given by the left speaker |
---|
0:09:06 | and the second component is looking for changes not explained by on it but with |
---|
0:09:11 | the of the right speaker |
---|
0:09:15 | okay so here we got the a speech segment and that |
---|
0:09:17 | this shows the or distance metric |
---|
0:09:21 | so well i also included the euclidean distance of compared to the mahalanobis distance |
---|
0:09:26 | so the red lines or the maximum peak so actually we have this the distance |
---|
0:09:31 | measurement mean for a maximum distance so we have to pick a selection algorithm |
---|
0:09:35 | so we average or a distance measure |
---|
0:09:38 | so when then according to the length of or speech segment we select the number |
---|
0:09:41 | of maxima |
---|
0:09:43 | and we also and for the minimum duration of a speaker or not but one |
---|
0:09:47 | second |
---|
0:09:47 | so the red lines indicate all the detected |
---|
0:09:50 | and you can and the black lines are actually the real speaker turns and we |
---|
0:09:54 | see the other model a mahalanobis distance a emphasis the |
---|
0:09:58 | the real speaker changes |
---|
0:10:00 | so it's successfully detects the to |
---|
0:10:03 | speaker turns out to why the with your |
---|
0:10:09 | okay once that we got or candidate speaker change points we can you some clustering |
---|
0:10:15 | of the adjacent segments to eliminate false a also this |
---|
0:10:19 | so again we had to pa system in our first also of the signal some |
---|
0:10:24 | system we will use delta bic here clustering of the adjacent speaker turns to see |
---|
0:10:29 | if there is a much acoustic somebody would reading segments if there are quite similar |
---|
0:10:34 | then you can simply eliminate this boundary |
---|
0:10:39 | and the second pass we had the specific eigenvoice model so this agent voice model |
---|
0:10:43 | matches the speakers and a file |
---|
0:10:45 | so then we can actually extract speaker factors |
---|
0:10:48 | perks homogeneous segments |
---|
0:10:50 | and use the course that cosine distance to compare the speaker factors |
---|
0:10:54 | if they're similar we eliminate the kind of the change point it's |
---|
0:10:57 | is there dissimilar it's a speaker change point |
---|
0:11:00 | so we can use the thresholds |
---|
0:11:02 | a bold criteria to control the number of eliminated boundaries |
---|
0:11:07 | okay |
---|
0:11:09 | so it does that this on the cost two hundred and eight broadcast news test |
---|
0:11:12 | sets at this as actually a sets with the twelve languages |
---|
0:11:16 | we used one language to as development data to tune our parameters |
---|
0:11:21 | and the other eleven remaining sets were used for s the test data |
---|
0:11:26 | so this includes a thirty hours of data |
---|
0:11:29 | and of four thousand four hundred the speaker turns |
---|
0:11:33 | for the evaluation me to the mapping between the estimated change points and the real |
---|
0:11:37 | so the speaker change points with the margin of five hundred milliseconds |
---|
0:11:41 | and we compare the precision and recall but with this mapping |
---|
0:11:46 | so the precision is the percentage of computed boundaries that are actually matter we once |
---|
0:11:52 | and the recall and the sorry the recall a substantial real boundaries mapped to the |
---|
0:11:57 | computers ones and the precision is the percentage of compute the boundaries other are actually |
---|
0:12:02 | map |
---|
0:12:03 | so we compare |
---|
0:12:06 | this is |
---|
0:12:07 | speaker just change detection with delta bic baseline |
---|
0:12:11 | and we can see that's for a low precision we get the maximum legal of |
---|
0:12:15 | nineteen point six percent which is a maybe a larger than the they'll topic of |
---|
0:12:20 | baseline |
---|
0:12:21 | so once we get these a decision beagle course we can then select an operating |
---|
0:12:26 | point according to the threshold of the of the |
---|
0:12:29 | by the elimination algorithm |
---|
0:12:31 | and you can use this operating point to start or a speaker clustering |
---|
0:12:39 | okay no more details about or a two-pass adapt is speaker segmentation system so in |
---|
0:12:44 | the first pass we got or speaker turns |
---|
0:12:47 | our clusters generated |
---|
0:12:49 | by clustering the speaker turns generated in the first pass then you to train the |
---|
0:12:53 | ubm model and the eigenvoice model on the speech and the speaker cluster test file |
---|
0:12:58 | so and he repeats the boundary generation |
---|
0:13:02 | and then we eliminate the boundaries with the cosine distance instead of the delta bic |
---|
0:13:05 | elimination |
---|
0:13:07 | so here the a yellow line |
---|
0:13:09 | indicates oracle or system and we can see that now the cosine distance boundary elimination |
---|
0:13:14 | actually outperforms the be all the bic elimination that we |
---|
0:13:19 | used in the first boss |
---|
0:13:21 | so now we can use an operating point on the second |
---|
0:13:25 | no of the output of the second pass |
---|
0:13:30 | okay now we propose actually if we extract speaker factors for each comparison window this |
---|
0:13:36 | did not differentiate between the speech and non-speech frames in the test file |
---|
0:13:41 | so the idea is actually to give the speech frames in the windows more rates |
---|
0:13:45 | during the speaker factor extraction |
---|
0:13:47 | so eval integrate the gmm based |
---|
0:13:51 | for a soft voice activity detection maybe estimated speech ubm and non-speech ubm and then |
---|
0:13:56 | we will integrates and then we will use a softmax |
---|
0:14:00 | to |
---|
0:14:01 | convert log likelihoods of the speech ubm and the non-speech ubm to speech posteriors per |
---|
0:14:05 | frame |
---|
0:14:07 | i'm we will be the baumwelch statistics that are used to bring the speaker factor |
---|
0:14:11 | extraction |
---|
0:14:12 | extraction |
---|
0:14:14 | to make them at the speech posteriors |
---|
0:14:16 | so it's also important the note that here we will use the speech ubm to |
---|
0:14:21 | estimate the occupation probabilities of a each frame |
---|
0:14:25 | because it will also used is the speech posteriors and the second part of the |
---|
0:14:29 | system so we do not only between these speech ubm but we also we train |
---|
0:14:33 | the non-speech ubm on the test all so we got non speech segments with the |
---|
0:14:37 | music and the applause |
---|
0:14:38 | and you will also use the low energy frames inside the speech segments to reading |
---|
0:14:43 | retrain the non-speech ubm |
---|
0:14:45 | and also during the boundary elimination soap to make the false positives |
---|
0:14:50 | we will use the soft voice activity to the |
---|
0:14:53 | to extract speaker factors and then use cosine distance boundary of the nist |
---|
0:15:00 | okay what we still |
---|
0:15:02 | problem of the big baseline again |
---|
0:15:04 | this is are |
---|
0:15:06 | speaker factor extraction without the soft voice activity detection we actually see if we don't |
---|
0:15:10 | use it to process than the t voice activity detection doesn't really improved results |
---|
0:15:15 | but if we use it to paul system may be to use the cosine distance |
---|
0:15:18 | from the elimination |
---|
0:15:20 | we see that we can further improve the results so the soft voice that the |
---|
0:15:24 | detection is a really useful if we use a two-pass just |
---|
0:15:29 | so once we got this set precision and recall best or best precision recall rough |
---|
0:15:34 | we choose an operating point to store a clustering |
---|
0:15:39 | so this clustering as a agglomerative clustering a first we do conditional big clustering across |
---|
0:15:45 | the whole that |
---|
0:15:46 | and this is quite important to gets enough data for a i-vector be lda clustering |
---|
0:15:52 | in the second stage |
---|
0:15:53 | so the ideas for each trust we got by the output of our clustering |
---|
0:15:57 | to extract an i-vector |
---|
0:16:00 | and then we will use the lda to that's the hypothesis if you have the |
---|
0:16:04 | same speaker or different speaker |
---|
0:16:07 | and if this the lda indicates |
---|
0:16:10 | and |
---|
0:16:11 | that the this the same speaker done real magic recipe |
---|
0:16:14 | cluster pair |
---|
0:16:15 | and then for this much cluster we will again extract the i-vector by a summing |
---|
0:16:20 | up |
---|
0:16:20 | the sufficient statistics extract a new i-vector and |
---|
0:16:24 | that's the hypothesis again with the lda |
---|
0:16:26 | so we will iterate this whole clustering process until |
---|
0:16:30 | the p lda outputs a large a low probability of the same speaker |
---|
0:16:37 | so okay whatever their results after clustering again we use the most one of eighty |
---|
0:16:41 | seven broadcast news data sets |
---|
0:16:43 | so we will evaluated diarization error rate which is the percentage of frames that are |
---|
0:16:48 | actually attribute to run speaker after mapping between the clusters and the real speakers |
---|
0:16:54 | so here we got the popular delta bic segmentation so then you go the diarization |
---|
0:16:59 | error rate of ten point one percent |
---|
0:17:01 | and we see that actually the detected boundaries are not that accurate when we have |
---|
0:17:05 | a margin over five hundred milliseconds |
---|
0:17:07 | if you look for a local changes between a speaker factors we see a slight |
---|
0:17:11 | improvement in the diarization error rate what the big changes all clearly in the accuracy |
---|
0:17:17 | of the boundaries of the speaker factor extraction is much more accurate and detecting the |
---|
0:17:21 | boundaries |
---|
0:17:23 | the same for when we use the to pa system we see |
---|
0:17:26 | a slight improvement in the precision on the be cool |
---|
0:17:29 | but then if we use the two passes system at the site soft activity detection |
---|
0:17:33 | apparently the boundaries got that besides that we got the ten percent relative improvement in |
---|
0:17:38 | our diarization error rates and the double boundary precision of at one percent and the |
---|
0:17:43 | recall of eighty five percent which is clearly better than the standard bic segmentation popular |
---|
0:17:48 | standard because segmentation |
---|
0:17:51 | so |
---|
0:17:53 | i we also want to note that the if we will it's popular to use |
---|
0:17:56 | viterbis a re-segmentation to make it to find more accurate boundaries offered of clustering and |
---|
0:18:02 | are basically use the speaker factor approach this actually the three it's the results |
---|
0:18:14 | thank environments so simple buttons |
---|
0:18:24 | it to pass that the a two-pass liquidation is quite well in the speaker diarization |
---|
0:18:32 | but problem of hunters this |
---|
0:18:34 | somehow you |
---|
0:18:36 | you |
---|
0:18:37 | you can represent them is the |
---|
0:18:41 | do you like that the actual or the u languages |
---|
0:18:46 | so one selection but ratio between posterior features or not the speaker factors the first |
---|
0:18:52 | step the first line again this is |
---|
0:18:57 | on speaker factors it difficult that is a slight line i'll try to put gosh |
---|
0:19:04 | models on the speaker factors but that didn't generate the same results so actually using |
---|
0:19:08 | a distance measure |
---|
0:19:10 | a different better results than trying to fit a portion models on problem |
---|
0:19:18 | question is did you have |
---|
0:19:22 | we were then |
---|
0:19:30 | so you |
---|
0:19:31 | the |
---|
0:19:35 | that is |
---|
0:19:39 | and then try that some one thing about this approach is that the amount of |
---|
0:19:44 | speech to the fact that depends on the length of the speech segments so we |
---|
0:19:47 | can use this to reduce the amount of speaker changes that we to make the |
---|
0:19:51 | hypothesis of the amount of speaker change that could actually for inside speech segment |
---|
0:19:55 | so then you would have to find solution for that's |
---|
0:19:58 | but i think it's possible to use actually this i-vector approach to find boundaries between |
---|
0:20:02 | speech and non-speech segments |
---|
0:20:05 | probably even after generate more accurate boundaries on an hmm system that i use not |
---|
0:20:10 | that's a hypothesis that i should best |
---|
0:20:24 | so the you use a gmm based what the real spectrum |
---|
0:20:29 | i just like are somewhat appears to the ribbon recordings or the trend of one |
---|
0:20:35 | completely the hmm system is also that so it's again to both system variable with |
---|
0:20:40 | so we got two models for the non speech |
---|
0:20:43 | so the music mogul and of background noise model |
---|
0:20:46 | then also for speech you got really different models speech clear speech that the background |
---|
0:20:51 | noise and speech and music |
---|
0:20:53 | so we go to the file one |
---|
0:20:55 | and then we might make us estimate posterior suddenly adapt the models |
---|
0:21:00 | and then we go through with the second |
---|
0:21:18 | the to what extent your rates to its over talking figure it is a speaker |
---|
0:21:23 | states it |
---|
0:21:25 | significantly what proportion of that would you have to |
---|
0:21:30 | speakers just of your region |
---|
0:21:33 | so are you talking about overlapping speech |
---|
0:21:37 | we don't send this dataset we don't have annotations of overlapping speech so i cannot |
---|
0:21:41 | comment on how this as an impact on that all results |
---|
0:21:48 | the by that token would you have in your class |
---|
0:21:52 | we have here |
---|
0:21:56 | and that just |
---|
0:21:59 | you model is that |
---|
0:22:02 | you're |
---|
0:22:04 | if you've got speakers speaking region |
---|
0:22:07 | and most cases each of these would be detected as a separate cluster i think |
---|
0:22:12 | if i manually look that the false than this could be detected as a separate |
---|
0:22:16 | cluster |
---|
0:22:18 | so the complete cluster |
---|
0:22:20 | that of |
---|
0:22:22 | it also pure i think it also occurs that the overlapping speech is assigned to |
---|
0:22:27 | one of the two speakers but i to notice sometimes that it's a detector doesn't |
---|
0:22:31 | a cluster |
---|
0:22:37 | think we have to sample |
---|
0:22:45 | okay she of this method is for |
---|
0:22:48 | t v |
---|
0:22:50 | other |
---|
0:22:52 | so |
---|
0:22:52 | how these that this method to online diarization |
---|
0:22:56 | you time citizens |
---|
0:23:01 | so you're talking about the second pass and of the system |
---|
0:23:15 | so it's not an on-line system so the idea is that the journalist upload the |
---|
0:23:19 | file then start the process and comes back and one hour for example |
---|
0:23:24 | so the first goal is not to make an on-line system but |
---|
0:23:27 | there might be techniques to make it online but i would have to think about |
---|
0:23:31 | the |
---|
0:23:44 | in this election system don't the to model the number of speakers |
---|
0:23:49 | so how many speaker were the |
---|
0:23:53 | in reality and how many speakers who estimated |
---|
0:23:56 | okay so if we combine the big clustering and then the i-vector p lda clustering |
---|
0:24:00 | of the ratio is very close to one but i have to notice if you |
---|
0:24:04 | don't use the initial be clustering |
---|
0:24:06 | the than the i-vector be lda system actually which is a low a diarisation error |
---|
0:24:10 | rate but he ratio between clusters and speakers is quite of its about the factor |
---|
0:24:14 | to that |
---|
0:24:15 | so it in the system it's quite important to do initial be clustering |
---|
0:24:18 | to make the racial close to one |
---|
0:24:21 | but the diarisation rate does not that i just using i-vector really |
---|
0:24:30 | alright i think so |
---|
0:24:32 | if no what the questions like to thank the speaker and all the speakers once |
---|
0:24:36 | again and stuff |
---|