Speech Transcript - Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

she good morning at second university at the it is data signs that you recently

worked on a soft voice activity detection in that factor analysis based speaker segmentation of

a broadcast news

so what this work has been done in the context of the artiste on project

so the u r d is actually the public broadcasters of long as

the dutch speaking region of that belgium

and the idea is to use the speech technology to

speed of the process of a subset of grading subtitles for tv shows

another case can be for journalists to meter reports two

have a fess track to put the report online with the subtitles so then they

can use the speech technology to generate the subtitles

and the quality maybe a bit less but

in case of for online you the speed is more important than the quality of

the subtitles

so the ideas that the subtitling as a very time-consuming a manual process so we

want to use the

speech technology

so in this presentation will focus on the diarisation and why do you want to

solve this of who spoke when problem

first of all we want to at colours to the subtitles

and if you want to generate subtitles it can also be useful to use the

speaker adapted models so we got speaker labels we can use these other models

and another thing is that actually if we detect speaker changes this can be extra

information for the language model of the speech recognizer to

begin and sentences so this can also help to recognition

so i the interspeech to have a show and tell session which of all the

shall be a complete system platform

it will a show how can uploaded be you and then start the whole chain

of a speech nonspeech segmentation speaker diarization language detection system and then speech recognition

but that's not the final step then we actually have to make short sentences to

display them on the screen

okay so what is this concept i think more probably get or audio signal plus

all the first step is the speech nonspeech segmentation we have to move a laughter

we have to remove music

so when once be detected the speech segments we can start that or a speaker

diarization

so this includes a detecting the speaker change points and finding homogeneous segments

and once we found of segments we can cluster those segments to assign a speaker

label to all these segments

so done you make the hypothesis that the each speaker only uses one language

and because in flanders you're interested in image we only keep the flemish segments

and then we will do the speech recognition

and the output of the speech recognizer will need some processing to make the sentences

short enough to display them on the screen

so here we will focus on more accurate speaker segmentation because if we use to

short segments that kernel provides and all data for reliable speaker models but costs in

this kind of the files we will use we have sometimes fifty speakers in one

audio file so the longer this homogeneous speaker segments will be the more reliable clustering

will be

obviously we don't detector speaker change this will result in nantes a homogeneous segments and

this will result in error propagation during the clustering process and also if we make

two short segments this will make clustering a lot slower because we have to accompany

lot more distances between segments

okay it'll propose a two-pass system so when the first a single other speech segments

are generated by the speech and non speech segmentation

so and then we will do so my speaker segmentation to actually the a standard

eigenvoice approach so would be vocal this a generic eigenvoices because these

a composer stuff the model actually every speaker that can appear

so why once we detected those speaker segment we can do standard speaker clustering

and the output of these of the speaker clustering i'm the speaker clusters we will

use that actually to

retrain or eigenvoice model so we know which speakers are active in the audio file

and the broadcast news file so we will retrain eigenvoices that match those speakers

and we also got speech segments so we can also actually retrain or a universal

background model

so then going go to a second sparse again the us start from or baseline

speech segments you do the speech segmentation again but now with our specific eigenvoices matching

the speakers inside the audio file

and then we do again speaker clustering and an evil three have that the speaker

clusters that in the first pass

okay the first step or speaking segmentation will be a boundary generation so that is

actually a generation of a kind of speaker change points

so we will lie use a sliding window approach we have to comparison windows left

window and the right window so and you can have a two hypothesis either we

have the same speaker and the to win also we have a different speaker

so we will use the a measure that looks for the maximal the similarity between

the distribution of the acoustic features and of there is a fixed to somebody then

this will indicate that there was a speaker change

okay also

speech nonspeech segmentation actually did not eliminate short pauses so it is tuned to detect

all laughter and music segments of longer than one seconds

so there can actually be a short alternate between speakers

so if we would use adjacent comparison windows it's actually generate several maxima

during the speaker change so we argue that is

i maxima can actually appear at the and vq at the beginning and the end

of the pulses because then the dissimilarity between acoustic features would be maximal and in

both windows

so and stats we propose to use overlapping comparison windows

so if you look at the regions that the classes of these actually attribute to

the summer the summer the

and the red regions

make them segments more the comparison windows more similar

so with actually the overlapped region between both comparison minnows

matches the false

then the dissimilarity between both windows will be maximal and the pause and the speaker

change will be inserted at the middle of the poles which is actually the thing

we want

just the more logical thing to do

so one if we apply to us

two or

sliding window approach we just simply use a two

overlapping sliding windows a left window and a right

okay for each comparison in the we actually want to extract speaker specific information

so we will do this to factor analysis

we will use so because we use the sliding window approach we will use very

low dimensional models because we have to extract those speaker factors for each frame

so we will use the gmm-ubm speech model with the thirty two components and use

a low dimensional speaker viable the or eigenvoice matrix with only twenty eigenvoices

so we use of in the wall for one second then we slide across each

frame and we expect those the twenty speaker factors

so i mentioned that for the training data we use the english broadcast news data

okay so not to another now that we have the speaker factors per frame we

actually look for a significant local changes between the speaker factors because these will indicate

a speaker change

so we use the extraction of one seconds so it's quite obvious that the phonetic

content of this one second window

we'll have a huge impact on the speaker factors

so we propose to estimate the subphonetic fallibility this intra speaker variability on that that's

that the data itself so we got or to i-vector speaker factor extraction then those

but

if we look at the segment to the left and in my to make the

hypothesis with the same speaker and the same to the right

we can actually use the question model

to estimate the phonetic variability are the intra speaker variability on the that the signal

and we have a right speaker we can say we estimate the phonetic fundable you

the signal are

and

actually want to use of you want to find changes in the speaker factors that

are not explained by this phonetic valuable do you want to look for changes other

have occurred because of a real speaker change

so if we use the model and will be space distance we can actually look

for changes that are in other directions than that caused by the phonetic variability

so we propose to make and mahalanobis space this with the components one where we

have the hypothesis that we have left speaker

so we look for changes in the speaker factors that are not explained by phonetic

fundable given by the left speaker

and the second component is looking for changes not explained by on it but with

the of the right speaker

okay so here we got the a speech segment and that

this shows the or distance metric

so well i also included the euclidean distance of compared to the mahalanobis distance

so the red lines or the maximum peak so actually we have this the distance

measurement mean for a maximum distance so we have to pick a selection algorithm

so we average or a distance measure

so when then according to the length of or speech segment we select the number

of maxima

and we also and for the minimum duration of a speaker or not but one

second

so the red lines indicate all the detected

and you can and the black lines are actually the real speaker turns and we

see the other model a mahalanobis distance a emphasis the

the real speaker changes

so it's successfully detects the to

speaker turns out to why the with your

okay once that we got or candidate speaker change points we can you some clustering

of the adjacent segments to eliminate false a also this

so again we had to pa system in our first also of the signal some

system we will use delta bic here clustering of the adjacent speaker turns to see

if there is a much acoustic somebody would reading segments if there are quite similar

then you can simply eliminate this boundary

and the second pass we had the specific eigenvoice model so this agent voice model

matches the speakers and a file

so then we can actually extract speaker factors

perks homogeneous segments

and use the course that cosine distance to compare the speaker factors

if they're similar we eliminate the kind of the change point it's

is there dissimilar it's a speaker change point

so we can use the thresholds

a bold criteria to control the number of eliminated boundaries

okay

so it does that this on the cost two hundred and eight broadcast news test

sets at this as actually a sets with the twelve languages

we used one language to as development data to tune our parameters

and the other eleven remaining sets were used for s the test data

so this includes a thirty hours of data

and of four thousand four hundred the speaker turns

for the evaluation me to the mapping between the estimated change points and the real

so the speaker change points with the margin of five hundred milliseconds

and we compare the precision and recall but with this mapping

so the precision is the percentage of computed boundaries that are actually matter we once

and the recall and the sorry the recall a substantial real boundaries mapped to the

computers ones and the precision is the percentage of compute the boundaries other are actually

map

so we compare

this is

speaker just change detection with delta bic baseline

and we can see that's for a low precision we get the maximum legal of

nineteen point six percent which is a maybe a larger than the they'll topic of

baseline

so once we get these a decision beagle course we can then select an operating

point according to the threshold of the of the

by the elimination algorithm

and you can use this operating point to start or a speaker clustering

okay no more details about or a two-pass adapt is speaker segmentation system so in

the first pass we got or speaker turns

our clusters generated

by clustering the speaker turns generated in the first pass then you to train the

ubm model and the eigenvoice model on the speech and the speaker cluster test file

so and he repeats the boundary generation

and then we eliminate the boundaries with the cosine distance instead of the delta bic

elimination

so here the a yellow line

indicates oracle or system and we can see that now the cosine distance boundary elimination

actually outperforms the be all the bic elimination that we

used in the first boss

so now we can use an operating point on the second

no of the output of the second pass

okay now we propose actually if we extract speaker factors for each comparison window this

did not differentiate between the speech and non-speech frames in the test file

so the idea is actually to give the speech frames in the windows more rates

during the speaker factor extraction

so eval integrate the gmm based

for a soft voice activity detection maybe estimated speech ubm and non-speech ubm and then

we will integrates and then we will use a softmax

convert log likelihoods of the speech ubm and the non-speech ubm to speech posteriors per

frame

i'm we will be the baumwelch statistics that are used to bring the speaker factor

extraction

to make them at the speech posteriors

so it's also important the note that here we will use the speech ubm to

estimate the occupation probabilities of a each frame

because it will also used is the speech posteriors and the second part of the

system so we do not only between these speech ubm but we also we train

the non-speech ubm on the test all so we got non speech segments with the

music and the applause

and you will also use the low energy frames inside the speech segments to reading

retrain the non-speech ubm

and also during the boundary elimination soap to make the false positives

we will use the soft voice activity to the

to extract speaker factors and then use cosine distance boundary of the nist

okay what we still

problem of the big baseline again

this is are

speaker factor extraction without the soft voice activity detection we actually see if we don't

use it to process than the t voice activity detection doesn't really improved results

but if we use it to paul system may be to use the cosine distance

from the elimination

we see that we can further improve the results so the soft voice that the

detection is a really useful if we use a two-pass just

so once we got this set precision and recall best or best precision recall rough

we choose an operating point to store a clustering

so this clustering as a agglomerative clustering a first we do conditional big clustering across

the whole that

and this is quite important to gets enough data for a i-vector be lda clustering

in the second stage

so the ideas for each trust we got by the output of our clustering

to extract an i-vector

and then we will use the lda to that's the hypothesis if you have the

same speaker or different speaker

and if this the lda indicates

and

that the this the same speaker done real magic recipe

cluster pair

and then for this much cluster we will again extract the i-vector by a summing

the sufficient statistics extract a new i-vector and

that's the hypothesis again with the lda

so we will iterate this whole clustering process until

the p lda outputs a large a low probability of the same speaker

so okay whatever their results after clustering again we use the most one of eighty

seven broadcast news data sets

so we will evaluated diarization error rate which is the percentage of frames that are

actually attribute to run speaker after mapping between the clusters and the real speakers

so here we got the popular delta bic segmentation so then you go the diarization

error rate of ten point one percent

and we see that actually the detected boundaries are not that accurate when we have

a margin over five hundred milliseconds

if you look for a local changes between a speaker factors we see a slight

improvement in the diarization error rate what the big changes all clearly in the accuracy

of the boundaries of the speaker factor extraction is much more accurate and detecting the

boundaries

the same for when we use the to pa system we see

a slight improvement in the precision on the be cool

but then if we use the two passes system at the site soft activity detection

apparently the boundaries got that besides that we got the ten percent relative improvement in

our diarization error rates and the double boundary precision of at one percent and the

recall of eighty five percent which is clearly better than the standard bic segmentation popular

standard because segmentation

i we also want to note that the if we will it's popular to use

viterbis a re-segmentation to make it to find more accurate boundaries offered of clustering and

are basically use the speaker factor approach this actually the three it's the results

thank environments so simple buttons

it to pass that the a two-pass liquidation is quite well in the speaker diarization

but problem of hunters this

somehow you

you

you can represent them is the

do you like that the actual or the u languages

so one selection but ratio between posterior features or not the speaker factors the first

step the first line again this is

on speaker factors it difficult that is a slight line i'll try to put gosh

models on the speaker factors but that didn't generate the same results so actually using

a distance measure

a different better results than trying to fit a portion models on problem

question is did you have

we were then

so you

the

that is

and then try that some one thing about this approach is that the amount of

speech to the fact that depends on the length of the speech segments so we

can use this to reduce the amount of speaker changes that we to make the

hypothesis of the amount of speaker change that could actually for inside speech segment

so then you would have to find solution for that's

but i think it's possible to use actually this i-vector approach to find boundaries between

speech and non-speech segments

probably even after generate more accurate boundaries on an hmm system that i use not

that's a hypothesis that i should best

so the you use a gmm based what the real spectrum

i just like are somewhat appears to the ribbon recordings or the trend of one

completely the hmm system is also that so it's again to both system variable with

so we got two models for the non speech

so the music mogul and of background noise model

then also for speech you got really different models speech clear speech that the background

noise and speech and music

so we go to the file one

and then we might make us estimate posterior suddenly adapt the models

and then we go through with the second

the to what extent your rates to its over talking figure it is a speaker

states it

significantly what proportion of that would you have to

speakers just of your region

so are you talking about overlapping speech

we don't send this dataset we don't have annotations of overlapping speech so i cannot

comment on how this as an impact on that all results

the by that token would you have in your class

we have here

and that just

you model is that

you're

if you've got speakers speaking region

and most cases each of these would be detected as a separate cluster i think

if i manually look that the false than this could be detected as a separate

cluster

so the complete cluster

that of

it also pure i think it also occurs that the overlapping speech is assigned to

one of the two speakers but i to notice sometimes that it's a detector doesn't

a cluster

think we have to sample

okay she of this method is for

t v

other

how these that this method to online diarization

you time citizens

so you're talking about the second pass and of the system

so it's not an on-line system so the idea is that the journalist upload the

file then start the process and comes back and one hour for example

so the first goal is not to make an on-line system but

there might be techniques to make it online but i would have to think about

the

in this election system don't the to model the number of speakers

so how many speaker were the

in reality and how many speakers who estimated

okay so if we combine the big clustering and then the i-vector p lda clustering

of the ratio is very close to one but i have to notice if you

don't use the initial be clustering

the than the i-vector be lda system actually which is a low a diarisation error

rate but he ratio between clusters and speakers is quite of its about the factor

to that

so it in the system it's quite important to do initial be clustering

to make the racial close to one

but the diarisation rate does not that i just using i-vector really

alright i think so

if no what the questions like to thank the speaker and all the speakers once

again and stuff

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

Speaker Recognition in Multimedia Content

Brecht Desplanques, Kris Demuynck, Jean-Pierre Martens