Speech Transcript - Online Two Speaker Diarization

OK, So the title of my talk is following, when we first say, given a

file offline

supervector-based speaker diarization system which we presented last Odyssey 2 years ago

OK, so the

doesn't

no, it doesn't go into the computer, but not here

first time I see something like this

OK, so this is that

outline already of my topic

OK, so just to those are not familiar with baseline algorithm

The idea is to take two speakers eh and usually a third symbol and to

do a speaker diarization.

And the main principle is as following: if you look at this illustration, this is

the illustration of the dynamic sys specs, one speaker in blue, and the other in

red.

And if you did not have the specific color, we wouldn't be able to do

separation between these 2 speakers.

So the idea is to take the speech, and to do some kind of parameterization

into a series of supervectors, representing overlapping short segments.

So what we get is what we see here: now we see some kind of

separation ... between two, between the two speakers.

And also we can see that every speaker roughly can be modeled by a unique

model PDF.

This is thanks to the supervector representation.

And the next step is to improve the separation between speakers by

removing pieces some of the intra section inter speaker variability.

This is the sketch of the algorithm.

and here are the actual steps.

so first there's the audio parameterization ... the section is taken and conversation dependent UBM

is estimated.

So basically this algorithm doesn't need any development data, the UBM is estimated from the

conversation.

Then the conversation is segmented into overlapping 1 second superframe.

and for each superframe we represent by a supervector which is adapted from the UBM.

then there is another step which I am not going to go into detail, because

it's something that we've already presented.

it's eh, what we do is we try to estimate on the fly, eh, from

the conversation.

intra speaker variability and compensate for it.

and to improve the accuracy.

the next step is to score the superframe, being either from speaker 1 or speaker

This is being done by first computing the covariance matrix, the covariance matrix of the

compensated

supervectors.

then applying PCA analysis to this covariance matrix, and justifying the largest eigenvector, and projecting

everything onto this largest Eigen vector.

Then we use Viterbi to do some smoothing and finally we do Viterbi segmentation on

the MFCC space.

So this is the baseline.

At least a few shortcomings to this algorithm.

First we found out when we apply this algorithm on short section and

when I'm thinking about short, it can be 15 seconds, or, 30 seconds.

and it doesn't work that well.

on short section, and this is first of all because of insufficient data for estimating

all the models and parameters from a single short section.

and also because of the probability of misbalance between the speakers, and the representation of

the speakers increases with, uh, when we're dealing with short section.

and also this algorithm is heavily based on the fact that there is some kind

of balance between the 2 speakers.

and another issue is that this algorithm inherits the online, or, offline, and several of

our customers require that we have the online solution.

and so this is the shortcomings.

So first, I'll talk about robustness in short sessions which is important by itself, but

also the first step towards the online algorithm

algorithm

So the basic idea is try to do everything that we can, to do it

offline from the development set.

instead of training the UBM

from the conversation, we just train from

from development set, and also the NAP intra speaker

variability compensation is trained from development set

but we don't need any labeling from the development set, because

our algorithm is unsupervised, it doesn't need to have speaker labels, or speaker turns.

labellings, we just need the raw audio.

So we take the development set, we estimate the UBM, we estimate the NAP

the NAP transform, and also we do the GMM model in order to make it

more robust to short sessions.

next thing is what we call the outlier-emphasizing PCA.

Contrary to robust PCA which someone is familiar with, in our case, we're actually interested

in the outlier and

we want to emphasize and give high weight to outliers when we're doing PCA

This is true because, let's look at this illustration

This illustration is when we have 2 speakers.

And they're balanced, we have the same data from both.

If we look at this example, then

if some conditions are actually happened then

then actually we just take the supervectors

and apply PCA then the largest eigenvector will actually

give us the decision boundary

now if we have unbalance speakers, then

in many cases, the PCA will be dominated from the most dominant speaker.

and we won't get the right decision boundary

So what we do is we do the following

we assign higher weight to outliers

which are found by selecting

the top 10% supervectors

in the given session with the largest distance to the sample mean

So we compute the center of gravity, the sample mean, and we just

uh, pick up, we select the 10% of the supervectors, which are more, most distant

from

this mean, and in this case, these are the outliers

we just give them the higher weight

and now

suddenly the PCA works well in this example

another problem is how to choose the threshold

because for example

in this case when the speakers are imbalanced

and if we just for example take the center of gravity

and then as a threshold then we would not be able to distinguish this two

speakers

correctly

so what we're trying to do is again

according to the same principle we compute the 10% and 90% percentile

look at the value that gives this percentile

around the Eigen the largest Eigen vector

and we just take this two values, average them, and decide

more robustly from the thresholds

OK, so before talking about the online diarization, just

a few experiments for show this section

so we use NIST 2005

dataset for this evaluation

and some thing important is that we compute the speaker error rate without discarding the

margin around the speaker turns

this is contrary to standard, and this is because we're dealing with short sessions, and

we try to throw weight

data then we found out it requires some numerical problems

so basically what it means is that

the result that I present are in some way

a bit pessimistic than what we would get, if we would

use the standard method

another important issue is that we

throw away short sessions

with less than 3 seconds

per speaker so

actually what we do is we take

we take the 5 minute session from NIST

and we just chop them into

short sessions

and now sometime when doing that

we may get short sessions

for example 50 seconds, 15 seconds

without, with only a single speaker

or with only 1 second from the second speaker

so in this work we will not try to deal with such a problem as

detecting such situation where we only have a single speaker

therefore we remove such sessions

the results for

for the diarization I talked about, basically what we can see here is that

for long sessions we don't get

any improvement or degradation

however for short sessions, we can get roughly something like 15% error reduction using this

technique

OK, so now let's talk about online diarization

so frame here is the following ... what we do in this session is we

take the prefix of a session

and the prefix is something that we will have to process offline

and of course you would want the prefix to be as short as possible

and we will actually set the length of the prefix adaptively

so we start by taking a short prefix

and according to the confidence estimation we would see ... we will verify whether this

prefix is good enough for the processing or we should just take a longer prefix

and do the processing

so we take the prefix of the session and

and we do offline processing the same ... just apply our algorithm on this prefix

and we did the result of this processing, the result with segmentation for the prefix

and also

with some model parameters for example the PCA

the threshold from the PCA and then we take this model threshold parameters and we

go to process the rest of the session online

using this model as a starting point

we update them periodically

and we do online processing

usually with some delay because we're using, we need some kind of backtracking, so we

have some short delay

can be a second or less but

we would always have some kind of latency

so we first apply this for voice activity detection

I won't go over all the details ... it's quite standard

OK so once we have voice activity detection, done online

then we have to do speaker diarization

so first we have the front end .. we do it online by using step

to get MFCC

extracting the supervectors

and compensating for intra-speaker variability

and then we take the prefix and we compute PCA for the supervector in this

prefix ... we

we project them all the supervectors onto the largest eigenvector

we do Viterbi ... Viterbi segmentation

then for the rest of the session, we just take the PCA statistics from the

prefix

we accumulate them online ... we periodically recompute the

PCA, and adjust our decision boundary

periodically

and also we do Viterbi and partial backtracking with some kind of latency

so here 're some results

first we will try to analyze the sensitivity of the delay

parameters ...delay parameter is the delay we have when we do the online diarization

on the rest of the conversation we still have some delay because we're using Viterbi

and ... and ... in order to do smoothing... so we found out that 0.2

second was good enough for ... for this algorithm

and then we ran some experiments to verify the sensitivity of the prefix length

and we found out that .. actually if we start with speaker rate of 4.4,

we'll see some significant degradation

gets to 9.0 for 15 seconds

of prefix

now when we... now we ran some control experiment

we did the same experiments, but

but we throw away all the sessions

uh.. with the ... there were .... not uh... with at least 3 seconds per

speaker in the prefix

for example if we take this column

we throw away all the session that in the first 15 seconds

we don't have at least 3 seconds per speaker

and when we do that we see quite good result, and the explanation is that

most of the degradation is due to the fact

that when we get the prefix

some time we do not have the presentation of the 2 speakers

and so and

so the way we introduce is this ... is to try to apply this confidence

term I will talk about

but before talking about the confidence ... the

the overall latency of the system is 1.3 seconds

including the prefix so...

if we have a 5 minute conversation ... for the first say 15 seconds

it's not online, it's offline, and then starting from the fifteen...

from the ... after this prefix

we will get the latency of 1.3 seconds

so now the issue of confidence based, the prefix we saw that

some time 15 seconds is enough, some time it's not enough, and it's heavily controlled

by the fact that we need 2 speakers to be presented

in the prefix

so what we do is we start with a short prefix .. we do diarization

we estimate the confidence

in the diarization

and if the confidence is not high enough, we just expand the prefix ... and

...

start over

we tried several confidence measures, and we chose to use... finally the Davies Bouldin index

which is the ratio between the average intra-class standard deviation and the inter-class distance

we're able to calculate when we have the diarization

OK and ... so

I won't go into all the details of this slide and the next ones, but

the main idea is that you can

you can actually get nice gains

by using this confidence measure, so for example for 30 second prefixes

50% of the session needs to be extended

to get almost as good result, but for the 50% of the session

you can just stop

so you can start with the prefix of 30 seconds

do diarization, compute this confidence measure

and for 50% of the session, you can decide that it's OK, I can

stop now to do the online processing

and for the rest of the session, you would need, for example, 45 to 60

seconds

to get optimal result

OK ... so

eh... what is the time complexity of the offline system .. the online system ...

this is the question that

many ... many ... many people would ask me after the previous presentation in the

last Odyssey

so we ran analysis .. experimental analysis

for this algorithm

and the analysis was run for 5 minute session

there was no sort of optimization done

just plain research goal

and ... so what we see here is that the baseline system

is 5 times faster than real time

and

we can actually improve the accuracy of the system by taking some of the

algorithm that I presented

improve the accuracy

and if we just take the whole uh... the whole

all the complexity I talked about, some of them actually degraded previously

for example training the UBM enough offline gives some degradation

so we get back the 4.4, but we get the speed up effect of 50,

50 times faster than real time

and for the online system, if we take the prefix of 30 seconds and the

delay of 0.2 seconds

then if we... actually the speed up effect is controlled by the retraining parameters

retraining parameter means in what frequency do we reestimate our PCA model and our GMMs

so we control it in a variable way that mean we start with a high

frequency at the beginning of the conversation, and then ...

just towards the end of the conversation, we actually stop retraining, or do it very

low frequency

we managed to get for the online system ..... speaker error rate of 7.8 with

the speed up effect of 30

OK before we're concluding, we give ... I'll just talk about specific .. specific task

which we're interested in.

which is speaker diarization for speaker verification

here we're not really interested in getting a very accurate diarization ... very high resolution

diarization

we just want ... don't want to get a good degradation in the equal error

rate for the speaker recognition in too wired data

so uh... we have initial work presented in Interspeech 2011 and here we have some

improvement

here we have some improvements that integrate all the components that I talked about in

this presentation

into this variance of our system

so we divide our audio into overlapping 5 second superframes, because we don't need the

resolution... high resolution

and we score each superframe independently against the target speaker model

now we have to do is to be able ... uh...

to classify or cluster these supervectors ... superframes into 2 speakers

so what we do is we do a partial diarization

and cluster these superframes into 2 groups of clusters and also

deemphasize some of the superframes which are in the borderline between the clusters

because we're actually ... uh... interested in speaker verification ... not speaker diarization, so we

can just throw away some superframes which we are not certain to which speaker they

belong

and we use eigenvoice-based dimensionality reduction in k-means

and we found out that ... the ..

the silhouette measure was actually optimal for deemphasizing several .. some of the supervectors

we also do it online, so we do

we use the same framework prefix, which is processed offline and then

we just adapt it ... eh ... uh...

for the rest of the conversation, we use the GMM-NAP-SVM system

developed for NIST 04 & 06, and evaluated on NIST 2005, for male only

we see that we get an improvement ... uh...

some improvement compared to the result that we presented in Interspeech

and we also observed that using this new technique

using the silhouette confidence measure for removing the superframes ... we get ... using the

hard decision... we get the optimal result

compared to using soft decision or no removal at all

so to summarize

we extended our speaker diarization method to work with short sessions and to run online

and we propose the following novelties: offline unsupervised estimation of intra-session intra-speaker variability

so again we use the development set to estimate this variability

but it's not labeled at all, we don't need labeled data

and we also use outlier emphasizing PCA for improving speaker clustering and adaptive threshold setting

the overall latency is 1.3 seconds except for the prefix

and speed is 50 times faster than real time for the offline system and between

30 to 40 for the online system

and also for the speaker verification task, we manage to substantially ... it's more in

the paper than in the presentation, but

we manage to substantially delay ... to reduce the delay ... eh ... substantially

for ... for speaker verification in some channels

OK, thank you

for initialization, you consider trying online speaker segmentation

algorithm, you just find the first speaker change to, so that you

are sure the second speaker

or the first speaker or any person in the next 15 seconds ?

yeah, what we're trying to do now.. is

is to start with

to take, to go with the prefix ... uh...

framework, start with a very short prefix, and to try to

start expanding it

and accessing whether it is a single speaker or not in this prefix... so

that is the title, that would be hard .... yeah, that's why we don't have

in the paper

so we

you have the speaker diarization

rate, diarization error rate

speaker error rate, it's without voice activity detection

OK, so just confusion, that's all

uh, in there we didn't mean

go to the result, go back to the result for

tests .... some

for recognition, for recognition

so ... do you know how the baseline being done?

did nothing, just scoring

you have the number?

we have it in the Interspeech ... uh...

in the last Interspeech paper, we have that number

the last question is about the PCA itself, so one of the thing

NAP which is removed

remove the channel first, trying to

the PCA no...

do you do any kind of channel compensation?

channel note

we'll do it ... there's something that actually

try to do ..uh.. same techniques as

being done for speaker verification

it's the NAP technique, so

so what's we doing... we're just taking the

pair of adjacent supervectors

and we just assume that

they belong to the same speaker, which is usually the right case

once in a while, it's not, because of speaker change, but usually from the same

speaker

from this we're estimating the

intra-speaker variability

you only estimate short term variability

short term variability

I don't understand the reason for online diarization used?

try to know the motivation

OK, this is started because actually where were two clients

this ... one of them is ... uh...

for example in the call center scenario

let's assume that it's two wires

for many in practice, that's the case

nowadays

at least one of the vendors

uh... .actually

this is the case ... so.... uh...

the project was... the idea was to

to run speech recognition on

online, on the

call center data

and to present the agent with some summary

of the conversation

and in order to do the summary, they need the speaker diarization

and everything must be done online but

it can be done with some latency of

for example with 30 seconds prefix, it's OK

because it's usually longer conversation

when you use Viterbi, do you always go all the way back to the beginning

or you just do...?

in the online, no, in the online we do just in a small chunk

how far do you go back?

it depends, because

we also of course try to go all the way back

it does not really cause false alarm

but we found out that we can

save a bit by not doing that, but it's not very important

the latency is caused by what happened after the

by the future, not the past, the past is something you can do it very

quickly

one more question

do you try it with the algorithm that added to the

multi-speaker diarization task that was used

in our meeting data

actually now we're working in a

in a framework, European project

that's ... uh...

it's a... we're dealing with ... a...

a meeting type scenario

we have to take this algorithm and to run it

we have to modify it of course

alright, thank you the speaker again

applauses

Online Two Speaker Diarization

SESSION 05: Speaker Diarization

Hagai Aronowitz