Speech Transcript - Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

and welcome to my paper improving diarisation robustness using verification randomisation

and the dover algorithm

if you brief overview will start with a review of the door algorithm

something that we directly but with recently to combine the outputs of multiple conversations systems

actually use of that is for information fusion

over this paper we're gonna focus on a another application used to achieve more robustness

whatever position

we describe our experiments and results and then conclude with some really an outlook

i'm sure everybody's familiar with the speaker diarization task it's the answer the question who

spoke when

so given an input you label it according to speaker identity without having any prior

knowledge

speaker so the

labels are anonymous label such as speaker one speaker to or

positions in order to track the interaction among multiple speakers in the conversation or meeting

is also critical to be able to speaker should you eer of the speech recognition

system that our readable transcript

and you can use it for things like speaker if one where you need to

identify all the speech i mean how the same

speaker source

the diarization error metric as a measure also similar to most

it's the racial the total duration of missed speech false alarm speech and speaker

speech does this labeled according to who were spoken by

and normalized for the should duration the speech

l the critical

thing in and are stationary computation

is which will important for you know later on is actually the mapping in speaker

labels that occur in the reference versus the hypothesis

e labels and reference have nothing to do with a label and the clusters so

we need to construct a mapping

actually minimizes the error rate

so in this case we will map speaker one speaker a

and speaker into two speaker e

and leaves speaker three amount because of the in fact is an extra speaker relative

to the reference

once we've done the mapping

we can compute false alarm the speech

speaker

now system combination or ensemble methods of coding methods are very popular in machine learning

sessions

it "'cause" it is very powerful to zero it combine multiple classifiers

to achieve a better results

and coding

it's just letting the majority determine optimal or soft-voting such as combining different scores in

some there is not gonna make weight

or to combine your already outputs by interpolation for example in order to achieve

any more accurate estimate

posterior probability and therefore us

labels

now this can be done and weighted marilyn the weighted matter so if you have

the

a reason to attribute more than so which to me that's

you and that and the voting algorithm

a popular version of this for speech recognition is the over algorithm

also confusion network combination any also of the purpose of i mean the word labels

from multiple asr systems like well

and performing and loading and the different machines

and usually this gives you know whatever when the input systems are about equally good

by have different error i don't store

as in the and errors

now how can we use this idea of for diarization

so there is a problem because these labels coming from

position hypotheses

are not inherently related

so there are anonymous as we said what

so it is not clear how to order among them

we can solve this problem i

extracting in that in between the different labels

and then performed by doing so

we can go there's map of the labels in fact as a kind of alignment

lingual space or level alignment

so we do it incrementally it's like for a rover for example so we start

with the first analysis that for star

and it as our initial alignment

and we iterate over all the remaining outputs we construct a mapping the it was

processed out that's

so that the e diarization error between the labels is minimized

we all know

we can

simply for the voting

i'm really label for all time instants

and this is what was described in our

last year and inside you

okay here's an example

so we have three systems at c

the labels are disjoint

and we

first start by starting with system a and then computing best map

of the second system to these labels in the second the first system

so in this case we will

one way one

to ensure a two three would in extra speaker labels so it remains

we re label everything so now we have system a and system i in the

same label space

read the same thing again with system c

so we can see here that c one should

i at one

t three should be mapped into

c two

remains map and that's the next a label

doesn't have a correspondence

so here we have no all three how that's the same label space

and we can fall the voting

for each time instance so they only when is a one to this point

then we enter a region where is actually if we went i between a one

human speech

so no matter only we can break the time anyway that's can and example in

the first one or if there are weights attached to the n b and

the one with the highest weight

we have a to again as the consensus and we're trying to a one

we never hears because it is always in the minority

and we can use the same idea to decide on speech versus non speech

we will help us speech only on those regions

where at least half of the in its i think there's speech

no again the natural

use of this is for information fusion

it is we run diarisation in the in italy stand for information for example we

have multiple microphones we can i rise in italy

and fused it's using dover

or we could have a single input that different feature streams

we can arise in the end is

we used just for multiple microphones in i paper

we have meeting recordings on seven microphones

and you can see here that difference is doing a clustering based diarization

this be wide range of results depending on which channel you choose

and over actually

if you're result that a slightly better than e single channel

so you're free from having to figure out which is the

thus the channel

if you do the diarization using speaker id because you're speakers are actually all of

the system

you get the same effect of course but much lower at position error rate over

also you average

you have the single channel and you have a where single channel

and it over a combination of all these out there is you have resulted actually

is better

the minimum

all the individual channels

no for this paper we gonna looking to different application of over

starts with the observation that diarization algorithm is often quite sensitive to the choice of

hyper parameters

i give some examples later but it is basically because when you clustering

you make our decisions based on comparing real values

and small differences in the in this can actually yield large differences you know

also the clustering is often greedy

and iterative so small

regions somewhere a linear model and a very large differences later on

this can be remedied by averaging over the different run essentially so

okay and you run with different hyperparameters an average the results

and using the over or you can used over from i'm the out of multiple

different

clustering solutions

to experiment with this we used an old speaker clustering algorithm of for diarization develop

idiomatic c

you start with an equal length segmentation of during the day

segments

then each segment is modeled by a mixture of gaussians

and e ds similarity between different segments can be evaluated i asking whether merging two

gmms yields a higher over likelihood or not

duration happens by merging two best clusters that resegmenting

and re-estimating so gmms

l which do this until i is information criterion tells you just a clustering

it like this algorithm to a collection of

recordings of meetings

from which we are extracted two feature streams and mfccs training after beamforming so we

had multiple

constraints but we marched on informing of the signal level

then extracted mfccs

and the beamformer would also give us the time delays of arrival which are an

important feature

because it indicates where the speakers are situated

now

there's two ways to generate more hypotheses from a single

this case

one is a what i call device verification meeting there either i and under

what was some range

and a single low also

example i can every the relative weight of the feature streams

or i can every the initial number

other clusters in the clustering order

the first one which we discuss the three what else given here for the interest

of time

and the other way as to randomise so i can manipulate the clustering algorithm

we will not always pick the first best

of clusters remark about two sometimes take the second just pure clusters

and a five point in order to make these decisions over it can generate multiple

clusterings

and of course i used over to final design with equal weight

although the of its use the same speech nonspeech classifier so we'll only differ are

speaker labels not in the speech nonspeech sessions

and only difference on the diarisation error is in fact on the speaker error rate

it is set was from the nist meeting rich transcription evaluations from the nist two

thousand seven thousand nine

and we used all of the microphone channels but we combine with beamforming

and you variety is actually quite considerable in this data so

errors different recording sites

there is different speakers from small three four

sixteen twenty one respectively so it was quite heterogeneous and that's why it's a challenge

to actually

and analyze the hyper parameters for them into a

in forty from one

f sets to the test set

use what happens when you vary your a stream weight one of the hyper parameters

so you can see that

varying along agree not use a small variation in the output rather channels up and

this is the speaker error rate

and more importantly

the best value on it's a it's not just value on the eval set

conversely the value of all citizens are was choice for the test set

so this is what i mean i robustness of problems in that

every when we do over a combination over all the different

results

we actually at a nice good result

it is either better than a single results for the test set or very close

to the single best result on you got stuck

similarly when we vary the initial number of clusters of the algorithm

we also got the l

with the speaker

according to a you know it is the variational the cluster number

and

the best choice for the test set is not the best choice for the eval

set

again when you do that or conversational you a good result in fact there is

always better than the second best choice

on the data for you also

finally when we do the randomisation of the clustering specifically we flip a coin with

only three we use a second best cluster each information merging

and the result is surprisingly sometimes lead to better and with the first a clustering

so you see here that with different random seeds we are in a range of

results

sometimes worse but often other and with the best first clustering

and the same is true for the whole set

first we cannot

expect the best thing all the data to also interesting only vol instead we need

to do the recognition in order to get a result

so we actually improve on the best first clustering consistently by doing or correlation over

different

randomized results

summary

we have just over algorithm allows us to voting among multiple times additional sees

we can use this to achieve

a robustness and annotation

combining multiple hypotheses obtained from a single input

e two ways that we do this is by very high utterance

or introduce diversity if you will and the results

and we find that the hyperparameter populations higher in over essentially freezes from the need

to do that optimization

and that its robustness that way

now the clustering can also be randomized be overcome the limitation of the first

research and clustering

and e combination of the randomized results actually says

higher accuracy and you the single

a string that that's

finally there's many more things we can do this so we can try to come

i'm

the different techniques so for example i are is wearing

a lot of multiple dimensions

or combining that with randomisation all in one and a well-known combination about

we can also tried as with different

like conversation than the algorithm is the gnostic to the actual

and

form of the diarisation algorithm

so we can try with x vector of a spectral clustering or normal and systems

of course region or we wish to

in this multiple the corporate in order to can work in the

the algorithm

to other things were currently working on is can i think different diarisation algorithms

as well as to generalize the to handle overlapping speech

thank you very much for your time

you're into question so essential to the c website

and i the rest of you culture

Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

Diarization

Andreas Stolcke