and welcome to my paper improving diarisation robustness using verification randomisation
and the dover algorithm
if you brief overview will start with a review of the door algorithm
something that we directly but with recently to combine the outputs of multiple conversations systems
actually use of that is for information fusion
over this paper we're gonna focus on a another application used to achieve more robustness
whatever position
we describe our experiments and results and then conclude with some really an outlook
i'm sure everybody's familiar with the speaker diarization task it's the answer the question who
spoke when
so given an input you label it according to speaker identity without having any prior
knowledge
speaker so the
labels are anonymous label such as speaker one speaker to or
positions in order to track the interaction among multiple speakers in the conversation or meeting
is also critical to be able to speaker should you eer of the speech recognition
system that our readable transcript
and you can use it for things like speaker if one where you need to
identify all the speech i mean how the same
speaker source
the diarization error metric as a measure also similar to most
it's the racial the total duration of missed speech false alarm speech and speaker
speech does this labeled according to who were spoken by
and normalized for the should duration the speech
l the critical
thing in and are stationary computation
is which will important for you know later on is actually the mapping in speaker
labels that occur in the reference versus the hypothesis
e labels and reference have nothing to do with a label and the clusters so
we need to construct a mapping
actually minimizes the error rate
so in this case we will map speaker one speaker a
and speaker into two speaker e
and leaves speaker three amount because of the in fact is an extra speaker relative
to the reference
once we've done the mapping
we can compute false alarm the speech
speaker
now system combination or ensemble methods of coding methods are very popular in machine learning
sessions
it "'cause" it is very powerful to zero it combine multiple classifiers
to achieve a better results
and coding
it's just letting the majority determine optimal or soft-voting such as combining different scores in
some there is not gonna make weight
or to combine your already outputs by interpolation for example in order to achieve
any more accurate estimate
posterior probability and therefore us
labels
now this can be done and weighted marilyn the weighted matter so if you have
the
a reason to attribute more than so which to me that's
you and that and the voting algorithm
a popular version of this for speech recognition is the over algorithm
also confusion network combination any also of the purpose of i mean the word labels
from multiple asr systems like well
and performing and loading and the different machines
and usually this gives you know whatever when the input systems are about equally good
by have different error i don't store
as in the and errors
now how can we use this idea of for diarization
so there is a problem because these labels coming from
position hypotheses
are not inherently related
so there are anonymous as we said what
so it is not clear how to order among them
we can solve this problem i
extracting in that in between the different labels
and then performed by doing so
we can go there's map of the labels in fact as a kind of alignment
lingual space or level alignment
so we do it incrementally it's like for a rover for example so we start
with the first analysis that for star
and it as our initial alignment
and we iterate over all the remaining outputs we construct a mapping the it was
processed out that's
so that the e diarization error between the labels is minimized
we all know
we can
simply for the voting
i'm really label for all time instants
and this is what was described in our
last year and inside you
okay here's an example
so we have three systems at c
the labels are disjoint
and we
first start by starting with system a and then computing best map
of the second system to these labels in the second the first system
so in this case we will
one way one
to ensure a two three would in extra speaker labels so it remains
we re label everything so now we have system a and system i in the
same label space
read the same thing again with system c
so we can see here that c one should
i at one
t three should be mapped into
c two
remains map and that's the next a label
doesn't have a correspondence
so here we have no all three how that's the same label space
and we can fall the voting
for each time instance so they only when is a one to this point
then we enter a region where is actually if we went i between a one
human speech
so no matter only we can break the time anyway that's can and example in
the first one or if there are weights attached to the n b and
the one with the highest weight
we have a to again as the consensus and we're trying to a one
we never hears because it is always in the minority
and we can use the same idea to decide on speech versus non speech
so
we will help us speech only on those regions
where at least half of the in its i think there's speech
no again the natural
use of this is for information fusion
it is we run diarisation in the in italy stand for information for example we
have multiple microphones we can i rise in italy
and fused it's using dover
or we could have a single input that different feature streams
we can arise in the end is
we used just for multiple microphones in i paper
we have meeting recordings on seven microphones
and you can see here that difference is doing a clustering based diarization
this be wide range of results depending on which channel you choose
and over actually
if you're result that a slightly better than e single channel
so you're free from having to figure out which is the
thus the channel
if you do the diarization using speaker id because you're speakers are actually all of
the system
you get the same effect of course but much lower at position error rate over
also you average
you have the single channel and you have a where single channel
and it over a combination of all these out there is you have resulted actually
is better
the minimum
all the individual channels
no for this paper we gonna looking to different application of over
starts with the observation that diarization algorithm is often quite sensitive to the choice of
hyper parameters
i give some examples later but it is basically because when you clustering
you make our decisions based on comparing real values
and small differences in the in this can actually yield large differences you know
also the clustering is often greedy
and iterative so small
regions somewhere a linear model and a very large differences later on
so
this can be remedied by averaging over the different run essentially so
okay and you run with different hyperparameters an average the results
and using the over or you can used over from i'm the out of multiple
different
clustering solutions
to experiment with this we used an old speaker clustering algorithm of for diarization develop
idiomatic c
you start with an equal length segmentation of during the day
segments
then each segment is modeled by a mixture of gaussians
and e ds similarity between different segments can be evaluated i asking whether merging two
gmms yields a higher over likelihood or not
e
duration happens by merging two best clusters that resegmenting
and re-estimating so gmms
l which do this until i is information criterion tells you just a clustering
it like this algorithm to a collection of
recordings of meetings
from which we are extracted two feature streams and mfccs training after beamforming so we
had multiple
constraints but we marched on informing of the signal level
then extracted mfccs
and the beamformer would also give us the time delays of arrival which are an
important feature
because it indicates where the speakers are situated
now
there's two ways to generate more hypotheses from a single
this case
one is a what i call device verification meeting there either i and under
what was some range
and a single low also
example i can every the relative weight of the feature streams
or i can every the initial number
other clusters in the clustering order
the first one which we discuss the three what else given here for the interest
of time
and the other way as to randomise so i can manipulate the clustering algorithm
we will not always pick the first best
of clusters remark about two sometimes take the second just pure clusters
and a five point in order to make these decisions over it can generate multiple
clusterings
and of course i used over to final design with equal weight
although the of its use the same speech nonspeech classifier so we'll only differ are
speaker labels not in the speech nonspeech sessions
and only difference on the diarisation error is in fact on the speaker error rate
it is set was from the nist meeting rich transcription evaluations from the nist two
thousand seven thousand nine
and we used all of the microphone channels but we combine with beamforming
and you variety is actually quite considerable in this data so
errors different recording sites
there is different speakers from small three four
sixteen twenty one respectively so it was quite heterogeneous and that's why it's a challenge
to actually
and analyze the hyper parameters for them into a
in forty from one
f sets to the test set
use what happens when you vary your a stream weight one of the hyper parameters
so you can see that
varying along agree not use a small variation in the output rather channels up and
this is the speaker error rate
and more importantly
the best value on it's a it's not just value on the eval set
conversely the value of all citizens are was choice for the test set
so this is what i mean i robustness of problems in that
every when we do over a combination over all the different
results
we actually at a nice good result
it is either better than a single results for the test set or very close
to the single best result on you got stuck
similarly when we vary the initial number of clusters of the algorithm
we also got the l
with the speaker
according to a you know it is the variational the cluster number
and
the best choice for the test set is not the best choice for the eval
set
again when you do that or conversational you a good result in fact there is
always better than the second best choice
on the data for you also
finally when we do the randomisation of the clustering specifically we flip a coin with
only three we use a second best cluster each information merging
and the result is surprisingly sometimes lead to better and with the first a clustering
so you see here that with different random seeds we are in a range of
results
sometimes worse but often other and with the best first clustering
and the same is true for the whole set
first we cannot
expect the best thing all the data to also interesting only vol instead we need
to do the recognition in order to get a result
so we actually improve on the best first clustering consistently by doing or correlation over
different
randomized results
summary
we have just over algorithm allows us to voting among multiple times additional sees
we can use this to achieve
a robustness and annotation
by
combining multiple hypotheses obtained from a single input
e two ways that we do this is by very high utterance
or introduce diversity if you will and the results
and we find that the hyperparameter populations higher in over essentially freezes from the need
to do that optimization
and that its robustness that way
now the clustering can also be randomized be overcome the limitation of the first
research and clustering
and e combination of the randomized results actually says
higher accuracy and you the single
a string that that's
finally there's many more things we can do this so we can try to come
i'm
the different techniques so for example i are is wearing
a lot of multiple dimensions
or combining that with randomisation all in one and a well-known combination about
we can also tried as with different
like conversation than the algorithm is the gnostic to the actual
and
form of the diarisation algorithm
so we can try with x vector of a spectral clustering or normal and systems
of course region or we wish to
in this multiple the corporate in order to can work in the
the algorithm
to other things were currently working on is can i think different diarisation algorithms
as well as to generalize the to handle overlapping speech
thank you very much for your time
you're into question so essential to the c website
and i the rest of you culture