OK, So the title of my talk is following, when we first say, given a
file offline
supervector-based speaker diarization system which we presented last Odyssey 2 years ago
OK, so the
doesn't
no, it doesn't go into the computer, but not here
first time I see something like this
OK, so this is that
outline already of my topic
OK, so just to those are not familiar with baseline algorithm
The idea is to take two speakers eh and usually a third symbol and to
do a speaker diarization.
And the main principle is as following: if you look at this illustration, this is
the illustration of the dynamic sys specs, one speaker in blue, and the other in
red.
And if you did not have the specific color, we wouldn't be able to do
separation between these 2 speakers.
So the idea is to take the speech, and to do some kind of parameterization
into a series of supervectors, representing overlapping short segments.
So what we get is what we see here: now we see some kind of
separation ... between two, between the two speakers.
And also we can see that every speaker roughly can be modeled by a unique
model PDF.
This is thanks to the supervector representation.
And the next step is to improve the separation between speakers by
removing pieces some of the intra section inter speaker variability.
This is the sketch of the algorithm.
and here are the actual steps.
so first there's the audio parameterization ... the section is taken and conversation dependent UBM
is estimated.
So basically this algorithm doesn't need any development data, the UBM is estimated from the
conversation.
Then the conversation is segmented into overlapping 1 second superframe.
and for each superframe we represent by a supervector which is adapted from the UBM.
then there is another step which I am not going to go into detail, because
it's something that we've already presented.
it's eh, what we do is we try to estimate on the fly, eh, from
the conversation.
intra speaker variability and compensate for it.
and to improve the accuracy.
the next step is to score the superframe, being either from speaker 1 or speaker
2.
This is being done by first computing the covariance matrix, the covariance matrix of the
compensated
supervectors.
then applying PCA analysis to this covariance matrix, and justifying the largest eigenvector, and projecting
everything onto this largest Eigen vector.
Then we use Viterbi to do some smoothing and finally we do Viterbi segmentation on
the MFCC space.
So this is the baseline.
At least a few shortcomings to this algorithm.
First we found out when we apply this algorithm on short section and
when I'm thinking about short, it can be 15 seconds, or, 30 seconds.
and it doesn't work that well.
on short section, and this is first of all because of insufficient data for estimating
all the models and parameters from a single short section.
and also because of the probability of misbalance between the speakers, and the representation of
the speakers increases with, uh, when we're dealing with short section.
and also this algorithm is heavily based on the fact that there is some kind
of balance between the 2 speakers.
and another issue is that this algorithm inherits the online, or, offline, and several of
our customers require that we have the online solution.
and so this is the shortcomings.
So first, I'll talk about robustness in short sessions which is important by itself, but
also the first step towards the online algorithm
algorithm
So the basic idea is try to do everything that we can, to do it
offline from the development set.
instead of training the UBM
from the conversation, we just train from
from development set, and also the NAP intra speaker
variability compensation is trained from development set
but we don't need any labeling from the development set, because
our algorithm is unsupervised, it doesn't need to have speaker labels, or speaker turns.
labellings, we just need the raw audio.
So we take the development set, we estimate the UBM, we estimate the NAP
the NAP transform, and also we do the GMM model in order to make it
more robust to short sessions.
next thing is what we call the outlier-emphasizing PCA.
Contrary to robust PCA which someone is familiar with, in our case, we're actually interested
in the outlier and
we want to emphasize and give high weight to outliers when we're doing PCA
This is true because, let's look at this illustration
This illustration is when we have 2 speakers.
And they're balanced, we have the same data from both.
If we look at this example, then
if some conditions are actually happened then
then actually we just take the supervectors
and apply PCA then the largest eigenvector will actually
give us the decision boundary
now if we have unbalance speakers, then
in many cases, the PCA will be dominated from the most dominant speaker.
and we won't get the right decision boundary
So what we do is we do the following
we assign higher weight to outliers
which are found by selecting
the top 10% supervectors
in the given session with the largest distance to the sample mean
So we compute the center of gravity, the sample mean, and we just
uh, pick up, we select the 10% of the supervectors, which are more, most distant
from
this mean, and in this case, these are the outliers
we just give them the higher weight
and now
suddenly the PCA works well in this example
another problem is how to choose the threshold
because for example
in this case when the speakers are imbalanced
and if we just for example take the center of gravity
and then as a threshold then we would not be able to distinguish this two
speakers
correctly
so what we're trying to do is again
according to the same principle we compute the 10% and 90% percentile
look at the value that gives this percentile
around the Eigen the largest Eigen vector
and we just take this two values, average them, and decide
more robustly from the thresholds
OK, so before talking about the online diarization, just
a few experiments for show this section
so we use NIST 2005
dataset for this evaluation
and some thing important is that we compute the speaker error rate without discarding the
margin around the speaker turns
this is contrary to standard, and this is because we're dealing with short sessions, and
we try to throw weight
data then we found out it requires some numerical problems
so basically what it means is that
the result that I present are in some way
a bit pessimistic than what we would get, if we would
use the standard method
another important issue is that we
throw away short sessions
with less than 3 seconds
per speaker so
actually what we do is we take
we take the 5 minute session from NIST
and we just chop them into
short sessions
and now sometime when doing that
we may get short sessions
for example 50 seconds, 15 seconds
without, with only a single speaker
or with only 1 second from the second speaker
so in this work we will not try to deal with such a problem as
detecting such situation where we only have a single speaker
therefore we remove such sessions
the results for
for the diarization I talked about, basically what we can see here is that
for long sessions we don't get
any improvement or degradation
however for short sessions, we can get roughly something like 15% error reduction using this
technique
OK, so now let's talk about online diarization
so frame here is the following ... what we do in this session is we
take the prefix of a session
and the prefix is something that we will have to process offline
and of course you would want the prefix to be as short as possible
and we will actually set the length of the prefix adaptively
so we start by taking a short prefix
and according to the confidence estimation we would see ... we will verify whether this
prefix is good enough for the processing or we should just take a longer prefix
and do the processing
so we take the prefix of the session and
and we do offline processing the same ... just apply our algorithm on this prefix
and we did the result of this processing, the result with segmentation for the prefix
and also
with some model parameters for example the PCA
the threshold from the PCA and then we take this model threshold parameters and we
go to process the rest of the session online
using this model as a starting point
we update them periodically
and we do online processing
usually with some delay because we're using, we need some kind of backtracking, so we
have some short delay
can be a second or less but
we would always have some kind of latency
so we first apply this for voice activity detection
I won't go over all the details ... it's quite standard
OK so once we have voice activity detection, done online
then we have to do speaker diarization
so first we have the front end .. we do it online by using step
to get MFCC
extracting the supervectors
and compensating for intra-speaker variability
and then we take the prefix and we compute PCA for the supervector in this
prefix ... we
we project them all the supervectors onto the largest eigenvector
we do Viterbi ... Viterbi segmentation
then for the rest of the session, we just take the PCA statistics from the
prefix
we accumulate them online ... we periodically recompute the
PCA, and adjust our decision boundary
periodically
and also we do Viterbi and partial backtracking with some kind of latency
so here 're some results
first we will try to analyze the sensitivity of the delay
parameters ...delay parameter is the delay we have when we do the online diarization
on the rest of the conversation we still have some delay because we're using Viterbi
and ... and ... in order to do smoothing... so we found out that 0.2
second was good enough for ... for this algorithm
and then we ran some experiments to verify the sensitivity of the prefix length
and we found out that .. actually if we start with speaker rate of 4.4,
we'll see some significant degradation
gets to 9.0 for 15 seconds
of prefix
now when we... now we ran some control experiment
we did the same experiments, but
but we throw away all the sessions
uh.. with the ... there were .... not uh... with at least 3 seconds per
speaker in the prefix
for example if we take this column
we throw away all the session that in the first 15 seconds
we don't have at least 3 seconds per speaker
and when we do that we see quite good result, and the explanation is that
most of the degradation is due to the fact
that when we get the prefix
some time we do not have the presentation of the 2 speakers
and so and
so the way we introduce is this ... is to try to apply this confidence
term I will talk about
but before talking about the confidence ... the
the overall latency of the system is 1.3 seconds
including the prefix so...
if we have a 5 minute conversation ... for the first say 15 seconds
it's not online, it's offline, and then starting from the fifteen...
from the ... after this prefix
we will get the latency of 1.3 seconds
so now the issue of confidence based, the prefix we saw that
some time 15 seconds is enough, some time it's not enough, and it's heavily controlled
by the fact that we need 2 speakers to be presented
in the prefix
so what we do is we start with a short prefix .. we do diarization
we estimate the confidence
in the diarization
and if the confidence is not high enough, we just expand the prefix ... and
...
start over
we tried several confidence measures, and we chose to use... finally the Davies Bouldin index
which is the ratio between the average intra-class standard deviation and the inter-class distance
we're able to calculate when we have the diarization
OK and ... so
I won't go into all the details of this slide and the next ones, but
the main idea is that you can
you can actually get nice gains
by using this confidence measure, so for example for 30 second prefixes
50% of the session needs to be extended
to get almost as good result, but for the 50% of the session
you can just stop
so you can start with the prefix of 30 seconds
do diarization, compute this confidence measure
and for 50% of the session, you can decide that it's OK, I can
stop now to do the online processing
and for the rest of the session, you would need, for example, 45 to 60
seconds
to get optimal result
OK ... so
eh... what is the time complexity of the offline system .. the online system ...
this is the question that
many ... many ... many people would ask me after the previous presentation in the
last Odyssey
so we ran analysis .. experimental analysis
for this algorithm
and the analysis was run for 5 minute session
there was no sort of optimization done
just plain research goal
and ... so what we see here is that the baseline system
is 5 times faster than real time
and
we can actually improve the accuracy of the system by taking some of the
algorithm that I presented
improve the accuracy
and if we just take the whole uh... the whole
all the complexity I talked about, some of them actually degraded previously
for example training the UBM enough offline gives some degradation
so we get back the 4.4, but we get the speed up effect of 50,
50 times faster than real time
and for the online system, if we take the prefix of 30 seconds and the
delay of 0.2 seconds
then if we... actually the speed up effect is controlled by the retraining parameters
retraining parameter means in what frequency do we reestimate our PCA model and our GMMs
so we control it in a variable way that mean we start with a high
frequency at the beginning of the conversation, and then ...
just towards the end of the conversation, we actually stop retraining, or do it very
low frequency
we managed to get for the online system ..... speaker error rate of 7.8 with
the speed up effect of 30
OK before we're concluding, we give ... I'll just talk about specific .. specific task
which we're interested in.
which is speaker diarization for speaker verification
here we're not really interested in getting a very accurate diarization ... very high resolution
diarization
we just want ... don't want to get a good degradation in the equal error
rate for the speaker recognition in too wired data
so uh... we have initial work presented in Interspeech 2011 and here we have some
improvement
here we have some improvements that integrate all the components that I talked about in
this presentation
into this variance of our system
so we divide our audio into overlapping 5 second superframes, because we don't need the
resolution... high resolution
and we score each superframe independently against the target speaker model
now we have to do is to be able ... uh...
to classify or cluster these supervectors ... superframes into 2 speakers
so what we do is we do a partial diarization
and cluster these superframes into 2 groups of clusters and also
deemphasize some of the superframes which are in the borderline between the clusters
because we're actually ... uh... interested in speaker verification ... not speaker diarization, so we
can just throw away some superframes which we are not certain to which speaker they
belong
and we use eigenvoice-based dimensionality reduction in k-means
and we found out that ... the ..
the silhouette measure was actually optimal for deemphasizing several .. some of the supervectors
we also do it online, so we do
we use the same framework prefix, which is processed offline and then
we just adapt it ... eh ... uh...
for the rest of the conversation, we use the GMM-NAP-SVM system
developed for NIST 04 & 06, and evaluated on NIST 2005, for male only
we see that we get an improvement ... uh...
some improvement compared to the result that we presented in Interspeech
and we also observed that using this new technique
using the silhouette confidence measure for removing the superframes ... we get ... using the
hard decision... we get the optimal result
compared to using soft decision or no removal at all
so to summarize
we extended our speaker diarization method to work with short sessions and to run online
and we propose the following novelties: offline unsupervised estimation of intra-session intra-speaker variability
so again we use the development set to estimate this variability
but it's not labeled at all, we don't need labeled data
and we also use outlier emphasizing PCA for improving speaker clustering and adaptive threshold setting
the overall latency is 1.3 seconds except for the prefix
and speed is 50 times faster than real time for the offline system and between
30 to 40 for the online system
and also for the speaker verification task, we manage to substantially ... it's more in
the paper than in the presentation, but
we manage to substantially delay ... to reduce the delay ... eh ... substantially
for ... for speaker verification in some channels
OK, thank you
for initialization, you consider trying online speaker segmentation
algorithm, you just find the first speaker change to, so that you
are sure the second speaker
or the first speaker or any person in the next 15 seconds ?
yeah, what we're trying to do now.. is
is to start with
to take, to go with the prefix ... uh...
framework, start with a very short prefix, and to try to
start expanding it
and accessing whether it is a single speaker or not in this prefix... so
that is the title, that would be hard .... yeah, that's why we don't have
in the paper
so we
you have the speaker diarization
rate, diarization error rate
speaker error rate, it's without voice activity detection
OK, so just confusion, that's all
uh, in there we didn't mean
go to the result, go back to the result for
tests .... some
for recognition, for recognition
so ... do you know how the baseline being done?
did nothing, just scoring
you have the number?
we have it in the Interspeech ... uh...
in the last Interspeech paper, we have that number
the last question is about the PCA itself, so one of the thing
NAP which is removed
remove the channel first, trying to
the PCA no...
do you do any kind of channel compensation?
channel note
we'll do it ... there's something that actually
try to do ..uh.. same techniques as
being done for speaker verification
it's the NAP technique, so
so what's we doing... we're just taking the
pair of adjacent supervectors
and we just assume that
they belong to the same speaker, which is usually the right case
once in a while, it's not, because of speaker change, but usually from the same
speaker
from this we're estimating the
intra-speaker variability
you only estimate short term variability
short term variability
I don't understand the reason for online diarization used?
OK
try to know the motivation
OK, this is started because actually where were two clients
this ... one of them is ... uh...
for example in the call center scenario
let's assume that it's two wires
for many in practice, that's the case
nowadays
at least one of the vendors
uh... .actually
this is the case ... so.... uh...
the project was... the idea was to
to run speech recognition on
online, on the
call center data
and to present the agent with some summary
of the conversation
and in order to do the summary, they need the speaker diarization
and everything must be done online but
it can be done with some latency of
for example with 30 seconds prefix, it's OK
because it's usually longer conversation
when you use Viterbi, do you always go all the way back to the beginning
or you just do...?
in the online, no, in the online we do just in a small chunk
how far do you go back?
it depends, because
we also of course try to go all the way back
it does not really cause false alarm
but we found out that we can
save a bit by not doing that, but it's not very important
the latency is caused by what happened after the
by the future, not the past, the past is something you can do it very
quickly
one more question
do you try it with the algorithm that added to the
multi-speaker diarization task that was used
in our meeting data
actually now we're working in a
in a framework, European project
that's ... uh...
it's a... we're dealing with ... a...
a meeting type scenario
we have to take this algorithm and to run it
we have to modify it of course
alright, thank you the speaker again
applauses