but not everybody today angle and at present a more work that freeze and a
addressee the problem of our online speaker diarization in contrast the two other works that
the majority of work in that is a shown that the was a it is
mainly offline
and the in a semi-supervised scenario
a i would first provide a brief introduction and the motivation a i would band
and describe the system implementation
i will the provide this mess
the some experimental results and the a simple solution
so
a i guess most of you are familiar with the problem of speaker diarization basically
given an audio stream a i want to determine who spoke when
and i want to a determined not to my segmentation word this segment boundaries are
present the speaker there are changes and the in optimize speaker sequence what i want
to assign you a the segments a to a specific speaker
an so
most of the state-of-the-art detection system at the vault around the belmont off offline diarization
system however with the diffusion of a smart objects are intimately fittings and mar force
in the your recent the are online diarization has attracted the
and increasing is the right interest
so in the literature only a few if you allow online there is some system
that i've been presented mainly focusing on the plenary speeches and the broadcast news when
the speaker turns article
and the our previous work right whether the problem of unsupervised online diarization
for of meeting data
a with a single piece the microphone
so as the condition
so unfortunately the system be the b-format and
although the results were aligned with the previous work at their system thing to perform
a well to some or practical like applications
in there is a short online diarization
so
we a
we think basically in online diarization get the
we have to deal with a the problem of speaker modeling not in addition
and the after period of not feature
the we assume that we encounter speech we want to be able to initialize speaker
model and the properties which kind of analysis window with the which amount of
speech
we take to a
initialize the speaker model
so i can choose a the what amount of as a speech time to the
and decrease the latency of the system
however everybody probably knows that the
there are interface much it is much higher because the speaker model are not well
initialized with little data
otherwise i can take a and low and longer windows a longer amount of us
a speech
a yes
whatever in the keys a the in speaker variation i
can fix the problem of
using too long speech in which are there are multiple speaker might be multiple speakers
as
because number speaker might and increase with the
a longer windows
so a
and way to improve the line that additional used to allow that it is allowed
a speaker models to decision
we but as some labeled training he some initial labeled training data
and the so our contribution is will present work going that the t seems supervised
online diarization system
and the on the work this kind of what was already presented by then your
moral
in august done for but in the context of the offline diarization
so the problem is the and the problem that we try to address is what
kind of the of c data is required to reach a similar performance the two
and of lined original system
a okay i will continue we've the spinning the and they all came up with
addition that these that we used to update the models in our system so on
the we supposed to have a sequence of these speech segment
from a particular speaker s
and the each segment is by parameterized by set of acoustic features
you
we maximum number of features and i and i was able to have initial the
gmm ubm model
we've gig of some components
so we found that basically in most of that is also used in that we
found the literature
it'd the authors would be used to a initialize speaker model by map adapting the
lander ubm model and that you with the for speech segment obtaining a the than
the first the for speaker model
and then using the net
speaker two and two right we have that the previous model s one in this
case
putting that and model or a new model as to
however we found out that the by using these that the final model doesn't is
not the same model s for using all the segments
two but they rely on the ubm model in ones
is so although it is a modest contribution we found out that the value
by least a calculating the sufficient statistic using the a available speech segment we against
the land the ubm model each time and by accumulating the sample statistics
the final model is more consistent a more similar to a the more that it
and that is the cosine with the offline that a muppet map adapted model
so you know from a politician basically this fisher statistics for a gig of the
component are doing a with for the main question okay i of the first the
older quite zero for a zero for and
sufficient statistic the first order statistics and the second order statistics
and the basically a
it's quite represent the likelihood of the each feature contain in the segments and i
guess the around the ubm model and i use all the segments available to market
that the lander ubm model
so
to obtain that they knew they are they to estimate i basically just the is
a tradeoff between the a and some ratio between assumption statistics and the original the
original
parameters and that these racial is
this ratio is like
is the depends on their relevance factor that tells me
a in how much importance quickly to initial parameters rather
rather than the final day
i what's important to
much closer but the i want to be
in estimating the new parameters and also we have become of it is an additional
parameter so that the weights are estimated weights sum to one
and so it's once motivation as we said use the first segments to updated around
the ubm model
i've the in the first speaker model s one
and then i recalculate there is sufficient statistics i guess to the in a speaker
model s one
to train and you model has two
and the more in general so i train the s one speaker model by map
adapting the ubm and model with the first segment and then a given once a
of the speaker segment you plus one
a the sufficient statistics
are given a calculated guess the previous model s i
so i have all the sufficient statistics calculating just the model
and the incremental map adaptation extent we want to calculate the with the for statement
the sufficient statistics a guest aligning the model to obtain the for small the embedding
a with the second segment we again a couple the sufficient statistics of against the
long to a ubm model and we have q-mode this one in this way the
three d of that the model is the same s offline map adaptation
and will so what was it
a we train the need just because model by you with the first segment
and then
a given i do not seem an a plus one is sufficient statistics a quality
by accumulating this
a with the previous one by using the features in the
it in the last segment
and calculating sufficient statistic guest the lower than and are committing to the thing to
statistics
so as a set for the model cost contribution but we found it green good
improvements in the final the original results as we see after
okay the system implementation
a so a now two
and the we have a supervised the supervised and face and unsupervised in the supervised
case we the is like around able to not be allowed the
we have and amount of the and the spk segments per speaker and with that
those ones we after feature extraction we initialize the models
in the all the people speak and the meeting for example
and the in the line face instead we have a supervised and the lights and
the we classify each signal and the speech segment a factor of dividing the to
segment of the maximum duration ts that are present our latency
and the these basically
these the distance are classified a guest the speaker model available
and the a i determine which speaker models is the and
is the most likely and the i'd label that segment with the according to the
speaker models that maximize the likelihood and i update the model by my incremental or
sequential map adaptation we use we show both results
and the in their life data i bring we need a sufficient statistics that with
the will be that are used to update to the speaker models
so all in the line processing i assign each segment i one of them speaker
models
according to must be more likely criteria
and the model that maximize the likelihood of for the feature contain the segment is
set to be this use the is used log of the segment
and the that speaker model e the adapted by either sequential or increment all the
of the station
and this is the implementation of the system
the to use
so i will not present experimental setup and the experimental results
so we used for different datasets
a compact from the n is the rich transcription activator
and the to the first that set is used to train the ubm and the
is just a few and is a set of sixty meeting shows the from the
nist audio for evaluation
the we have the development dataset
is the set of fifteen meeting shows from there are two five and that you've
six evaluation is used to g
to develop the system
and the evaluation set that we used to ask people that used to evaluate this
system is the set of at meeting shows from the active seven and the set
of seventeen shows from the t o nine evaluation
and the we show the results independently for these two dataset to perform better comparison
we previous work
the experimental setup we use a nineteen the mel frequency cepstral coefficient i made by
energy we've a twenty millisecond windows and the ten millisecond chips
a i mean is shown me ten milisecond the overlap
a ubm is trained on that the ubm with a
ten iteration em iteration and sixty four gaussian components
and the analysis window that you that the segment duration that
that correspond to the lattice of the system is the we not i for the
phone lattices from zero point twenty five second zero point five seconds on t four
seconds
and the amount of training data that used to initialize the model initially in the
back from one thirty nine seconds
and the which both short of the results for sequential an increment all the map
adaptation
and the relevance factor for the map adaptation least and the
the a
okay the overlapped speech is a move according to the transcription
well what
but problems with the descent
we need and the offline baseline system is the idea eric on top down duration
system classification the to use as a baseline to a as a reference for
the results
so in this in these in the first in the rest of the left
you are presented the results using a sequential model adaptation approach we can see that
the by allowing and amount of training data in a of the people putting data
can also labeled training data to initialize the model
and we managed to perform better of an offline diarization system
and the results right are is that we increment the muppet that they show that
the
allows for better or a profiles of the q and the because the model we
accumulate the statistics
and the we can see that the
we can reach an offline but there is some performance we've only five second possible
actual y n
incremental works better with three seconds
and the by allowing more training data
a the we the readers to that is not a rate of ten percent
and
the c and the okay the state for the lower latency this is just a
it is system
does not perform well because see that the licensees the really global too low and
the also
the
so for ten in this table we present the results for different amount of and
bring the cost the
a training data
three second by second and seven seconds
and the for the different not set and all these results correspond reluctance use of
three second
so we can see that in all cases incremental by map adaptation works better when
the question map adaptation so but being the statistics of we are but would it's
better results
and the finally
these dropped represents a represents the amount of training data as a function of the
latency and yes
when that
when we would be to the offline diarization performance all
or points corresponding to d r of seventeen percent and still have in here that
the incremental map adaptation works better when is when sum up to the station
the in
for future work the in
and
the goal is to probably a and reduce both lattices and both the amount of
training data to reach better performance
so
to include
so we propose a semi-supervised aligned there is some system
and we show that for the in the case of they have to your seven
dataset the system gonna outperform an offline diarization system we've already three seconds of
a speaker c data and with a latency of three second
well using an document the map adaptation approach
and the a by line will be the legacy all retirement see the that we
have lower eer and percent
and the also if we tested this inconvenience of like of
initialize the speaker model some labeled training data we can
a
open there should two
development of supervisors image supervise the speaker discriminative features transformation
a both to reduce that fancy
and that the amount of data
thank you have any so i'll in here we have time for a few questions
thank you for talk
for so long to this then do you
to know how many speakers or the conversation runs
yes is usually we mean that we allow for these
and knowing in advance the number of speakers to initialize the better the models
and was we five we are searching other ways to introduce new speakers that are
unknown in the
in this was in the beginning
you divided the data between the speakers
so we assume that all the speakers of speaks from the beginning
yes
present himself or something like this
yes exactly we are it means that us not which everybody also examined
you're wrong system
does not assume that the number of speakers is not the ones so it's more
a totally correlated to compare isn't version
this mean a when we do the these experiments i agree with the comment when
you really these experiments we had only that all signed addition system any was difficult
to
a initialized speaker models
so since we had already that baseline we decide okay to stick that one as
a reference and the g
we compare with the and the last questions practical who see in the that online
for the first the segmentation for the filtering
okay also other application that the these the work can and can be useful for
example can be
and
when you interact weavers mat corpus
all six you can provide the initial these
some data for the people that usually utilized was is mapped object so you can
provide some initial data that was our disk is not gonna use for this east
thank you remote
any other questions
so i have a question so and
i from your presentation i understand that you are adapting all the parameter phase
okay systems for score area where is okay by dataset covariance tennis a means so
have you tried
two
check if extractor that fewer parameters and you get some
yes it right what we only the mean that the as we increment the model
that these women so you
bring "'em" we must have that we stick with a map adaptation
it's a to get worse results in front of data on the parameters
while the and saw in the case identity of the mean
because as the use of all these few data will data model is like
use only few data but incremental map adaptation is we are bringing we've also the
statistics those statistics are also useful to an update a variance and the mean as
well so an incremental map adaptation case we you
id updating with the parameters a broad better results than updating models
and for example
do you think a comparison or
you think it will make sense to compare
in terms of number of parameters maybe increasing number of gaussians in the unit or
so i don't if you and sounds and
okay so maybe releasing for about or
for better a competition on reducing the number of gaussian mean
no i was i was thinking owning increasing then if you as you go for
example i mean you could okay
even
that might be postport model becomes a better liable to increase the number of components
to maybe double the money we don't right in this case because we sixty four
different components
because it was language pairs was that we did
so but that might be a good
there's still time for questions
questions
a global there is the explanation but the white your system is it does on
the offline but system
sorry your system is it does on the of the nicest in
and we would you know like
okay inane that case so we try to put in previous work we tried we
totally unsupervised system and they which within a number of speakers and that the performance
where much more some end of line there is some system
and the as i said before the comparison was faring the case because the
in that line that is some system all c than the number of speakers but
in this case you allow the you allow to not the number of speakers in
a bass to get better performance and to stop of practical application so but knowing
and the number of speakers already you add a lot of that you two
to the problem so you is already
and
is already some information
that adds to be the offline diarization system
understanding that you can decreasing but on my system can do you line
so what the difference at the end of line
i mean you can imitate of online
by using offline system
okay but the flesh system basically you need all the audio from the beginning so
that you do not have to use all the
was audio
so
also the this system that we use of flying system was really computationally every so
it to the use is a lot of and segmentation and clustering e like
iterations
so is my it may lie on the from the point of view of latency
is much worse than the online diarization system
you know questions
so
let's take the speaker again
i