to do not an i'm here two percent this work the influence of transition costs
in this one depicts the states of a speaker diarization i'm bad at the in
a i work in the speech technology group in the ubm how well the wonders
apply the kingdom at the here units playing
so here is the online i'm going to following the in this presentation first of
all i'm going to well explain the baseline system the baseline actually changes that we
are used and with what they are detect to without the modules the states so
on
focusing in the segmentation of clustering of states which is
basically that the that initiation estates
where a which is where we have been well actually making things and analysing results
and what we
one kid in this work was to analyze the effect of
that the of the parameters involved in the in the duration of the speaker that's
i mean
at time can be very long various or tweets parameters are involved in this decision
and how much it the probably affecting in our system
then i well i will present the experiments we have done with a the development
dataset and a all the analysis and some compression
here is the baseline system architecture we how we work with more t-ball a it
multiple signals from multiple microphones so that's or input that we first filtered to reduce
noise
and then we extract from these various signal the delay between them that is the
time delay of arrival and we use these information for to work for two things
first
is the acoustic function which is to create as you
probably no
to create a and i can see now just to me and all the all
the signals from the different my in the microphones a delay in or one to
the other what a
probably about the proposed for them
one delay to the other so they this soon as at the end it the
voice on the acoustic a signal or something and it's a nice but
no we use this signal to extract the cepstral with these in the mel frequency
cepstral coefficient
and also what to extract information about where their use a speech on there is
not with the voice activity with that
the on a way we use the delays that at the tdoa are used as
actually as an input to
and you see the last the states that is the segmentation of agglomerative clustering of
speech strenuous that these the a day is to each day
where we actually decide who uses beacon and way needs to speak and so when
we performed a diarization
it's a calm
what it doesn't matter
that these
segmentation i'm mlps clustering i mean
here is like a more in the diagram of what an of what these estates
performs we first with a i'm not at the initialization allows a segmentation any sound
segmentation
that is porous is uniform for the baseline system we use his unit probably a
segment into and plus their for a cease
sixteen bit as or more but we use it might be sixteen because then we
iteratively are going to
reduce the number by marriage in or hypothesize a hypothesis that size clusters
then
after this initial segmentation we before we start and that the of segmentation a training
and during in the states we with these segmentation create it models than we that
the that we that we use to a restatement the see now
and well at the end of we will have a better segmentation is according to
date at twelve a speaker models we have train
i thirty three segmentation we a we compare these
as clusters one-to-one birds and we didn't joint in we match those that are more
seem a lot
we used for that but using information you know
we use
in general
well with the with a we do eat all these iteratively until there is no
more clusters to merits because
well this is then can see that there are no more plaster that are in
a similar to demerits and it
it finished
score
that's features
while here is a
we are moving to the point this is like that diagram where you can see
all the parameters that actually involved in the
duration of these speaker that's we have one parameter the called medium on duration on
that
for us is to fifty frames a two hundred fifty prince of the music and
so a round to two point five seconds that is like
okay i'm going i want that might speaker times are at least of two one
five seconds of duration because if they are stored their well i'm you know still
much interested them and so well let's force this is then to go at least
two fifty
then these parameters are beta a we won it we wanted to cancel them because
they define to concern i mean to constantly influence it has in the duration of
the speaker a
out
good mean okay is probably t you would applied to
remain in the same speaker or not moving to another and data is to
one two and all that the one of the clustered by another speaker so we
said then to one
we know would they does not so one that is
like the stopping people think that is they way they actually having not influence in
the final decision
but we also a in the a in there you know is used and the
experiment that one this last and that and we
discover is it was the from there you know system and
the problem for us is that is not just the this parameter that is a
happy some useless again in the decision of moving from one to speak at one
other but that these m is the number of active clusters
and
overall system iteratively rate use these number of clusters it goes from sixteen one sixteen
for us "'cause" is what we used in
to well bass-fine a time and that could be too
or one
what one would be just the timit but two three four and in each iteration
it going is going to change
here you see actually when it sees a state institution to change first we have
the likelihood of a and while in the basic question is
the likelihood of some primes
to belong to one cluster the other side the likelihood of the same frames to
belong to another cluster
those two and i related to the about data we have so we are okay
with this but these all the parameter located great we have called logarithm of k
is just a independent and deals with a well what was in the data in
the by diagram of the of the previous slide was a we present is that
why this last
so
is in the band has nothing to do with the to and this
actually if it's lower than one is kind of been analyzed in changes
because well as a variant of
zero point whatever yes and negative
and if hi there it
well
five is fable or d is these changes
and as i said
when
as we have that case one is less and a
and decreases in every iteration also calculate increase in every iteration but it still is
going to be always lower than twice on so you know baseline sees the not
fighting we are always been analysing fancy since even though we really don't know if
we want to make
sort their parents or wrong bass we are doing it so
well we see and if we do not really a lower number of speakers because
what we increase from sixteen whatever if we have a lower number of speakers
we really have high probability of haven't changes i don't know i assume that i
and all isolated so more these transitions
so we thought well let's concept scott time that
maybe it works fine and not these we remove this variability and take the decision
only a few data and also
of course as we have a we decided to do this experiment we decided also
to say okay
it said this case to a fixed value
negative maybe but it what's actually we wanted to look is a if we could
fable that is
these transitioned plastics not used so it doesn't change over iterations
but a also
maybe a positive value so we have probably faber transition in a speaker's changing of
speakers
is that experiments
i a here is the database i fused
we have the development set that is
probably switch task is to evaluate somebody eight meetings
from yes
you see two thousand two thousand and five two thousand six and seven
and we have used that for all the development dataset and then the test set
that this r t o nine from
come on
well that the element that the this it is the from these results presented here
i the two thousand nine
here is all or the been all the experiments we have don't to analyze
a study the effect of these
parameters we wanted to check the effect of the season when we got we have
all these k
consistent weight because is like
well i applied to taxis you have
and we wanted also to tasty to evaluate its influence a
we if we are also taking into account the minimum duration parameter and talk you
about
because well also of them are actually influencing parameter data duration of the speaker time
we used widely work in the baseline if i two fifty frames so there there's
the baseline which is the flat line
it's this is only after a score of course because the transition weighting the baseline
is it to one it's last m so you change to about the process
and then we have all these other
experiments a here that i see still weight can be well if it's one i
want you to do not is that winter season weight is one it's like constantly
its effect caused by one zero so
no effect at all is only data
and if you y very you know a it put me in that changes are
very few people actually
i have and
put the value for detecting some way equal to zero
because
it's like fifty
and a vertical line was there i have the two with the only needs and
was like okay well very high because this e-step for be trained sees you know
sent at the end is segments all the recording p one speaker which is obviously
a very high error rate
then we sell okay with me duration equal to two hundred we actually have like
that every instable and all where it with low error rate various table section on
a in trying to find yourself one
and with a lower error rate and the baseline so
maybe it's good to have peace into consideration
so let's see what
what i and in at the end
what happened on the n
if we six
this with that that's that we choose
three point is all of those points we have checked the we have a evaluated
with the development dataset where
with a better than they baseline we three
one two three
we also compute the well compute all the system compute the data position for the
a for a transition weight you want to
one as last and which is the baseline but with a minimum duration but the
two hundred so
and what we could compare actually a
the improvement you to these transition we variation in you to the mean duration
separately because what the baseline use minimum duration to fifty and so on
i liked very much the idea scott setting it because i and well
was
good to see
parameter is in the band then if we can console and have to better results
are at least the best was also at least good enough
why not
something less to train for future experiments
the problem is actually the test set it out what it didn't go very well
not very much but we may be well compute the average of the two error
rates it's good but
it was worse what
we have what was barry we thought well
and that the results for day
prediction we what is three which is very boring actually very for in it changes
of a speaker
and rate using them anymore and iteration of any a speaker time
conclusion model compression i think i four for these more or less during the during
the presentation was more like stream rice and what i think that these turn transitioned
weight i don't have it we discover because was
it's a was a previously statically that came from icsi was well maybe someone you
have worked with it is
unleashed
for s we discovered that very small changes can affect the very much that i
use a c and that's why i look like at the beginning to have calculated
but if you is the one to constantly it at least you have to note
that it exists a you if you want to change your the duration or to
work with the duration of jerry speaker dance is important it's important to make it
to run experiments
it with both transmit what the racing on "'em" also because well a very three
of these
we
actually got
better it's also
for us is good but and the main thing we land problem this is that
and
if the variability with one time is very high
or can be very high you mass
i try to take into account the maybe evolve constantly made with a this technique
what to one is the best option so you can
i make the system or upwards for future experiments
well that's more honest that i think
so we then proposed
thanks to multiple english
first of all
we look for this so that it's much platoons good solution from the whole circle
two six should also and so each time constant
so smooth the lasted a okay a sycamore to all the weights
we should
flirt
the phone but it's very important to train them
the show a high constant
i know not so the remote were used for training the transition probabilities
in rooms do not want to work with them or whatever how to cope with
this remote is
and it's as much as the solution
this transition
the motivation and the results
i dunno why the snow
okay well
i don't use the word
those in differences
all rates were all the routes two goals in the logo to go all this
is
as a constant
it's a cost and is
one two three doesn't matter at all
so in quantum o one with the home and speaker of the
because why you try
well
i is a three and you know is a constant value but is a different
number and the decisions taken when this inequality it is a full field i want
you made this inequality
like okay i'll be is a idea would all be is a brings belong to
discuss the that likelihood of these is saying brings belong to another cluster completely different
and then be used and if it's high and have a forced that you okay
change of class that is like if it's a very low
it for b s
to go to another cluster
that's why it's a variable you fable more deterrence easiest the changes or you penalise
them
questions
so why i of course there is also probably
you transition words so
we can use
okay for volume of english could be thing
in there as well
first for ratio between the core model on the
the
moreover maybe
become the new speaker
so it's
so do you think is sort of threshold are just
it would be dependent on the task of the database just have one still because
i haven't actually an take a nice okay i have a right
that's why i think is that for the system to be more robust in the
to be using future task or you know that databases and
well it is the speaker out of the rights and in different meetings before and
databases that have a slightly longer duration maybe i speak and a lot at all
their interface the on a sorta
and if you are in that room with four people just don't well it depends
on the basically see that that's why i tied states yes okay if a if
you can that's a similar results yes have you
is a time you don't have to train and that's always
if you it you have a similar result or something in that this would to
know what you have you will have a less work to do you know used
to let that the c sent one you that bayes and unique the menu or
and
straightforward
get rid of this problem right i also a one because this is like a
preliminary work and i would like to maybe to use these
if i
somehow could this somehow i really don't know that don't have any clear the of
what to do two
but when i a get a good resampling that
if i somehow had a i in the idea of how long the speaker concept
going to be
or how many singers or maybe if i have some information about the role of
the speakers in the room and that could
actually
not would be used to i think smiling at that is aligned and that that's
all a lot
actually staying i think this kind of the probability of this is a low enough
to these one or something some way of extracting this information in
unsupervised diarization could be tricky but still i think you then you could
and achieve this parameter full for get them better results
but not
okay
show you questions