these work was the and by jeff's timing and mean
and a scene two thousand two out of there is no i-vectors we really tried
to put i-vectors what we don't didn't find where to put it
we don't say that this is this state of the art however you can okay
to do something new something and sometimes the best think it's to use something very
old that everyone forgot about it
so these work basically go back to nineteen fifty five
where hmms were approximately define
and
at this time their work
you pose define two types of hmms one that we well known
we have a transitions
from one state or to another and then we have a at each state it
distribution the define the distribution of these data these
state and you name
more hmm and another type will find that
it depends from where the data was defined so
both transition probability and the distribution of the data was on the arcs
not at this data and you named
male hmm
in the control system
society they work a lot on both types of hmms
but they were more than
the don't try to estimate the parameters or make
the best bus is viterbi
the more work on
discrete distribution and ask questions what is the equivalent model the that they can find
or what is the minimum
hmm the
they can find
we will look
on the other perspective of the mainly hmm
and compare it to more hmm the hmm we know
and try to apply to
there is a nation of telephone conversations
so
i would give a short summary of hmm just to build the same notation not
to say something new
and then they will present
the mainly gmm
and
show how we applied to speaker there is variation
and some results
would be following
so in the hmm we know that we have if we have k state
model
it defined by the
initial probability vector or the transition matrix
and
the vector or of
distributions in the
gmm case
each state distribution will be a
gmm
so the triple
by a and b defines the
model
in more a gmm
what we show
there are going three problems to define the probability or the likelihood of the data
of the model given the data
of the v terribly problem to find the best path
and
to estimate the model
we can estimated via a viterbi
statistics
or by baum-welch
in our case in there is variation
we interesting in viterbi statistics more
the what tomato vacation to use mainly a gym and can be seen in these
do example
we can see that the
suppose we are looking at state to
there is
few data which are i from state one to state to name and to one
only two hundred points
on the other hand from states three to state
two
derive much more data
nine times more data
the distribution of the data which arrive from each
state
different go science
but if we will try to
estimate its state to gmm of
size of two
it would basically low
almost like the data k r i from state three
in state
two data will
have very small influence
all the distribution
so
we there we want to emphasise
the data of this the which derived from state one
so we can
truancy is the to this the eight
it's proper or when the data came from state one
these are the distributions
of the two gauche iain's
and the above
the
from state one and from state to if we multiply each of them
it the transition
what
we can see that
we can not
trying to see have any transition
from state one to state to because the blue line is about
and
and of
we always will decide to state to state it state one
but
if we are looking only on the data from transitions on the arcs
on the specific data
then we see
it totally another features we see that it's much prefer able to move from state
one to state to than to stay on state one which is the
blue line
so
and i
if we have
it specific distribution on each arc and not on the state level
we can
better
two
move from one state to another
then when we assume that
the day in the state data
is the same norm made therefore we each
preview state we arrive
so these was them the motivation to try to move
from
more a gmm two male hmm
it in this case we define our model that we have
in the initial vector or but that initial vector it's not
effect or probabilities but in the a vector of pdfs for a distribution function
it depends also
on the data
not only
which data you are going to
and we have a
metrics any which is the matrix all again of function
the dependence
from which they to each datum transient
transient and
the data also it depends also on data so now we have a model which
is
a couple
only of but i and eighty
we have the same three problems like in more hmm to define the
and likelihood of the model given the data
to find the best path
and to estimate
or the parameters via
viterbi statistics or baum-welch
again baum-welch is not of the interest of these store we will
don't just a little bit
on these
later
so we can see
if you want to estimate the likelihood
it became very easy
it just a
product all of the
initial vector are multiplied by mattered
and then to sum it we multiply by
vector a
a row vector of ones
if we compare it to for a gmm
of course the
we know we have to make
apart over all the possible one
pasties and we use the
the forward or backward
coefficients to do it but still the creation is much more complex
then
the
matrix multiplication that we have in
mainly representation
to find the best viterbi by us
it's also a known problem we have just to make these products
all of the
best transitions we have
and we want to maximize
are marks on the
and a sequence of states really
want to have
i will briefly
do you to
because it's well now we have a at each time stamp effect or
off
best
like to use of the sequence
of a partial sequence and
effect or of we are we derive from
just as in more
case
we initialize
the
delta vector and three vector or
very simply
but in their a portion
the equation became very simple much simpler than
it wasn't more a gym and you just have them are probably product
and mean
be twice or
between the vector of
previous likely to and
a row vector of much looks at
we take much someone these product
and they have they
place where of the maximum likelihood of the path and argmax you've the previews
place
state where we came from
and then like in more the gym and we have a termination
and
every cushion
novel
changes at all
if you want to estimate
the parameters using viterbi statistics
which are the and i
hence the cost function
and
the difference to the moral you
in the level lagrange multiplier now we have a constrained
not bella
is some of
the weights have to sum to one
but
the estimation to one
have to be over all the weights
from
all the transition states from
if you're going from state one we take the weights all the weights which are
self loop to state one plus all the weights state to an states three
and
this is the only difference
and
at the end it converge to very simple recreation we just look
it the data that runs eaten can maybe a train it gmm
like
we do in more but then we have
scale the
weights
it at each gmm
by
this fraction
everyone knows here what fraction is yes
and this fraction it is actually the same s the transition probability in the more
a gmm
i was so we can see that on each are
it's not a pdf
but it pdf multiplied by
a probability of the transition
if you want to do bound where's we just i will not give the creation
of there are big and ugly and
there is no match information i just show that we have two
defined a little bit differently the hidden variables we need
the hidden variables on the
for state on the initial state to define
tk me
one that if
the m-th mixture
i
of the kinetic case initial state they meet
do the e x one and similarly
lee
we define the hidden variable
but any outdoor
time which is not one
then
can rise the question is it really matter to you more a gmm or maybe
a gmm
yes and no
yes we will see that it makes the life easier we will show shortly
no because it was shown already that
any
more a gmm can be represented is mainly hmm and vice versa if any male
hmm can be represented as more gmm
so if we define
a set of all possible sequences
we give an example in the binary sequence that
let's say all the value can be only zero and once so x star our
all sequence possible sequences
then the string probability p is
and mapping from x start to zero one
and we can define an equivalent model
like
to a two models
which can be both hmm or both mainly one hmm one mainly
more
are defined the is equivalent
for each be an ap prime
we get the be equals be prime
then we can define
the more minimal model
it's a model that
it's in the equivalent model than has the look at the smallest number of states
and the same is the mailing minimal a model
we define the same if you have to mainly models that the mainly
with the same
be really we use the less number of states
it's an open question still
how to find the minimal model
but
the more interesting that we can show that for any case
states
more hmm we can find
and the equivalent mainly hmm with the same number of states
with no more than k states
but
vice versa it's not so easy for k
states
male hmm
it can't happen that the
minimal model for will be case square states so we increase
in the power of to the number of states
very easy to show how to move
from
more to maybe you just on the arcs put the probability of the pdfs of
the state and multiplied by transition and we have an equivalent small
but if you're going to male hmm
we have to build
it's structure that
part of the transitions are zero and
so a
specify
in the very precise way how to really
and
i'm not sure that this will be the minimal more model
but they showed that this more than on more model would be equivalent to mainly
model
but we increase transition matrix and so on
these in the case when we
no the which state belong to which event
if we don't know
we will have to somehow estimated state
it's one it
s one to belong to event one and a to prevent to it's not
very simple
we applied to speaker there is station we have some voice activity detection overlapped speech
removal of that the initialisation of hmm
and then we
apply fix duration
a gmm
clustering
both for
mainly and formal
the minimum amount duration was of two hundred milliseconds it mean we stay twenty states
in the same model
so we have three hyper state for speaker one speaker to and non-speech because we
know that this is telephone conversation we know in advance they are
only
two speakers
in case of more hmm these is the picture which they tell in our case
that we could twenty times in the same model then we can translate to any
outdoor
in male hmm it's very see similar
but now with thing one model
doll minus one ninety times in the same model
and we have
now on the transition our
distributions
the results were on
ldc database one hundred and that eight conversations
then approximately ten minute each
and
we tried different models
therefore more those of twenty one and twenty four gaussian as a full covariance what
better model gave
best results
over twenty four the results dropdown so we didn't show here
and then we tried to
a different
models of mainly a gmm
on the left side
we see that a total number of gaussian is that we have
in all the
hmm
and of the right side the diarization error rate
and we can see basically that
we have more gmms to estimate
but we can achieve the same results as in
more which a man
we is twenty percent about twenty percent
let's go oceans overall
why because
we enable
to define it data on the transition
because
we cannot be sure that speaker one speaks after speaker at all
have the same dynamics
or i don't face like if you start speaking after silence
maybe speaks differently and we want to
defined these transition effects
and we define them on the are
probabilities
so
we can
have the same results
with less go shares or
a little bit
better results
when we
use more go options
so we present maybe hmm
show
that you can works similarly
the presentation between mainly and moral
we see that the we can make telephone there is station
without any loss of
performance when we use mainly and even better performance with less complexity
we know that hmm is usually and not always use it's a standalone
the recession system
but also
when we use
big based their station we have re fine tuning
at the end which done by
hmm
we know that the in i-vector based there is station
like poetry course the front view between the phase one and phase two there is
a and an hmm that make re-segmentation we can replace the more hmm by maybe
a gmm
maybe
get some improvement
in the systems
so
this is my
last thing that want to say
thank you
no you once
question that where
so
in speaker diarisation usually we use gmms right
which is then that well and you are using an ergodic hmm
so it can you can on the advantage of using and nobody approach
compared to the in there is a station we use the
note gmms but like in
in these the system may be an hmm the use of a good because you
mail we
assume that we can move from
each speaker to each speaker and then there go tick way
and the question about that is the state distribution deal now
the state distribution as far as an over
gmms
and he relished industry also with gmms but on the arcs instead on right and
the states
but they stay with gmm
okay
but we have you are not using the notion of you know if the universal
background models no i this is then we don't use because
we work we several
companies the that
we tried to have data for universal background model and they say that they have
no data in the
the channels are changing very much and this a maybe they can give us one
o one half hour
and out of data
and then not sure that we can do a very good the ubm model we
use the
one or even two hours of data
so we use stand the long model that do not rely on a some background
model but
you've there is
and background model that we
we have a data so we can use extended hmm like and then is or
based i-vector system and just encapsulate
the gmms a part of it
it's not a problem the next paper is on broadcast data so it may have
more detailed and you