Speech Transcript - Telephone Conversation Speaker Diarization Using Mealy-HMMs

these work was the and by jeff's timing and mean

and a scene two thousand two out of there is no i-vectors we really tried

to put i-vectors what we don't didn't find where to put it

we don't say that this is this state of the art however you can okay

to do something new something and sometimes the best think it's to use something very

old that everyone forgot about it

so these work basically go back to nineteen fifty five

where hmms were approximately define

and

at this time their work

you pose define two types of hmms one that we well known

we have a transitions

from one state or to another and then we have a at each state it

distribution the define the distribution of these data these

state and you name

more hmm and another type will find that

it depends from where the data was defined so

both transition probability and the distribution of the data was on the arcs

not at this data and you named

male hmm

in the control system

society they work a lot on both types of hmms

but they were more than

the don't try to estimate the parameters or make

the best bus is viterbi

the more work on

discrete distribution and ask questions what is the equivalent model the that they can find

or what is the minimum

hmm the

they can find

we will look

on the other perspective of the mainly hmm

and compare it to more hmm the hmm we know

and try to apply to

there is a nation of telephone conversations

i would give a short summary of hmm just to build the same notation not

to say something new

and then they will present

the mainly gmm

and

show how we applied to speaker there is variation

and some results

would be following

so in the hmm we know that we have if we have k state

model

it defined by the

initial probability vector or the transition matrix

and

the vector or of

distributions in the

gmm case

each state distribution will be a

gmm

so the triple

by a and b defines the

model

in more a gmm

what we show

there are going three problems to define the probability or the likelihood of the data

of the model given the data

of the v terribly problem to find the best path

and

to estimate the model

we can estimated via a viterbi

statistics

or by baum-welch

in our case in there is variation

we interesting in viterbi statistics more

the what tomato vacation to use mainly a gym and can be seen in these

do example

we can see that the

suppose we are looking at state to

there is

few data which are i from state one to state to name and to one

only two hundred points

on the other hand from states three to state

two

derive much more data

nine times more data

the distribution of the data which arrive from each

state

different go science

but if we will try to

estimate its state to gmm of

size of two

it would basically low

almost like the data k r i from state three

in state

two data will

have very small influence

all the distribution

we there we want to emphasise

the data of this the which derived from state one

so we can

truancy is the to this the eight

it's proper or when the data came from state one

these are the distributions

of the two gauche iain's

and the above

the

from state one and from state to if we multiply each of them

it the transition

what

we can see that

we can not

trying to see have any transition

from state one to state to because the blue line is about

and

and of

we always will decide to state to state it state one

but

if we are looking only on the data from transitions on the arcs

on the specific data

then we see

it totally another features we see that it's much prefer able to move from state

one to state to than to stay on state one which is the

blue line

and i

if we have

it specific distribution on each arc and not on the state level

we can

better

two

move from one state to another

then when we assume that

the day in the state data

is the same norm made therefore we each

preview state we arrive

so these was them the motivation to try to move

from

more a gmm two male hmm

it in this case we define our model that we have

in the initial vector or but that initial vector it's not

effect or probabilities but in the a vector of pdfs for a distribution function

it depends also

on the data

not only

which data you are going to

and we have a

metrics any which is the matrix all again of function

the dependence

from which they to each datum transient

transient and

the data also it depends also on data so now we have a model which

a couple

only of but i and eighty

we have the same three problems like in more hmm to define the

and likelihood of the model given the data

to find the best path

and to estimate

or the parameters via

viterbi statistics or baum-welch

again baum-welch is not of the interest of these store we will

don't just a little bit

on these

later

so we can see

if you want to estimate the likelihood

it became very easy

it just a

product all of the

initial vector are multiplied by mattered

and then to sum it we multiply by

vector a

a row vector of ones

if we compare it to for a gmm

of course the

we know we have to make

apart over all the possible one

pasties and we use the

the forward or backward

coefficients to do it but still the creation is much more complex

then

the

matrix multiplication that we have in

mainly representation

to find the best viterbi by us

it's also a known problem we have just to make these products

all of the

best transitions we have

and we want to maximize

are marks on the

and a sequence of states really

want to have

i will briefly

do you to

because it's well now we have a at each time stamp effect or

off

best

like to use of the sequence

of a partial sequence and

effect or of we are we derive from

just as in more

case

we initialize

the

delta vector and three vector or

very simply

but in their a portion

the equation became very simple much simpler than

it wasn't more a gym and you just have them are probably product

and mean

be twice or

between the vector of

previous likely to and

a row vector of much looks at

we take much someone these product

and they have they

place where of the maximum likelihood of the path and argmax you've the previews

place

state where we came from

and then like in more the gym and we have a termination

and

every cushion

novel

changes at all

if you want to estimate

the parameters using viterbi statistics

which are the and i

hence the cost function

and

the difference to the moral you

in the level lagrange multiplier now we have a constrained

not bella

is some of

the weights have to sum to one

but

the estimation to one

have to be over all the weights

from

all the transition states from

if you're going from state one we take the weights all the weights which are

self loop to state one plus all the weights state to an states three

and

this is the only difference

and

at the end it converge to very simple recreation we just look

it the data that runs eaten can maybe a train it gmm

we do in more but then we have

scale the

weights

it at each gmm

this fraction

everyone knows here what fraction is yes

and this fraction it is actually the same s the transition probability in the more

a gmm

i was so we can see that on each are

it's not a pdf

but it pdf multiplied by

a probability of the transition

if you want to do bound where's we just i will not give the creation

of there are big and ugly and

there is no match information i just show that we have two

defined a little bit differently the hidden variables we need

the hidden variables on the

for state on the initial state to define

tk me

one that if

the m-th mixture

of the kinetic case initial state they meet

do the e x one and similarly

lee

we define the hidden variable

but any outdoor

time which is not one

then

can rise the question is it really matter to you more a gmm or maybe

a gmm

yes and no

yes we will see that it makes the life easier we will show shortly

no because it was shown already that

any

more a gmm can be represented is mainly hmm and vice versa if any male

hmm can be represented as more gmm

so if we define

a set of all possible sequences

we give an example in the binary sequence that

let's say all the value can be only zero and once so x star our

all sequence possible sequences

then the string probability p is

and mapping from x start to zero one

and we can define an equivalent model

to a two models

which can be both hmm or both mainly one hmm one mainly

are defined the is equivalent

for each be an ap prime

we get the be equals be prime

then we can define

the more minimal model

it's a model that

it's in the equivalent model than has the look at the smallest number of states

and the same is the mailing minimal a model

we define the same if you have to mainly models that the mainly

with the same

be really we use the less number of states

it's an open question still

how to find the minimal model

but

the more interesting that we can show that for any case

states

more hmm we can find

and the equivalent mainly hmm with the same number of states

with no more than k states

but

vice versa it's not so easy for k

states

male hmm

it can't happen that the

minimal model for will be case square states so we increase

in the power of to the number of states

very easy to show how to move

from

more to maybe you just on the arcs put the probability of the pdfs of

the state and multiplied by transition and we have an equivalent small

but if you're going to male hmm

we have to build

it's structure that

part of the transitions are zero and

so a

specify

in the very precise way how to really

and

i'm not sure that this will be the minimal more model

but they showed that this more than on more model would be equivalent to mainly

model

but we increase transition matrix and so on

these in the case when we

no the which state belong to which event

if we don't know

we will have to somehow estimated state

it's one it

s one to belong to event one and a to prevent to it's not

very simple

we applied to speaker there is station we have some voice activity detection overlapped speech

removal of that the initialisation of hmm

and then we

apply fix duration

a gmm

clustering

both for

mainly and formal

the minimum amount duration was of two hundred milliseconds it mean we stay twenty states

in the same model

so we have three hyper state for speaker one speaker to and non-speech because we

know that this is telephone conversation we know in advance they are

only

two speakers

in case of more hmm these is the picture which they tell in our case

that we could twenty times in the same model then we can translate to any

outdoor

in male hmm it's very see similar

but now with thing one model

doll minus one ninety times in the same model

and we have

now on the transition our

distributions

the results were on

ldc database one hundred and that eight conversations

then approximately ten minute each

and

we tried different models

therefore more those of twenty one and twenty four gaussian as a full covariance what

better model gave

best results

over twenty four the results dropdown so we didn't show here

and then we tried to

a different

models of mainly a gmm

on the left side

we see that a total number of gaussian is that we have

in all the

hmm

and of the right side the diarization error rate

and we can see basically that

we have more gmms to estimate

but we can achieve the same results as in

more which a man

we is twenty percent about twenty percent

let's go oceans overall

why because

we enable

to define it data on the transition

because

we cannot be sure that speaker one speaks after speaker at all

have the same dynamics

or i don't face like if you start speaking after silence

maybe speaks differently and we want to

defined these transition effects

and we define them on the are

probabilities

we can

have the same results

with less go shares or

a little bit

better results

when we

use more go options

so we present maybe hmm

show

that you can works similarly

the presentation between mainly and moral

we see that the we can make telephone there is station

without any loss of

performance when we use mainly and even better performance with less complexity

we know that hmm is usually and not always use it's a standalone

the recession system

but also

when we use

big based their station we have re fine tuning

at the end which done by

hmm

we know that the in i-vector based there is station

like poetry course the front view between the phase one and phase two there is

a and an hmm that make re-segmentation we can replace the more hmm by maybe

a gmm

maybe

get some improvement

in the systems

this is my

last thing that want to say

thank you

no you once

question that where

in speaker diarisation usually we use gmms right

which is then that well and you are using an ergodic hmm

so it can you can on the advantage of using and nobody approach

compared to the in there is a station we use the

note gmms but like in

in these the system may be an hmm the use of a good because you

mail we

assume that we can move from

each speaker to each speaker and then there go tick way

and the question about that is the state distribution deal now

the state distribution as far as an over

gmms

and he relished industry also with gmms but on the arcs instead on right and

the states

but they stay with gmm

okay

but we have you are not using the notion of you know if the universal

background models no i this is then we don't use because

we work we several

companies the that

we tried to have data for universal background model and they say that they have

no data in the

the channels are changing very much and this a maybe they can give us one

o one half hour

and out of data

and then not sure that we can do a very good the ubm model we

use the

one or even two hours of data

so we use stand the long model that do not rely on a some background

model but

you've there is

and background model that we

we have a data so we can use extended hmm like and then is or

based i-vector system and just encapsulate

the gmms a part of it

it's not a problem the next paper is on broadcast data so it may have

more detailed and you

Telephone Conversation Speaker Diarization Using Mealy-HMMs

Speaker Diarization

Itshak Lapidot, Jean-Francois Bonastre and Samy Bengio