hello everyone

since for what shall use video

i'm stringent england you please dont university

eight year i we have a brief introduction to you about all paper

optimal mapping lost

are passed other works well and when speaker variation

recently i neutral new word based approaches have become more and more popular problem or

the use of speaker diarization

such as voice activity detection

speaker you by extraction clustering

however

and two in speaker diarization shin still remains a challenging

partly you that difficult lost this i was the speaker label accurately portray

no permutation invariant train was on n p r t s can be a possible

solution

and has been apply to as the emd network

but it's kind of time complexity increases vector s and number of speakers increases

in this paper

we investigate improve on the calculus and for the proposed a novel optimal mapping lost

which directly compute was that best matches between the allpass because sequence and that once

was because sequence so hungarian weights

all proposed lots systems finally we use the cost to polynomial time

meanwhile keeping the same performance as deputy dots

so what is speaker accurately

he than an audio with ground truth labels

a e p c zero eight p c are the speakers

we nationally you all the speakers in two interpreters

and good the encoding labels

one two to three

for outputs like one two to three and two one three

plus should be correctly from the view of speaker diarization

are

traditional losses functions like binary cross entropy loss as the and the first a wrist

or lost

well as the second output with high loss

this obviously does not meet all expectations

the reason behind this that speaker diarization only focuses on the relative difference of the

speaker identity

let's have a real as the lightest obtained you and a speaker diarization system

and see how yourself the rubber

as the un t post encoder of each transform as the model

is the partition decoding removed

to be specific

even for an recipe includes that little male filter bank features

it directly generous speaker posterior assumes that model

in addition

the plp loss is used to call with the speaker labels maybe you'll be t

program

figure two shows an overview of as the emd for that was speaker cases

given a sequence of features

x one x two s t

sat anti enclose of ground truth labels as follows

t is the duration and is the number of speakers in the audio

speaker label y t indicates a joint activity of the n is expressed as a

too small

it should be not to use a fifty seven boundary vector but not one whole

vector

for non speech regions

windy sphere resellers

the speeches from my water is employed as a network

input i handled as follows

first

for a rest features are transformed by the new layer

and then built into multiple stacked transform at encoder layers

who is then passed through the second being a layer and the signal money function

generation the at the end speaker it's posteriors as it is moment

we believe that you have family there was transformer

so let's give the introduction

for more details

please read the paper

the system all was is i and the ground truth labels y column arise

because the following formants

where they had an and y had and has the same shape of t

which in you has posterior as and labels of speaker and overtime respectively

speaker label ambiguity programs you it is i

no matter how you show why by column

you can be still effective label is creation

to cope with that program

sequences can see those or problem rice permutations of y and computers the final cross

entropy loss between slight each kind of population

anyone a two in this is shown in texas of one two and

and then to minimal loss is to return for and that well i for publication

in brief

the pf in loss function can be written as follows

hum and an almost all possible permutations of one to an

and the psd loss compaq time complexity is

all of t times eight times and in fact a real

the time cost of he lost become expensive as the number of speakers increases

to deal with that we use

become of this the first improvement first p actually lost

we don't and computation is in the process of psd computation

two point

it was for what the equation as follows

since we call and can was ranges from one to and

only and square pairs they actually has the computation of pc loss function

however

is the process

the function is score for and times and factory of times

okay each pair is creepy to be computed

our proposed idea is simple

we first comp u

we see that was used of or and we have pairs and stores them in

the last matrix

then you in the permutation process

given and had so you have and way ahead and pair

we just index and returns a corresponding amount

the details as shown in algorithm one

in the time complexity is

of t times and square cost all of and times and vectorial

so you the construction of the last matrix l

we have choose this guy improvement on the calculus

however

the computational she's still increases you know that are real time

when n is large

to deal with a proper

we must remove the permutation process

because the relationship between they had an and y have the and is described by

pca loss

and each is they have an must be designed one and only one optimal why

head and is a final

as shown in figure four

this tactical hoc assignment problem

is that the proposed that the optimal moment shingles

which employs the hunger reason everybody to find the best matching between the head and

whitehead

hungarian and was then it starts and every sentence as the final important right polynomial

time

vol case

key element i j is kind the ester cost of assigned ground-truth speaker change

was because i

totally is to find the optimal designing indexes

so that you overall cost is slow at least

the hungarian with that can be described as follows

is time complexity

of and you

and our optimal mapping loss are shown in double isn't cool

in total

the complexity is o

key times

and where us all of

and q

experimental results

in the first experiment

we generate plunking all close

ranging from zero to one and binary ground truth labels are random programs

is the things

the times t time

the batch size i is the tool wondering taking i and t six s five

hundred

the number of speakers and range is done to ten

for each and

we keep the process of data generation and lost computation

using different roles functions all one hundred times

the average time closed are reported in table two

is periods are carried out on both you and g p triphones

for you case

we also for all the speaker

we see that open mode mapping was is the hardest

when and is larger than for in the time course are relatively stable

in contrast

the rest to functions a tool so you was

when the number of speakers reaches time

in addition we also repeated as the un the disappearance use different loss functions

and see where is our proposed function is compatible in network training

results are shown in table three

as expected

the model channel with different loss functions results in the same relation error rate

therefore

all loss functions i effective

there is no

since you