hello everyone
since for what shall use video
i'm stringent england you please dont university
eight year i we have a brief introduction to you about all paper
optimal mapping lost
are passed other works well and when speaker variation
recently i neutral new word based approaches have become more and more popular problem or
the use of speaker diarization
such as voice activity detection
speaker you by extraction clustering
however
and two in speaker diarization shin still remains a challenging
partly you that difficult lost this i was the speaker label accurately portray
no permutation invariant train was on n p r t s can be a possible
solution
and has been apply to as the emd network
but it's kind of time complexity increases vector s and number of speakers increases
in this paper
we investigate improve on the calculus and for the proposed a novel optimal mapping lost
which directly compute was that best matches between the allpass because sequence and that once
was because sequence so hungarian weights
all proposed lots systems finally we use the cost to polynomial time
meanwhile keeping the same performance as deputy dots
so what is speaker accurately
he than an audio with ground truth labels
a e p c zero eight p c are the speakers
we nationally you all the speakers in two interpreters
and good the encoding labels
one two to three
for outputs like one two to three and two one three
plus should be correctly from the view of speaker diarization
are
traditional losses functions like binary cross entropy loss as the and the first a wrist
or lost
well as the second output with high loss
this obviously does not meet all expectations
the reason behind this that speaker diarization only focuses on the relative difference of the
speaker identity
let's have a real as the lightest obtained you and a speaker diarization system
and see how yourself the rubber
as the un t post encoder of each transform as the model
is the partition decoding removed
to be specific
even for an recipe includes that little male filter bank features
it directly generous speaker posterior assumes that model
in addition
the plp loss is used to call with the speaker labels maybe you'll be t
program
figure two shows an overview of as the emd for that was speaker cases
given a sequence of features
x one x two s t
sat anti enclose of ground truth labels as follows
t is the duration and is the number of speakers in the audio
speaker label y t indicates a joint activity of the n is expressed as a
too small
it should be not to use a fifty seven boundary vector but not one whole
vector
for non speech regions
windy sphere resellers
the speeches from my water is employed as a network
input i handled as follows
first
for a rest features are transformed by the new layer
and then built into multiple stacked transform at encoder layers
who is then passed through the second being a layer and the signal money function
generation the at the end speaker it's posteriors as it is moment
we believe that you have family there was transformer
so let's give the introduction
for more details
please read the paper
the system all was is i and the ground truth labels y column arise
because the following formants
where they had an and y had and has the same shape of t
which in you has posterior as and labels of speaker and overtime respectively
speaker label ambiguity programs you it is i
no matter how you show why by column
you can be still effective label is creation
to cope with that program
sequences can see those or problem rice permutations of y and computers the final cross
entropy loss between slight each kind of population
anyone a two in this is shown in texas of one two and
and then to minimal loss is to return for and that well i for publication
in brief
the pf in loss function can be written as follows
hum and an almost all possible permutations of one to an
and the psd loss compaq time complexity is
all of t times eight times and in fact a real
the time cost of he lost become expensive as the number of speakers increases
to deal with that we use
become of this the first improvement first p actually lost
we don't and computation is in the process of psd computation
two point
it was for what the equation as follows
since we call and can was ranges from one to and
only and square pairs they actually has the computation of pc loss function
however
is the process
the function is score for and times and factory of times
okay each pair is creepy to be computed
our proposed idea is simple
we first comp u
we see that was used of or and we have pairs and stores them in
the last matrix
then you in the permutation process
given and had so you have and way ahead and pair
we just index and returns a corresponding amount
the details as shown in algorithm one
in the time complexity is
of t times and square cost all of and times and vectorial
so you the construction of the last matrix l
we have choose this guy improvement on the calculus
however
the computational she's still increases you know that are real time
when n is large
to deal with a proper
we must remove the permutation process
because the relationship between they had an and y have the and is described by
pca loss
and each is they have an must be designed one and only one optimal why
head and is a final
as shown in figure four
this tactical hoc assignment problem
is that the proposed that the optimal moment shingles
which employs the hunger reason everybody to find the best matching between the head and
whitehead
hungarian and was then it starts and every sentence as the final important right polynomial
time
vol case
key element i j is kind the ester cost of assigned ground-truth speaker change
was because i
totally is to find the optimal designing indexes
so that you overall cost is slow at least
the hungarian with that can be described as follows
is time complexity
of and you
and our optimal mapping loss are shown in double isn't cool
in total
the complexity is o
key times
and where us all of
and q
experimental results
in the first experiment
we generate plunking all close
ranging from zero to one and binary ground truth labels are random programs
is the things
the times t time
the batch size i is the tool wondering taking i and t six s five
hundred
the number of speakers and range is done to ten
for each and
we keep the process of data generation and lost computation
using different roles functions all one hundred times
the average time closed are reported in table two
is periods are carried out on both you and g p triphones
for you case
we also for all the speaker
we see that open mode mapping was is the hardest
when and is larger than for in the time course are relatively stable
in contrast
the rest to functions a tool so you was
when the number of speakers reaches time
in addition we also repeated as the un the disappearance use different loss functions
and see where is our proposed function is compatible in network training
results are shown in table three
as expected
the model channel with different loss functions results in the same relation error rate
therefore
all loss functions i effective
there is no
since you