i
since for what should be too
and twenty one do clustering university
and here i have a brief introduction to a paper
the heart we still hot experimental results and discussions from decay you a novelty
in this paper we present the summit each system for the second that a speech
diarization challenge
diarization system includes multiple modules
nobody voice activity detection speaker in many extraction similarities miss truman clustering with the confusion
overlap detection
for each model to explore different technologies to enhance the performance
a final submission even close to mismatch system based vad that the there is no
based speaker a value
that estimate base the similarity scoring and
spectral per state
three diarisation use also applied in the re-segmentation stage
and overlap detection also brings time improvement
our proposed system achieves a key point at forty what check one and twenty seven
point ninety percent in eer for check two
post a systems have reduced the f d r's right twenty seven point five percent
is that you one point seven percent relative a cascade of use of s times
we believe that diarization task is the over each utterance
may analysis
we carry a mentality analysis on a development set to show how hot the competition
is
several in that occurs and order
their religion of the lda was
the number of speakers
speech percent each and the overlap ever
overlap ever determines the medium and diarization error rate a system is able to a
chip we sell handling overlaps speech
it is defined as follows
i
the speech regions of speaker i
in summary the competition is how many because first the audio site shown for about
divers set of challenging documents
second the number of speakers varies you know very large range
so
hi overlap error costs for the eer
well it process i employed in our experiments for training
note that one looks like to combine short utterances received all speakers
suitable for speaker in nineteen change
most people speak audio's are drawn from the database is a median and tri-phone domains
the making data consist of
icsi i s l nist s and one baseline to
the telephone data services
no monolingual problems that
including arabic
english
drama
japanese
men therein and spanish
that used for changing voice activity detection
similarity miss truman an overlap detection
musician and a raw score or
i employed for the computation
voice activity detection
right i p c p of initial best time for channel two
let us to estimate into frames with twenty milliseconds
duration
for each input for n
a pc generous the and the recall what
and optional setting right a p c is the way steve martin
there is a list of ways here
well
three is the most where is you about field all non speech
we also propose a em based approach for the vad task
then usual network as shown in figure two consist of their rest and model you
multiple bidirectional estimate there's and in you know there's
our motivation is stay the rest and what you
generous representative feature mapping is for speech and non-speech
and then the right the original svms control sequential information
the input is that a long sequence of frame as features
each a france inter sequence a hack and feed into the rest that
generating multiple channel
features magazines
we of times ago for every holy
on each channel and courtesy dimensional vector
next a bidirectional estimate there's to catch for the for one and that was sequence
information
finally
allpass from the prior that rationales
task to that being a layers
and that p with the sigmoid function
and generous the speech posteriors
all converges activity detection
real-time a sliding mean of one point five seconds lands and zero point five as
the five
second shift was based speech into short segments
the speaker embedded ice check a to find the sediments
here we consider three models
i-vector extractor and the rest i-vector
for the i-vector extractor with follows that the how to design a t v one
where is that in colour t and height of also audio's for system changing
for this paper we also follows that the heart was on an ap we will
call us that to change the model
s for the rest i-vector
it consists of three main components
a restaurant or in a two-dimensional staticity pooling their
and a feed-forward network
not fit the one that well in close to is that the in your there's
the search of l o zero point five between
given a sequence of input features
to rest and brian first covers them into multiple channel feature dimensions
is that the static sporting their calculators the mean extend the time variation studies for
each channel
generating the utterance level representation of
to see that addition
last the feed-forward network transforms the utterance level feature representation to speaker posteriors
the embedding the imaging is one hundred and twenty k
chinese there is also folks that respect alimentation
and detail parameters can be view in table three
speaker in getting sequence x one x to x and
we compute similarity score as i g between any interest because embedding as i x
j
and push on the similarity matrix research and times i
the first was that for the similarity measure is p lda
you can be expressed as follows
that's their assumes that the embedding i and j are from the different speakers
well it's one assumes that is there are from the same speaker
the lda model is channel we suppose that and written by the two development set
we must not is there
the key lda and those speaker embedded these you know paralysed and had a man
reach you can always the sequential information
therefore we propose to analyse them basis point model to capture the forward and backward
messages
in comparison with p lda
scores articulated between vector and sequence rather than vector that could
give a speaker embodies x one x two accent
recently that i in recreate could be compared with the whole sequence
do you feed this sequence into a list and their work
and generous course of the input can kinda vectors
a strong be actually equation seven
the first you know what kind of course
includes two i original estimate errors and to lean you know there is
the output layer is one dimensional connected with this economy function
in the clustering stage
two was that a part
the first was that is agglomerative hierarchical clustering
which are from as the random mutually between precise
segments i'm initialized as individual clusters
and each time to prove starts with the highest score are merged and chosen humans
raised were is mediate
and are not always a spectral clustering
is and where our best score some you know it's a
given the similarities matrix s
you can see that as i j s away of a g between no i
and not okay you know and directly where
by removing weak edges with small weights
spectral clustering device the original graph into multiple somewhere off
which star graph is a holster
of course there
there we segmentations that is that high to aligned a close friend rides
g m and we segmentation next see that constructing thus because the cp gmms
for each speaker according to clustering results
then for each frame in the audio
we assign it to gmm is the highest the posteriors
the process interest and to convert
and are not always
we start with station
construct a gmm a gmm model
with engine voice priors
impulses that imitation side or speaker-specific gmms share the same component weights and covariance men
she's
besides
the mean vectors are projected from total variability subspace
with some progress
v diarization kingsbury's segmentation performance
the small we consider is overlap detection
the model structure data and ten in combination is a all the same as those
in rest of the same voice activity detection system
that we change the labels for speech nonspeech two overlap no overlap
for testing cases
but has segment is referred as overlapped speech
we used ten is boundary i twenty frames and ten or speakers of hearing is
the extended segment as the labels of the original segment
experimental results
whatever directly you and voice activity detection performance
maybe parallel independent evaluation on a pc our best system based vad
the metric used and whereas you're right
and results are shown in table four
basically we start model adaptation
are processed model used just slightly better than the official baseline
however if you finding the model to handle development set
accuracy ready to be increased to ninety one point four percent on the eval set
you can sort of course there are chanting that is drawn from meeting and telephone
domain
well as the database probably eleven domain
domain mismatch this to work performance
well model adaptation rinsed income improvement
in table five
we compare different combinations of the speaker binding
similarity scoring and resume is that into one
it is all that the that the address mapping to
performs i-vector extractor or combination
is that is
so i and o a system based scoring well by spectral clustering have used
better there in comparison to is you know the edges e
best single system is systems six
which she is that the eer of twenty point eight seven percent
where we fuse based on tool for densities are reaching their score metrics
the eer for the reduces to and you one to four percent
with the condition is carried out on a best single system and the fusion system
results are shown in figure six
in our expectation
the vad algorithms should outperform the gmm is the
and re-segmentation models used
should bring similar improvement for both systems
"'cause" the price
for the fusion system die residual predictions after resegmentation does not become more data right
so mostly be improvement this can be systems six with bp diarization
we do seems that the eer by one point six five percent absolutely
the last few in our diarization system is overlap detection
since the overlap everybody's as i s time for instance is present on the development
set
is it is not go for asked was seems that there is around ten percent
of the sometime error the eval set
experiments are carried out on systems use with three d diarization
results are shown in table seven
all have to the time of the last speech only slightly improves the past i
zero point is the c eight percent on channel one and zero point six nine
percent on check two
it is the very challenging because we for less than
ten percent of the overlapped speech
last to understand how our system performs it is recipe goldmine
we go the eers of the development set on system six
a tall man
results are shown in figure three
system performs rolls on this policy is
rest or
we have video media been chosen
c of each are discussed in manhattan and that's due to high overlap errors
the child domain
this is by no overlap error rates to hide eer of
so these data points that you eight percent
it is probably because the audio are drawn from seeking colours
we have shown to a six to at most old
this is a mismatch
comparison of speakers in our training database
as a result
six times the outperforms probably in this for changing documents
things you've we'll watch