she good morning at second university at the it is data signs that you recently
worked on a soft voice activity detection in that factor analysis based speaker segmentation of
a broadcast news
so what this work has been done in the context of the artiste on project
so the u r d is actually the public broadcasters of long as
the dutch speaking region of that belgium
and the idea is to use the speech technology to
speed of the process of a subset of grading subtitles for tv shows
another case can be for journalists to meter reports two
have a fess track to put the report online with the subtitles so then they
can use the speech technology to generate the subtitles
and the quality maybe a bit less but
in case of for online you the speed is more important than the quality of
the subtitles
so the ideas that the subtitling as a very time-consuming a manual process so we
want to use the
speech technology
so in this presentation will focus on the diarisation and why do you want to
solve this of who spoke when problem
first of all we want to at colours to the subtitles
and if you want to generate subtitles it can also be useful to use the
speaker adapted models so we got speaker labels we can use these other models
and another thing is that actually if we detect speaker changes this can be extra
information for the language model of the speech recognizer to
begin and sentences so this can also help to recognition
so i the interspeech to have a show and tell session which of all the
shall be a complete system platform
so
it will a show how can uploaded be you and then start the whole chain
of a speech nonspeech segmentation speaker diarization language detection system and then speech recognition
but that's not the final step then we actually have to make short sentences to
display them on the screen
okay so what is this concept i think more probably get or audio signal plus
all the first step is the speech nonspeech segmentation we have to move a laughter
we have to remove music
so when once be detected the speech segments we can start that or a speaker
diarization
so this includes a detecting the speaker change points and finding homogeneous segments
and once we found of segments we can cluster those segments to assign a speaker
label to all these segments
so done you make the hypothesis that the each speaker only uses one language
and because in flanders you're interested in image we only keep the flemish segments
and then we will do the speech recognition
and the output of the speech recognizer will need some processing to make the sentences
short enough to display them on the screen
so here we will focus on more accurate speaker segmentation because if we use to
short segments that kernel provides and all data for reliable speaker models but costs in
this kind of the files we will use we have sometimes fifty speakers in one
audio file so the longer this homogeneous speaker segments will be the more reliable clustering
will be
obviously we don't detector speaker change this will result in nantes a homogeneous segments and
this will result in error propagation during the clustering process and also if we make
two short segments this will make clustering a lot slower because we have to accompany
lot more distances between segments
okay it'll propose a two-pass system so when the first a single other speech segments
are generated by the speech and non speech segmentation
so and then we will do so my speaker segmentation to actually the a standard
eigenvoice approach so would be vocal this a generic eigenvoices because these
a composer stuff the model actually every speaker that can appear
so why once we detected those speaker segment we can do standard speaker clustering
and the output of these of the speaker clustering i'm the speaker clusters we will
use that actually to
retrain or eigenvoice model so we know which speakers are active in the audio file
and the broadcast news file so we will retrain eigenvoices that match those speakers
and we also got speech segments so we can also actually retrain or a universal
background model
so then going go to a second sparse again the us start from or baseline
speech segments you do the speech segmentation again but now with our specific eigenvoices matching
the speakers inside the audio file
and then we do again speaker clustering and an evil three have that the speaker
clusters that in the first pass
okay the first step or speaking segmentation will be a boundary generation so that is
actually a generation of a kind of speaker change points
so we will lie use a sliding window approach we have to comparison windows left
window and the right window so and you can have a two hypothesis either we
have the same speaker and the to win also we have a different speaker
so we will use the a measure that looks for the maximal the similarity between
the distribution of the acoustic features and of there is a fixed to somebody then
this will indicate that there was a speaker change
okay also
speech nonspeech segmentation actually did not eliminate short pauses so it is tuned to detect
all laughter and music segments of longer than one seconds
so there can actually be a short alternate between speakers
so if we would use adjacent comparison windows it's actually generate several maxima
during the speaker change so we argue that is
i maxima can actually appear at the and vq at the beginning and the end
of the pulses because then the dissimilarity between acoustic features would be maximal and in
both windows
so and stats we propose to use overlapping comparison windows
so if you look at the regions that the classes of these actually attribute to
the summer the summer the
and the red regions
make them segments more the comparison windows more similar
so with actually the overlapped region between both comparison minnows
matches the false
then the dissimilarity between both windows will be maximal and the pause and the speaker
change will be inserted at the middle of the poles which is actually the thing
we want
just the more logical thing to do
so one if we apply to us
two or
sliding window approach we just simply use a two
overlapping sliding windows a left window and a right
okay for each comparison in the we actually want to extract speaker specific information
so we will do this to factor analysis
we will use so because we use the sliding window approach we will use very
low dimensional models because we have to extract those speaker factors for each frame
so we will use the gmm-ubm speech model with the thirty two components and use
a low dimensional speaker viable the or eigenvoice matrix with only twenty eigenvoices
so we use of in the wall for one second then we slide across each
frame and we expect those the twenty speaker factors
so i mentioned that for the training data we use the english broadcast news data
okay so not to another now that we have the speaker factors per frame we
actually look for a significant local changes between the speaker factors because these will indicate
a speaker change
so we use the extraction of one seconds so it's quite obvious that the phonetic
content of this one second window
we'll have a huge impact on the speaker factors
so we propose to estimate the subphonetic fallibility this intra speaker variability on that that's
that the data itself so we got or to i-vector speaker factor extraction then those
but
if we look at the segment to the left and in my to make the
hypothesis with the same speaker and the same to the right
we can actually use the question model
to estimate the phonetic variability are the intra speaker variability on the that the signal
l
and we have a right speaker we can say we estimate the phonetic fundable you
the signal are
and
actually want to use of you want to find changes in the speaker factors that
are not explained by this phonetic valuable do you want to look for changes other
have occurred because of a real speaker change
so if we use the model and will be space distance we can actually look
for changes that are in other directions than that caused by the phonetic variability
so we propose to make and mahalanobis space this with the components one where we
have the hypothesis that we have left speaker
so we look for changes in the speaker factors that are not explained by phonetic
fundable given by the left speaker
and the second component is looking for changes not explained by on it but with
the of the right speaker
okay so here we got the a speech segment and that
this shows the or distance metric
so well i also included the euclidean distance of compared to the mahalanobis distance
so the red lines or the maximum peak so actually we have this the distance
measurement mean for a maximum distance so we have to pick a selection algorithm
so we average or a distance measure
so when then according to the length of or speech segment we select the number
of maxima
and we also and for the minimum duration of a speaker or not but one
second
so the red lines indicate all the detected
and you can and the black lines are actually the real speaker turns and we
see the other model a mahalanobis distance a emphasis the
the real speaker changes
so it's successfully detects the to
speaker turns out to why the with your
okay once that we got or candidate speaker change points we can you some clustering
of the adjacent segments to eliminate false a also this
so again we had to pa system in our first also of the signal some
system we will use delta bic here clustering of the adjacent speaker turns to see
if there is a much acoustic somebody would reading segments if there are quite similar
then you can simply eliminate this boundary
and the second pass we had the specific eigenvoice model so this agent voice model
matches the speakers and a file
so then we can actually extract speaker factors
perks homogeneous segments
and use the course that cosine distance to compare the speaker factors
if they're similar we eliminate the kind of the change point it's
is there dissimilar it's a speaker change point
so we can use the thresholds
a bold criteria to control the number of eliminated boundaries
okay
so it does that this on the cost two hundred and eight broadcast news test
sets at this as actually a sets with the twelve languages
we used one language to as development data to tune our parameters
and the other eleven remaining sets were used for s the test data
so this includes a thirty hours of data
and of four thousand four hundred the speaker turns
for the evaluation me to the mapping between the estimated change points and the real
so the speaker change points with the margin of five hundred milliseconds
and we compare the precision and recall but with this mapping
so the precision is the percentage of computed boundaries that are actually matter we once
and the recall and the sorry the recall a substantial real boundaries mapped to the
computers ones and the precision is the percentage of compute the boundaries other are actually
map
so we compare
this is
speaker just change detection with delta bic baseline
and we can see that's for a low precision we get the maximum legal of
nineteen point six percent which is a maybe a larger than the they'll topic of
baseline
so once we get these a decision beagle course we can then select an operating
point according to the threshold of the of the
by the elimination algorithm
and you can use this operating point to start or a speaker clustering
okay no more details about or a two-pass adapt is speaker segmentation system so in
the first pass we got or speaker turns
our clusters generated
by clustering the speaker turns generated in the first pass then you to train the
ubm model and the eigenvoice model on the speech and the speaker cluster test file
so and he repeats the boundary generation
and then we eliminate the boundaries with the cosine distance instead of the delta bic
elimination
so here the a yellow line
indicates oracle or system and we can see that now the cosine distance boundary elimination
actually outperforms the be all the bic elimination that we
used in the first boss
so now we can use an operating point on the second
no of the output of the second pass
okay now we propose actually if we extract speaker factors for each comparison window this
did not differentiate between the speech and non-speech frames in the test file
so the idea is actually to give the speech frames in the windows more rates
during the speaker factor extraction
so eval integrate the gmm based
for a soft voice activity detection maybe estimated speech ubm and non-speech ubm and then
we will integrates and then we will use a softmax
to
convert log likelihoods of the speech ubm and the non-speech ubm to speech posteriors per
frame
i'm we will be the baumwelch statistics that are used to bring the speaker factor
extraction
extraction
to make them at the speech posteriors
so it's also important the note that here we will use the speech ubm to
estimate the occupation probabilities of a each frame
because it will also used is the speech posteriors and the second part of the
system so we do not only between these speech ubm but we also we train
the non-speech ubm on the test all so we got non speech segments with the
music and the applause
and you will also use the low energy frames inside the speech segments to reading
retrain the non-speech ubm
and also during the boundary elimination soap to make the false positives
we will use the soft voice activity to the
to extract speaker factors and then use cosine distance boundary of the nist
okay what we still
problem of the big baseline again
this is are
speaker factor extraction without the soft voice activity detection we actually see if we don't
use it to process than the t voice activity detection doesn't really improved results
but if we use it to paul system may be to use the cosine distance
from the elimination
we see that we can further improve the results so the soft voice that the
detection is a really useful if we use a two-pass just
so once we got this set precision and recall best or best precision recall rough
we choose an operating point to store a clustering
so this clustering as a agglomerative clustering a first we do conditional big clustering across
the whole that
and this is quite important to gets enough data for a i-vector be lda clustering
in the second stage
so the ideas for each trust we got by the output of our clustering
to extract an i-vector
and then we will use the lda to that's the hypothesis if you have the
same speaker or different speaker
and if this the lda indicates
and
that the this the same speaker done real magic recipe
cluster pair
and then for this much cluster we will again extract the i-vector by a summing
up
the sufficient statistics extract a new i-vector and
that's the hypothesis again with the lda
so we will iterate this whole clustering process until
the p lda outputs a large a low probability of the same speaker
so okay whatever their results after clustering again we use the most one of eighty
seven broadcast news data sets
so we will evaluated diarization error rate which is the percentage of frames that are
actually attribute to run speaker after mapping between the clusters and the real speakers
so here we got the popular delta bic segmentation so then you go the diarization
error rate of ten point one percent
and we see that actually the detected boundaries are not that accurate when we have
a margin over five hundred milliseconds
if you look for a local changes between a speaker factors we see a slight
improvement in the diarization error rate what the big changes all clearly in the accuracy
of the boundaries of the speaker factor extraction is much more accurate and detecting the
boundaries
the same for when we use the to pa system we see
a slight improvement in the precision on the be cool
but then if we use the two passes system at the site soft activity detection
apparently the boundaries got that besides that we got the ten percent relative improvement in
our diarization error rates and the double boundary precision of at one percent and the
recall of eighty five percent which is clearly better than the standard bic segmentation popular
standard because segmentation
so
i we also want to note that the if we will it's popular to use
viterbis a re-segmentation to make it to find more accurate boundaries offered of clustering and
are basically use the speaker factor approach this actually the three it's the results
thank environments so simple buttons
it to pass that the a two-pass liquidation is quite well in the speaker diarization
but problem of hunters this
somehow you
you
you can represent them is the
do you like that the actual or the u languages
so one selection but ratio between posterior features or not the speaker factors the first
step the first line again this is
on speaker factors it difficult that is a slight line i'll try to put gosh
models on the speaker factors but that didn't generate the same results so actually using
a distance measure
a different better results than trying to fit a portion models on problem
question is did you have
we were then
so you
the
that is
and then try that some one thing about this approach is that the amount of
speech to the fact that depends on the length of the speech segments so we
can use this to reduce the amount of speaker changes that we to make the
hypothesis of the amount of speaker change that could actually for inside speech segment
so then you would have to find solution for that's
but i think it's possible to use actually this i-vector approach to find boundaries between
speech and non-speech segments
probably even after generate more accurate boundaries on an hmm system that i use not
that's a hypothesis that i should best
so the you use a gmm based what the real spectrum
i just like are somewhat appears to the ribbon recordings or the trend of one
completely the hmm system is also that so it's again to both system variable with
so we got two models for the non speech
so the music mogul and of background noise model
then also for speech you got really different models speech clear speech that the background
noise and speech and music
so we go to the file one
and then we might make us estimate posterior suddenly adapt the models
and then we go through with the second
the to what extent your rates to its over talking figure it is a speaker
states it
significantly what proportion of that would you have to
speakers just of your region
so are you talking about overlapping speech
we don't send this dataset we don't have annotations of overlapping speech so i cannot
comment on how this as an impact on that all results
the by that token would you have in your class
we have here
and that just
you model is that
you're
if you've got speakers speaking region
and most cases each of these would be detected as a separate cluster i think
if i manually look that the false than this could be detected as a separate
cluster
so the complete cluster
that of
it also pure i think it also occurs that the overlapping speech is assigned to
one of the two speakers but i to notice sometimes that it's a detector doesn't
a cluster
think we have to sample
okay she of this method is for
t v
other
so
how these that this method to online diarization
you time citizens
so you're talking about the second pass and of the system
so it's not an on-line system so the idea is that the journalist upload the
file then start the process and comes back and one hour for example
so the first goal is not to make an on-line system but
there might be techniques to make it online but i would have to think about
the
in this election system don't the to model the number of speakers
so how many speaker were the
in reality and how many speakers who estimated
okay so if we combine the big clustering and then the i-vector p lda clustering
of the ratio is very close to one but i have to notice if you
don't use the initial be clustering
the than the i-vector be lda system actually which is a low a diarisation error
rate but he ratio between clusters and speakers is quite of its about the factor
to that
so it in the system it's quite important to do initial be clustering
to make the racial close to one
but the diarisation rate does not that i just using i-vector really
alright i think so
if no what the questions like to thank the speaker and all the speakers once
again and stuff