okay
thank you my name is a problem
i'll present the work we have carried out by extracting i-vectors from
short and long time speech features for speaker clustering
and
this is also the what of don't really k and have it at london
so the outline of
a representation that's is as follows so we would describe
so objectives of our research
we would also describe the main
long-term features that are used in our experiments we would also mention the
baseline and the proposed speaker the standard vector
and then we will
describe the fusion techniques that are carried out in the speaker segmentation and speaker clustering
and finally the experimental setups conclusions would be presented
so far the on all
speaker diarization consists of two men tasks and these are
speaker segmentation and speaker clustering
and in a speaker segmentation
a given audio speech is
split it into homogeneous boxes and in speaker clustering
speech clusters that belong to a given speaker are grouped together
so the main motivation for this dataset used in our previous
work
we have shown that the use of jitter and shimmer and
prosodic features have improved
the performance of
gmm based speaker detection systems so based on these
we have proposed to the extraction of i-vectors from these
detection or prosodic features
and then want to fuse their cosine distance courses with the
mfcc for speaker clustering task
so here in the feature selection
we select different set of features from the voice quality and from the prosodic
from the voice quality way extracts
features called absolute jitter absolute stream attention meticulously and from the prosodic once we extract
the speech
intensity and the first four formant frequencies
once these features are extracted the abstract the same feature vectors
then we extract two different set of i-vectors the first i-vector is from
the mfcc
and the second i-vector is from the long-term features
then the cosine similarity of these two
i-vectors is used for speaker clustering task
so these are the main speech features that are used in our experiments without mfcc
voice quality that the jitter and shimmer and we have also used the prosodic ones
so from the voice qualities we have selected three different measurement is based on previous
studies these are the absolute jitter which major the variation between
two consecutive periods
and we have also very absolute stream
it may just evaluation of the amplitude between consecutive periods and also
they should medically two c d's
similar to should matter out of instrument that's
it takes into consideration three consecutive periods
so from prosody we have extracted speech
in basically and formant frequencies
so when it comes to the speaker diarization architecture first i'll try to describe the
baseline system
so given speech signal
so we
further steak the speech different mappings or i thought
the main reason wearers using the oracle studies
where i'm really interested on the speaker errors
where the restaurant the speaker segmentation errors
then we extract the mfcc the jitter and shimmer and the prosodic once only for
the speech frames
then the jitter and shimmer and that was only once output in the same feature
vectors
so based on the side of that inside the
new channel number of clusters is initialized if we have
more number of if there
size of the data is or if the
sure is longer to have more number of clusters if it is shown to do
have less number of clusters all the initial number of clusters
depend just on the duration of the audio signal
then we assign genments complex tenish ali for this neutralised clusters
then we perform the hmm decoding and
training process and then we'll get to different to log-likelihood scores the first one is
for the
short time spectral features
and then we also get another score role
you don't of features
then these two scores are used you nearly in the speaker segmentation and
we get the speaker segmentation still in gives us
a set of clusters
so we use a classical bic
computation technique and computes
pairwise similarity between
all set of clusters and i each iteration the two clusters that that's the not
the highest
bic score
will be but you'd and this process
i to its until the highest peak value among the clusters is less than the
specified threshold value
so this is a classical be computation so you know work
the initialization and the speaker segmentation are the same
the may conclusion we should it in the speaker clustering one the speech the decomposition
of the gmm be competition is replaced by the i-vector clustering one
so this is our proposed architecture so given a set of clusters
that are
the output of the viterbi segmentation we extract
two different set of i-vectors
if a test i-vector is from the mfcc
and the second one is from the detection and the problems once
and we used to difference
universal background models the first one is forty
short-term spectral features
and the second one is four
the
long-term features
so the ubm and the t matrix is are trained using the same source from
a and have selected one hundreds unusual side of a duration of forty hours to
train
the ubm
and the i-vectors are extracted using an is a toolkit
so that the less than the g
is normally based on
specified threshold value so if the threshold value is based on
specified one the
system stops margin
so
to find the optimum
threshold value we have used a semi-automatic way of
finding
the number of triphone value
for example in this figure
we have displayed how we have selected
the lamp that value and the stopping criterion for five shows from the development set
so these once the or it once show the highest
cosine distance scores per each iteration
and
these block once they are the diarization error rates but each iteration so there
horizontal dashed line is the lamb data selected
as a threshold to stop the process for example
if we talk about the rest
showed
there's system
stops at the for citation because in the fall detection
the
maximum
cosine distance score value is made than this threshold value so we have applied this
techniques on
the whole development shows and this number about it is applied directly on the test
sites
so we have used to different fusion techniques ones on the speaker segmentation and the
other on speaker clustering
so in the
segmentation
the figure technique is based on log likelihood scores so we get
two different scores for a given we can see that the axes
the short-term spectral features and the what is the long-term features all
we gates
and more than
for
the short-term spectral features so we get the log-likelihood score so this is multiplied by
are five and again similarly for the
long-term features we ate
extract
the log-likelihood score and this is multiplied by a the file and the and fast
how to be against you on the development data sites
so that putting technique in the speaker clustering is carried out possible so we have
three different set of features very mfcc the voice quality and the prosodic once
so the long-term features are stacked a basic
then we extract two different sets of i-vectors from the mfcc and from the long
term one
then the cosine similarity between
these two sets of i-vectors
is fused divide
a linear weighting function
so that fused score that is a multi it "'cause" i similarity is multiplied by
weight functions
but also the beta in this one is
the weights
but applied for their cosine task force extracted from
the spectral features and one minus data is
the way to signs
for the cosine distance scores
extracted from the long-term features
so when we come to the experimental setup
we have
developed and tested that experiment on ami corpus which is
and multi-party and spontaneous that of meeting recordings
so normally in the i shows the number of speakers is
let me just two
so you to five that's mostly
the number of speakers these for and
it is and meeting records and it is a model channel with the fight of
each condition
so we have selected potentials as a development set to tune the different parameter studies
the weight values
and that threshold values
then we have defined
two experimental setups the first one is a single sides so potentials
how to be selected from idea
and the other one is a multiple sites
we have selected ten calls from idea
adam back end to end all sides so
the
optimum parameters that are obtained from the development sites are directly used on these
a single and the multiple sites roles so we have used to difference
as of i-vectors
for the short and long term features and is also
do you want on the development set and we have
use the artists at all the speech differences
at the speech activity detection so very but is that the city portage in this
work
corresponds mainly to the speaker errors missus speech and the form out on this have
a zero value
so he if we see
the results the baseline system that is based on mfcc and gmm big
clustering p does
is a model of the art
but when we are using jitter and shimmer and prosody both in the gmm and
i-vector
clustering technique it improves
a lot compared to
the baseline
and if we compare these to the i-vector
clustering techniques with the gmm ones
but i with a clustering techniques
again provide better result is on
they gmm clustering technique
and we can also conclude that
if we compare the same clustering techniques the i-vector clustering techniques that this one based
on only short-term spectral feature and this one
using two different set of features it's
i provide us better results on
using one i-vectors from the
sure that features
so we have
also then
some posts paper processing work
after the sensational stories to better
so we have
also pasted
that the lda scoring
in the clustering stage
and the p l a clustering as it is shown in the table
with that it uses only one set of i-vector or
two sets of i-vectors
it provides a better diarization of results on what the gmm and cosine scoring techniques
so one of the issues in a speaker adaptation is the diarization error rates among
the different roles is
a relatively
it follows from one to one show for example is a wonderful may give us
a small d are like five percent and another show to make debusk idea of
like a fifty percent
so for example this box plot shows the
d r evaluation all the multi pole and a single side so
this one is a d r evaluation for the single five and the grey one
is
the idea validation for the multiple site
so this easy high d r and d c the lowest eer
so we can see that there is
a huge evaluation
between
the maximum and the minimum
so if we see
here the use of long-term features
both in the gmm and i-vector clustering technique
help us to reduce the
the other what if you normal the different roles
and the other thing we can see both
i-vector clustering techniques that are based on
short-term and shorter class long-term features
they give us
a bit errors
at least we can say it reduces again
the idea variations among
the different roles
and finally this one that is the i-vector clustering technique based on
short-term and long-term features used as
the lost
variations among
the different roles
so in conclusion
we have proposed the extraction of i-vectors from
short and long term c feature for
speaker clustering task
and in the experiments are designed to sit strum that's the
i-vector clustering techniques provide
bitter diarization error is that is and the clustering the general clustering once
and also the extraction of i-vectors from the
long-term features
in addition to the
a short time once
help us to reduce the d r
so in conclusion we can phase that's the extraction of i-vectors
and the use of
i-vector clustering techniques are helpful for speaker diarization system
and thank you
then it's time for questions
so i have
but i was one thing to do explain the process you using for calculating the
jitter and shimmer in did you find it to be a robust process across the
tv shows
normally are
shows a meeting domains
but
it is
it is
and meeting domain it's not a t v show
but when we extract different remote
we the problem of bases if
the speech is almost
we give zero buttons
so we compensate them by averaging over five hundred milisecond duration
that extract the fattest all certainly second duration
sort compensates a zero values for the unvoiced frames we averaging over five hundred milisecond
duration
you have also in one of your a slight and you said that the training
from the development set
how did you you'll find it or train it how did you find that threshold
and did you experiment with changing the threshold value
you mean that the segmentation i think this one
no in the formula you
we present the segmentation
this one
or you hear
so you mean that i four buttons they have been
modeling be you on the development sites
we taste different weights while the weight
bottles
for the two features
and
these files are directly applied on the test sites
okay so they are fixed your exists in the test experiments affix it's
of thank you very clear presentation i just wanted to understand the little bit about
the physical what we should you have an explanation why he went so did to
shiver and prosody
so for example in explains that we do we for pitch to be quite well
quite to important how did you sort of converge of these two did you go
through a selection opposed to get to the mean do you have any intuition or
expression for the
so you're saying why we are interested in the extraction of the detection but and
prosodic how did you zero it on the balloon is what's your sort of physical
intuition for what using that as opposed to of the long-term features
because they are voice quality measurements
no special potentially much
so they can be used to discriminate
where the speech of one percent from another one so you'll hypothesis is that they
would that would be significant difference between us we have seen it is and the
this will be robust to whatever channel that is going through
but we didn't similar extremely delicate if you will so
if you had extend this outside this dataset for example of real life recording
we're going to worry about the sensitivity of these features that you looking at
okay for example jitter and shimmer they have also been used in a speaker
verification and recognition on these database
so we have normally
that is it will not is the reason why we applied on speaker diarization
and we have checked the jitter and shimmer on ami corpus
here's what i'm presenting we have also attracted on how about campus it is a
cut on projects t v show
is that also we got some improvements
so you would like companies it's helps and
would that be any other as you think
i don't it would a but others you think that you could out to the
two
note that different types of region we have about ten or eleven types of jitter
and shimmer measurements
beds we have selected this c d based on previous studies for speaker recognition and
of maybe you can check with the others also
and you in a question
and i don't have we question so it's about the stopping criterion so you are
not assuming and that you know the number of speakers beforehand
that's right now we know the number of speakers you know you larson and you
know it is okay conditions
so any other questions
there are no more questions to estimate speaker again