i everyone my name is again and i'm working with orange labs and the value
in france
and then i'm going to talk about the concept of self training speaker diarization
so the application we don't working on is
the task of across recordings because data traditional applied on t v archives french t
v archives
and the goal is to index to spew costs of collections of multiple recordings
in order for example two provides new mean of dataset exploration and by creating links
between different it is so it's
so a system is based on a two-pass approach we first
process each recording separately applying some kind of speaker segmentation and clustering
and then we perform a cross recording a speaker linking and try to link all
within recording clusters
across the whole collection
so they're framework is based on the state-of-the-art speaker recognition
framework
we are using i-vector of the lda model edition and for clustering we use the
article agglomerative clustering
so we know that the lda the goal of the lda is to maximize the
between speaker variability one minute
minimizing the within speaker variability
so what we want to
investigate in our paper is can we use the target that a as training material
and how good
could we estimate the speaker variability
so first i'm going to represent
battery different from work so let's take a an audio file phone problem
from a target data
our target that is unable so we just have a audio files
first we are extracting some features we are using a mfcc features with delta and
delta-delta
then we perform a combination of speech activity detection and bic clustering to extract some
speakers segments
on top of those segments we can extract i-vectors using pre-trained ubm and total variability
matrix
once we obtain a well i-vectors a reliable to score all i-vectors between each other
and computer similarity scoring matrix
and for that we use p lda likelihood the
each are trained the p lda parameters are estimated separate
once we have or similarity matrix we can apply a speaker clustering
and do you results of the that are just and is a speaker clusters
so we can repeat the process for is of all recordings
once we've done that we can compute
a collection why the similarity matrix and repeat the clustering process and this time i
call it the speaker i'm thinking big because the goal is to
link the within recording clusters across the whole collection
and after the linking a park
after the linking part we obtain a the degradation
so the usual way of training the ubm t v matrix and estimate the plp
of parameters is to use
trained that that's that which is labeled based you can and the training procedure is
pretty straightforward
the problem when we
apply this technique we have some kind of mismatch between a target and trained that
the first we don't have the same acoustic conditions
and seconds we don't necessarily have the same speakers
in target and trained that also
we could use a information about the target that a maybe we could have better
results
so what we want to investigates is the concept of self training there is there
some meaning we like to only use the target that itself to estimate the parameters
and then we are going to complete to the results with a combination of target
and trained that
so
the goal of sell train data revisionist to avoid the acoustic mismatch between the training
and target data
so
what we need to train an i-vector p lda system to train the ubm and
the tv matrix we only need a clean speech segments the training is then straightforward
and as for the lda parameters estimation we need several sessions by post you got
in various acoustic conditions so
what we need to investigates is do we have several speakers
appearing in different it is that's you know what target data
and assuming we know how to effectively cluster of the target data in terms of
speaker can we estimate p lda parameters with those
so let's have a look on the data
we have around two hundred there was a of french broadcast news that drawn from
a previous french evaluation campaigns
so it's a combination of a tv and radio data
i'm of this two hundred hours we selected two shows a target
cooperate we selected there's with l c be awful and the f m story
and we to all other available recordings and decided to build what we call the
train corpus
so if we take a look of at the data we see that we have
more than forty episodes
more than forty results for each other show and we what we cannot this is
a speech proportion of the what i call the recording speakers which is a above
fifty percent for both corpora
corpora
so the recurring speakers is speaker who appear in more than one if results
as opposed to the one time speaker who only appear in one it is
so
to the em so of the previous first question
yes we have several speaker appearing in different if you that you know target
so no
we decided to
train the original system
meaning we suppose we know how to
cluster on the data target that so we
we use we had the target that are labels in real life we do not
so we don't have those labels but for
experiments
we decided to use them
so
to train the ubm and the tv matrix and estimate the p l d a
parameter parameters we process the same with them
with their trained that are we just replace the train data with labels my target
that are with labels
so what we see detailed that is that for the l c p so we
are able to obtain a result
so the results are present in terms of a diarisation error rates
cross recording there is there is there a residual error rate
so for the l c p show we had some results as for the b
f m shall we will not able to estimate the lda parameters
and we suppose we don't have enough data to do so that we we're gonna
investigate that
if we compared with the baseline results we see that if we use the information
about speakers in the target that we can right we should be able to improve
the baseline system
so what we one
to investigate is
it's the minimum
amount of data we need to estimate p idea parameters because
we so that for the v f m shall we will not able to train
p lda while for the l c d so we were able to so
we just decided to find out the minimum number of it is that's we could
take into the l c p so to estimate suitable p lda parameters so that
the group of that with you see here is the d right the a on
the l c d so
as a function of the numbers of it is it's take and to estimate the
p l d a parameter so
the total numbers of ap that is forty five and we started the experiments with
thirty visits because we see that a before the results that
so what's interesting
interesting to see is that we need to run thirty seven results to be able
to improve the baseline results
and when we have
thirty seven it is that's we have forty recording speakers
what's also interesting to see is that
we have the same numbers of speakers and here
i don't the
the different number of it is that's but the resulting the art is a really
well seals and he also what's interesting is that we are able to
so we have the same speaker out that
what
what's happening here is dressed that there are more and more that are gathered for
each speaker
and we need a minimum amount of that are for each speaker if we take
a look at the average number of session task because it's a run seven
when you have thirty seven types of
as for the df m show
when we take it is that we only have thirty five recording speakers
and are bring in five it is that in average so it's far less than
for the l c d corpus and that's why we are not able to train
a dog parameters
so now let's place in the real case and we are now not choose not
allowed to use of that target data labels
so i'm the first to train the ubm and tv matrix what we need a
clean speech signal so we just decided to take the output of the speaker segmentation
and compute the ubm in tv matrix
but we don't have any information about the speaker so we are not able to
estimate period of the lda parameters
so we just replace the p lda likelihood scoring by focusing based growing
and then we have a working system when we look at the results of our
stand with then we using t lda
that not to suppress the we expect that
no what we obtain a speaker clusters so
what we this idea is to use the speaker clusters and try to estimate the
lda experiments with those clusters
when we do when we do so well the training procedure doesn't six it
well we so in the oracle experiment that the number of data was limited and
we also suspect that the a probability of the clusters are used to back to
allow us to estimate the lda permitted
so to summarize with the self training experiment
for the ubm and t v training we selected segments produced by speaker segmentation we
only get the segments with the duration above ten seconds
and we also it shows the bic parameters so that the segments are considered tool
because to train a to estimate to train the tv matrix we need a clean
and we only need we need only one speaker in each segments for training
as for the lda we need several session
the speaker from values results so first we perform an i-vector clustering based you got
a position and use the and put into a speaker clusters to perform i-vector normalization
can estimate ple are limited so we just select
the output speaker clusters with
i-vectors coming from one
more than three episodes
no so we so that we are not able to train a
sufficient system with only detected target that are so we decide to at some train
data in the mix
so it's the so the classics the idea of a domain adaptation
so the main difference in this e system comparing with the baseline is that we
replace the ubm and tv metrics by
in this experiment ubm and tv metrics are trained
on to a target that are instead of training data and then we extract i-vectors
from the training data and estimate the lda parameters on the training but
so
when replacing the ubm and tv matrix we are able to improve around one percent
in absolute
in terms of d r
no
well why not try to applied the same process then we it with the center
in experiments and take the speaker clusters to estimate a new p lda parameters
so as before we the training the estimation of the lda parameter phase we i
think we really don't have enough that do so
and so we just decided to
combined their use of training data and
target the task to update the key idea parameter the classic domain adaptation scenario but
we don't use any whiting parameters to balance the influence and of trained and target
that are we just
to the i-vectors from the training data and the i-vectors from this
output speaker clusters and
combining them and
train new p lda parameters
so when we combine the that the data we again a improve the baseline the
system and again one percent in terms around one percent
in terms of the whole
and
well now that we've done then we why not try to iterate as
as long as we obtain speaker clusters we can always to use them and try
to improve the estimation of purely a parameters
well it doesn't so it doesn't work
if you iterate it doesn't improve the system we tried two
four iterations but i
that it's not okay
so
let's have a look on the system parameters we use the site it for that
or position toolkit it's a package above the psychic toolkit
but library
for the front end and we use thirteen mfccs with delta and delta-delta
we use a two hundred and fifty six components to train the ubm
the covariance make matrix is there gonna
the dimension of the tv matrix is two hundred the dimension to be the eigenvoice
matrix is one hundred
we don't use any i can channel matrix
for the speaker clustering task we use
the combination of connected components clustering and the article argumentative clustering
and i as i said before the metric is the data results for an error
rates and we use the two hundred and fifty milliseconds
so
if we summarize we compare the other three for different system first three but we
performed a surprise training using only external data
and then we
use the same training process but we replace the training data with their delicate that
this is the oracle experiments
and then we focused on
and surprise training using only the target data and we so that that's it's
that's good enough when comparing with the baseline system
so we decided to take back
some training data i'm applied in some kind of unsupervised domain adaptation and combined train
target
so
to conclude can say that
with so that if we don't have enough data we absolutely need to use external
that bootstrap the system
but the putting it even using unlabeled target that a which is and perfectly clusters
with some kind of them domain adaptation we are able to improve the system
so in our future work we want to in to focus on the adaptation framework
and used
already
where we we'd like to use
introduce whitening variability between train and target data
and we also like to try to work on the iterative procedure because we think
that if we are able to a better estimate p lda parameters after one at
a rate iteration we should be able to improve the quality of clusters and some
kind of iteration should be possible
in fact this work was don't already we presented a we submitted a paper at
interspeech it will be presented
so i can already said that using one thing variability
the results are really get better
and the iterative procedure also walks we with two or three iterations we are able
to slowly improve the that the all
and another way of improve
improve your remains to be seen but
with what's like to try to put strapless that would any label that for example
we could try to take the train that a don't use the labels and upper
from causing basis clustering because we so that on our approach maybe we didn't have
enough data and the target that i to apply this idea so maybe
try to bootstrap with more unlabeled data could be working
well thank you that that's wonderful
documents so i'm for instance
thank you for that are i think this is more common that a question but
i believe that some of your problems with the em for the p o da
our years speaker subspace dimension is higher numbers
i think that that's the problem we the that i mentioned that for a t
v and p l of the idea is to find a when we don't have
enough target data i cannot the problem is
i is difficult to estimate the one hundred i mentioned
p l d l parameters if you don't have that much speakers
did you try to reduce the i don't i do the focus on that well
thanks to the presentation thirteen and well like to use it for d c two
sounds pretty
and you was presenting it on
i think that last used e
i use the deeper then how the school that
well
in my experiment
the results are not very different between ilp and agglomerative clustering well i just decided
to use agglomerative clustering because it's
small simple simpler
yes computed computation time
but not really a big difference between
i think
so
dealing with these different internal extra so one thing i
see here and work was
what to use a way that i
why each latest specifically a little white here
no we didn't fight the data are we just we just to the target clusters
and the training clusters and
put them together in the same dataset
so if you look at the equations its own
it's the same taste as if use that the whiting parameters
of a value which is the relative amount of data between target and try to
train better so it is almost equal to zero
that's why we need to work and the availability because we are not
would every for that i
not that this difference anyway you're clustering experiments you decide how many clusters
well the
the clustering is a function of the that's which
and we don't we just saw a select the screenshot by next experiment we that's
why we which was to target corporate because this way we are able to do
an exhaustive search on the other three shown on the one and one corpus and
then
we you look if the same crucial applies for the other cultures
and the clustering tree structure is around zero so
we still have time for a few questions
okay so i was curious human centred in this work to you don't want be
considered for the reader assumed to be helpful but then you are able to somehow
fixed upon the
a next once we know what is that
i mean what was to what do you think is the most the problem would
do so
in this in this work the program is we want to introduce a wide thing
we don't balance the influence training of target that also
and the combination of training and target that we have so much training data
that the
the whitening parameters is really in favour of the train on the training data
when we change the are balance between training target that and give more importance to
the target that the films to get better results and then you see that why
the routine you can improve some
no more of the two or three iterations
and that we also i did some kind of yes cost normalization because when you
when you when you use a target that too
to obtain the p l d a parameter as the distribution of lda also tends
to achieve a lot
for you need to one
normalized to keep the same clustering speech
otherwise you don't cluster
the same place a total
after reported average
okay so if no further questions let's thank the speaker