Speech Transcript - First investigations on self trained speaker diarization

i everyone my name is again and i'm working with orange labs and the value

in france

and then i'm going to talk about the concept of self training speaker diarization

so the application we don't working on is

the task of across recordings because data traditional applied on t v archives french t

v archives

and the goal is to index to spew costs of collections of multiple recordings

in order for example two provides new mean of dataset exploration and by creating links

between different it is so it's

so a system is based on a two-pass approach we first

process each recording separately applying some kind of speaker segmentation and clustering

and then we perform a cross recording a speaker linking and try to link all

within recording clusters

across the whole collection

so they're framework is based on the state-of-the-art speaker recognition

framework

we are using i-vector of the lda model edition and for clustering we use the

article agglomerative clustering

so we know that the lda the goal of the lda is to maximize the

between speaker variability one minute

minimizing the within speaker variability

so what we want to

investigate in our paper is can we use the target that a as training material

and how good

could we estimate the speaker variability

so first i'm going to represent

battery different from work so let's take a an audio file phone problem

from a target data

our target that is unable so we just have a audio files

first we are extracting some features we are using a mfcc features with delta and

delta-delta

then we perform a combination of speech activity detection and bic clustering to extract some

speakers segments

on top of those segments we can extract i-vectors using pre-trained ubm and total variability

matrix

once we obtain a well i-vectors a reliable to score all i-vectors between each other

and computer similarity scoring matrix

and for that we use p lda likelihood the

each are trained the p lda parameters are estimated separate

once we have or similarity matrix we can apply a speaker clustering

and do you results of the that are just and is a speaker clusters

so we can repeat the process for is of all recordings

once we've done that we can compute

a collection why the similarity matrix and repeat the clustering process and this time i

call it the speaker i'm thinking big because the goal is to

link the within recording clusters across the whole collection

and after the linking a park

after the linking part we obtain a the degradation

so the usual way of training the ubm t v matrix and estimate the plp

of parameters is to use

trained that that's that which is labeled based you can and the training procedure is

pretty straightforward

the problem when we

apply this technique we have some kind of mismatch between a target and trained that

the first we don't have the same acoustic conditions

and seconds we don't necessarily have the same speakers

in target and trained that also

we could use a information about the target that a maybe we could have better

results

so what we want to investigates is the concept of self training there is there

some meaning we like to only use the target that itself to estimate the parameters

and then we are going to complete to the results with a combination of target

and trained that

the goal of sell train data revisionist to avoid the acoustic mismatch between the training

and target data

what we need to train an i-vector p lda system to train the ubm and

the tv matrix we only need a clean speech segments the training is then straightforward

and as for the lda parameters estimation we need several sessions by post you got

in various acoustic conditions so

what we need to investigates is do we have several speakers

appearing in different it is that's you know what target data

and assuming we know how to effectively cluster of the target data in terms of

speaker can we estimate p lda parameters with those

so let's have a look on the data

we have around two hundred there was a of french broadcast news that drawn from

a previous french evaluation campaigns

so it's a combination of a tv and radio data

i'm of this two hundred hours we selected two shows a target

cooperate we selected there's with l c be awful and the f m story

and we to all other available recordings and decided to build what we call the

train corpus

so if we take a look of at the data we see that we have

more than forty episodes

more than forty results for each other show and we what we cannot this is

a speech proportion of the what i call the recording speakers which is a above

fifty percent for both corpora

corpora

so the recurring speakers is speaker who appear in more than one if results

as opposed to the one time speaker who only appear in one it is

to the em so of the previous first question

yes we have several speaker appearing in different if you that you know target

so no

we decided to

train the original system

meaning we suppose we know how to

cluster on the data target that so we

we use we had the target that are labels in real life we do not

so we don't have those labels but for

experiments

we decided to use them

to train the ubm and the tv matrix and estimate the p l d a

parameter parameters we process the same with them

with their trained that are we just replace the train data with labels my target

that are with labels

so what we see detailed that is that for the l c p so we

are able to obtain a result

so the results are present in terms of a diarisation error rates

cross recording there is there is there a residual error rate

so for the l c p show we had some results as for the b

f m shall we will not able to estimate the lda parameters

and we suppose we don't have enough data to do so that we we're gonna

investigate that

if we compared with the baseline results we see that if we use the information

about speakers in the target that we can right we should be able to improve

the baseline system

so what we one

to investigate is

it's the minimum

amount of data we need to estimate p idea parameters because

we so that for the v f m shall we will not able to train

p lda while for the l c d so we were able to so

we just decided to find out the minimum number of it is that's we could

take into the l c p so to estimate suitable p lda parameters so that

the group of that with you see here is the d right the a on

the l c d so

as a function of the numbers of it is it's take and to estimate the

p l d a parameter so

the total numbers of ap that is forty five and we started the experiments with

thirty visits because we see that a before the results that

so what's interesting

interesting to see is that we need to run thirty seven results to be able

to improve the baseline results

and when we have

thirty seven it is that's we have forty recording speakers

what's also interesting to see is that

we have the same numbers of speakers and here

i don't the

the different number of it is that's but the resulting the art is a really

well seals and he also what's interesting is that we are able to

so we have the same speaker out that

what

what's happening here is dressed that there are more and more that are gathered for

each speaker

and we need a minimum amount of that are for each speaker if we take

a look at the average number of session task because it's a run seven

when you have thirty seven types of

as for the df m show

when we take it is that we only have thirty five recording speakers

and are bring in five it is that in average so it's far less than

for the l c d corpus and that's why we are not able to train

a dog parameters

so now let's place in the real case and we are now not choose not

allowed to use of that target data labels

so i'm the first to train the ubm and tv matrix what we need a

clean speech signal so we just decided to take the output of the speaker segmentation

and compute the ubm in tv matrix

but we don't have any information about the speaker so we are not able to

estimate period of the lda parameters

so we just replace the p lda likelihood scoring by focusing based growing

and then we have a working system when we look at the results of our

stand with then we using t lda

that not to suppress the we expect that

no what we obtain a speaker clusters so

what we this idea is to use the speaker clusters and try to estimate the

lda experiments with those clusters

when we do when we do so well the training procedure doesn't six it

well we so in the oracle experiment that the number of data was limited and

we also suspect that the a probability of the clusters are used to back to

allow us to estimate the lda permitted

so to summarize with the self training experiment

for the ubm and t v training we selected segments produced by speaker segmentation we

only get the segments with the duration above ten seconds

and we also it shows the bic parameters so that the segments are considered tool

because to train a to estimate to train the tv matrix we need a clean

and we only need we need only one speaker in each segments for training

as for the lda we need several session

the speaker from values results so first we perform an i-vector clustering based you got

a position and use the and put into a speaker clusters to perform i-vector normalization

can estimate ple are limited so we just select

the output speaker clusters with

i-vectors coming from one

more than three episodes

no so we so that we are not able to train a

sufficient system with only detected target that are so we decide to at some train

data in the mix

so it's the so the classics the idea of a domain adaptation

so the main difference in this e system comparing with the baseline is that we

replace the ubm and tv metrics by

in this experiment ubm and tv metrics are trained

on to a target that are instead of training data and then we extract i-vectors

from the training data and estimate the lda parameters on the training but

when replacing the ubm and tv matrix we are able to improve around one percent

in absolute

in terms of d r

well why not try to applied the same process then we it with the center

in experiments and take the speaker clusters to estimate a new p lda parameters

so as before we the training the estimation of the lda parameter phase we i

think we really don't have enough that do so

and so we just decided to

combined their use of training data and

target the task to update the key idea parameter the classic domain adaptation scenario but

we don't use any whiting parameters to balance the influence and of trained and target

that are we just

to the i-vectors from the training data and the i-vectors from this

output speaker clusters and

combining them and

train new p lda parameters

so when we combine the that the data we again a improve the baseline the

system and again one percent in terms around one percent

in terms of the whole

and

well now that we've done then we why not try to iterate as

as long as we obtain speaker clusters we can always to use them and try

to improve the estimation of purely a parameters

well it doesn't so it doesn't work

if you iterate it doesn't improve the system we tried two

four iterations but i

that it's not okay

let's have a look on the system parameters we use the site it for that

or position toolkit it's a package above the psychic toolkit

but library

for the front end and we use thirteen mfccs with delta and delta-delta

we use a two hundred and fifty six components to train the ubm

the covariance make matrix is there gonna

the dimension of the tv matrix is two hundred the dimension to be the eigenvoice

matrix is one hundred

we don't use any i can channel matrix

for the speaker clustering task we use

the combination of connected components clustering and the article argumentative clustering

and i as i said before the metric is the data results for an error

rates and we use the two hundred and fifty milliseconds

if we summarize we compare the other three for different system first three but we

performed a surprise training using only external data

and then we

use the same training process but we replace the training data with their delicate that

this is the oracle experiments

and then we focused on

and surprise training using only the target data and we so that that's it's

that's good enough when comparing with the baseline system

so we decided to take back

some training data i'm applied in some kind of unsupervised domain adaptation and combined train

target

to conclude can say that

with so that if we don't have enough data we absolutely need to use external

that bootstrap the system

but the putting it even using unlabeled target that a which is and perfectly clusters

with some kind of them domain adaptation we are able to improve the system

so in our future work we want to in to focus on the adaptation framework

and used

already

where we we'd like to use

introduce whitening variability between train and target data

and we also like to try to work on the iterative procedure because we think

that if we are able to a better estimate p lda parameters after one at

a rate iteration we should be able to improve the quality of clusters and some

kind of iteration should be possible

in fact this work was don't already we presented a we submitted a paper at

interspeech it will be presented

so i can already said that using one thing variability

the results are really get better

and the iterative procedure also walks we with two or three iterations we are able

to slowly improve the that the all

and another way of improve

improve your remains to be seen but

with what's like to try to put strapless that would any label that for example

we could try to take the train that a don't use the labels and upper

from causing basis clustering because we so that on our approach maybe we didn't have

enough data and the target that i to apply this idea so maybe

try to bootstrap with more unlabeled data could be working

well thank you that that's wonderful

documents so i'm for instance

thank you for that are i think this is more common that a question but

i believe that some of your problems with the em for the p o da

our years speaker subspace dimension is higher numbers

i think that that's the problem we the that i mentioned that for a t

v and p l of the idea is to find a when we don't have

enough target data i cannot the problem is

i is difficult to estimate the one hundred i mentioned

p l d l parameters if you don't have that much speakers

did you try to reduce the i don't i do the focus on that well

thanks to the presentation thirteen and well like to use it for d c two

sounds pretty

and you was presenting it on

i think that last used e

i use the deeper then how the school that

well

in my experiment

the results are not very different between ilp and agglomerative clustering well i just decided

to use agglomerative clustering because it's

small simple simpler

yes computed computation time

but not really a big difference between

i think

dealing with these different internal extra so one thing i

see here and work was

what to use a way that i

why each latest specifically a little white here

no we didn't fight the data are we just we just to the target clusters

and the training clusters and

put them together in the same dataset

so if you look at the equations its own

it's the same taste as if use that the whiting parameters

of a value which is the relative amount of data between target and try to

train better so it is almost equal to zero

that's why we need to work and the availability because we are not

would every for that i

not that this difference anyway you're clustering experiments you decide how many clusters

well the

the clustering is a function of the that's which

and we don't we just saw a select the screenshot by next experiment we that's

why we which was to target corporate because this way we are able to do

an exhaustive search on the other three shown on the one and one corpus and

then

we you look if the same crucial applies for the other cultures

and the clustering tree structure is around zero so

we still have time for a few questions

okay so i was curious human centred in this work to you don't want be

considered for the reader assumed to be helpful but then you are able to somehow

fixed upon the

a next once we know what is that

i mean what was to what do you think is the most the problem would

do so

in this in this work the program is we want to introduce a wide thing

we don't balance the influence training of target that also

and the combination of training and target that we have so much training data

that the

the whitening parameters is really in favour of the train on the training data

when we change the are balance between training target that and give more importance to

the target that the films to get better results and then you see that why

the routine you can improve some

no more of the two or three iterations

and that we also i did some kind of yes cost normalization because when you

when you when you use a target that too

to obtain the p l d a parameter as the distribution of lda also tends

to achieve a lot

for you need to one

normalized to keep the same clustering speech

otherwise you don't cluster

the same place a total

after reported average

okay so if no further questions let's thank the speaker

First investigations on self trained speaker diarization

Speaker Recognition in Multimedia Content

Gaël Le Lan, Sylvain Meignier, Delphine Charlet, Anthony Larcher