good morning everybody my name is one is all the and that's being gonna present
that were work short duration of speaker modeling we form of that training
i will first start with a brief introduction on which i will explain the minimal
the main motivation of our work
i will that continuum now we've explaining our approach form of that training and the
present experimental setup results and i would finally conclude the we will at some conclusion
and the some future work directions
so
linguistic variation is a significant source of already shown in many of them i speech
processing application such as a short duration of speaker verification and speaker diarization
in both cases we find ourselves to deal with short duration segments when we want
to lure a speaker model
so let's suppose to have a ubm model
and that if we want to learn a speaker model from upper that they show
well we use a long utterance or we have plenty of data that the estimate
it speaker model
a will be near to the idea speaker model serious a the phonetic variation will
be marginalised
while when we use the short utterance for example three seconds do you know what
muppet that they some process then this team at the speaker model is far from
ideal speaker model c insert the phonetic variation on is not marginalised
sold objective of our work keys to improve the speaker modeling while the decrease in
the variation due to the phonetic content and the while increasing the speaker discrimination
a to do these and we started from a method called a speaker that it
training set up to but is technique commonly used in i automatic speech recognition and
the is used to reduce the speaker variation a to estimate models that are more
a phonetic that the morphed weighted discriminant
and to get better estimation
still
the idea is the to let us to model you have it in the original
the acoustic feature space and we say we can discriminate between speakers and that we
can discrimate the also between phones
so in that is at a scenario what sub that's is to project the acoustic
features in a space in which all the phonetic information is retained while the speaker
disk the speaker information is discarded
so
a if we interchange at the roles of speaker informs we can that reach the
opposite results
that means suppress the phonetic variation while at emphasizing that the speaker variation
so we have always our original feature acoustic space which we can discriminate between phones
and speakers and the what but in this idea of snarl but does is to
project the features in that their acoustic feature space in which all the speaker information
is entertaining while we can not discrimate anymore between phones
still
but that was first applied to speaker diarization we have a relative increase in the
speaker discrimination on the of twenty seven percent and the additive decrease in phone discrimination
of features by percent and the speaker and phone discrimination was calculated for the fisher
score as we will see later
and the however the improvement in the diarization error rate a was disappointing
still
what we do in that what we now used we try to optimize and the
bottle with butter in installation the from the convolutive a complexity of speaker diarization
and the this is done by a addressing the problem of speaker modeling in the
case of a shot when the training data nice cars and the by performing a
small scale a speaker verification experiments are
a
by using a database that the man only lap to look at the phonetic level
so i would proceed to a in our approach for that that's training we use
a extensively a constrained maximum likelihood linear regression cmllr
so all cmllr is the technique to reduce be screen and mismatched it with the
reduced the mismatch between an adaptation on the dataset that and the a initial model
it by estimating an affine transformation
so let's assume we have it a gaussian mixture model we've initial mean more and
the a initial covariance sigma so as similar transform estimate a and times and my
six eight where n is the dimension of the feature space and the and then
dimensional buyers vector b
and sets that we cannot that the mean and the covariance signal far initial model
through this equation by we just chief the mean and we calculate the various in
this way
and the what the real important think of similar out that we use than simply
but these two that we have the possibility of transforming an acoustic feature on of
the patient feature a we can my back to the our initial model a by
applying the a line the
a
the inverse transformation
and the and
at the base of but where is similar i and the but the as we
said that a aims to suppress the form that variability in order to provide a
more speaker discriminatively features
and the so we supposed to have a set of s speaker and p phones
and the and set of utterances that are parameterize the a
by a set of acoustic features
and the we suppose also to have a set of initial that a speaker models
always the craft a mixture models
so what
but thus is to estimate the a set of gaussian mixture models a that are
normalized across phones and the for each phone p we aims to estimate a similar
transform a that the cutters the for evaluation on the across the speakers and this
is done it
by solving the problem of maximum likelihood and the it is it a
done it iteratively
still
that's supposed to have a fixed number of iteration and the in the initial iteration
on that we start by a giving as input the initial features and the initial
that a
in
my initial speaker models
so for each phone we estimate the similar transform that is common to all the
speaker models and that we estimate this transformed by using a the data is for
that particular phone for all the speakers
so we use the i'll transform to normalize the feature vectors by using transformation that
we so before
and the by applying the inverse transformation
the normalized feature of be normalized features are then used to estimate the is set
of speaker models that are normalized across phones
so if for each at the last iteration on the band the we got a
normalized features our fast scores and that our finally our finally a speaker models that
are normalized across the phones
otherwise we give we give as input
the am
obtain a features and that the in that a
a speaker models
so
a in our case when we deal with short duration utterances of a full so
that we don't that much data to it to estimate the transform for each phone
so what we do is to estimate the transform for an acoustic classes acoustic class
so that the
these a set of phones so i transform a and acoustic classes that are mean
by using good a binaural regression free it which the main know what is initialized
with the all the phones
and the according to linguistic rule we split these main all the a by choosing
the split the maximize the likelihood of the training data
and that this is don until the increase in the like to you the a
is i urban that fixed frazzled
so when
when it or we reach the last the
the last iteration we calculate each and which a transform
forty eight acoustic classes
and the
it's phoning that acoustic classes that share the same transform
so
in our experimental setup so what we need the two
to evaluate and we'll to my spotting ideal scenario
is to a database you which we have short duration sentences
a
we have clear and accurate phonetic transcription and the we have a limit the level
of noise and channel variation this is because we want to
to see to estimate where
performance of part in that the l and able to my scenario
so
by taking into account these consideration that we a we concluded that nice database are
not the
i are not a
does not fit our the nist database that don't fit our needs because you to
the lack of a target speaker and the phonetic transcriptions
to the channel variation and to the different types of noisy compromise the recordings
so we choose the we base our choice on the timit the database because the
a is a collection of i quality and the read speech sentences
and the
it's end this is a which is last three seconds and the
average and the is manually transcribed at the phonetic level and the you know the
database very is a limited noise and the
bodies a
that is not channel variation
so
however i database is composed of six under speakers of which are four hundred fifty
eight are males and what i don't and each word females it each speaker contribute
to this ten sentences we've average duration of three seconds
and the
we said we divide the database that so that well data for from for under
six to speaker where used to learn the ubm while the remaining speaker the recording
from the remaining speaker
a what i'm sixty eight where used for a city experiments automatic speaker prediction experiments
and the but performance is the analyzed by using her from one to seven sentences
are it to learn the speaker model
so the first opposed to a it was too
a segment the our utterance easy speech and non-speech as segments according to the ground
of transcriptions
we then extract the features that were canonically mel-frequency says that a comfy sent twelve
plus energy blast delta and acceleration coefficients
we've and we've an estimated speaker model by map adaptation from the ubm models that
a
estimation from four to one thousand ready for gmm components
and that by using an initial feature and the initial the thing to speak in
the initial speaker model we applied but that is starting from acoustic classes that where
where a obtain a from the initial set of the thirty eight phones and we
finally got our a normalized features and our normalize speaker models
so but performance was assessed on two different this piece used them at a traditional
gmmubm system and the state-of-the-art high vector p of you system
a baseline and we perform our baseline experiment with the initial set of features that
we defined before was be
at the a
is be experiments we've part where you where the would perform by using the for
to normalize speaker features
so i without with the experiment the results
so to as a set before to assess the speaker and the phone discrimination we
decided to use the fisher score discriminant the fisher score the future score
and that's supposed to have a the
as classes and the a set of and a lot but feature
bilateral i mean each feature is the in
not at the with the
class belong to
so the speaker phone discrimination
it's calculate the a fruit the feature score
that the
at which in than where it at the numerator the inter class distance where you
is the mean of each class
and the at the denominator we have at the intra
a and
intra class distance
basically a represent the spread of the features are around their own mean in the
class
so
if we want to
in if we
if the inter class distance increase it means that the numerator is i are while
the if the we have more normalisation more the features more spread out there are
on their mean when it means that the denominator is i of a numerator
so that
in our experiments we calculate the speaker discrimination and the phone discrimination after ten iterations
of but and we show that the speaker discrimination as a relative increase of forty
percent of the ten iteration
while at the phone discrimination
as a
relative decrease of fifty percent
so a disease a
this is good because it is
it goes along with the previous results that we've got in our previous work
however i would bust the and now to the automatic speaker verification experiments
so as it possible to serve
a we a
for our speaker verification experiments by using them all those from
for
to one thousand before gmm components and the whole for gmm and that they vector
ple the fist thing about these is that the a an i-vector p lda
performance but much better but gmm-ubm system the scale is different from like that
at the
also we bought to a we can see that we have always a
but the performance is rather than a than the baseline system
and the
another thing to not this is that is that for lower model complexity we can
reach but the performance is then the baseline or a similar performance is
then that the baseline
and the a result of the models training we one-sentence when we deal we one
sentence
and the for seven centers is that it is the
we carried comparable performance with the baseline
but we've
the word model complexity for example in the four we forty two jim ubm components
we get the same performance it as the baseline when using two hundred fifty six
components
and the same in the i-vector system where we got better performance with forty two
components
is it
by using but
and the a
compared to the baseline system that we what do you we two hundred fifty six
components
so
in this two tables i'm going to
the present the results
a where a
independently from the model size views
so these are results
a
i've the result by using good the an optimal model size for the speaker model
and that we can see that for the i-vector p lda system at than
fifty percent a increasing the performances of course in an ideal and optimize the environment
while a for the gmm ubm sister system the could but the performance see for
the first four by when the using one that and three training sentences and we
got comparable the results when using five and seven sentence
and the
in these that you that the plot use a we can still see the results
as before these are results when using one single sentence and the we can see
that we have
fifty percent degrees in the are of the i-vector the lda system and the in
the gmm-ubm system at the lines are more are less far apart but we have
the and the degrees from near are of forty two to three six percent
and the
to conclude
a this works in this work we address the problem of speaker model e in
the case that when training data is cars shall by you when using short duration
utterances
and the we optimize and the value it but a at the speaker modeling level
but performing the small the speaker verification the experiments
and the by using timit database that it's lateral at the phonetic level
and the we show that but it's skies formally in that using performed by s
a
well it probably significantly the performance of those two systems gmm ubm and i-vector p
lda
and the what is worth nothing is also that but is able to provide the
equivalent or but the performance by using a lower model complexity
and the
for the future work we aim to go back twelve original goal that these but
for speaker diarization
a we want to explore approaches automatic approach is to a in a closed acoustical
but class transcriptions because of but the but actually doesn't need to
a
it might be no doesn't need phonetic transcription but as long as we are we
are able to label that adding that way that we can map the features to
a particular acoustic class
we can transform we can calculate the transform for that particular class and we can
finally improve the performance is a in the in the system
and one final it was problem as passionately but am and speaker-independent approaches to four
and normalization
effect of rotation
you're i-vector extractor i was trained with sort channels and so as well because be
lda was obviously trained on i think with the sentence is okay so we didn't
manage for example sentences from the same channel speaker to create a big
sentence so that you train the i-vector extractor this way
for example for a one speaker we use of the centres is when we put
it together and we
okay so it's okay so character selected to understand was we used short sentences so
that is exactly what
wise that's used you don't have a couple of minutes
i think balls
r is a
and the team it
a much too simple databases for
because of all the
you a beetle
well close to zero percent
so what does it challenges in a
text dependent i mean this should be applied everywhere i mean if it works so
well
what is what are the challenges
sorry what which to what in the real life i mean this is not be
used in many systems
right so
i mean you should be employed if we have no ever
so what actually this one was that of the work to optimize and the bottle
with button ideas value because as i said a we try to apply the as
a respectively speaker diarization but the problem was that we didn't that comp in this
enough to
because there is out that was disappointing where like a we gotta really little improvement
in the it their decision error rate so we said okay but is not
we tried to see how to try to find out the
upper limit performance that can be shown off but i mean you have been using
timit that there are many versions and t meet what they the timit
with a noise condition all sorts of more we or telephone bandwidths condition
then it has been transform in many ways so why don't you uses
the because as a set of timit where the phonetic transcription also for these we
since we want to what demise dismantling the ideal condition we
this
so i would think would be interesting to see that this is one
of the risk of a primary all is very quickly
the major impediment so to progress in this field lack of data
i towards been very generous and making available we also data but that's pretty much
three
the only
dataset a recent phone search the so that we have so to work on
so
i mean one was experience with realtors novels that the problem really as part of
our
we're not going to be able to make progress almost as far as like can
see
on that's we find some way of showing various at some are among researchers we
need a on this program
you are working with the industrial partner that probably collect some that all right
mutual benefit to sharing data
then we could probably
make sure progress in this way we otherwise it's not software or how we're going
to the one
thank you for those points patrick
i just one and that mention that in odyssey two thousand one when we became
odyssey there was considerable effort put forward to a creating these standard text dependent corpora
to distribute to the participants both per se converse and new ones
put together these nice text-dependent datasets we distributed them to the odyssey members in advance
and plan to have a whole track with a text dependent speaker verification
and the sad news with only a couple of sites participating
so i think craig greenberg was
imply maybe a similar issue with the hazy or evaluations so a lot of these
you know it has to be a two way street to go to the f
for an expense put together corpora
and then have a reasonable number of participants want to take on the challenge so
if there's been a shift in interest to
text-dependent verification i
i think would be good is a community to get together in figure that out
and put together some evaluation