a good morning everyone so i'm not sure if you notice what
this is the only speaker recognition and talk in this section
so which makes me feel somehow like
the distant relative that the family invites
but you know they don't want to like to
so today i'm gonna be a presenting the some of the recent advances that we've
had seen in our speaker recognition system and i will share some results
that we obtained with the system on the nist sre two thousand and ten
extended core task
tasks with an emphasis on the telephony condition which is condition five
this is joint work with actually ongoing about the who is now an assistant professor
at i see bangalore india and jason pelecanos
i will start mike like with a brief overview of some of the recent a
state-of-the-art works in speaker recognition then i will
share with you the objectives of my talk i will present a our speaker recognition
system and the key components that contributed the most towards the end results
i'll describe our experimental setup the data we used the d and then acoustic models
as well it's configurations as well as the speaker recognition system configuration and i'll share
with you as i said the results we obtained with the system on the nist
sre two thousand ten extended core task
mostly on condition five and then a comparable
so when we look at the recent a state-of-the-art work on a speaker recognition forcible
most of most of the state-of-the-art systems are i-vector based
and they somehow use a universal background model to generate statistics to compute the i-vectors
now when we look at this through the time so we started by traditional unsupervised
a gaussian mixture models to represent the ubms
and then more recently we used a
of phone any event based you be ubms which are derived from an asr systems
so
i would like to emphasise here that even though this work
don't at i b and does not get much created but this is the first
work that in fact used seen owns to compute a
the hyper parameters of the ubm for speaker recognition in fact it was it achieves
a state-of-the-art results as a single system on the nist sre two thousand
and then a this work game the work from best buy which basically used the
nn based a scene on posteriors to compute the ubm parameters
more recently there was the work from a johns hopkins university the used a t
v and then based you
a posteriors to compute the ubm parameters any fact they found that
contrary to read to what sri found
with the
a with the diagonal covariance make a so with a with the ubm that uses
diagonal covariance matrices you can i in fact you
estimate the ubm parameters we
with a full covariance matrices and reduce a lot of computations then you need to
necessarily
go through the hassle of
of it for the nn based system so you can use directly if a supervised
ubm to compute the statistics and then from their compute the i-vectors and they had
nice gains as well
we also some of the state-of-the-art systems also use they don't use the nn posteriors
to compute the ubm hyper parameters a use the n and bottleneck features and then
the
the rest of the pipeline mean in i-vector based speaker recognition system remains thing
so i mentioned some of the word here i would like to give some pretty
to their heck a fixed work in ninety eight
that was the first to explore a bottleneck based features for a speaker recognition
so the objectives of mike talk today
i will be sharing our a state-of-the-art results on the nist sre two thousand and
ten extended core tasks again our emphasis is on telephony condition which is condition five
i will be presenting the key system components that contributed the most towards achieving these
results
namely i will talk about the fmllr based features that we used
and compared them in fact compare compared them with the a more traditional raw acoustic
features such as mfccs
we also used the n and based acoustic models in place of a gmm
unsupervised gmm acoustic model for i think ubm
the dev so this is basically
technically not novel id and then based i-vector a i-vectors a they've been around for
awhile now
what we did here we nearly double
the size of the scene onset and we wanted to see how that impacts the
speaker recognition performance
and then finally we explore a nearest neighbor this given analysis to achieve inter session
variability compensation in the i-vector space we compared the performance with
the more commonly used lda
we also quantify the contribution all
these three system components
to work towards the performance in fact we also
we'll see how varying for example the signal that the size of the scene on
set
will impact the perform
now let's take a look at
our speaker recognition system so you see the flowchart of all that speaker recognition system
here this is assuming that all the all the model parameters already train a so
we have at the in an acoustic model training i-vector extractors for that the lda
models
and
so the three components i just mentioned let me repeat this
we have
a similar based features
can be used to train and evaluate the d and then
as well as
to compute the sufficient statistics for i-vector extraction so with a from a large you
can achieve well speaker and channel and normalization
we have the in an acoustic model instead of a an unsupervised on a gmm
acoustic model to compute
the i-vectors again we nearly compared to the previous work double the size of the
signal set
and then we replace the traditional of the more commonly used lda with
and you for intersession variability compensation and used you i'm sure you're familiar with the
sparkling
so if we look at the previous work
with the non with the nn based signal i-vectors we what we of there is
that many systems the used to different set of features
to
compute the posteriors
and to compute a sufficient statistics so typically asr features are different from a
speaker-recognition features which makes sense so in this work we wanted to see
what happens if we can you many five
or use the same set of features to both trained and evaluated the nn and
to compute a sufficient
statistics for i-vector extraction so for the two words that we
are considered to use o a feature space maximum likelihood linear regression transforms to which
are actually use and based features which are used as features for our d n
and system
so adamantly transform is a linear transform like this which is basically would which can
be decomposed into a linear manner and it right and a translation and these parameters
can be obtain a
using the alignments that we obtain
from a from the first pass through a gmm hmm system
and then a maximum likelihood basically
estimate gives us a fight and better probably not be repeated parameters here
and the product of this transform on a raw acoustic features such as mfccs or
even transform like lda transform features
our speaker and channel more normalized features
i mean this may sound contradictory by the way so because after the larger used
to reduce speaker variability but as we know
there are two types of variability is their variability other speakers
speaker variability can be with teen or across speakers right so here we believe that
the
normalisation that if from lower provides within speakers dominates the a between speaker normalisation also
we get the benefit of channel normalization so if we have different setup and stands
for example we think that f and lark and take care that as well
now i as i mentioned the in ansi known i-vectors they've been around for awhile
so nothing technique the new in this
here in this slide the only difference is that you we really nearly a double
the size of our semen set compared to the previous work to compute the posteriors
and the from their computer sufficient
sufficient i will sufficient statistics
so i'm not gonna spend much time on this
now what
but can i conduct a present that
now we know basically how to rapidly compute i-vectors using even ten k c known
so just you connect this work to the
one of the presentations yesterday a set of money talked about
a how i-vector distributions are not necessarily gaussian and actually he showed us some distributions
and that was even on clean data not of in noisy data okay so
and lda basically it's formulated based on gaussian distribution assumptions for a class for different
for individual classes
or even if they are not gaussian they need to be at least uni modal
at a soul
therefore lda cannot effectively handle multimodal data
which is typical in the nist sre types of a scenarios because data come from
various sources we have switchboard sources of data we have mixer sources of data and
that causes a multimodal the in the i-vectors
and also for applications such as language recognition we because we only have a few
classes the lda transform
i can be rank deficient
so we might get a hit from that as well
so instead of trying to transform the i-vector space so that is more gaussian like
what center are presented yesterday
we here we tried to use the transform that is the does not assumed gaussianity
or does not use the class
the ball or a structure of the classes to compute a the between class scatter
matrices so
when you look at the lda uses the class centroids
the differences between class controls will error rate here
arrow here we see to compute the between-class gotta meet now in the n b
a what we do with that we don't assume any a global extract structure for
classes for individual classes rather we assume that classes are only locally structure
so we use the local
means that are computed based on character a nearest neighbours for each individual sample
and then used to differences to compute the between class scatter matrices
another thing is that we introduce this weighting function here which is basically to emphasise
the sample these samples near the classification boundary which are more important for discrimination between
different classes rather than the sample here
which should get a really small way because it doesn't contribute towards to discrimination
the class discrimination
and then another thing is that on like lda and be a
even that we have enough a number of examples of for can for different classes
can always before right
so therefore is very useful for applications such as language id which we don't you
publish to work in i guess a twenty fifteen and we actually obtain some gains
over the
so our experimental setup a for training data we extracted english telephony and microphone data
from a
this two thousand and four through two thousand and eight sre data we also used
switchboard data both cellular and land line data
this are basically resulted in a total of sixty k recordings
to train our system hyper parameters for evaluations we considered the nist a twenty ten
sre and that a evaluation set there is that we considered a nist sre two
thousand ten compared to twenty twelve is because
we had some anchors to compare our performance of our system where it
with other with other no sites
so the conditions we consider where c one condition want to see five
you can see the details here but i wanted to emphasise that again our emphasis
is on condition five which is
a left levantine and there is a mismatch between enrollment and test
so the type of
phones used in
enrollment s are not necessarily the same
our didn't system
our d an acoustic model had a seven hidden layers a six to six of
wins
ha twenty forty eight a hidden units and then the bottleneck layer which
five hundred and twelve units we use fisher data to train it
in addition to the think eight original think a signals we also consider two point
four k
posteriors to see basically how of varying the size of how a varying the granular
larry d
the in the output layer
well in fact that speaker recognition performance
a typical setup for speaker our speaker recognition system
we use five hundred dimensional total variability subspace
which was reduced to two fifty using an l d or lda simply lda which
was trained on the entire a training set and we report equal error rate or
and mindcf away and ten
i think that we also consider a to give working in thinking signals from i-vector
extraction
in terms of results list can compare held every and da so this is a
these results are obtained with mfcc twenty forty component gaussian mixture model thinking in and
results are reported condition five as we see a no matter what type of acoustic
model we use lda all always provided a nice benefit over
and the across the three metrics
and the reason is because in as i mentioned in v a can handle non
gaussian and more to model
more effectively then lda for comparison of
mfccs version of a large
again condition five thinking in we can see that
first with lda and the it doesn't matter we always have improvement with a from
a large aware mfccs and the reason is because you
m f and large provide a speaker and channel normalization
also note that we unified a speaker recognition and speech recognition features this way okay
so the system is even it's simpler but we should also take into account the
fact that for every two in order to compute fmllr transforms we need a two-pass
system as opposed to
to look up to measure the impact of signals that
size we consider to pay for k ever think a posteriors accuracy as an increase
the signals that side results improve
we also considered thirty two j signals
okay so choose just to just to see how it how it impacts performance
we did not see much gains with thirty two k c that experiment to the
wind to finish
i just one to emphasise here that in contrast to what we see with the
d n and
if you increase the size of the components the number of components in a gmm
athletes with a diagonal covariance matrices you don't c d's gain if you increase the
size
of the gmm components from two k to forty two six k to make a
you don't see
probably gave it if not degradations
and now they say i picture is worth in other words you're this that bloody
at this work to table
so for a week what we what you can you can see how lda compared
to lda with both a gmm and the in and based systems the performance larger
menus gmm gmms to compute the i-vector a posterior to compute i-vectors
and with the nn as we increase the size of the signals that the this
gap in performance
a narrow and then secondly we can compare two k versus
ten k the nn seen on a
performance
a progression of our system over time without and with the very basic system gmms
and mfcc then lda we replace the and the got it you came and we
are replaced you know you know got of used and the timit and from lars
got further boost in performance so we at the best published a performance on the
nist sre two thousand and ten condition five at least
for other conditions we believe those are also the best college performances these we you
know refer to the paper for more detail
because we're really on previous best results i wanted to mention you created to a
d choose more that it's you one point o nine equal error rate
they had a gender dependent system our system is gender independent
and but this work
which also the also use the gender dependent systems and what the only reported results
on female trials while i'm not sure how we can
compare it is numbers but
so in conclusion i presented hours a speaker recognition system the components that it has
i shared with you our results and quantify the contribution of different components
we have if you're interested for further progress on our system
please come visit us are i in your speech or you know if you buy
me a cookie after this i might be i able to share more details when
you thank you
for some questions
for your presentation my question is about the weights in the in computation it is
those weights in the
original in da that you mentioned
in computation
those weights alright originally in the in da for us they are they and you
say that at all the data that are close the boundaries are have to listen
to look at this let's take a look at how things are computed so that
is that it says minimum of the distance between
each sound and its k nearest neighbor within that class
e seven point it's k nearest neighbours from the j a two class right and
then that is divided by a so of course if the sample is not close
to the boundary it's gonna be closer to use case okay a k nearest neighbours
from kids the same class
right
so this is gonna be smallest number is gonna be small compared to the denominator
so you're gonna get close to zero number versus if the if the south are
close to the boundary this number
is gonna get
close to this number
so sorry this number is gonna basically at the this is gonna come out of
the mean this so it's the sample from class i compared to its case i
k nearest mean of the k here and here's neighbours from plastic is gonna come
out so this is divided by
this term is gonna be point five so for samples near and dear a classification
boundary you get point five four samples that are far from that is a battery
where you're gonna get some a zero norm
can you tell conceptually what does it mean i mean
what "'cause" i mean why some those doubles the boundaries more weights done because the
sound of which are far from either far anyway so what are but directly what
we what they contribute or you know a station the between class scatter because sometimes
that are if it assumes that they are gaussians
samples that are near yes
first well even if it's gaussian
so all those data is that are a away from the mean so that are
like l layers right
the because you can distinguish them
if there is that she that can conclude that can do not directly last but
again we extended data more than those that are or more representative of the training
set is the training set what we mean by shift is the training set already
have the labels we know that those samples or
are in that class and they are far from the classification boundary
okay
work
thank you thank you for your the actually of a question regarding the implementation of
the indy so a new papers i've seen several things like they're within class covariance
use the classical one are used for this work on this work we use the
classical one
okay be clear lately we limit limited with something for this work we use the
exactly the same we compute the within class scatter matrix exactly the same way you
computed for lda
for at k nearest neighbours we use one versus press
that means that work each class
you consider that fast versus all the other classes and you compute the and you
compute the nearest exit was the development question so if there is a except for
the computational time because it was gonna be other was gonna be various i know
but intimate results does it change anything
if we i this is never be was so slow that i've never explore
thank you
english degree