hi everybody in this that i'm going to present the average at least the mental
for
allowing for the states and also selecting assumption is that is seeking text dependent speaker
verification
and also using deep neural network for improving the performance of the
text dependent speaker verification
a text dependent speaker verification is
task of verifying both speaker and the also phase signal the phase information
and we can use for improving the performance
we proposed a freezing dependent hmm models for a lightning frame
to the states and also to the gaussian components
an by using a hmm a we shall use the phrase information also we can
take into account the framework
then to use the h and then channel reviews the
uncertainty in the i-vector estimation if we need a they're pretty
and the average to resolve the covariance both sides
as uncertainty
this not so the reviews the uncertainty about twenty pairs
compared to the g
in addition to write we try to using deep neural networks for reducing the gap
between a gmm how much of the alignments
are also for improving the performance of the assuming this that's
that i certainly the general i-vector based system
in i-vector as system may mobile the all test you can then
supervector is we the
and this equation
in i-vector system we need to zero and first order statistic for training and the
extracting i-vectors
you can see the efficient
in this the equations a got one i shows the posterior probability of one frame
b
generated be the one a specific a gaussian components
we can component are computed the
gamma
got most be different at all the gmm-ubm
and or also or channel dismantled
but then you want them using the chairman has ubm in text dependent speaker verification
you have several choices
the first and second one is the using a phrase dependent hmm models in this
case you have to train and i-vector extractor for each phrase
this is suitable for common raspberries are also for text prompted speaker verification
we need sufficient training data for each phrase out of the also and so this
is not practical for
real application of taste gonna speaker verification
then other choices a tied mixture
hmms
and the last minute or phrasing the dimensional models
in this the middle the be used from a monophone structure same as the speech
production
and the
trait phrase model by concatenating the
four times and models
and extracting such as that seek for each phrase and convert into the
same shape of for all for example train an i-vector extractor for all
phrase
in this method to we don't need any a bead only large amount of training
data for each other frames on hmms can be trained
using any transcriber data
in these
the first stage of the this method is the training of phone recognizer under constructing
mobile left right problems for each phrase
and that doing a bit every forced alignment to align the frame to the states
and eight and
in
each state
extracting such as that's is the same as
simple gmm
and this is the for each phrase test statistic have different shapes and you have
to
change a them to a unique shape to be able to train on i-vectors structure
for all of the reason
in this
in the button of this
figure see that
spectral zero or a first order statistics
colours that
phrase specific statistics of the final of erin
we just sound the
some part of
nh just sound
part of the statistic that associated with the
saying the state of the same performance
and after the training
train an i-vector extractor exactly similar to text independent speaker
and verification
for channel compensation and scoring a text dependent speaker verification via problem be these
it's proved that the performance of the lda it's not so a lot and sometimes
the performance of the baseline gmm this is better than a p lda
also because the in text dependent speaker location of the training data it's really make
you need to
in most number of speakers also number of
sometimes pair freight
we cannot use a simple lda and
the reduces the end of the search just that using a regularized reduces cm
for
reducing the effect of a sample size
in their regularized values the c and we just add some
we just had some regularization to the
the covariance multi cell for each class something and all that i think it's the
exactly same as the symbol that uses
and the also in a
takes the gun on the speaker location because the old the ocean it's a very
short you have to
use the phrase dependent transform and also for is the government
regular is a the first dependent score normalization
especially venues the
a hmm for a long data frames
use cosine similarity for scoring and the system for normalization
for reducing the get fifteen
hmm and gmm alignment the we can use
the intent
in two scenarios the first one maybe use the nn for calculating
and posterior probability and it is exactly same as was found in
and text independent speaker verification
and another choice is using
the nn for extracting bodily features
for improving the gmm alignment
in this case the better of fun i'm like features based clustering obtained on the
performance of the gmm based improve
for it's like four
note for heavy use stacked bottleneck features
in this topology
the two
to the bottleneck networks
that's good that to each other
the bottleneck loaded of the first stage construct their input of the second stage
and we use the old that of the
but what a nickel that of the second stage as
well to make hold
are used to different the
networks one us for a menu for extracting bottleneck features that have about eight thousand
percent and another one used for bows
extracting bottleneck out of the calculating the posterior probability
that have
i bought of one thousand sentence
four feet input features used utterance six a lot and the scale filterbank
and also three features
where x for experiment of used car one of the r s r dataset
in a result dataset there are a three hundred speakers
one hundred on the
fifty
so that males and one hundred forty three females each of which are problems for
announcing thirty
and different phrases from timit in nine this thing sessions are used really a sessions
for enrollment a by averaging the i-vector and others for testing
we just use backgrounds for
training and the results reported just some evaluation set
a for training the n and the we use the switchboard data sets
as a feature we use different acoustic features
thirty nine dimensional plp features are also
the initial all mfcc features both of them extracted from sixteen influence
and two version of the bottleneck features but extracted from at a data
for
vad we use supervised
silence model for
just dropping to find out
just probably the initial and final silence in a regional trans on
after that applied
cepstral mean and variance too much mean and variance
use of four hundred dimensional i-vectors that length normalized before regularized w c n
and as lisa the use phrase dependent required their use in an s not cosine
distance for a score
in this table you can see the comparison results between a different a features and
also alignment that so it
in the first section of this table you can as can bear the performance of
the gmm and hmm aligner and you can see that it shows that the significantly
improve the performance
and comparing that the nn alignment with hmm of each and see that the nn
also calendar improve the performance
especially for the female the performance is better than it channel alignment
may be used it was just features
then use bottleneck features
the performance of the gmm it's
increased
and you can see compare these two number on also others
well for hmm based for female the performance is better for mesa
you got some deterioration in performance
well for the and then we can use those bottleneck features on the l an
alignment you can see the
so
you can see that
q
you duration in performance
maybe use both of them
and the in the last section you see the a pair results of the bottleneck
concatenated image
that the mfcc features
in this case we got that this result
for weight loss hmm and the gmm case you can see that of in the
use this the features the performance of the gmm
it's very close to the
hmm one but again for be and then the performance is not so
because the pair performance of the chinese it's better than other we just report the
results on this but also in this table
in this table in the first
section we compare the
performance of the different features
mfcc plp what'll make a two button think one of them extracted from
is smaller network
you can see that most i'm this field you a
the perplexity same
and the bottleneck its course for made on the it's better for female
then reduce the size of the network
the performance of the bottleneck reviews the you can see
for both appeal to kill the and mfcc we
concatenated with the bottleneck we get a
would be improvement
and in the last session of this table you can see the results of the
errors fusion in score domain
a comprise only the second session that it's fusion in the it's in feature domain
in this case you can see that the in almost all cases the performance of
the main interest for coming is better than features domain name
takes you can then speaker verification because in text
independent
the performance of the
concatenation is better than
fusing the
it's cool it's course of two features
and a higher the
the problem is the training data the training data it's very limited and for larger
than actually need to more training data
you can see that of the four
using the bottleneck features be the plp and mfcc a we get the
we would be improvement
and that this result come from a fusing three different
it scores of three different
features
and at the end we proved that a
be also
can get very best results with i-vectors in a text dependent speaker verification
we verify that in text examine the speaker verification
the performance of that the an alignment
so good
and in some cases similar or better result did
the in an alignment
we also get a
we do excellent a result we using your bottleneck in text dependent speaker verification
it's should even concatenated you the other cepstral features
in text-dependent has been in speaker recognition have also
it score to maintain it is better than
feature level fusion
and we get the best results from i've using three different features
i'm just another one is a text dependent speaker verification you have to
used for a sleep and then the transform on score normalization was
the
duration is very short and then use the
hmm for aligning frame to the states
pitch and not to use the phrase independent
and
score
questions
okay maybe a quick question for lunch a very good on the vector work aid
to try this one the red dots
yes a are also the
results from our right that's
you can see the result of that the use of this was able to interspeech
i can see that a comparison between gmm ubm gmm i-vector also hmm i-vector in
three different non-target trial
and you can see that the especially for the target find that the freight frame
that it is important for us
also the content of it is important and the performance of the
hmm alignment
it's very better than other two methods
and also for impostor enquiry case the performance of directories
better too
first thing to note question
just a quick question to the fusion for gmm systems so they didn't systems were
working controls the hmms drive using cd units is used to see we're very minor
to remain
no i and try to be