Speech Transcript - Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition

okay this is going to be account where

technical

so that recovery problem

maybe uncertainty modeling in

text dependent speaker recognition

of the issues that i'm concerned with here is that

what to do about the

the problem of a speaker recognition context where you have very little data and the

features you extract

are necessarily going to be noisy and the and the statistical sense

okay the this comes straight to the for in text dependent speaker recognition where you

have maybe just two seconds of data

it's also an important problem in a text-independent speaker recognition

because of the need to be able to

just set a uniform threshold even in cases where you're test utterances are variable duration

that would be interesting to say what happens in the forthcoming nist evaluation

with that a particular problem

some progress has been made on a with

but subspace methods with i-vectors you try to quantify three

the statistical noise in the i-vector extractor process

and see that into the into p of you get a

but

i've taken a possibility of the table for the purposes of interest and said look

of subspace methods

in general are not gonna work in text-independent speaker record text dependent speaker recognition because

of the data distribution

okay so

what i attempted to do it was to tackle this problem

well and speaker variability when one could not able to characterize that by a subspace

methods

i realise what was preparing the presentation that the that the paper in the proceedings

is very tense it's rather difficult tree

but

the idea it's a tricky but it's fairly simple okay so i have made an

effort in the slides just to communicate the core idea and if you are interested

them i recommend that you

that you look at the slides rather than the paper i poles these lines one

on my web page

so the

the test that for this task we took the dior store for three that's the

random digit

a portion or

of the arms or data i just mentioned two things about this

a because the a design we have five random digits at test time

well all ten digits were repeated three times at enrollment time in random order you

only see half of the digits at test time

and actually turns out that under

those conditions

a gmm methods have an advantage okay because you can use all of the

all of your enrollment data no matter what the test utterances or as if you

pre-segment the data into digits you are actually constraining yourself chose to using the enrollment

data that corresponds to the digits are actually think

in the in the vocabulary so in practice you actually need to you need

one of the thing a dimension is that this paper is about the back end

we used a standard sixty dimensional plp front end which is maybe not hiding

i think that'll come up and the

in the next talk you can get much better results on female speakers if you

use a low dimensional front-end

which i think of resort was the was the first two of discover

the model that i was using here is a is that you have a model

which uses the low dimensional of hidden variables to characterize speaker effects but does not

attempt to something to characterize channel effects

but does not attend so to characterize speakers using subspace methods

and these the z factor that characterizes speakers would be used as a feature vector

for

speaker recognition

and the problem i wanted to address was to design a back and that would

take account of the fact

that the number of observations to re

available to estimate the components of this vector is very small

in general a step of it has them

one frame per mixture component if you have a two second utterance and the ubm

with five hundred and twelve gaussians to calculation you see that you have extremely sparse

data

so there are two a backends the but are present

one the joint density back end uses a point estimates of the features that are

extracted at enrollment time and at test time and it's

a models the correlation between the two

okay to construct a likelihood ratio and the innovation in this paper the hidden supervector

back end it treats

those two feature vectors as hidden variables as in the original formulation jfa

so that the key ingredient is to supply up a prior distribution that on the

correlations between those hidden variables

how much time don't have like i'm sorry i dependence one a strong

how much

okay good

okay so

the

so i was just digress a minute

the way uncertainty modeling is used is usually tackled in text independent speaker recognition is

that you try to characterize the uncertainty in a point estimate of an i-vector using

a posterior covariance matrix

but is calculated using zero order statistics

and you do this on the enrollment side and on the test side independently

if you think about this you realise that this isn't quite the right way to

okay the reason is that if you are hypothesized thing a target trial then what

you see on the test site has to be highly correlated with what you see

on the from inside

they are not statistically independent

and that has to be of benefit

from

using those correlations to quantify the uncertainty in the feature that comes out of the

test utterance we could something called a little total variance that says on average

when you condition one random variable on another u models reduce the variance

okay so

this the critical thing that i introduced in this paper is there is this correlation

between the enrollment on the on the test

okay so here's a the mechanics of how the joint density back end work that's

pretty straight the features has point estimates

at this was inspired by us central command is work at the last odyssey

he implemented this the upper level of i-vectors there's nothing to start with just a

few doing at the level of supervectors as well

you can't obviously training

correlation matrices of supervector dimension but you can implement this idea at the level of

individual mixture components and that's

so that gives you a trainable

back end for text dependent speaker recognition even if you can't use subspace methods

so that's the

that's our best back end and thus the one that i used as a benchmark

for our experiments

the supervector backend is the hidden version of this

okay that says

you're in a position to observe by one statistics but you're not position to observe

the these that factors you have to make inferences

about the posterior distribution of those features and bayes

you're likelihood ratio on a on that calculation

now it turns out that three

probability calculations are formally just mathematically equivalent to calculations with and then the fusion i-vector

extractor that has just two gaussians in a

weekly effects mixture component from the ubm

you say you observed that mixture component once on the enrollment side once on the

test side so you're two

hidden gaussians

okay you have a variable number of observations on the on one side of variable

number of observations on the test site that's the type of situation that we model

with an i-vector extractor so here

the there is an i-vector extractor

but it's only being used to do probability calculations is not going to be used

to extract features

one thing about this i-vector extractors that you're not going to use a to impose

subspace constraints because it's just have the two gaussians right

you don't need to say that those two gaussians

line of low dimensional subspace of the

of the supervector space

so you might as well just take the total variability matrix to be the identity

matrix and shift all of the burden

modeling the data

so the to the prior distribution

in a i-vector modeling we always take us time or prior a standard normal prior

zero mean identity covariance matrix that's because there is in fact norm general stick to

be gained by using a non-standard prior okay you can always compensate for a non-standard

prior

by fiddling with the

with the total variability matrix here

take the total variability matrix of the to be the identity but you have to

train the prior

and

that involves doing well you can do a posterior calculations if you look at those

formulas you see they look just like the standard ones except i now have and

mean and the precision matrix which

would be zero and the identity matrix in the case of the

standard

standard normal prior

and you can do minimum divergence estimation okay which is the way in the fact

of training the prior if you think about the way you minimum divergence estimation wiretaps

in fact what you're doing is your estimating a prior

okay and then we say well there's no gain in using a non-standard prior

so with standard as the prior and modified the total variability matrix instead so here

we just

estimate the prior to put that were estimated in inverted commas estimating a prior is

not something

variation due but we do it all the time so it works

so how would you how would you training assuming a have to organise your training

data into target trials

you with collect

the

the i-vector for each trial three each mixture component in the ubm you would have

an observation on the enrollment set an observational that over multiple observations so you're bound

was statistics

and you just implement this minimum divergence estimation procedure

then get a prior distribution that tells you what correlations to expect

between the enrollment data and the test data

in the case of a target trial

if you want to handle non-target trials and you just a impose a statistical independence

assumption you just zero while the correlations

okay so the way you would use this machinery to calculate a likelihood ratio

is that

given enrollment data and test data

you would calculate the evidence but this is just the likelihood of the data

but you get when you integrate out the hidden variables

it's not usually done but i think everybody should of a gender implementation of i-vectors

we should always calculate the evidence

okay the because it tells you

it's a very it's a very good a diagnostic

that tells you whether you're implementation is correct you have to evaluate an integral it's

a gaussian entropy role

the answer can be expressed and close form in terms of a bomb or statistics

as in the paper

so you in order to use that's for speaker recognition you evaluate the evidence in

two different ways one with the prior for target trials and one with prior for

nontarget trials you take the ratio of the two and that gives you your likelihood

ratio

for speaker recognition

so the mechanics then of getting destroyed depends critically on how you prepare

the baumwelch statistics

that us summarize the enrollment data and the and the test data

the first thing you need to do

is that stick the enrollment things i'm

okay each of those is potentially contaminated by channel effects so you take the role

by one statistics and

filter out the channel effects

just using the jfa model

so in that way you get a set of syntactic up on well statistics which

characterizes a speaker you just pooled at the bottom one statistics together after you have

filtered out the channel effects you do that on the enrollment side you do the

huntley on the test side and you end up in and the trial having to

compare

one set upon was statistics with another using this hadn't supervector back end

i here's a here's a new wrinkle that really makes a more

we know that in order to

the sort of achilles heel of jfa models is the gaussian assumption

okay and the reason why we do length normalization

in between extracting i-vectors and feeding them to be able to get a is in

order to fix the are only a scale some assumptions in really i

we have to do a similar

track here but the normalization is a bit tricky because

you have to normalize one statistics are not normalising a vector

obviously the first order statistics

the magnitude is going to depend on zero order statistics so it's not

immediately obvious that the one

so the this recipe that a the data used it comes from well go back

to the jfa model and see how the jfa model is trained in these that

i-vectors this treat them as hidden variables

okay that come with

both a point estimate and an uncertainty of posterior covariance matrix that tells you

well how answer you more about the about the observations

a lot of the underlying hidden vector

and the thing to

but it turns out to be convened in to normalize is the expected norm of

that hidden variable

set of making the normal equal to one you make the expected norm one

okay so

a curious thing is that the second term on the right hand side of the

trace of the posterior covariance matrix is actually the dominant term

we can be because the encircling is you

and there is an experiment and paper that shows that you better not neglect factor

and the role of the relevance factor in the in the experiments that i reported

in the paper

as you fiddle with the relevance factor you're actually filling with the relative

magnitude of this term here so you have you do have just without

all possible relevance factors in order to get this thing working problem

okay so here some results using what i call global feature vectors that's where we

don't bother to pre-segment the data

in two digits

okay

member a set of the beginning

that the was an advantage on this task to not

segment

okay in other words just ignoring left to right structure

that you give and in the problem

so there is a gmm ubm benchmark

joint density benchmark and the two versions of the hidden supervector back and one without

but length normalization is applied to baumwelch statistics and the other with that

so you see that the length normalization really is the key

to getting this thing to work

i should mention that those you use it is you should reduction in error rate

there on the female side from eight percent to six percent

that appears to have to do with the with the front and i

we fixed a standard front-end for these experiments

but it appears that if you use

a lower dimensional feature vectors for female speakers you get better results and think that

sticks explanation of that

so there are there's a actually fairly big improvement okay if you go for one

twenty eight gaussians to five twelve even though the uncertainty in the case of five

twelve is necessarily going to be more ones

it was this

this phenomena that celibate motivated us to look at the uncertainty modelling problem

you can also implement this with

if you pre-segment the data into digits and extract what i call locals that vectors

and the paper

and it works in that case as well

there is a tree that we use here famous circles at component fusion

you can break the likelihood ratio up into a contribution from the individual gaussians and

wait and where the weights are

calculated using a logistic regression

that helps quite a lot

it requires however that you have a development set

in order just to choose the fusion weights

so in fact you see in the paper that

the results with these locals that vectors although we did obtain an improvement on the

evaluation set it was not as big nice improvement we obtained on we

on the development set

unique data if you're going to use a regularized logistic regression

so there is a way the way we found a way around that

instead of presegmenting the data into individual digits

we used a speech recognition system

in order to collect rebound one statistics

i mean if you can

in text-independent speaker recognition if you can use

signal and discriminant neural network to collect almost statistics be obvious thing to do in

text dependent speaker recognition for you know we

the phonetic transcription is just use a speech recognizer to collect the

to collect the common statistics

there because

individual signals are very unlikely to occur more than once

i in a digit string you are implicitly imposing a left-to-right structure

but you're not so you don't have to do it explicitly

the that works the horse just as well

okay so there's some fusion results

three two approaches with and without the

paying attention to that remind structure but they work if you use them you

you do get better results

okay so just adjustments one thing that i didn't well on

this thing can be implemented very efficiently

okay you come

you can basically things up in such a way that the linear algebra that needs

to be performed at runtime just involves

diagonal matrices

so it's not so it's nothing like the i-vector back end but i that i

presented

the last interspeech conference which

was just a trial run for this wasn't intended to be a realistic solution problem

it involves essentially

extracting an i-vector approach which is not something you would so you would normally do

but this is very computationally reason so it is effective in a lisp right

okay that's the that's only have to say thank you

okay we

question

are normalising the channel effect in the baum-welch statistics and he also normalized the for

any the phoneme variability there as well

well what was i said this is future work i to do something about but

i think this is a problem we should pay attention to

okay so famous that some preliminary work on a but it it's a vector

and sell something we well here for a very this

fanatics israel email down for you in

text dependent speaker recognition it's not so much an issue as it is and in

text independent where it's really going to come from

a neural network that's trying to discriminate

okay and ask questions any using a skinny channel

we estimate

that's right that's what the rest of the calls for i think can be somehow

we define it

it's what the jfa recipe calls for even though the channel

variables are treated as hidden variables that have

posterior expectation the posterior covariance matrix

if you look

the role that

likely

if you merrily interest and filtering of the channel thanks to turns out that all

you need is the

is the posterior expectation

this is just with the model sets

so i

the model is very simple you just two

this the only way i'm not once the last

okay tend to stick

Uncertainty Modeling Without Subspace Methods For Text-Dependent Speaker Recognition

Text Dependent Speaker Verification

Patrick Kenny, Themos Stafylakis, Jahangir Alam, Vishwa Gupta and Marcel Kockmann