Speech Transcript - Deep Neural Network based Text-Dependent Speaker Verification : Preliminary Results

okay so got on that difficulty with this is so these are

so i sent to store for them

it's a talk about so a neural networks primarily recurrent neural networks

for text dependent speaker verification

this is on paper at least a very natural fit between a model on the

problem

and it's something that's a good goal has got to work very successfully

so we try to unfortunately we came to the completion that have we were a

couple of orders-of-magnitude short simply amount of background data we with me

i one telephone dugout and it's explain why didn't work

i would recommend that you read this paper i suggested to and the derided a

as a survey article

i think is worth reading on those on those trials

but i don't cry going to spend the whole period talking about this particular problem

i'd like to explain why our times are for getting of these neural networks to

work i'm talking specifically about

speaker discriminant neural networks

getting them to work in text

independent speaker recognition

got times a thesis project will be specifically i'm getting convolutional neural networks to work

and i personally i'm particularly interested in

what is the right back end architecture

for this type of problem

so what i plan to do it was then maybe five or even though it

would have only results to present have spent maybe five or ten minutes

talking about

well point this is a difficult problem but why the difficulties are not since approval

and

if possible like will explain for four

hoping to do by way of at

system for the

for the nist evaluation based on speaker discriminant neural networks

all this in the hope of provoking a discussion i would be particularly interested in

hearing

fans of and the other people who might be trying to do something

okay so i don't

that's for guns on this task of the problem was to use neural networks

to extract utterance that features

which could be used to characterize speakers

in the context of a classical text dependent speaker recognition task where you have a

fixed

a pass phrase and the phonetic variability is partially nailed down

the easiest

way to do this is using an ordinary feed forward a deep neural network

but we were particularly interested in trying to get this to work with recurrent neural

networks

largely inspired by

recent work in machine translation which

this briefly

so here's the problem i'll just mention at the outset that we were specifically interested

in the case of getting that's to work with a modest amount of background data

most of us working in

text dependent speaker recognition are confronted by very heart constraint more if we're lucky we

will be able to get data from

one hundred speakers

whereas if you read the google paper you will see that they have

this really tens of millions of recordings

all instances of phrase

okay

so for

well what you would do and designing a deep neural network for this purpose you

would just feed the a three hundred milisecond

when no into a classical feed forward neural network

with the softmax on the outputs where you have additional for each

speaker among your development the population and train up with a classical cross entropy criterion

you with then given utterance level features simply by averaging the output so from the

over all frames so that this was implemented successfully by google the gold at a

d vector approach

and

it works fairly well on our task as well although it's not competitive with play

the gmm ubm

so well this is just the

classical feed forward architecture i don't think it needs and the anti further comments

what was i think most remarkable about the

or an architecture which are

describe the next

is that a local manage to get this to work has an end-to-end

speaker recognition system not nearly

a feature extractor

but one of which could make a binary a decision concerning a trial as to

whether it's a

a target trial or non-target trial

this has been sort of seen as a part of gold at the end of

the rainbow in our field for very long time

it has been i people have been able to get to work with i-vectors

but a direct approach to that problem has generally been you know resistant to our

best staffers but go to work with their or and then system

so you see that they used to an awful lot of data that figure of

twenty two million recordings is not a misprint

so the what the or nn architecture in the slides the diagrams refer just to

the

a classical memory module of them the and test again a memory module where

in addition to an input vector at each time step you also have a hidden

layers of encodes upon set straight

and the one neural network does at each time step is that depends again but

so the

a hidden activation

then squash as the dimension back down so the dimension of the hidden activation that

i'm feeds a nominee repeated into a nonlinear z so you

keep on updating a memory of the history of the utterance and that's

a very natural sort of model

for data with a left-to-right structure as in classical text dependent speaker recognition

or and even machine translation

and the was a

was it paper

okay so this is the classical or in an architecture

there was a an extraordinary paper machine translation published and two thousand and fourteen

which shows that it was possible to train a neural network for the

french to english translation problem

using an organ and architecture with a very special feature namely

the was a single softmax

okay in the what they call the encoder the encoder read french language sentences

and

the

it was trained in such a way that the hidden activation the last time step

was capable of memorising the entire french sentence

so that all the information you need to you needed in order to do machine

translation from french to english was summarized in the hidden activation at the last war

of the of the sentence

to get this work they have to use for layers of the nist m units

it wasn't easy but they were able to get good results with a machine on

a state-of-the-art results on machine translation task

with sentences about the thirty warren's obviously that's must actually break down

okay you can memorise sentences of indefinite duration this way just because the memory has

a finite capacity

but google data well if it works a machine translation is definitely going to work

and

text dependent speaker recognition will be possible to

memorise the as a speakers utterance o a fixed hence frames

the other various ways them the past has been improved on

an obvious thing to do instead of

using the activation of the last time step to memorise an utterance would be to

average the activations of all time steps

but once again you would be taking the average activation and feeding it into a

single softmax to do the to do the memory it's not one softmax per frame

there was a bit of controversy as you can imagine and the machine translation field

as to whether this would really was the right way to memorise entire sentences and

that lead to a flurry of activity something called

what was attention modeling

okay where

i mean the argument was that if you're going to translate from french to english

then in the course of the english translation as you proceed work by where you

want to direct your attention to the appropriate place and the in the french utterance

and that's correspondence is not necessarily going to be monotonic because word ordering can change

as you change one language to the other

but that was and a model developed along these lines in the actual then shows

about which i think

planes to be the state-of-the-art in a text

and

machine automatic machine translation

and what gotten set up to do was to

take that idea and instead of using this sort of attention mechanism to weight the

individual frames

in the utterance to learn an optimal

summary of a speakers production of the of the pass phrase

and that was the thing that so actually work best for them

so that this describes the task if a fairly classical text dependent speaker recognition task

of the language was in german it was provided for us by the biphone stressed

the results with the in the heavens well although the you know standard tricks worked

as a as advertised of they were you know

or the cold read you units rectified linear units dropped out some accent and so

on each of them gave an incremental improvement in performance but

we want able to match the performance of a gmm ubm

and of course well the same thing happened with or analysis at doing intelligent summaries

of they said data held but the results ultimately more disappointing

and the reason

it was quite clear that the reason

with just one hundred development speakers we are going to

hopelessly overfit to the to the data so

at these methods are not going to work on less we have a very large

amounts of data

very large amounts of data ports are on the way i was

talking

just this morning to make a was set that the might be the possibility of

getting a surly data

where this sort of thing could be serious the as a viable plausible

solution but it's clear that go term isn't going together up usually faces of that

is solved

while he's been bitten by the

by the neural network back so he's is task would be to trying to get

convolutional neural networks working

convolutional neural networks trained to discriminate between speakers working as a feature extractors

for a text independent speaker recognition

what i would like to do it was just

talk about what are our fans are for that

what i thought it would do was first of all explain why this

this is a difficult problem

okay why

we cannot expect out of the bars solutions

already existing in the neural network literature to work for us

a white nonetheless it's not in an superbly difficult problem and we ought to be

able to do something about

presently uncommitted

to get in this work

the to get in this work

we are going to submit some sort of system for the for the nist evaluation

but i think well it's going to take a bit longer to actually i and

all the king set out of this

it seems to believe that

it approach in this problem there are two fundamental questions that we need to be

able to answer and how we answer them is probably going to dictate

well direction we actually terry

the car restroom about the backend which i'm particularly interested then

but it's i actually of secondary importance

so the first question i c is if we look at these success that feels

like face recognition

have a where

a very similar biometric pattern recognition problem i'm taking thinking in particular of gee face

one is it that it has more so spectacularly for them but we still haven't

been able to get more

that's what that's one question

a second question would be

if we look at the current state-of-the-art in text dependent speaker recognition

because that's where we have a

neural network trained to discriminate between senones

collecting baumwelch statistics for a

an i-vector field is a cascade

wang is it

if we simply trying to neural network to discriminate between speakers

in the in the nist data what is it that we haven't been able to

treat that architecture

okay together to work satisfactorily

in speaker recognition

to my knowledge

several people have tried this but haven't yet obtain a even a publisher result

okay i'm it may be wrong about this be happy to select program wrong about

this but i believe that this is where things stand a present

so if we if we look at the

at the deep face architecture became the

so what these guys didn't facebook they had a population of four thousand development speakers

one thousand images are

speaker i

subject okay

one thousand images per for proper subject they

trying to convolutional neural network to

discriminate

between this the subjects in the development population

and use that as a feature extractor and one-to-one assumption that just that the output

into a cosine distance classifier

there are output was a few thousand dimensions but

google later showed that you could do this with one hundred twenty dimensions but the

same order of magnitude that we have found

so be appropriate for characterizing speakers and

text independent speaker recognition

of course the fact that they have one thousand instances per subject but obviously does

make like a lot easier

then

the market is four we have maybe time average

but some people have raised a sort of more fundamental concern

in our case we're not really trying to extract features from something that's

analogous to static images

because of the time dimension work on where we're confronted with model only

are we dealing with utterances of variable duration model than a fixed dimension but

the

order of phonetic events is something that is nuisance for us

okay we need to get a representation that's

invariant under permutations with respect to the

order of phonetic events

i don't

a convolutional neural network should be eight to solve multiples

problems in principle

because it will produce a representation that's invariant under permutations and the time dimension

and in principle it will be able to handle

utterances of variable duration

there is an animal automatic segmentation image processing you seen that they do use convolutional

neural networks with images of variable

signs

so i don't think it's hopeless but this would be my answer the question okay

why

two

signal discriminant neural networks work but not speaker discriminant neural networks is because i think

trying to discriminate between speakers on very short time scales is going to be very

heart problem

i think we should just stay away from the

from the time being and the reason is very simple

but the

primarily

variability in the signal at short time scales is necessarily phonetic variable

not speaker variable

it was very phonetic variability then

speaker a speech recognition rather than what would not be possible

okay so what happens again if we focus and if we take the same architecture

as is used in signal discriminant neural networks at a ten milisecond frame advancement three

hundred milisecond window

then we're just gonna get swamped with the problem phonetic variability

it's actually quite easy okay to get neural networks working as a feature extractor

if you use all utterances as the input i mean just encode the utterance as

an i-vector you will get bottleneck feature that

doesn't very good job of discriminating between speakers

if you feed and whole utterances they problem it some of the will but is

actually too easy to be interesting i did not gonna get away from i-vectors

if you go down to ten miliseconds i think is just going to get killed

by the problem of phonetic variability and

the sweet spot for the short term i think should be something like ten seconds

okay that was marked in

and language recognition

and you'll see actually several papers in the in these proceedings

that show that neural networks or good a extra features and language recognition

if you're if you give them utterances of three seconds or ten seconds whatever

but i would say that particular problem of

getting down to short time time-scales is one that we should eventually be able to

solve and we showed that go one

okay i think if you want to

use

neural networks as feature extractor is not nearly for speech rec speaker recognition but also

for speaker diarization then you are going to have to confront the problem

okay you can't have a window of more than

say five hundred milliseconds in speaker diarization or you're going to miss speaker turns okay

so you

we are eventually going to have to confront that problem how to normalize for the

phonetic variability and

in utterances of short duration if we're to train

neural networks to discriminate between speakers

i just mention

paper of

famous will be present in that attempts to deal with that problem with factor analysis

methods

the very last analysis

i thought to be

the idea would be

i think this is going to work eventually okay you we should

think of phonetic content as a

short term

channel effects

okay one when i say short term i mean maybe five

frames or chan frames in the normal

way we think about channels this is sort of that this would be sort of

hopeless okay you we can model channel effects that the resumes of the

persistent over entire utterances but not at the level of say ten miliseconds however we

do have the benefit of a supervision

from that could be supplied by something like a signal discriminant neural network that tells

you at each time step while the

probable phonetic content

that is

so that it is actually possible to model phonetic content as

a short lived channel effect and you can do that using factor analysis methods

and that was the topic of famous as presentation you just a first experiment

but i think that particular problem is going to be

the solution of that problem is going to be a key element

to i

the guessing

neural networks to discriminate between speakers i short just a short time scales

okay so that's same about that so

english

okay so the i think that you said that you want to reduce and then

to learn the same speaker variability how you while you're trying to think about how

you like your yes the other one thinking about the softmax as the target speakers

or you know for example i can tell you what we are interested in working

is the what is trying to learn the cosine similarity between speakers so we have

a skinny staring

trying to mimic saying all this is the same speaker or different speaker would buy

toward by learning some cosine similarity and tried to push the clusters friendly shoulders

well my view about this and this is just a pen okay is that

i believe that in order to get you are not forced to work in speaker

recognition in the long run we are going to have to combine them with a

general okay

i the way it's you're working is that

analogously to the face to face architecture we can hope to get neural networks working

as feature extractors that would be trained to discriminate between speakers in the development set

but used as feature extractors

at runtime

i would expect

that

we would have these neural networks i'll for thing

so i axis

okay i regular intervals as you as you go through an utterance

and that the problem

i believe that the interesting problem

is how to design a backend

to deal with that

okay it in fact in fact involved modeling counts which you will be the

the

the topic of your presentation

although i believe for

there are other models which are just waiting to be used

for the and thinking particularly of latent directly allocation

which is the

i'm along for

i data eigenvoices four

for continuous data

and

you can you

i and the results that you want you can do is you can

you can build an i-vector extractor using latent dirichlet allocation for count a so

and if you can do eigenvoices you can also do

an analogue of the of the i

it'll behave very differently from the bleu we

"'cause" it would've gaussian assumptions

it won't even have this optional statistical independence between speaker effects and channel things

that's a whole lot of thirty

okay you can actually what basis for that the data with

training the lda with unlabeled data you can do that's what latent dirichlet allocation

so that it's actually very big

figure here waiting to be useful

only the question is do we want to go to tea

the selected training of softmax forty want to go to direction of representation

i think personally for this is just one and

personally i believe

the

your networks

okay or not to our task okay

we could never hope to the

training on labeled data

with just a matter for you and that was cannot discriminate between speakers of the

don't know harms the listener

so i think you will need to be complemented by a backend which is waiting

to be developed

not the backend but we have present person

okay

Deep Neural Network based Text-Dependent Speaker Verification : Preliminary Results

Text Dependent Speaker Verification

Gautam Bhattacharya and Patrick Kenny