okay so got on that difficulty with this is so these are
so i sent to store for them
it's a talk about so a neural networks primarily recurrent neural networks
for text dependent speaker verification
this is on paper at least a very natural fit between a model on the
problem
and it's something that's a good goal has got to work very successfully
so we try to unfortunately we came to the completion that have we were a
couple of orders-of-magnitude short simply amount of background data we with me
so
i one telephone dugout and it's explain why didn't work
i would recommend that you read this paper i suggested to and the derided a
as a survey article
i think is worth reading on those on those trials
but i don't cry going to spend the whole period talking about this particular problem
i'd like to explain why our times are for getting of these neural networks to
work i'm talking specifically about
speaker discriminant neural networks
getting them to work in text
independent speaker recognition
got times a thesis project will be specifically i'm getting convolutional neural networks to work
and i personally i'm particularly interested in
what is the right back end architecture
for this type of problem
so what i plan to do it was then maybe five or even though it
would have only results to present have spent maybe five or ten minutes
talking about
well point this is a difficult problem but why the difficulties are not since approval
and
if possible like will explain for four
hoping to do by way of at
system for the
for the nist evaluation based on speaker discriminant neural networks
all this in the hope of provoking a discussion i would be particularly interested in
hearing
fans of and the other people who might be trying to do something
okay so i don't
that's for guns on this task of the problem was to use neural networks
to extract utterance that features
which could be used to characterize speakers
in the context of a classical text dependent speaker recognition task where you have a
fixed
a pass phrase and the phonetic variability is partially nailed down
the easiest
way to do this is using an ordinary feed forward a deep neural network
but we were particularly interested in trying to get this to work with recurrent neural
networks
largely inspired by
recent work in machine translation which
this briefly
so
so here's the problem i'll just mention at the outset that we were specifically interested
in the case of getting that's to work with a modest amount of background data
most of us working in
text dependent speaker recognition are confronted by very heart constraint more if we're lucky we
will be able to get data from
one hundred speakers
whereas if you read the google paper you will see that they have
this really tens of millions of recordings
all instances of phrase
okay
so for
well what you would do and designing a deep neural network for this purpose you
would just feed the a three hundred milisecond
when no into a classical feed forward neural network
with the softmax on the outputs where you have additional for each
speaker among your development the population and train up with a classical cross entropy criterion
you with then given utterance level features simply by averaging the output so from the
over all frames so that this was implemented successfully by google the gold at a
d vector approach
and
it works fairly well on our task as well although it's not competitive with play
the gmm ubm
so well this is just the
classical feed forward architecture i don't think it needs and the anti further comments
what was i think most remarkable about the
or an architecture which are
describe the next
is that a local manage to get this to work has an end-to-end
speaker recognition system not nearly
a feature extractor
but one of which could make a binary a decision concerning a trial as to
whether it's a
a target trial or non-target trial
this has been sort of seen as a part of gold at the end of
the rainbow in our field for very long time
we
it has been i people have been able to get to work with i-vectors
but a direct approach to that problem has generally been you know resistant to our
best staffers but go to work with their or and then system
so you see that they used to an awful lot of data that figure of
twenty two million recordings is not a misprint
so the what the or nn architecture in the slides the diagrams refer just to
the
a classical memory module of them the and test again a memory module where
in addition to an input vector at each time step you also have a hidden
layers of encodes upon set straight
and the one neural network does at each time step is that depends again but
so the
a hidden activation
then squash as the dimension back down so the dimension of the hidden activation that
i'm feeds a nominee repeated into a nonlinear z so you
keep on updating a memory of the history of the utterance and that's
a very natural sort of model
for data with a left-to-right structure as in classical text dependent speaker recognition
or and even machine translation
and the was a
was it paper
okay so this is the classical or in an architecture
there was a an extraordinary paper machine translation published and two thousand and fourteen
which shows that it was possible to train a neural network for the
french to english translation problem
using an organ and architecture with a very special feature namely
the was a single softmax
okay in the what they call the encoder the encoder read french language sentences
and
the
it was trained in such a way that the hidden activation the last time step
was capable of memorising the entire french sentence
so that all the information you need to you needed in order to do machine
translation from french to english was summarized in the hidden activation at the last war
of the of the sentence
to get this work they have to use for layers of the nist m units
it wasn't easy but they were able to get good results with a machine on
a state-of-the-art results on machine translation task
with sentences about the thirty warren's obviously that's must actually break down
okay you can memorise sentences of indefinite duration this way just because the memory has
a finite capacity
but google data well if it works a machine translation is definitely going to work
and
text dependent speaker recognition will be possible to
memorise the as a speakers utterance o a fixed hence frames
so
the other various ways them the past has been improved on
an obvious thing to do instead of
using the activation of the last time step to memorise an utterance would be to
average the activations of all time steps
but once again you would be taking the average activation and feeding it into a
single softmax to do the to do the memory it's not one softmax per frame
there was a bit of controversy as you can imagine and the machine translation field
as to whether this would really was the right way to memorise entire sentences and
that lead to a flurry of activity something called
what was attention modeling
okay where
i mean the argument was that if you're going to translate from french to english
then in the course of the english translation as you proceed work by where you
want to direct your attention to the appropriate place and the in the french utterance
and that's correspondence is not necessarily going to be monotonic because word ordering can change
as you change one language to the other
but that was and a model developed along these lines in the actual then shows
about which i think
planes to be the state-of-the-art in a text
and
machine automatic machine translation
and what gotten set up to do was to
take that idea and instead of using this sort of attention mechanism to weight the
individual frames
in the utterance to learn an optimal
summary of a speakers production of the of the pass phrase
and that was the thing that so actually work best for them
so that this describes the task if a fairly classical text dependent speaker recognition task
of the language was in german it was provided for us by the biphone stressed
the results with the in the heavens well although the you know standard tricks worked
as a as advertised of they were you know
or the cold read you units rectified linear units dropped out some accent and so
on each of them gave an incremental improvement in performance but
we want able to match the performance of a gmm ubm
and of course well the same thing happened with or analysis at doing intelligent summaries
of they said data held but the results ultimately more disappointing
and the reason
it was quite clear that the reason
with just one hundred development speakers we are going to
hopelessly overfit to the to the data so
at these methods are not going to work on less we have a very large
amounts of data
very large amounts of data ports are on the way i was
talking
just this morning to make a was set that the might be the possibility of
getting a surly data
where this sort of thing could be serious the as a viable plausible
solution but it's clear that go term isn't going together up usually faces of that
is solved
is
while he's been bitten by the
by the neural network back so he's is task would be to trying to get
convolutional neural networks working
convolutional neural networks trained to discriminate between speakers working as a feature extractors
for a text independent speaker recognition
so
what i would like to do it was just
talk about what are our fans are for that
what i thought it would do was first of all explain why this
this is a difficult problem
okay why
we cannot expect out of the bars solutions
already existing in the neural network literature to work for us
a white nonetheless it's not in an superbly difficult problem and we ought to be
able to do something about
presently uncommitted
to get in this work
the to get in this work
we are going to submit some sort of system for the for the nist evaluation
but i think well it's going to take a bit longer to actually i and
all the king set out of this
so
it seems to believe that
it approach in this problem there are two fundamental questions that we need to be
able to answer and how we answer them is probably going to dictate
well direction we actually terry
the car restroom about the backend which i'm particularly interested then
but it's i actually of secondary importance
so the first question i c is if we look at these success that feels
like face recognition
have a where
a very similar biometric pattern recognition problem i'm taking thinking in particular of gee face
one is it that it has more so spectacularly for them but we still haven't
been able to get more
that's what that's one question
a second question would be
if we look at the current state-of-the-art in text dependent speaker recognition
because that's where we have a
neural network trained to discriminate between senones
collecting baumwelch statistics for a
an i-vector field is a cascade
wang is it
if we simply trying to neural network to discriminate between speakers
in the in the nist data what is it that we haven't been able to
treat that architecture
okay together to work satisfactorily
in speaker recognition
to my knowledge
several people have tried this but haven't yet obtain a even a publisher result
okay i'm it may be wrong about this be happy to select program wrong about
this but i believe that this is where things stand a present
so if we if we look at the
at the deep face architecture became the
so what these guys didn't facebook they had a population of four thousand development speakers
one thousand images are
speaker i
subject okay
one thousand images per for proper subject they
trying to convolutional neural network to
discriminate
between this the subjects in the development population
and use that as a feature extractor and one-to-one assumption that just that the output
into a cosine distance classifier
there are output was a few thousand dimensions but
google later showed that you could do this with one hundred twenty dimensions but the
same order of magnitude that we have found
so be appropriate for characterizing speakers and
text independent speaker recognition
of course the fact that they have one thousand instances per subject but obviously does
make like a lot easier
then
the market is four we have maybe time average
but some people have raised a sort of more fundamental concern
in our case we're not really trying to extract features from something that's
analogous to static images
because of the time dimension work on where we're confronted with model only
are we dealing with utterances of variable duration model than a fixed dimension but
the
order of phonetic events is something that is nuisance for us
okay we need to get a representation that's
invariant under permutations with respect to the
order of phonetic events
i don't
a convolutional neural network should be eight to solve multiples
problems in principle
because it will produce a representation that's invariant under permutations and the time dimension
and in principle it will be able to handle
utterances of variable duration
there is an animal automatic segmentation image processing you seen that they do use convolutional
neural networks with images of variable
signs
so i don't think it's hopeless but this would be my answer the question okay
why
two
signal discriminant neural networks work but not speaker discriminant neural networks is because i think
trying to discriminate between speakers on very short time scales is going to be very
heart problem
i think we should just stay away from the
from the time being and the reason is very simple
but the
primarily
variability in the signal at short time scales is necessarily phonetic variable
not speaker variable
it was very phonetic variability then
speaker a speech recognition rather than what would not be possible
okay so what happens again if we focus and if we take the same architecture
as is used in signal discriminant neural networks at a ten milisecond frame advancement three
hundred milisecond window
then we're just gonna get swamped with the problem phonetic variability
so
it's actually quite easy okay to get neural networks working as a feature extractor
if you use all utterances as the input i mean just encode the utterance as
an i-vector you will get bottleneck feature that
doesn't very good job of discriminating between speakers
so
if you feed and whole utterances they problem it some of the will but is
actually too easy to be interesting i did not gonna get away from i-vectors
if you go down to ten miliseconds i think is just going to get killed
by the problem of phonetic variability and
the sweet spot for the short term i think should be something like ten seconds
okay that was marked in
and language recognition
and you'll see actually several papers in the in these proceedings
that show that neural networks or good a extra features and language recognition
if you're if you give them utterances of three seconds or ten seconds whatever
but i would say that particular problem of
getting down to short time time-scales is one that we should eventually be able to
solve and we showed that go one
okay i think if you want to
use
neural networks as feature extractor is not nearly for speech rec speaker recognition but also
for speaker diarization then you are going to have to confront the problem
okay you can't have a window of more than
say five hundred milliseconds in speaker diarization or you're going to miss speaker turns okay
so you
we are eventually going to have to confront that problem how to normalize for the
phonetic variability and
in utterances of short duration if we're to train
neural networks to discriminate between speakers
i just mention
paper of
famous will be present in that attempts to deal with that problem with factor analysis
methods
the very last analysis
i thought to be
the idea would be
i think this is going to work eventually okay you we should
think of phonetic content as a
short term
channel effects
okay one when i say short term i mean maybe five
frames or chan frames in the normal
way we think about channels this is sort of that this would be sort of
hopeless okay you we can model channel effects that the resumes of the
persistent over entire utterances but not at the level of say ten miliseconds however we
do have the benefit of a supervision
from that could be supplied by something like a signal discriminant neural network that tells
you at each time step while the
probable phonetic content
that is
so that it is actually possible to model phonetic content as
a short lived channel effect and you can do that using factor analysis methods
and that was the topic of famous as presentation you just a first experiment
but i think that particular problem is going to be
the solution of that problem is going to be a key element
to i
the guessing
neural networks to discriminate between speakers i short just a short time scales
okay so that's same about that so
english
okay so the i think that you said that you want to reduce and then
to learn the same speaker variability how you while you're trying to think about how
you like your yes the other one thinking about the softmax as the target speakers
or you know for example i can tell you what we are interested in working
is the what is trying to learn the cosine similarity between speakers so we have
a skinny staring
trying to mimic saying all this is the same speaker or different speaker would buy
toward by learning some cosine similarity and tried to push the clusters friendly shoulders
well my view about this and this is just a pen okay is that
i believe that in order to get you are not forced to work in speaker
recognition in the long run we are going to have to combine them with a
general okay
i the way it's you're working is that
analogously to the face to face architecture we can hope to get neural networks working
as feature extractors that would be trained to discriminate between speakers in the development set
but used as feature extractors
at runtime
i would expect
that
we would have these neural networks i'll for thing
so i axis
okay i regular intervals as you as you go through an utterance
and that the problem
i believe that the interesting problem
is how to design a backend
to deal with that
okay it in fact in fact involved modeling counts which you will be the
the
the topic of your presentation
although i believe for
there are other models which are just waiting to be used
for the and thinking particularly of latent directly allocation
which is the
i'm along for
i data eigenvoices four
for continuous data
and
you can you
i and the results that you want you can do is you can
you can build an i-vector extractor using latent dirichlet allocation for count a so
and if you can do eigenvoices you can also do
an analogue of the of the i
it'll behave very differently from the bleu we
"'cause" it would've gaussian assumptions
it won't even have this optional statistical independence between speaker effects and channel things
that's a whole lot of thirty
okay you can actually what basis for that the data with
training the lda with unlabeled data you can do that's what latent dirichlet allocation
so that it's actually very big
figure here waiting to be useful
only the question is do we want to go to tea
the selected training of softmax forty want to go to direction of representation
i think personally for this is just one and
personally i believe
the
your networks
okay or not to our task okay
we could never hope to the
training on labeled data
with just a matter for you and that was cannot discriminate between speakers of the
don't know harms the listener
so i think you will need to be complemented by a backend which is waiting
to be developed
not the backend but we have present person
okay