Speech Transcript - Deep Neural Network based Text-Dependent Speaker Verification : Preliminary Results

0:00:15	okay so got on that difficulty with this is so these are
0:00:19	so i sent to store for them
0:00:23	it's a talk about so a neural networks primarily recurrent neural networks
0:00:27	for text dependent speaker verification
0:00:31	this is on paper at least a very natural fit between a model on the
0:00:36	problem
0:00:37	and it's something that's a good goal has got to work very successfully
0:00:43	so we try to unfortunately we came to the completion that have we were a
0:00:49	couple of orders-of-magnitude short simply amount of background data we with me
0:00:55	so
0:00:57	i one telephone dugout and it's explain why didn't work
0:01:04	i would recommend that you read this paper i suggested to and the derided a
0:01:09	as a survey article
0:01:11	i think is worth reading on those on those trials
0:01:15	but i don't cry going to spend the whole period talking about this particular problem
0:01:21	i'd like to explain why our times are for getting of these neural networks to
0:01:28	work i'm talking specifically about
0:01:30	speaker discriminant neural networks
0:01:33	getting them to work in text
0:01:35	independent speaker recognition
0:01:40	got times a thesis project will be specifically i'm getting convolutional neural networks to work
0:01:47	and i personally i'm particularly interested in
0:01:52	what is the right back end architecture
0:01:54	for this type of problem
0:01:58	so what i plan to do it was then maybe five or even though it
0:02:02	would have only results to present have spent maybe five or ten minutes
0:02:06	talking about
0:02:08	well point this is a difficult problem but why the difficulties are not since approval
0:02:14	and
0:02:16	if possible like will explain for four
0:02:19	hoping to do by way of at
0:02:23	system for the
0:02:25	for the nist evaluation based on speaker discriminant neural networks
0:02:31	all this in the hope of provoking a discussion i would be particularly interested in
0:02:36	hearing
0:02:38	fans of and the other people who might be trying to do something
0:02:43	okay so i don't
0:02:45	that's for guns on this task of the problem was to use neural networks
0:02:51	to extract utterance that features
0:02:55	which could be used to characterize speakers
0:02:59	in the context of a classical text dependent speaker recognition task where you have a
0:03:04	fixed
0:03:05	a pass phrase and the phonetic variability is partially nailed down
0:03:11	the easiest
0:03:12	way to do this is using an ordinary feed forward a deep neural network
0:03:19	but we were particularly interested in trying to get this to work with recurrent neural
0:03:23	networks
0:03:25	largely inspired by
0:03:27	recent work in machine translation which
0:03:30	this briefly
0:03:33	so
0:03:36	so here's the problem i'll just mention at the outset that we were specifically interested
0:03:42	in the case of getting that's to work with a modest amount of background data
0:03:47	most of us working in
0:03:49	text dependent speaker recognition are confronted by very heart constraint more if we're lucky we
0:03:55	will be able to get data from
0:03:58	one hundred speakers
0:04:00	whereas if you read the google paper you will see that they have
0:04:04	this really tens of millions of recordings
0:04:08	all instances of phrase
0:04:10	okay
0:04:13	so for
0:04:16	well what you would do and designing a deep neural network for this purpose you
0:04:21	would just feed the a three hundred milisecond
0:04:25	when no into a classical feed forward neural network
0:04:30	with the softmax on the outputs where you have additional for each
0:04:37	speaker among your development the population and train up with a classical cross entropy criterion
0:04:45	you with then given utterance level features simply by averaging the output so from the
0:04:52	over all frames so that this was implemented successfully by google the gold at a
0:04:57	d vector approach
0:05:01	and
0:05:03	it works fairly well on our task as well although it's not competitive with play
0:05:09	the gmm ubm
0:05:12	so well this is just the
0:05:15	classical feed forward architecture i don't think it needs and the anti further comments
0:05:23	what was i think most remarkable about the
0:05:28	or an architecture which are
0:05:31	describe the next
0:05:34	is that a local manage to get this to work has an end-to-end
0:05:39	speaker recognition system not nearly
0:05:42	a feature extractor
0:05:44	but one of which could make a binary a decision concerning a trial as to
0:05:48	whether it's a
0:05:50	a target trial or non-target trial
0:05:53	this has been sort of seen as a part of gold at the end of
0:05:57	the rainbow in our field for very long time
0:06:00	we
0:06:01	it has been i people have been able to get to work with i-vectors
0:06:07	but a direct approach to that problem has generally been you know resistant to our
0:06:14	best staffers but go to work with their or and then system
0:06:20	so you see that they used to an awful lot of data that figure of
0:06:24	twenty two million recordings is not a misprint
0:06:30	so the what the or nn architecture in the slides the diagrams refer just to
0:06:39	the
0:06:39	a classical memory module of them the and test again a memory module where
0:06:47	in addition to an input vector at each time step you also have a hidden
0:06:52	layers of encodes upon set straight
0:06:55	and the one neural network does at each time step is that depends again but
0:07:00	so the
0:07:01	a hidden activation
0:07:04	then squash as the dimension back down so the dimension of the hidden activation that
0:07:10	i'm feeds a nominee repeated into a nonlinear z so you
0:07:13	keep on updating a memory of the history of the utterance and that's
0:07:21	a very natural sort of model
0:07:24	for data with a left-to-right structure as in classical text dependent speaker recognition
0:07:31	or and even machine translation
0:07:33	and the was a
0:07:35	was it paper
0:07:37	okay so this is the classical or in an architecture
0:07:42	there was a an extraordinary paper machine translation published and two thousand and fourteen
0:07:48	which shows that it was possible to train a neural network for the
0:07:54	french to english translation problem
0:07:57	using an organ and architecture with a very special feature namely
0:08:04	the was a single softmax
0:08:07	okay in the what they call the encoder the encoder read french language sentences
0:08:14	and
0:08:15	the
0:08:16	it was trained in such a way that the hidden activation the last time step
0:08:21	was capable of memorising the entire french sentence
0:08:29	so that all the information you need to you needed in order to do machine
0:08:34	translation from french to english was summarized in the hidden activation at the last war
0:08:41	of the of the sentence
0:08:44	to get this work they have to use for layers of the nist m units
0:08:49	it wasn't easy but they were able to get good results with a machine on
0:08:54	a state-of-the-art results on machine translation task
0:08:57	with sentences about the thirty warren's obviously that's must actually break down
0:09:04	okay you can memorise sentences of indefinite duration this way just because the memory has
0:09:13	a finite capacity
0:09:15	but google data well if it works a machine translation is definitely going to work
0:09:20	and
0:09:22	text dependent speaker recognition will be possible to
0:09:26	memorise the as a speakers utterance o a fixed hence frames
0:09:33	so
0:09:35	the other various ways them the past has been improved on
0:09:42	an obvious thing to do instead of
0:09:46	using the activation of the last time step to memorise an utterance would be to
0:09:51	average the activations of all time steps
0:09:54	but once again you would be taking the average activation and feeding it into a
0:09:58	single softmax to do the to do the memory it's not one softmax per frame
0:10:07	there was a bit of controversy as you can imagine and the machine translation field
0:10:11	as to whether this would really was the right way to memorise entire sentences and
0:10:17	that lead to a flurry of activity something called
0:10:22	what was attention modeling
0:10:24	okay where
0:10:25	i mean the argument was that if you're going to translate from french to english
0:10:30	then in the course of the english translation as you proceed work by where you
0:10:35	want to direct your attention to the appropriate place and the in the french utterance
0:10:41	and that's correspondence is not necessarily going to be monotonic because word ordering can change
0:10:48	as you change one language to the other
0:10:51	but that was and a model developed along these lines in the actual then shows
0:10:59	about which i think
0:11:01	planes to be the state-of-the-art in a text
0:11:06	and
0:11:07	machine automatic machine translation
0:11:09	and what gotten set up to do was to
0:11:15	take that idea and instead of using this sort of attention mechanism to weight the
0:11:23	individual frames
0:11:25	in the utterance to learn an optimal
0:11:28	summary of a speakers production of the of the pass phrase
0:11:36	and that was the thing that so actually work best for them
0:11:40	so that this describes the task if a fairly classical text dependent speaker recognition task
0:11:48	of the language was in german it was provided for us by the biphone stressed
0:11:57	the results with the in the heavens well although the you know standard tricks worked
0:12:04	as a as advertised of they were you know
0:12:10	or the cold read you units rectified linear units dropped out some accent and so
0:12:15	on each of them gave an incremental improvement in performance but
0:12:20	we want able to match the performance of a gmm ubm
0:12:25	and of course well the same thing happened with or analysis at doing intelligent summaries
0:12:34	of they said data held but the results ultimately more disappointing
0:12:39	and the reason
0:12:41	it was quite clear that the reason
0:12:44	with just one hundred development speakers we are going to
0:12:49	hopelessly overfit to the to the data so
0:12:53	at these methods are not going to work on less we have a very large
0:12:58	amounts of data
0:13:03	very large amounts of data ports are on the way i was
0:13:08	talking
0:13:08	just this morning to make a was set that the might be the possibility of
0:13:12	getting a surly data
0:13:15	where this sort of thing could be serious the as a viable plausible
0:13:24	solution but it's clear that go term isn't going together up usually faces of that
0:13:30	is solved
0:13:31	is
0:13:32	while he's been bitten by the
0:13:36	by the neural network back so he's is task would be to trying to get
0:13:40	convolutional neural networks working
0:13:45	convolutional neural networks trained to discriminate between speakers working as a feature extractors
0:13:52	for a text independent speaker recognition
0:13:55	so
0:13:57	what i would like to do it was just
0:13:59	talk about what are our fans are for that
0:14:07	what i thought it would do was first of all explain why this
0:14:11	this is a difficult problem
0:14:13	okay why
0:14:15	we cannot expect out of the bars solutions
0:14:20	already existing in the neural network literature to work for us
0:14:25	a white nonetheless it's not in an superbly difficult problem and we ought to be
0:14:29	able to do something about
0:14:31	presently uncommitted
0:14:33	to get in this work
0:14:34	the to get in this work
0:14:36	we are going to submit some sort of system for the for the nist evaluation
0:14:42	but i think well it's going to take a bit longer to actually i and
0:14:47	all the king set out of this
0:14:50	so
0:14:51	it seems to believe that
0:14:55	it approach in this problem there are two fundamental questions that we need to be
0:14:59	able to answer and how we answer them is probably going to dictate
0:15:06	well direction we actually terry
0:15:11	the car restroom about the backend which i'm particularly interested then
0:15:15	but it's i actually of secondary importance
0:15:20	so the first question i c is if we look at these success that feels
0:15:26	like face recognition
0:15:29	have a where
0:15:31	a very similar biometric pattern recognition problem i'm taking thinking in particular of gee face
0:15:38	one is it that it has more so spectacularly for them but we still haven't
0:15:43	been able to get more
0:15:44	that's what that's one question
0:15:47	a second question would be
0:15:51	if we look at the current state-of-the-art in text dependent speaker recognition
0:15:57	because that's where we have a
0:16:02	neural network trained to discriminate between senones
0:16:06	collecting baumwelch statistics for a
0:16:10	an i-vector field is a cascade
0:16:12	wang is it
0:16:14	if we simply trying to neural network to discriminate between speakers
0:16:21	in the in the nist data what is it that we haven't been able to
0:16:25	treat that architecture
0:16:28	okay together to work satisfactorily
0:16:30	in speaker recognition
0:16:34	to my knowledge
0:16:36	several people have tried this but haven't yet obtain a even a publisher result
0:16:42	okay i'm it may be wrong about this be happy to select program wrong about
0:16:47	this but i believe that this is where things stand a present
0:16:53	so if we if we look at the
0:16:57	at the deep face architecture became the
0:17:01	so what these guys didn't facebook they had a population of four thousand development speakers
0:17:06	one thousand images are
0:17:10	speaker i
0:17:11	subject okay
0:17:13	one thousand images per for proper subject they
0:17:16	trying to convolutional neural network to
0:17:20	discriminate
0:17:22	between this the subjects in the development population
0:17:26	and use that as a feature extractor and one-to-one assumption that just that the output
0:17:33	into a cosine distance classifier
0:17:35	there are output was a few thousand dimensions but
0:17:38	google later showed that you could do this with one hundred twenty dimensions but the
0:17:44	same order of magnitude that we have found
0:17:47	so be appropriate for characterizing speakers and
0:17:52	text independent speaker recognition
0:17:55	of course the fact that they have one thousand instances per subject but obviously does
0:18:00	make like a lot easier
0:18:02	then
0:18:04	the market is four we have maybe time average
0:18:09	but some people have raised a sort of more fundamental concern
0:18:13	in our case we're not really trying to extract features from something that's
0:18:19	analogous to static images
0:18:23	because of the time dimension work on where we're confronted with model only
0:18:29	are we dealing with utterances of variable duration model than a fixed dimension but
0:18:34	the
0:18:37	order of phonetic events is something that is nuisance for us
0:18:43	okay we need to get a representation that's
0:18:47	invariant under permutations with respect to the
0:18:51	order of phonetic events
0:18:54	i don't
0:18:55	a convolutional neural network should be eight to solve multiples
0:18:59	problems in principle
0:19:02	because it will produce a representation that's invariant under permutations and the time dimension
0:19:07	and in principle it will be able to handle
0:19:11	utterances of variable duration
0:19:16	there is an animal automatic segmentation image processing you seen that they do use convolutional
0:19:21	neural networks with images of variable
0:19:25	signs
0:19:28	so i don't think it's hopeless but this would be my answer the question okay
0:19:34	why
0:19:35	two
0:19:37	signal discriminant neural networks work but not speaker discriminant neural networks is because i think
0:19:42	trying to discriminate between speakers on very short time scales is going to be very
0:19:48	heart problem
0:19:49	i think we should just stay away from the
0:19:51	from the time being and the reason is very simple
0:19:54	but the
0:19:58	primarily
0:20:00	variability in the signal at short time scales is necessarily phonetic variable
0:20:06	not speaker variable
0:20:08	it was very phonetic variability then
0:20:13	speaker a speech recognition rather than what would not be possible
0:20:17	okay so what happens again if we focus and if we take the same architecture
0:20:22	as is used in signal discriminant neural networks at a ten milisecond frame advancement three
0:20:29	hundred milisecond window
0:20:31	then we're just gonna get swamped with the problem phonetic variability
0:20:36	so
0:20:38	it's actually quite easy okay to get neural networks working as a feature extractor
0:20:45	if you use all utterances as the input i mean just encode the utterance as
0:20:50	an i-vector you will get bottleneck feature that
0:20:53	doesn't very good job of discriminating between speakers
0:20:56	so
0:20:57	if you feed and whole utterances they problem it some of the will but is
0:21:02	actually too easy to be interesting i did not gonna get away from i-vectors
0:21:06	if you go down to ten miliseconds i think is just going to get killed
0:21:09	by the problem of phonetic variability and
0:21:13	the sweet spot for the short term i think should be something like ten seconds
0:21:17	okay that was marked in
0:21:19	and language recognition
0:21:21	and you'll see actually several papers in the in these proceedings
0:21:27	that show that neural networks or good a extra features and language recognition
0:21:33	if you're if you give them utterances of three seconds or ten seconds whatever
0:21:39	but i would say that particular problem of
0:21:43	getting down to short time time-scales is one that we should eventually be able to
0:21:47	solve and we showed that go one
0:21:50	okay i think if you want to
0:21:53	use
0:21:55	neural networks as feature extractor is not nearly for speech rec speaker recognition but also
0:22:00	for speaker diarization then you are going to have to confront the problem
0:22:04	okay you can't have a window of more than
0:22:08	say five hundred milliseconds in speaker diarization or you're going to miss speaker turns okay
0:22:15	so you
0:22:16	we are eventually going to have to confront that problem how to normalize for the
0:22:21	phonetic variability and
0:22:24	in utterances of short duration if we're to train
0:22:28	neural networks to discriminate between speakers
0:22:32	i just mention
0:22:35	paper of
0:22:37	famous will be present in that attempts to deal with that problem with factor analysis
0:22:41	methods
0:22:44	the very last analysis
0:22:46	i thought to be
0:22:49	the idea would be
0:22:52	i think this is going to work eventually okay you we should
0:22:56	think of phonetic content as a
0:23:01	short term
0:23:03	channel effects
0:23:05	okay one when i say short term i mean maybe five
0:23:10	frames or chan frames in the normal
0:23:15	way we think about channels this is sort of that this would be sort of
0:23:18	hopeless okay you we can model channel effects that the resumes of the
0:23:24	persistent over entire utterances but not at the level of say ten miliseconds however we
0:23:33	do have the benefit of a supervision
0:23:37	from that could be supplied by something like a signal discriminant neural network that tells
0:23:42	you at each time step while the
0:23:46	probable phonetic content
0:23:48	that is
0:23:49	so that it is actually possible to model phonetic content as
0:23:55	a short lived channel effect and you can do that using factor analysis methods
0:24:01	and that was the topic of famous as presentation you just a first experiment
0:24:06	but i think that particular problem is going to be
0:24:11	the solution of that problem is going to be a key element
0:24:15	to i
0:24:17	the guessing
0:24:19	neural networks to discriminate between speakers i short just a short time scales
0:24:25	okay so that's same about that so
0:24:36	english
0:24:52	okay so the i think that you said that you want to reduce and then
0:24:58	to learn the same speaker variability how you while you're trying to think about how
0:25:03	you like your yes the other one thinking about the softmax as the target speakers
0:25:08	or you know for example i can tell you what we are interested in working
0:25:12	is the what is trying to learn the cosine similarity between speakers so we have
0:25:18	a skinny staring
0:25:19	trying to mimic saying all this is the same speaker or different speaker would buy
0:25:24	toward by learning some cosine similarity and tried to push the clusters friendly shoulders
0:25:30	well my view about this and this is just a pen okay is that
0:25:37	i believe that in order to get you are not forced to work in speaker
0:25:44	recognition in the long run we are going to have to combine them with a
0:25:48	general okay
0:25:51	i the way it's you're working is that
0:25:56	analogously to the face to face architecture we can hope to get neural networks working
0:26:03	as feature extractors that would be trained to discriminate between speakers in the development set
0:26:09	but used as feature extractors
0:26:13	at runtime
0:26:14	i would expect
0:26:16	that
0:26:17	we would have these neural networks i'll for thing
0:26:20	so i axis
0:26:21	okay i regular intervals as you as you go through an utterance
0:26:25	and that the problem
0:26:27	i believe that the interesting problem
0:26:30	is how to design a backend
0:26:33	to deal with that
0:26:36	okay it in fact in fact involved modeling counts which you will be the
0:26:42	the
0:26:44	the topic of your presentation
0:26:47	although i believe for
0:26:49	there are other models which are just waiting to be used
0:26:53	for the and thinking particularly of latent directly allocation
0:26:57	which is the
0:26:59	i'm along for
0:27:01	i data eigenvoices four
0:27:05	for continuous data
0:27:08	and
0:27:10	you can you
0:27:12	i and the results that you want you can do is you can
0:27:17	you can build an i-vector extractor using latent dirichlet allocation for count a so
0:27:22	and if you can do eigenvoices you can also do
0:27:26	an analogue of the of the i
0:27:29	it'll behave very differently from the bleu we
0:27:33	"'cause" it would've gaussian assumptions
0:27:35	it won't even have this optional statistical independence between speaker effects and channel things
0:27:42	that's a whole lot of thirty
0:27:44	okay you can actually what basis for that the data with
0:27:49	training the lda with unlabeled data you can do that's what latent dirichlet allocation
0:27:56	so that it's actually very big
0:27:58	figure here waiting to be useful
0:28:02	only the question is do we want to go to tea
0:28:06	the selected training of softmax forty want to go to direction of representation
0:28:11	i think personally for this is just one and
0:28:16	personally i believe
0:28:18	the
0:28:20	your networks
0:28:22	okay or not to our task okay
0:28:28	we could never hope to the
0:28:30	training on labeled data
0:28:33	with just a matter for you and that was cannot discriminate between speakers of the
0:28:37	don't know harms the listener
0:28:39	so i think you will need to be complemented by a backend which is waiting
0:28:47	to be developed
0:28:48	not the backend but we have present person
0:28:54	okay

Deep Neural Network based Text-Dependent Speaker Verification : Preliminary Results

Text Dependent Speaker Verification

Gautam Bhattacharya and Patrick Kenny