Speech Transcript - Preliminary Investigation of Boltzmann Machine Classifiers for Speaker Recognition

0:01:04	So the outline will be like this, I am going have a short introduction to
0:01:09	Restricted Boltzmann Machines.
0:01:10	And then I will talk a little bit about deep and sparse Boltzmann Machines
0:01:15	Then I am going to propose some topologies that are relative to speaker recognition.
0:01:22	And some experiments will follow.
0:01:28	So, RBM, as you have already known from the keynote speaker, a bipartite undirected graphical
0:01:34	model
0:01:34	with visible and hidden layers.
0:01:37	The building blocks of the deep belief nets and deep Boltzmann machines.
0:01:45	Of course they are generative models.
0:01:49	Although you can tune them into discriminative ones, but we won't do that.
0:01:56	Another key you have to know is, in fact that the joint distribution forms an
0:02:03	exponential family.
0:02:04	And that is why you are going to see many expressions look very familiar to
0:02:08	you.
0:02:12	The main issue here to see is that there is no connection between nodes of
0:02:19	the same layer.
0:02:22	This allows a very fast training algorithm, namely the blocked-Gibbs sampling.
0:02:28	Meaning that we can sample a layer at once,
0:02:35	but not node by node.
0:02:42	The main issue to know this is that although don't have such connection in the
0:02:50	visible layer,
0:02:51	correlation is still present.
0:02:54	when you consider the marginal likelihood of your data,
0:02:58	of you incomplete data, the v.
0:03:02	As the feature extractor, you can realize that the hidden variables capture higher information, higher
0:03:11	level information, more structured information.
0:03:16	Here are two examples, you have MNIST digits, the standard database.
0:03:23	The below w is called receipted field.
0:03:29	They are not good to say that the Eigen vectors seem to be an analysis.
0:03:35	The higher order information that is able to be captured.
0:03:40	As a generative model, what you need to reconstruct is the p, considered the pixel
0:03:49	i.
0:03:49	You seem to project the h onto the i-th row.
0:03:55	This transpose, by the way, is unnecessary.
0:03:58	This gives you the p i, p i is the parameter from the knowledge prior,
0:04:06	so no need to go back and binarize it. Simply do it by sampling Bernoulli
0:04:14	distribution with this p i.
0:04:17	The g is the logistic function, the sigmoid, that maps continuously from zero to one.
0:04:28	Some useful expressions, the joint distribution looks like this. I denote with a star, p-star,
0:04:35	the unnormalized density . Zeta is the so-called partition function, as you can see very
0:04:43	clearly.
0:04:45	And it forms the exponential function.
0:04:58	So consider binary and forget about the zero biases, assume they are zero.
0:05:05	You see that the conditional on both v and h.
0:05:11	Have this nice product form, this is not approximation, this is due to the restricted
0:05:20	structure of the RBM.
0:05:22	And this is a very useful result when you do regularity.
0:05:27	So how do you do learning?
0:05:29	How you do it? You simply maximize the log likelihood of theta given some observations.
0:05:40	Simply consider that you won't estimate, for example the w matrix, assuming that the biases
0:05:46	are all zero.
0:05:48	What's the difference here, you end up with this familiar expression.
0:05:53	So, we have the data dependent term and the data independent term.
0:05:58	In the case of RBM, it's not that you exactly build this product form.
0:06:04	It's very trivial to calculate the first term, the data dependent term.
0:06:08	You have your data, the empirical distribution. All you have to do is to complete
0:06:13	them.
0:06:13	Based on the conditional of the h, even the angular product form is very trivial.
0:06:22	However,
0:06:23	the second term, that is the model dependent term,
0:06:28	is really hard to compute. By the way what does the term mean?
0:06:33	This term seems to be a different expression, a different parameterization of w.
0:06:42	So you have a current estimate of your w, of your model, but it is
0:06:48	defined on a different space, it is defined on the canonical space for the ?
0:06:53	parameterization.
0:06:54	What you want to do is to map it to the expectation space, that is
0:06:59	where is your sufficient statistics are defined.
0:07:02	So all you need for the training, here is nothing more than trying to map
0:07:08	this w to a different space, a space of the sufficient statistics to form the
0:07:15	difference.
0:07:19	So, Contrastive Divergence.
0:07:21	First of all, how you proceed? You have batch, you split it into minibatches.
0:07:28	Say one hundred each size, typical size.
0:07:34	Proceed with one of the minibatch at a time and you set for the epochs.
0:07:40	As momentum term is not to be more smooth, and it decreases with the epoch
0:07:48	count.
0:07:49	So the contrastive divergence, goes like this.
0:07:53	What she can found, was in fact that if you start, not randomly, but at
0:08:00	each data point.
0:08:01	And then you can simply sample, by successive conditioning.
0:08:08	And you can just sample on state. And if you do so, you have a
0:08:13	pretty nice algorithm to train it very fast.
0:08:15	But that's not what we do actually, that's good for isolation.
0:08:23	We want to do a serious Gibbs sampling, then you have to start from random,
0:08:31	and let the chain loops for many steps.
0:08:35	So having completed this short introduction of RBM.
0:08:41	Deep Boltzmann Machines and Deep Belief Nets.
0:08:46	I'm looking at ? about the belief nets.
0:08:50	As you see RBM starts the main building, for both.
0:08:56	The Boltzmann Machines are completely undirected, they are MRF actually, with hidden variables.
0:09:01	And both of them are constructed from the information of higher levels.
0:09:10	Here is the typical Deep Boltzmann Machine.
0:09:13	So you want to train the thing both,
0:09:15	You start with this conventional version of greedy layer by layer
0:09:21	And then you refine it with the so-called Persistent Contrastive Divergence.
0:09:27	What do you have to know here is that
0:09:29	this nice product form of the conditional breaks down here.
0:09:35	So you have to apply mean-field of approximation
0:09:38	to approximate the first term.
0:09:42	The second term which is the model term, it is the same, all you have
0:09:47	to do is transform it from one space to another.
0:09:54	So here is log likelihood.
0:09:57	It's very straightforward.
0:10:08	You have also this l that connects visible with visible.
0:10:13	And this j that connects hidden with hidden layers.
0:10:19	So there are three matrices, last a biases, that you want to train.
0:10:26	With respect to every other node.
0:10:29	The g, we call the g just the logistic function.
0:10:33	These are the close-form expressions.
0:10:36	However, the three conditionals are not the product of this conditional.
0:10:49	So that's the way to proceed. Assume a factorization, again this is a standard ?
0:10:56	based. Next, assume a factorized posterior of this form.
0:11:00	And recall that the log likelihood is like this.
0:11:03	And simply consider the Bayesian lower bounds.
0:11:06	This is typical if you do that. h is the entropy of this posterior.
0:11:14	This is Bayes based posterior.
0:11:17	It replaces h with the expectation, that is the miu.
0:11:24	The other is just the formula for the entropy.
0:11:28	I repeat.
0:11:31	This formula to estimate the h q.
0:11:37	So during training, what you have to do is to complete your data.
0:11:44	You have the visible, you have to complete the data.
0:11:47	You data with the estimation of h. So what must you do?
0:11:52	To approximate the approximation. And when you evaluate,
0:11:58	You use the variation lower bound instead of the marginal log likelihood.
0:12:05	So this is how the Persistent Contrastive Divergence, this is the complete picture.
0:12:10	You first initialize with ?. You might have initialized the visible already with some contrastive
0:12:19	divergence training, pretraining.
0:12:21	And for each batch and minibatch and epoch, repeat until convergence.
0:12:27	First, do the variation approximation. you need that in order to approximate the first term.
0:12:32	So that you complete your data.
0:12:36	So you do this iteratively until it converges.
0:12:42	And then you have the stochastic approximation.
0:12:44	That is to transform the current estimation to the expectation parameterization.
0:12:50	How do you do that? With Gibbs Sampling.
0:12:53	That's how you do that.
0:12:55	And you take parameter updating.
0:12:58	There is a w here, but there also the other matrices are half relative to
0:13:03	the same formulas.
0:13:05	You see here, first step is to approximate using this , and the other using
0:13:12	this. That's stochastic approximation.
0:13:14	And of course you have a learning rate that decreases with the number of epoch
0:13:22	count.
0:13:23	So, how you can do classification? Some examples.
0:13:29	Here is the Boltzmann Machine, you can use the outermost layer for the labels.
0:13:34	You may consider that as your data.
0:13:44	You want to evaluate this thing, how you can do it?
0:13:47	Well, like the hypothesis testing. so you have an v and you want to classify
0:13:56	it.
0:13:57	Try placing this hypothesis using one for the occurrence of the nodes of your all
0:14:06	nodes.
0:14:07	And that's why you calculate the largest ? for each class.
0:14:13	The point here is that you don't require to estimate zeta, the normalizer, which is
0:14:20	really hard. You know why? The likelihood ratio will not get that at all.
0:14:27	PLDA, this is another approximation, you have the simple RBM.
0:14:37	We are going to represent it like this.
0:14:43	This is the typical example on how you can do PLDA.
0:14:59	So this is the model we will examine.
0:15:02	It's called the Siamese twin.
0:15:07	What does it model? The first model, on your right, is the h0 hypothesis,
0:15:13	that makes the two speakers are not the same.
0:15:18	We model the RBM, the distribution of the complement supervisor.
0:15:30	{Q&A}
0:15:56	How do you train this model?
0:15:57	You first train this, which is somehow between the model distribution of the i-vectors.
0:16:08	And the h1, the Siamese twin.
0:16:12	to capture correlation between the layers?
0:16:19	These are symmetric matrices, I mean the x and y are symmetric matrices, and we
0:16:25	try to capture correlation .
0:16:27	The h0 hypothesis completely relies on the statistical independent assumption.
0:16:34	We don't try to model the h0 hypothesis using negative examples.
0:16:43	We are going to compute this statistical.
0:16:51	So, how do we train that? As I told you, we first train the singleton
0:16:59	model, this is simply RBM.
0:17:02	And then you collect first i-vector of the same speaker.
0:17:07	And then split them into minibatches.
0:17:11	And then initialize based on the w0, you singleton model, initialize your twin model.
0:17:19	Apply several epochs of this contrastive divergence algorithms.
0:17:25	To evaluate it, similar in other layers, and use variational Bayesian lower bound for both
0:17:34	hypotheses.
0:17:34	Partition functions are not required, the threshold will absorb them.
0:17:43	They are data independent.
0:17:45	So it's no reason actually to compute the partition function, that will be absorbed by
0:17:52	your threshold.
0:17:53	So, experiment.
0:19:24	So, this is the configuration.
0:19:28	It's a standard, we applied the standard like this, unfortunately we tried multiply, but we
0:19:36	failed.
0:19:36	So, in such of case, let's at least use this standard ?, so we are
0:19:46	having better cosine distance.
0:19:48	That's what we are doing.
0:19:52	To replace that, we covered our work in Interspeech that somehow tries to make some
0:20:02	of these supervised learnings approaches using, again Deep Boltzmann Machines.
0:20:10	The results in terms of that are like this.
0:20:15	That annotation means Boltzmann Machines' first layer, how many nodes in the first layer, how
0:20:22	many nodes in the second layer.
0:20:24	So you see that the configuration of two hundred computers to ? was the best,
0:20:31	and this is the cosine distance.
0:21:24	These are the results so for evaluating for the female portion.
0:21:35	I think in terms of error rate, they are quite comparable.
0:21:54	Conclusions of this Boltzmann Machines is forming a really fledging framework for combining generative and
0:22:05	discriminative models.
0:22:07	It's ideal when large amount of them are unlabeled data that are available some limited
0:22:14	amount of labeled data.
0:22:16	It's an alternative way to introduce hierarchies and extract higher level representations, and maybe Bayesian
0:22:25	inference can be applied, although you have some ? approach.

Preliminary Investigation of Boltzmann Machine Classifiers for Speaker Recognition

SESSION 04: Neural Network for Speaker Recognition

Themos Stafylakis