0:01:04 | So the outline will be like this, I am going have a short introduction to |
---|
0:01:09 | Restricted Boltzmann Machines. |
---|
0:01:10 | And then I will talk a little bit about deep and sparse Boltzmann Machines |
---|
0:01:15 | Then I am going to propose some topologies that are relative to speaker recognition. |
---|
0:01:22 | And some experiments will follow. |
---|
0:01:28 | So, RBM, as you have already known from the keynote speaker, a bipartite undirected graphical |
---|
0:01:34 | model |
---|
0:01:34 | with visible and hidden layers. |
---|
0:01:37 | The building blocks of the deep belief nets and deep Boltzmann machines. |
---|
0:01:45 | Of course they are generative models. |
---|
0:01:49 | Although you can tune them into discriminative ones, but we won't do that. |
---|
0:01:56 | Another key you have to know is, in fact that the joint distribution forms an |
---|
0:02:03 | exponential family. |
---|
0:02:04 | And that is why you are going to see many expressions look very familiar to |
---|
0:02:08 | you. |
---|
0:02:12 | The main issue here to see is that there is no connection between nodes of |
---|
0:02:19 | the same layer. |
---|
0:02:22 | This allows a very fast training algorithm, namely the blocked-Gibbs sampling. |
---|
0:02:28 | Meaning that we can sample a layer at once, |
---|
0:02:35 | but not node by node. |
---|
0:02:42 | The main issue to know this is that although don't have such connection in the |
---|
0:02:50 | visible layer, |
---|
0:02:51 | correlation is still present. |
---|
0:02:54 | when you consider the marginal likelihood of your data, |
---|
0:02:58 | of you incomplete data, the v. |
---|
0:03:02 | As the feature extractor, you can realize that the hidden variables capture higher information, higher |
---|
0:03:11 | level information, more structured information. |
---|
0:03:16 | Here are two examples, you have MNIST digits, the standard database. |
---|
0:03:23 | The below w is called receipted field. |
---|
0:03:29 | They are not good to say that the Eigen vectors seem to be an analysis. |
---|
0:03:35 | The higher order information that is able to be captured. |
---|
0:03:40 | As a generative model, what you need to reconstruct is the p, considered the pixel |
---|
0:03:49 | i. |
---|
0:03:49 | You seem to project the h onto the i-th row. |
---|
0:03:55 | This transpose, by the way, is unnecessary. |
---|
0:03:58 | This gives you the p i, p i is the parameter from the knowledge prior, |
---|
0:04:06 | so no need to go back and binarize it. Simply do it by sampling Bernoulli |
---|
0:04:14 | distribution with this p i. |
---|
0:04:17 | The g is the logistic function, the sigmoid, that maps continuously from zero to one. |
---|
0:04:28 | Some useful expressions, the joint distribution looks like this. I denote with a star, p-star, |
---|
0:04:35 | the unnormalized density . Zeta is the so-called partition function, as you can see very |
---|
0:04:43 | clearly. |
---|
0:04:45 | And it forms the exponential function. |
---|
0:04:58 | So consider binary and forget about the zero biases, assume they are zero. |
---|
0:05:05 | You see that the conditional on both v and h. |
---|
0:05:11 | Have this nice product form, this is not approximation, this is due to the restricted |
---|
0:05:20 | structure of the RBM. |
---|
0:05:22 | And this is a very useful result when you do regularity. |
---|
0:05:27 | So how do you do learning? |
---|
0:05:29 | How you do it? You simply maximize the log likelihood of theta given some observations. |
---|
0:05:40 | Simply consider that you won't estimate, for example the w matrix, assuming that the biases |
---|
0:05:46 | are all zero. |
---|
0:05:48 | What's the difference here, you end up with this familiar expression. |
---|
0:05:53 | So, we have the data dependent term and the data independent term. |
---|
0:05:58 | In the case of RBM, it's not that you exactly build this product form. |
---|
0:06:04 | It's very trivial to calculate the first term, the data dependent term. |
---|
0:06:08 | You have your data, the empirical distribution. All you have to do is to complete |
---|
0:06:13 | them. |
---|
0:06:13 | Based on the conditional of the h, even the angular product form is very trivial. |
---|
0:06:22 | However, |
---|
0:06:23 | the second term, that is the model dependent term, |
---|
0:06:28 | is really hard to compute. By the way what does the term mean? |
---|
0:06:33 | This term seems to be a different expression, a different parameterization of w. |
---|
0:06:42 | So you have a current estimate of your w, of your model, but it is |
---|
0:06:48 | defined on a different space, it is defined on the canonical space for the ? |
---|
0:06:53 | parameterization. |
---|
0:06:54 | What you want to do is to map it to the expectation space, that is |
---|
0:06:59 | where is your sufficient statistics are defined. |
---|
0:07:02 | So all you need for the training, here is nothing more than trying to map |
---|
0:07:08 | this w to a different space, a space of the sufficient statistics to form the |
---|
0:07:15 | difference. |
---|
0:07:19 | So, Contrastive Divergence. |
---|
0:07:21 | First of all, how you proceed? You have batch, you split it into minibatches. |
---|
0:07:28 | Say one hundred each size, typical size. |
---|
0:07:34 | Proceed with one of the minibatch at a time and you set for the epochs. |
---|
0:07:40 | As momentum term is not to be more smooth, and it decreases with the epoch |
---|
0:07:48 | count. |
---|
0:07:49 | So the contrastive divergence, goes like this. |
---|
0:07:53 | What she can found, was in fact that if you start, not randomly, but at |
---|
0:08:00 | each data point. |
---|
0:08:01 | And then you can simply sample, by successive conditioning. |
---|
0:08:08 | And you can just sample on state. And if you do so, you have a |
---|
0:08:13 | pretty nice algorithm to train it very fast. |
---|
0:08:15 | But that's not what we do actually, that's good for isolation. |
---|
0:08:23 | We want to do a serious Gibbs sampling, then you have to start from random, |
---|
0:08:31 | and let the chain loops for many steps. |
---|
0:08:35 | So having completed this short introduction of RBM. |
---|
0:08:41 | Deep Boltzmann Machines and Deep Belief Nets. |
---|
0:08:46 | I'm looking at ? about the belief nets. |
---|
0:08:50 | As you see RBM starts the main building, for both. |
---|
0:08:56 | The Boltzmann Machines are completely undirected, they are MRF actually, with hidden variables. |
---|
0:09:01 | And both of them are constructed from the information of higher levels. |
---|
0:09:10 | Here is the typical Deep Boltzmann Machine. |
---|
0:09:13 | So you want to train the thing both, |
---|
0:09:15 | You start with this conventional version of greedy layer by layer |
---|
0:09:21 | And then you refine it with the so-called Persistent Contrastive Divergence. |
---|
0:09:27 | What do you have to know here is that |
---|
0:09:29 | this nice product form of the conditional breaks down here. |
---|
0:09:35 | So you have to apply mean-field of approximation |
---|
0:09:38 | to approximate the first term. |
---|
0:09:42 | The second term which is the model term, it is the same, all you have |
---|
0:09:47 | to do is transform it from one space to another. |
---|
0:09:54 | So here is log likelihood. |
---|
0:09:57 | It's very straightforward. |
---|
0:10:08 | You have also this l that connects visible with visible. |
---|
0:10:13 | And this j that connects hidden with hidden layers. |
---|
0:10:19 | So there are three matrices, last a biases, that you want to train. |
---|
0:10:26 | With respect to every other node. |
---|
0:10:29 | The g, we call the g just the logistic function. |
---|
0:10:33 | These are the close-form expressions. |
---|
0:10:36 | However, the three conditionals are not the product of this conditional. |
---|
0:10:49 | So that's the way to proceed. Assume a factorization, again this is a standard ? |
---|
0:10:56 | based. Next, assume a factorized posterior of this form. |
---|
0:11:00 | And recall that the log likelihood is like this. |
---|
0:11:03 | And simply consider the Bayesian lower bounds. |
---|
0:11:06 | This is typical if you do that. h is the entropy of this posterior. |
---|
0:11:14 | This is Bayes based posterior. |
---|
0:11:17 | It replaces h with the expectation, that is the miu. |
---|
0:11:24 | The other is just the formula for the entropy. |
---|
0:11:28 | I repeat. |
---|
0:11:31 | This formula to estimate the h q. |
---|
0:11:37 | So during training, what you have to do is to complete your data. |
---|
0:11:44 | You have the visible, you have to complete the data. |
---|
0:11:47 | You data with the estimation of h. So what must you do? |
---|
0:11:52 | To approximate the approximation. And when you evaluate, |
---|
0:11:58 | You use the variation lower bound instead of the marginal log likelihood. |
---|
0:12:05 | So this is how the Persistent Contrastive Divergence, this is the complete picture. |
---|
0:12:10 | You first initialize with ?. You might have initialized the visible already with some contrastive |
---|
0:12:19 | divergence training, pretraining. |
---|
0:12:21 | And for each batch and minibatch and epoch, repeat until convergence. |
---|
0:12:27 | First, do the variation approximation. you need that in order to approximate the first term. |
---|
0:12:32 | So that you complete your data. |
---|
0:12:36 | So you do this iteratively until it converges. |
---|
0:12:42 | And then you have the stochastic approximation. |
---|
0:12:44 | That is to transform the current estimation to the expectation parameterization. |
---|
0:12:50 | How do you do that? With Gibbs Sampling. |
---|
0:12:53 | That's how you do that. |
---|
0:12:55 | And you take parameter updating. |
---|
0:12:58 | There is a w here, but there also the other matrices are half relative to |
---|
0:13:03 | the same formulas. |
---|
0:13:05 | You see here, first step is to approximate using this , and the other using |
---|
0:13:12 | this. That's stochastic approximation. |
---|
0:13:14 | And of course you have a learning rate that decreases with the number of epoch |
---|
0:13:22 | count. |
---|
0:13:23 | So, how you can do classification? Some examples. |
---|
0:13:29 | Here is the Boltzmann Machine, you can use the outermost layer for the labels. |
---|
0:13:34 | You may consider that as your data. |
---|
0:13:44 | You want to evaluate this thing, how you can do it? |
---|
0:13:47 | Well, like the hypothesis testing. so you have an v and you want to classify |
---|
0:13:56 | it. |
---|
0:13:57 | Try placing this hypothesis using one for the occurrence of the nodes of your all |
---|
0:14:06 | nodes. |
---|
0:14:07 | And that's why you calculate the largest ? for each class. |
---|
0:14:13 | The point here is that you don't require to estimate zeta, the normalizer, which is |
---|
0:14:20 | really hard. You know why? The likelihood ratio will not get that at all. |
---|
0:14:27 | PLDA, this is another approximation, you have the simple RBM. |
---|
0:14:37 | We are going to represent it like this. |
---|
0:14:43 | This is the typical example on how you can do PLDA. |
---|
0:14:59 | So this is the model we will examine. |
---|
0:15:02 | It's called the Siamese twin. |
---|
0:15:07 | What does it model? The first model, on your right, is the h0 hypothesis, |
---|
0:15:13 | that makes the two speakers are not the same. |
---|
0:15:18 | We model the RBM, the distribution of the complement supervisor. |
---|
0:15:30 | {Q&A} |
---|
0:15:56 | How do you train this model? |
---|
0:15:57 | You first train this, which is somehow between the model distribution of the i-vectors. |
---|
0:16:08 | And the h1, the Siamese twin. |
---|
0:16:12 | to capture correlation between the layers? |
---|
0:16:19 | These are symmetric matrices, I mean the x and y are symmetric matrices, and we |
---|
0:16:25 | try to capture correlation . |
---|
0:16:27 | The h0 hypothesis completely relies on the statistical independent assumption. |
---|
0:16:34 | We don't try to model the h0 hypothesis using negative examples. |
---|
0:16:43 | We are going to compute this statistical. |
---|
0:16:51 | So, how do we train that? As I told you, we first train the singleton |
---|
0:16:59 | model, this is simply RBM. |
---|
0:17:02 | And then you collect first i-vector of the same speaker. |
---|
0:17:07 | And then split them into minibatches. |
---|
0:17:11 | And then initialize based on the w0, you singleton model, initialize your twin model. |
---|
0:17:19 | Apply several epochs of this contrastive divergence algorithms. |
---|
0:17:25 | To evaluate it, similar in other layers, and use variational Bayesian lower bound for both |
---|
0:17:34 | hypotheses. |
---|
0:17:34 | Partition functions are not required, the threshold will absorb them. |
---|
0:17:43 | They are data independent. |
---|
0:17:45 | So it's no reason actually to compute the partition function, that will be absorbed by |
---|
0:17:52 | your threshold. |
---|
0:17:53 | So, experiment. |
---|
0:19:24 | So, this is the configuration. |
---|
0:19:28 | It's a standard, we applied the standard like this, unfortunately we tried multiply, but we |
---|
0:19:36 | failed. |
---|
0:19:36 | So, in such of case, let's at least use this standard ?, so we are |
---|
0:19:46 | having better cosine distance. |
---|
0:19:48 | That's what we are doing. |
---|
0:19:52 | To replace that, we covered our work in Interspeech that somehow tries to make some |
---|
0:20:02 | of these supervised learnings approaches using, again Deep Boltzmann Machines. |
---|
0:20:10 | The results in terms of that are like this. |
---|
0:20:15 | That annotation means Boltzmann Machines' first layer, how many nodes in the first layer, how |
---|
0:20:22 | many nodes in the second layer. |
---|
0:20:24 | So you see that the configuration of two hundred computers to ? was the best, |
---|
0:20:31 | and this is the cosine distance. |
---|
0:21:24 | These are the results so for evaluating for the female portion. |
---|
0:21:35 | I think in terms of error rate, they are quite comparable. |
---|
0:21:54 | Conclusions of this Boltzmann Machines is forming a really fledging framework for combining generative and |
---|
0:22:05 | discriminative models. |
---|
0:22:07 | It's ideal when large amount of them are unlabeled data that are available some limited |
---|
0:22:14 | amount of labeled data. |
---|
0:22:16 | It's an alternative way to introduce hierarchies and extract higher level representations, and maybe Bayesian |
---|
0:22:25 | inference can be applied, although you have some ? approach. |
---|