0:00:16 | okay last undo |
---|
0:00:18 | i'm going to present well work on i-vector transformation and scaling for p lda based |
---|
0:00:23 | recognition |
---|
0:00:24 | and the goal of this work |
---|
0:00:26 | is |
---|
0:00:27 | two presents a way to transform over i-vectors so that they better fit the plp |
---|
0:00:33 | assumptions |
---|
0:00:34 | and the same time introduce a way |
---|
0:00:37 | to perform some sort of dataset mismatch compensation similar to what length normalization is who |
---|
0:00:43 | enforced on the p lda |
---|
0:00:46 | so |
---|
0:00:47 | as we all know the lda assumption assumes that the latent variables a portion which |
---|
0:00:54 | with the resulting i-vectors which if we assume they are independently someone they would |
---|
0:01:00 | follow a gaussian distribution |
---|
0:01:02 | now we all know this is not really the case |
---|
0:01:06 | indeed |
---|
0:01:07 | we have two main problems personal model |
---|
0:01:10 | our |
---|
0:01:11 | i-vectors do not really look like they should if they were some performs a gaussian |
---|
0:01:16 | distribution |
---|
0:01:17 | for example here on the right |
---|
0:01:19 | i've plotting the one dimension of the i-vectors the they mentioned with the highest skewness |
---|
0:01:26 | i plot in the histogram and it's quite clear that |
---|
0:01:29 | the histogram doesn't really resemble anything like a gaussian distribution but it's even almost multimodal |
---|
0:01:37 | then the other problems that we're |
---|
0:01:39 | a quite evident mismatch between development and evaluation |
---|
0:01:43 | vectors |
---|
0:01:45 | for example if we look at the left |
---|
0:01:49 | there is a plot of the histogram of the squared i-vector models for both |
---|
0:01:53 | our development set which is sre ten females at |
---|
0:01:57 | and evaluation which is condition five female settles whatever sre ten |
---|
0:02:01 | and we can see two things first of all |
---|
0:02:05 | the distribution list pronounce or evaluation and development set are |
---|
0:02:10 | quite different among themselves |
---|
0:02:12 | and none of them resembles what we should expect |
---|
0:02:16 | these i-vectors of everything sampled from a standard normal distribution |
---|
0:02:21 | now |
---|
0:02:22 | up to now we have |
---|
0:02:24 | mainly to waste approach |
---|
0:02:27 | these issues i've represented |
---|
0:02:29 | first one was heavy tailed yesterday by patrick kenny which mainly tries to with the |
---|
0:02:34 | non gaussian assumption |
---|
0:02:36 | what with the gaussian assumption is that in that it removes the core channels options |
---|
0:02:40 | and assumes that i-vector distributions are heavy tailed |
---|
0:02:44 | and the second one is length and or |
---|
0:02:47 | functional in our opinion is not really making things more portion what is really mainly |
---|
0:02:53 | dealing with the dataset mismatch that we have in this between evaluation and development i-vectors |
---|
0:03:00 | in need here i'm doing the same block that was doing on the most you |
---|
0:03:04 | dimensional i-vectors before and after lexical and we can see that even if we apply |
---|
0:03:09 | length on these cannot compensate since alike |
---|
0:03:12 | multimodal distribution signal what i-vectors |
---|
0:03:15 | it might actually compensate for heavy tailed of your that's for sure but still we |
---|
0:03:19 | don't get things which are really |
---|
0:03:21 | go shown like |
---|
0:03:24 | now in this war we want to address |
---|
0:03:27 | second the problem of doing both approximation of i-vectors so that they better fit the |
---|
0:03:33 | lda assumption so we tried to portion right somewhat i-vectors |
---|
0:03:37 | and that the same time we propose |
---|
0:03:40 | way to perform the dataset compensations email toward length normalized on the difference being that |
---|
0:03:46 | the this dataset compensation a student |
---|
0:03:49 | for our transformation |
---|
0:03:52 | and we estimate both of the same time |
---|
0:03:55 | okay so |
---|
0:03:57 | how do we perform these |
---|
0:04:00 | this phones focus on how we |
---|
0:04:03 | manner transform i-vectors so that they better fit the gaussian assumption |
---|
0:04:07 | to do that stands we assume that i-vectors are sampled from a random variable feeding |
---|
0:04:13 | which |
---|
0:04:14 | whose pdf we don't know however we assume that we can express is unavoidable feels |
---|
0:04:19 | a function |
---|
0:04:20 | although a standard normal random variable |
---|
0:04:23 | now if we do like these then we can express the pdf of this random |
---|
0:04:28 | variable fee others |
---|
0:04:30 | the little pdf for |
---|
0:04:32 | samples |
---|
0:04:34 | of samples which are transformed through f and computed over the for why class |
---|
0:04:40 | sometimes which of the log that are we don't of the accordion of the transformation |
---|
0:04:45 | no the good thing is that we can |
---|
0:04:47 | due to things with this model first of all we can estimate the function f |
---|
0:04:52 | us to maximize the lack of our i-vectors |
---|
0:04:56 | and in that way we would obtain something which |
---|
0:05:00 | use also the pdf of i-vectors with which is not anymore standard portion but depends |
---|
0:05:06 | on the transformation |
---|
0:05:08 | and the other one thing is that we can also employed this function to transform |
---|
0:05:12 | i-vectors so that the samples which follow the distribution will fee |
---|
0:05:17 | becomes transformed into samples which follow |
---|
0:05:21 | standard normal distribution |
---|
0:05:25 | no |
---|
0:05:26 | two |
---|
0:05:27 | no more than these unknown functions we decided to follow a |
---|
0:05:33 | framework which is quite similar to the neural network framework |
---|
0:05:37 | that is we assume that we can express this transformation function as a composition of |
---|
0:05:42 | several a simple functions |
---|
0:05:46 | which can be interpreted as layers of a neural network |
---|
0:05:50 | now |
---|
0:05:51 | the only constraint that we have with respect to the standard neural network here is |
---|
0:05:55 | that we want to work with functions which i vegetables or our layers of the |
---|
0:06:00 | same size and the transformation they |
---|
0:06:02 | produce needs to be invertible |
---|
0:06:05 | as we said we perform maximum like to estimate of the parameters of the transformation |
---|
0:06:10 | and then instead of using the pdf directly we use the transformation function to map |
---|
0:06:15 | back |
---|
0:06:16 | i y i-vectors to |
---|
0:06:18 | let's say well shall distributed i-vectors |
---|
0:06:21 | here i have a small an example on the one dimensional data these is again |
---|
0:06:28 | the most cute dimensional are almost you component of our training i-vectors |
---|
0:06:36 | and from the top left the original histogram and on the right hyper the transformation |
---|
0:06:41 | that we estimated |
---|
0:06:43 | so how's you can see from the top left |
---|
0:06:45 | if we directly use the transformation |
---|
0:06:48 | to evaluate the log pdf of the |
---|
0:06:51 | about one |
---|
0:06:53 | i-vectors actually we obtain a pdf which are very closely matches the histogram of our |
---|
0:06:58 | i-vectors |
---|
0:07:00 | then if we apply the inverse transformation to these data points we obtain what we |
---|
0:07:05 | c in the bottom v you hear |
---|
0:07:08 | and what |
---|
0:07:09 | does that show it shows that we managed to obtain a histogram of i-vectors which |
---|
0:07:13 | very closely matches the gaussian |
---|
0:07:16 | pdf which is portable i don't know if it's visible but there is the pdf |
---|
0:07:20 | of the from one question which is pretty much on top of the histogram all |
---|
0:07:25 | the transformed vectors |
---|
0:07:29 | no |
---|
0:07:30 | in this war |
---|
0:07:32 | now we decided to use a simple selection for our layers in particular we have |
---|
0:07:37 | one kind of layer which does just an affine transformation that is we can interpret |
---|
0:07:42 | it just as the weights |
---|
0:07:44 | of a neural network |
---|
0:07:45 | what we call as you know it's in |
---|
0:07:48 | let you have |
---|
0:07:49 | which performs the nonlinearity |
---|
0:07:51 | no the reason we chose this particular kind of an ideal is that it is |
---|
0:07:56 | nice properties for example with a single layer we can already |
---|
0:08:00 | represents pdfs |
---|
0:08:02 | of the random variable which are most similar to the same in heavy tailed and |
---|
0:08:07 | skewed with just a single layer and |
---|
0:08:09 | if we are more like it we increase the |
---|
0:08:12 | modelling capabilities of the program although this creates some problems of overfitting i was like |
---|
0:08:16 | with say |
---|
0:08:18 | later |
---|
0:08:20 | now the other side we use a maximum likelihood criterion to estimate the transformation and |
---|
0:08:25 | the nice thing |
---|
0:08:27 | is that we can use are optimized on a general optimize the which we provide |
---|
0:08:31 | the objective function and the grunt incentives guardians |
---|
0:08:34 | can be computed we'd |
---|
0:08:36 | an algorithm which resembles quite closely that of back propagation with mean square error of |
---|
0:08:42 | a neural network |
---|
0:08:44 | the main differences that would need to take into account also the contribution of the |
---|
0:08:48 | log determinant switch |
---|
0:08:50 | increases the complexity of the training but the training times is pretty much the same |
---|
0:08:54 | as we what we would have with that standard neural network |
---|
0:08:58 | no this is a full set of experiments here we still didn't a couple length |
---|
0:09:03 | normalization and any other kind of |
---|
0:09:06 | compensation approaches or what i'm showing here is what happens when we estimate |
---|
0:09:11 | this transformation on our |
---|
0:09:12 | training data and we applied to transform i wanna vectors |
---|
0:09:17 | as you can see on top layer on the left the same histograms of the |
---|
0:09:21 | square norm i was presenting before and on the right the squared norms of the |
---|
0:09:25 | transformed i-vectors |
---|
0:09:27 | of all |
---|
0:09:28 | here i'm using a transformation way to just one not only not like |
---|
0:09:33 | now of course as we can see the square norm is still not exactly what |
---|
0:09:37 | we would expect from |
---|
0:09:39 | standard normal or the distributed samples but |
---|
0:09:43 | matches more closely our expectation and more important we also somehow |
---|
0:09:49 | reduce the mismatch between evaluation and development squared norms which means that our i-vectors are |
---|
0:09:55 | more similar |
---|
0:09:57 | and this gets a reflected in the results on the first and second line you |
---|
0:10:01 | know the lda and |
---|
0:10:03 | the same the lda but trained with the transform i-vectors |
---|
0:10:07 | has the same here would not |
---|
0:10:08 | using any kind of like someone we can see that our model allows to achieve |
---|
0:10:13 | much better performance compared to standard lda |
---|
0:10:16 | on the last line all |
---|
0:10:18 | we can still see that length normalization is compensating for is not as a mismatch |
---|
0:10:23 | better which allows the lda with length normalized i-vectors to perform better than our model |
---|
0:10:29 | right |
---|
0:10:31 | so |
---|
0:10:31 | the next part is how can we |
---|
0:10:35 | incorporate this kind of preprocessed in our data of course we could try to maximize |
---|
0:10:39 | i-vector but we can do better by |
---|
0:10:42 | costing these |
---|
0:10:44 | kind of transformation directly to our model |
---|
0:10:47 | to this extent |
---|
0:10:49 | we first need to in you but different interpretation elements alarm and the particular we |
---|
0:10:54 | need to sting |
---|
0:10:55 | all |
---|
0:10:57 | length normalized the maximum like the solution of a quite simple model |
---|
0:11:01 | well i what i-vectors are not i aid anymore in the sense that |
---|
0:11:05 | we assume that each i-vector is sample from a different random variable has a distribution |
---|
0:11:10 | which is normal |
---|
0:11:12 | the it the all these time the variables channel i think down which is the |
---|
0:11:17 | seed model |
---|
0:11:18 | the covariance matrix but this covariance matrix is case for each i-vector by a scholar |
---|
0:11:23 | that |
---|
0:11:24 | this is quite similar to one maybe tailed distribution but instead of putting prior simple |
---|
0:11:29 | zeros on this stems |
---|
0:11:30 | we just optimized by the maximum like of solution |
---|
0:11:34 | now if we perform a two-step optimization where we first estimate see no assuming that |
---|
0:11:39 | the alpha terms are one |
---|
0:11:41 | and then we fix that senile we estimate the optimal alpha times we would gonna |
---|
0:11:46 | end up with something which is why |
---|
0:11:49 | very similar to links norm indeed it's the links |
---|
0:11:53 | is the squared no it's the norm of the white and i-vectors divided by the |
---|
0:11:57 | square root of the dimensionality of the i-vectors |
---|
0:12:01 | now why this is interesting because these |
---|
0:12:03 | random variable can be represented as a transformational a standard random variable well the transformation |
---|
0:12:10 | as a parameter which is like vector dependent |
---|
0:12:13 | now if you have to estimate this |
---|
0:12:15 | but i mean of using an iterative strategy which but of a first estimate the |
---|
0:12:20 | sequence and the alpha and then we |
---|
0:12:23 | well to apply the inverse transformation we would recover it exactly what we're doing right |
---|
0:12:27 | now would length normalization |
---|
0:12:30 | so these demos |
---|
0:12:32 | you know how to implement a similar strategy into our model |
---|
0:12:37 | we introduce what we call that not all eight euros scaling layer which is a |
---|
0:12:41 | single parameter and this parameters i-vector dependence of for each i-vector where y to estimate |
---|
0:12:46 | its much selected solution |
---|
0:12:48 | now our transformation is the cascade of these |
---|
0:12:52 | scaling layer and what we were proposing before saw |
---|
0:12:56 | the |
---|
0:12:57 | composition of a finance also there yes |
---|
0:13:01 | that is one comment here |
---|
0:13:03 | in order to |
---|
0:13:04 | if you change in this thing we |
---|
0:13:06 | still have to resort what adaptive training that is we first three why we estimate |
---|
0:13:12 | the bottom the shared parameters that we fix the shared parameters and the optimize what |
---|
0:13:15 | file |
---|
0:13:16 | and one more thing that we need to take into account is that at this |
---|
0:13:20 | time |
---|
0:13:21 | while with the original more than we don't need to do anything as then transformed |
---|
0:13:24 | i-vectors with this model at this point we also need to estimate the by selecting |
---|
0:13:29 | the optimal scaling factor |
---|
0:13:32 | however these |
---|
0:13:34 | used as a great improvement as you can see well the first line of the |
---|
0:13:38 | same i was presenting before |
---|
0:13:41 | and then the last three lines are the lda would length normalization |
---|
0:13:45 | then the one day of transformation with the out of a scaling with one iteration |
---|
0:13:49 | of |
---|
0:13:51 | i don't like to estimates and with three dimensional automate estimates |
---|
0:13:55 | and as you can see |
---|
0:13:57 | the model with three iteration is clearly outperformed the lda will end in all conditions |
---|
0:14:03 | on the sre ten female dataset |
---|
0:14:08 | no |
---|
0:14:10 | so i guess we get the conclusions we |
---|
0:14:14 | investigated here an approach to estimate of this transformation which allows modified by i-vectors |
---|
0:14:20 | so that they better fit the plp assumptions |
---|
0:14:22 | so we apply this transformation we obtain i-vectors which are more or shall i and |
---|
0:14:28 | we calculating the more than a |
---|
0:14:30 | prepare a way to perform length compensation which is similar to p s two length |
---|
0:14:35 | norm |
---|
0:14:36 | but is |
---|
0:14:37 | but you want to the particular let us that we using in the transformation |
---|
0:14:41 | this transformation is that you using a maximum likelihood criterion and the transformation function itself |
---|
0:14:47 | is implemented using a frame or which is very similar to that |
---|
0:14:51 | of the neural networks |
---|
0:14:53 | we'd other said with some constraints because we want our latest embeddable in this case |
---|
0:14:57 | of that we can compute |
---|
0:14:59 | we can guarantee that the log that amount of our copiers a existence of one |
---|
0:15:06 | no this approach allows to |
---|
0:15:09 | so as to be improve the results remaining terms of this from the sre ten |
---|
0:15:13 | data we also experiments in the paper that |
---|
0:15:17 | i don't report here we show that used it may also works on nist two |
---|
0:15:21 | thousand twelve data |
---|
0:15:23 | there is one cup that's how they said before here we using a single layer |
---|
0:15:27 | transformation the reason is that this kind of more there's ten two |
---|
0:15:31 | overfit white easily |
---|
0:15:33 | so our first experiments with more than one on you know layer |
---|
0:15:38 | well not very satisfactory as in the they were decreasing the performance |
---|
0:15:43 | now we are managing to get interesting results by changing |
---|
0:15:47 | in the weights the first one is changing the kind of neat in only narratives |
---|
0:15:51 | of the details |
---|
0:15:52 | some constraints inside the function itself which you meet these |
---|
0:15:57 | overfitting behaviour |
---|
0:15:59 | and on the other hand we also find some structure where we impose constraints on |
---|
0:16:03 | the parameters of the transformation which again |
---|
0:16:06 | use the overfitting behaviour in these allows to train it was which are more players |
---|
0:16:11 | although up to now we obtained with the results in the sense that we managed |
---|
0:16:15 | to |
---|
0:16:16 | train transformation which behave much better |
---|
0:16:19 | if we don't |
---|
0:16:20 | use the scaling down but after we have in so let's get into them and |
---|
0:16:24 | the |
---|
0:16:26 | all |
---|
0:16:27 | frame or the end we more or less convincing there is also what was shown |
---|
0:16:30 | here so do still working provide us to understand why we have this strange be |
---|
0:16:36 | everywhere we can |
---|
0:16:37 | improve the performance and that of the transformation itself but we cannot improve |
---|
0:16:42 | when we add the scaling term anymore |
---|
0:16:46 | so on |
---|
0:16:52 | i know some questions we have are fine but |
---|
0:17:05 | however this compared to just straight gas station |
---|
0:17:10 | okay the |
---|
0:17:11 | thing is how we would improvement association with one hundred fifty dimensional vectors i mean |
---|
0:17:17 | what you got size each dimension on its own |
---|
0:17:20 | well if you both sides it's dimensional with some we tried |
---|
0:17:24 | something with this model which if we put cosine transformation or well the function itself |
---|
0:17:29 | can |
---|
0:17:30 | so |
---|
0:17:31 | produce that kind of organ and by the way when working with one dimensional synthetic |
---|
0:17:36 | data disk image period when many kind of different usual spot the results already much |
---|
0:17:42 | worse |
---|
0:17:43 | so my case is that it would not be sufficient to independently |
---|
0:17:47 | gaussianized ml each on its own |
---|
0:17:50 | but allows me i'm sorry miss you tried it didn't where's |
---|
0:17:54 | no i didn't right exactly that i tried the same order like presenting here with |
---|
0:17:59 | transformation which applied independently of each component and my experience what i'm working on a |
---|
0:18:06 | single a single dimensional data points |
---|
0:18:09 | you think size very well |
---|
0:18:11 | it does not program over fitting we then if i are more like something data |
---|
0:18:16 | with several kind of is the only reason aspen is that the gas station kernel |
---|
0:18:20 | right exactly does inverse function it's not approximation to it |
---|
0:18:24 | no but it makes one like it the spectral that approximation it that's what they |
---|
0:18:28 | get here doesn't work so i guess is that the approximate the real thing with |
---|
0:18:31 | the commercialisation would still not work |
---|
0:18:39 | i don't use the sensitivity |
---|
0:18:43 | this approach does not come and activation function for d n and |
---|
0:18:47 | the justification to is shown to them to probably too well as the evaluation is |
---|
0:18:55 | first of all the original transformation i was you think you know is the last |
---|
0:19:00 | one which then it can be shown that we can split into several layers but |
---|
0:19:05 | it is different probabilities first of all it can represent the identity transformation |
---|
0:19:10 | so if our data already portion |
---|
0:19:13 | are kept like that |
---|
0:19:15 | then it has some nice properties which can be shown there are some references in |
---|
0:19:19 | our paper where you can find that |
---|
0:19:22 | this kind of |
---|
0:19:24 | like single-layer skin color represents a role set of this shows which are both |
---|
0:19:29 | same in heavy tailed is q |
---|
0:19:32 | so the reason we shall this |
---|
0:19:34 | kind of this show the overall layer is essentially because it was already shown lately |
---|
0:19:39 | can more than what some broadside to family of distributions |
---|
0:19:44 | well it's all |
---|
0:19:49 | it's they have to strange question |
---|
0:19:52 | first the is it possible to the universal parameters and try to understand what that |
---|
0:19:58 | the characteristics |
---|
0:20:00 | of you training set |
---|
0:20:02 | in term of a twisty of in the most |
---|
0:20:05 | station effect of ten effects |
---|
0:20:08 | you mean what do you mean i mean |
---|
0:20:11 | look at you transformation a try to understand this so you the loose enough phone |
---|
0:20:18 | when the v |
---|
0:20:19 | the mismatch between o training set the inside the training set you to the presence |
---|
0:20:25 | of |
---|
0:20:26 | said phone from them |
---|
0:20:27 | okay that's why the s c could be applied separately on different sides |
---|
0:20:33 | if you have some way to |
---|
0:20:36 | more the to see what is the difference in your distribution before and after transformation |
---|
0:20:41 | you can apply the same technique often so on my |
---|
0:20:44 | as well |
---|
0:20:46 | transform independently two different sets and see if this represents on the differences or not |
---|
0:20:52 | what i have here is that |
---|
0:20:54 | pretty much |
---|
0:20:56 | it looks like at least if we can see that evaluation and development of two |
---|
0:21:00 | different sets with different is usually it is somehow able to |
---|
0:21:04 | partly compensate for that |
---|
0:21:06 | no transformations that is partly responsible for is because these as |
---|
0:21:11 | say maybe to have your is it allows to stretch the models which are far |
---|
0:21:18 | from what we would expect |
---|
0:21:20 | so in what she can also one of the middle of these used |
---|
0:21:24 | and the other hand |
---|
0:21:25 | there you have thing which does this processing is the scaling anyway so that scaling |
---|
0:21:30 | is very similar to length or is it is two hundred transformation that i'm applying |
---|
0:21:34 | for this all done blindly |
---|
0:21:36 | and then i'm learning transformational x i-vectors but i'm estimating at the same time the |
---|
0:21:41 | transformation into skating |
---|
0:21:44 | okay that is the part which is in my opinion really responsible for posing due |
---|
0:21:49 | to mismatch in the basement used in that |
---|
0:21:51 | then another thing that i cannot is |
---|
0:21:54 | what is what would be much |
---|
0:21:57 | better done |
---|
0:21:58 | we were using is really more that the speaker factors and the channel factors appear |
---|
0:22:03 | the i for example |
---|
0:22:05 | the problem is that |
---|
0:22:06 | already like these it takes |
---|
0:22:08 | several hours if not the is to train the transformation function that this time it's |
---|
0:22:14 | very fast training is quite slow and if we move into |
---|
0:22:18 | using it cannot be lda styles all if we wanted differently the times that i |
---|
0:22:23 | would really explode that so computational time also this time |
---|
0:22:27 | because we would need to consider |
---|
0:22:29 | in cases where the i-vectors are from the same speaker or not and in that |
---|
0:22:33 | case would grow up |
---|
0:22:35 | you would have |
---|
0:22:36 | similarly |
---|
0:22:38 | something similar to what we have we uncertainty propagation where you have to do that |
---|
0:22:43 | this time of computation of everything but much worse |
---|
0:22:48 | okay it's just |
---|
0:22:49 | in fact because the training needs to be but i want to try to x |
---|
0:22:55 | exploit this much as possible you parameters and method which is related to the first |
---|
0:23:00 | one |
---|
0:23:02 | is it possible somewhere to use this approach to |
---|
0:23:07 | determine if one thing though i think when i-vector |
---|
0:23:11 | is in domain or out-of-domain |
---|
0:23:15 | so you use the two d to detect say okay |
---|
0:23:20 | my operationally is |
---|
0:23:21 | probably not really i mean length normalization that is not affect you start with this |
---|
0:23:25 | but this is not and i |
---|
0:23:27 | and the problem with this thing is that if i of a really huge mismatch |
---|
0:23:31 | then gets amplified by transformation itself |
---|
0:23:35 | because the data point and transforming arnold will be should be so the weight to |
---|
0:23:40 | as well like the non linear function |
---|
0:23:42 | is probably going to increase my mismatch instead of using it |
---|
0:23:46 | so i'll to some point the with respect to still work better than start up |
---|
0:23:50 | you after some point with this but it does not been worse |
---|
0:23:57 | mismatches datasets |
---|
0:23:59 | thanks and disappointed |
---|
0:24:03 | okay this like the special |
---|