0:00:15 | so hi everyone i'll present thing to the iterative bayesian and mmse by noise compensation |
---|
0:00:22 | techniques for speaker recognition in the i-vector space |
---|
0:00:28 | so let's |
---|
0:00:29 | start by setting up the problem |
---|
0:00:32 | here we are working on noise also noise is one of the biggest problem in |
---|
0:00:38 | speaker recognition |
---|
0:00:41 | and the a lot of techniques have been proposed in the but in the past |
---|
0:00:45 | years to deal with it in different domains |
---|
0:00:48 | such as speech enhancement techniques |
---|
0:00:51 | feature compensation mother compensation and robust scoring and in the last years the nn based |
---|
0:00:57 | techniques |
---|
0:00:58 | for a the robust feature extraction or a robust computations or statistics or |
---|
0:01:08 | i-vector like representation of speech |
---|
0:01:12 | so what we are proposing sheer ease a combination of two algorithms |
---|
0:01:19 | in order to clean up and noisy i-vectors |
---|
0:01:23 | so we are using a |
---|
0:01:25 | clean front end so system trained using clean data and a clean back end so |
---|
0:01:33 | in scoring model |
---|
0:01:36 | so the first algorithm |
---|
0:01:39 | in the past work in the previous work we presented a i'm up |
---|
0:01:45 | it's an additive noise model operating in the i-vector space |
---|
0:01:49 | it's based on a two hypothesis |
---|
0:01:53 | the gaussianity of |
---|
0:01:55 | the i-vectors distribution and the gaussianity of the night distribution |
---|
0:02:00 | in the i-vector space |
---|
0:02:02 | here i'm not saying that noise is additive in the i-vector space and just use |
---|
0:02:06 | ink this model to represent relationship between clean and noisy i-vectors |
---|
0:02:11 | just to be here |
---|
0:02:14 | so using not criterion we can |
---|
0:02:19 | there are in this equation |
---|
0:02:22 | and we end up we a model that it given a y zero noisy i-vector |
---|
0:02:31 | we can |
---|
0:02:33 | d noise it |
---|
0:02:34 | clean it up using |
---|
0:02:37 | the between i-vectors distribution hyper parameters and the noise distribution hyper parameters |
---|
0:02:46 | so in practice this algorithm is implemented like this given a test segment we start |
---|
0:02:54 | by checking it's the snr level if the segment it's clean is clean so we |
---|
0:03:00 | are okay |
---|
0:03:02 | if it's not |
---|
0:03:04 | we extract the noisy version of the i-vectors y zero and then using a voice |
---|
0:03:12 | activity detection system we extract |
---|
0:03:15 | noise from the signal using the silence intervals |
---|
0:03:19 | and then we inject |
---|
0:03:22 | this noise |
---|
0:03:25 | into clean training utterances |
---|
0:03:28 | this way we have clean i-vectors and they are noisy preference using the test noise |
---|
0:03:36 | so we can build the noise model |
---|
0:03:39 | using the gaussian distribution and then we can use the previous equation to clean up |
---|
0:03:44 | the noisy i-vectors |
---|
0:03:49 | so |
---|
0:03:52 | the novelty of this paper is how can we improve the i'm |
---|
0:03:59 | so that the problem is that we can apply time up many times |
---|
0:04:05 | successfully |
---|
0:04:08 | iteratively because we can guarantee the goshen hypothesis on the on the residual noise |
---|
0:04:15 | so the solution that we came up with is to use another algorithm and to |
---|
0:04:20 | iteratively between these two algorithms in order to achieve better training for the i-vectors |
---|
0:04:28 | so this second algorithms this call the catfish algorithm it's used mainly in chemistry two |
---|
0:04:39 | align different molecules so here we we're applying it on i-vectors and we're starting from |
---|
0:04:46 | noisy i-vectors |
---|
0:04:48 | and we want to estimate the best translation and rotation matrix |
---|
0:04:53 | in order to go to the clean version |
---|
0:04:58 | so formally for the formulation of the problem |
---|
0:05:04 | it's called the |
---|
0:05:07 | program this |
---|
0:05:09 | problem and its start with two matrices to data matrices and noisy i-vectors |
---|
0:05:16 | presented at a matrix and the clean version |
---|
0:05:20 | this way we can estimate the best relation matrix or here |
---|
0:05:25 | that relates the two |
---|
0:05:28 | so in the training we start by |
---|
0:05:34 | that we said that we are estimating a translation vector and the rotation matrix so |
---|
0:05:38 | to get rid of the translation we start by center ink the data the we |
---|
0:05:44 | compute the centroid on the clean data and the noisy data and then |
---|
0:05:50 | we center |
---|
0:05:52 | the clean and noisy very i-vectors |
---|
0:05:56 | then |
---|
0:05:58 | now we can compute the |
---|
0:06:01 | to the best rotation matrix between the noisy i-vectors and their cleavers and using svd |
---|
0:06:09 | decomposition |
---|
0:06:14 | the once we've done this when we have the best translation and rotation for a |
---|
0:06:21 | given noise |
---|
0:06:23 | on the test |
---|
0:06:24 | the weekend |
---|
0:06:27 | extract the test i-vector |
---|
0:06:29 | we apply we start by applying the translation a minus |
---|
0:06:34 | here we subtract the centroid of the |
---|
0:06:38 | the noisy i-vectors and then we apply the rotation and then either translation to and |
---|
0:06:45 | up with its cleaver |
---|
0:06:51 | so we use needs and switchboard data for training |
---|
0:06:56 | and the nist two thousand and eight four test that seven condition we are using |
---|
0:07:03 | nineteen mfcc coefficients plus energy plus their first and second derivatives |
---|
0:07:11 | five hundred twelve components gmm |
---|
0:07:16 | our i-vectors have a four hundred components under using the two covariance scoring |
---|
0:07:24 | so here we are applying |
---|
0:07:26 | each algorithm independently and then what combining the two |
---|
0:07:33 | so |
---|
0:07:33 | we've the first algorithm i'm up we can achieve from forty to sixty percent |
---|
0:07:39 | for a t v equal error rate improvement |
---|
0:07:43 | for each noise |
---|
0:07:45 | for the first algorithm we jan achieved up to forty five percent of equal error |
---|
0:07:50 | rate improvement but |
---|
0:07:53 | when we combine the two |
---|
0:07:55 | in the for one iteration or for you we can and up with up to |
---|
0:08:01 | eighty five percent of whatever it improvement |
---|
0:08:08 | here i presented the data for male they may |
---|
0:08:14 | for male data and to your for you might but well for female it's |
---|
0:08:21 | the error rates are a little bit tired but it's efficient for both |
---|
0:08:29 | the and here we compare the two algorithms and their combination |
---|
0:08:34 | on heterogeneous the setup it's the when we use a lot of data noisy and |
---|
0:08:42 | clean data for enrollment and test with different snr levels on the target and test |
---|
0:08:49 | and we can see that's a it's it remains efficient in this context |
---|
0:08:57 | so as a summary |
---|
0:09:00 | using |
---|
0:09:03 | i'm out or that they kept algorithm we can improve the equal error rate from |
---|
0:09:09 | forty to sixty percent but the interesting part is that combining the two |
---|
0:09:15 | can achieve |
---|
0:09:18 | for better gains |
---|
0:09:22 | thank you |
---|
0:09:30 | so we have questions |
---|
0:09:42 | is the patient matrix a noise and it's |
---|
0:09:47 | or anti noise that yes that's really different sorry |
---|
0:09:55 | yes here we're estimating for each different noise at different a translation and rotation matrix |
---|
0:10:02 | we just want to show the efficiency of this technique but in of the future |
---|
0:10:08 | in another paper will be published in interspeech i guess we well it's except that |
---|
0:10:16 | so |
---|
0:10:17 | it will |
---|
0:10:20 | we propose another approach so that the that does not |
---|
0:10:26 | suppose a certain model of noise in the i-vector space |
---|
0:10:29 | and that can be used for many noise |
---|
0:10:33 | that can be trained using many noises and use it if you used efficiently |
---|
0:10:38 | on the test with different places |
---|
0:10:40 | so here is to just to show the how four we can go to the |
---|
0:10:46 | best case scenario |
---|
0:10:48 | but in another paper we show how we can extend this to go away many |
---|
0:10:53 | noises |
---|
0:11:03 | and i was presentation so |
---|
0:11:06 | if you go back many years ago how lemon oppenheimer had a sequential map estimation |
---|
0:11:13 | that be used for speech enhancement obliterated back and forth between noise suppression filters and |
---|
0:11:19 | speech parameterization so you're iterating back and forth between two algorithms here |
---|
0:11:25 | you show results we had one iteration to iteration is there any way to come |
---|
0:11:29 | up with some well maybe two questions here anyway to come up with some form |
---|
0:11:34 | of convergence criteria that you can assess and second is there any way to look |
---|
0:11:39 | at the i-vectors as you go through the two iterations to see |
---|
0:11:44 | which i-vectors are actually changing the most that might tell you a little bit more |
---|
0:11:47 | about which vectors are more sensitive to the type of noise |
---|
0:11:54 | so the first question |
---|
0:11:56 | so the first question was is there any way to look at a convergence criteria |
---|
0:12:01 | because when you say eight or two you need to know whether you convergence and |
---|
0:12:06 | okay |
---|
0:12:07 | so well here what we've that is just to iterate many tendency |
---|
0:12:13 | at which |
---|
0:12:15 | from a which level we get |
---|
0:12:17 | we start the |
---|
0:12:20 | making the results were so it's not really |
---|
0:12:26 | it's not that the |
---|
0:12:29 | we haven't the gone that gone there in that |
---|
0:12:34 | so if you look at the two noise types you cycling fan noise and i |
---|
0:12:38 | think you had to |
---|
0:12:40 | car noise so both are low frequency type noises can you see if you have |
---|
0:12:45 | similar changes in the i-vectors in both those noise types |
---|
0:12:50 | yes |
---|
0:12:53 | maybe i can't the common in that because i haven't then the full analysis but |
---|
0:12:59 | the just from the right we can |
---|
0:13:03 | i can tell you for sure for sure is the that the efficiency depends on |
---|
0:13:08 | the |
---|
0:13:11 | on which noise you're playing at all so |
---|
0:13:15 | it sufficient store but it's it can be the |
---|
0:13:21 | that is in the way that makes it more efficient if we have different noises |
---|
0:13:26 | in the between enrollment and test |
---|
0:13:40 | thank you for the nice presentation |
---|
0:13:43 | one a while ago try to read original are mapped paper so if you don't |
---|
0:13:47 | mind i just as a question about the original i'm out that the iterative one |
---|
0:13:51 | sorry that i didn't understood original are you map |
---|
0:13:54 | yes not data at one |
---|
0:13:58 | okay so go like i mean in the block diagram that you how |
---|
0:14:06 | can you go back to the block diagram of this |
---|
0:14:08 | or email |
---|
0:14:11 | yes |
---|
0:14:11 | so you're estimating extracting noise from the signal or somehow estimating the noise and in |
---|
0:14:18 | the signal |
---|
0:14:19 | so and then you go up to the for noisy and of zero db that |
---|
0:14:24 | the speech and noise are of steam similar or same strengths over there can you |
---|
0:14:28 | tell us how would you or in extracting noise from signal in zero db |
---|
0:14:34 | so here were using energy based voice activity detection system but we are we just |
---|
0:14:40 | making the threshold more strict in order to avoid the and you got with speech |
---|
0:14:48 | confused as noise so it's not we |
---|
0:14:53 | we did the well as sophisticated the voice activity detection system for this task specific |
---|
0:14:59 | well as the avoiding a slight as much as possible to end up with the |
---|
0:15:03 | with speech by using a very strict this one on the energy |
---|
0:15:10 | c use the it's just it's quite amazing the level of improvement you gain from |
---|
0:15:14 | twenty something to date present it is it is quite something that it feel it |
---|
0:15:19 | feels that you have very good model of noise here and if you have such |
---|
0:15:24 | thing then it would make sense also to just check we is speech enhancement i |
---|
0:15:29 | mean you have this |
---|
0:15:30 | and misty based approach like wiener filtering if you have a good model the contract |
---|
0:15:34 | the noise than it is good to also compare with that was to do you |
---|
0:15:38 | like feature enhancement noise reduction in compare with that as well just a common |
---|
0:15:42 | yes okay |
---|
0:15:54 | okay that doesn't be any more questions over so that the speaker |
---|