0:00:15 | i per se and with lincoln laboratory enormous pride some more quickly for channel compensation |
---|
0:00:21 | i using that the lda for the only thing you know |
---|
0:00:26 | and that is no brief overview over a five year multichannel speaker recognition and mixer |
---|
0:00:32 | and the baseline system is an i-vector system is trained on one tell us all |
---|
0:00:36 | speech |
---|
0:00:38 | and there are two approaches were looking at one okay that the lda parameters the |
---|
0:00:42 | telephone data to microphone data |
---|
0:00:45 | and the other approaches we try to compensate features coming into the system and re-training |
---|
0:00:49 | or system does not sort of forms are hybrid system i don't give results along |
---|
0:00:54 | the way |
---|
0:00:57 | so the basic idea that we have a system is trained on switchboard data and |
---|
0:01:01 | works pretty well in the data were tested on is also conversational telephone speech |
---|
0:01:06 | but as a multiple known you try to evaluate microphone trials on the same system |
---|
0:01:10 | just fall for the for performance is really that |
---|
0:01:14 | and |
---|
0:01:15 | two approaches people that has to do this sort of a adaptation of the lda |
---|
0:01:21 | i don't think is exactly the same adaptation reason was trying to bring in some |
---|
0:01:25 | of the subspace to move that the only parameters for the microphone data |
---|
0:01:31 | and we also tried past enhancement of another did not so was different prices to |
---|
0:01:37 | do the |
---|
0:01:40 | i'm sorry what's due process is to use a neural network to do this compensation |
---|
0:01:47 | and actually it's not new in general i should mention that for this part challenge |
---|
0:01:51 | a lot of people using this technique and works very well for speech recognition and |
---|
0:01:55 | that test but they had microphone data as well |
---|
0:01:58 | so for these two techniques one we're taking a i-vectors from a telephone train system |
---|
0:02:04 | and weird adding those two of this microphone data to do that we take the |
---|
0:02:09 | within class an across class may parameters are used the lda scoring |
---|
0:02:14 | and we adapt those parameters towards the microphone data using a relevance map which is |
---|
0:02:19 | just a lambda interpolation |
---|
0:02:21 | and that we found that some calibration issues we do pretty well for eer we |
---|
0:02:26 | get a nice gain at the eer level but for mindcf we don't see much |
---|
0:02:29 | began |
---|
0:02:31 | on the other hand it is a very simple technique and that you don't change |
---|
0:02:33 | or system you just to train these two of parameters with existing i-vectors or you |
---|
0:02:38 | extract new i-vectors the microphone data we don't change the system itself |
---|
0:02:42 | the ml approach is a little a requires more work in that you've the training |
---|
0:02:46 | you know |
---|
0:02:47 | and the d n is trained to take a parallel data that's noisy the try |
---|
0:02:51 | to clean it up to try to reconstruct a clean signal given a noisy representation |
---|
0:02:56 | of the same data |
---|
0:02:58 | and that's actually very robust technique it works by twelve it does mean you want |
---|
0:03:02 | to retrain your system with that new front end |
---|
0:03:07 | also for this work or using three datasets one is switchboard one and two that's |
---|
0:03:11 | we used for training the baseline system and all the i-vector parameters are trained with |
---|
0:03:16 | just that data |
---|
0:03:17 | and then we'll mixer to which is a collection from two thousand for those a |
---|
0:03:22 | multi microphone collection |
---|
0:03:24 | i've had a clean telephone channel than the at eight microphones in the romantic like |
---|
0:03:28 | to the data parallel for tuna forty speakers and up to six sessions i think |
---|
0:03:33 | that was collected two thousand four minnesota dataset those actually is not straight and |
---|
0:03:38 | and this is the mixer six about that one they did the same type of |
---|
0:03:42 | a collection but for an speakers in different rooms and therefore two microphones as well |
---|
0:03:46 | so the telephone channel |
---|
0:03:48 | and for the sre they're focusing a lot on the interview condition for that where |
---|
0:03:53 | the interviewer rum and interviewee and you had to separate the two to try to |
---|
0:03:58 | not deal with that issue we just took this other portion of the sessions |
---|
0:04:02 | which is a conversation the person's having over phone so it's the same how collection |
---|
0:04:07 | but it's conversational data |
---|
0:04:08 | and that matches the mixer to style so these are disjoint |
---|
0:04:12 | lex the mixer to the mixer six |
---|
0:04:15 | we use mixer to for developing the system either for training or indian and or |
---|
0:04:18 | for adapting or parameters and the mixer six or using protesting that to see how |
---|
0:04:22 | well works |
---|
0:04:26 | so i just t v an idea of what these collections or comprise the next |
---|
0:04:30 | one and two was collected over eight microphones |
---|
0:04:32 | and mixer six was over fourteen |
---|
0:04:35 | we found it generate a huge dataset values of fourteen so we just selected six |
---|
0:04:40 | of them based on the distance from the speaker so the mixer six collection comes |
---|
0:04:43 | with documentation about where the microphones or position that that's we use here |
---|
0:04:49 | mixer one and two was available to us but we've actually given this the ldc |
---|
0:04:54 | and they graph is planning on making release people wanna work with this data so |
---|
0:04:59 | it should be probably available fairly soon thing |
---|
0:05:02 | and ice estimates somewhere only evaluating on same mic trials on the mixer six condition |
---|
0:05:06 | of the trials always you the target speaker and or what the non-target speakers on |
---|
0:05:10 | the same mike |
---|
0:05:13 | the baseline system is |
---|
0:05:15 | exactly what everybody else is doing with an i-vector system |
---|
0:05:18 | we start with a ubm to be trained on switchboard wanted to extract easier wasn't |
---|
0:05:23 | first order statistics to create a i supervector and then we take the map point |
---|
0:05:27 | estimate to get the i-vector six enter dimensional i-vector |
---|
0:05:32 | the whitening is done with switchboard two data as well for the d n and |
---|
0:05:36 | case |
---|
0:05:37 | for the microphone a map and that of for the map-adapted case actually did the |
---|
0:05:41 | whitening using the mixer a microphone data the mixture to microphone data and then signal |
---|
0:05:46 | w c and c macy of the parameters are being adapted for the ple the |
---|
0:05:50 | lda adaptation |
---|
0:05:53 | so start with the baseline results |
---|
0:05:55 | well the first result in table is on a street and that's just the telephone |
---|
0:05:59 | results on disk sort of the out-of-domain task we get the system trained on switchboard |
---|
0:06:04 | and then the you'd all data is this a street and mixer data so you |
---|
0:06:08 | don't have mixer data as part of training the system |
---|
0:06:11 | that's about five point seven percent equal error rate and a point six two and |
---|
0:06:16 | you take that system |
---|
0:06:17 | and evaluate it with the s with the mixer six trials the microphone trials |
---|
0:06:22 | and you can see the equal error rate goes up by a factor of two |
---|
0:06:25 | or so and mindcf really takes a good as well |
---|
0:06:29 | and the first number there is the average this just taking the eer further channel |
---|
0:06:34 | and then averaging number that's kind of unrealistic because typically you'd have to pick one |
---|
0:06:37 | threshold for everything so the people i think is a more practical matter and that |
---|
0:06:42 | one's even more c take a bigger hit for that because of the calibration problem |
---|
0:06:47 | and |
---|
0:06:49 | where for the remaining results of this for example i think that's a more practical |
---|
0:06:53 | matter |
---|
0:06:55 | the first and the map-adapted results and here you can see the same the mindcf |
---|
0:06:58 | really doesn't improve very much although you do get a pretty big improvement eer goes |
---|
0:07:03 | down by about thirty one percent |
---|
0:07:04 | so that part's nice but min you'd really like to see mindcf get a little |
---|
0:07:08 | better |
---|
0:07:10 | and just yes i should mention that for landay's use point five and the reason |
---|
0:07:16 | for that as i did sort of a sweet and you can see there's a |
---|
0:07:19 | they're nice curves at eer because that's where i get again |
---|
0:07:22 | and point five looks like it's a it's fairly optimal across microphones of the three |
---|
0:07:27 | d plot is for each microphone the eers use with as use we plan to |
---|
0:07:32 | for doing data adaptation |
---|
0:07:34 | and around point five as we're seeing a sweet spot for that |
---|
0:07:38 | but you'll get mindcf it doesn't really change very much that's where we were saying |
---|
0:07:41 | the problem of this technique |
---|
0:07:44 | so moving on added to the enhancement idea were training a neural network to try |
---|
0:07:49 | to reconstruct a plane signal given by a noisy version of that so we have |
---|
0:07:54 | the person talking to telephone the telephone is are clean version and we also have |
---|
0:07:58 | microphones of the room the collecting of the microphone corrupted versions |
---|
0:08:02 | and we just trained as like a regression it's a very simple thing we have |
---|
0:08:04 | a windowed set of i-vectors coming into the n and we have the same vector |
---|
0:08:09 | trying to reconstruct that we just training over again samples |
---|
0:08:13 | one key thing release i think this is important is that we include the clean |
---|
0:08:17 | samples as well really like this neural network not change the clean data but to |
---|
0:08:21 | try to also improve the noisy data make of what more likely |
---|
0:08:27 | and just t v some idea of how this data was collected |
---|
0:08:30 | the ldc the these parallel collections and a couple of rounded have like one or |
---|
0:08:34 | two rooms which is not really like that morals but this is so how this |
---|
0:08:37 | time |
---|
0:08:39 | and you'll have to come in to sit down and they have the microphones around |
---|
0:08:41 | that have all the equipment running |
---|
0:08:43 | and of the problems that if you realise later that you wanna one more microphone |
---|
0:08:47 | maybe really hardly really comeback collect more data so really what people do especially asr |
---|
0:08:53 | size eight in generating synthetic parallel datasets using a i rs online and point out |
---|
0:08:59 | noise sources and just generating tons parallel data |
---|
0:09:02 | and we actually been working on that more recently the another paper interspeech on that |
---|
0:09:06 | and that actually that works quite well as well i think that's and long-term as |
---|
0:09:10 | the way wanna do that but we had this probably just available and we want |
---|
0:09:13 | to start with that for this work |
---|
0:09:17 | so that the hybrid system where you have that channel compensating neural network in the |
---|
0:09:21 | front of it and then you have the i-vector system the of the baseline these |
---|
0:09:26 | before and we just retrain this pipeline after we retrain the denoising neural network we |
---|
0:09:30 | retreat we retrain the i-vector system on the switchboard |
---|
0:09:35 | and that for the system or using all the mixture to data for training course |
---|
0:09:38 | and then we also using forty mfccs and that's the dimensionality of the output of |
---|
0:09:43 | the neural nets or trying to reconstruct forty mfccs and that includes twenty deltas which |
---|
0:09:50 | may seem kind of counterintuitive but it was actually important and blue delta coefficients and |
---|
0:09:54 | thus |
---|
0:09:55 | we use of five layer neural network with two thousand forty nodes |
---|
0:09:59 | twenty one frame input context and mainly because that's we used for bottleneck features before |
---|
0:10:05 | we just adapted that system to this problem |
---|
0:10:08 | and then we of the one clean channel and the eight noisy ones come |
---|
0:10:13 | and you can say we get a pretty big in mindcf and everything it almost |
---|
0:10:17 | a thirty percent gain mindcf and that's cool result |
---|
0:10:20 | and a fifty percent in eer so this is really doing we're hoping is to |
---|
0:10:23 | get an improvement at mindcf and eer as well |
---|
0:10:29 | so that was actually nice k |
---|
0:10:31 | and |
---|
0:10:32 | i should mention we didn't number of different things we tried initially i think it |
---|
0:10:35 | first we're trying to see if we could do this with log mel-frequency filter banks |
---|
0:10:40 | so i think some of the work that's been done just on the enhancement side |
---|
0:10:43 | is to try to improve a filter banks and then you can do what one |
---|
0:10:47 | of those like to synthesise cepstra from those a cleaned up filter banks |
---|
0:10:51 | but i will be found that the deltas were actually important so going to mfccs |
---|
0:10:56 | plus deltas give us to begin reduced using filter banks |
---|
0:10:59 | and is also critical on each other people mention this to be suitable for the |
---|
0:11:03 | you have to do that some type of me the variance normalisation to the data |
---|
0:11:06 | for training the neural net just to get the district to converge |
---|
0:11:09 | and that we also found the architecture at a pretty big impact so i am |
---|
0:11:12 | reporting results on the two thousand forty eight node be an you can say we |
---|
0:11:16 | take you can see we take a bit of here we go down to ten |
---|
0:11:19 | twenty four nodes especially dr and then we get on the five control not be |
---|
0:11:22 | taken figurehead |
---|
0:11:24 | but honestly the two thousand and forty you know the nn to goes a long |
---|
0:11:27 | time the training i-vectors weeks to train that one and that's maybe are four we |
---|
0:11:31 | don't have a parallel training mechanism |
---|
0:11:33 | that was the problem that |
---|
0:11:37 | it's worth seeing what the telephone performances you don't really want to system is robust |
---|
0:11:40 | to microphone data but also worked well for telephone data and so this is actually |
---|
0:11:44 | kind of a nice surprise we get a small gain about it some percent relative |
---|
0:11:47 | in just on the telephone task |
---|
0:11:50 | than that was for the you know a signal that and forty the map that |
---|
0:11:52 | the lda falls apart when you buy telephone data is you moved all those parameters |
---|
0:11:57 | this microphone set there does not well matched telephone data anymore |
---|
0:12:01 | so it's the trade off there |
---|
0:12:04 | so we see the nice in using this the nn channel compensation technique forty doesn't |
---|
0:12:10 | it was a lost on the telephone data |
---|
0:12:13 | you so you don't need do any kind of channel detection to switch back and |
---|
0:12:16 | forth |
---|
0:12:17 | the map that the lda unfortunately so far hasn't work well for us it does |
---|
0:12:22 | give unity are but the mindcf doesn't really change very much |
---|
0:12:26 | it is really easy to implement if you have an existing i-vector system you just |
---|
0:12:29 | run on that day to train parameters |
---|
0:12:32 | the other issue is that we've been using real relative to this which is not |
---|
0:12:37 | really very practical so the synthetic parallel corpora makes a lot sense |
---|
0:12:40 | and lastly at the input if you're really looking into using a recurrent networks within |
---|
0:12:46 | for doing a lot with feed forward networks and with the big context one to |
---|
0:12:49 | allow that but i think aren't as we can be the way to go looking |
---|
0:12:54 | for |
---|
0:12:56 | the biggest much time |
---|
0:13:02 | how to the sre five |
---|
0:13:09 | think that recent training |
---|
0:13:34 | you said you didn't |
---|
0:13:37 | we think about the size of the input window you used twenty one frames i |
---|
0:13:42 | and just about that |
---|
0:13:43 | you have some |
---|
0:13:45 | inputs for some idea is do you think that for channel compensation for example you |
---|
0:13:50 | need a longer window of and what of your were doing only for each speaker |
---|
0:13:56 | recognition or e |
---|
0:13:58 | you know actually i would really recommend looking at the aspire papers from i think |
---|
0:14:03 | it was from |
---|
0:14:06 | maybe asru not sure it's one of the speaker regular workshops |
---|
0:14:11 | or might be names which actually a perl thusly train the denoising network and is |
---|
0:14:15 | it but i think were the fft outputs to introduce six a power a fifty |
---|
0:14:20 | upwards and yet a really long window or something like a three hundred frames or |
---|
0:14:25 | something huge like that we trained a giant network |
---|
0:14:28 | and yes it very impressive results and i've been meaning to see if i can |
---|
0:14:32 | recreate that it will take me forever to training |
---|
0:14:34 | so i think we wanna have a faster training algorithm but i would encourage looking |
---|
0:14:38 | at those results in particular looking at the other aspire systems |
---|
0:14:42 | the suit they did i think there was a and ice comparison of what do |
---|
0:14:45 | you did a joint training of the whole system with the way i one was |
---|
0:14:49 | doing a where you are you do a multi style sorry multi condition training with |
---|
0:14:55 | a with a whole bunch of data with your targets are always signals and some |
---|
0:14:59 | people try to decouple it's of the asr system was trained independently |
---|
0:15:03 | and then they train the denoising network and just use those features and one issue |
---|
0:15:07 | i haven't addressed here is the idea of not retraining the i-vector system |
---|
0:15:11 | so could you actually do okay if the features were coming from the denoising network |
---|
0:15:16 | but you're still using |
---|
0:15:18 | the same i-vector system |
---|
0:15:20 | i was worried about right now but i think it's worth busting |
---|
0:15:31 | but i start pretty i did you go back to a whether you're earlier slides |
---|
0:15:35 | where you're gonna highlighting the different microphones between mixture to and mixer six |
---|
0:15:41 | yes so |
---|
0:15:43 | so i was looking at a mixture one and two and their what kind of |
---|
0:15:48 | country a little concern if i guess channel number five has the kind of the |
---|
0:15:53 | jar or a |
---|
0:15:55 | okay thinking of star wars years arc a cellphone wyman there's also the error by |
---|
0:16:00 | one so you got to their actually i mean you're but not i don't think |
---|
0:16:04 | used five and six from mixture one armature two percent correct all next |
---|
0:16:10 | extra wanted to die all data use all of it so i'm thinking that some |
---|
0:16:15 | out those when you have two mikes that are actually still configured around here they |
---|
0:16:20 | are letterman agree you know i mean it it's a mike you're gonna have some |
---|
0:16:25 | i imagine interference between the two |
---|
0:16:28 | so that maybe i don't know it does not question you check that okay so |
---|
0:16:34 | what are the things good the main question i was gonna ask is when you're |
---|
0:16:37 | looking at a kind of map adaptation you had the |
---|
0:16:44 | denoising enhancement piece when you're looking across the different mikes going from one mike to |
---|
0:16:49 | another some mikes that are closer in terms of their characteristics in others did you |
---|
0:16:54 | see any benefit in moving from one to the other |
---|
0:16:59 | i guess we're asking is whether we could subset a set of unique |
---|
0:17:02 | right and we haven't that's a really good question i think actually moving forward anyway |
---|
0:17:08 | i mean real data is kinda nice because you can reality check but i think |
---|
0:17:11 | actually moving towards the synthetic data you can really move to two very different you |
---|
0:17:16 | know |
---|
0:17:16 | run conditions i mean exactly collected in two rows are male diverse and i'm just |
---|
0:17:22 | thinking about chemistry structure for all the mikes energy kind of look at your solutions |
---|
0:17:26 | to see |
---|
0:17:27 | why you're one if you're launching from one mike to another sometimes of each closer |
---|
0:17:32 | one solution does better than another |
---|
0:17:34 | it's actually analysis we could try to do we could try to see which features |
---|
0:17:37 | look closer cross the parallel data sets |
---|
0:17:40 | i think about asking you to burn morality cycles either directly it's a nice question |
---|
0:17:55 | that that's a good point we have to don't i couldn't find placement information for |
---|
0:17:59 | mixed wanted to it probably exist somewhere but i ran out of like trying to |
---|
0:18:02 | find mixer six has a lot of information |
---|
0:18:09 | so mixer to it was it three locations i think there's i aside the ldc |
---|
0:18:14 | and |
---|
0:18:17 | and i think it see i think there are three and then mixer six i |
---|
0:18:20 | think is to i believe that's right |
---|
0:18:25 | although it was okay feature start with reading it |
---|
0:18:34 | a question and all that denoising network so when we apply that kind of thing |
---|
0:18:39 | we found it was important to |
---|
0:18:42 | applied in fact and then ten train the network because if we send that the |
---|
0:18:48 | silence frames to it |
---|
0:18:49 | i was with easy but value that's a really because of just zeros and then |
---|
0:18:54 | it goes the rest of the network what the network zapping that state so that |
---|
0:18:58 | are actually thousand four point we ran a |
---|
0:19:02 | we limited the mars that's right i |
---|
0:19:05 | that's i think we might have run that on the clean channel for training and |
---|
0:19:09 | applied at the other ones for decoding we always ran back or whatever the data |
---|
0:19:13 | was |
---|
0:19:13 | we try to optimize the you know to not realistic addition but for training i |
---|
0:19:19 | think we might have done a that on the telephone data which matched are bad |
---|
0:19:22 | system robust and then use that as i |
---|
0:19:24 | the speech marks across |
---|
0:19:32 | anymore questions |
---|
0:19:35 | okay stack the speaker |
---|