0:00:14 | you know all that from you the |
---|
0:00:18 | i will be presenting what we did for lre fifteen and probably |
---|
0:00:23 | great part of you have already seen most of this presentation |
---|
0:00:27 | at the workshop |
---|
0:00:29 | we have changed you things correctly some errors |
---|
0:00:32 | and i will give you the presentation again |
---|
0:00:36 | so |
---|
0:00:38 | well lets them here it was as john already said it was a collaboration between |
---|
0:00:43 | per no i need your and only technically three you know |
---|
0:00:46 | i included the almost the full list of people who participate it is a in |
---|
0:00:51 | our team that was a lot of concentrated fun during the autumn and we really |
---|
0:00:58 | enjoyed that |
---|
0:01:00 | so |
---|
0:01:02 | let's go straight to the system what they to be used to be we decided |
---|
0:01:07 | to participate in both nist conditions the fixed data condition and open data condition |
---|
0:01:12 | and the fixed data condition we joint some affords with mit and the they provided |
---|
0:01:20 | some definitions of the |
---|
0:01:21 | of the development set and the shortcuts so we split all of the data we |
---|
0:01:26 | had available for |
---|
0:01:27 | training and they have we kept sixty percent for training and forty percent for that |
---|
0:01:33 | and we also generate the some short cuts out of the long segments that are |
---|
0:01:38 | uniformly distributed from three to thirty seconds because that was that's what we apply are |
---|
0:01:44 | expecting then devil data according to evolution one |
---|
0:01:47 | for the open training data condition a |
---|
0:01:51 | we try to harvest all of the data from a harddrive that we could find |
---|
0:01:56 | we also asked our friends |
---|
0:01:58 | from here from bilbao to provide some other databases and also nudging from mit so |
---|
0:02:04 | these databases that you might not using your systems regular eer colour guthrie that is |
---|
0:02:11 | we took european spanish and british english |
---|
0:02:14 | and from al jazeera free speech corpus we took some arabic dialects otherwise it was |
---|
0:02:22 | just all the data that be harvested for nist lre o nine from the radios |
---|
0:02:28 | from the voice of america and so on just to let you know we didn't |
---|
0:02:32 | use any bible four |
---|
0:02:34 | for the classifier training we just use the bible data to train some |
---|
0:02:41 | bottleneck feature extractors able to speak about it later |
---|
0:02:47 | bottleneck features that's really is a core far system so it's |
---|
0:02:52 | i think that most of you are already familiar but this architecture we train a |
---|
0:02:56 | neural network do classify phoneme states it's just some better specially did is architecture because |
---|
0:03:04 | it is stacked bottleneck so |
---|
0:03:06 | the structure is here on the picture |
---|
0:03:08 | the stacked mean that |
---|
0:03:10 | we first train the classical network to classify the phonemes days then be coded at |
---|
0:03:15 | the bottleneck |
---|
0:03:16 | and then steak these bottlenecks in time and train again |
---|
0:03:20 | so that we train another stage and we take the bottlenecks |
---|
0:03:24 | from the second stage from the second network so that's why the stacked bottlenecks |
---|
0:03:30 | the effect is that |
---|
0:03:31 | in the end they see longer context and |
---|
0:03:35 | from our experience other they work pretty well but if you do |
---|
0:03:39 | some tuning you can you can |
---|
0:03:42 | you can just use the first bottlenecks it's enough especially for speaker id i say |
---|
0:03:49 | so for the fixed training condition apparently we had to use switchboard and the network |
---|
0:03:54 | was approximately seven thousand triphone states at all |
---|
0:03:58 | and the we were trying some new technique a with the automatic acoustic unit discovery |
---|
0:04:06 | and we train the bottleneck on these and for that we used lre fifteen data |
---|
0:04:11 | for the open training |
---|
0:04:13 | condition b |
---|
0:04:15 | we use the bible data and later in the most of all we've train another |
---|
0:04:19 | network that has seventeen languages of the bible and it is indeed the one that |
---|
0:04:26 | that it would like to use if you can use |
---|
0:04:30 | all kind of data |
---|
0:04:33 | so general system or would be you as i already said the basis of our |
---|
0:04:38 | system other bottlenecks either based on switchboard or labeled data and then some reference we |
---|
0:04:45 | had the mfcc shifted delta cepstral system we had be llr system we also tried |
---|
0:04:52 | some |
---|
0:04:54 | some politics systems and model the |
---|
0:04:56 | expect the n-gram counts with the multinomial subspace model and techniques like that where around |
---|
0:05:03 | fewer spectra they didn't make it a diffusion |
---|
0:05:07 | and are favourite classifier is just a simple wiener gaussian classifier |
---|
0:05:12 | and if you can along with it's good to include the i-vector uncertainty in the |
---|
0:05:18 | computation of scores that helps quite a bit with the calibration and also |
---|
0:05:24 | provide you slides |
---|
0:05:27 | performance boost |
---|
0:05:30 | and |
---|
0:05:31 | we had them new fink |
---|
0:05:33 | a sequence summarizing neural network |
---|
0:05:36 | i will speak about just now |
---|
0:05:40 | just later because it was a little bit of a disaster labels e |
---|
0:05:45 | the fusion |
---|
0:05:47 | fusion was a little bit different we tried to reflect the nist criteria because we've |
---|
0:05:52 | are to the c average was computed over the clusters and then averaged so |
---|
0:05:59 | so we are reflected ease and the otherwise |
---|
0:06:04 | we had one way then |
---|
0:06:06 | per system and one buys per language |
---|
0:06:08 | and the cluster prior and that be assigned the cluster specific priors for the data |
---|
0:06:14 | for each cluster and all of the or other data |
---|
0:06:17 | other set whose where had the prior set to zero and v be trained over |
---|
0:06:22 | all clusters in the end so that |
---|
0:06:26 | i think that it improve the results on the nist metric what substantially |
---|
0:06:33 | and also we gave nist a system that was |
---|
0:06:36 | a classical multiclass system that they could they could do some between cluster results on |
---|
0:06:41 | this is because if we gave them just the one that b calibrated or fused |
---|
0:06:47 | this way |
---|
0:06:48 | they would be out of like with doing anything with that because of course |
---|
0:06:53 | the asked for |
---|
0:06:54 | a log likelihood ratios not the log likelihoods i hope that the next time they |
---|
0:06:58 | will they will rectify this |
---|
0:07:02 | this all what we had in the end in our submissions |
---|
0:07:07 | most of the systems are stacked bottlenecks to see in the and mean the cluster |
---|
0:07:11 | dependent system i will speak about it just two slides later |
---|
0:07:15 | and then there was this a sequence summarizing network |
---|
0:07:19 | and as you can see |
---|
0:07:21 | it is the clear that were system it would never make it to the to |
---|
0:07:26 | the diffusion but at the nist workshop five as present think is this as a |
---|
0:07:29 | system that could almost perfectly classify but that's data it's not the case there was |
---|
0:07:34 | a bunch of course |
---|
0:07:37 | some level data in the training data |
---|
0:07:40 | so now it's the worst system |
---|
0:07:43 | so anyway we were so scared added what worked so well on our test data |
---|
0:07:47 | that we didn't included in the primary system anyway so that the red arrow shows |
---|
0:07:52 | what we had as a primary system a narrative and the |
---|
0:07:56 | the alternate system would be with the |
---|
0:07:59 | sequence summarizing that were included the what i report here is the c word star |
---|
0:08:04 | means that the calibration was performed on the dev set |
---|
0:08:09 | i don't i don't show already the c average for the dev so because during |
---|
0:08:13 | that develop we were doing check and i think |
---|
0:08:16 | which is |
---|
0:08:18 | not here in this lies anymore |
---|
0:08:21 | and so these are the results on that that's that it's |
---|
0:08:25 | it's pretty good let's skip to the |
---|
0:08:28 | results on the of also |
---|
0:08:31 | there is nothing much to say just that the we sing quite some a calibration |
---|
0:08:36 | loss on the of all data |
---|
0:08:39 | and the |
---|
0:08:42 | which was not the case on our test data especially on the on the fixed |
---|
0:08:46 | set because it proved to be |
---|
0:08:49 | quite easier said than the one i design for the open data condition |
---|
0:08:56 | so |
---|
0:08:57 | so that's it that's our that's of this are fixed that's our system for the |
---|
0:09:01 | fixed training condition |
---|
0:09:03 | so now let's talk about those specialities we had there the one with a cluster |
---|
0:09:08 | dependent i-vector system |
---|
0:09:10 | the cluster dependent means that we train |
---|
0:09:12 | per cluster we train the ubm separate cluster and then the i-vector and the rest |
---|
0:09:18 | of the system is trained on the whole data |
---|
0:09:23 | they provide |
---|
0:09:25 | you can see there's a six independent systems which provide the scores and then we |
---|
0:09:29 | fuse them here with the |
---|
0:09:32 | with a simple average due to provide some robustness be we calibrate them later anyway |
---|
0:09:38 | so based this proved to be quite effective during the development with you just need |
---|
0:09:45 | to take care about the amount of the daytime in the in the cluster so |
---|
0:09:50 | the results line coming here indicate that there is no need you know data and |
---|
0:09:55 | if you use of diagonal ubm you have a |
---|
0:09:58 | you have a better result in the end which i believe this cost by not |
---|
0:10:03 | enough data per cluster to fit all of all of the parameters of the full |
---|
0:10:06 | covariance ubm |
---|
0:10:10 | and the sequence summarizing neural network which doesn't work |
---|
0:10:14 | it's |
---|
0:10:14 | is i don't know if you have ever use it for language id it's basically |
---|
0:10:20 | you take a sequence and short utterance |
---|
0:10:23 | and |
---|
0:10:24 | and passing through the network summarise it at this there is a summarisation a layer |
---|
0:10:29 | inside |
---|
0:10:31 | when you many of initial the frames then you then you provoke a the rest |
---|
0:10:33 | till the end where you have to |
---|
0:10:36 | probabilities of the classes and you do it all over again over all the data |
---|
0:10:41 | and |
---|
0:10:43 | and the that's it |
---|
0:10:45 | the |
---|
0:10:47 | and then to just that you can use the sequence summarizing layer |
---|
0:10:50 | as some sort of feature extractor and model it is and later it differently |
---|
0:10:56 | and apparently works a little bit better than then just using the network to do |
---|
0:11:01 | the final classification |
---|
0:11:03 | we had some partial results with the sequence summarizing that for the at when we |
---|
0:11:09 | tried it on lre o nine but here the task is so much tougher |
---|
0:11:14 | and |
---|
0:11:15 | the system was a complete disaster |
---|
0:11:18 | open training data condition |
---|
0:11:21 | it's a almost the same scenario just we had a little bit more variability in |
---|
0:11:26 | features here specifically i would like to point out the multilingual features multilingual bottleneck features |
---|
0:11:33 | that is the ml seven insist in |
---|
0:11:37 | and |
---|
0:11:39 | you can see that if you include this whole machinery and all of the data |
---|
0:11:43 | and the nice a look like that can really cluster the space |
---|
0:11:47 | of the languages you get the cleared the best system that you can get |
---|
0:11:54 | and it also is the case on the ml data |
---|
0:11:59 | here i can even show you that what is the difference when you use the |
---|
0:12:04 | use the covariance in the in the gaussian linear classifier to obtain the scores |
---|
0:12:11 | it's the last line versus the second line of the table there is not so |
---|
0:12:16 | much gain on the on the dev data because they're already |
---|
0:12:22 | goals are to whatever we are training on but there is a nice gain |
---|
0:12:28 | my skin on the on the of all data |
---|
0:12:34 | if we if we submitted just the single system that would be probably the best |
---|
0:12:39 | but of course |
---|
0:12:41 | we haven't seen the |
---|
0:12:43 | seen the results on the dev all data before submitting and |
---|
0:12:46 | and tried try the whole fusion which is |
---|
0:12:50 | slightly worse than the single best system |
---|
0:12:57 | some analysis with the training data |
---|
0:13:00 | we |
---|
0:13:01 | we had a little a time constraints and we thought that |
---|
0:13:06 | from our experience |
---|
0:13:07 | it's experience it's always good do |
---|
0:13:11 | necessary to retrain the final classifier i mean when you have the i-vectors to retrain |
---|
0:13:16 | the logistic regression or regions of classifier to get your classes posteriors |
---|
0:13:22 | but it unfortunately was not this case or for the album data condition we decided |
---|
0:13:27 | okay we have this ubm i-vector extractor let's just use deals and retrain a retrain |
---|
0:13:34 | the system we will use for our submission of the open data condition |
---|
0:13:39 | as |
---|
0:13:41 | and we didn't train the new ubm and i-vector extractor of course we did it |
---|
0:13:45 | after |
---|
0:13:46 | and you can see that |
---|
0:13:48 | the column just below the submission is the one that we would get if we |
---|
0:13:54 | to the time and retrained both ubm i-vector and the classifier on top of our |
---|
0:14:02 | dataset |
---|
0:14:05 | so we hurt ourselves quite a bit here as well |
---|
0:14:12 | so features |
---|
0:14:14 | as i already said the bottleneck features are the best ones that we were able |
---|
0:14:19 | to |
---|
0:14:20 | to train |
---|
0:14:22 | if you compare it with the mfcc and shifted okay switch shifted delta cepstra there |
---|
0:14:29 | is a there is a huge get and i think that |
---|
0:14:33 | the bottleneck system should be the basis of |
---|
0:14:37 | any serious |
---|
0:14:38 | language id system nowadays |
---|
0:14:42 | the bottlenecks out of the network it was trained on the automatically derived units |
---|
0:14:47 | it didn't perform very well but of course |
---|
0:14:51 | that was a very new thing and we didn't want to only |
---|
0:14:56 | run the bottlenecks and |
---|
0:14:59 | be done with the evaluation so we tried it you can see that still it's |
---|
0:15:04 | really depends if you can if you can derive some |
---|
0:15:08 | some meaningful units and |
---|
0:15:09 | and more specifically if |
---|
0:15:11 | if the ml data would match your that they do very are trained it because |
---|
0:15:16 | then the units what |
---|
0:15:18 | would correspond and probably the book like would be better |
---|
0:15:22 | it so far doesn't work that well |
---|
0:15:29 | with french cluster yesterday i so many people present the results here already been of |
---|
0:15:34 | the french cluster they but inspired with great in the nist workshop where he it |
---|
0:15:40 | excluded them from the results i think that we should not do that i spoke |
---|
0:15:46 | the ldc |
---|
0:15:47 | at the data are completely okay people can recognise a there is just the problem |
---|
0:15:51 | with the channel as if they gave us |
---|
0:15:53 | one channel in training and another one in the test they basically swap it |
---|
0:15:59 | and because this is a cluster of just two languages we all build a very |
---|
0:16:02 | nice channel detector |
---|
0:16:04 | so |
---|
0:16:06 | that is something we should deal with and not to exclude the french class are |
---|
0:16:09 | from the evaluation |
---|
0:16:12 | just please fix it |
---|
0:16:14 | well we will try but we haven't time to really do that so all of |
---|
0:16:18 | the results i will show in q of course include the french cluster |
---|
0:16:22 | and |
---|
0:16:23 | there |
---|
0:16:25 | they're pretty good if you if you take the a multilingual bottleneck features but we |
---|
0:16:29 | have to be careful even you when you're doing analysis of with the french cluster |
---|
0:16:36 | the croat from the french is actually from bubble so if you happen to have |
---|
0:16:39 | some bubble data bic or for about it rather not use it or use it |
---|
0:16:43 | carefully |
---|
0:16:45 | or you might be surprised how useful the problem |
---|
0:16:47 | well it didn't solve it it'll |
---|
0:16:52 | so |
---|
0:16:53 | we of course try the bunch of the classifiers on top of the i-vectors and |
---|
0:16:58 | i can say that |
---|
0:17:00 | it's all about the same |
---|
0:17:04 | and the classifier of choices the simplest one just the gaussian in our classifier that |
---|
0:17:10 | you can build |
---|
0:17:11 | right away out of i-vectors |
---|
0:17:13 | an eagle was experimenting with some different language dependent i-vectors when you extract the i-vectors |
---|
0:17:20 | with the language priors involved it was |
---|
0:17:24 | it was performing nicely but |
---|
0:17:28 | but the |
---|
0:17:30 | not really beating the |
---|
0:17:32 | the simple across a linear classifier we try it |
---|
0:17:37 | fully bayesian classifier we tried a neural network and the logistic regression you can see |
---|
0:17:42 | that all the columns here are pretty much the same |
---|
0:17:48 | and |
---|
0:17:49 | we still have a few minutes so i can again briefly us to do something |
---|
0:17:53 | all this automatically derived you needs it's a it's a variational bayes method a we |
---|
0:17:58 | train a duration a process mixture of hmms and b we try to fit the |
---|
0:18:04 | open phoneme blue the on the data to estimate the estimate the |
---|
0:18:09 | units |
---|
0:18:11 | and then be used this to somehow transcribed data |
---|
0:18:16 | and use these once this |
---|
0:18:19 | as the source for a training the training the neural network which would include the |
---|
0:18:24 | bottleneck and then |
---|
0:18:25 | then we would have some |
---|
0:18:27 | unsupervised bottleneck |
---|
0:18:31 | well maybe there is there is a |
---|
0:18:34 | still somehow four days and i hope that people edge h work should bill |
---|
0:18:37 | we'll move this thing forward and we will see the goal think is that |
---|
0:18:43 | we were able to surpass the mfcc baseline on the dev set with this system |
---|
0:18:50 | that is i think that's already impressive |
---|
0:18:55 | so the conclusions |
---|
0:18:59 | again |
---|
0:19:00 | use the bottleneck system in your lid system the gaussian linear classifier is enough |
---|
0:19:07 | it if you can do you just include the uncertainty in the score computation |
---|
0:19:13 | and we tried a bunch of the phonotactic systems and they perform |
---|
0:19:20 | okay but they didn't make it to the fusion |
---|
0:19:24 | and |
---|
0:19:26 | i would say that it's always good to have some exercise with the data engineering |
---|
0:19:31 | and try to see the |
---|
0:19:33 | see the data that we have and try to collect something and |
---|
0:19:36 | where with the data not only with the systems |
---|
0:19:40 | we tried a bunch of other things like the denoising the reverberation we didn't see |
---|
0:19:45 | any gains on the dev set then there is very slight gains on the evaluation |
---|
0:19:49 | set |
---|
0:19:52 | for the phonotactic systems we very using the switchboard to train it |
---|
0:19:56 | and |
---|
0:19:57 | we try to frame of the nn which |
---|
0:20:00 | which was pretty bad |
---|
0:20:02 | so that's all ready thank you |
---|
0:20:11 | okay time for some questions |
---|
0:20:20 | so my question is more related with the stacked bottleneck that you were recently there |
---|
0:20:25 | you mentioned that it's good for language at night you didn't get so many good |
---|
0:20:31 | which holds for speaker at |
---|
0:20:32 | well we get the good results for speaker id just that we get as good |
---|
0:20:38 | results with the bottlenecks that would not be the stack so you can train the |
---|
0:20:44 | first network |
---|
0:20:45 | only and take the classical what lex you don't need to do this exercise which |
---|
0:20:50 | thinking the bottlenecks and training another network |
---|
0:20:53 | well but they perform well for speaker s one is not what the right |
---|
0:20:59 | i once i wouldn't think i wouldn't say that it's worth it |
---|
0:21:03 | but maybe bill using the sorry sixteen just don't you don't use it as an |
---|
0:21:08 | excuse |
---|
0:21:11 | and the other question it's a |
---|
0:21:13 | although i guess that using these are stacked bottleneck features on later six ubms for |
---|
0:21:21 | language cluster you're solution was like in terms of time like are we can't |
---|
0:21:26 | well that is indeed a |
---|
0:21:29 | oracle system |
---|
0:21:30 | from the point of the design but it worked slightly better |
---|
0:21:37 | i wouldn't be in favour of a building such a system for five percent relative |
---|
0:21:42 | gain over ten percent relative in but it simulation no |
---|
0:21:46 | the numbers matter the usability is |
---|
0:21:49 | the second thing |
---|
0:21:54 | recursions |
---|
0:22:06 | thank you the for the presentation i'm sorry because my question is also related to |
---|
0:22:10 | the stacked bottleneck i was wondering if you have made in the analysis on the |
---|
0:22:15 | alignment provided by both |
---|
0:22:17 | the first bottlenecks and the stacked one to see if there is really an evolution |
---|
0:22:21 | in the process all |
---|
0:22:23 | alignment |
---|
0:22:25 | you mean you mean what you mean the performance of the system or some |
---|
0:22:29 | no i'm talking read about the lid alignment on your ubm to see how they |
---|
0:22:33 | are about the distribution of the features evolves |
---|
0:22:37 | i don't think we made my this comparison |
---|
0:22:40 | sorry |
---|
0:22:46 | our can ask questions are also messiah problem accurate context you're looking at plusminus time |
---|
0:22:53 | found that did you |
---|
0:22:55 | we don't something you kind of exporter you can't that fixed to the set of |
---|
0:22:59 | course this is the ideal number explored |
---|
0:23:03 | a bunch of numbers if you're having just the first network i think that you |
---|
0:23:07 | can play more with a context |
---|
0:23:12 | you should aim for something like three hundred millisecond of the context we if you're |
---|
0:23:16 | using the stacked bottleneck the context is more because used a |
---|
0:23:21 | several bottlenecks and |
---|
0:23:22 | use that in the second stage so |
---|
0:23:24 | that's why they will something plusminus then |
---|
0:23:27 | i was thinking for maybe more sensitive |
---|
0:23:29 | with the background noise "'cause" you do in your other systems you said you did |
---|
0:23:33 | some denoising theirselves wondering what's more sensitive to noise the bottleneck is pretty good in |
---|
0:23:38 | dealing with the noise actually i had a paper interspeech when we trained the denoising |
---|
0:23:44 | all tangled or |
---|
0:23:45 | and it works pretty well on the mfccs |
---|
0:23:48 | then be used they'll that denoised spectral to generate the bottlenecks |
---|
0:23:53 | and the |
---|
0:23:54 | and well basically repeat all the experiments with the bottlenecks and the gains are much |
---|
0:23:59 | more much smaller |
---|
0:24:04 | discussion |
---|
0:24:12 | so that this is more of a |
---|
0:24:14 | a comment on the french cluster you're speaking about and i agree you know it |
---|
0:24:19 | showed up is problematic that you said ignoring it is not the answer to it |
---|
0:24:24 | i would point out that we do a contradiction going on in the sense you |
---|
0:24:29 | about you label that a single the channel thing right |
---|
0:24:33 | but we know from lre nineteen other ones we done |
---|
0:24:37 | narrowband over brought up or broadcast and haven't seen this massive the ship four |
---|
0:24:43 | so we have that the contradiction in the past use this successfully with telephony speech |
---|
0:24:48 | pulling it from broadcast and so forth there is an interesting point here which the |
---|
0:24:54 | it again ldc went out that did say that it's not it was not in |
---|
0:24:58 | this labelling was errors in there but |
---|
0:25:01 | this chance that the formality of the language changes based on whether you're broadcast you |
---|
0:25:06 | might be at a higher you know high versus low whereas telephony so there's i |
---|
0:25:11 | just bring doesn't bring these are in general because policing talks coming on the display |
---|
0:25:15 | be one thing that may be something about the actual |
---|
0:25:18 | dialect show that happens based on how to produce not so much of the channel |
---|
0:25:23 | we don't know yet |
---|
0:25:25 | i agree |
---|
0:25:27 | okay lets them for speaker again |
---|