0:00:15 | so first thank you very much for the odyssey conference for giving us the chance |
---|
0:00:20 | to present our language recognition system my name's raymond we've from the university of sheffield |
---|
0:00:26 | and the chinese university of hong kong |
---|
0:00:30 | so well i was a these have a language recognition system is pretty a |
---|
0:00:35 | fundamental and signed it |
---|
0:00:37 | so the of motivation of the paper and at all today will be basically to |
---|
0:00:43 | go through the keypoints maybe set of the core system and so this is some |
---|
0:00:47 | more suggestions and well as be calibration |
---|
0:00:53 | a bit of the background language recognition is about recognising language from a speech segment |
---|
0:00:57 | so we go through the classical map that all language recognition we can see researches |
---|
0:01:03 | using acoustic of phonotactic features working on that |
---|
0:01:07 | and then there are shifted delta cepstral features which takes a longer temporal spend of |
---|
0:01:11 | the signal which helps you language recognition recently i-vectors all at the end and for |
---|
0:01:18 | the combination of all of methods proved to be useful in anguish recognitions |
---|
0:01:23 | of all us we submitted three systems of the combination of three system in a |
---|
0:01:30 | nice language recognition last year |
---|
0:01:32 | the first one is a standard i-vector system and we have phonotactic system and the |
---|
0:01:36 | third one is a frame basis but the nn system after the evaluation we |
---|
0:01:41 | got a little bit enhancement combining the button and i-vector we'll go through its good |
---|
0:01:45 | of that like to |
---|
0:01:47 | so this is just a recap briefly on be a training data and also the |
---|
0:01:51 | target languages we have the switchboard data used telephone speech training data also some |
---|
0:01:56 | all multilingual |
---|
0:01:59 | lre training data from past evaluations |
---|
0:02:03 | training set of assets |
---|
0:02:04 | so there are twenty languages in language recognition and therefore into six language clusters and |
---|
0:02:09 | the task of language recognition is to identify languages within the clusters of language we |
---|
0:02:15 | shot closely related |
---|
0:02:18 | on the training data of language recognition comes as a role set of files in |
---|
0:02:23 | about seven hundred eight hundred hours or to start with the training we run some |
---|
0:02:28 | voice activity detection and to train our voice activity detector we use the competition that |
---|
0:02:34 | are from speech all by |
---|
0:02:37 | training our switchboard model of from tokenizer |
---|
0:02:40 | run out |
---|
0:02:41 | a forced alignment into them and then we just treats the silence label as non |
---|
0:02:46 | speech and the nonsilence labeled as speech we also take some of the posterity train |
---|
0:02:52 | data from voice of america broadcasts speech to train the of voice activity detector using |
---|
0:02:57 | that channel |
---|
0:02:58 | for this data we just take the role of speech nonspeech label |
---|
0:03:02 | on the amount of the voiced and unvoiced speech in different corpus from the table |
---|
0:03:11 | we train a to lay at the end and for a vad so this |
---|
0:03:15 | takes a stand at the end and with |
---|
0:03:18 | which with all train three dimensional filter bank features of features by saying of fifteen |
---|
0:03:23 | frames laughs and the right |
---|
0:03:25 | the outputs of the end and is |
---|
0:03:28 | two neurons the end and which is voice and the voice put zero probability |
---|
0:03:33 | we have sequence lyman using a tuesday hmm and forcing a minimum duration of twenty |
---|
0:03:38 | frames of voiced and unvoiced on top of that we have a heuristic to bridge |
---|
0:03:43 | the non a non speech gap which are shorter than two seconds |
---|
0:03:47 | for the results |
---|
0:03:49 | on the switchboard test data we have a miss and false alarm rate all around |
---|
0:03:53 | to present |
---|
0:03:55 | but for the be all the o a data the broke out of the broadcast |
---|
0:04:00 | they to the error rates much higher so we did an oral inspections that they |
---|
0:04:05 | and |
---|
0:04:07 | we believe it's down to the inaccuracy of the reference data so will a first |
---|
0:04:10 | system and to continue trying out language recognition system |
---|
0:04:14 | on |
---|
0:04:15 | we establish |
---|
0:04:17 | define a training set in the cost of the system development so these are the |
---|
0:04:23 | two "'cause" that's we use v one and p three |
---|
0:04:26 | the v one data is already version of the training data we use it directly |
---|
0:04:30 | text e of vad results |
---|
0:04:33 | and then extracts |
---|
0:04:34 | the whole segment whose duration lies between twenty and forty five seconds and then we |
---|
0:04:39 | train that specifically for thirty minute all sort thirty second condition so in the developments |
---|
0:04:46 | we |
---|
0:04:47 | from the very beginning divided test and training in three second ten seconds thirty second |
---|
0:04:53 | duration we're not sure whether this is correct or not very that |
---|
0:04:57 | four |
---|
0:04:58 | v three data are then we |
---|
0:05:00 | actually run different tokenizer all over again on the whole training set of the data |
---|
0:05:05 | and with that we will be one segmentation just that then we have a shorter |
---|
0:05:09 | segments for offshore shorter segments for decoding at the speed up the decoding process in |
---|
0:05:14 | the first round |
---|
0:05:15 | then we run re-segmentation with differences i don't stressful |
---|
0:05:19 | and we derive a three |
---|
0:05:21 | training set of normal evaluation of thirty seconds ten seconds and three seconds |
---|
0:05:26 | so these are not this thing gives that with a little bit of overlapping |
---|
0:05:30 | what data partitions of for each of the set then we have |
---|
0:05:33 | at present of the data for training time percent for development and that we're going |
---|
0:05:37 | to report the internal pass result in the early bits of the experiment for ten |
---|
0:05:41 | percent inter class |
---|
0:05:45 | so this is a system diagram for our or language recognition system on the laughs |
---|
0:05:50 | you can see the i-vector system and there is a phonotactic system the phonotactic system |
---|
0:05:55 | generate bottleneck features to fit into |
---|
0:05:57 | the nn system which is the frame based language recognition system |
---|
0:06:03 | the i-vector system is i we follow standard county recipe of media and normalization for |
---|
0:06:11 | the features shifted delta cepstrum mean normalization and also frame based vad to start with |
---|
0:06:17 | we trained a two thousand forty eight combine ubm and so the variability matrix to |
---|
0:06:22 | extract six hundred dimension i-vector we tried to language classifiers with all support vector machine |
---|
0:06:28 | and logistic regression and then to focus of the study here is to see to |
---|
0:06:33 | compare the use of |
---|
0:06:35 | different datasets in the training of ubm and also to the for a bit matrix |
---|
0:06:39 | also language classifier and also the comparison of global and cluster dependence classifiers |
---|
0:06:47 | but i think global classifies i mean classify which all |
---|
0:06:51 | classifies all the trendy languages and one go |
---|
0:06:54 | so we have four configurations here is that with so form condition a to condition |
---|
0:06:59 | be we increase the amount of data for ubm and total variability matrix training |
---|
0:07:03 | from be to see |
---|
0:07:05 | we replace the svm with logistic regression classifier and from c t we further increase |
---|
0:07:11 | the amount of training data for logistic regression classifier |
---|
0:07:15 | and the past year on the right shows the |
---|
0:07:19 | minimum average minimums the average score for different all configurations of set up on the |
---|
0:07:24 | i-vector system and the result is reported on the internal tests v one data |
---|
0:07:28 | which has |
---|
0:07:29 | thirty second duration |
---|
0:07:32 | on for when we go to a where we look at the to read past |
---|
0:07:36 | here in the middle then we can see |
---|
0:07:39 | the comparison between using fewer amount of training data for the ubm and more amounts |
---|
0:07:46 | then it gives some improvement there |
---|
0:07:48 | and we also see some a difference |
---|
0:07:52 | a by having a global classifier and within class the classifiers we did not manage |
---|
0:07:57 | to try or the combination is listed here just because of the time constraint |
---|
0:08:02 | but for this set of experiments on |
---|
0:08:05 | what we conclude is that we tend to use |
---|
0:08:09 | the full set of role training data and segment that for the training of ubm |
---|
0:08:14 | and sort of error rate matrix and also within class the classifiers outperform the global |
---|
0:08:19 | classifiers |
---|
0:08:20 | and then when our training progresses then we moved to the v three data |
---|
0:08:26 | we have similar conclusions as i just mentioned and then we tried |
---|
0:08:31 | to use different amount of the training data forty logistic regression classifier as shown as |
---|
0:08:36 | the three web boss here |
---|
0:08:39 | basically the left bar here are used as few amount of training data only one |
---|
0:08:43 | hundred hours |
---|
0:08:44 | and we use three hundred hours of data |
---|
0:08:48 | for the d one we use the roll set of data which a comprises about |
---|
0:08:52 | eight hundred hours so here that showrooms |
---|
0:08:55 | a trade-off between using more data and also whether the data are well structure of |
---|
0:09:00 | our segment it or not and then we ended up with using three hundred hours |
---|
0:09:04 | of segmented data training the a logistic regression classifier |
---|
0:09:09 | for the two red bars on the far left and right it is about the |
---|
0:09:15 | use of svm or use of the all |
---|
0:09:19 | logistic regression in the language recognition again that shows the |
---|
0:09:25 | improvement |
---|
0:09:26 | for using logistic regression classifier |
---|
0:09:31 | then that comes to our second system lid phonotactic language recognition system |
---|
0:09:37 | there are two components in the phonotactic system first a phone tokenizer and the second |
---|
0:09:43 | the language classifier the from tokenizer is based on the standard county setup we have |
---|
0:09:49 | lda c m and how speaker adaptation |
---|
0:09:52 | then it is that the n m with six layer and each layer contains around |
---|
0:09:57 | two thousand euros |
---|
0:09:59 | we used i don't bigram language model with a very low grammar scale factor of |
---|
0:10:04 | zero point five we tried to have a high a scale factor of two and |
---|
0:10:07 | it |
---|
0:10:08 | that's and |
---|
0:10:09 | gives better results in our internal test sets |
---|
0:10:12 | optionally we try to run even sequence training on the training switchboard data but bear |
---|
0:10:19 | in mind design english training data so we're not sure that |
---|
0:10:22 | of discriminative training will give over trying new networks to the results |
---|
0:10:28 | for the language classifier design svm classifiers |
---|
0:10:31 | which runs are trained on the tf-idf statistics of the phone n-gram which tried from |
---|
0:10:38 | bigram l from trigram the reason we back-off to bigram is that we of trained |
---|
0:10:43 | on the form |
---|
0:10:45 | position dependent form and we ended up with |
---|
0:10:48 | roughly five million dimension of the trigram statistics we |
---|
0:10:51 | where e that maybe sparsity issues |
---|
0:10:55 | so this is the performance on the internal test sets |
---|
0:10:59 | with the different setup |
---|
0:11:00 | as we which the trigram outdated gives better performance in terms of the low means |
---|
0:11:06 | the average score of this is valid for the thirty seconds later but you messy |
---|
0:11:10 | in a while that may break very comes to very short duration segments |
---|
0:11:17 | the purple bass a the results with the discriminatively trained the nn from tokenisers again |
---|
0:11:23 | than that shown that be of the over trained the nn here are and it |
---|
0:11:27 | gives higher word error rate |
---|
0:11:29 | sorry a higher that means the average score i mean |
---|
0:11:36 | the third system is the frame based the nn system for language recognition |
---|
0:11:42 | we talk a sixty four dimensional bottleneck features from the switchboard tokenisers |
---|
0:11:47 | and there are features slicing with the for frames one the left and for frames |
---|
0:11:51 | on the right |
---|
0:11:52 | the d n and is a four layer the nn with seven hundred neurons |
---|
0:11:58 | we have a problem normalizations which |
---|
0:12:02 | we multiplied it has probability with the inverse of the language prior and the decision |
---|
0:12:07 | of language recognition system can buy every change the frame based language recognition posterior probability |
---|
0:12:17 | so this is |
---|
0:12:18 | hey summary of the frame based language recognition system on different handsets |
---|
0:12:26 | then to trance we observed against very obviously when the situation is shorter than d |
---|
0:12:33 | c average score is higher and then the second is generally the |
---|
0:12:38 | the be the error he is higher than the phonotactic system and i-vector system but |
---|
0:12:45 | it becomes more robust when it comes to a very short duration |
---|
0:12:49 | situation |
---|
0:12:51 | so after the evaluation we have an enhanced system which recall that a button that |
---|
0:12:55 | i-vector system and is also a basic system |
---|
0:12:59 | we talked the |
---|
0:13:00 | bottleneck features from the switchboard and we place the mfcc in i-vector system with the |
---|
0:13:06 | bottleneck features and build another system for language recognition |
---|
0:13:11 | a bit of the details |
---|
0:13:13 | we take the sixty four dimension bottleneck features |
---|
0:13:16 | there are no vtln and no normalization or shifted delta cepstrum but they are frame |
---|
0:13:22 | based vad here |
---|
0:13:25 | so this is a side by side comparison between the i-vector system and the bottleneck |
---|
0:13:30 | system where the mfcc features can replaced by the bottleneck |
---|
0:13:33 | we can see roughly of relative improvement from fifteen to twenty five percent by replacing |
---|
0:13:40 | the bottleneck features |
---|
0:13:45 | for system calibration and fusion we train target language dependent gaussian back and |
---|
0:13:53 | and the gaussian |
---|
0:13:54 | has for age of sixteen components and then these are trained on da training data |
---|
0:13:59 | of thirty seconds data |
---|
0:14:02 | then of course system fusion we run logistic regression |
---|
0:14:06 | that comprises the log-likelihood ratio conversion and that the system combination |
---|
0:14:12 | the reflection |
---|
0:14:15 | so we apply that separately on the three system the i-vector system the nn system |
---|
0:14:20 | and phonotactic system we found that |
---|
0:14:22 | the |
---|
0:14:24 | gaussian back and you know why work for the i-vector so we do not use |
---|
0:14:27 | that in the |
---|
0:14:30 | final evaluation |
---|
0:14:31 | and then for the in an informal to technique gives a |
---|
0:14:34 | significant improvement |
---|
0:14:38 | and this is the fusion result in our internal it has set so |
---|
0:14:42 | for thirty second data |
---|
0:14:44 | i-vector system gave so |
---|
0:14:46 | the battery so i'm on the three |
---|
0:14:49 | submissions system |
---|
0:14:50 | and |
---|
0:14:51 | it can the n and informal to take they have roughly the same performance |
---|
0:14:56 | system fusion give some performance improvement actually a noticeable performance in the internal test that's |
---|
0:15:03 | we have |
---|
0:15:04 | and the bottleneck system did not give better results but and where we incorporate the |
---|
0:15:08 | for system than there are the best results we have |
---|
0:15:12 | when it comes down to three seconds a as i've set the phonotactic system |
---|
0:15:20 | behaves much worse here |
---|
0:15:22 | so that maybe because of this pastiche was on t particular setup of our own |
---|
0:15:27 | current statistics |
---|
0:15:28 | and |
---|
0:15:30 | when |
---|
0:15:31 | we compare the i-vector system and the bottleneck system then we see significant improvement for |
---|
0:15:36 | the off button x system and the further improvement the impression |
---|
0:15:41 | then here we show on the results of d formal evaluation |
---|
0:15:46 | datasets |
---|
0:15:48 | i-vector system |
---|
0:15:50 | phonotactic system the nn system performs well |
---|
0:15:54 | roughly as expected |
---|
0:15:55 | and then bottleneck system again has |
---|
0:15:59 | more than ten percent relative improvement on top of the i-vector system |
---|
0:16:03 | and this system version |
---|
0:16:05 | gives marginal improvement |
---|
0:16:07 | on top of the best system here |
---|
0:16:10 | then finally i'm going to a shown to be about a pair-wise system contribution |
---|
0:16:16 | to see keyword you've contribution to t component system in our language recognition systems |
---|
0:16:22 | so now you see clusters of boss hears for each clusters on the very laughed |
---|
0:16:27 | pass we have a single system |
---|
0:16:29 | and then what these single system for example this is about an i-vector system |
---|
0:16:34 | we make a fusion with this system with one of the system and then the |
---|
0:16:39 | older is that we take the worst system to fuse with |
---|
0:16:43 | and then we take the second whereas and so on |
---|
0:16:46 | so the interesting thing here he is that gender at apart from fusion with that |
---|
0:16:51 | the nn system which is the worst system |
---|
0:16:53 | fusion pairwise fusion you know case works |
---|
0:16:58 | maybe you can argue we may be in a different all operating region that the |
---|
0:17:02 | error region and that |
---|
0:17:04 | maybe why we cease to work |
---|
0:17:07 | and then another interesting thing is the of |
---|
0:17:11 | performance of fusion system basically is in proportion to the performance of the single system |
---|
0:17:16 | which means that when the fusion of about the system then we get a better |
---|
0:17:19 | results here |
---|
0:17:22 | so as a summary we introduce the three language direction recognition component systems submitted to |
---|
0:17:28 | the or at least two thousand fifty and the description to segmentation data selection plan |
---|
0:17:35 | and classifier training we have and then harassment button i i-vector system |
---|
0:17:40 | and is demonstrate performance improvement for the future work we want to were a bit |
---|
0:17:46 | on the data selection and augmentation as a team thus |
---|
0:17:50 | and also we are interested in the multilingual new network of the adaptation of that |
---|
0:17:55 | maybe some unsupervised training on that as well and to improve the bottleneck features also |
---|
0:18:01 | some variability compensation to deal with the huge try no and development dataset mismatch in |
---|
0:18:07 | the evaluation dataset |
---|
0:18:08 | and a suggestions or maybe collaborations all welcome a thank very much more attention |
---|
0:18:20 | here type questions |
---|
0:18:34 | thanks for the when you're talking about the language clusters |
---|
0:18:41 | the clusters the according to some linguists yes |
---|
0:18:48 | for our small experiment |
---|
0:18:52 | the linguistic clusters and the based on the |
---|
0:18:57 | a to a lot of the same with silver last of the data |
---|
0:19:03 | but they which is |
---|
0:19:06 | and use these clusters that are on features |
---|
0:19:10 | we gain that it would be the computer the |
---|
0:19:13 | when compared to the results of plus so that are made by linguists |
---|
0:19:19 | tried plus the language for trial |
---|
0:19:24 | yes i think that's a scientific question an interesting question we follow language classes basically |
---|
0:19:30 | by a narrow definition all exciting following what the nice a language recognition evaluations told |
---|
0:19:36 | us to and you're absolutely right up there are some cases when the training where |
---|
0:19:41 | you |
---|
0:19:43 | just become a distinction between |
---|
0:19:47 | even dialects or other unwanted of factors which does not directly related to language classes |
---|
0:19:54 | at all so yes definitely this is something we want to look at them particularly |
---|
0:19:57 | for some dialects were interested in for example chinese data are interested in it everyone |
---|
0:20:02 | to do more |
---|
0:20:06 | and the questions |
---|
0:20:11 | i one quick question so in an eer at most teams we did a scroll |
---|
0:20:17 | most works typically would sixty percent for train maybe going to seventy percent used a |
---|
0:20:22 | little bit more you want to eighty percent so my question is once you did |
---|
0:20:28 | your development when you actually submitted |
---|
0:20:31 | the final results did you do of for retrained with all the data or did |
---|
0:20:35 | you just stick with the original eighty percent range system that you |
---|
0:20:38 | we trained with the original system with eighty percent which we now doubts whether this |
---|
0:20:43 | should be the case and then we also have almost have a little bit by |
---|
0:20:49 | even if in the very early stage we |
---|
0:20:51 | divided data into three second ten second and three seconds or and that again |
---|
0:20:56 | of reduce the amount of training size and that's that we should note decision we |
---|
0:21:01 | tried to use h present and seventy we |
---|
0:21:05 | one more suggestions on |
---|
0:21:07 | how the data i think with of a bit on the all data segmentation and |
---|
0:21:13 | selection time |
---|
0:21:16 | here any other questions |
---|
0:21:20 | and b let's think speaker again |
---|