0:00:15 | you |
---|
0:00:23 | okay very much so the other not to seventy six system was a collaborative work |
---|
0:00:28 | between the brno university of technology |
---|
0:00:31 | medial and a little |
---|
0:00:36 | so let's start already introduced we had twenty four languages to deal with |
---|
0:00:42 | we had a new metric |
---|
0:00:44 | the list of the languages somehow |
---|
0:00:47 | screwed the |
---|
0:00:49 | and so we have the new metric the two seventy six language but that that's |
---|
0:00:54 | where the name comes from |
---|
0:00:56 | and we had to select a twenty four was pairs in terms of mindcf and |
---|
0:01:01 | then compute the average |
---|
0:01:03 | actually see |
---|
0:01:05 | in order to be able to deal with those languages we had to call we |
---|
0:01:10 | had to collect or data so this is basically the list of data that we |
---|
0:01:14 | used in the past |
---|
0:01:16 | evaluations there are some callfriend fisher |
---|
0:01:19 | oh mixer data from the sre evaluations |
---|
0:01:24 | previous lre evaluations ogi data of foreign accented english |
---|
0:01:28 | some a speech data for the you've really used european languages |
---|
0:01:33 | i was and switchboard and some broadcasts though |
---|
0:01:37 | data from the voice of america and ready for your |
---|
0:01:40 | and then some |
---|
0:01:42 | iraqi arabic conversational speech |
---|
0:01:45 | arabic broadcast |
---|
0:01:46 | speech |
---|
0:01:48 | oh as doc showed |
---|
0:01:51 | there were some languages for which we didn't have enough data |
---|
0:01:54 | so what we did is that we add on this additional radio data from public |
---|
0:02:00 | sources so we use the radio free europe radio free asia |
---|
0:02:05 | some czech broadcast |
---|
0:02:06 | america |
---|
0:02:08 | and there's a there's a list of languages first we covered check farsi lotion and |
---|
0:02:14 | job be |
---|
0:02:15 | and again i say arabic knackered be |
---|
0:02:18 | mandarin in training and i and i guess that were couple |
---|
0:02:23 | so what we did is that we did a phone call detection so we |
---|
0:02:27 | detecting the parts of |
---|
0:02:30 | broadcasts where |
---|
0:02:31 | there were some |
---|
0:02:32 | and |
---|
0:02:33 | telephone |
---|
0:02:34 | conversations would be greater |
---|
0:02:37 | and for each language we we're and |
---|
0:02:40 | automatic speaker like labeling |
---|
0:02:43 | so that we didn't want that we do |
---|
0:02:46 | yeah |
---|
0:02:47 | train and test sets |
---|
0:02:49 | but the speakers to overlap |
---|
0:02:52 | oh this is this is the scheme for a development set so we use the |
---|
0:02:57 | lre eleven development data |
---|
0:02:59 | so we make two sets actually one D one was the trusty data which was |
---|
0:03:04 | actually based on nist so thirty seconds so cost |
---|
0:03:09 | definition |
---|
0:03:10 | we also an automatic speaker labeling |
---|
0:03:14 | and we split into non-overlapping |
---|
0:03:17 | training data |
---|
0:03:18 | test parts |
---|
0:03:20 | and then we took the entire conversations and in the thirty seconds excerpts |
---|
0:03:24 | results |
---|
0:03:25 | so was presented thirty second segments |
---|
0:03:30 | and all splits from one conversation side to be trained and test set |
---|
0:03:37 | again there was some |
---|
0:03:39 | speaker automatic speaker labeling |
---|
0:03:42 | of course we had more data but it was less reliable that's all seen our |
---|
0:03:46 | contrastive system |
---|
0:03:47 | oh |
---|
0:03:49 | this doing this |
---|
0:03:51 | helped a little bit |
---|
0:03:53 | having the all the P V their position |
---|
0:03:56 | oh |
---|
0:03:59 | so |
---|
0:04:01 | to give a little bit of statistics on or not |
---|
0:04:05 | dataset so we get the train set which are sixty six sre six two thousand |
---|
0:04:10 | segments |
---|
0:04:11 | and it was based of all kinds of sources |
---|
0:04:16 | and yet the test set |
---|
0:04:17 | which was thirty eight thousand segments |
---|
0:04:20 | was based basically on a previous lre evaluations |
---|
0:04:24 | and then be and test sets with service |
---|
0:04:26 | on the other you |
---|
0:04:28 | evaluations comprising one |
---|
0:04:35 | so a little overview of our systems we have we have a summation of three |
---|
0:04:39 | systems one primary and you can trust it |
---|
0:04:42 | so the primary system consisted of one acoustic system which was based on the i-vectors |
---|
0:04:47 | you be the descriptions later |
---|
0:04:51 | and then we had three phonotactic subsystems so that would |
---|
0:04:56 | yeah diver system so we had |
---|
0:04:58 | a binary decision tree systems based on the english tokenizer then we had a pc |
---|
0:05:04 | reduction |
---|
0:05:05 | systems based on the russian systems |
---|
0:05:08 | and then we had some multinomial subspace |
---|
0:05:11 | i-vector space but |
---|
0:05:14 | hungarian |
---|
0:05:15 | tokenizer |
---|
0:05:17 | that the first thing just a system or same as the primary what we did |
---|
0:05:21 | is that we excluded the P two |
---|
0:05:23 | that means that the entire conversations |
---|
0:05:28 | we'll see the results |
---|
0:05:29 | later |
---|
0:05:30 | and then contrasted to system was just of |
---|
0:05:33 | fusion of two best system with the acoustic and the english |
---|
0:05:37 | problem of the image was that the at the development that the case |
---|
0:05:42 | very good results but it's see as kind of problem i think |
---|
0:05:48 | so there is a little diagram |
---|
0:05:50 | of our system |
---|
0:05:52 | in the in the first |
---|
0:05:56 | at a very left you know we have the front end so we have the |
---|
0:06:00 | acoustic i-vector the phonotactic i-vector and the pca |
---|
0:06:04 | oh which basically convert the input of some form |
---|
0:06:09 | into a fixed factors either i-vectors for the for the acoustic i-vector extractor or we |
---|
0:06:14 | also column i-vectors for the phonotactic i-vector extractor pca |
---|
0:06:19 | and after which we had to do some scoring |
---|
0:06:24 | and then we use the det binary decision tree model which show was based basically |
---|
0:06:29 | going on a log likelihood evaluation |
---|
0:06:31 | of the of the n-gram counts |
---|
0:06:33 | itself so we already got discourse |
---|
0:06:36 | which could then go do precalibration |
---|
0:06:40 | both the scoring and pre-calibration are based on logistic regression |
---|
0:06:45 | i'm unit-fusion also based on logistic regression out of which we get twenty four scores |
---|
0:06:50 | likelihoods and then we do pair-wise log likelihood ratio for |
---|
0:06:55 | each of the errors |
---|
0:06:58 | it is just to show the how is the data that the data described in |
---|
0:07:02 | the previous section so that the train database was used for the |
---|
0:07:06 | the for the front-end training and for the for the scoring |
---|
0:07:10 | classifier training |
---|
0:07:13 | at the dev and test databases where they used for the for the |
---|
0:07:17 | back in the pre-calibration |
---|
0:07:19 | fusion |
---|
0:07:24 | so for acoustic system we use the hungarian phoneme recognizer based vad |
---|
0:07:30 | oh basically to kill the |
---|
0:07:33 | silence then we use the vtln |
---|
0:07:35 | i'll features dithering |
---|
0:07:38 | cepstral mean |
---|
0:07:41 | and variance normalisation with rasta processing basically |
---|
0:07:46 | what similar to that |
---|
0:07:48 | right later previously |
---|
0:07:50 | and the modeling was based on full covariance ubm we two thousand forty eight components |
---|
0:07:55 | and the i-vector size was example |
---|
0:08:00 | for the phonotactic systems |
---|
0:08:04 | we used we used a diversity of techniques to |
---|
0:08:08 | sources are tokenization |
---|
0:08:10 | so that pca for each feature extraction |
---|
0:08:13 | oh |
---|
0:08:14 | was based on is something you |
---|
0:08:17 | hungarian tokenizer |
---|
0:08:20 | so what we do is that we would do accounts with the square root of |
---|
0:08:23 | the counts |
---|
0:08:25 | ppca on top of that we used the dimensions to six hundred |
---|
0:08:30 | and we basically used in the same way as the acoustic i-vector |
---|
0:08:35 | and we had a multinomial subspace modeling of the trigram counts was based on the |
---|
0:08:39 | regression tokenizer |
---|
0:08:41 | so this is something a slightly newer and something that marshall |
---|
0:08:47 | one |
---|
0:08:49 | features |
---|
0:08:50 | is the basically modeling the n-gram counts |
---|
0:08:56 | the subspace |
---|
0:08:57 | of the simplex |
---|
0:09:00 | the output of |
---|
0:09:01 | of such a approach is also |
---|
0:09:04 | i-vector like |
---|
0:09:06 | a feature |
---|
0:09:07 | which we then again process the same way as the i-vectors |
---|
0:09:10 | and then we had the binary decision tree which is basically a novel technique where |
---|
0:09:15 | the decision trees are used to cluster the n-gram counts |
---|
0:09:18 | and a claimant like this is used |
---|
0:09:21 | if the score |
---|
0:09:25 | so you know the scoring for the for the acoustic i-vector and the two phonotactic |
---|
0:09:31 | i-vector systems was the |
---|
0:09:34 | the input was usually the i-vector gonna six hundred dimensional or one thousand dimensional case |
---|
0:09:39 | pca |
---|
0:09:40 | we perform length normalization |
---|
0:09:43 | ultimately performed within class covariance normalization |
---|
0:09:47 | and after that is that as the |
---|
0:09:49 | S R |
---|
0:09:51 | classify we used to regularize multiclass logistic regression |
---|
0:09:55 | with cross entropy objective function |
---|
0:09:57 | the regularizer was the L two regularizer but the penalty was chosen without cross validation |
---|
0:10:06 | and it was trained on the train database |
---|
0:10:09 | the output |
---|
0:10:10 | was |
---|
0:10:11 | scores |
---|
0:10:13 | what we did that with the |
---|
0:10:16 | the each set up to twenty four scores is that we do precalibration |
---|
0:10:20 | of E system so that was a full affine transform |
---|
0:10:27 | and we use the regularized logistic regression |
---|
0:10:30 | which was trained on the test |
---|
0:10:32 | and the database is |
---|
0:10:37 | and in the end we use the |
---|
0:10:40 | we used the four systems |
---|
0:10:43 | two |
---|
0:10:44 | oh |
---|
0:10:45 | with the with the constrain affine transform so instead of assigning each of the twenty |
---|
0:10:50 | four scores of each of the systemsindividual scale constant |
---|
0:10:53 | we had one scale constants for one system |
---|
0:10:56 | and we had a we have a vector offsets |
---|
0:11:01 | and this logistic regression was also bayesian logistic regression stumps regularized |
---|
0:11:07 | was trained on the test |
---|
0:11:09 | jeff databases |
---|
0:11:13 | as i said that the decisions where done using the log-likelihood ratios that came out |
---|
0:11:18 | these are |
---|
0:11:20 | the decisions for the for the |
---|
0:11:22 | two seventy six course where |
---|
0:11:24 | converted from the twenty four scores |
---|
0:11:27 | oh as a as a log likelihood ratios |
---|
0:11:30 | among all those pairs |
---|
0:11:33 | oh |
---|
0:11:34 | for |
---|
0:11:36 | this is just a |
---|
0:11:38 | the subtraction of course |
---|
0:11:39 | we |
---|
0:11:40 | we give a score |
---|
0:11:42 | it is |
---|
0:11:44 | and decisions the models |
---|
0:11:46 | where |
---|
0:11:48 | all set to threshold of zero |
---|
0:11:51 | as just a little common that these decisions are invariant to scaling of the of |
---|
0:11:55 | the log likelihoods |
---|
0:11:56 | and |
---|
0:11:58 | you relations by the heart language pairs |
---|
0:12:02 | does not sell |
---|
0:12:03 | just the calibration |
---|
0:12:07 | so the analysis that we used to |
---|
0:12:10 | we so we |
---|
0:12:13 | we fixed that the |
---|
0:12:15 | denotes the use when designing the system is that |
---|
0:12:18 | oh we fix the twenty four worst pairs on our thirty seconds evaluations |
---|
0:12:24 | and we compared |
---|
0:12:25 | of three different numbers of the actual dcf |
---|
0:12:30 | minimum dcf |
---|
0:12:31 | start dcf which was based |
---|
0:12:33 | on the |
---|
0:12:34 | on that may cause recipe and mentioned monday |
---|
0:12:39 | it's based on the |
---|
0:12:41 | likelihood pre-calibration |
---|
0:12:45 | so we will compare the development |
---|
0:12:50 | and the evaluation sets |
---|
0:12:53 | we will try if the comparison of four eight system so we will have the |
---|
0:12:58 | individual systems on getting phonotactic direction of the technique phonotactic |
---|
0:13:04 | eucharistic i-vectors and then we will present for fusions |
---|
0:13:09 | so one of the primary |
---|
0:13:10 | the contrastive the second contrastive |
---|
0:13:13 | and we also have a three system fusion which excluded the |
---|
0:13:17 | the english phonotactic system which somehow |
---|
0:13:20 | he misbehaved |
---|
0:13:21 | this will see |
---|
0:13:24 | so this is that this is the result of three seconds that we fixed the |
---|
0:13:28 | person thirty seconds |
---|
0:13:31 | oh |
---|
0:13:32 | as we see this is the this is the |
---|
0:13:35 | this is the miss behavior of the system so the last parser with the that |
---|
0:13:42 | set and the right or with |
---|
0:13:45 | the evaluation set and we see that the trend |
---|
0:13:48 | going like this in the in the dataset but |
---|
0:13:51 | the |
---|
0:13:52 | the english phonotactic system |
---|
0:13:54 | speech |
---|
0:13:55 | oh of that vector |
---|
0:13:58 | yeah |
---|
0:14:04 | this is for the for the ten seconds |
---|
0:14:11 | so this is this contrast if one system is the system where we excluded the |
---|
0:14:16 | where we excluded that the dft to data which are |
---|
0:14:21 | comprise the entire segment so we see that there is a slight hit |
---|
0:14:25 | compared to |
---|
0:14:26 | the primary system so these two systems to remind you are |
---|
0:14:29 | very same except that |
---|
0:14:32 | in the in the calibration and scoring |
---|
0:14:34 | there are there are some data left out |
---|
0:14:39 | oh a again the english |
---|
0:14:42 | the english is very behaved here |
---|
0:14:45 | oh we see that though |
---|
0:14:48 | that the difference between the balloon the and the right one |
---|
0:14:52 | well which is the difference between the minimum and actual is actually |
---|
0:14:56 | we see that |
---|
0:14:57 | the miscalibration was |
---|
0:15:02 | no i mean |
---|
0:15:03 | it was not a tragedy but |
---|
0:15:06 | we didn't do very well the calibration which is even more simple on the thirty |
---|
0:15:10 | second stuff so if we see that |
---|
0:15:12 | that the miscalibration is much |
---|
0:15:15 | much more reasonable |
---|
0:15:17 | especially fusions |
---|
0:15:21 | here on the on the contrastive one versus the primary |
---|
0:15:24 | oh we see that the that excluding the data |
---|
0:15:29 | really are |
---|
0:15:31 | so |
---|
0:15:33 | that's one thing |
---|
0:15:34 | oh |
---|
0:15:36 | on the evaluation systems excluded so the three systems is equivalent to the primary by |
---|
0:15:41 | excluding the english system |
---|
0:15:43 | we see that the on the development set didn't do much |
---|
0:15:47 | in fact the system is slightly worse excluding the english |
---|
0:15:50 | systems because the english perform very well |
---|
0:15:52 | on the |
---|
0:15:54 | the development set |
---|
0:15:55 | but is putting it |
---|
0:15:57 | really hot on the evaluation |
---|
0:16:00 | after the evaluation |
---|
0:16:07 | so again just to summarise the observation is that it was |
---|
0:16:11 | the big deterioration between the mindcf for development and |
---|
0:16:20 | we differ |
---|
0:16:22 | then there was no calibration disasters but on the thirty seconds as i pointed out |
---|
0:16:27 | well we could have done better |
---|
0:16:29 | the binary tree system was kind of screwed and |
---|
0:16:34 | what we found out later is that if we apply |
---|
0:16:38 | similar dimensionality-reduction plot scoring techniques |
---|
0:16:41 | even to the english tokens the system where good again so |
---|
0:16:46 | so it was it was |
---|
0:16:49 | due to the |
---|
0:16:50 | the plane landed evaluation |
---|
0:16:53 | and acoustic outperforms phonotactic almost everywhere that there were a couple of systems a couple |
---|
0:16:58 | of language pairs where the where the acoustic |
---|
0:17:01 | where the phonotactic was better |
---|
0:17:03 | if you have ever so we did it is that we didn't analysis but for |
---|
0:17:07 | novice versus mit system since the mighty was the best |
---|
0:17:10 | a there was a weak correlation between sites and difficulty of paris |
---|
0:17:15 | a domain mindcf |
---|
0:17:17 | was were very similar |
---|
0:17:19 | of the mindcf five |
---|
0:17:22 | for the worst twenty four pairs for slightly worse for us than for mit |
---|
0:17:28 | an actual dcf oh |
---|
0:17:30 | we had a big calibration |
---|
0:17:34 | that's an interesting how there's an interesting plot here which compares us some of the |
---|
0:17:39 | selected arabic dialects versus like languages where O |
---|
0:17:45 | somehow we knew that i might get more data are for the arabic dialects |
---|
0:17:49 | so we see that we do very poor |
---|
0:17:51 | and some of the some of the pairs |
---|
0:17:54 | okay arabic versus push to |
---|
0:17:57 | et cetera because we mostly due to the lack of data while i'm on the |
---|
0:18:02 | on the |
---|
0:18:03 | the slavic languages |
---|
0:18:06 | we had we do better |
---|
0:18:08 | and some |
---|
0:18:08 | selected pairs |
---|
0:18:10 | so this is just two are just to show that |
---|
0:18:13 | oh |
---|
0:18:14 | the amount of data really |
---|
0:18:16 | there's |
---|
0:18:19 | this is just a correlation between some of the best of |
---|
0:18:23 | unlike the end and be useful is the |
---|
0:18:26 | but axes and use them mit excess |
---|
0:18:28 | oh |
---|
0:18:30 | we see that if we did the same thing |
---|
0:18:33 | all the points we would be aligned |
---|
0:18:36 | as the |
---|
0:18:37 | the ability but the we see that some errors are really |
---|
0:18:41 | of the |
---|
0:18:42 | we did very differently |
---|
0:18:44 | some of the past |
---|
0:18:47 | and this is just to show all the worst |
---|
0:18:50 | the worst the mindcf |
---|
0:18:53 | and versus actual mindcf for the for the mit and be used |
---|
0:18:57 | but not system |
---|
0:18:59 | so |
---|
0:19:00 | oh |
---|
0:19:01 | we these are average |
---|
0:19:03 | at which point so we see that on average my better |
---|
0:19:09 | so mit |
---|
0:19:11 | and my these points are more know on a on a single line all the |
---|
0:19:14 | systems are more scattered around here so this shows the again |
---|
0:19:19 | the |
---|
0:19:20 | calibration hit |
---|
0:19:24 | so that concludes that we built several systems but we only selected for |
---|
0:19:28 | for the primary for the primary fusion |
---|
0:19:32 | get the acoustic outperforms the phonotactic |
---|
0:19:35 | for the phonotactic we try to from the different backgrounds and we saw that the |
---|
0:19:39 | dimension reduction really else |
---|
0:19:42 | we have |
---|
0:19:44 | the big hit |
---|
0:19:45 | for the english phonotactic systems where there was a |
---|
0:19:49 | i forgot to delete the we did not know why |
---|
0:19:52 | we already is the |
---|
0:19:54 | scoring |
---|
0:19:57 | and probably we could use special detectors for select paris |
---|
0:20:01 | that is that is |
---|
0:20:25 | yes |
---|
0:20:26 | oh |
---|
0:20:27 | yeah |
---|
0:20:30 | yes |
---|
0:20:33 | oh |
---|
0:20:38 | sorry |
---|
0:20:40 | well |
---|
0:20:41 | the unique and the shifted |
---|
0:20:43 | yeah we use that |
---|
0:20:45 | right |
---|
0:20:47 | oh |
---|
0:20:48 | so we use the six mfccs plus C zero yeah shifted again |
---|
0:20:55 | okay |
---|
0:20:56 | sorry |
---|
0:20:58 | i |
---|
0:20:58 | oh yeah |
---|
0:21:12 | so |
---|
0:21:14 | for the which so we use |
---|
0:21:18 | real the regularisation in our scoring and then or pre-calibration |
---|
0:21:23 | so a little L two regularization |
---|
0:21:27 | okay |
---|
0:21:32 | oh |
---|
0:21:53 | i |
---|
0:21:53 | i |
---|
0:21:59 | i |
---|
0:22:18 | oh |
---|
0:22:31 | so |
---|