0:00:06 | morning |
---|
0:00:07 | uh |
---|
0:00:08 | what i would like to present |
---|
0:00:09 | here today is uh |
---|
0:00:11 | our |
---|
0:00:14 | language recognition |
---|
0:00:15 | two thousand nine |
---|
0:00:16 | submission |
---|
0:00:18 | and |
---|
0:00:19 | we did |
---|
0:00:20 | a lot of work |
---|
0:00:21 | after the evaluations to figure out |
---|
0:00:23 | what happened |
---|
0:00:25 | darcy |
---|
0:00:25 | them into our lives that because actually |
---|
0:00:27 | we saw a big difference |
---|
0:00:29 | in the performance |
---|
0:00:30 | on our |
---|
0:00:31 | development data and the |
---|
0:00:33 | on |
---|
0:00:33 | then |
---|
0:00:34 | that on the actual evaluation |
---|
0:00:36 | so |
---|
0:00:37 | uh |
---|
0:00:39 | first |
---|
0:00:41 | i will try to explain |
---|
0:00:42 | what |
---|
0:00:43 | new in the |
---|
0:00:44 | uh language recognition |
---|
0:00:46 | what happened |
---|
0:00:47 | in the year two thousand nine |
---|
0:00:49 | that in some new data |
---|
0:00:51 | then i will go through |
---|
0:00:53 | a very quick and brief description of our |
---|
0:00:55 | all system |
---|
0:00:56 | and then i will |
---|
0:00:58 | try to concentrate on the |
---|
0:01:00 | uh issues of the calibration and data selection |
---|
0:01:03 | and uh how we resolve |
---|
0:01:05 | problems with our original development set |
---|
0:01:08 | then i will |
---|
0:01:09 | try to |
---|
0:01:09 | conclude |
---|
0:01:10 | our work |
---|
0:01:12 | so |
---|
0:01:13 | in |
---|
0:01:14 | two thousand nine |
---|
0:01:15 | what was new |
---|
0:01:16 | that |
---|
0:01:17 | the new uh source of the data came into the language recognition |
---|
0:01:21 | actually these data are |
---|
0:01:22 | broadcast |
---|
0:01:23 | the voice of america |
---|
0:01:25 | and |
---|
0:01:26 | we found a big are high |
---|
0:01:27 | of |
---|
0:01:28 | about |
---|
0:01:29 | for the three languages and |
---|
0:01:31 | uh the data uh |
---|
0:01:33 | out of this archive |
---|
0:01:34 | what's the use |
---|
0:01:35 | and actually only the detected telephone calls |
---|
0:01:38 | and um |
---|
0:01:40 | which |
---|
0:01:41 | this data brought |
---|
0:01:42 | at peak variability |
---|
0:01:44 | a to the original cts data we always used for the training of our language ideas |
---|
0:01:50 | so it |
---|
0:01:51 | okay |
---|
0:01:51 | it brought some |
---|
0:01:53 | new problems with calibration and channel compensation |
---|
0:01:57 | so |
---|
0:01:58 | uh |
---|
0:02:01 | these are the languages |
---|
0:02:03 | uh which |
---|
0:02:04 | are present |
---|
0:02:05 | i would have to check if they are still present in the |
---|
0:02:08 | a voice over |
---|
0:02:08 | of the M erica archive |
---|
0:02:10 | as you can see |
---|
0:02:13 | the |
---|
0:02:13 | and multiple languages |
---|
0:02:15 | here is a very huge and |
---|
0:02:17 | it brought |
---|
0:02:18 | very very nice |
---|
0:02:19 | dataset |
---|
0:02:20 | to test our systems on and |
---|
0:02:22 | ability to improve the language recognition |
---|
0:02:25 | stems |
---|
0:02:26 | two |
---|
0:02:26 | two |
---|
0:02:27 | uh actually classify |
---|
0:02:29 | more languages so |
---|
0:02:31 | for the |
---|
0:02:32 | two thousand nine |
---|
0:02:33 | nist lre |
---|
0:02:35 | these are the |
---|
0:02:36 | twenty three |
---|
0:02:37 | uh |
---|
0:02:38 | target languages |
---|
0:02:39 | and the bold ones |
---|
0:02:41 | other languages |
---|
0:02:42 | uh where the only uh |
---|
0:02:46 | well the |
---|
0:02:47 | that we had |
---|
0:02:47 | only uh data coming from the |
---|
0:02:50 | from this was |
---|
0:02:51 | of and there are high |
---|
0:02:53 | so there was no cts data for training on these languages |
---|
0:02:56 | on the other languages we also had |
---|
0:03:00 | normal |
---|
0:03:01 | continues |
---|
0:03:02 | speech data are recorded by |
---|
0:03:04 | L D C |
---|
0:03:05 | previous times |
---|
0:03:07 | and also for the |
---|
0:03:08 | two thousand nine |
---|
0:03:10 | uh evaluation |
---|
0:03:11 | so we had to deal with this issue |
---|
0:03:13 | and |
---|
0:03:15 | uh |
---|
0:03:17 | and |
---|
0:03:18 | uh do the |
---|
0:03:19 | proper calibration |
---|
0:03:21 | and channel compensation |
---|
0:03:22 | so |
---|
0:03:23 | what |
---|
0:03:24 | be more tomatoes |
---|
0:03:25 | after the evaluation to do this work and work |
---|
0:03:28 | again |
---|
0:03:28 | on our development set and to do a lot of experiments was |
---|
0:03:32 | but we saw a huge difference |
---|
0:03:34 | between the performance |
---|
0:03:36 | our |
---|
0:03:36 | original |
---|
0:03:37 | development set |
---|
0:03:38 | and uh |
---|
0:03:40 | you've all said |
---|
0:03:41 | which the uh we |
---|
0:03:43 | which was uh |
---|
0:03:44 | corrected by |
---|
0:03:45 | nice |
---|
0:03:47 | so all of the |
---|
0:03:48 | numbers you will see here |
---|
0:03:50 | will be the |
---|
0:03:52 | average detection cost |
---|
0:03:53 | defined by nice |
---|
0:03:55 | and |
---|
0:03:57 | uh |
---|
0:04:00 | yeah uh |
---|
0:04:02 | on the |
---|
0:04:03 | language recognition workshop there about |
---|
0:04:05 | there were a lot of discussions about that |
---|
0:04:07 | crafting of |
---|
0:04:08 | uh a |
---|
0:04:09 | development set |
---|
0:04:10 | alarm systems |
---|
0:04:11 | so |
---|
0:04:12 | uh |
---|
0:04:12 | some |
---|
0:04:13 | some people created a rather small and |
---|
0:04:16 | very clean upset |
---|
0:04:17 | we we had a |
---|
0:04:19 | actually a very very huge |
---|
0:04:20 | development set containing a lot of data |
---|
0:04:23 | which brought some computational issues |
---|
0:04:25 | to train the systems but |
---|
0:04:27 | uh we decided to go |
---|
0:04:29 | with this development |
---|
0:04:30 | set |
---|
0:04:31 | the big one |
---|
0:04:33 | and |
---|
0:04:33 | in the end it didn't show |
---|
0:04:35 | to be maybe the but the she's |
---|
0:04:37 | but that but decision but |
---|
0:04:39 | we |
---|
0:04:40 | had to |
---|
0:04:41 | well with that |
---|
0:04:42 | so |
---|
0:04:43 | and |
---|
0:04:44 | we |
---|
0:04:44 | presentation of our |
---|
0:04:46 | us |
---|
0:04:47 | is what we had in the |
---|
0:04:49 | in the |
---|
0:04:49 | summation so |
---|
0:04:51 | we had two types of uh |
---|
0:04:53 | uh front ends |
---|
0:04:54 | the first |
---|
0:04:55 | on acoustic frontends which are based |
---|
0:04:58 | on the gmm modelling and the features are |
---|
0:05:00 | mfcc derive actually |
---|
0:05:02 | these are the |
---|
0:05:03 | uh popular shifty don't like cepstral features |
---|
0:05:06 | and |
---|
0:05:07 | for the system we had |
---|
0:05:08 | there |
---|
0:05:08 | jfa sixteen |
---|
0:05:11 | we tried a new feature extraction |
---|
0:05:13 | based on the audio the |
---|
0:05:15 | and then we had their eighty and then |
---|
0:05:20 | maximum |
---|
0:05:20 | mutual information criterion |
---|
0:05:22 | and using the channel compensated features |
---|
0:05:26 | also |
---|
0:05:27 | we tried to normal |
---|
0:05:28 | gmm with a guilty features without any channel compensation |
---|
0:05:32 | we perform the |
---|
0:05:33 | well tract length normalisation |
---|
0:05:35 | cepstral mean and |
---|
0:05:37 | and variance normalisation |
---|
0:05:39 | and reading the voice activity detection using car |
---|
0:05:42 | hungarian phoneme recogniser |
---|
0:05:44 | when we |
---|
0:05:45 | where we met all of this |
---|
0:05:46 | speech phonemes to the |
---|
0:05:48 | speech and nonspeech |
---|
0:05:49 | the to decide |
---|
0:05:55 | yeah thanks |
---|
0:05:56 | then |
---|
0:05:56 | it's a standard based jittery sistine |
---|
0:05:59 | uh |
---|
0:06:00 | as you can see |
---|
0:06:01 | a sorry but |
---|
0:06:02 | this time of course without |
---|
0:06:04 | and the eigenvoices there is only a |
---|
0:06:07 | uh channel |
---|
0:06:08 | variability present |
---|
0:06:10 | so |
---|
0:06:10 | we had |
---|
0:06:11 | some super vector |
---|
0:06:12 | of gmm means for every speech segment |
---|
0:06:15 | and which is |
---|
0:06:16 | then uh |
---|
0:06:17 | channel dependent |
---|
0:06:19 | the this |
---|
0:06:19 | uh channel loading matrix was trained using the |
---|
0:06:22 | E M algorithm and |
---|
0:06:24 | the five hundred |
---|
0:06:26 | sessions for every language very used to train |
---|
0:06:29 | uh the |
---|
0:06:31 | the channel loading matrix |
---|
0:06:32 | and uh |
---|
0:06:33 | language dependent uh super vectors |
---|
0:06:36 | the alice |
---|
0:06:38 | the remote adapted using the |
---|
0:06:40 | rather than smart all these but also trained |
---|
0:06:43 | using the five |
---|
0:06:44 | hundred segments |
---|
0:06:45 | there |
---|
0:06:46 | a language |
---|
0:06:49 | actually this |
---|
0:06:50 | is the core acoustic system here |
---|
0:06:53 | because |
---|
0:06:54 | uh |
---|
0:06:55 | it uses also our delta features and |
---|
0:06:57 | as you will see |
---|
0:06:58 | later on we decided to drop the audio D features and use |
---|
0:07:02 | just the J faces |
---|
0:07:03 | scheme |
---|
0:07:04 | eating the shifted of packets |
---|
0:07:11 | yeah we tried |
---|
0:07:12 | a new discriminative technique to derive our features |
---|
0:07:15 | uh this is technique |
---|
0:07:17 | based on the |
---|
0:07:19 | a region dependent linear transforms this is a technique |
---|
0:07:22 | uh |
---|
0:07:22 | which was introduced in the speech recognition but it is known as |
---|
0:07:26 | S and P E |
---|
0:07:27 | the idea is that |
---|
0:07:28 | we have some |
---|
0:07:29 | you know transformations |
---|
0:07:31 | which will take our features |
---|
0:07:33 | and |
---|
0:07:34 | then |
---|
0:07:35 | we take the linear combinations of the transformation to |
---|
0:07:38 | uh |
---|
0:07:39 | two |
---|
0:07:42 | for menu |
---|
0:07:43 | uh feature which would be which should |
---|
0:07:46 | uh be discriminate |
---|
0:07:47 | it's trained so |
---|
0:07:49 | i know but |
---|
0:07:50 | picture and i will try to |
---|
0:07:52 | at least |
---|
0:07:53 | very briefly |
---|
0:07:55 | uh describe what is going on so |
---|
0:07:58 | in the star |
---|
0:07:59 | we are |
---|
0:07:59 | having |
---|
0:08:00 | some linear transformation |
---|
0:08:02 | in the beginning there are initialised |
---|
0:08:04 | two |
---|
0:08:05 | great just the shifted delta cepstral features |
---|
0:08:08 | we have some |
---|
0:08:09 | G M and which is trained on all or |
---|
0:08:11 | over all languages |
---|
0:08:13 | and which is |
---|
0:08:14 | select the two which is uh |
---|
0:08:16 | suppose |
---|
0:08:17 | two |
---|
0:08:17 | so like |
---|
0:08:18 | the |
---|
0:08:19 | uh |
---|
0:08:20 | here the transformations in every step |
---|
0:08:23 | it actually provides the weights |
---|
0:08:25 | we are uh then we are combining |
---|
0:08:28 | these |
---|
0:08:28 | transformation |
---|
0:08:29 | so for every twenty one frames |
---|
0:08:32 | we |
---|
0:08:33 | we take the we we take the twenty once frames |
---|
0:08:36 | mfcc put it into the gmm |
---|
0:08:38 | then we take the most meaning |
---|
0:08:40 | gaussian components |
---|
0:08:42 | which provide us the weights |
---|
0:08:44 | and |
---|
0:08:44 | we will combine |
---|
0:08:46 | according to this might be a combined is linear transformations |
---|
0:08:49 | usually |
---|
0:08:51 | it happened that |
---|
0:08:52 | only one |
---|
0:08:53 | or three |
---|
0:08:55 | a gaussian |
---|
0:08:55 | components |
---|
0:08:56 | for these twenty one frames |
---|
0:08:58 | where nonzero also |
---|
0:09:00 | uh not all of these other transformations were linearly combined all the other |
---|
0:09:05 | weights are set to zero |
---|
0:09:07 | so |
---|
0:09:08 | then we are taking the eating area |
---|
0:09:10 | combined transformations |
---|
0:09:13 | and |
---|
0:09:13 | summing up |
---|
0:09:15 | and then |
---|
0:09:16 | there is a gmm |
---|
0:09:18 | which will |
---|
0:09:18 | estimate these feature and according to the training translate criteria |
---|
0:09:23 | we will update |
---|
0:09:24 | these |
---|
0:09:25 | linear transform |
---|
0:09:26 | and then we go |
---|
0:09:27 | one other |
---|
0:09:28 | one two months frames |
---|
0:09:30 | train the system so |
---|
0:09:31 | here |
---|
0:09:32 | in the end what we have |
---|
0:09:34 | after the training |
---|
0:09:35 | this |
---|
0:09:36 | will be the features |
---|
0:09:37 | we will feed you are |
---|
0:09:39 | jeff face |
---|
0:09:43 | the next |
---|
0:09:43 | acoustic system what that was a gmm |
---|
0:09:46 | they two hundred |
---|
0:09:47 | and for the adults and |
---|
0:09:49 | one and |
---|
0:09:50 | which was |
---|
0:09:51 | uh discriminatively trained using tandem i |
---|
0:09:54 | uh criterion |
---|
0:09:55 | and |
---|
0:09:58 | uh we use the features which are which where |
---|
0:10:01 | penn state |
---|
0:10:04 | so that was |
---|
0:10:05 | for acoustic subsystems |
---|
0:10:07 | then |
---|
0:10:08 | some common technique from then |
---|
0:10:10 | uh |
---|
0:10:13 | the core of our well but |
---|
0:10:14 | but but think systems where of course our |
---|
0:10:17 | phoneme recognisers |
---|
0:10:19 | the first one to english one is a gmm based |
---|
0:10:22 | uh phoneme recogniser which is based on our |
---|
0:10:25 | on the triphone acoustic models from an |
---|
0:10:28 | lvcsr |
---|
0:10:29 | than with just a |
---|
0:10:30 | oh |
---|
0:10:30 | take |
---|
0:10:31 | uh language model |
---|
0:10:33 | the two other uh |
---|
0:10:35 | for the party |
---|
0:10:36 | for not phoneme recognisers of the russian and hungarian |
---|
0:10:39 | our neural network based |
---|
0:10:40 | well the |
---|
0:10:41 | neural network |
---|
0:10:42 | uh |
---|
0:10:43 | estimates the posterior probabilities |
---|
0:10:46 | of the phonemes and then |
---|
0:10:47 | it feeds them to the hmm for the decoding |
---|
0:10:50 | so |
---|
0:10:51 | these |
---|
0:10:52 | we uh phoneme recognisers were used to be able |
---|
0:10:56 | three uh binary decision tree language models |
---|
0:11:01 | and |
---|
0:11:01 | one svm |
---|
0:11:02 | well |
---|
0:11:04 | which was |
---|
0:11:04 | based on the hungarian phoneme written |
---|
0:11:06 | nice |
---|
0:11:07 | here the foreground |
---|
0:11:08 | where use |
---|
0:11:09 | and uh |
---|
0:11:10 | as we and was actually using only the trying around |
---|
0:11:13 | uh |
---|
0:11:14 | a lattice come |
---|
0:11:15 | as a feature |
---|
0:11:21 | then |
---|
0:11:21 | uh we were doing a fusion |
---|
0:11:23 | um we use it |
---|
0:11:24 | and you go |
---|
0:11:25 | multiclass |
---|
0:11:27 | uh logistic regression |
---|
0:11:28 | focal |
---|
0:11:29 | toolkit |
---|
0:11:30 | so whatever assisting |
---|
0:11:32 | uh |
---|
0:11:34 | the thing is |
---|
0:11:36 | but the first time we had we didn't |
---|
0:11:38 | trained to three separate beckons for the |
---|
0:11:41 | each condition |
---|
0:11:42 | we tried |
---|
0:11:42 | to do the |
---|
0:11:45 | duration independent fusion so |
---|
0:11:48 | every sixteen |
---|
0:11:50 | was a coding |
---|
0:11:51 | some |
---|
0:11:52 | raw scores |
---|
0:11:53 | and |
---|
0:11:53 | in addition to these it was outputting also some information about the line |
---|
0:11:57 | a segment |
---|
0:11:58 | which for the |
---|
0:11:59 | acoustic system or was |
---|
0:12:01 | number of frames and uh |
---|
0:12:02 | phonotactic systems they provide it |
---|
0:12:05 | number of phonemes |
---|
0:12:07 | then these |
---|
0:12:08 | a raw scores for every systems |
---|
0:12:10 | where |
---|
0:12:11 | we are going to lose |
---|
0:12:12 | uh the gaussian backend |
---|
0:12:14 | we had |
---|
0:12:15 | three but gaussian back and |
---|
0:12:17 | persisting because we use |
---|
0:12:19 | uh |
---|
0:12:21 | three |
---|
0:12:21 | and so |
---|
0:12:22 | uh lance normalisation either we |
---|
0:12:25 | divided discourse |
---|
0:12:27 | by the |
---|
0:12:28 | uh by the land |
---|
0:12:30 | or |
---|
0:12:31 | two okay square root or |
---|
0:12:33 | we didn't do anything |
---|
0:12:34 | and then |
---|
0:12:35 | we put |
---|
0:12:36 | all of |
---|
0:12:37 | the L wheels of the |
---|
0:12:38 | uh these goals and back and sing to the multiclass |
---|
0:12:41 | uh |
---|
0:12:42 | a logistic regression |
---|
0:12:43 | discriminatively trained |
---|
0:12:45 | and i'll put most |
---|
0:12:46 | that a calibrated |
---|
0:12:48 | language |
---|
0:12:48 | look like |
---|
0:12:49 | course |
---|
0:12:51 | so |
---|
0:12:53 | here's |
---|
0:12:53 | scheme of the |
---|
0:12:54 | fusion |
---|
0:12:55 | so again |
---|
0:12:56 | it's is |
---|
0:12:57 | thing |
---|
0:12:57 | uh i'll put |
---|
0:12:59 | uh |
---|
0:13:00 | four |
---|
0:13:01 | and |
---|
0:13:02 | it's a it's either |
---|
0:13:04 | taken as it these |
---|
0:13:05 | or |
---|
0:13:05 | it's normalised by |
---|
0:13:07 | where or |
---|
0:13:08 | divide it |
---|
0:13:09 | and |
---|
0:13:09 | uh then the output of the gaussian beckons |
---|
0:13:12 | both |
---|
0:13:13 | also together with the information about the lines to the discriminant |
---|
0:13:16 | it's criminal |
---|
0:13:17 | multi possible just |
---|
0:13:18 | regression |
---|
0:13:22 | so |
---|
0:13:24 | the |
---|
0:13:26 | the actual |
---|
0:13:26 | core of this |
---|
0:13:28 | paper |
---|
0:13:29 | was to |
---|
0:13:30 | was |
---|
0:13:31 | to go |
---|
0:13:32 | uh so our development set and decide |
---|
0:13:35 | whether |
---|
0:13:35 | or |
---|
0:13:36 | address but the problem |
---|
0:13:37 | you're right |
---|
0:13:38 | like thing |
---|
0:13:39 | uh our friends |
---|
0:13:40 | int or you know |
---|
0:13:42 | get too much yeah |
---|
0:13:43 | who provided us with their development set |
---|
0:13:47 | so we were able to do |
---|
0:13:48 | this analyse |
---|
0:13:49 | actually |
---|
0:13:50 | in the tory no they had |
---|
0:13:52 | much uh |
---|
0:13:53 | small development set then we had |
---|
0:13:56 | it contained about uh |
---|
0:13:58 | if i correctly may remember |
---|
0:14:00 | ten thousand segments of |
---|
0:14:01 | and thirty three |
---|
0:14:03 | thirty four languages our development set was |
---|
0:14:05 | very huge it contained |
---|
0:14:07 | data from |
---|
0:14:08 | fifty seven languages and about uh |
---|
0:14:12 | sixty thousand |
---|
0:14:13 | second |
---|
0:14:14 | so we did the experiment |
---|
0:14:16 | we try to recreate the |
---|
0:14:18 | putting the whole uh |
---|
0:14:20 | training |
---|
0:14:21 | set and |
---|
0:14:22 | development set |
---|
0:14:24 | and also we had |
---|
0:14:25 | of course all training at developments and then we |
---|
0:14:27 | the the four |
---|
0:14:28 | types |
---|
0:14:29 | experiment i'd everywhere |
---|
0:14:30 | training |
---|
0:14:31 | our systems |
---|
0:14:32 | are the system and cutting |
---|
0:14:34 | and calibrating in on the |
---|
0:14:36 | uh |
---|
0:14:37 | put it all |
---|
0:14:38 | they |
---|
0:14:39 | what it does set or |
---|
0:14:40 | we trained |
---|
0:14:42 | on the |
---|
0:14:43 | L P T set and then potty break it on our set |
---|
0:14:46 | or |
---|
0:14:46 | we train |
---|
0:14:48 | our set and calibrated |
---|
0:14:49 | on the L P T outright |
---|
0:14:51 | trained |
---|
0:14:51 | on our set and |
---|
0:14:53 | i degraded one hours |
---|
0:14:54 | so |
---|
0:14:54 | these |
---|
0:14:55 | while at |
---|
0:14:57 | while i |
---|
0:14:57 | columns |
---|
0:14:58 | our our |
---|
0:14:59 | original scores |
---|
0:15:01 | these analyses of course |
---|
0:15:02 | was done |
---|
0:15:03 | using our |
---|
0:15:05 | our |
---|
0:15:06 | uh one |
---|
0:15:06 | the stick |
---|
0:15:07 | subsystem the jfa system |
---|
0:15:09 | because it would be |
---|
0:15:10 | very um feasible to run all of the systems |
---|
0:15:13 | again |
---|
0:15:14 | for the training so |
---|
0:15:16 | as you can see |
---|
0:15:17 | we had some |
---|
0:15:18 | serious issues for some languages actually these were the languages |
---|
0:15:21 | uh whether only the what's of america |
---|
0:15:25 | uh data were available |
---|
0:15:26 | so |
---|
0:15:27 | bosnian language |
---|
0:15:28 | was an issue you can see a big |
---|
0:15:30 | difference |
---|
0:15:31 | between a |
---|
0:15:32 | twenty two and our set the the blue blue column |
---|
0:15:35 | is |
---|
0:15:36 | just |
---|
0:15:37 | training on our set |
---|
0:15:38 | and using the |
---|
0:15:39 | putting those |
---|
0:15:40 | the |
---|
0:15:41 | development set for calibration so |
---|
0:15:43 | there must have been uh some |
---|
0:15:45 | some bothersome issue |
---|
0:15:48 | in our development set |
---|
0:15:50 | so |
---|
0:15:50 | the problems where the |
---|
0:15:51 | wasn't in |
---|
0:15:52 | farsi |
---|
0:15:56 | and also |
---|
0:15:57 | the final |
---|
0:15:59 | final score |
---|
0:16:00 | we were |
---|
0:16:01 | everywhere |
---|
0:16:03 | gaining some |
---|
0:16:04 | performance |
---|
0:16:06 | a loss |
---|
0:16:07 | uh |
---|
0:16:08 | so we try to |
---|
0:16:09 | focus on these languages and fine |
---|
0:16:12 | that should we had in our development |
---|
0:16:15 | so the first |
---|
0:16:16 | first we should we found was |
---|
0:16:18 | ridiculous |
---|
0:16:19 | we had |
---|
0:16:19 | mislabelled one |
---|
0:16:21 | language in our development set |
---|
0:16:22 | actually that was a labour |
---|
0:16:24 | label for |
---|
0:16:25 | far as the and |
---|
0:16:26 | version |
---|
0:16:27 | and we treated them as |
---|
0:16:28 | different languages so we |
---|
0:16:31 | we corrected is or and |
---|
0:16:33 | the problems for the for the language |
---|
0:16:35 | mostly disappear |
---|
0:16:36 | the next problem |
---|
0:16:38 | we |
---|
0:16:38 | we address was |
---|
0:16:40 | finding the repeating speakers between |
---|
0:16:43 | training and development set because |
---|
0:16:46 | based on the discussions |
---|
0:16:47 | on the |
---|
0:16:48 | language recognition workshop |
---|
0:16:50 | we already |
---|
0:16:52 | a suspect it this can be a problem |
---|
0:16:54 | for our |
---|
0:16:55 | uh |
---|
0:16:56 | training and develop |
---|
0:16:57 | so |
---|
0:16:58 | what we D |
---|
0:16:59 | we trained the |
---|
0:17:00 | our speaker I D's |
---|
0:17:01 | stint from |
---|
0:17:03 | previous |
---|
0:17:04 | evaluations |
---|
0:17:05 | which is a gmm based |
---|
0:17:07 | speaker I D's |
---|
0:17:08 | dean |
---|
0:17:09 | and |
---|
0:17:11 | uh |
---|
0:17:12 | train the models for every train |
---|
0:17:14 | segment |
---|
0:17:15 | inside the language and test |
---|
0:17:17 | again the segment |
---|
0:17:18 | in the |
---|
0:17:19 | uh developments |
---|
0:17:21 | what we ended up |
---|
0:17:22 | was this |
---|
0:17:23 | uh |
---|
0:17:24 | bimodal uh |
---|
0:17:25 | distribution of |
---|
0:17:27 | scores |
---|
0:17:27 | so |
---|
0:17:28 | uh |
---|
0:17:29 | this part here |
---|
0:17:36 | this part here |
---|
0:17:38 | these are the |
---|
0:17:38 | hi |
---|
0:17:39 | speaker I discourse |
---|
0:17:40 | and it's uh just |
---|
0:17:41 | there are some recruiting speakers |
---|
0:17:44 | between the training and the developments |
---|
0:17:46 | so |
---|
0:17:48 | when they look at these pictures |
---|
0:17:50 | we decided |
---|
0:17:52 | to threshold the data and to discard |
---|
0:17:54 | everything from our development set |
---|
0:17:56 | what is |
---|
0:17:58 | higher |
---|
0:17:58 | score then |
---|
0:17:59 | for this ukrainian language |
---|
0:18:01 | uh |
---|
0:18:02 | of |
---|
0:18:03 | discourse |
---|
0:18:04 | the threshold |
---|
0:18:04 | twenty |
---|
0:18:06 | did this |
---|
0:18:07 | experiment |
---|
0:18:09 | we discovered that |
---|
0:18:11 | we are we are discarding |
---|
0:18:13 | for some languages |
---|
0:18:14 | yeah disquiet discarding almost everything from our development set |
---|
0:18:17 | for example bosnian |
---|
0:18:19 | we ended up |
---|
0:18:20 | with the |
---|
0:18:20 | just fourteen |
---|
0:18:21 | fourteen segments in our development set |
---|
0:18:24 | and |
---|
0:18:25 | for the other languages |
---|
0:18:26 | where |
---|
0:18:27 | very very doing the |
---|
0:18:29 | speaker i didn't |
---|
0:18:30 | cation filtering |
---|
0:18:31 | we also discarded a lot of the data for example ukrainian only twelve |
---|
0:18:35 | well segments |
---|
0:18:36 | inaudible |
---|
0:18:39 | so |
---|
0:18:39 | what was the performance change when we did this experiment |
---|
0:18:43 | really me |
---|
0:18:45 | or |
---|
0:18:45 | correcting the label |
---|
0:18:47 | or already it was easy |
---|
0:18:49 | and it |
---|
0:18:50 | a show |
---|
0:18:50 | and that the |
---|
0:18:52 | did does |
---|
0:18:52 | some |
---|
0:18:53 | uh |
---|
0:18:54 | proven |
---|
0:18:55 | and then |
---|
0:18:55 | speaker I D filtering |
---|
0:18:58 | this was |
---|
0:18:58 | white huge |
---|
0:18:59 | different |
---|
0:19:01 | in the performance so |
---|
0:19:04 | these |
---|
0:19:04 | again these are the results for our acoustic |
---|
0:19:07 | subsystem the jfa |
---|
0:19:09 | two thousand |
---|
0:19:10 | what they got |
---|
0:19:11 | as with |
---|
0:19:11 | the R T L T features |
---|
0:19:17 | so |
---|
0:19:18 | when we did this |
---|
0:19:21 | we decided to run |
---|
0:19:22 | the whole fusion on our filter |
---|
0:19:24 | data |
---|
0:19:25 | it's not that |
---|
0:19:26 | we we didn't change |
---|
0:19:28 | the nature or or we didn't retrain |
---|
0:19:30 | and you far system we had in the |
---|
0:19:32 | submission |
---|
0:19:33 | for the |
---|
0:19:34 | nist language recognition evaluation we just |
---|
0:19:37 | filtered out |
---|
0:19:38 | course |
---|
0:19:39 | from our development set and |
---|
0:19:41 | run diffusion again |
---|
0:19:43 | and we were |
---|
0:19:44 | gaining |
---|
0:19:45 | some |
---|
0:19:45 | performance |
---|
0:19:46 | improvements |
---|
0:19:47 | quite |
---|
0:19:48 | substantial |
---|
0:19:49 | so for the |
---|
0:19:50 | third the second condition |
---|
0:19:52 | the C average went from |
---|
0:19:54 | two point three to one point ninety three which is |
---|
0:19:57 | what a nice |
---|
0:19:58 | improvement and the |
---|
0:20:00 | if you look at the table for |
---|
0:20:01 | every duration |
---|
0:20:04 | the improve there is |
---|
0:20:05 | an improvement |
---|
0:20:06 | i think there is no number |
---|
0:20:08 | which deteriorated |
---|
0:20:09 | so |
---|
0:20:10 | it worked |
---|
0:20:11 | all over the conditions and the |
---|
0:20:15 | uh over |
---|
0:20:16 | oh |
---|
0:20:16 | the |
---|
0:20:17 | all |
---|
0:20:18 | uh |
---|
0:20:18 | set and |
---|
0:20:19 | for every language and |
---|
0:20:21 | four |
---|
0:20:22 | every duration |
---|
0:20:25 | what we also |
---|
0:20:26 | so here |
---|
0:20:27 | what's a little |
---|
0:20:29 | you duration |
---|
0:20:30 | of the results on our developments |
---|
0:20:33 | yeah it it could be |
---|
0:20:36 | address |
---|
0:20:37 | could be |
---|
0:20:39 | the the cost could be that the our system |
---|
0:20:42 | right trained actually to the |
---|
0:20:44 | that that is |
---|
0:20:45 | speaker and they they're more i can recognise |
---|
0:20:47 | the speaker then the |
---|
0:20:48 | then the language |
---|
0:20:50 | for some languages |
---|
0:20:55 | so then |
---|
0:20:56 | we decided to work on our |
---|
0:20:58 | uh |
---|
0:20:59 | acoustics just in the jfa |
---|
0:21:02 | are the L D system |
---|
0:21:03 | because |
---|
0:21:04 | we wanted to do also |
---|
0:21:06 | another possible experiments to improve the final |
---|
0:21:09 | final fusion |
---|
0:21:11 | so |
---|
0:21:12 | what we did |
---|
0:21:13 | we |
---|
0:21:13 | just discarded the |
---|
0:21:15 | audio T features and use |
---|
0:21:17 | the plane shifted delta cepstra |
---|
0:21:19 | train |
---|
0:21:20 | the system |
---|
0:21:21 | and |
---|
0:21:21 | it uh |
---|
0:21:23 | there was some improvement |
---|
0:21:25 | out of this |
---|
0:21:27 | also what we did |
---|
0:21:28 | was to train the jfa |
---|
0:21:30 | using all |
---|
0:21:31 | all the segments |
---|
0:21:32 | there |
---|
0:21:33 | language |
---|
0:21:34 | instead of five hundred segments |
---|
0:21:36 | the or language and |
---|
0:21:38 | this |
---|
0:21:38 | uh brought |
---|
0:21:39 | so some |
---|
0:21:40 | nice improvement |
---|
0:21:41 | so when we |
---|
0:21:42 | did the |
---|
0:21:44 | final fusion |
---|
0:21:46 | we |
---|
0:21:47 | is guarded the |
---|
0:21:48 | are the L T J face |
---|
0:21:50 | in |
---|
0:21:50 | replace it with the normal |
---|
0:21:52 | jfa |
---|
0:21:53 | justin |
---|
0:21:54 | the and the my |
---|
0:21:55 | us |
---|
0:21:56 | still remained in the fusion |
---|
0:21:57 | and instead of |
---|
0:21:58 | all other |
---|
0:22:00 | uh binary trees and |
---|
0:22:01 | that one is we and we |
---|
0:22:02 | we put there |
---|
0:22:03 | actually a lot |
---|
0:22:04 | of the svm |
---|
0:22:06 | systems which are phonotactic |
---|
0:22:07 | based and |
---|
0:22:09 | uh |
---|
0:22:10 | they are based on |
---|
0:22:11 | our |
---|
0:22:12 | uh |
---|
0:22:12 | all of us |
---|
0:22:13 | all of ours |
---|
0:22:14 | uh phoneme recognisers and that a much because we'll have at all |
---|
0:22:18 | on the |
---|
0:22:19 | two P M |
---|
0:22:20 | and we will he will explain |
---|
0:22:22 | more |
---|
0:22:23 | about this |
---|
0:22:24 | stem cell |
---|
0:22:25 | when we did this the final fusion went from |
---|
0:22:29 | one point nine |
---|
0:22:30 | as we saw previously |
---|
0:22:31 | two |
---|
0:22:33 | uh one point |
---|
0:22:34 | fifty seven |
---|
0:22:34 | which is |
---|
0:22:35 | very competitive results |
---|
0:22:37 | of course |
---|
0:22:38 | it's a positive relation with |
---|
0:22:44 | so what is the conclusions |
---|
0:22:46 | of this work |
---|
0:22:48 | we have to really care about our development that the data and rather than |
---|
0:22:52 | creating a huge |
---|
0:22:53 | huge development set it |
---|
0:22:55 | better to |
---|
0:22:56 | pay attention and |
---|
0:22:58 | and |
---|
0:22:59 | have it |
---|
0:23:00 | smaller box filter and |
---|
0:23:02 | clean |
---|
0:23:03 | uh development set we actually did experiments with |
---|
0:23:05 | trying given more data |
---|
0:23:07 | two or seven |
---|
0:23:09 | it didn't help us |
---|
0:23:10 | the problem of the repeating speakers |
---|
0:23:13 | between |
---|
0:23:14 | between the training and the development set was |
---|
0:23:17 | i was like |
---|
0:23:18 | large |
---|
0:23:19 | and |
---|
0:23:20 | we should pay attention when we are |
---|
0:23:22 | doing the next evolves |
---|
0:23:23 | so that this |
---|
0:23:24 | well |
---|
0:23:26 | so thank you |
---|
0:23:27 | and |
---|
0:23:33 | also |
---|
0:23:45 | uh_huh |
---|
0:23:47 | okay |
---|
0:23:51 | what |
---|
0:23:51 | oh |
---|
0:23:53 | oh |
---|
0:23:54 | about a person so |
---|
0:23:57 | we're principles is what |
---|
0:24:00 | later use |
---|
0:24:02 | a we are we looked at least |
---|
0:24:04 | and |
---|
0:24:04 | yeah we |
---|
0:24:05 | we talked with them with one |
---|
0:24:06 | in the workshop and they were |
---|
0:24:08 | they were doing the speaker filtering stuff |
---|
0:24:11 | but we didn't uh filter they are set according to our uh training set |
---|
0:24:15 | uh to |
---|
0:24:16 | but even the filter |
---|
0:24:18 | uh the repeating speaker |
---|
0:24:19 | what remained there |
---|
0:24:21 | we just use as it was |
---|
0:24:26 | oh |
---|
0:24:27 | right |
---|
0:24:30 | do you |
---|
0:24:31 | same speakers element |
---|
0:24:33 | i see |
---|
0:24:36 | we should |
---|
0:24:37 | we we don't know that and we didn't check it |
---|
0:24:40 | we we just |
---|
0:24:41 | it uh |
---|
0:24:41 | wanted to treat our evaluation set |
---|
0:24:44 | S |
---|
0:24:44 | and evaluation set and we didn't look at it yeah |
---|
0:24:47 | yeah |
---|
0:24:48 | you know robust |
---|
0:24:50 | you could probably get a little uh |
---|
0:24:53 | well i think that the the remote or not |
---|
0:24:55 | so much speakers repeating in that you've also because |
---|
0:24:58 | as i understood nice |
---|
0:25:00 | was using some uh |
---|
0:25:02 | previously recorded data |
---|
0:25:04 | and that |
---|
0:25:05 | it is probably much less likely that |
---|
0:25:08 | that there will be the meeting speakers again |
---|
0:25:11 | for something which is of course it can happen |
---|
0:25:13 | four |
---|
0:25:14 | some of them but we we didn't actually check if those |
---|
0:25:17 | or so |
---|
0:25:19 | well short but |
---|
0:25:21 | oh |
---|
0:25:22 | you lose it seems |
---|
0:25:24 | this |
---|
0:25:25 | list |
---|
0:25:25 | uh |
---|
0:25:26 | you choose that actually |
---|
0:25:28 | would be to then the more |
---|
0:25:30 | uh |
---|
0:25:31 | yeah |
---|
0:25:31 | yeah |
---|
0:25:33 | yeah it is like that |
---|
0:25:34 | uh |
---|
0:25:35 | we were |
---|
0:25:36 | making a lot of effort to |
---|
0:25:38 | try this new our guilty technique and |
---|
0:25:40 | uh |
---|
0:25:40 | which didn't work |
---|
0:25:41 | also what was working was |
---|
0:25:43 | combining 'cause of |
---|
0:25:44 | many phonotactic systems as |
---|
0:25:46 | as you did in your |
---|
0:25:48 | submission and |
---|
0:25:48 | yeah |
---|
0:25:49 | very easily combining |
---|
0:25:51 | thirteen pca base |
---|
0:25:52 | as we and |
---|
0:25:53 | since |
---|
0:25:53 | based on our phoneme recognisers |
---|
0:25:55 | what's actually |
---|
0:25:57 | uh |
---|
0:25:57 | very |
---|
0:25:59 | very nice |
---|
0:26:00 | the results |
---|
0:26:01 | are quite compact it if if you will |
---|
0:26:03 | and the number one one seventy eight |
---|
0:26:05 | just |
---|
0:26:05 | these |
---|
0:26:06 | svm systems |
---|
0:26:07 | where better then |
---|
0:26:08 | our |
---|
0:26:09 | final submission even after the filtering of the |
---|
0:26:13 | of the calibration |
---|
0:26:16 | like speaker |
---|
0:26:17 | sort of |
---|
0:26:19 | let's |
---|
0:26:20 | but |
---|
0:26:21 | oh |
---|