0:00:15 | okay moving so this is a joint work for me from |
---|
0:00:20 | bob dunn first |
---|
0:00:23 | so we describe a limited to challenge |
---|
0:00:27 | i in the in a explain our approach it is based networks |
---|
0:00:33 | i'll show how we can be applied |
---|
0:00:35 | to the and then show experiments |
---|
0:00:39 | so we start with a the channel the forced on the buttons |
---|
0:00:45 | so we have a |
---|
0:00:47 | actually she was we have some labeled they have not been a lot of the |
---|
0:00:53 | unlabeled data is it as a development set and we have a it is |
---|
0:00:59 | actually it's |
---|
0:01:01 | is it does gratification one of the future it's just under classification task a but |
---|
0:01:08 | there are two different first we have |
---|
0:01:11 | unlabeled data that we do to use it is part of the classification and we |
---|
0:01:15 | found out of state |
---|
0:01:17 | data that you want to take your |
---|
0:01:19 | so i in this task will focus on these |
---|
0:01:22 | to challenge of how to use the unlabeled data |
---|
0:01:26 | it's part of training and how to take care of ops |
---|
0:01:30 | so that these two fifty in languages i |
---|
0:01:34 | some of them are very similar to each other some of them are different |
---|
0:01:40 | in this is the cost function for from the cost function we can roughly assume |
---|
0:01:45 | that |
---|
0:01:45 | one quarter of the test set and the unlabeled data is |
---|
0:01:51 | use it |
---|
0:01:53 | it will use this but |
---|
0:01:55 | okay |
---|
0:01:57 | so |
---|
0:01:58 | e |
---|
0:01:59 | i want to discuss how can use the unlabeled data for training in the case |
---|
0:02:04 | of a deep-learning in the framework of people |
---|
0:02:07 | so just a way to lose unlabeled data |
---|
0:02:11 | we use it as a pre-training |
---|
0:02:13 | instead of a random initialization of the hotel so we used a the unlabeled data |
---|
0:02:19 | to a pre-training then network |
---|
0:02:23 | that doesn't probably due to do pre-training |
---|
0:02:26 | systems thinking |
---|
0:02:28 | a restricted boltzmann machine |
---|
0:02:31 | the second the middle based on the denoising |
---|
0:02:36 | but both of them a pre-training |
---|
0:02:38 | we used to extract |
---|
0:02:40 | and then we form k one as well |
---|
0:02:42 | the unlabeled the |
---|
0:02:44 | so how do i |
---|
0:02:46 | really mentioned how a how auto-encoder |
---|
0:02:50 | is that well we have a |
---|
0:02:53 | the data points point and we tried to construct the noise and they gathered structure |
---|
0:03:02 | that is |
---|
0:03:02 | similar is not okay to the clean data |
---|
0:03:07 | you in our approach you use a generalization |
---|
0:03:11 | of out to go there because corner store |
---|
0:03:15 | the two years |
---|
0:03:18 | and data just to work with taking points but to across the entire that'll |
---|
0:03:24 | with a c |
---|
0:03:26 | data we actually to the |
---|
0:03:29 | that's what the cost function is the local section of |
---|
0:03:33 | but also though the construction of the hidden layers |
---|
0:03:37 | will explain how in more detail |
---|
0:03:40 | so this is that it |
---|
0:03:43 | this is a standard |
---|
0:03:45 | you want we have |
---|
0:03:47 | i don't think that we have a soft spot classification is done to a architecture |
---|
0:03:55 | but beside it if we apply for labeled data |
---|
0:04:00 | in case of unlabeled data with the same network |
---|
0:04:05 | the same parameters but at each step we add noise |
---|
0:04:10 | to the |
---|
0:04:11 | data with of the important we add noise to each of the hidden layer we |
---|
0:04:16 | try to recall that there will try to construct |
---|
0:04:21 | the hidden layers so engaged every possible for them but these the base of the |
---|
0:04:27 | c |
---|
0:04:28 | however i and the construct and |
---|
0:04:31 | a previously |
---|
0:04:33 | i the that the cost function is one or construction |
---|
0:04:37 | will be |
---|
0:04:38 | so very close to the clean |
---|
0:04:41 | a hidden layers |
---|
0:04:45 | this is therefore in |
---|
0:04:49 | for the former so we have encoder and decoder is it's the in the encoder |
---|
0:04:55 | each you just |
---|
0:04:56 | a hundred and voice |
---|
0:05:00 | at each step |
---|
0:05:02 | in the decoder their we construct the denoising huge layers |
---|
0:05:08 | so if we will be more specific the main problem |
---|
0:05:13 | all of this model is of course of action how we can reconstruct the hidden |
---|
0:05:17 | layer based on the model |
---|
0:05:20 | he or and of course factors previously or and |
---|
0:05:25 | still we assume that the that we apply a additive gaussian noise |
---|
0:05:31 | to the in the later so we use a you know i estimation |
---|
0:05:37 | the daily we estimate |
---|
0:05:39 | then i |
---|
0:05:41 | we estimate hidden layer is a linear function |
---|
0:05:45 | all of the noisy you hear that |
---|
0:05:49 | we take the that we now require sufficient we take it |
---|
0:05:54 | it is that you know away from the eh |
---|
0:05:57 | the previous construction workers fact that leo this is the did not we know that |
---|
0:06:02 | money not is applied by you know function plus a sigmoid function |
---|
0:06:08 | and we don't for each one |
---|
0:06:10 | several the concept is |
---|
0:06:13 | similar to else |
---|
0:06:15 | so if you have this intuition |
---|
0:06:18 | but the idea here is to reconstruct the noisy but they are based on the |
---|
0:06:27 | two |
---|
0:06:28 | based on their the professional constructed |
---|
0:06:32 | popping so once we have |
---|
0:06:35 | it actually we have a training data that is based on |
---|
0:06:39 | boast labeled data |
---|
0:06:42 | and unlabeled data |
---|
0:06:44 | the training cost function will be |
---|
0:06:48 | standard cross entropy |
---|
0:06:50 | applied on the labeled data |
---|
0:06:52 | and construction and all applied on the data and bibles action a well i mean |
---|
0:06:59 | reports about not just the input but |
---|
0:07:01 | reconstructed each of the hidden layer |
---|
0:07:04 | so if all |
---|
0:07:10 | so if we go back to this picture so i |
---|
0:07:14 | e in the unlabeled data we want |
---|
0:07:18 | we inject the noisy version into the net and then try to are constructed such |
---|
0:07:25 | that it will be very similar to the a noisy data |
---|
0:07:32 | so in this way we use the unlabeled data not just is the three training |
---|
0:07:37 | and then they forget about it but we use it i in it is part |
---|
0:07:42 | of the training |
---|
0:07:43 | the training data of the neural network is explicitly |
---|
0:07:47 | very small bones the labeled and unlabeled data |
---|
0:07:54 | this is an illustration of the power of a larger networks |
---|
0:08:00 | so we can this is a result of the standard em based |
---|
0:08:06 | a u i |
---|
0:08:08 | it that's or something this is the number of flavours and this is the construction |
---|
0:08:12 | or and we can see that using a large networks |
---|
0:08:16 | we had we can have a |
---|
0:08:19 | performance that is that is if everyone based on |
---|
0:08:23 | only for every something like a one or two hundred labeled data |
---|
0:08:29 | well all other consonants |
---|
0:08:32 | of a images is argument |
---|
0:08:35 | okay so this is the idea of other networks there you |
---|
0:08:40 | we will apply to this a challenge |
---|
0:08:46 | now we want to discuss how can we incorporate i'll sit in this a frame |
---|
0:08:52 | or |
---|
0:08:52 | so we use a you want network architecture of but i would add |
---|
0:08:58 | another class a fixed it so we have fifty classes for each of the labels |
---|
0:09:05 | for each of the unknown languages and it and that are out of that class |
---|
0:09:10 | and the how we can train the out of state the label |
---|
0:09:15 | so that we used a label layer distribution the legalisation |
---|
0:09:20 | so what we meet again |
---|
0:09:22 | assuming we do a batch a training so we can compute the frequency |
---|
0:09:29 | of a of they all say |
---|
0:09:33 | all of classifying the a language at the difference you of |
---|
0:09:39 | off a classifier languages so we can count how many times we classified in the |
---|
0:09:44 | language is english how it are we classified as hindu |
---|
0:09:47 | that's right and how many we time to classify it as a state |
---|
0:09:52 | and we have |
---|
0:09:54 | a rough estimation what should be okay the histogram was what should be this distribution |
---|
0:10:00 | we can assume that we have |
---|
0:10:03 | all languages should be roughly |
---|
0:10:05 | it appears |
---|
0:10:07 | and i'll start should be roughly one quarter |
---|
0:10:10 | of the |
---|
0:10:12 | so we can say |
---|
0:10:14 | it cross entropy |
---|
0:10:16 | score function |
---|
0:10:18 | two |
---|
0:10:21 | it to who it define a score of |
---|
0:10:24 | the data distribution of the classifier so and the main point is that we can |
---|
0:10:30 | do it because we have a label |
---|
0:10:32 | we have out of set |
---|
0:10:34 | in the unlabeled data if we don't |
---|
0:10:36 | in this challenge is the main point |
---|
0:10:39 | in these times we have out of set in the unlabeled data so we can |
---|
0:10:43 | assume that the adapted with the that some of my |
---|
0:10:46 | the labels should be altered |
---|
0:10:50 | okay so softly season a the additional cost function so we have |
---|
0:10:55 | two |
---|
0:10:56 | it cost function one is the lower cost function there is some supervised cost function |
---|
0:11:01 | and the other the other one yees day |
---|
0:11:06 | a discrepancy of course of the labeled the decisions |
---|
0:11:11 | okay now i go to experiment what the three years the detector that deals with |
---|
0:11:17 | the input is the i-vectors we use a |
---|
0:11:21 | a natural therefore ready |
---|
0:11:24 | value hidden layers |
---|
0:11:25 | and the and we have a softmax output with diffuse t one |
---|
0:11:30 | classes |
---|
0:11:31 | so that this is the extent pale |
---|
0:11:34 | this assimilation the two |
---|
0:11:37 | we text sort it of the languages |
---|
0:11:40 | it is inserted and the other languages |
---|
0:11:43 | is out of set so we this is simulation we know |
---|
0:11:47 | all the labels |
---|
0:11:49 | i and he is example what is happening if we use |
---|
0:11:53 | the baseline |
---|
0:11:54 | they without today the latter it and if we and |
---|
0:11:59 | a that the latter |
---|
0:12:01 | score so we can see that |
---|
0:12:03 | we gained a |
---|
0:12:06 | in a significant improvement using the unlabeled data |
---|
0:12:10 | the during the price of doing it it's more difficult to learn the log spectral |
---|
0:12:17 | the prices is |
---|
0:12:18 | that we need to do more reports but it's not a big issue the that |
---|
0:12:23 | it is more |
---|
0:12:26 | so this is there a result a the progress are the results |
---|
0:12:31 | so we have a either doing a ladder or not doing other a and |
---|
0:12:37 | taking the labour statistics |
---|
0:12:39 | a or not take detecting the label statistics score |
---|
0:12:43 | so this is the baseline in for our case |
---|
0:12:48 | so if we take a larger |
---|
0:12:51 | we get a an improvement |
---|
0:12:54 | if we take a label statistics |
---|
0:12:57 | we also get improvement but not much and if we a |
---|
0:13:02 | a combine |
---|
0:13:03 | the two strategies |
---|
0:13:06 | the first strategy is the for unlabeled data in the second strategy |
---|
0:13:10 | for all to start we get it would gain a significantly |
---|
0:13:14 | improvement |
---|
0:13:15 | we i and this would this is |
---|
0:13:19 | a this problem is the out of set statistics |
---|
0:13:23 | at the two |
---|
0:13:25 | the daily the system will provide then |
---|
0:13:28 | we tried to stall for example here what |
---|
0:13:32 | what we classify us thirty percent |
---|
0:13:36 | all of a development set as is |
---|
0:13:39 | that would try |
---|
0:13:41 | to a |
---|
0:13:45 | to adjust the number of out of state to be one quarter |
---|
0:13:50 | because we don't that roughly the that this should be that the number but |
---|
0:13:54 | in the baseline we got improvement but here |
---|
0:13:58 | it doesn't a |
---|
0:14:01 | actually the performance and decreases |
---|
0:14:04 | so i still the this was that the |
---|
0:14:09 | you the best results |
---|
0:14:12 | okay so to compute we tried to apply here lately a deep learning strategy the |
---|
0:14:19 | take care of both |
---|
0:14:21 | is a challenging role of three |
---|
0:14:25 | unlabeled data and out of set for unlabeled data |
---|
0:14:28 | we use the latter network that explicitly |
---|
0:14:31 | take the i labeled data into account while training |
---|
0:14:35 | four out of set a we use a label and distribution score |
---|
0:14:40 | that is also i |
---|
0:14:43 | i is used in the training |
---|
0:14:45 | i we show that |
---|
0:14:46 | combining |
---|
0:14:47 | these two methodology we can mitigate |
---|
0:14:50 | a improve the results |
---|
0:14:54 | okay ten q |
---|
0:15:01 | we have time for questions |
---|
0:15:13 | can you tell us exactly how much this unsupervised data help you anyone either training |
---|
0:15:21 | like |
---|
0:15:22 | for example i imagine you do the also to express reconstruction in the same training |
---|
0:15:27 | data that you have like a regularisation into the classification categorisation would it you did |
---|
0:15:32 | you compare between added due to the regularization as what is the supervised and unsupervised |
---|
0:15:37 | no need to measure how much you will gain by don't answer provide it's a |
---|
0:15:42 | good question |
---|
0:15:43 | i didn't write but |
---|
0:15:48 | the utterance is used also is a regularization |
---|
0:15:52 | you think that the draw pile a strategy is that |
---|
0:16:00 | but not |
---|
0:16:02 | not sure |
---|
0:16:04 | just you deduction the well but |
---|
0:16:06 | we did try |
---|
0:16:08 | if i remember that it helps |
---|
0:16:12 | but anyhow we need |
---|
0:16:14 | do we need unlabeled data be because it has out that |
---|
0:16:28 | but it's still |
---|
0:16:33 | want to know if you a what applying some and kind of pre-processing for this |
---|
0:16:37 | for the i-vectors |
---|
0:16:38 | if what a new three and in and |
---|
0:16:42 | what you know what you know within two |
---|
0:16:46 | so called a low |
---|
0:16:49 | so the i-vectors that what provided by nist by nist |
---|
0:16:52 | the results but maybe preprocessing |
---|
0:16:56 | i don't know but we tried we use the raw data |
---|
0:17:11 | if there are no other questions let's think the speaker again |
---|
0:17:18 | so i think we we're at the end of the session i think we have |
---|
0:17:22 | a few |
---|