0:00:14 | i |
---|
0:00:17 | the my name is |
---|
0:00:18 | yeah or and i'm going to present our common work on uh |
---|
0:00:21 | structured output layer neural network language model |
---|
0:00:25 | and we are |
---|
0:00:26 | presenting the work done it a he's an is and in france |
---|
0:00:30 | so first said like to introduce briefly the new all network language models |
---|
0:00:34 | then i move on to a hierarchical models that actually motivate a structured out layer model |
---|
0:00:39 | and fine we present the core of these priests george the so we network language which |
---|
0:00:46 | so the neural network language models |
---|
0:00:48 | but we talk about usual n-gram language models |
---|
0:00:51 | we know that they have very successful they've introduced several decades ago |
---|
0:00:55 | but the drawbacks of these models also well known |
---|
0:00:59 | one of |
---|
0:01:00 | these drawbacks S for sparsity issues and the lack of generalization |
---|
0:01:04 | so |
---|
0:01:05 | one of the major reasons for this |
---|
0:01:07 | is that |
---|
0:01:08 | conventional and models |
---|
0:01:10 | they use |
---|
0:01:11 | flat vocabulary |
---|
0:01:12 | so each what is |
---|
0:01:13 | so shade it with an index and the vocabulary |
---|
0:01:16 | and this way |
---|
0:01:17 | the models |
---|
0:01:18 | did do not make use of the hidden and semantic relationships |
---|
0:01:21 | that a between different words |
---|
0:01:25 | so the neural network language models |
---|
0:01:28 | were introduced |
---|
0:01:29 | to estimate n-gram probabilities and continuous space |
---|
0:01:32 | about ten years ago |
---|
0:01:34 | these models was successfully |
---|
0:01:36 | uh applied to speech recognition |
---|
0:01:38 | why should they work |
---|
0:01:40 | in you a network language models |
---|
0:01:41 | because similar words expected you have |
---|
0:01:43 | similar feature vectors in continuous space |
---|
0:01:47 | and so the probability function is this smart smooth function of feature values |
---|
0:01:51 | and so if we have |
---|
0:01:54 | similar features |
---|
0:01:55 | that you use |
---|
0:01:56 | this small change in probability for a similar words with to male feature |
---|
0:02:02 | so just the brief overview the deer a uh |
---|
0:02:05 | you network language models |
---|
0:02:07 | a train and used |
---|
0:02:09 | the first we represent each word in the vocabulary as |
---|
0:02:12 | one of and vector so with all these yours except for on |
---|
0:02:16 | one index that is one |
---|
0:02:19 | then |
---|
0:02:20 | we project |
---|
0:02:21 | this there |
---|
0:02:22 | this this vector |
---|
0:02:24 | the continuous space |
---|
0:02:26 | so we add a second layer that is fully connected |
---|
0:02:29 | that is called context layer or projection there |
---|
0:02:32 | and if we work on the four gram level for example |
---|
0:02:36 | we feed to them you will network |
---|
0:02:38 | the history that is three previous words |
---|
0:02:41 | so we project three previous words and continuous space |
---|
0:02:45 | and |
---|
0:02:46 | we showing |
---|
0:02:47 | to back vectors in continuous space to obtain |
---|
0:02:50 | the projection the the context vector |
---|
0:02:52 | for the history |
---|
0:02:56 | as we work with the uh |
---|
0:02:57 | you network model |
---|
0:02:58 | so as we have the the vector |
---|
0:03:01 | for the history the the layer for the he's trip projection layer |
---|
0:03:04 | then we add the |
---|
0:03:06 | hidden layer |
---|
0:03:08 | with a a on non any uh non linearity so with a hyperbolic tangent activation function |
---|
0:03:14 | that source to create |
---|
0:03:16 | feature vector for that word to be predicted |
---|
0:03:18 | in continuous space |
---|
0:03:20 | prediction space |
---|
0:03:21 | then |
---|
0:03:22 | we also need the output layer to estimate probabilities |
---|
0:03:26 | for all words given the history |
---|
0:03:29 | so and this there are we use the softmax function |
---|
0:03:34 | and all the parameters of the neural network must be learned during training |
---|
0:03:39 | so the key points are the |
---|
0:03:42 | these neural network language models that use projection continuous space |
---|
0:03:46 | that reduces these sparsity issues |
---|
0:03:49 | so these |
---|
0:03:50 | and the |
---|
0:03:51 | projection prediction a lawrence small ten |
---|
0:03:55 | in practice |
---|
0:03:56 | significant and systematic improvements |
---|
0:03:58 | both |
---|
0:03:59 | and speech recognition and machine translation task |
---|
0:04:02 | what or |
---|
0:04:03 | we you all network language models |
---|
0:04:05 | we used to complement |
---|
0:04:06 | conventional and one language models |
---|
0:04:08 | so that interpolate with that |
---|
0:04:12 | so the point |
---|
0:04:13 | everybody should use it |
---|
0:04:15 | but |
---|
0:04:16 | there is a small problem |
---|
0:04:17 | it wouldn't bit small training sets |
---|
0:04:20 | the training time |
---|
0:04:22 | is very big is are large |
---|
0:04:23 | and we do with large training sets |
---|
0:04:26 | it's even larger |
---|
0:04:28 | why |
---|
0:04:29 | is it so long |
---|
0:04:31 | well we look at just one inference |
---|
0:04:34 | what do we have to do |
---|
0:04:36 | first |
---|
0:04:37 | we have to |
---|
0:04:39 | project uh |
---|
0:04:40 | project histories |
---|
0:04:41 | you know to do this |
---|
0:04:43 | this is just a metrics or selection |
---|
0:04:45 | operation |
---|
0:04:47 | then |
---|
0:04:48 | imagine that we have |
---|
0:04:49 | two hundred notes |
---|
0:04:50 | and uh the uh projection vector |
---|
0:04:53 | and we have the history of three so all together we have six hundred |
---|
0:04:56 | and we had two hundred notes an output there |
---|
0:04:59 | so what we have to do we have to perform metrics |
---|
0:05:01 | multiple uh multiplication |
---|
0:05:04 | with this fellows |
---|
0:05:06 | and then what we have to do a you have to perform another metrics multiplication |
---|
0:05:10 | that depends on the size of the output layer |
---|
0:05:12 | that is |
---|
0:05:13 | two hundred |
---|
0:05:14 | and |
---|
0:05:15 | on the size |
---|
0:05:16 | of the output we a a sort of the hidden layer |
---|
0:05:19 | and the size of the output the very |
---|
0:05:22 | so we look at the complexity complexity issues |
---|
0:05:26 | we can see that the input vocabulary can can be as large as we want |
---|
0:05:31 | as it doesn't at |
---|
0:05:33 | uh |
---|
0:05:34 | well in complexity O complexity |
---|
0:05:36 | then increase in the you order |
---|
0:05:38 | that's not drastically increase the complexity as it can be it |
---|
0:05:41 | it most in year |
---|
0:05:43 | the increase |
---|
0:05:44 | the problem is the in the output layer in the output vocabulary size |
---|
0:05:48 | if the a cable large then the training and inference time |
---|
0:05:52 | well there are a lot but uh uh with the lower |
---|
0:05:57 | so there are a number of |
---|
0:05:58 | usual tricks |
---|
0:06:00 | that i used to |
---|
0:06:01 | speed up training and inference |
---|
0:06:04 | one part of these streets deal with the resampling of training data |
---|
0:06:09 | and using different portions of training data in each |
---|
0:06:12 | a a book of neural network language model training |
---|
0:06:16 | and using batch training mode |
---|
0:06:18 | that is propagating |
---|
0:06:20 | and grams sharing the same histories |
---|
0:06:22 | so we spend less time this way |
---|
0:06:25 | and that the type of uh tricks is |
---|
0:06:27 | reducing in the output vocabulary |
---|
0:06:29 | that is |
---|
0:06:30 | the think that is called the short |
---|
0:06:33 | you network language models |
---|
0:06:34 | so we use the neural network |
---|
0:06:37 | to predict only the K most frequent words |
---|
0:06:40 | normally it's |
---|
0:06:41 | up to |
---|
0:06:42 | uh uh twenty thousand words |
---|
0:06:45 | and we use the conventional |
---|
0:06:47 | now a n-gram language model |
---|
0:06:49 | to |
---|
0:06:50 | to pretty a to give the probabilities for all the rest |
---|
0:06:53 | right |
---|
0:06:54 | so in this scheme |
---|
0:06:56 | we have to keep the conventional |
---|
0:06:59 | and gram language model |
---|
0:07:00 | to for back of if and for re normalisation because we have to re normalise the probabilities |
---|
0:07:06 | so that they sum up to one for each one |
---|
0:07:11 | now would like to a to talk about a high models |
---|
0:07:14 | that actually |
---|
0:07:16 | we introduced two |
---|
0:07:17 | tackle this problem of |
---|
0:07:19 | dealing with a a large output vocabularies |
---|
0:07:23 | one of the first ideas is was in a a a a was dealing with us begin a maximum entropy |
---|
0:07:27 | models |
---|
0:07:29 | so that |
---|
0:07:30 | just |
---|
0:07:30 | is |
---|
0:07:31 | the same problem |
---|
0:07:32 | uh a previous ah a talk was about this section |
---|
0:07:36 | so |
---|
0:07:36 | what was proposed to bounty ten years ago was just set of complete directly |
---|
0:07:40 | the conditional probability |
---|
0:07:42 | make use of clustering of words |
---|
0:07:45 | so that |
---|
0:07:47 | we introduce classes into computation |
---|
0:07:50 | and then if we have for example |
---|
0:07:52 | what of a the rate of ten thousand |
---|
0:07:55 | words |
---|
0:07:56 | and we have |
---|
0:07:56 | we cluster them and one hundred classes |
---|
0:07:58 | and the that |
---|
0:08:00 | each of these classes has exactly one hundred words inside |
---|
0:08:03 | so instead of doing normalisation over ten thousand words |
---|
0:08:07 | we have to do to normalisation |
---|
0:08:09 | well were |
---|
0:08:10 | one hundred out |
---|
0:08:12 | so |
---|
0:08:12 | can be |
---|
0:08:13 | you have to do only |
---|
0:08:15 | normalisation on only over two hundred dollars |
---|
0:08:18 | so we can read |
---|
0:08:20 | the computation by fifty |
---|
0:08:23 | that was the idea |
---|
0:08:24 | then this idea was uh |
---|
0:08:27 | uh this idea in spite |
---|
0:08:29 | the i think the uh they work on hierarchical probabilistic neural network language models |
---|
0:08:34 | so |
---|
0:08:37 | a at is to cluster the output vocabulary |
---|
0:08:41 | at the output layer of the neural network |
---|
0:08:43 | and pretty |
---|
0:08:44 | words |
---|
0:08:45 | it's spots in the clustering tree |
---|
0:08:49 | the clustering in this work |
---|
0:08:51 | what's constrained by wordnet semantic where |
---|
0:08:54 | and the |
---|
0:08:56 | so when |
---|
0:08:57 | in this uh frame or we predict |
---|
0:08:59 | but to work |
---|
0:09:00 | exactly in the output layer |
---|
0:09:01 | but the next beat in high are T in the in the in the tree in the clustering to |
---|
0:09:07 | so we uh at each node |
---|
0:09:10 | we uh uh pretty |
---|
0:09:12 | the beat that is zero or one left to right |
---|
0:09:14 | in the binary tree |
---|
0:09:16 | given |
---|
0:09:17 | the code |
---|
0:09:19 | for these node |
---|
0:09:20 | and the history |
---|
0:09:21 | so there is one parameter at it |
---|
0:09:24 | and the calculation that is the the D binary code |
---|
0:09:27 | of the note |
---|
0:09:28 | the way we can we have to get a |
---|
0:09:31 | get to it |
---|
0:09:32 | the experiments were or the experimental results are shown on uh quite small brown corpus with the ten thousand words |
---|
0:09:39 | vocabulary |
---|
0:09:40 | significant speed-up which shown |
---|
0:09:42 | like |
---|
0:09:43 | two orders of mike menu two |
---|
0:09:45 | but wasn't perplexed |
---|
0:09:48 | the same time |
---|
0:09:50 | probably the loss and perplexity was do to using what net |
---|
0:09:53 | semantic care |
---|
0:09:55 | so in the work |
---|
0:09:57 | uh called scalable hierarchical distributed language model |
---|
0:10:01 | the automatic clustering was used instead of |
---|
0:10:04 | wordnet |
---|
0:10:05 | the model itself was |
---|
0:10:07 | implemented as what bill your model |
---|
0:10:10 | uh without nonlinearity |
---|
0:10:12 | and want to many what class mapping |
---|
0:10:14 | was important so the work |
---|
0:10:16 | the long to more than one |
---|
0:10:19 | the results were |
---|
0:10:20 | reported on a large dataset with the uh eighteen thousand |
---|
0:10:24 | words for cable E |
---|
0:10:26 | a perplexity improvements over a and were model was shown |
---|
0:10:30 | speed-up of course |
---|
0:10:31 | and similar performance to an or here can what be linear |
---|
0:10:38 | no i'm going to check about in the major part of uh uh of this work that is structured output |
---|
0:10:43 | layer neural network language models |
---|
0:10:45 | so what i the main idea |
---|
0:10:47 | yeah and the structured output layer neural network language model |
---|
0:10:52 | first |
---|
0:10:52 | if we |
---|
0:10:54 | compare it with the hierarchical models have just been talking about a |
---|
0:10:58 | bit trees be i used to cluster |
---|
0:10:59 | the output vocabulary and not binary anymore |
---|
0:11:03 | so we actually |
---|
0:11:04 | because of these we use |
---|
0:11:06 | baltic we multiple |
---|
0:11:07 | multiple output layers |
---|
0:11:09 | a the of a of neural network with the softmax in each so i will talk |
---|
0:11:14 | in detail the bit later about this |
---|
0:11:16 | then we do not |
---|
0:11:18 | perform clustering for frequent words |
---|
0:11:21 | so we |
---|
0:11:22 | still do you some ideas from the short least |
---|
0:11:25 | neural networks |
---|
0:11:26 | so we keep the short list |
---|
0:11:28 | without clustering and we cluster are only not frequent words |
---|
0:11:32 | and then we use uh what we think is efficient clustering scheme |
---|
0:11:36 | so we use what or word vectors in projections space |
---|
0:11:39 | for clustering |
---|
0:11:41 | the task |
---|
0:11:42 | is to improve state-of-the-art in C speech to text system |
---|
0:11:46 | that makes use already of short a you wanna work which models |
---|
0:11:49 | that is characterised by large vocabulary and the baseline and uh and n-gram language model trained on be lance words |
---|
0:11:57 | so what clustering |
---|
0:11:59 | are we do it |
---|
0:11:59 | first we had still shake each frequent word with the single class |
---|
0:12:04 | and then of a cluster |
---|
0:12:05 | or infrequent word |
---|
0:12:09 | in this way |
---|
0:12:12 | as we use |
---|
0:12:14 | uh in our research |
---|
0:12:16 | a clustering trees that that not binary |
---|
0:12:19 | the |
---|
0:12:20 | as opposed to binary clustering trees our clustering trees are |
---|
0:12:23 | what shall |
---|
0:12:24 | so normally in our experiments the depth of the trees i'd the three or four |
---|
0:12:31 | here is uh you can see the um |
---|
0:12:33 | the formal of for computation of the probability so actually in each chief of this clustering tree we and up |
---|
0:12:40 | with the |
---|
0:12:41 | uh |
---|
0:12:43 | we end up with the uh |
---|
0:12:44 | with the word |
---|
0:12:45 | is a single class |
---|
0:12:46 | so at each that we have soft max function |
---|
0:12:49 | at the upper level we have the |
---|
0:12:51 | short least words |
---|
0:12:52 | that a |
---|
0:12:53 | not |
---|
0:12:55 | classified |
---|
0:12:56 | so each word in each note in each class |
---|
0:12:59 | and then the node |
---|
0:13:00 | for |
---|
0:13:01 | infrequent frequent out of short least words |
---|
0:13:03 | and then would you clustering |
---|
0:13:05 | and we add up at the lower level |
---|
0:13:07 | again |
---|
0:13:08 | with |
---|
0:13:09 | one word per class |
---|
0:13:13 | so if we |
---|
0:13:14 | represent present our model |
---|
0:13:16 | in this more convenient way |
---|
0:13:18 | we can say that normally |
---|
0:13:20 | the neural network they have one out a player |
---|
0:13:24 | in our scheme we have |
---|
0:13:25 | one out of there that is |
---|
0:13:28 | the first layer |
---|
0:13:29 | that deals with a frequent words |
---|
0:13:32 | and then |
---|
0:13:33 | it has a the layers |
---|
0:13:35 | that do you |
---|
0:13:36 | with a sub classes |
---|
0:13:38 | in the clustering tree |
---|
0:13:40 | and each output layer |
---|
0:13:42 | as of marks |
---|
0:13:43 | function |
---|
0:13:45 | so if we have |
---|
0:13:46 | more |
---|
0:13:47 | classes the clustering we have |
---|
0:13:49 | more |
---|
0:13:50 | output layers |
---|
0:13:52 | uh in our neural net were |
---|
0:13:56 | the training great |
---|
0:13:58 | the way the uh we train our structure out layer neural network language model |
---|
0:14:03 | so first we train is standard |
---|
0:14:05 | neural network language model with a short |
---|
0:14:07 | as an out |
---|
0:14:08 | so it's a short list you wanna network each model |
---|
0:14:11 | what we train it on three |
---|
0:14:13 | with three you box |
---|
0:14:14 | so normally we use |
---|
0:14:16 | fifteen twenty bucks to train the train fully |
---|
0:14:19 | so now it's really |
---|
0:14:20 | trained |
---|
0:14:21 | with three but |
---|
0:14:23 | that what we do |
---|
0:14:24 | we reduce the dimension of the context space |
---|
0:14:27 | using the you principal component analysis |
---|
0:14:30 | and in now experiments the final the ten |
---|
0:14:34 | and then we perform a recursive came uh uh means a word clustering based on these distribute it |
---|
0:14:40 | representation and used |
---|
0:14:42 | by the continuous space |
---|
0:14:44 | except for the words in short is because we do not have to class |
---|
0:14:48 | right |
---|
0:14:49 | and finally we train the whole model |
---|
0:14:55 | the results we report in this paper |
---|
0:14:57 | are on uh mentoring gale task |
---|
0:15:00 | so we use links to mentoring speech system |
---|
0:15:03 | that is |
---|
0:15:04 | uh are rice by uh fifty six thousand vocabulary |
---|
0:15:08 | this is a for word work the so |
---|
0:15:10 | what we do first we do the segmentation of change data |
---|
0:15:14 | in words |
---|
0:15:15 | using the uh |
---|
0:15:17 | maximum length approach and then |
---|
0:15:20 | we train our word based language models on this |
---|
0:15:23 | and the baseline let's a language models in train on |
---|
0:15:26 | three point two billion words |
---|
0:15:28 | this just train it on many |
---|
0:15:30 | subcomponent lm static interpolated together |
---|
0:15:33 | with the interpolation weight |
---|
0:15:34 | Q in turn have to it |
---|
0:15:37 | then we train for neural network |
---|
0:15:39 | and i |
---|
0:15:42 | using |
---|
0:15:43 | at each iteration about twenty five million words |
---|
0:15:46 | after resampling |
---|
0:15:48 | because at each iteration we sampled different different |
---|
0:15:52 | in the table you can see results |
---|
0:15:54 | a a on a mentoring gale task |
---|
0:15:58 | first with the baseline four gram of them |
---|
0:16:00 | and then |
---|
0:16:01 | when this baseline |
---|
0:16:03 | for grant that's right i'm is interpolated |
---|
0:16:06 | with new all network language models of the of different type |
---|
0:16:09 | so |
---|
0:16:10 | we have the |
---|
0:16:11 | eight thousand word |
---|
0:16:13 | uh for gram that network language model and twelve thousand |
---|
0:16:16 | short least |
---|
0:16:17 | words some short least you on a language model |
---|
0:16:19 | and structural out there and you a bunch mode |
---|
0:16:23 | so what we can see that it's that |
---|
0:16:25 | so and a lamb |
---|
0:16:27 | consistently the out performs the |
---|
0:16:30 | short list based neural network language models |
---|
0:16:32 | not to say of the the the baseline |
---|
0:16:36 | the base four gram language model |
---|
0:16:38 | and |
---|
0:16:39 | what we can also see that the improvement |
---|
0:16:42 | for four grams |
---|
0:16:44 | is about zero point two zero point one |
---|
0:16:47 | and when we speech to six grams scenario with uh uh and we not language models |
---|
0:16:52 | the gain we get from a on neural network language models |
---|
0:16:55 | i be better i bit larger so we gain in uh between zero point three |
---|
0:17:00 | zero point two |
---|
0:17:01 | over the our best |
---|
0:17:03 | short least neural network language model |
---|
0:17:07 | why we use uh the short list of uh |
---|
0:17:10 | eight thousand and twelve thousand so this is normally the shortest we use in our uh now our experiments an |
---|
0:17:15 | our systems |
---|
0:17:16 | and also |
---|
0:17:17 | when we train our soul neural network language model |
---|
0:17:21 | uh |
---|
0:17:23 | we use |
---|
0:17:24 | the short list of eight thousand words |
---|
0:17:26 | to train |
---|
0:17:27 | part |
---|
0:17:28 | that one be clustered |
---|
0:17:30 | and we use |
---|
0:17:31 | four thousand classes at the |
---|
0:17:33 | upper level |
---|
0:17:34 | so this model |
---|
0:17:36 | can they |
---|
0:17:37 | in complex it's pretty much the same |
---|
0:17:39 | is the short list |
---|
0:17:41 | uh model with a twelve |
---|
0:17:42 | thousand |
---|
0:17:43 | words in short |
---|
0:17:46 | what i the conclusions |
---|
0:17:49 | the |
---|
0:17:51 | so neural network language model |
---|
0:17:53 | if two |
---|
0:17:54 | is a combination actually of neural network |
---|
0:17:57 | and class based |
---|
0:17:58 | language model |
---|
0:18:01 | then |
---|
0:18:02 | it can deal with that of the cab others of |
---|
0:18:05 | are B tree size is |
---|
0:18:06 | so on this research they've a but there was |
---|
0:18:09 | fifty uh fifty six a thousand words |
---|
0:18:12 | but uh we you have recently around uh on the experiments cable or of |
---|
0:18:16 | three hundred thousand |
---|
0:18:18 | then |
---|
0:18:19 | speech recognition improvements are achieved on large scale task |
---|
0:18:23 | and over very challenging baselines |
---|
0:18:26 | and then what we have also noted that |
---|
0:18:29 | structured output layer neural networks |
---|
0:18:31 | improve |
---|
0:18:32 | better |
---|
0:18:32 | for longer contact |
---|
0:18:37 | and that's |
---|
0:18:45 | questions |
---|
0:18:54 | input there |
---|
0:18:56 | i |
---|
0:18:58 | yeah |
---|
0:19:01 | yeah but then okay |
---|
0:19:03 | okay |
---|
0:19:13 | so |
---|
0:19:14 | here you mean |
---|
0:19:16 | but this is the operation we do at this point |
---|
0:19:19 | is just a metrics row selection |
---|
0:19:21 | we do not have any uh do have to do any multiplication |
---|
0:19:25 | so at this point |
---|
0:19:27 | and that it has nothing to do with a a a with the increase of can of complex |
---|
0:19:31 | so if you look here |
---|
0:19:37 | so the you what we do we have to do we have to do just metrics or selection |
---|
0:19:55 | yeah sure |
---|
0:19:56 | a |
---|
0:19:57 | yeah |
---|
0:19:59 | i |
---|
0:20:00 | no but that's this this part is trained very fast |
---|
0:20:03 | so it's like |
---|
0:20:25 | or can we discuss it later because a a a a can of full uh and this then |
---|
0:20:30 | in in the questions |
---|
0:20:31 | at time for one more |
---|
0:20:35 | hmmm |
---|
0:20:38 | yeah thanks |
---|
0:20:40 | um um i just have to quick questions so on your was that how much just one or whether you |
---|
0:20:44 | have results could do set in the range of number of open or knows range use from something or small |
---|
0:20:49 | few thousand to twenty thousand |
---|
0:20:51 | oh you experiments did you try to go up to twenty thousand to see what happens |
---|
0:20:56 | and "'cause" you you close and class based upon their on configuration bases spanning across internal vocabulary |
---|
0:21:02 | but it to but the um the ones the eight king and twelve is not |
---|
0:21:06 | so you you can it to be more and what does that happen and do you do you uh the |
---|
0:21:11 | the maximum when which tried it was set twelve thousand K because |
---|
0:21:14 | already with twelve south then K in a and the output vocabulary of have more |
---|
0:21:19 | like thirty |
---|
0:21:20 | then the training time is too large |
---|
0:21:23 | okay so to long you don't have a a a a a a experiment one no way |
---|
0:21:26 | and also so just one basic extreme cases so if you don't not |
---|
0:21:30 | split the clustering read it may be just class to or the other words um out of the show only |
---|
0:21:35 | to one plots |
---|
0:21:36 | and how does that model fair okay kings you are a multiple class i'm configuration |
---|
0:21:42 | we prefer to use these configuration because |
---|
0:21:45 | actually that's not the story that's another paper |
---|
0:21:48 | but uh i |
---|
0:21:49 | we we prefer to keep short least |
---|
0:21:52 | part |
---|
0:21:52 | a stable |
---|
0:21:53 | because in the in other experiments what we do we use much more data to to learn to to train |
---|
0:21:58 | the out of should least part |
---|
0:22:01 | but this is |
---|
0:22:01 | another sorry |
---|
0:22:02 | so |
---|
0:22:05 | uh_huh |
---|
0:22:08 | yeah |
---|
0:22:11 | i don't know i have never tried |
---|
0:22:13 | okay case |
---|
0:22:15 | i the speaker okay |
---|