0:00:15 | okay so that's what is not intended to be particularly for also you know we |
---|
0:00:19 | have well |
---|
0:00:20 | put away the screen there will be no slides |
---|
0:00:23 | so what i was encouraging everybody who's other annals to do it got about five |
---|
0:00:28 | minutes to just give sort of an oral summary of the poster so let's encourage |
---|
0:00:33 | people to come see it because it's going to be up for the rest of |
---|
0:00:37 | the |
---|
0:00:38 | session and then we can open up the floor for questions morgan i might have |
---|
0:00:42 | a few and we'll see where the discussion dots |
---|
0:00:45 | so |
---|
0:00:47 | why don't we get started and since you closest a big presently go first sorry |
---|
0:00:55 | okay so |
---|
0:00:56 | the one should not here because i'm basically got a nice gmms |
---|
0:01:01 | so what i did this i looked at the neural networks and i try to |
---|
0:01:05 | figure out why the work very well and try to port this back to a |
---|
0:01:09 | gmm |
---|
0:01:11 | so |
---|
0:01:13 | why gmms so we kind of like stand for years we have lots of techniques |
---|
0:01:18 | model based techniques model based adaptation speaker adaptation noise adaptation uncertainty decoding |
---|
0:01:25 | all kinds of techniques that are based on maximum-likelihood trained hmm gmm systems if we |
---|
0:01:31 | just |
---|
0:01:32 | put in dnns |
---|
0:01:34 | at the front and basically you basically you lose a lot |
---|
0:01:39 | and all the reason is there fast |
---|
0:01:41 | the very efficient a few parameters you can encode you can make a speech recognizer |
---|
0:01:46 | with ten times less packed fee parameters in there at gold very fast |
---|
0:01:51 | final and lost reason you'll do speech recognition we kind of try to understand how |
---|
0:01:56 | it works so if you going to replace the neural network in the top of |
---|
0:02:00 | your head |
---|
0:02:01 | it's on all the black box method like a deep neural network what i've learned |
---|
0:02:05 | in the end so maybe a little bit model molar system where you have building |
---|
0:02:10 | blocks that are |
---|
0:02:11 | at least doing something you understand |
---|
0:02:14 | it's nice to have that |
---|
0:02:16 | so the second part is what are we going to port from dnns to do |
---|
0:02:21 | the nn the gmm world |
---|
0:02:23 | so basically look at the nns they take a very large winnable frames |
---|
0:02:28 | and they going to map that to context dependent states for bit basically long span |
---|
0:02:33 | symbolic units to go from long span temporal patterns too long span |
---|
0:02:39 | symbolic units fairly complex mapping that's why they need |
---|
0:02:44 | lots of layers |
---|
0:02:45 | probably and also they want to go wide a have something like two thousand four |
---|
0:02:50 | thousand to notes in between so that's pretty |
---|
0:02:54 | pretty be pipeline |
---|
0:02:56 | so the deep we already had the white we have already had to important properties |
---|
0:03:03 | of a neural network a long window of frames and another thing is neural networks |
---|
0:03:10 | they advertise them as being a product of experts |
---|
0:03:14 | so basically adding note us useful input and is trained on all output |
---|
0:03:20 | so there's |
---|
0:03:22 | lots of training data for every weight |
---|
0:03:25 | okey so the next that is let's try to port all these ideas to the |
---|
0:03:30 | hmm gmm world |
---|
0:03:33 | so and basically i didn't invent anything you i used existing techniques |
---|
0:03:39 | so if you want to handle log large frame of windows you have to do |
---|
0:03:44 | feature reduction because gmms don't like a two hundred dimensional input features |
---|
0:03:50 | so we use something like lda linear discriminant analysis to do feature reduction |
---|
0:03:57 | but that's loses lots of information so in parallel with that for example use multiple |
---|
0:04:03 | streams multiple streams are not new in you old discrete hmm world you have static |
---|
0:04:08 | features delta features and double delta features multiple parallel streams and fusing at the end |
---|
0:04:13 | you can still do that today |
---|
0:04:15 | so that's already have coping bit a large input a window of frames |
---|
0:04:23 | going a wider we already had at we have multiple streams in parallel |
---|
0:04:28 | you can i the seed of |
---|
0:04:29 | as a |
---|
0:04:31 | property of coping with a large dimensional input feature stream or you can say that |
---|
0:04:36 | a little models |
---|
0:04:39 | then the going deeper that's basically don't by adding and log-linear layer up in the |
---|
0:04:46 | layers but nothing new nothing special |
---|
0:04:49 | the conditional random fields or maximum entropy models they go around but lots of names |
---|
0:04:56 | just a softmax in the neural networks |
---|
0:04:59 | so that's nothing special but it's a simplest the extra layer you can add more |
---|
0:05:03 | or less |
---|
0:05:05 | it is in a product of expert model so it combines values in a some |
---|
0:05:10 | which is basically product and |
---|
0:05:13 | makes a new values so it's very good at fusing thing so |
---|
0:05:17 | i at the frame stacking from of it just to increase the feature dimension i |
---|
0:05:26 | so basically all existing techniques very simple techniques i forgot one the parameter tying but |
---|
0:05:32 | that's also very simple use tied states a our system that we have time to |
---|
0:05:37 | go first row so |
---|
0:05:40 | that basically means that every gaussian is trained not all output and all that all |
---|
0:05:44 | inputs but a lot so basically every gaussian is used under and over a hundred |
---|
0:05:48 | times for a and that the output states so the lots of view so if |
---|
0:05:52 | it exceeds every frame anyhow |
---|
0:05:54 | and if you combine all these things you and the pitch results that are competitive |
---|
0:05:58 | to last year's union the results |
---|
0:06:02 | this year's the nn the results at something like segmental training or sequence training convolutional |
---|
0:06:08 | neural networks dropout training |
---|
0:06:11 | then you techniques i don't know yet how i'm going to map that to my |
---|
0:06:15 | system up to six the sequence training is very simple to add and probably will |
---|
0:06:19 | improve the systems |
---|
0:06:21 | so the and messages the gmms and hmms are not |
---|
0:06:31 | okay thank you chris a hank |
---|
0:06:35 | slightly worse on you to |
---|
0:06:38 | it will work on voice search and them some work on a U T |
---|
0:06:44 | we are actually published results i thought you know to be great if you chair |
---|
0:06:47 | sometimes of the with you want you to you |
---|
0:06:50 | so if you know what you to youtube so you're sharing site you can share |
---|
0:06:53 | also things i think most popular video is |
---|
0:06:56 | you know task using dogs or cats running but like this but a there there's |
---|
0:07:01 | actually and there's some useful data their mean ability user's fees each of youtube every |
---|
0:07:05 | month they watch six billion videos on U T |
---|
0:07:09 | was over a hundred hours of data being uploaded every minute |
---|
0:07:13 | so as long content of their a lot of people watching |
---|
0:07:18 | one thing we like to do with you be able to provide a you know |
---|
0:07:21 | because you use more accessible for those that are harder here you or not speak |
---|
0:07:26 | also imagine if we would by automatic captions you to you |
---|
0:07:30 | that would help searching for videos on youtube or |
---|
0:07:34 | actual to navigate in the video if you want one particular instances words and videos |
---|
0:07:39 | in people that there is that some people compute non-trivial actually latest video problem a |
---|
0:07:45 | bit bigger roles where you is obviously snapping we say words of all the weak |
---|
0:07:51 | acid |
---|
0:07:52 | i'm gonna give you want soft and people it may be used this indexing technology |
---|
0:07:57 | to the final instances where problem and says speech |
---|
0:08:01 | so you know there's you know some can be some point but with applying |
---|
0:08:07 | so i looked at this from a couple aspect of the i'll get is from |
---|
0:08:11 | a D task so we have a lot of data what are some the with |
---|
0:08:14 | that we can levels of data |
---|
0:08:18 | for example |
---|
0:08:20 | users are apply for twenty six about have uploaded twenty six thousand hours of just |
---|
0:08:27 | caption |
---|
0:08:27 | online text captions |
---|
0:08:29 | for these videos attempt to have tasks because you know the find it is useful |
---|
0:08:34 | to have them |
---|
0:08:37 | but some of those artists can in some fashion matched video they just the advertising |
---|
0:08:42 | things |
---|
0:08:43 | so |
---|
0:08:44 | i think about people looked at how to use this sort of |
---|
0:08:49 | dropped data to use a strange and so we do is so much i think |
---|
0:08:53 | that everyone else does we try to figure out what sort of aligned with doesn't |
---|
0:08:56 | align and we had this island of confidence or technique so basically areas where a |
---|
0:09:02 | lot of alignment happens from recognition result and the actual user provide result |
---|
0:09:07 | what we use that the sort of islands was not the coherence then used as |
---|
0:09:11 | training ground truth |
---|
0:09:13 | and so we're slight |
---|
0:09:15 | after all still training of like non-native data a christmas can be what actually aligns |
---|
0:09:20 | well we get a we got initial corpus about thousand hours |
---|
0:09:25 | and compared to but some and fifty hours of supervised |
---|
0:09:29 | actually hand-transcribing |
---|
0:09:32 | so we're able to do some persons on that |
---|
0:09:34 | and |
---|
0:09:35 | the other aspect is well we have so much data to me improve the modelling |
---|
0:09:39 | techniques that certainly different ways and it just use of force people i think doesn't |
---|
0:09:44 | talk about |
---|
0:09:45 | having thousand cd state units and i think it typically we all work on our |
---|
0:09:49 | own seven thousand cd state units |
---|
0:09:52 | i think frank's ideas we have to thirty three thousand so we really do run |
---|
0:09:56 | in europe so i'm not writing |
---|
0:09:58 | around twenty thousand four five thousand see you know to use more data my one |
---|
0:10:02 | time we got better |
---|
0:10:04 | it's model |
---|
0:10:07 | and so but you know that was really large that way with the softmax |
---|
0:10:12 | there are forty but that's no points and thousand nodes prepare for five million parameters |
---|
0:10:17 | there |
---|
0:10:18 | just that one there so actually this is little bits by brit actors aristotle in |
---|
0:10:23 | icassp the right factorization slight warming to try this data and see it goes so |
---|
0:10:29 | and they're in a paper we looked at using various levels of this task |
---|
0:10:33 | percent miss is lower a linear layer from |
---|
0:10:36 | to just by close to you are and so that it task |
---|
0:10:41 | and the basically our results were |
---|
0:10:43 | actually suboptimal we can use privately well the semi supervised data where we use it |
---|
0:10:48 | is a captions we can see build model it's better than or gmm system by |
---|
0:10:52 | about ten percent relative |
---|
0:10:54 | so our team system initially was well |
---|
0:10:56 | plus fifty percent error rates |
---|
0:10:58 | and i think there is that you know some issues of the gmm system i |
---|
0:11:01 | think it's very cambridge the events matrix for us |
---|
0:11:05 | they got below fifty percent but not much but with the same is rise data |
---|
0:11:09 | no supervised training we did pretty well |
---|
0:11:13 | we actually when we actually the supervised data the results with less data we actually |
---|
0:11:16 | better results than in the systems revise data models but that's expected and actually combined |
---|
0:11:22 | it doesn't work against combining |
---|
0:11:25 | and with low rate roll find that with your parameters we're able to get you |
---|
0:11:30 | know how better but actually results that slightly better maybe it's just regularization |
---|
0:11:36 | we found that overall by having |
---|
0:11:39 | and all this extra data we got the results on youtube general on you general |
---|
0:11:43 | test data sets test sets but |
---|
0:11:46 | we will actually that's a domain specific test set |
---|
0:11:50 | for example you to use same broadcast news we actually get a degradation by adding |
---|
0:11:54 | all this all the sins rise data so that was interesting so and you're a |
---|
0:11:58 | neural networks people bigger better more data |
---|
0:12:02 | but since then we still have some issues with cross training |
---|
0:12:06 | so i still you know what things look |
---|
0:12:09 | so that's what |
---|
0:12:12 | okay thanks a star |
---|
0:12:17 | okay so frank showed earlier today |
---|
0:12:20 | one of the first and results on lvcsr was on switchboard showed about thirty percent |
---|
0:12:24 | relative improvement on a speaker independent a system |
---|
0:12:28 | and you know microsoft as well as i am in others have shown that if |
---|
0:12:32 | you speaker adapted features for the dnn the results are better |
---|
0:12:37 | and then earlier this year i but using very simple log-mel features just a convolutional |
---|
0:12:42 | neural network you actually improve performance by between for the for seven percent relative over |
---|
0:12:48 | at the end and trained with speaker adapted features |
---|
0:12:52 | and one of the reasons we think is this sort of learning this speaker adaptation |
---|
0:12:56 | jointly with the rest of the network for the actual objective function at hand you |
---|
0:13:00 | either cross entropy or sequence |
---|
0:13:03 | into the idea of this filter learning work we did is he said well why |
---|
0:13:06 | are we've been starting log-mel let's start with the much simpler feature such as the |
---|
0:13:11 | power spectra |
---|
0:13:13 | and have a network learn a filterbank which is appropriate for the speech recognition task |
---|
0:13:18 | at hand rather than using a filterbank which is perceptually motivated |
---|
0:13:22 | it's if you think about how the log-mel is computed you take the power spectrum |
---|
0:13:25 | you multiply by filterbank and then take the log which is respectively one layer of |
---|
0:13:31 | a neural net great weight multiplication followed by nonlinearity |
---|
0:13:35 | so the idea in this filter learning work was to learn start with the power |
---|
0:13:38 | spectra and that the filter bank layer jointly with the rest of the convolutional neural |
---|
0:13:42 | network |
---|
0:13:44 | so and we did sort of this idea initially we got very modest improvements |
---|
0:13:51 | and one of the reasons is because you have to normalize not the layer to |
---|
0:13:56 | that |
---|
0:13:58 | convolutional network but the layer to the filter learning |
---|
0:14:02 | we know there's a lot of work that shows if you charge normalized input features |
---|
0:14:07 | into the network such written down here |
---|
0:14:10 | so we found that by normalizing input into the filter bank layer and by using |
---|
0:14:17 | a trick very similar to done and that's not in rasta processing to ensure that |
---|
0:14:21 | the input into the filter learning layer would be positive we able to get about |
---|
0:14:26 | a four percent relative improvement |
---|
0:14:30 | over using a |
---|
0:14:31 | fix filterbank there is a nativity are broadcast news task |
---|
0:14:37 | we then show that base you the filter bank where can be seen as a |
---|
0:14:41 | convolutional layer with limited weight sharing so you can fly tricks such as pooling |
---|
0:14:47 | so if you pull you can get you know what five percent relative improvement over |
---|
0:14:52 | the baseline of the fixed with fixed mel-filterbank |
---|
0:14:56 | and then tried other things like increasing the filter bank size a lot more freedom |
---|
0:15:01 | for the filters that didn't seem to help much probably because there's lot of course |
---|
0:15:05 | going on between the different filters |
---|
0:15:08 | we also tried we found the filter weights or very few key probably picking up |
---|
0:15:12 | many harmonics in the signal we tried smoothing that out that didn't seem to help |
---|
0:15:17 | much |
---|
0:15:19 | so it seems was that the extra peeps that are learned in the filter bank |
---|
0:15:22 | layer is actually beneficial |
---|
0:15:29 | finally we tried instead of enforcing you know using analogue nonlinearity positive weights along the |
---|
0:15:35 | weights P negative in using like a sigmoid or you prove nonlinearity and also didn't |
---|
0:15:40 | seem to help us |
---|
0:15:42 | so it seems like using a lot nonlinearity which is perceptually motivated is actually does |
---|
0:15:46 | so in summary we looked at filter bank learning i suppose using a fixed mel-filterbank |
---|
0:15:52 | agreeable to get about five percent relative improvement number i guess |
---|
0:15:59 | thank you chair call |
---|
0:16:05 | new |
---|
0:16:14 | okay |
---|
0:16:15 | so |
---|
0:16:16 | in principle i was trying to |
---|
0:16:19 | so the same similar problem is thing was folding |
---|
0:16:22 | but there was one difference that you will use probably several thousands of even ten |
---|
0:16:29 | thousands of training data that we could be possibly leveraged to improve the word error |
---|
0:16:34 | rates and in the our case the dataset most more modest at school |
---|
0:16:41 | very nice to play with |
---|
0:16:44 | in the our case we had the ten hours of transcribed data seventy four hours |
---|
0:16:49 | of un-transcribed be done today that was i means |
---|
0:16:53 | from the iarpa babel program and |
---|
0:16:59 | this one of the conditions the limited language pack |
---|
0:17:02 | condition |
---|
0:17:04 | and |
---|
0:17:07 | i try to find some heuristics to how to leverage the best |
---|
0:17:14 | they don't |
---|
0:17:16 | the results so what idea is that i used to different confidence the measures on |
---|
0:17:21 | different levels |
---|
0:17:23 | one level was to sometimes level |
---|
0:17:26 | and the other was or frame level |
---|
0:17:28 | so that we can select the data for training |
---|
0:17:35 | the way that the sentence-level condition was computed it was |
---|
0:17:38 | basically the average posterior from the confusion from the confusion network |
---|
0:17:45 | the best word |
---|
0:17:46 | and the frame level |
---|
0:17:49 | confidence measure was the |
---|
0:17:53 | imagine you have some |
---|
0:17:56 | let these |
---|
0:17:58 | you well the weighted semi supervised training is done that the beginning be able to |
---|
0:18:03 | transcribe a to be built some system |
---|
0:18:05 | and with this system we can decode the |
---|
0:18:09 | data we don't transcripts and so we can |
---|
0:18:13 | take the best parts from the lady sees as if you was the reference |
---|
0:18:18 | and |
---|
0:18:20 | so when we have the let this is we can take the best file and |
---|
0:18:24 | we can compute the posteriors abilities and we can then read the posteriors which lie |
---|
0:18:30 | under the best path and use those as this confidence measures |
---|
0:18:38 | to use |
---|
0:18:41 | so then when we start and when we started experiments first experiment with the frame |
---|
0:18:47 | cross entropy training |
---|
0:18:49 | and i try to make a systematic of steps first |
---|
0:18:54 | star so on the larger commodity and then go to the smaller one so let |
---|
0:18:59 | the beginning so i was starting to think of those sentences according to the confidence |
---|
0:19:06 | and surprisingly i kept adding |
---|
0:19:09 | more and more something see something like that it all of those and so there |
---|
0:19:14 | was stick still the system was |
---|
0:19:17 | it's a radio and there was no degradation be very in |
---|
0:19:22 | it was surprising |
---|
0:19:23 | and so |
---|
0:19:26 | so this gave so minus one point one percent improvement in absolute and then be |
---|
0:19:32 | very the situation that there was still |
---|
0:19:35 | roughly ten hours of transcribed speech seventy hours of untranscribed speech so there was double |
---|
0:19:40 | lines of in the monthly multiply the |
---|
0:19:44 | amount of transcribed speech show by twenty three we we'd try to |
---|
0:19:50 | different |
---|
0:19:53 | different |
---|
0:19:56 | numbers two |
---|
0:19:57 | different multiplication numbers of the system |
---|
0:20:01 | three was |
---|
0:20:02 | the good one and there was meant of zero point three |
---|
0:20:06 | no absolute on the |
---|
0:20:09 | and finally we went to do |
---|
0:20:13 | lower level to the frame level and found out that the |
---|
0:20:17 | frame level selection with the appropriately to a threshold would be another zero point nine |
---|
0:20:25 | zero point eight percent |
---|
0:20:28 | you know something |
---|
0:20:30 | so to the overall improvements over two point two |
---|
0:20:35 | first order eight percent absolute |
---|
0:20:38 | and |
---|
0:20:40 | is the |
---|
0:20:42 | as the full recipes use also includes the sequence-discriminative training |
---|
0:20:48 | i did some experiments to |
---|
0:20:51 | with the some the are criterion to improve the |
---|
0:20:56 | results on these stage and i try to use similar |
---|
0:21:00 | data selection framework |
---|
0:21:02 | but the remote |
---|
0:21:05 | how and what is this the safest option was to take the transcribed data and |
---|
0:21:12 | a use a some the are it's just the transcribed data and |
---|
0:21:17 | in the |
---|
0:21:20 | the improvement that we obtained on the frame cross entropy level training a large part |
---|
0:21:26 | of it persisted in the systems |
---|
0:21:31 | pretty much more |
---|
0:21:33 | the experiments we did |
---|
0:21:35 | so i'd like to invite you to |
---|
0:21:40 | so you see the posters union by i would like also to think to then |
---|
0:21:45 | podium of the colleagues who |
---|
0:21:48 | worked on developing company |
---|
0:22:07 | thanks kernel next we have all okay |
---|
0:22:12 | so other poster paper it's about how to learn a speech representation from multiple or |
---|
0:22:18 | single distant channels so we did distant speech recognition |
---|
0:22:24 | which has so we are now is much more difficult to copy of because of |
---|
0:22:28 | many aspects like for example |
---|
0:22:31 | course signal-to-noise ratio or |
---|
0:22:35 | different interference effects of other acoustic sources so what people usually do with distant speech |
---|
0:22:44 | recognition is to |
---|
0:22:47 | capture they're sticking out using multiple distant microphones which we now |
---|
0:22:52 | a germ at all so basically we can apply on top some sort of combining |
---|
0:22:57 | algorithm which in from the signal entire the single channel and then you be it |
---|
0:23:02 | acoustic model on top |
---|
0:23:04 | a like an acoustic model you want and it's |
---|
0:23:09 | we are interested how to how to use multiple distant microphones we'd up to conform |
---|
0:23:14 | or so |
---|
0:23:15 | we do in addition to the actual i in all the are dramatically |
---|
0:23:19 | and at and try to explore that way to combine channels so we use |
---|
0:23:27 | a neural networks for that and there are there are two obvious ways to follow |
---|
0:23:33 | the first one is |
---|
0:23:34 | a simple concatenation so you get a acoustic captured by multiple channels and you to |
---|
0:23:42 | just like a large spliced input to the network and you train it we have |
---|
0:23:46 | a single targets like we should why you do |
---|
0:23:50 | and the other way a the other way to do it it's is it is |
---|
0:23:55 | multi-style training and it must i multi-style training |
---|
0:24:02 | allows you to actually use multiple distant microphones while you're training and you can |
---|
0:24:09 | recognise with a single distant microphone |
---|
0:24:11 | so getting back to concatenation a we have just a simple concatenation we were able |
---|
0:24:16 | to recover around fifty percent of think of that inform again so we weren't able |
---|
0:24:22 | to beat our best |
---|
0:24:24 | and it |
---|
0:24:25 | dnn model |
---|
0:24:26 | trained on that eighteen from channels but we were like i've but we were able |
---|
0:24:30 | to improve like around fifty percent relative |
---|
0:24:36 | and |
---|
0:24:39 | it |
---|
0:24:40 | relative to the gain of indian eight of course |
---|
0:24:43 | and we've quality style training we train the network in which a task fashion where |
---|
0:24:49 | we actually had that share the representation for each channels and we like presented a |
---|
0:24:56 | random batch of data from random channels and we did not eight |
---|
0:25:00 | for that |
---|
0:25:01 | and that apparently force the network to actually you can or some of the travel |
---|
0:25:06 | it is in the channels so in the and |
---|
0:25:09 | and multi-style training |
---|
0:25:12 | i gave ask the same as those are simply if a concatenation |
---|
0:25:16 | so basically it's a very attractive a way because you do not need multiple distant |
---|
0:25:24 | microphones and test scenario |
---|
0:25:26 | which is nice finding |
---|
0:25:28 | and |
---|
0:25:32 | right so |
---|
0:25:33 | in that order we also point some sort of open challenges like for example she |
---|
0:25:39 | still overlapping speech just select a huge issue |
---|
0:25:43 | and not many researchers actually try to just it |
---|
0:25:47 | and so the simplest think is just ignore it as |
---|
0:25:52 | and |
---|
0:25:53 | and we also like present the complex set of numbers for i mean datasets for |
---|
0:25:58 | pure rugby datasets all this numbers should be easy to reproduce if someone is interested |
---|
0:26:02 | so i by |
---|
0:26:05 | anyone who's interested suggested |
---|
0:26:08 | came by and we can discuss some more |
---|
0:26:11 | thank you |
---|
0:26:14 | okay thanks paul i finally alex |
---|
0:26:17 | thank you |
---|
0:26:18 | so just to start a little bit of for |
---|
0:26:23 | more than ten percent here |
---|
0:26:25 | at a kind of longstanding ambition |
---|
0:26:30 | speech recognition can be reproducing kernel not |
---|
0:26:34 | have one network to an acoustic modeling |
---|
0:26:39 | the language modeling |
---|
0:26:43 | state transition |
---|
0:26:45 | and happens all kind of combining single network |
---|
0:26:49 | can difficult |
---|
0:26:52 | you probably won't be surprised here |
---|
0:26:54 | and i was eventually costly mostly by my coworker rock my mama you maybe i |
---|
0:27:00 | should just try |
---|
0:27:01 | you can one of this thing i |
---|
0:27:05 | replacing the |
---|
0:27:06 | neural networks with or |
---|
0:27:13 | and so that's basically what were you |
---|
0:27:18 | and it's really it's |
---|
0:27:21 | it's fairly straightforward |
---|
0:27:24 | you know it's |
---|
0:27:27 | standard system |
---|
0:27:29 | the only thing would be |
---|
0:27:31 | all the people here is the network architecture and the network architectures probably so |
---|
0:27:37 | one thing a run you know |
---|
0:27:41 | taking just ignore recurrent neural network making |
---|
0:27:45 | a sample |
---|
0:27:47 | input feature |
---|
0:27:48 | and brings really |
---|
0:27:50 | really you can increase |
---|
0:27:52 | but like with multi |
---|
0:27:56 | there are various other kind of |
---|
0:28:00 | improvements basic recurrent network architecture that i mean accumulating you |
---|
0:28:07 | and i guess the two main ones are not i directional so having single or |
---|
0:28:12 | no network stop beginning sequence those the and |
---|
0:28:16 | you have |
---|
0:28:17 | recurrent networks one going forward someone going back |
---|
0:28:20 | and you know that's not the past |
---|
0:28:24 | future compact |
---|
0:28:27 | and you can be that same structure just saying which account for normal |
---|
0:28:34 | so it's bidirectional |
---|
0:28:37 | and what you actually find the U |
---|
0:28:40 | and networks use of context brands out |
---|
0:28:44 | as it goes i |
---|
0:28:48 | and the other hand a novel thing i guess is used to this long short |
---|
0:28:51 | term memory architecture which i won't try to describe in detail you basic idea it's |
---|
0:28:58 | better at |
---|
0:28:59 | storing information times but gives you access longer range from |
---|
0:29:08 | common problem and everyone's a when you try or no networks for speech things |
---|
0:29:15 | there is flashing makes it difficult for score information |
---|
0:29:21 | and |
---|
0:29:23 | other not |
---|
0:29:25 | well as a standard recipe from the training |
---|
0:29:30 | fifteen hours |
---|
0:29:32 | because one of the compare the system with |
---|
0:29:36 | the kind of more and are workplaces |
---|
0:29:41 | using implement we actually i printable system |
---|
0:29:49 | and then we'll to the wall street journal corpus and the results kind of income |
---|
0:29:56 | using these |
---|
0:29:58 | bidirectional rnns can be cross entropy are frame are pretty small |
---|
0:30:08 | a |
---|
0:30:10 | one possible reasons |
---|
0:30:12 | wall street journal is |
---|
0:30:14 | maybe not the best corpus challenging and you know what is essential |
---|
0:30:20 | model which switchboard |
---|
0:30:22 | but my feeling is the |
---|
0:30:26 | what we really have |
---|
0:30:30 | this is this going to be cross entropy training |
---|
0:30:35 | you know word error rate actually carry |
---|
0:30:40 | something we got like |
---|
0:30:42 | same just train |
---|
0:30:53 | thanks so at this point we can open up the floor or questions or comments |
---|
0:30:58 | from the audience either directed that the panel or anybody else's room |
---|
0:31:05 | so drive any takers |
---|
0:31:12 | so following it up for what are certain jobs you have so terrible start well |
---|
0:31:18 | for you actually do you put was the power spectrum so do you think that |
---|
0:31:22 | the known as will be of are capable of |
---|
0:31:25 | right you can |
---|
0:31:27 | very there or backward see if you want |
---|
0:31:30 | to the waveform |
---|
0:31:32 | i think that something better definitely |
---|
0:31:36 | and nobody has done some more can actually |
---|
0:31:40 | i think you're right i think there's then a little bit of were but been |
---|
0:31:43 | by not do you might know alex on using convolution |
---|
0:31:47 | neural network like approaches are do you remember right |
---|
0:31:52 | i mention this is more has been some work with this but the generally do |
---|
0:31:57 | something on top of it like to take the law yep take the actual value |
---|
0:32:02 | florida log and so on the in there there's and things that are kind of |
---|
0:32:05 | heart to reproduce just by pretending you don't know these things are any good |
---|
0:32:09 | actually have i was gonna ask you |
---|
0:32:12 | i was trying to recall |
---|
0:32:15 | did you end up taking along still |
---|
0:32:19 | i so let's take the log is right into the neural network i think like |
---|
0:32:22 | twice right |
---|
0:32:24 | right so that's interesting right i mean that you know we've got these are for |
---|
0:32:28 | learning |
---|
0:32:30 | machine we stuff to take more |
---|
0:32:38 | i don't and about that |
---|
0:32:45 | okay i've a question which actually is can be directed more morgan and hynek and |
---|
0:32:51 | alex waibel the these in the room |
---|
0:32:53 | so one of the themes a came up earlier in the day was that some |
---|
0:32:59 | of this stuff was done that in the nineties and due to limitations only metadata |
---|
0:33:04 | we had to work with the amount of computation available |
---|
0:33:09 | there were there were things it really i couldn't people explore or couldn't viably be |
---|
0:33:13 | explored and so the question now is are there papers from the nineties that occur |
---|
0:33:18 | practitioner should be going back rereading and trying to plagiarised yes from the see that |
---|
0:33:24 | can that absent improve on now |
---|
0:33:28 | and it's the which ones |
---|
0:33:32 | they are so this L is a lot of i mean |
---|
0:33:39 | i don't say i mean it depends of people interested in right like this morning |
---|
0:33:44 | their questions about adaptation and i didn't recall that up my head which papers but |
---|
0:33:48 | there were a bunch of papers by a neural net so it in S K |
---|
0:33:52 | and twenty grams in an improvement cambridge |
---|
0:33:58 | if you interested in adaptation |
---|
0:34:00 | there is |
---|
0:34:03 | large number of papers on the basic methods on the sequence we're talking about a |
---|
0:34:08 | luncheon on the sequence training |
---|
0:34:10 | us papers i are shown and there are not there at an anti R T |
---|
0:34:18 | where he did sequence training i think around ninety five or something |
---|
0:34:26 | we're what we're doing the time once |
---|
0:34:29 | using the cameras as the targets for the net training |
---|
0:34:33 | i mean isn't just the computation and the |
---|
0:34:36 | and storage and amount of data it's also just that |
---|
0:34:43 | oftentimes you know these things are cyclic a new you try some things out and |
---|
0:34:48 | somebody like we did the sequence training |
---|
0:34:52 | help tiny little bit |
---|
0:34:54 | and in what we're in the examples we're looking at and was a lot more |
---|
0:34:57 | hassle |
---|
0:34:58 | so we didn't pursue it more |
---|
0:35:00 | we had a couple years where we really looking into it but it wasn't so |
---|
0:35:04 | great |
---|
0:35:05 | so there probably some things that we weren't doing quite right and |
---|
0:35:10 | now it's coming back and |
---|
0:35:12 | also people's and to see when you're enthusiastic about stuff |
---|
0:35:17 | you look at a point two percent increase a lot differently than when you're not |
---|
0:35:30 | about some other questions for the panel they had lots of interesting things they were |
---|
0:35:35 | talking about so |
---|
0:35:48 | i question for all and you're multiple |
---|
0:35:51 | yes |
---|
0:35:53 | the multi microphone |
---|
0:35:56 | experiment you did i guess that was with the ami corpus |
---|
0:35:59 | yes it's so you got you get this |
---|
0:36:02 | i guess nowadays predictable result that if you just concatenate the features from your three |
---|
0:36:08 | but to the for the different channels |
---|
0:36:11 | you would perform better than any beamforming |
---|
0:36:16 | wiener filtering or whatever else you that you're doing |
---|
0:36:18 | but it's is that correct no okay |
---|
0:36:24 | when you concatenate you get some improvements over a single distant microphone but it's a |
---|
0:36:31 | like the message from the air is that if you can inform you probably should |
---|
0:36:35 | inform |
---|
0:36:36 | yes but how okay with the with the concatenated features going into neural network is |
---|
0:36:42 | that assuming that this speaker is sort of |
---|
0:36:45 | i mean if my speaker was to walk around and |
---|
0:36:49 | i can imagine actually |
---|
0:36:52 | or observation is not network isn't learning can fink well actually gives you |
---|
0:36:59 | beamforming gives you a it's more like adapting to that |
---|
0:37:05 | the most meaningful signal |
---|
0:37:07 | to the strongest signal so basically |
---|
0:37:10 | if you have like multiple distant microphones one of the speakers just always a like |
---|
0:37:15 | in some way |
---|
0:37:17 | it's closer to give an microphone but down to D are not or and that's |
---|
0:37:23 | thinking actual you can exploit and you had |
---|
0:37:26 | in this scenario we applied for |
---|
0:37:29 | and |
---|
0:37:30 | because the when you like put multiple frames |
---|
0:37:34 | in the input you have like a very small time resolution so you actually can |
---|
0:37:39 | not there any time delays in this setup so it's just the it's just take |
---|
0:37:45 | eigenmaps really and you can do it like in a more obvious way for example |
---|
0:37:50 | you can apply |
---|
0:37:52 | convolutional that'll |
---|
0:37:54 | the acoustic models and the max-pooling |
---|
0:37:57 | and tops the also give some gains |
---|
0:38:00 | but that's like a followup work |
---|
0:38:14 | to be a little bit of courage to decide to response to brian was asking |
---|
0:38:19 | because you know i'm they pretty bad in reading other people's papers |
---|
0:38:22 | and so i had only examples of paper speech i wrote all my colleague set |
---|
0:38:27 | of students roles each people should very critically we i mean i don't mean that |
---|
0:38:32 | they are wonderful but i still think they are interesting which and this is this |
---|
0:38:37 | is this work on contracts which we started to work on that it was in |
---|
0:38:41 | the time you post pretty crazy |
---|
0:38:43 | because we just took the temporal trajectory of spectral energy a given frequency |
---|
0:38:48 | one second long and we said can you estimate what's happening in seventy six |
---|
0:38:53 | of this trajectory |
---|
0:38:55 | and so first of course without was that you got about twenty percent correct at |
---|
0:39:01 | best |
---|
0:39:01 | and of course you get is the number of frequencies so after that need to |
---|
0:39:05 | this out with all these posteriors and fit in your into nothing in it and |
---|
0:39:10 | a then you leads to |
---|
0:39:13 | estimate still the phoneme in the centre so it was like kind of formants deep |
---|
0:39:18 | neural net i would say because it was kind of neat it was also performance |
---|
0:39:22 | why because he that trajectories at different frequencies |
---|
0:39:26 | and you was it works surprisingly well i mean so if people can look at |
---|
0:39:32 | it and the last possible global we should have better and of course you never |
---|
0:39:36 | retrain the whole thing the which probably we should have done and we use |
---|
0:39:41 | on the context independent phonemes each maybe we should and shouldn't number of things happen |
---|
0:39:47 | at the time something that there are two entirely comparable to the manager and pos |
---|
0:39:53 | it's all and hopkins and so one is also be where only all that much |
---|
0:39:58 | but i still see that people should look at |
---|
0:40:01 | in and tell us what was wrong or how you is that it works |
---|
0:40:07 | that you try to recognize context independent phoneme out of one second context |
---|
0:40:13 | you know and you get actually very well you do very well if you look |
---|
0:40:17 | at least posteriorgram amazingly good |
---|
0:40:20 | sort of the look like for that issue do you mean vector or perception at |
---|
0:40:26 | all times seems like that it is i would see |
---|
0:40:30 | somebody else to should look at it critically |
---|
0:40:33 | so sorry for probabilities might work but i said i mean so |
---|
0:40:38 | so the other people's work so |
---|
0:40:53 | so this question is a mostly i and other hand address i have something to |
---|
0:41:01 | say i |
---|
0:41:02 | so i actually spoken trying to achieve a person |
---|
0:41:07 | our knowledge and found on this but i |
---|
0:41:10 | for example we can read and the program that the videos there being recorded going |
---|
0:41:15 | to be can that's going to be able to search them for keywords |
---|
0:41:19 | online so like i think the keywords that are typing into that system are gonna |
---|
0:41:23 | be that |
---|
0:41:24 | new at me now i have no |
---|
0:41:26 | and it's gonna be names i think i can and you know i okay so |
---|
0:41:31 | that are gonna be |
---|
0:41:32 | and i have a cabinet where K fig |
---|
0:41:36 | so |
---|
0:41:37 | kinds of plastic |
---|
0:41:39 | what is what is deep neural networks are we optimising R |
---|
0:41:44 | where acoustic models either really frequent words and leaving a on infrequent words second i |
---|
0:41:51 | mixture |
---|
0:41:53 | and if the other thing is you're analysing with word error rates over your entire |
---|
0:41:58 | vocabulary |
---|
0:42:01 | it's is this really getting at the performance |
---|
0:42:05 | the only one and we want to understand it is when the interesting to look |
---|
0:42:08 | at landing a way that standpoint for spoken content retrieval |
---|
0:42:17 | one stack of that |
---|
0:42:25 | restricted by taking a i can |
---|
0:42:28 | maybe address some of that i think i don't think than i don't know networks |
---|
0:42:32 | are just focused on |
---|
0:42:34 | i'm not on the on the head did you pretty well on the tales well |
---|
0:42:38 | but i mean there's two aspects here and there's the where the where you don't |
---|
0:42:42 | like vocabulary and those words that are out-of-vocabulary we have this in the model at |
---|
0:42:45 | test time and that's a different kind of orthogonal i don't my |
---|
0:42:52 | i mean and shake your head but i think |
---|
0:42:58 | i think if we can incorporate re well as to what we what we do |
---|
0:43:02 | our searches we have a stack decoder graph we can actually corporate dynamic can tailor |
---|
0:43:06 | into that into that graph |
---|
0:43:08 | i think when we do that a big actually recognise out-of-vocabulary words i mean that |
---|
0:43:15 | that we haven't seen during training time for example i worked on a voicemail years |
---|
0:43:20 | ago and you know people's names come up all the time and our program manager |
---|
0:43:26 | for the reason his name was |
---|
0:43:29 | with the recognizer is this is okay the people who tell collisions in but you're |
---|
0:43:34 | always recognise missed ten cents |
---|
0:43:36 | for some reason but once we switched on so direct vocabulary we have like is |
---|
0:43:41 | name checked into the stacks photograph same recognized and then we refer lots of other |
---|
0:43:47 | different so i think right now the system doesn't actually corpus title vocabulary and i |
---|
0:43:53 | think |
---|
0:43:53 | the metrics you talk about also devices to sort of working at sort of broad |
---|
0:43:57 | range and it is sort of makes more difficult to say if it introduces the |
---|
0:44:03 | you know technique where we do anything cataract that'll give us a point one or |
---|
0:44:09 | even middle today and that's a shame think really need to look at its |
---|
0:44:16 | techniques to look at the long |
---|
0:44:18 | but i think it still really this but recognition there's lots of the word that |
---|
0:44:23 | can be done and in language modeling and analyze about capital a men's room you |
---|
0:44:28 | know these words useful |
---|
0:44:34 | i'll chime in a little bit on this one too so i can speak from |
---|
0:44:38 | experience on doing few works are shown in lots of languages things to but the |
---|
0:44:42 | babel program which will be hearing about tomorrow from a very arbour |
---|
0:44:46 | and what we found is word error rate actually is pretty good basic metric even |
---|
0:44:53 | when we're doing search for words that are out-of-vocabulary in the training so it's not |
---|
0:44:59 | perfect correlation between word error rate and retrieval performance on this table past but |
---|
0:45:06 | at least the first-order a large improvements in word error rate like we see using |
---|
0:45:10 | neural networks instead of the gmms definitely |
---|
0:45:13 | to better retrieval performance even a vocabulary terms |
---|
0:45:17 | so it it's not perfect metric but it's one that we used for many years |
---|
0:45:22 | and it works pretty well |
---|
0:45:25 | the interesting pronunciation you'll find problems with those words and it you know it's very |
---|
0:45:30 | recognition and of that |
---|
0:45:32 | but i as you can see this work here where we're trying to drive dispensation |
---|
0:45:35 | so |
---|
0:45:37 | not dismissing it just |
---|
0:45:40 | i actually wanna |
---|
0:45:42 | the tractable but in favour of the direction what the question are saying |
---|
0:45:47 | "'cause" i think it is a disorder separate out the decoding so forth from what's |
---|
0:45:55 | happening in whatever you're acoustic model is you see whether it's gmms are dnns or |
---|
0:46:01 | whatever or mlps many layers for phone |
---|
0:46:07 | it's true that you just to do better on things that you see lots of |
---|
0:46:12 | examples |
---|
0:46:13 | and this is also true even if you looking for a particular see you know |
---|
0:46:17 | that are or you know triphones whatever those triphones occur less often and then you |
---|
0:46:22 | are not going to estimate as well but what you're saying is true to that |
---|
0:46:26 | you know |
---|
0:46:27 | it doesn't completely kill |
---|
0:46:31 | i agree i mean there's issues where we have some queries that just to get |
---|
0:46:34 | recognised recognizer and you know the combat the ones you know get recognised to find |
---|
0:46:40 | out there is on the five instances that context |
---|
0:46:43 | directly in their systems are trained to do so |
---|
0:46:47 | but something does need to be addressed |
---|
0:46:52 | where a one technical comment on the super watchers for of course |
---|
0:46:56 | this you know for but |
---|
0:46:59 | so we take a very pragmatic engineering approach and basically the recognizer is fed by |
---|
0:47:05 | the proceedings and by just like one and everything so generated of the new words |
---|
0:47:09 | are no not that the new anymore |
---|
0:47:12 | but i had another the question to the colour we're going to maybe also prime |
---|
0:47:17 | that was about the sequence of discriminative for training on the on the bentley transcriber |
---|
0:47:23 | or untranscribed report |
---|
0:47:26 | so we gaussian |
---|
0:47:28 | basically the simple you are needed to be don't know would portion of the data |
---|
0:47:32 | and then not on the model loosely transcribed by the recognizer |
---|
0:47:38 | what what's your experience on youtube videos and maybe |
---|
0:47:41 | but i'm common from this as well |
---|
0:47:44 | although first but because we actually have done sequence training experiments on the spread |
---|
0:47:55 | well let's see on |
---|
0:47:58 | i personally don't have a lot of experience i think when we report numbers or |
---|
0:48:03 | three hundred hours broadcast news |
---|
0:48:05 | there about half of it is manually transcribed have but it slightly transcribed and so |
---|
0:48:12 | i'm pretty sure we see some nice gains on that chart can for |
---|
0:48:16 | ten percent relative though |
---|
0:48:18 | likings be cer fifty hour broadcast news from cross entropy sequence are more then we |
---|
0:48:24 | sent four hundred i don't know that's amount of data or data ones you know |
---|
0:48:29 | transcribers is like this |
---|
0:48:31 | that's a and which what the reasonably good baseline but again with a pretty good |
---|
0:48:36 | proportion of the training data being lightly supervised |
---|
0:48:41 | anybody else the comments and that coral |
---|
0:48:46 | comment would be that's and should investigate deeper |
---|
0:48:50 | that i truly believe that there is |
---|
0:48:53 | more |
---|
0:48:54 | persons |
---|
0:48:57 | achieved three use the words right |
---|
0:49:02 | okay other comments or questions |
---|
0:49:09 | okay thomas |
---|
0:49:13 | so this is a very general question about how much training data really need in |
---|
0:49:16 | the future if you go would you make with the dnns |
---|
0:49:23 | well i guess is was trying to motivate my where are you know is we're |
---|
0:49:26 | just initials or system where you know with a lot of data to a big |
---|
0:49:30 | networks and it takes |
---|
0:49:31 | i don't know we trained a big networks but i think i think it's good |
---|
0:49:35 | sort of challenge question where no one have like ten thousand hours of the thousand |
---|
0:49:39 | hours of data used for training and we can maybe we do we increase than |
---|
0:49:45 | certain number context of outputs to a hundred thousand what we get |
---|
0:49:49 | and be interesting just no you know if we started you that we had to |
---|
0:49:53 | be at the change around with to train model |
---|
0:49:55 | that this or sizes |
---|
0:50:00 | i would also we just more is better if it's if the transcriptions are good |
---|
0:50:04 | enough |
---|
0:50:07 | that sounds great intro to mark format which was more |
---|
0:50:13 | i just wanted to |
---|
0:50:16 | mention the results was that but in numbers the role |
---|
0:50:22 | where we the actually somewhat and selection of the raw acoustic modeling |
---|
0:50:30 | all |
---|
0:50:31 | well you are the word error rate |
---|
0:50:36 | for well with one there but with other words the |
---|
0:50:42 | so i think |
---|
0:50:43 | for remote controls |
---|
0:50:45 | no piling |
---|
0:50:47 | well in that are split |
---|
0:50:50 | really remove more careful what we're |
---|
0:50:53 | but the performance of the word remove more thoughtful |
---|
0:50:56 | work that was coming from the model |
---|
0:51:03 | i'm blanking but there is a visitor |
---|
0:51:07 | from google actually give a talk at icsi you is showing us with look like |
---|
0:51:12 | definite as interpreting |
---|
0:51:14 | of performance with going up two hundred thousand two hundred thousand hours and so on |
---|
0:51:19 | so i think it helps but after awhile |
---|
0:51:22 | that's a much i think which was |
---|
0:51:24 | i and surprise that you've been quiet all day so |
---|
0:51:29 | there you making me happy |
---|
0:51:34 | so on the issue of selection i think |
---|
0:51:37 | you can certainly argue that |
---|
0:51:42 | selection |
---|
0:51:44 | cannot be the right thing to do |
---|
0:51:47 | instead you should always to weighting |
---|
0:51:50 | because whatever data you have there's certainly the i certainly agree that there's good data |
---|
0:51:55 | and bad data but that data is not worthless it's just less good than the |
---|
0:52:00 | good data |
---|
0:52:02 | so for example we have a paper here or something for the set for semi |
---|
0:52:06 | supervised training which revert done for a long time in the past you just transcribe |
---|
0:52:12 | make a model transcribe some untrained some |
---|
0:52:16 | recognise some untranscribed data and then use it for training when the error rates are |
---|
0:52:20 | relatively low |
---|
0:52:22 | for low is fifty percent or below you can do that |
---|
0:52:27 | with your eyes closed |
---|
0:52:29 | when the error rate gets really high like seventy percent |
---|
0:52:32 | that does break now but that doesn't mean you should discard the data |
---|
0:52:37 | you should just give it a lower weight and you can show that you always |
---|
0:52:41 | get better performance if you include the data the weight just gets lower yes in |
---|
0:52:45 | principle the weight could go to zero but |
---|
0:52:49 | you know you that the system decide that and the weights don't really go to |
---|
0:52:53 | zero the just get smaller so weights like one third and one half are error |
---|
0:52:57 | rates of eighty percent still giving me |
---|
0:53:01 | that's been our experience at least |
---|
0:53:14 | so i but the so it's |
---|
0:53:17 | more data widely |
---|
0:53:19 | may not |
---|
0:53:22 | monopoly the right thing i agree with what you're saying just as always value and |
---|
0:53:27 | data |
---|
0:53:28 | whatever those the figure forgot the utterance from should also |
---|
0:53:33 | pay some attention to be distributional properties that are |
---|
0:53:37 | but names |
---|
0:53:38 | so or this is one point of the problems of the room sure |
---|
0:53:45 | sampling space correctly that's really one |
---|
0:53:49 | i think i think it should that with my paper where i was that are |
---|
0:53:53 | closer general youtube data that when we actually that particular vertical like you use where |
---|
0:53:58 | we we're getting much better rates |
---|
0:54:01 | but adding all the data to train didn't doing that |
---|
0:54:04 | a bigger neural network for unknown parameters for your after getting losses |
---|
0:54:08 | on that specific domain so there are some issues of generalization just |
---|
0:54:15 | i like to add a little bit on data was we will be different but |
---|
0:54:20 | which is saying though i agree that of course more data is always better |
---|
0:54:24 | but i think the also we can be using less and less and less data |
---|
0:54:28 | so the question is how much data we will need i would hold less and |
---|
0:54:32 | less and less |
---|
0:54:33 | because we are any more and more about speech and we actually learning now how |
---|
0:54:37 | to train the nets on one language and use it on another and so on |
---|
0:54:41 | and also maybe sixty percent about the bottle which i'll go babel am |
---|
0:54:49 | i called bobble i think that we are going to learn how we use that |
---|
0:54:53 | knowledge from the some data bases on new task i this is at least my |
---|
0:54:57 | so i'd like to and up on this positive for |
---|
0:55:01 | approach less and less that's what i see |
---|
0:55:05 | just a follow up with what you're saying i think like sort of the lower |
---|
0:55:07 | part of the network or learning language-independent or task-independent information so you if you feel |
---|
0:55:13 | a lot of data and to those layers and less data in the upper parts |
---|
0:55:16 | that might be an approach to get i think is very |
---|
0:55:21 | actually when we started working in the gale we had a bunch of stuff trained |
---|
0:55:24 | that's trained on english |
---|
0:55:26 | and we're working this with this or i and trying to move to arabic we |
---|
0:55:30 | didn't have much are picked data yet so we just use the nets from english |
---|
0:55:33 | to begin with |
---|
0:55:35 | but still did something good |
---|
0:55:38 | one point i'd like to make but i think my in there you recognition and |
---|
0:55:44 | something that |
---|
0:55:45 | if you've got more than you i think you might be ten times and doesn't |
---|
0:55:51 | want to learn |
---|
0:55:56 | think i limited |
---|
0:55:58 | not like this number of an intuition |
---|
0:56:02 | C |
---|
0:56:09 | think we don't |
---|
0:56:12 | have any other pressing questions actually is time |
---|
0:56:16 | so no reach was |
---|
0:56:22 | saying that |
---|
0:56:23 | but at the way the data actually i did to do the and contrastive the |
---|
0:56:27 | experiment to one case use the frame selection in other cases the frame-weighting and |
---|
0:56:36 | D and i obtained identical word error rates for both systems |
---|
0:56:44 | so maybe you know if the |
---|
0:56:49 | if so it is true what to reach says then there should be done some |
---|
0:56:55 | post-processing in the |
---|
0:56:59 | of the confidence scores or it's true that those are not so uniform at all |
---|
0:57:04 | that |
---|
0:57:05 | it's a it looks more like an exponential |
---|
0:57:09 | mention kind of |
---|
0:57:10 | groups |
---|
0:57:15 | bring some more like |
---|
0:57:20 | something else in the more data so |
---|
0:57:23 | there are several because you want and are the ones don't speaker variability |
---|
0:57:27 | okay then you have more speakers but if you want you know is a list |
---|
0:57:30 | of our other robust against reverberation you can just make the data |
---|
0:57:35 | and then you so does present in the same data yes variation added noise but |
---|
0:57:41 | for a reverberation |
---|
0:57:43 | just train a system on for room acoustics |
---|
0:57:47 | makes it very robust against the for this microphones that's a very cheap trick and |
---|
0:57:53 | it works |
---|
0:57:54 | and something else about more data if you look at the very good neural networks |
---|
0:57:59 | presented in everybody's head |
---|
0:58:01 | they're not trained with that much data |
---|
0:58:03 | google already has more data so that is a strong point in making better than |
---|
0:58:07 | a networks you can do it so why tend to be |
---|
0:58:14 | "'cause" we don't know how but i |
---|
0:58:18 | could to |
---|
0:58:19 | i think we are out of time in principle and so i think we should |
---|
0:58:23 | turn this over |
---|
0:58:25 | conference organisers |
---|
0:58:28 | and thank the panelists still |
---|
0:58:37 | thank you morgan so this will be short |
---|
0:58:40 | i would like first before the we go a couple of practical things |
---|
0:58:45 | so for the people that subscribe to the micro brewery tour |
---|
0:58:50 | but so it's a word is not one way trip a that they did it |
---|
0:58:54 | should meet very important to seven a into will be and we begin tomorrow at |
---|
0:59:00 | the at their favourite forty with the limited the resources |
---|
0:59:04 | just the last practical command there is a carpeting table on the on the message |
---|
0:59:08 | board so whatever goes the prior knowledge and all other places |
---|
0:59:12 | and there's free space just write yourself maybe we'll have some nice the centres |
---|
0:59:20 | well |
---|
0:59:21 | i would like to thank let's take the |
---|
0:59:25 | i don't know where we ever is more or less important but let's think to |
---|
0:59:28 | the public because of almost everyone you're very like to thank you very much |
---|
0:59:33 | then the to the penalty is |
---|
0:59:35 | and to all the speakers and of course my greatest things go to today organisers |
---|
0:59:40 | and i have still one point but |
---|
0:59:43 | for brian because you have but one |
---|
0:59:45 | so this is a |
---|