okay so that's what is not intended to be particularly for also you know we
have well
put away the screen there will be no slides
so what i was encouraging everybody who's other annals to do it got about five
minutes to just give sort of an oral summary of the poster so let's encourage
people to come see it because it's going to be up for the rest of
the
session and then we can open up the floor for questions morgan i might have
a few and we'll see where the discussion dots
so
why don't we get started and since you closest a big presently go first sorry
okay so
the one should not here because i'm basically got a nice gmms
so what i did this i looked at the neural networks and i try to
figure out why the work very well and try to port this back to a
gmm
so
why gmms so we kind of like stand for years we have lots of techniques
model based techniques model based adaptation speaker adaptation noise adaptation uncertainty decoding
all kinds of techniques that are based on maximum-likelihood trained hmm gmm systems if we
just
put in dnns
at the front and basically you basically you lose a lot
and all the reason is there fast
the very efficient a few parameters you can encode you can make a speech recognizer
with ten times less packed fee parameters in there at gold very fast
final and lost reason you'll do speech recognition we kind of try to understand how
it works so if you going to replace the neural network in the top of
your head
it's on all the black box method like a deep neural network what i've learned
in the end so maybe a little bit model molar system where you have building
blocks that are
at least doing something you understand
it's nice to have that
so the second part is what are we going to port from dnns to do
the nn the gmm world
so basically look at the nns they take a very large winnable frames
and they going to map that to context dependent states for bit basically long span
symbolic units to go from long span temporal patterns too long span
symbolic units fairly complex mapping that's why they need
lots of layers
probably and also they want to go wide a have something like two thousand four
thousand to notes in between so that's pretty
pretty be pipeline
so the deep we already had the white we have already had to important properties
of a neural network a long window of frames and another thing is neural networks
they advertise them as being a product of experts
so basically adding note us useful input and is trained on all output
so there's
lots of training data for every weight
okey so the next that is let's try to port all these ideas to the
hmm gmm world
so and basically i didn't invent anything you i used existing techniques
so if you want to handle log large frame of windows you have to do
feature reduction because gmms don't like a two hundred dimensional input features
so we use something like lda linear discriminant analysis to do feature reduction
but that's loses lots of information so in parallel with that for example use multiple
streams multiple streams are not new in you old discrete hmm world you have static
features delta features and double delta features multiple parallel streams and fusing at the end
you can still do that today
so that's already have coping bit a large input a window of frames
going a wider we already had at we have multiple streams in parallel
you can i the seed of
as a
property of coping with a large dimensional input feature stream or you can say that
a little models
then the going deeper that's basically don't by adding and log-linear layer up in the
layers but nothing new nothing special
the conditional random fields or maximum entropy models they go around but lots of names
just a softmax in the neural networks
so that's nothing special but it's a simplest the extra layer you can add more
or less
it is in a product of expert model so it combines values in a some
which is basically product and
makes a new values so it's very good at fusing thing so
i at the frame stacking from of it just to increase the feature dimension i
so basically all existing techniques very simple techniques i forgot one the parameter tying but
that's also very simple use tied states a our system that we have time to
go first row so
that basically means that every gaussian is trained not all output and all that all
inputs but a lot so basically every gaussian is used under and over a hundred
times for a and that the output states so the lots of view so if
it exceeds every frame anyhow
and if you combine all these things you and the pitch results that are competitive
to last year's union the results
this year's the nn the results at something like segmental training or sequence training convolutional
neural networks dropout training
then you techniques i don't know yet how i'm going to map that to my
system up to six the sequence training is very simple to add and probably will
improve the systems
so the and messages the gmms and hmms are not
okay thank you chris a hank
slightly worse on you to
it will work on voice search and them some work on a U T
we are actually published results i thought you know to be great if you chair
sometimes of the with you want you to you
so if you know what you to youtube so you're sharing site you can share
also things i think most popular video is
you know task using dogs or cats running but like this but a there there's
actually and there's some useful data their mean ability user's fees each of youtube every
month they watch six billion videos on U T
was over a hundred hours of data being uploaded every minute
so as long content of their a lot of people watching
one thing we like to do with you be able to provide a you know
because you use more accessible for those that are harder here you or not speak
also imagine if we would by automatic captions you to you
that would help searching for videos on youtube or
actual to navigate in the video if you want one particular instances words and videos
in people that there is that some people compute non-trivial actually latest video problem a
bit bigger roles where you is obviously snapping we say words of all the weak
acid
i'm gonna give you want soft and people it may be used this indexing technology
to the final instances where problem and says speech
so you know there's you know some can be some point but with applying
so i looked at this from a couple aspect of the i'll get is from
a D task so we have a lot of data what are some the with
that we can levels of data
for example
users are apply for twenty six about have uploaded twenty six thousand hours of just
caption
online text captions
for these videos attempt to have tasks because you know the find it is useful
to have them
but some of those artists can in some fashion matched video they just the advertising
things
so
i think about people looked at how to use this sort of
dropped data to use a strange and so we do is so much i think
that everyone else does we try to figure out what sort of aligned with doesn't
align and we had this island of confidence or technique so basically areas where a
lot of alignment happens from recognition result and the actual user provide result
what we use that the sort of islands was not the coherence then used as
training ground truth
and so we're slight
after all still training of like non-native data a christmas can be what actually aligns
well we get a we got initial corpus about thousand hours
and compared to but some and fifty hours of supervised
actually hand-transcribing
so we're able to do some persons on that
and
the other aspect is well we have so much data to me improve the modelling
techniques that certainly different ways and it just use of force people i think doesn't
talk about
having thousand cd state units and i think it typically we all work on our
own seven thousand cd state units
i think frank's ideas we have to thirty three thousand so we really do run
in europe so i'm not writing
around twenty thousand four five thousand see you know to use more data my one
time we got better
it's model
and so but you know that was really large that way with the softmax
there are forty but that's no points and thousand nodes prepare for five million parameters
there
just that one there so actually this is little bits by brit actors aristotle in
icassp the right factorization slight warming to try this data and see it goes so
and they're in a paper we looked at using various levels of this task
percent miss is lower a linear layer from
to just by close to you are and so that it task
and the basically our results were
actually suboptimal we can use privately well the semi supervised data where we use it
is a captions we can see build model it's better than or gmm system by
about ten percent relative
so our team system initially was well
plus fifty percent error rates
and i think there is that you know some issues of the gmm system i
think it's very cambridge the events matrix for us
they got below fifty percent but not much but with the same is rise data
no supervised training we did pretty well
we actually when we actually the supervised data the results with less data we actually
better results than in the systems revise data models but that's expected and actually combined
it doesn't work against combining
and with low rate roll find that with your parameters we're able to get you
know how better but actually results that slightly better maybe it's just regularization
we found that overall by having
and all this extra data we got the results on youtube general on you general
test data sets test sets but
we will actually that's a domain specific test set
for example you to use same broadcast news we actually get a degradation by adding
all this all the sins rise data so that was interesting so and you're a
neural networks people bigger better more data
but since then we still have some issues with cross training
so i still you know what things look
so that's what
okay thanks a star
okay so frank showed earlier today
one of the first and results on lvcsr was on switchboard showed about thirty percent
relative improvement on a speaker independent a system
and you know microsoft as well as i am in others have shown that if
you speaker adapted features for the dnn the results are better
and then earlier this year i but using very simple log-mel features just a convolutional
neural network you actually improve performance by between for the for seven percent relative over
at the end and trained with speaker adapted features
and one of the reasons we think is this sort of learning this speaker adaptation
jointly with the rest of the network for the actual objective function at hand you
either cross entropy or sequence
into the idea of this filter learning work we did is he said well why
are we've been starting log-mel let's start with the much simpler feature such as the
power spectra
and have a network learn a filterbank which is appropriate for the speech recognition task
at hand rather than using a filterbank which is perceptually motivated
it's if you think about how the log-mel is computed you take the power spectrum
you multiply by filterbank and then take the log which is respectively one layer of
a neural net great weight multiplication followed by nonlinearity
so the idea in this filter learning work was to learn start with the power
spectra and that the filter bank layer jointly with the rest of the convolutional neural
network
so and we did sort of this idea initially we got very modest improvements
and one of the reasons is because you have to normalize not the layer to
that
convolutional network but the layer to the filter learning
we know there's a lot of work that shows if you charge normalized input features
into the network such written down here
so we found that by normalizing input into the filter bank layer and by using
a trick very similar to done and that's not in rasta processing to ensure that
the input into the filter learning layer would be positive we able to get about
a four percent relative improvement
over using a
fix filterbank there is a nativity are broadcast news task
we then show that base you the filter bank where can be seen as a
convolutional layer with limited weight sharing so you can fly tricks such as pooling
so if you pull you can get you know what five percent relative improvement over
the baseline of the fixed with fixed mel-filterbank
and then tried other things like increasing the filter bank size a lot more freedom
for the filters that didn't seem to help much probably because there's lot of course
going on between the different filters
we also tried we found the filter weights or very few key probably picking up
many harmonics in the signal we tried smoothing that out that didn't seem to help
much
so it seems was that the extra peeps that are learned in the filter bank
layer is actually beneficial
finally we tried instead of enforcing you know using analogue nonlinearity positive weights along the
weights P negative in using like a sigmoid or you prove nonlinearity and also didn't
seem to help us
so it seems like using a lot nonlinearity which is perceptually motivated is actually does
so in summary we looked at filter bank learning i suppose using a fixed mel-filterbank
agreeable to get about five percent relative improvement number i guess
thank you chair call
new
okay
so
in principle i was trying to
so the same similar problem is thing was folding
but there was one difference that you will use probably several thousands of even ten
thousands of training data that we could be possibly leveraged to improve the word error
rates and in the our case the dataset most more modest at school
very nice to play with
in the our case we had the ten hours of transcribed data seventy four hours
of un-transcribed be done today that was i means
from the iarpa babel program and
this one of the conditions the limited language pack
condition
and
i try to find some heuristics to how to leverage the best
they don't
the results so what idea is that i used to different confidence the measures on
different levels
one level was to sometimes level
and the other was or frame level
so that we can select the data for training
the way that the sentence-level condition was computed it was
basically the average posterior from the confusion from the confusion network
the best word
and the frame level
confidence measure was the
imagine you have some
let these
you well the weighted semi supervised training is done that the beginning be able to
transcribe a to be built some system
and with this system we can decode the
data we don't transcripts and so we can
take the best parts from the lady sees as if you was the reference
and
so when we have the let this is we can take the best file and
we can compute the posteriors abilities and we can then read the posteriors which lie
under the best path and use those as this confidence measures
to use
so then when we start and when we started experiments first experiment with the frame
cross entropy training
and i try to make a systematic of steps first
star so on the larger commodity and then go to the smaller one so let
the beginning so i was starting to think of those sentences according to the confidence
and surprisingly i kept adding
more and more something see something like that it all of those and so there
was stick still the system was
it's a radio and there was no degradation be very in
it was surprising
and so
so this gave so minus one point one percent improvement in absolute and then be
very the situation that there was still
roughly ten hours of transcribed speech seventy hours of untranscribed speech so there was double
lines of in the monthly multiply the
amount of transcribed speech show by twenty three we we'd try to
different
different
numbers two
different multiplication numbers of the system
three was
the good one and there was meant of zero point three
no absolute on the
and finally we went to do
lower level to the frame level and found out that the
frame level selection with the appropriately to a threshold would be another zero point nine
zero point eight percent
you know something
so to the overall improvements over two point two
first order eight percent absolute
and
is the
as the full recipes use also includes the sequence-discriminative training
i did some experiments to
with the some the are criterion to improve the
results on these stage and i try to use similar
data selection framework
but the remote
how and what is this the safest option was to take the transcribed data and
a use a some the are it's just the transcribed data and
in the
the improvement that we obtained on the frame cross entropy level training a large part
of it persisted in the systems
pretty much more
the experiments we did
so i'd like to invite you to
so you see the posters union by i would like also to think to then
podium of the colleagues who
worked on developing company
thanks kernel next we have all okay
so other poster paper it's about how to learn a speech representation from multiple or
single distant channels so we did distant speech recognition
which has so we are now is much more difficult to copy of because of
many aspects like for example
course signal-to-noise ratio or
different interference effects of other acoustic sources so what people usually do with distant speech
recognition is to
capture they're sticking out using multiple distant microphones which we now
a germ at all so basically we can apply on top some sort of combining
algorithm which in from the signal entire the single channel and then you be it
acoustic model on top
a like an acoustic model you want and it's
we are interested how to how to use multiple distant microphones we'd up to conform
or so
we do in addition to the actual i in all the are dramatically
and at and try to explore that way to combine channels so we use
a neural networks for that and there are there are two obvious ways to follow
the first one is
a simple concatenation so you get a acoustic captured by multiple channels and you to
just like a large spliced input to the network and you train it we have
a single targets like we should why you do
and the other way a the other way to do it it's is it is
multi-style training and it must i multi-style training
allows you to actually use multiple distant microphones while you're training and you can
recognise with a single distant microphone
so getting back to concatenation a we have just a simple concatenation we were able
to recover around fifty percent of think of that inform again so we weren't able
to beat our best
and it
dnn model
trained on that eighteen from channels but we were like i've but we were able
to improve like around fifty percent relative
and
it
relative to the gain of indian eight of course
and we've quality style training we train the network in which a task fashion where
we actually had that share the representation for each channels and we like presented a
random batch of data from random channels and we did not eight
for that
and that apparently force the network to actually you can or some of the travel
it is in the channels so in the and
and multi-style training
i gave ask the same as those are simply if a concatenation
so basically it's a very attractive a way because you do not need multiple distant
microphones and test scenario
which is nice finding
and
right so
in that order we also point some sort of open challenges like for example she
still overlapping speech just select a huge issue
and not many researchers actually try to just it
and so the simplest think is just ignore it as
and
and we also like present the complex set of numbers for i mean datasets for
pure rugby datasets all this numbers should be easy to reproduce if someone is interested
so i by
anyone who's interested suggested
came by and we can discuss some more
thank you
okay thanks paul i finally alex
thank you
so just to start a little bit of for
more than ten percent here
at a kind of longstanding ambition
speech recognition can be reproducing kernel not
have one network to an acoustic modeling
the language modeling
state transition
and happens all kind of combining single network
can difficult
you probably won't be surprised here
and i was eventually costly mostly by my coworker rock my mama you maybe i
should just try
you can one of this thing i
replacing the
neural networks with or
and so that's basically what were you
and it's really it's
it's fairly straightforward
you know it's
standard system
the only thing would be
all the people here is the network architecture and the network architectures probably so
one thing a run you know
taking just ignore recurrent neural network making
a sample
input feature
and brings really
really you can increase
but like with multi
there are various other kind of
improvements basic recurrent network architecture that i mean accumulating you
and i guess the two main ones are not i directional so having single or
no network stop beginning sequence those the and
you have
recurrent networks one going forward someone going back
and you know that's not the past
future compact
and you can be that same structure just saying which account for normal
so it's bidirectional
and what you actually find the U
and networks use of context brands out
as it goes i
and the other hand a novel thing i guess is used to this long short
term memory architecture which i won't try to describe in detail you basic idea it's
better at
storing information times but gives you access longer range from
common problem and everyone's a when you try or no networks for speech things
there is flashing makes it difficult for score information
and
other not
well as a standard recipe from the training
fifteen hours
because one of the compare the system with
the kind of more and are workplaces
using implement we actually i printable system
and then we'll to the wall street journal corpus and the results kind of income
using these
bidirectional rnns can be cross entropy are frame are pretty small
a
one possible reasons
wall street journal is
maybe not the best corpus challenging and you know what is essential
model which switchboard
but my feeling is the
what we really have
this is this going to be cross entropy training
you know word error rate actually carry
something we got like
same just train
thanks so at this point we can open up the floor or questions or comments
from the audience either directed that the panel or anybody else's room
so drive any takers
so following it up for what are certain jobs you have so terrible start well
for you actually do you put was the power spectrum so do you think that
the known as will be of are capable of
right you can
very there or backward see if you want
to the waveform
i think that something better definitely
and nobody has done some more can actually
i think you're right i think there's then a little bit of were but been
by not do you might know alex on using convolution
neural network like approaches are do you remember right
i mention this is more has been some work with this but the generally do
something on top of it like to take the law yep take the actual value
florida log and so on the in there there's and things that are kind of
heart to reproduce just by pretending you don't know these things are any good
actually have i was gonna ask you
i was trying to recall
did you end up taking along still
i so let's take the log is right into the neural network i think like
twice right
right so that's interesting right i mean that you know we've got these are for
learning
machine we stuff to take more
i don't and about that
okay i've a question which actually is can be directed more morgan and hynek and
alex waibel the these in the room
so one of the themes a came up earlier in the day was that some
of this stuff was done that in the nineties and due to limitations only metadata
we had to work with the amount of computation available
there were there were things it really i couldn't people explore or couldn't viably be
explored and so the question now is are there papers from the nineties that occur
practitioner should be going back rereading and trying to plagiarised yes from the see that
can that absent improve on now
and it's the which ones
they are so this L is a lot of i mean
i don't say i mean it depends of people interested in right like this morning
their questions about adaptation and i didn't recall that up my head which papers but
there were a bunch of papers by a neural net so it in S K
and twenty grams in an improvement cambridge
if you interested in adaptation
there is
large number of papers on the basic methods on the sequence we're talking about a
luncheon on the sequence training
us papers i are shown and there are not there at an anti R T
where he did sequence training i think around ninety five or something
we're what we're doing the time once
using the cameras as the targets for the net training
i mean isn't just the computation and the
and storage and amount of data it's also just that
oftentimes you know these things are cyclic a new you try some things out and
somebody like we did the sequence training
help tiny little bit
and in what we're in the examples we're looking at and was a lot more
hassle
so we didn't pursue it more
we had a couple years where we really looking into it but it wasn't so
great
so there probably some things that we weren't doing quite right and
now it's coming back and
also people's and to see when you're enthusiastic about stuff
you look at a point two percent increase a lot differently than when you're not
about some other questions for the panel they had lots of interesting things they were
talking about so
i question for all and you're multiple
yes
the multi microphone
experiment you did i guess that was with the ami corpus
yes it's so you got you get this
i guess nowadays predictable result that if you just concatenate the features from your three
but to the for the different channels
you would perform better than any beamforming
wiener filtering or whatever else you that you're doing
but it's is that correct no okay
when you concatenate you get some improvements over a single distant microphone but it's a
like the message from the air is that if you can inform you probably should
inform
yes but how okay with the with the concatenated features going into neural network is
that assuming that this speaker is sort of
i mean if my speaker was to walk around and
i can imagine actually
or observation is not network isn't learning can fink well actually gives you
beamforming gives you a it's more like adapting to that
the most meaningful signal
to the strongest signal so basically
if you have like multiple distant microphones one of the speakers just always a like
in some way
it's closer to give an microphone but down to D are not or and that's
thinking actual you can exploit and you had
in this scenario we applied for
and
because the when you like put multiple frames
in the input you have like a very small time resolution so you actually can
not there any time delays in this setup so it's just the it's just take
eigenmaps really and you can do it like in a more obvious way for example
you can apply
convolutional that'll
the acoustic models and the max-pooling
and tops the also give some gains
but that's like a followup work
to be a little bit of courage to decide to response to brian was asking
because you know i'm they pretty bad in reading other people's papers
and so i had only examples of paper speech i wrote all my colleague set
of students roles each people should very critically we i mean i don't mean that
they are wonderful but i still think they are interesting which and this is this
is this work on contracts which we started to work on that it was in
the time you post pretty crazy
because we just took the temporal trajectory of spectral energy a given frequency
one second long and we said can you estimate what's happening in seventy six
of this trajectory
and so first of course without was that you got about twenty percent correct at
best
and of course you get is the number of frequencies so after that need to
this out with all these posteriors and fit in your into nothing in it and
a then you leads to
estimate still the phoneme in the centre so it was like kind of formants deep
neural net i would say because it was kind of neat it was also performance
why because he that trajectories at different frequencies
and you was it works surprisingly well i mean so if people can look at
it and the last possible global we should have better and of course you never
retrain the whole thing the which probably we should have done and we use
on the context independent phonemes each maybe we should and shouldn't number of things happen
at the time something that there are two entirely comparable to the manager and pos
it's all and hopkins and so one is also be where only all that much
but i still see that people should look at
in and tell us what was wrong or how you is that it works
that you try to recognize context independent phoneme out of one second context
you know and you get actually very well you do very well if you look
at least posteriorgram amazingly good
sort of the look like for that issue do you mean vector or perception at
all times seems like that it is i would see
somebody else to should look at it critically
so sorry for probabilities might work but i said i mean so
so the other people's work so
so this question is a mostly i and other hand address i have something to
say i
so i actually spoken trying to achieve a person
our knowledge and found on this but i
for example we can read and the program that the videos there being recorded going
to be can that's going to be able to search them for keywords
online so like i think the keywords that are typing into that system are gonna
be that
new at me now i have no
and it's gonna be names i think i can and you know i okay so
that are gonna be
and i have a cabinet where K fig
so
kinds of plastic
what is what is deep neural networks are we optimising R
where acoustic models either really frequent words and leaving a on infrequent words second i
mixture
and if the other thing is you're analysing with word error rates over your entire
vocabulary
it's is this really getting at the performance
the only one and we want to understand it is when the interesting to look
at landing a way that standpoint for spoken content retrieval
one stack of that
restricted by taking a i can
maybe address some of that i think i don't think than i don't know networks
are just focused on
i'm not on the on the head did you pretty well on the tales well
but i mean there's two aspects here and there's the where the where you don't
like vocabulary and those words that are out-of-vocabulary we have this in the model at
test time and that's a different kind of orthogonal i don't my
i mean and shake your head but i think
i think if we can incorporate re well as to what we what we do
our searches we have a stack decoder graph we can actually corporate dynamic can tailor
into that into that graph
i think when we do that a big actually recognise out-of-vocabulary words i mean that
that we haven't seen during training time for example i worked on a voicemail years
ago and you know people's names come up all the time and our program manager
for the reason his name was
with the recognizer is this is okay the people who tell collisions in but you're
always recognise missed ten cents
for some reason but once we switched on so direct vocabulary we have like is
name checked into the stacks photograph same recognized and then we refer lots of other
different so i think right now the system doesn't actually corpus title vocabulary and i
think
the metrics you talk about also devices to sort of working at sort of broad
range and it is sort of makes more difficult to say if it introduces the
you know technique where we do anything cataract that'll give us a point one or
even middle today and that's a shame think really need to look at its
techniques to look at the long
but i think it still really this but recognition there's lots of the word that
can be done and in language modeling and analyze about capital a men's room you
know these words useful
i'll chime in a little bit on this one too so i can speak from
experience on doing few works are shown in lots of languages things to but the
babel program which will be hearing about tomorrow from a very arbour
and what we found is word error rate actually is pretty good basic metric even
when we're doing search for words that are out-of-vocabulary in the training so it's not
perfect correlation between word error rate and retrieval performance on this table past but
at least the first-order a large improvements in word error rate like we see using
neural networks instead of the gmms definitely
to better retrieval performance even a vocabulary terms
so it it's not perfect metric but it's one that we used for many years
and it works pretty well
the interesting pronunciation you'll find problems with those words and it you know it's very
recognition and of that
but i as you can see this work here where we're trying to drive dispensation
so
not dismissing it just
i actually wanna
the tractable but in favour of the direction what the question are saying
"'cause" i think it is a disorder separate out the decoding so forth from what's
happening in whatever you're acoustic model is you see whether it's gmms are dnns or
whatever or mlps many layers for phone
it's true that you just to do better on things that you see lots of
examples
and this is also true even if you looking for a particular see you know
that are or you know triphones whatever those triphones occur less often and then you
are not going to estimate as well but what you're saying is true to that
you know
it doesn't completely kill
i agree i mean there's issues where we have some queries that just to get
recognised recognizer and you know the combat the ones you know get recognised to find
out there is on the five instances that context
directly in their systems are trained to do so
but something does need to be addressed
where a one technical comment on the super watchers for of course
this you know for but
so we take a very pragmatic engineering approach and basically the recognizer is fed by
the proceedings and by just like one and everything so generated of the new words
are no not that the new anymore
but i had another the question to the colour we're going to maybe also prime
that was about the sequence of discriminative for training on the on the bentley transcriber
or untranscribed report
so we gaussian
basically the simple you are needed to be don't know would portion of the data
and then not on the model loosely transcribed by the recognizer
what what's your experience on youtube videos and maybe
but i'm common from this as well
although first but because we actually have done sequence training experiments on the spread
well let's see on
i personally don't have a lot of experience i think when we report numbers or
three hundred hours broadcast news
there about half of it is manually transcribed have but it slightly transcribed and so
i'm pretty sure we see some nice gains on that chart can for
ten percent relative though
likings be cer fifty hour broadcast news from cross entropy sequence are more then we
sent four hundred i don't know that's amount of data or data ones you know
transcribers is like this
that's a and which what the reasonably good baseline but again with a pretty good
proportion of the training data being lightly supervised
anybody else the comments and that coral
comment would be that's and should investigate deeper
that i truly believe that there is
more
persons
achieved three use the words right
okay other comments or questions
okay thomas
so this is a very general question about how much training data really need in
the future if you go would you make with the dnns
well i guess is was trying to motivate my where are you know is we're
just initials or system where you know with a lot of data to a big
networks and it takes
i don't know we trained a big networks but i think i think it's good
sort of challenge question where no one have like ten thousand hours of the thousand
hours of data used for training and we can maybe we do we increase than
certain number context of outputs to a hundred thousand what we get
and be interesting just no you know if we started you that we had to
be at the change around with to train model
that this or sizes
i would also we just more is better if it's if the transcriptions are good
enough
that sounds great intro to mark format which was more
i just wanted to
mention the results was that but in numbers the role
where we the actually somewhat and selection of the raw acoustic modeling
all
well you are the word error rate
for well with one there but with other words the
so i think
for remote controls
no piling
well in that are split
really remove more careful what we're
but the performance of the word remove more thoughtful
work that was coming from the model
i'm blanking but there is a visitor
from google actually give a talk at icsi you is showing us with look like
definite as interpreting
of performance with going up two hundred thousand two hundred thousand hours and so on
so i think it helps but after awhile
that's a much i think which was
i and surprise that you've been quiet all day so
there you making me happy
so on the issue of selection i think
you can certainly argue that
selection
cannot be the right thing to do
instead you should always to weighting
because whatever data you have there's certainly the i certainly agree that there's good data
and bad data but that data is not worthless it's just less good than the
good data
so for example we have a paper here or something for the set for semi
supervised training which revert done for a long time in the past you just transcribe
make a model transcribe some untrained some
recognise some untranscribed data and then use it for training when the error rates are
relatively low
for low is fifty percent or below you can do that
with your eyes closed
when the error rate gets really high like seventy percent
that does break now but that doesn't mean you should discard the data
you should just give it a lower weight and you can show that you always
get better performance if you include the data the weight just gets lower yes in
principle the weight could go to zero but
you know you that the system decide that and the weights don't really go to
zero the just get smaller so weights like one third and one half are error
rates of eighty percent still giving me
that's been our experience at least
so i but the so it's
more data widely
may not
monopoly the right thing i agree with what you're saying just as always value and
data
whatever those the figure forgot the utterance from should also
pay some attention to be distributional properties that are
but names
so or this is one point of the problems of the room sure
sampling space correctly that's really one
i think i think it should that with my paper where i was that are
closer general youtube data that when we actually that particular vertical like you use where
we we're getting much better rates
but adding all the data to train didn't doing that
a bigger neural network for unknown parameters for your after getting losses
on that specific domain so there are some issues of generalization just
i like to add a little bit on data was we will be different but
which is saying though i agree that of course more data is always better
but i think the also we can be using less and less and less data
so the question is how much data we will need i would hold less and
less and less
because we are any more and more about speech and we actually learning now how
to train the nets on one language and use it on another and so on
and also maybe sixty percent about the bottle which i'll go babel am
i called bobble i think that we are going to learn how we use that
knowledge from the some data bases on new task i this is at least my
so i'd like to and up on this positive for
approach less and less that's what i see
just a follow up with what you're saying i think like sort of the lower
part of the network or learning language-independent or task-independent information so you if you feel
a lot of data and to those layers and less data in the upper parts
that might be an approach to get i think is very
actually when we started working in the gale we had a bunch of stuff trained
that's trained on english
and we're working this with this or i and trying to move to arabic we
didn't have much are picked data yet so we just use the nets from english
to begin with
but still did something good
one point i'd like to make but i think my in there you recognition and
something that
if you've got more than you i think you might be ten times and doesn't
want to learn
think i limited
not like this number of an intuition
C
think we don't
have any other pressing questions actually is time
so no reach was
saying that
but at the way the data actually i did to do the and contrastive the
experiment to one case use the frame selection in other cases the frame-weighting and
D and i obtained identical word error rates for both systems
so maybe you know if the
if so it is true what to reach says then there should be done some
post-processing in the
of the confidence scores or it's true that those are not so uniform at all
that
it's a it looks more like an exponential
mention kind of
groups
bring some more like
something else in the more data so
there are several because you want and are the ones don't speaker variability
okay then you have more speakers but if you want you know is a list
of our other robust against reverberation you can just make the data
and then you so does present in the same data yes variation added noise but
for a reverberation
just train a system on for room acoustics
makes it very robust against the for this microphones that's a very cheap trick and
it works
and something else about more data if you look at the very good neural networks
presented in everybody's head
they're not trained with that much data
google already has more data so that is a strong point in making better than
a networks you can do it so why tend to be
"'cause" we don't know how but i
could to
i think we are out of time in principle and so i think we should
turn this over
conference organisers
and thank the panelists still
thank you morgan so this will be short
i would like first before the we go a couple of practical things
so for the people that subscribe to the micro brewery tour
but so it's a word is not one way trip a that they did it
should meet very important to seven a into will be and we begin tomorrow at
the at their favourite forty with the limited the resources
just the last practical command there is a carpeting table on the on the message
board so whatever goes the prior knowledge and all other places
and there's free space just write yourself maybe we'll have some nice the centres
well
i would like to thank let's take the
i don't know where we ever is more or less important but let's think to
the public because of almost everyone you're very like to thank you very much
then the to the penalty is
and to all the speakers and of course my greatest things go to today organisers
and i have still one point but
for brian because you have but one
so this is a