so i mean i don't have this is all one solely basically let's go back
to the they to the first day all the way and what is that you
didn't ask what is it whatever it is that you didn't say
you sort of each that maybe you said
that now is a good opportunity because it happens very often and their questions
but of comes out of don't know how fast but then you come home and
say all i wish that be so this so alex definitely yes it is one
thing to say
sorry for saying something but
well i'm close the circle of the
basically at the beginning of the meeting we
learned a lot
well it's i want a lot about what happens and people's brains and so forth
that's
i think that something systems prove that we can understand language
well i thought that's my
but it here the and you're basically tell us all that stuff but have read
out from between right after that talk i'm right now is basically you know probably
the wrong idea
so
are you buying well whatever you why are we do we need to learn something
from those are
and how can we do that
that there are so real sounds so you have two choices a one choice you
have it is about the models used all or something probably existence proof
but
the other is the think about computation
and build a model that way
models don't have to be the say so there's certainly used to pass
and i think learning for hence is a very bright
i pay some attention to that
recipes to which a lot a lot from the altar system
to the extent to understand it and we understand part of
so i think there's two avenues of information to mine once
physiology the others computation
but still there's also the model suggests also so what kind of the evidence we
can take advantage of
so i have a couple sense for this question too so you know regarding i
don't and tuesday the low resource data and we had some zero-resourced arts and that
sort of right most work and it turns out when you actually start removing supervision
from the system
the things that actually allow you to discover units as a speech automatically are not
the same features that we use for supervised process is not the same models that
we use this process
and so somehow
it is the case that i think well a lot of people might not be
interested in sort of that extreme of research because it might not always be practical
from one you don't insist that you can sell and so forth i think that
style of work where you're forced to sort of connect yourself to something like for
example of the and many was talking about with human language acquisition and make something
consisting
between those things can send you to new classes and models and you representations that
you're forced into that i think could eventually be that can be fed back into
the supervised case for forgetting
i'm glad that you go also back to the early days like monday and tuesday
be more skillful of optimism and not for the two for thursday where we all
i like of the coach our
i like to remind you that indeed i think that a community is diving into
new types of models
well below for worse because of course always then you start some new paradigm everybody
chime suddenly turned quickly you may get also discouraged but additionally these nonlinear systems each
of these scores neural networks or something they are very good in being able to
construct all kinds of
architectures highly parallel architectures
we have to think about the new a select up think the models the maximum
likelihood is gone right and this ordinance along i think there is a plenty of
work to do that if i may speak for myself i i'm big deal believer
in highly parallel assistance a layer
there is a many the use of speech being provided
and then the big issue is how do you pick up the most appropriate you
which is which might be appropriate for the situation so adaptation not abide by adapting
the parameters of the model by adapting like picking up the right processing stream very
much along the lines i was quite impressed
what chris while it was telling us that when he added a lot of noise
of course many few euros where good but the ones who were what were still
a very good so essentially my purse i'm speaking for myself no my view is
like that system should be highly parallel
the trained on the whatever data are available but not like one global model on
many parallel models
and it is possible different and independent models and then the big issue is that
you pick up a good one so this is that one direction i'm thinking about
i don't know what other people think about it
but i think that there is a whole i think that whole new a whole
new area of research is and whole possibility for new paradigms is coming
i mean that's what we see all the past few years with the re you
re invention or rediscovery of alternatives to gmm models
i didn't mean to speak i mean i just one mean and give you some
space for thinking what you want to sail you want to ask
so i would just like to
pos ask a question about the possible eventual test of the field in a feature
so it happens i mean i'm not old enough to see this but for example
for
for coding it happened it after their strong technology transfer understand much more established
the research fields
i it didn't die
trips freak terribly right
and this will happen one day with automatic speech recognition
we have some stop these methods and then
they won't be that much things to research i this is going to happens some
are applied
and i was wondering how much time do we have
because
we are already seen a very strong twenty times try and there's a lot of
investment by
all the major
technology by using the market
so are we close to really sorting is not i don't mean sorting semantic context
that's not condition
but are we close to
study some standards
and then is done
because what i we got the research on
how close are we
and years twenty years because the but my carrier right maybe it's side effect
i've life yes for the
i
i think i people that
that's good
this is average spectral for your funding sources
it's a can be all close hope that there is going to do and we
will that
stick i think i tell my students i come i still is that they are
they getting the speech recognition they are safe for life that this was my experience
somehow i think
comparing speech coding to
speech recognition just doesn't fly at all
i mean speech coding
unless you're going to try for their
utopia of three hundred bits per second which does then requires synthesis coding
there's just no comparison
very straightforward and eventually yes
standards with set
the field i same
could be
to about a coding of pictures
very trivial to cover pictures
we have an impact three impact for
it's all done
picture understanding which is very much like
is the thing
sort of book
i do think
that to
the feel this very far from that
but i think the field
will kill it
if it assumes that it as the solutions
and then continue
to plough through just working the solutions that we have right now
all done so one other thing that i would probably a
like to see happen is are
rather than sitting around and talking about what's wrong with the field
is possibly construct certain experiments
that could point
to what's going on
just
for example when steve was storing before
i was thinking
so you have a mismatch in acoustics and you have a mismatch and language
try to fix one without the other
and C
what is the result where it falls
so i think it's a wonderful want to remind people jump ears was advising us
to design a clear experiments with the answers
so that science can of speech can grow steadily step by step
rather than the rapture for computers and unproven theories
i have are
maybe a couple happens observations
we talk about neural nets
right now as an improvement and i'm sure it's obviously an improvement
it actually goes in the opposite direction
what we're all advising ourselves to do that is it does nothing about any independence
assumption it's just building a better gmm which is the place where you said that
wasn't a problem
it's not modeling dependence
except to the extent that we model longer feature sequences which we tried to do
with the gmms also
in terms of
where we will you know when will we sell but obviously not five years but
that doesn't mean ever
so it would be nice if we could come up with the right model obviously
that would be the best answer
i'm not sure that
speech coding and image coding i don't believe they were saw by coming up with
the right answer i think they were saul by coming up with good enough
answers that
wouldn't have been practical
twenty five years ago because the computing was not enough to
implement those solutions but they are now
and so those
fairly simple fairly brute force
expensive methods now we're practical and work just well enough
so i think speech recognition could go the same way it doesn't you know it
could i if we if someone is very smart pick the right answer that's great
but if you
look at how much we've improved over say the last twenty five to fifty years
there's been a big improvement
say and twenty five years
and if you imagine the improvement from twenty five years to now ago to now
maybe two more times
and the so this is next you know grows exponentially so fifty years from now
i think we could say with almost absolute certainty
speech recognition will be completely cell to all intents and purposes that is it'll work
for all the things you want to do little work very well it'll be fast
it'll be cheap there will be no more research in it
because you will have
computers with
i don't know what the right term is but change of the ninth
memory and computation where you know ten to the fifteenth computation and you'll have modeled
all those differences
by brute force it won't it still would never work to train on one thing
and then tested another but you want have to you will have trained on everything
you know you will of trained on samples of everything so that it just works
so
the doom and gloom doesn't have to work that way it would just be nicer
to find a more elegant solution sooner
bcmvn this is also positive value there is a just for fast
i don't know nine is probably this probably few more data people in this room
this is a actually would point there's a ten to nine some neurons in auditory
cortex so that must be turned to the nines
tend to the nines away so first solving the problem and maybe it is the
right way to go
i think there is another aspect that's missing which is a
looking at is speech recognition this is a little
no acoustic signal and you're model
model for
i think we need to bring in the context and
we are moving towards that
feature where the palestinians about the context about your personality
but the personalisation all these things should be
incorporated into whatever model
and that will be used some of these ambiguities that if you just looking at
the acoustics
that's another you know feature you know it
actually i would also like to continue on what she was telling us that there
is another one solution to speech recognition there is many right i mean there are
some just like there is many cars and many bicycles and many what side i
mean is something solutions we need solution to a problem
and of course what we keep thinking about all the time is that we will
so you can find peace i think it's okay to find many other so many
smaller solutions it is not questioning my mind that recognition made enormous progresses i mean
actually even i use it here and there i mean of informal will go voice
and this is this is already quite something say so google voice is a good
example since we have a over here i mean i where the solution came to
the point where it's becoming use for just like a car used for do we
all agree that this is not ideal way of
moving people from one place to another it works to some extent so i maybe
we should also think not only about this solution but about many
solutions to
i wasn't those say that
and this relates to
about data
one thing we see anything this is that
given our models language acoustic models
young a particular size
with a C V
and
and in that sense what you say about what was also somewhat
you were kind of suggesting and symbols of classifiers and rocky suggesting a personalisation their
estimate well because
we also and all that if i build the model just for you
and acoustic model just for you are language models just for you it really works
well
and
maybe is not the most a layer and solution but
given enough data and enough context
and in of computational resources that works really well
and i think don't want to see a lot of work in that direction the
prize will have to pay is that
you have to let a whoever's building the recognizer for you what there is no
one's or microsoft whatever
you have to let them access your data
and without that you will have to label within a speaker in the and then
a context system which might be good but not as well as it can be
or you may also provide the means for the user to a modified to technology
in such a way that it works best for that even user and a given
task right you don't have to the i'd necessarily of on the big brother whatever
for me thanks but if you provided technology
which is that have a just like actually most of the technology which we are
using thing about the car i mean you know you can drive it fast you
can drive it slow you can drive you crazy you can drive it safely and
it's a little bit up to you technology basically was provided in such a way
that user can adopt
it in due to its knees i'm use i think that it so this is
one way you the other ways you need we are trying to build is big
huge model which will and the income parse everything i'm more like
believer in many parallel models very much along the lines that human perception in general
because you need wherever you're looking the sensory perception typically always find many channels each
of them looking at the problem before and way
and of course what we have available to us is to pick up the best
way and any given time and this is something which we have two and perhaps
you know but i don't want to push physical direction which i'm thinking about i'd
like to
my belief is that it just building one solution for everything is maybe not also
the best the best way of
quite
so i just wanted to say that
that the world is a dramatically different place
now that it was in nineteen so
and that
that the constraints
that row
of the current sort of formalism they don't exist anymore and i think chip you're
in shell but says that and i agree that you know if somebody didn't know
anything about what the way we do this and they started
a fresh
and thought about it in the current context it would be remarkable
that person came up with the formalism that we do have now
and
i think that
we should spend more time i don't know we should do i certainly will thinking
you know about how to do this in a different way given what we have
and what we know about the brain i mean it's remarkable how much
more we know about humans
just comment concerning the speaker-dependent stuff that you put gets it seems year
but it's not really solving the problem i mean you can make really very good
speaker dependent model but then the person i don't know switch the microphone and you
are again most or he's called alright of no use some obscure digital coding which
is completely cleared for the human beings but because of some strange digital artifacts your
whole algorithms break again
so this is i think this is somehow for the people each i'm i mean
to help get business in the i completely speaker-dependent environment
and i assume that for the people reach are in the i don't know in
the environment which is completely speaker independent it must be kind of the power of
these you know because you have a huge amount of the data which a speaker
dependent so
but it's not really sort of the problem is making the problem we came out
of our error rate and everything obviously because you can train to the speaker but
it's not really dissolution
that you're looking for
this just commands and then also somehow my
intuition or feeling is that the
i just i just know that if i understand what the people are talking about
it easier to me all the to perform a speech recognition
so it has to do something with semantic and it has to case to do
something that semantic and with the with the intelligence and the and
i don't know on so we use but this is the C just the kind
of intuition
i have a common about the semantics
my perception is that
in any many groups
i mean many companies not so low resource
they tend to treat the recognition as a black box
and semantic models are built on top of it
maybe they do a little bit of accounting like or maybe let's go phonetic matches
just in case the recognizer makes a mistake
and i
and it that's okay to get something up and running but i think that's a
stupid mistake
that the semantics and the recognition so be closer together
i have to say it's difficult to convince some of the people doing
semantics that don't have any speech background
that since would be done differently but i believe
this would be influenced
back and forth
was mentioned that is
someone starting fresh
start with the approach we do
and it probably really true
one of you hear it
the someone E mailed out so gone into that once is
now we apply all the in that station the speaker adaptation or all the compensation
development features now neural networks someone have that right
it's just not gonna work right out by
and you can i
compensate for thousands of hours that on in its current a broken
the renaissance neural networks so morgan
using neural networks in the in their fibre formalism because nobody
you know
was that interested because of all the other things that we're working so well and
why would why would anyone in their right minds what it right
but then all of a certain work back to you know we're back in this
zone where people are doing it so i'll all i'm saying is that the less
and i take from that is
you know if you can if you can work in if you can get something
that is that is that makes sense and is and that is demonstrated really good
on a small problem
well then maybe that would be pretty compelling
i mean i agree with you though it's a it's the success is pretty are
you know if i have it is something that i am i gonna say what
we think about this for forty years know exactly
we all know thirty six
and maybe they are like to do something that we should do dishes designing experiments
where we say
i will show you on the state-of-the-art systems that my method works a little bit
better
because that's it itched it is not really such a very scientific is it i
mean assigned to the experiment is that you isolate one problem and you sort of
try to change the conditions and see the things go up postings go down into
the goodwill design experiment if you get worse and you predicted be worse
given your hypotheses i think you are meaning right we are almost never
report results i that because our belief is that the only way to convince our
peers that what you are doing is used to use was used for is that
you get a low word error rate is possible on the state-of-the-art systems with the
optimal accepted task whatever it is at the moment
so i designing good experiments again going back it seems seriously to jump beers be
designed a clear definite experiments so that science can grow step by step by step
i seen that we have to learn how to do that and since you mentioned
in new networks i want to share with you might personal experience
it's different houses here is going to be and he may not even remember
but a long time ago once the post postdoc at icsi here on the experiment
very he had a context independent a hmm-model a context independent phoneme and the you
wanna model and you wanted model was doing twice as good as the hmm and
that can means to be i mean you know that we stick to neural nets
throughout the dark ages on you of neural nets N I partially because we invent
have a so but in hmms an lvcsr as but as a partially because i
truly believe that because that was an experiment which was very convincing to me if
i have a simple a gmm model
without any context-dependency to try easy to of course building to do system and context
the i mean context independent hmm model which was the only way which we between
you have to be noted at a time
and you and that is doing twice as good as the hmm why wouldn't i
stick to this at you are like model i'm glad that we did
i don't know steep if you remember this experiment i say good but i think
it actually got a piece even in transactions eventually right
you know what one other where you can get use of out of a local
optimum is change the evaluation criteria right and i think and i think that's i
mean and part what mary's than what the babel program you know have keyword searches
the task in atwv well extracted word error rate it's not always perfect and i
think another thing that
people we seems to me really are to be reporting when you put a word
report a word error rate is not just the mean word error rate but the
variance across the utterances because you can have a five percent word error rate but
if a quarter of your utterances are essentially you know eighty percent word error rate
which can happen then you know that's a good way to start figuring out how
to get your
technology a little more reliable
i was hoping you would have a comment
i feel
i feel obligated to
to
talk about ancient history since i'm getting a little older now
i remember when hmms started and we were certainly not the first to use them
we were sort of in the middle of that
of that previous
a revolution
the big criticism there were two big criticisms of hmms
relative to the previous method the previous method was just write the rules because we
all know about speech and say how it works and those systems which i wrote
systems like that back and the early seventies because i was a late adopter of
hmms
those systems were very simple easy to understand extremely fast
needed no training data
that sounds nice right
and they could do very well on set on simple problems without training data and
the hmm is the government argued in other people argued and sometimes we argued hmms
were too complicated require too much storage too much training too much memory and would
never be practical
well obviously things changed and it wasn't only computing power that was a big factor
but it was also learning how to make it more efficient and we do a
combination of all of those not being
re so rigid just to say we have to do it with zero data and
just what i learned in my acoustic phonetics class
we could use data
more data always helped
learning to do speaker adaptation rather than speaker dependent models
okay neural nets
neural nets work done simple problems but not on more complicated problems
and what was need i'd say the reason it works now is because we can
now do you know it two three years ago the things that we're working we're
requiring two months of computation which is just you know unacceptable completely unacceptable some bold
people did that that's great and then they figured out how to get better computers
that all of this argues that each revolution which happens that at twenty five years
cycle
is the realisation that all of the intelligent things that we thought we knew
can stevens would tell us what happens with formant frequencies and i learned all those
things all of those were not the way to go the real understanding was not
the way to go with bothered us because we'd like to think about
we like to think about you know the them phonemes and things like that
but we know that phonemes are abstractions
we know that formants are an oversimplification
everything that we learn is an oversimplification and computers are just simply more powerful than
we are
then we can anything we can write the not more powerful than the brain but
the right more powerful than anything that we can write in a in a program
so i think
that would argue against
the i i'm not i'm not saying that you shouldn't keep trying to find the
right answer but i think history has told us that the right answer is think
about more efficient ways
both you know computing will increase its increased by factor of a thousand and the
last twenty five years both segments memory and storage and it will increase by a
factor of a thousand every twenty five years forever
and that's a big number in fifty years
but at the same time we can think about algorithms that are a thousand times
more efficient
that had that has happened and it will happen
it a little you know collects that's of data other people can collect parts of
data i think it will happen that we will have corpora that include the speech
of millions of people from
hundreds of languages in hundreds of environments
and if you just imagine that it was let's just pause it that it was
simple and easy to collect millions of hours from all these environments and memorise all
of it and learn what to do with it and compute it store it all
in something that fits in your you know in the chip that's embedded in your
in your hand or something or in your you in your head
well in it just works you don't know why or how it works but it
works
so i
while i have the same desire to understand
intellectually what's going on i would that almost anything that will be of the solution
that eventually works
so i'd like to make the other side
and the other side is if you look at the history of science
what's happened is
are truly
stupendous advances have come from understanding where we are
recurrent models don't work
it's not
that we shouldn't try to push models
but the think that you're describing
engineering
i'm pam of engineering what truly understanding comes from looking at the places where our
current models fail
and all of the things that we've been doing for the past twenty years are
data
for the next
and we should be paying attention to where we fail
and that's where we're gonna find the success
so a
one the to it at a little bit
it seems like this i think that i like which we always think
a
the old story is if you take
an infinite number of monkeys and give them
infinite number of typewriters eventually will i shakes
and i think that's what you're suggesting
a you have a few problems number one
more is lower it did
fairly much comes took came to an end
and that industry is facing the same problem unless there is a dramatic
technological shipped
a you're not going to get
the kind of doubling that we've seen every eighteen months
in the future
basically quantum mechanics eventually getting you way
the alignments are so narrow now that there are not too many atoms or
to allow for them to continue to be
a
somebody else said something about
well what happen if people started a
doing this research all over again would be find the same solution
a i'm waiting now a marvellous what paul designed the nature tries to explain evolution
not just of humans but rivers and everything else in terms of
physical laws
i highly suggest reading it it's very entertaining a but basically
and then going back to the coding i think when the coding what was done
it really was fundamental in the sense that we understood
a page and spectrum where the essence so for example the coding that works on
yourself on which is really meant to code speech if this is like in the
background it totally the stories because it really as adopted to the speech signal
so wasn't just a random brute force process it really depended on first lpc then
are is a coding the residual and all of that and that's why we have
such good coders and i think
a
the theory behind that was of course much more trivial then it is and in
language
so i do think that
we need to continue the work that we're doing but on the other hand do
a lot for some paradigm shifts a that would be more than just are increasing
a that's stochastic ability by introducing neural nets and
from where i said i thousand miles at a neural nets essentially are a generalization
of hmm their boats stochastic models it's just that in hmm you have essentially a
single it later
so i think the point about how much data and we need to solve the
problem by brute force comes down also to the question of
artificial intelligence right
so contain with these two stage scenarios one even scarier is that one day we're
going to get a activity in to use right
and so this process this when this happened or in the way so that moment
we're going to lose control of abstraction right machines are going to be better than
this ad created their own map structures so all this prior knowledge we want to
put into our models
is going to be are way you've seen things but machines are going to have
their way of seeing things
and when is it is discussions about saying
when we have to look at the problem and things like humans and
i think well
i is already happening that machine to create in they don't obstructions and they are
not into due to less but since they are two going to do better than
as in the long term we're done we might be better of just think you
know how the so much in sync up on the not how like to think
on this
how i can express the problem okay you at generative model that see it is
to me
maybe it should be intuitive to the machine
or to the harder right and deep neural networks
to some extent
okay
doing this i would very far away from that similarity right but when we will
reach that so maybe we'll webbetter of thinking
and i
that they are really always looking in the light and basically after fifty years over
artificial intelligence essentially of developed
tremendous methods for optimization and classification there is very little more can inference and logic
so i'm very good the to field is alive and well the si can see
from this discussion it really reminds me of which it reminded us that for one
of the first the asr you the workshops and i will also remember that even
in my introduction
where people were discussing fighting and it always the desire to move the field further
and i'm very happy that i think that we use exceeded too large extent in
this asr you to so let's just keep it's going i think otherwise i will
i will pass of the microphone to one zap who has a
a sound
since to say about is it is it the time for post the room or
basically i estimate i one commander is discussion i think
what we were discussing with the data that models the adequacy of models monitored by
C
i think well it turned little bit speech centric
so a little bit too selfish i fine so i think we forgot about the
users have a four technologies because i have the impression
that the well rarely people would just ultimately use the output the of asr and
say this is the output them your it finishes is most of the time is
just some meat product that would be further used by someone so actually
i like the way that the better what so speaking about that the well for
you would be the wer is not the automated metric but is the click through
rate wer of foreign call center traffic it might be the customers of destruction so
they have measures forty
for a government agency it might be the number of court
but the guys
and so on and so on so i think actually there is still quite some
work to do in propagating these target metrics
back to our field that i'd i don't know if there was like sufficient work
on this maybe they are not that only interested
in at W or wer and stuff like this just the just need to get
there were done
okay so we cook is sorry i didn't i didn't mean that the
find technical common and in the i did so no
no comments on this
one
lost