so
well
thank you for i'm thinking the organisers for allow me to be sort of surprise
at talker and
and so i'm going to tell you a little bit what we have been doing
in terms of trying to understand language acquisition
now when well as a parent we are trying to understand how babies are learning
languages it seems very obvious we are just using on tuition and well maybe it's
just have to listen to what we're turning right it's very simple
now as a psychologist
then we have been trained to try to think in terms to take the place
of the baby okay
how's it feel to be a baby indeed this situation well it's going to be
a lot more complicated because
we don't understand that or what we told we just have the signal
and now
what i would like to do is to take the third perspective
which would be the perspective of an engineer i'm not an engineer myself i'm a
psychologist but here the idea is try to see how could we basically construct a
system that does what the baby do
okay that's the perspective we would like to push
so okay
so this what we basically we know
are we think we know about babies only language acquisition so this time nine here
is the model
so this is birth and this is the first offline
and as you can see babies are learning a quite a number of things quite
quickly so
basically here babies are starting to say the few words and before that they are
at rings various organisations but actually
before they are trained the first where they are
learning quite a bit of information about their own language
for instance the start to be sensitive to that of a list that are a
typical of the language channels will start to build some representation of the consonants that
here are starting to be all basically language models with a sequential dependencies et cetera
et cetera so this is taking place
very only
way before they can show us that they have
then these things okay
at the same time they also learning over aspects of language in the prior the
prosody and in the lexical domain
so
how do we know that babies are doing this well this is all job a
psychologist to try to interpret interrogate the babies that don't talk
and we have to find a clever ways to
build situations where the babies is going to
for instance look at the screen or
sec a little
blind people here
and this behavior of a maybe a way to control this thing really that they're
going to be presented with so in the typical experiment that was basically the beginning
of this field in the seventies
okay time as it this study where you basically presents over and over again each
time the babies doing this little behaviour this section we have your we present the
same syllable so it's a
and you can see here that the basically the frequency of this setting is decreasing
because it's boring is always the same syllable but then suddenly
you change the syllable or not spell
okay and now the baby sucking a lot more
okay that means that the baby has
notice that there was a change
and this to all the conditions where the same the same syllable continue blah exactly
the same syllable when in slightly different one
so
with this kind of ideas you can basically pro babies perception of a speech sounds
and you can ask yourself okay to discriminate but i'm part
dot and got and always kind of sounds you can also program memory
have they memorise have the segmented out
particular frequent or interesting parts of the environment so this sounds in that environment all
they also more fancy type of equipment the to do the same kind of experiments
but i'm not going to talk about them
so
the question that's really interest me here is a how can we understand what babies
are doing okay not if you open up a linguistic i mean psycho linguistic the
rampant all technology journals you find some hypothesis
that interesting but i'm not going to talk about them because unfortunately this series
do not a low to basically have an idea of what are the mechanisms that
babies are using for
understanding speech
you do fine in psychology and also linguistic jungles publications trying to
basic cut down to learning problem to solve a so for instance some people i'm
going to talk more about that have studied how you could
fine phonemes
from row speech using some kind of course unsupervised clustering
but also known the once you have to the i don't put the phone and
find the word forms
or once you of the reform sums from learn some semantics et cetera et cetera
so these are
this paper was out on basically less technologies that an english they are not the
done by engineers and
they what one particular aspect of them is that they are focusing on really
as a small part of the problem of the learning problem
and they also
basically making a lot of assumptions about the rest of the system
so that the question we can ask ourselves is
could we make a global system that would learn but with many of the babies
doing by concatenating these elements
and what i i'm
i think i will try to do demonstrate to use that such a system simply
does not work
doesn't work because it doesn't scale it has incorporate some particularity is and you also
we doesn't press one what the previous doing anyway
so i'm going to focus on this particular part and we talk a lot about
that at least two talks today focused on how you could discover some units of
speech from
from row speech in psychology
it's really people believe that bay the weight babies do that is by accumulating evidence
and doing some kind of and but unsupervised clustering
so this is the paper a couple of papers were published
basically that i stack these babies that six months are able to distinguish sounds that
are not in and language so they can distinguish dot i wouldn't well if you
are speaking yes and you say that i say right
but most of you wouldn't hear that and contrary to the image maybe i six
months but the twelve months they have lost but the ability because that contrast is
not used in the language
okay and so the hypothesis about how babies do that is that they basically accumulating
evidence and doing some kind of statistical clustering based on the input that's available in
the language
now
and that in the number of papers have a try to demonstrate that you can
build a system like this
however
most of this papers have dealt with a very small number of categories so these
are sort of proof of principle papers that basically construct data according to distributions deck
is and they show that you can find these by doing some kind of clustering
so that's nice but does it scale we
and as everybody knows here we know that speech is more complicated and this is
basically running speech and you got more conversational speech you need some not separated the
a not so segmented easily segmentation is part of the problem except for except
okay so is where i started to get involved in this problem working
with a
that sounds if a hopkins and we wish we choose the idea was basically to
try to apply real simple unsupervised clustering algorithm on the row speech on running speech
and see what we get could we get phonemes up about
so this is what
we did they have there was this the idea is you start with a very
simple markov model with just one state and then you speak the states into
various
possibilities you can split it in time and time domain or like a horizontally like
you have two different versions of each sound and then you can make this
continue to H rate is a graph drawing process until you have a very complicated
network
and so in other to analyze
the what the system was doing what sensitive and a bad actors and it was
to apply decoding
using finite state transducers so that you can basically have some interpretation of what the
states mean
and what was discovered was that the phoneme the units that are found in this
kind of system are very small smaller than phonemes
but even if you concatenate them and these are the most frequent
so strings concatenation is
they correspond not a phone is but more to contextual and of phones
that is also thought problem which the units are not very talker invariant
but so
so this problem sun a very surprising for those of you work with speech and
that's majority of people here because we all know that again the phonemes are not
going to be found a in such of in such a way
this one problem i want to we insist on because i think that's quite crucial
and we talked about that in earlier discussions
is the fact that languages
do contain elements
that are
that you will discover if you do some kind of unsupervised clustering but there is
no way to merge them into abstract phonemes and this is due to the existence
allophones okay you have in many languages in most languages you have and a phonics
rules like for instance in france you have the overall voiced what get number one
and you have the unvoiced in cannot wrote all okay so this sounds exist in
the language there is no i think you can do about that they actually are
two different phonemes in some other language
so
you are going to and the fact discovering this units
okay so in a with a purely bottom-up fashion there is no way to remove
this
okay
so
well
you could say and that's actually was one of the question what but was discussed
before how many phone ins how many units you want to discover
and it was sort of set it doesn't really matter we can take a sixty
four we can take hundred
well actually it doesn't matter for the rest of the processing at least
that's what we discovered with the
phd student of mine so what we did there was to basically vary the number
of allophones that you used to
transcribed speech okay and then we use a these other algorithm which is this word
segmentation algorithm
that was referred also to before so we use a one of sharon goldwater type
of algorithm
and
what we found so here what you have is the segmentation f-score and this is
basically number of this is the number of phones is converted into the number of
alternate word forms that the
and phones create
and you can see that the performances is affected is dropping
this is the right here for english french and japanese and in some languages like
japanese it's really having a very detrimental effect you have lots of allophones then it's
becoming extremely difficult to find words okay because these are reasons just break down
so it doesn't matter to have to start of to start with good units
so this is another experiment that was reported by our and
where again issues you basically replaced with some kind of unsupervised you need
and you try to
feed that onto a word segmentation system then you end up with a very poor
performance
okay so
that means that phonemes
at least with a simple minded
clustering system is not able acoustically
so there are two ideas from their which i want to discuss one is to
use a better all the three model and the other is to use the top
down
model top down information
so
regarding the
well i'm basically going to
this is just a summary of what i said so what we have right now
is with some of simplified of fate input we there are some unsupervised learning clustering
i present have been successful with more realistic input we have a system that works
but they use heavy the supervised
a models and the question is where we can we build systems that
a combined this portion of the space
and
so i'm not going to present a much work that we did on unsupervised for
pruning discovery
because for me there was a plenary a very important question first is that how
we evaluate unsupervised phone and discovery
so imagine you have a system the discovered units how do you know how can
you evaluate whether these units a good a not good
so traditionally people use for name error rate which is busy you train the phone
and decoder which is what we did with this is successive state splitting
it was this
finite state transducer that translated the find that the states of the system into phonemes
of course the problem is that when you do that then maybe a lot of
the performance at the end is due to the decoder
it may not be
the fact that these units are good it just that you have trained a good
because
and also we don't even know that phonemes of the relevant units for this for
inference okay
so maybe they are using something else maybe they're using
syllables diphones some other kind of you
so the idea is to use and so the variation technique that's basically is suited
to do this kind of work
and
and the idea of entered ideas that we don't really care whether babies are all
the system is discovering phonemes what we care about is that the able to distinguish
words that mean different things
so talk and all
the whole mean different things so they should be distinguish but the system know what
the what how you cope with the just okay
so this is the idea underlying the same different task that are in had means
pushy
all these years and we have first slightly different version of that which we called
at X task
so with the
the same different task goes like this you are i'd if you two words to
talk and then you have to say whether the same word
and you compute the acoustic distance between then and these are the distribution of the
two acoustic distance and what or and showed
was that
if you are basically doing things within the same you same talker the two distributions
are quite different so it's easy to say what is the same word or not
but if it's the same if it's a different or quite becomes a lot harder
okay
so what we did west to
build on this
and
ask a slightly different question i give you three things i give you don't say
by one talker
the whole say by the same poker
and then talk say by a second talk okay so now you have to say
whether this
i am here is closer to this one obvious one
so this is simple psychoacoustic task
that's
for me it's really inspired but what by the type of experiments we do we
babysit apples and with that we can compute the primes you can compute
the values that are that have
i mean a psychological interpretation but also we can
basic you'd have a very fine grained analysis of the type of errors the system
is doing so there we apply this task to a database of syllables that have
been recalled in english across talkers
and this is the performance that you get a recitals what's nice with this kind
of that you got really compared human and machine
and this is performance of humans and this performance of mfcc coefficients okay so we
can see there is a quite a bit of difference between
these two kind of
of systems
so this these are actually run on meaning that's a double so we can be
the case that humans are using meaning to do this task okay
but then this task this kind of task can be used to test different kind
of features
which is nice so that's what we did with the this of mine too much
that's
and also hynek
here i actually so the same the same as i was talking about so a
crosstalk or phone and discrimination you can then apply a typical processing pipeline where you
start with the signal you power spectrum and you will kind of transformations
and you can see way whether each of these different types of approximation you due
to the signal is actually improving a not all that particular task
okay so this in this graph you have the effect of performance depending on the
number spectrum channels and what we found was that the
actually phone and discrimination task requires fewer channels stand for instance if you were to
do a talk a discrimination task which we can do now having
dog spoken back to speakers and then a for item that's a different word but
one of the first talk about this they all the talk
so i'm not going to say more about that but
but we
that's the ideas that
trying to specify the proper evaluation tasks is going to help devising proper features that
between the would work for
unsupervised learning
this is this work we started with another post of mine
what he did was to apply the deep belief network
to the to this problem so this is we already
learned a lot about this ring the first day of the talk
but then what you can do is you can compare the performance of this deep
belief network representations that each of the levels to do this kind of discrimination task
okay
and this is the mfcc for instance this is what you have
at the first level of the dbn
without any training so actually you are doing better
this is the error rate here and if you do some unsupervised training like the
restricted boltzmann machine training actually a green slightly worse okay on that task now it
does not that this pre-training here helps a lot when you do supervised training after
that but if you don't do supervised training actually not doing much
okay so i think it's that's what i'm saying is important to have a good
evaluation task
for unsupervised
problems because then you can discover whether you unsupervised unit is actually mean any good
or not
okay so not in the time remains i would like to talk a little bit
about
this all the idea the idea of using top-down information
and that that's an idea that was not at least to me very
natural because
i have this idea that maybe should maybe should learn first the phonemes the elements
of the language before running higher
or other information but of course phonemes a part of a big system okay and
so maybe the meaning the definition of the finance is
emerges out of the big system so the intuition there is that maybe babies are
trying to learn the whole system
and why they do that they are going to find if one
okay so
so all the different things we try i'm going to
talk about this idea of using lexical information
so lexical information is a very simple idea
is the following is that
typically when you have to retake two words that random
or you just you to you take your you whole lexicon and the you try
to find minimal pairs
that would a actually different on one only one segment so for instance cannot and
cannot
okay
you don't find a lot of than you do fine then but they are
very infrequent statistic
so now if you are looking at your lexicon you imagine you are
you have some initial position to find the words and then you are looking at
all the set of maybe more it is that you find you have to find
a lot of minimal pairs that correspond to
this contrast a whole then you can be pretty sure that it's not really a
phoneme ink
contrast these are probably telephone
and that's the intuition okay
so how we tested that
we started with
a transcribed corpus then we the transcribed it into phonemes then we make random allophones
this is not going well
okay
and then we transcribe this a phoneme a transcription again into fine
of very fine description with all these other phones and we vary the number of
other phones
so that's how we generate the corpus and then the task is to take this
and basically fine
which pairs of phones belong to the same phone and want
using just information in a corpus
so
so that's what we do
and this is the basically
so we compute the distance
now the number of different
minimal process that you have for each contrast
and we compute here the area-under-the-curve and that's this
right here and this is the number of phones
so don't look at this curve here this curve here is the relevant one is
the effect of using the strategy of counting the number of
also i mean the multi okay
so the performance is quite good and it's actually not really affected
negatively by the number of phones that you had
okay so this is
this strategy works quite well but of course it's cheating right because there i assume
that the babies had the boundaries of words
but they don't and in fact i showed you just before that it's actually extremely
difficult if you have lots of allophones to find the boundaries of words
so
so that that's a kind of security that we would like to avoid
and so the idea that the un T martine the postdoc had
which was great was to say well maybe we don't need to have an exactly
second maybe babies can go and build a proper lexicon with the whatever segmentation and
reason they have it's going to be incomplete is going to be wrong you has
many long words in it
but still maybe could be you useful thing to have and that's what we find
here so this is there we use of free really extremely rudimentary segment segmentation sources
using an n-gram to the ten percent most frequent n-grams in the corpus and that's
the lexicon so it was really pretty awful
but still it provided
actually performance that was almost as good as the gold mexican
okay
so then that the
and demanding went to japan and then i had pasta a doctoral student
who said well we could even go even further than that
maybe babies could be constructed
some approximate semantics
and the reason why it could be useful to do that is that well cannot
okay
they are different allophones because they are in minimal pair but what about this one
these are two words in french can and cannot
and the if i way to apply the same strategy i would declare that and
the are allophones which is wrong and then i would end up with the a
japanese french
type of this than so that's not what we want
so but on the other hand if we have some idea of even vague idea
that cannot
about the meaning of got out actually not okay now
but some kind of bird and it whereas this one is some kind of water
thing then that's that maybe that sufficient to help distinguish these two cases that's also
kind of cannot
so there
what we did what we do the same kind of pipeline we make the problem
actually more realistic by having a instead of having run them allophones we generated allophones
by using tied a three state
using hmm
actually that makes the protection much more difficult that they are phones are more realistic
but it's also becoming the lexical started data represented before it is having trouble with
that
and then the idea is that you take that now don't cheat anymore you are
trying to recover
possible words from that and then you do some semantic estimation and then
now you compute the semantic distance between two pairs of phones
so how does it work
what word segmentation
a state-of-the-art
minimum description length or adaptive grammar
okay so we know that is working but we know that's not working very well
okay especially if we have lots of allophones it's going to have a pretty bad
estimate of the lexicon
but then we still take that as alex again and then we apply the latent
semantic analysis
which basically is
is counting how many
how much time these different terms occur in the dig documents and here we took
the comments we to go ten sentences length
so we have this whole corpus we segmented into ten sentences and we compute the
this matrix of counts which we then decompose and we arrive at semantic representation where
basic each word now is a vector
so the i mean not that people in india the mean much more sophisticated things
like this one so this is pretty
first
are older a semantic analysis
what but what's nice was that we can compute the cosine between the two or
semantic propose semantic presentation of the words and the idea of now is that if
you have to allophones they should have quite i don't similar and
vectors because the are occurring the same context
okay so that's the result
there
so in this in this study what we did was to try to
to look at them
because we have generated is allophones on the basis of hmms we can compute this
the acoustic distance within that okay so obviously acoustic distance
is going to help you have to allophones two forms that are quite close to
one other maybe they should be grouped together because they are likely allophones
we also know that is not working
if we only have that's it's not working because the bottom-up strategy doesn't work on
performance is not that
but it still not perfect okay so that's the performance
in the task where you i give you to phones and you have to tell
you whether they're allophones all okay
the chances fifty percent
so this is the percent correct if you use acoustic you only for english and
japanese of
something missing there which is the number find a phone so it's hundred five hundred
thousand that the phones
and this is that the effect of the acoustic distance the semantic distance okay and
the performance is almost as good as the acoustic distance
when you combine then you have very good performance
so that should that shows
that
that you can basically
use this kind of semantic representation even though they are in computed on the basis
of an extremely bad lexical i mean at this level here the number of real
words that you find with the ad upper brummer a type of framework is about
twenty percent so you're like second is twenty percent real words and eighty percent
but nevertheless
that mexican and is enough to give you some semantic top-down information so that's shows
that the semantic information is very strong
alright so
i'm going to wrap up very quickly so i thought that the with this idea
that babies would go in a sort of
a bottom-up fashion
that doesn't work it didn't scalable and it was a climate is
also and it doesn't really account for the fact that babies are learning works
before they have really for those zoom in on their eventually of phonemes okay
in fact now they are but it's showing that even at six months babies have
an idea of the semantic
representations of words
so basically babies out and everything
so now we would like to replace is a scenario by
a system like this where we you start with row speech and you try to
learn all the levels
at the same time
and then of course you're going to do a bad job the phonemes are going
to be wrong the word segment to be wrong everything of semantics going to be
awful
but then
you combine all this and you make a better next iteration
so that would be the
the proposed architecture for babies
you know the for this to work you have to work we have to do
a lot more work we have to
basically a stop
of course using target language and try to approximate the really put that B C
getting as much as we can
we have to quantify what does it mean to have a pro to lexical propose
something okay so i gave you an idea of and evaluation procedure for evaluating what
is it to have a proper phoneme we have to do the same thing for
pro to words and propose semantics et cetera
because these are sort of really approximate representations
and then
and then
the synergies are what i this the describe just
now which is
this image is a when you try to learn the phonological units alone you are
doing about the bad job semantic representations alone it's difficult but you can basically
if you try to learn
sort of a joint model you are going to be better
and they are a lot of potential synergies you could you could imagine
the last thing that i have to do as a psychologist of course is to
go back to the baby and test with the babies are actually doing this but
i'm not going to
talk about that now
and find the
i mean why should we do this i think this reverse engineering of you meant
infant is really a new channel G break and i think
both sides can bring out of things
both psychologist engineers can bring ideas we can bring some interesting corpora and we can
test with other the ideas are going to
real realistic probabilities
and then region is can bring algorithms and also test this large scale test on
real data is very important
and we have a lot to work on because that's would be some of the
potential architecture i try to put everything that has been
documented somewhere in terms of
potential link between different levels that you're
thank you approximate is and there i guess you have added a lot of things
you have also babies are actual arc estimating so maybe this articulation is feeding back
to help in constructing
the sub-lexical units
they also any string the faces
of caretakers and they also have a lot of which are
semantic input for acquisition so all these
representation have to be put in at some point that and but i think we
what we have to do is to establish whether we do have interesting synergies a
not if we don't then we can be factor the system to separate subsystems
and that's what have to say so this is this is the team like human
these are
very nice colleagues that help to this work
okay so we are gonna have a abbreviated channel so we don't have a hold
on a time for questions but
one or two
what do you think about inference learning
something between
phonemes and words like syllables which have a nice sort of acoustics
chunkiness to them
i mean that's actually
that's was the basically the hypothesis i had when i did not used
but the role of the syllables
i guess
i mean that's perfectly possible i mean the thing is that
in the way task i think that what
deep belief networks are doing by having these inputs recitation where you stack about hundred
and fifty millisecond of signal
some of going in that direction i mean how to the fifties milliseconds basically the
size of syllable
so
i guess that's behind this i mean if this is what people are using the
most recent and the reason is that that's basically the domain of the syllable where
you have coarticulation that so that's where you can capture the essential information for
recover in contrast
so i think there are many ways in which syllable type units could play a
role it could be in an implicit fashion like i just said or you could
have actually tried to build recognition system i mean units that have the still shape
which is another way to the
i mean we know we know that inference are counting syllables for instance so at
birth it can be can effect you present then a three syllable
words and then you switched into syllables the notice the change they don't notice the
change if you go from for need to six phone instances which is the same
racial of change
so we have evidence that there are a decent syllable nuclei are things that the
pay attention to
a lot
thank you for your talk manual i think you at one time told me that
almost from day one
that infants sorta can are take you can imitate articulatory gestures
that somehow hardcode
in and well okay it and i don't know how you do that experiment but
on the other hand all of these acquiring you know phonemic balance in and word
boundary segmentation lexicons that seems to be all sort of this
part of the plastic city of learning and inference so why isn't the some notion
of starting with the articulate you articulatory gestures since that is sort of
there are beginning wise that part of your model or is it should be
in the most the mobile it's sparse
so i have i have actually of course
so i have proposed a working on this and they're actually number of
have tried to incorporate
using actually deep learning systems trying to learn at the same time speech features and
articulatory features
and then
and then if you retrain like this and then you only present now speech features
you are two between better decoding then if you had where learning speech feature so
we know that there is some notion in which that could work but of course
this work was done with
real i don't i articulation
the baby articulation is much more primitive so it's not here that's going to help
as much but that's one of thing we want to try
so i think we a relative time but of manually i believe you're gonna be
here tomorrow as well so i encourage people to go and ask all these questions
i think they're very they're very relevant work we all so this community