so

well

thank you for i'm thinking the organisers for allow me to be sort of surprise

at talker and

and so i'm going to tell you a little bit what we have been doing

in terms of trying to understand language acquisition

now when well as a parent we are trying to understand how babies are learning

languages it seems very obvious we are just using on tuition and well maybe it's

just have to listen to what we're turning right it's very simple

now as a psychologist

then we have been trained to try to think in terms to take the place

of the baby okay

how's it feel to be a baby indeed this situation well it's going to be

a lot more complicated because

we don't understand that or what we told we just have the signal

and now

what i would like to do is to take the third perspective

which would be the perspective of an engineer i'm not an engineer myself i'm a

psychologist but here the idea is try to see how could we basically construct a

system that does what the baby do

okay that's the perspective we would like to push

so okay

so this what we basically we know

are we think we know about babies only language acquisition so this time nine here

is the model

so this is birth and this is the first offline

and as you can see babies are learning a quite a number of things quite

quickly so

basically here babies are starting to say the few words and before that they are

at rings various organisations but actually

before they are trained the first where they are

learning quite a bit of information about their own language

for instance the start to be sensitive to that of a list that are a

typical of the language channels will start to build some representation of the consonants that

here are starting to be all basically language models with a sequential dependencies et cetera

et cetera so this is taking place

very only

way before they can show us that they have

then these things okay

at the same time they also learning over aspects of language in the prior the

prosody and in the lexical domain

so

how do we know that babies are doing this well this is all job a

psychologist to try to interpret interrogate the babies that don't talk

and we have to find a clever ways to

build situations where the babies is going to

for instance look at the screen or

sec a little

blind people here

and this behavior of a maybe a way to control this thing really that they're

going to be presented with so in the typical experiment that was basically the beginning

of this field in the seventies

okay time as it this study where you basically presents over and over again each

time the babies doing this little behaviour this section we have your we present the

same syllable so it's a

and you can see here that the basically the frequency of this setting is decreasing

because it's boring is always the same syllable but then suddenly

you change the syllable or not spell

okay and now the baby sucking a lot more

okay that means that the baby has

notice that there was a change

and this to all the conditions where the same the same syllable continue blah exactly

the same syllable when in slightly different one

so

with this kind of ideas you can basically pro babies perception of a speech sounds

and you can ask yourself okay to discriminate but i'm part

dot and got and always kind of sounds you can also program memory

have they memorise have the segmented out

particular frequent or interesting parts of the environment so this sounds in that environment all

they also more fancy type of equipment the to do the same kind of experiments

but i'm not going to talk about them

so

the question that's really interest me here is a how can we understand what babies

are doing okay not if you open up a linguistic i mean psycho linguistic the

rampant all technology journals you find some hypothesis

that interesting but i'm not going to talk about them because unfortunately this series

do not a low to basically have an idea of what are the mechanisms that

babies are using for

understanding speech

you do fine in psychology and also linguistic jungles publications trying to

basic cut down to learning problem to solve a so for instance some people i'm

going to talk more about that have studied how you could

fine phonemes

from row speech using some kind of course unsupervised clustering

but also known the once you have to the i don't put the phone and

find the word forms

or once you of the reform sums from learn some semantics et cetera et cetera

so these are

this paper was out on basically less technologies that an english they are not the

done by engineers and

they what one particular aspect of them is that they are focusing on really

as a small part of the problem of the learning problem

and they also

basically making a lot of assumptions about the rest of the system

so that the question we can ask ourselves is

could we make a global system that would learn but with many of the babies

doing by concatenating these elements

and what i i'm

i think i will try to do demonstrate to use that such a system simply

does not work

doesn't work because it doesn't scale it has incorporate some particularity is and you also

we doesn't press one what the previous doing anyway

so i'm going to focus on this particular part and we talk a lot about

that at least two talks today focused on how you could discover some units of

speech from

from row speech in psychology

it's really people believe that bay the weight babies do that is by accumulating evidence

and doing some kind of and but unsupervised clustering

so this is the paper a couple of papers were published

basically that i stack these babies that six months are able to distinguish sounds that

are not in and language so they can distinguish dot i wouldn't well if you

are speaking yes and you say that i say right

but most of you wouldn't hear that and contrary to the image maybe i six

months but the twelve months they have lost but the ability because that contrast is

not used in the language

okay and so the hypothesis about how babies do that is that they basically accumulating

evidence and doing some kind of statistical clustering based on the input that's available in

the language

now

and that in the number of papers have a try to demonstrate that you can

build a system like this

however

most of this papers have dealt with a very small number of categories so these

are sort of proof of principle papers that basically construct data according to distributions deck

is and they show that you can find these by doing some kind of clustering

so that's nice but does it scale we

and as everybody knows here we know that speech is more complicated and this is

basically running speech and you got more conversational speech you need some not separated the

a not so segmented easily segmentation is part of the problem except for except

okay so is where i started to get involved in this problem working

with a

that sounds if a hopkins and we wish we choose the idea was basically to

try to apply real simple unsupervised clustering algorithm on the row speech on running speech

and see what we get could we get phonemes up about

so this is what

we did they have there was this the idea is you start with a very

simple markov model with just one state and then you speak the states into

various

possibilities you can split it in time and time domain or like a horizontally like

you have two different versions of each sound and then you can make this

continue to H rate is a graph drawing process until you have a very complicated

network

and so in other to analyze

the what the system was doing what sensitive and a bad actors and it was

to apply decoding

using finite state transducers so that you can basically have some interpretation of what the

states mean

and what was discovered was that the phoneme the units that are found in this

kind of system are very small smaller than phonemes

but even if you concatenate them and these are the most frequent

so strings concatenation is

they correspond not a phone is but more to contextual and of phones

that is also thought problem which the units are not very talker invariant

but so

so this problem sun a very surprising for those of you work with speech and

that's majority of people here because we all know that again the phonemes are not

going to be found a in such of in such a way

this one problem i want to we insist on because i think that's quite crucial

and we talked about that in earlier discussions

is the fact that languages

do contain elements

that are

that you will discover if you do some kind of unsupervised clustering but there is

no way to merge them into abstract phonemes and this is due to the existence

allophones okay you have in many languages in most languages you have and a phonics

rules like for instance in france you have the overall voiced what get number one

and you have the unvoiced in cannot wrote all okay so this sounds exist in

the language there is no i think you can do about that they actually are

two different phonemes in some other language

so

you are going to and the fact discovering this units

okay so in a with a purely bottom-up fashion there is no way to remove

this

okay

so

well

you could say and that's actually was one of the question what but was discussed

before how many phone ins how many units you want to discover

and it was sort of set it doesn't really matter we can take a sixty

four we can take hundred

well actually it doesn't matter for the rest of the processing at least

that's what we discovered with the

phd student of mine so what we did there was to basically vary the number

of allophones that you used to

transcribed speech okay and then we use a these other algorithm which is this word

segmentation algorithm

that was referred also to before so we use a one of sharon goldwater type

of algorithm

and

what we found so here what you have is the segmentation f-score and this is

basically number of this is the number of phones is converted into the number of

alternate word forms that the

and phones create

and you can see that the performances is affected is dropping

this is the right here for english french and japanese and in some languages like

japanese it's really having a very detrimental effect you have lots of allophones then it's

becoming extremely difficult to find words okay because these are reasons just break down

so it doesn't matter to have to start of to start with good units

so this is another experiment that was reported by our and

where again issues you basically replaced with some kind of unsupervised you need

and you try to

feed that onto a word segmentation system then you end up with a very poor

performance

okay so

that means that phonemes

at least with a simple minded

clustering system is not able acoustically

so there are two ideas from their which i want to discuss one is to

use a better all the three model and the other is to use the top

down

model top down information

so

regarding the

well i'm basically going to

this is just a summary of what i said so what we have right now

is with some of simplified of fate input we there are some unsupervised learning clustering

i present have been successful with more realistic input we have a system that works

but they use heavy the supervised

a models and the question is where we can we build systems that

a combined this portion of the space

and

so i'm not going to present a much work that we did on unsupervised for

pruning discovery

because for me there was a plenary a very important question first is that how

we evaluate unsupervised phone and discovery

so imagine you have a system the discovered units how do you know how can

you evaluate whether these units a good a not good

so traditionally people use for name error rate which is busy you train the phone

and decoder which is what we did with this is successive state splitting

it was this

finite state transducer that translated the find that the states of the system into phonemes

of course the problem is that when you do that then maybe a lot of

the performance at the end is due to the decoder

it may not be

the fact that these units are good it just that you have trained a good

because

and also we don't even know that phonemes of the relevant units for this for

inference okay

so maybe they are using something else maybe they're using

syllables diphones some other kind of you

so the idea is to use and so the variation technique that's basically is suited

to do this kind of work

and

and the idea of entered ideas that we don't really care whether babies are all

the system is discovering phonemes what we care about is that the able to distinguish

words that mean different things

so talk and all

the whole mean different things so they should be distinguish but the system know what

the what how you cope with the just okay

so this is the idea underlying the same different task that are in had means

pushy

all these years and we have first slightly different version of that which we called

at X task

so with the

the same different task goes like this you are i'd if you two words to

talk and then you have to say whether the same word

and you compute the acoustic distance between then and these are the distribution of the

two acoustic distance and what or and showed

was that

if you are basically doing things within the same you same talker the two distributions

are quite different so it's easy to say what is the same word or not

but if it's the same if it's a different or quite becomes a lot harder

okay

so what we did west to

build on this

and

ask a slightly different question i give you three things i give you don't say

by one talker

the whole say by the same poker

and then talk say by a second talk okay so now you have to say

whether this

i am here is closer to this one obvious one

so this is simple psychoacoustic task

that's

for me it's really inspired but what by the type of experiments we do we

babysit apples and with that we can compute the primes you can compute

the values that are that have

i mean a psychological interpretation but also we can

basic you'd have a very fine grained analysis of the type of errors the system

is doing so there we apply this task to a database of syllables that have

been recalled in english across talkers

and this is the performance that you get a recitals what's nice with this kind

of that you got really compared human and machine

and this is performance of humans and this performance of mfcc coefficients okay so we

can see there is a quite a bit of difference between

these two kind of

of systems

so this these are actually run on meaning that's a double so we can be

the case that humans are using meaning to do this task okay

but then this task this kind of task can be used to test different kind

of features

which is nice so that's what we did with the this of mine too much

that's

and also hynek

here i actually so the same the same as i was talking about so a

crosstalk or phone and discrimination you can then apply a typical processing pipeline where you

start with the signal you power spectrum and you will kind of transformations

and you can see way whether each of these different types of approximation you due

to the signal is actually improving a not all that particular task

okay so this in this graph you have the effect of performance depending on the

number spectrum channels and what we found was that the

actually phone and discrimination task requires fewer channels stand for instance if you were to

do a talk a discrimination task which we can do now having

dog spoken back to speakers and then a for item that's a different word but

one of the first talk about this they all the talk

so i'm not going to say more about that but

but we

that's the ideas that

trying to specify the proper evaluation tasks is going to help devising proper features that

between the would work for

unsupervised learning

this is this work we started with another post of mine

what he did was to apply the deep belief network

to the to this problem so this is we already

learned a lot about this ring the first day of the talk

but then what you can do is you can compare the performance of this deep

belief network representations that each of the levels to do this kind of discrimination task

okay

and this is the mfcc for instance this is what you have

at the first level of the dbn

without any training so actually you are doing better

this is the error rate here and if you do some unsupervised training like the

restricted boltzmann machine training actually a green slightly worse okay on that task now it

does not that this pre-training here helps a lot when you do supervised training after

that but if you don't do supervised training actually not doing much

okay so i think it's that's what i'm saying is important to have a good

evaluation task

for unsupervised

problems because then you can discover whether you unsupervised unit is actually mean any good

or not

okay so not in the time remains i would like to talk a little bit

about

this all the idea the idea of using top-down information

and that that's an idea that was not at least to me very

natural because

i have this idea that maybe should maybe should learn first the phonemes the elements

of the language before running higher

or other information but of course phonemes a part of a big system okay and

so maybe the meaning the definition of the finance is

emerges out of the big system so the intuition there is that maybe babies are

trying to learn the whole system

and why they do that they are going to find if one

okay so

so all the different things we try i'm going to

talk about this idea of using lexical information

so lexical information is a very simple idea

is the following is that

typically when you have to retake two words that random

or you just you to you take your you whole lexicon and the you try

to find minimal pairs

that would a actually different on one only one segment so for instance cannot and

cannot

okay

you don't find a lot of than you do fine then but they are

very infrequent statistic

so now if you are looking at your lexicon you imagine you are

you have some initial position to find the words and then you are looking at

all the set of maybe more it is that you find you have to find

a lot of minimal pairs that correspond to

this contrast a whole then you can be pretty sure that it's not really a

phoneme ink

contrast these are probably telephone

and that's the intuition okay

so how we tested that

we started with

a transcribed corpus then we the transcribed it into phonemes then we make random allophones

this is not going well

okay

and then we transcribe this a phoneme a transcription again into fine

of very fine description with all these other phones and we vary the number of

other phones

so that's how we generate the corpus and then the task is to take this

and basically fine

which pairs of phones belong to the same phone and want

using just information in a corpus

so

so that's what we do

and this is the basically

so we compute the distance

now the number of different

minimal process that you have for each contrast

and we compute here the area-under-the-curve and that's this

right here and this is the number of phones

so don't look at this curve here this curve here is the relevant one is

the effect of using the strategy of counting the number of

also i mean the multi okay

so the performance is quite good and it's actually not really affected

negatively by the number of phones that you had

okay so this is

this strategy works quite well but of course it's cheating right because there i assume

that the babies had the boundaries of words

but they don't and in fact i showed you just before that it's actually extremely

difficult if you have lots of allophones to find the boundaries of words

so

so that that's a kind of security that we would like to avoid

and so the idea that the un T martine the postdoc had

which was great was to say well maybe we don't need to have an exactly

second maybe babies can go and build a proper lexicon with the whatever segmentation and

reason they have it's going to be incomplete is going to be wrong you has

many long words in it

but still maybe could be you useful thing to have and that's what we find

here so this is there we use of free really extremely rudimentary segment segmentation sources

using an n-gram to the ten percent most frequent n-grams in the corpus and that's

the lexicon so it was really pretty awful

but still it provided

actually performance that was almost as good as the gold mexican

okay

so then that the

and demanding went to japan and then i had pasta a doctoral student

who said well we could even go even further than that

maybe babies could be constructed

some approximate semantics

and the reason why it could be useful to do that is that well cannot

okay

they are different allophones because they are in minimal pair but what about this one

these are two words in french can and cannot

and the if i way to apply the same strategy i would declare that and

the are allophones which is wrong and then i would end up with the a

japanese french

type of this than so that's not what we want

so but on the other hand if we have some idea of even vague idea

that cannot

about the meaning of got out actually not okay now

but some kind of bird and it whereas this one is some kind of water

thing then that's that maybe that sufficient to help distinguish these two cases that's also

kind of cannot

so there

what we did what we do the same kind of pipeline we make the problem

actually more realistic by having a instead of having run them allophones we generated allophones

by using tied a three state

using hmm

actually that makes the protection much more difficult that they are phones are more realistic

but it's also becoming the lexical started data represented before it is having trouble with

that

and then the idea is that you take that now don't cheat anymore you are

trying to recover

possible words from that and then you do some semantic estimation and then

now you compute the semantic distance between two pairs of phones

so how does it work

what word segmentation

a state-of-the-art

minimum description length or adaptive grammar

okay so we know that is working but we know that's not working very well

okay especially if we have lots of allophones it's going to have a pretty bad

estimate of the lexicon

but then we still take that as alex again and then we apply the latent

semantic analysis

which basically is

is counting how many

how much time these different terms occur in the dig documents and here we took

the comments we to go ten sentences length

so we have this whole corpus we segmented into ten sentences and we compute the

this matrix of counts which we then decompose and we arrive at semantic representation where

basic each word now is a vector

so the i mean not that people in india the mean much more sophisticated things

like this one so this is pretty

first

are older a semantic analysis

what but what's nice was that we can compute the cosine between the two or

semantic propose semantic presentation of the words and the idea of now is that if

you have to allophones they should have quite i don't similar and

vectors because the are occurring the same context

okay so that's the result

there

so in this in this study what we did was to try to

to look at them

because we have generated is allophones on the basis of hmms we can compute this

the acoustic distance within that okay so obviously acoustic distance

is going to help you have to allophones two forms that are quite close to

one other maybe they should be grouped together because they are likely allophones

we also know that is not working

if we only have that's it's not working because the bottom-up strategy doesn't work on

performance is not that

but it still not perfect okay so that's the performance

in the task where you i give you to phones and you have to tell

you whether they're allophones all okay

the chances fifty percent

so this is the percent correct if you use acoustic you only for english and

japanese of

something missing there which is the number find a phone so it's hundred five hundred

thousand that the phones

and this is that the effect of the acoustic distance the semantic distance okay and

the performance is almost as good as the acoustic distance

when you combine then you have very good performance

so that should that shows

that

that you can basically

use this kind of semantic representation even though they are in computed on the basis

of an extremely bad lexical i mean at this level here the number of real

words that you find with the ad upper brummer a type of framework is about

twenty percent so you're like second is twenty percent real words and eighty percent

but nevertheless

that mexican and is enough to give you some semantic top-down information so that's shows

that the semantic information is very strong

alright so

i'm going to wrap up very quickly so i thought that the with this idea

that babies would go in a sort of

a bottom-up fashion

that doesn't work it didn't scalable and it was a climate is

also and it doesn't really account for the fact that babies are learning works

before they have really for those zoom in on their eventually of phonemes okay

in fact now they are but it's showing that even at six months babies have

an idea of the semantic

representations of words

so basically babies out and everything

so now we would like to replace is a scenario by

a system like this where we you start with row speech and you try to

learn all the levels

at the same time

and then of course you're going to do a bad job the phonemes are going

to be wrong the word segment to be wrong everything of semantics going to be

awful

but then

you combine all this and you make a better next iteration

so that would be the

the proposed architecture for babies

you know the for this to work you have to work we have to do

a lot more work we have to

basically a stop

of course using target language and try to approximate the really put that B C

getting as much as we can

we have to quantify what does it mean to have a pro to lexical propose

something okay so i gave you an idea of and evaluation procedure for evaluating what

is it to have a proper phoneme we have to do the same thing for

pro to words and propose semantics et cetera

because these are sort of really approximate representations

and then

and then

the synergies are what i this the describe just

now which is

this image is a when you try to learn the phonological units alone you are

doing about the bad job semantic representations alone it's difficult but you can basically

if you try to learn

sort of a joint model you are going to be better

and they are a lot of potential synergies you could you could imagine

the last thing that i have to do as a psychologist of course is to

go back to the baby and test with the babies are actually doing this but

i'm not going to

talk about that now

and find the

i mean why should we do this i think this reverse engineering of you meant

infant is really a new channel G break and i think

both sides can bring out of things

both psychologist engineers can bring ideas we can bring some interesting corpora and we can

test with other the ideas are going to

real realistic probabilities

and then region is can bring algorithms and also test this large scale test on

real data is very important

and we have a lot to work on because that's would be some of the

potential architecture i try to put everything that has been

documented somewhere in terms of

potential link between different levels that you're

thank you approximate is and there i guess you have added a lot of things

you have also babies are actual arc estimating so maybe this articulation is feeding back

to help in constructing

the sub-lexical units

they also any string the faces

of caretakers and they also have a lot of which are

semantic input for acquisition so all these

representation have to be put in at some point that and but i think we

what we have to do is to establish whether we do have interesting synergies a

not if we don't then we can be factor the system to separate subsystems

and that's what have to say so this is this is the team like human

these are

very nice colleagues that help to this work

okay so we are gonna have a abbreviated channel so we don't have a hold

on a time for questions but

one or two

what do you think about inference learning

something between

phonemes and words like syllables which have a nice sort of acoustics

chunkiness to them

i mean that's actually

that's was the basically the hypothesis i had when i did not used

but the role of the syllables

i guess

i mean that's perfectly possible i mean the thing is that

in the way task i think that what

deep belief networks are doing by having these inputs recitation where you stack about hundred

and fifty millisecond of signal

some of going in that direction i mean how to the fifties milliseconds basically the

size of syllable

so

i guess that's behind this i mean if this is what people are using the

most recent and the reason is that that's basically the domain of the syllable where

you have coarticulation that so that's where you can capture the essential information for

recover in contrast

so i think there are many ways in which syllable type units could play a

role it could be in an implicit fashion like i just said or you could

have actually tried to build recognition system i mean units that have the still shape

which is another way to the

i mean we know we know that inference are counting syllables for instance so at

birth it can be can effect you present then a three syllable

words and then you switched into syllables the notice the change they don't notice the

change if you go from for need to six phone instances which is the same

racial of change

so we have evidence that there are a decent syllable nuclei are things that the

pay attention to

a lot

thank you for your talk manual i think you at one time told me that

almost from day one

that infants sorta can are take you can imitate articulatory gestures

that somehow hardcode

in and well okay it and i don't know how you do that experiment but

on the other hand all of these acquiring you know phonemic balance in and word

boundary segmentation lexicons that seems to be all sort of this

part of the plastic city of learning and inference so why isn't the some notion

of starting with the articulate you articulatory gestures since that is sort of

there are beginning wise that part of your model or is it should be

in the most the mobile it's sparse

so i have i have actually of course

so i have proposed a working on this and they're actually number of

have tried to incorporate

using actually deep learning systems trying to learn at the same time speech features and

articulatory features

and then

and then if you retrain like this and then you only present now speech features

you are two between better decoding then if you had where learning speech feature so

we know that there is some notion in which that could work but of course

this work was done with

real i don't i articulation

the baby articulation is much more primitive so it's not here that's going to help

as much but that's one of thing we want to try

so i think we a relative time but of manually i believe you're gonna be

here tomorrow as well so i encourage people to go and ask all these questions

i think they're very they're very relevant work we all so this community