0:00:16so
0:00:17well
0:00:19thank you for i'm thinking the organisers for allow me to be sort of surprise
0:00:25at talker and
0:00:27and so i'm going to tell you a little bit what we have been doing
0:00:31in terms of trying to understand language acquisition
0:00:35now when well as a parent we are trying to understand how babies are learning
0:00:42languages it seems very obvious we are just using on tuition and well maybe it's
0:00:46just have to listen to what we're turning right it's very simple
0:00:51now as a psychologist
0:00:53then we have been trained to try to think in terms to take the place
0:00:57of the baby okay
0:00:59how's it feel to be a baby indeed this situation well it's going to be
0:01:03a lot more complicated because
0:01:05we don't understand that or what we told we just have the signal
0:01:09and now
0:01:10what i would like to do is to take the third perspective
0:01:14which would be the perspective of an engineer i'm not an engineer myself i'm a
0:01:18psychologist but here the idea is try to see how could we basically construct a
0:01:23system that does what the baby do
0:01:25okay that's the perspective we would like to push
0:01:29so okay
0:01:32so this what we basically we know
0:01:35are we think we know about babies only language acquisition so this time nine here
0:01:40is the model
0:01:41so this is birth and this is the first offline
0:01:44and as you can see babies are learning a quite a number of things quite
0:01:47quickly so
0:01:49basically here babies are starting to say the few words and before that they are
0:01:53at rings various organisations but actually
0:01:57before they are trained the first where they are
0:02:00learning quite a bit of information about their own language
0:02:04for instance the start to be sensitive to that of a list that are a
0:02:08typical of the language channels will start to build some representation of the consonants that
0:02:13here are starting to be all basically language models with a sequential dependencies et cetera
0:02:19et cetera so this is taking place
0:02:21very only
0:02:23way before they can show us that they have
0:02:26then these things okay
0:02:28at the same time they also learning over aspects of language in the prior the
0:02:31prosody and in the lexical domain
0:02:34so
0:02:35how do we know that babies are doing this well this is all job a
0:02:38psychologist to try to interpret interrogate the babies that don't talk
0:02:42and we have to find a clever ways to
0:02:46build situations where the babies is going to
0:02:50for instance look at the screen or
0:02:52sec a little
0:02:55blind people here
0:02:56and this behavior of a maybe a way to control this thing really that they're
0:03:00going to be presented with so in the typical experiment that was basically the beginning
0:03:05of this field in the seventies
0:03:07okay time as it this study where you basically presents over and over again each
0:03:12time the babies doing this little behaviour this section we have your we present the
0:03:15same syllable so it's a
0:03:19and you can see here that the basically the frequency of this setting is decreasing
0:03:24because it's boring is always the same syllable but then suddenly
0:03:28you change the syllable or not spell
0:03:31okay and now the baby sucking a lot more
0:03:33okay that means that the baby has
0:03:35notice that there was a change
0:03:38and this to all the conditions where the same the same syllable continue blah exactly
0:03:44the same syllable when in slightly different one
0:03:47so
0:03:47with this kind of ideas you can basically pro babies perception of a speech sounds
0:03:54and you can ask yourself okay to discriminate but i'm part
0:03:57dot and got and always kind of sounds you can also program memory
0:04:01have they memorise have the segmented out
0:04:05particular frequent or interesting parts of the environment so this sounds in that environment all
0:04:11they also more fancy type of equipment the to do the same kind of experiments
0:04:15but i'm not going to talk about them
0:04:18so
0:04:19the question that's really interest me here is a how can we understand what babies
0:04:25are doing okay not if you open up a linguistic i mean psycho linguistic the
0:04:30rampant all technology journals you find some hypothesis
0:04:35that interesting but i'm not going to talk about them because unfortunately this series
0:04:40do not a low to basically have an idea of what are the mechanisms that
0:04:43babies are using for
0:04:46understanding speech
0:04:49you do fine in psychology and also linguistic jungles publications trying to
0:04:54basic cut down to learning problem to solve a so for instance some people i'm
0:04:58going to talk more about that have studied how you could
0:05:01fine phonemes
0:05:03from row speech using some kind of course unsupervised clustering
0:05:07but also known the once you have to the i don't put the phone and
0:05:11find the word forms
0:05:12or once you of the reform sums from learn some semantics et cetera et cetera
0:05:17so these are
0:05:19this paper was out on basically less technologies that an english they are not the
0:05:23done by engineers and
0:05:26they what one particular aspect of them is that they are focusing on really
0:05:31as a small part of the problem of the learning problem
0:05:35and they also
0:05:38basically making a lot of assumptions about the rest of the system
0:05:41so that the question we can ask ourselves is
0:05:45could we make a global system that would learn but with many of the babies
0:05:49doing by concatenating these elements
0:05:51and what i i'm
0:05:53i think i will try to do demonstrate to use that such a system simply
0:05:57does not work
0:05:59doesn't work because it doesn't scale it has incorporate some particularity is and you also
0:06:04we doesn't press one what the previous doing anyway
0:06:07so i'm going to focus on this particular part and we talk a lot about
0:06:12that at least two talks today focused on how you could discover some units of
0:06:18speech from
0:06:19from row speech in psychology
0:06:24it's really people believe that bay the weight babies do that is by accumulating evidence
0:06:31and doing some kind of and but unsupervised clustering
0:06:35so this is the paper a couple of papers were published
0:06:39basically that i stack these babies that six months are able to distinguish sounds that
0:06:43are not in and language so they can distinguish dot i wouldn't well if you
0:06:49are speaking yes and you say that i say right
0:06:53but most of you wouldn't hear that and contrary to the image maybe i six
0:06:58months but the twelve months they have lost but the ability because that contrast is
0:07:02not used in the language
0:07:04okay and so the hypothesis about how babies do that is that they basically accumulating
0:07:09evidence and doing some kind of statistical clustering based on the input that's available in
0:07:14the language
0:07:15now
0:07:17and that in the number of papers have a try to demonstrate that you can
0:07:21build a system like this
0:07:23however
0:07:24most of this papers have dealt with a very small number of categories so these
0:07:29are sort of proof of principle papers that basically construct data according to distributions deck
0:07:35is and they show that you can find these by doing some kind of clustering
0:07:39so that's nice but does it scale we
0:07:44and as everybody knows here we know that speech is more complicated and this is
0:07:50basically running speech and you got more conversational speech you need some not separated the
0:07:55a not so segmented easily segmentation is part of the problem except for except
0:08:00okay so is where i started to get involved in this problem working
0:08:06with a
0:08:07that sounds if a hopkins and we wish we choose the idea was basically to
0:08:12try to apply real simple unsupervised clustering algorithm on the row speech on running speech
0:08:19and see what we get could we get phonemes up about
0:08:23so this is what
0:08:25we did they have there was this the idea is you start with a very
0:08:28simple markov model with just one state and then you speak the states into
0:08:34various
0:08:36possibilities you can split it in time and time domain or like a horizontally like
0:08:40you have two different versions of each sound and then you can make this
0:08:45continue to H rate is a graph drawing process until you have a very complicated
0:08:49network
0:08:50and so in other to analyze
0:08:53the what the system was doing what sensitive and a bad actors and it was
0:08:59to apply decoding
0:09:03using finite state transducers so that you can basically have some interpretation of what the
0:09:07states mean
0:09:08and what was discovered was that the phoneme the units that are found in this
0:09:12kind of system are very small smaller than phonemes
0:09:16but even if you concatenate them and these are the most frequent
0:09:20so strings concatenation is
0:09:23they correspond not a phone is but more to contextual and of phones
0:09:27that is also thought problem which the units are not very talker invariant
0:09:34but so
0:09:35so this problem sun a very surprising for those of you work with speech and
0:09:39that's majority of people here because we all know that again the phonemes are not
0:09:44going to be found a in such of in such a way
0:09:47this one problem i want to we insist on because i think that's quite crucial
0:09:52and we talked about that in earlier discussions
0:09:55is the fact that languages
0:09:58do contain elements
0:10:01that are
0:10:03that you will discover if you do some kind of unsupervised clustering but there is
0:10:07no way to merge them into abstract phonemes and this is due to the existence
0:10:13allophones okay you have in many languages in most languages you have and a phonics
0:10:17rules like for instance in france you have the overall voiced what get number one
0:10:22and you have the unvoiced in cannot wrote all okay so this sounds exist in
0:10:28the language there is no i think you can do about that they actually are
0:10:30two different phonemes in some other language
0:10:34so
0:10:35you are going to and the fact discovering this units
0:10:40okay so in a with a purely bottom-up fashion there is no way to remove
0:10:44this
0:10:46okay
0:10:47so
0:10:48well
0:10:49you could say and that's actually was one of the question what but was discussed
0:10:55before how many phone ins how many units you want to discover
0:11:01and it was sort of set it doesn't really matter we can take a sixty
0:11:05four we can take hundred
0:11:08well actually it doesn't matter for the rest of the processing at least
0:11:11that's what we discovered with the
0:11:14phd student of mine so what we did there was to basically vary the number
0:11:19of allophones that you used to
0:11:22transcribed speech okay and then we use a these other algorithm which is this word
0:11:28segmentation algorithm
0:11:30that was referred also to before so we use a one of sharon goldwater type
0:11:35of algorithm
0:11:36and
0:11:37what we found so here what you have is the segmentation f-score and this is
0:11:41basically number of this is the number of phones is converted into the number of
0:11:46alternate word forms that the
0:11:48and phones create
0:11:50and you can see that the performances is affected is dropping
0:11:54this is the right here for english french and japanese and in some languages like
0:11:59japanese it's really having a very detrimental effect you have lots of allophones then it's
0:12:04becoming extremely difficult to find words okay because these are reasons just break down
0:12:11so it doesn't matter to have to start of to start with good units
0:12:18so this is another experiment that was reported by our and
0:12:23where again issues you basically replaced with some kind of unsupervised you need
0:12:30and you try to
0:12:31feed that onto a word segmentation system then you end up with a very poor
0:12:36performance
0:12:38okay so
0:12:41that means that phonemes
0:12:43at least with a simple minded
0:12:45clustering system is not able acoustically
0:12:49so there are two ideas from their which i want to discuss one is to
0:12:53use a better all the three model and the other is to use the top
0:12:55down
0:12:57model top down information
0:13:00so
0:13:01regarding the
0:13:04well i'm basically going to
0:13:06this is just a summary of what i said so what we have right now
0:13:10is with some of simplified of fate input we there are some unsupervised learning clustering
0:13:16i present have been successful with more realistic input we have a system that works
0:13:22but they use heavy the supervised
0:13:24a models and the question is where we can we build systems that
0:13:29a combined this portion of the space
0:13:33and
0:13:34so i'm not going to present a much work that we did on unsupervised for
0:13:39pruning discovery
0:13:40because for me there was a plenary a very important question first is that how
0:13:46we evaluate unsupervised phone and discovery
0:13:49so imagine you have a system the discovered units how do you know how can
0:13:53you evaluate whether these units a good a not good
0:13:56so traditionally people use for name error rate which is busy you train the phone
0:14:02and decoder which is what we did with this is successive state splitting
0:14:06it was this
0:14:10finite state transducer that translated the find that the states of the system into phonemes
0:14:17of course the problem is that when you do that then maybe a lot of
0:14:20the performance at the end is due to the decoder
0:14:25it may not be
0:14:27the fact that these units are good it just that you have trained a good
0:14:30because
0:14:32and also we don't even know that phonemes of the relevant units for this for
0:14:36inference okay
0:14:37so maybe they are using something else maybe they're using
0:14:39syllables diphones some other kind of you
0:14:43so the idea is to use and so the variation technique that's basically is suited
0:14:50to do this kind of work
0:14:51and
0:14:52and the idea of entered ideas that we don't really care whether babies are all
0:14:57the system is discovering phonemes what we care about is that the able to distinguish
0:15:01words that mean different things
0:15:03so talk and all
0:15:05the whole mean different things so they should be distinguish but the system know what
0:15:08the what how you cope with the just okay
0:15:12so this is the idea underlying the same different task that are in had means
0:15:16pushy
0:15:17all these years and we have first slightly different version of that which we called
0:15:22at X task
0:15:24so with the
0:15:26the same different task goes like this you are i'd if you two words to
0:15:30talk and then you have to say whether the same word
0:15:33and you compute the acoustic distance between then and these are the distribution of the
0:15:37two acoustic distance and what or and showed
0:15:41was that
0:15:41if you are basically doing things within the same you same talker the two distributions
0:15:47are quite different so it's easy to say what is the same word or not
0:15:51but if it's the same if it's a different or quite becomes a lot harder
0:15:55okay
0:15:56so what we did west to
0:16:00build on this
0:16:02and
0:16:03ask a slightly different question i give you three things i give you don't say
0:16:07by one talker
0:16:08the whole say by the same poker
0:16:11and then talk say by a second talk okay so now you have to say
0:16:14whether this
0:16:16i am here is closer to this one obvious one
0:16:19so this is simple psychoacoustic task
0:16:22that's
0:16:23for me it's really inspired but what by the type of experiments we do we
0:16:27babysit apples and with that we can compute the primes you can compute
0:16:32the values that are that have
0:16:35i mean a psychological interpretation but also we can
0:16:39basic you'd have a very fine grained analysis of the type of errors the system
0:16:43is doing so there we apply this task to a database of syllables that have
0:16:49been recalled in english across talkers
0:16:51and this is the performance that you get a recitals what's nice with this kind
0:16:55of that you got really compared human and machine
0:16:58and this is performance of humans and this performance of mfcc coefficients okay so we
0:17:03can see there is a quite a bit of difference between
0:17:06these two kind of
0:17:08of systems
0:17:10so this these are actually run on meaning that's a double so we can be
0:17:15the case that humans are using meaning to do this task okay
0:17:21but then this task this kind of task can be used to test different kind
0:17:25of features
0:17:26which is nice so that's what we did with the this of mine too much
0:17:31that's
0:17:31and also hynek
0:17:36here i actually so the same the same as i was talking about so a
0:17:39crosstalk or phone and discrimination you can then apply a typical processing pipeline where you
0:17:46start with the signal you power spectrum and you will kind of transformations
0:17:51and you can see way whether each of these different types of approximation you due
0:17:56to the signal is actually improving a not all that particular task
0:18:00okay so this in this graph you have the effect of performance depending on the
0:18:04number spectrum channels and what we found was that the
0:18:09actually phone and discrimination task requires fewer channels stand for instance if you were to
0:18:13do a talk a discrimination task which we can do now having
0:18:17dog spoken back to speakers and then a for item that's a different word but
0:18:22one of the first talk about this they all the talk
0:18:26so i'm not going to say more about that but
0:18:28but we
0:18:30that's the ideas that
0:18:32trying to specify the proper evaluation tasks is going to help devising proper features that
0:18:39between the would work for
0:18:44unsupervised learning
0:18:46this is this work we started with another post of mine
0:18:50what he did was to apply the deep belief network
0:18:55to the to this problem so this is we already
0:18:59learned a lot about this ring the first day of the talk
0:19:03but then what you can do is you can compare the performance of this deep
0:19:07belief network representations that each of the levels to do this kind of discrimination task
0:19:12okay
0:19:13and this is the mfcc for instance this is what you have
0:19:17at the first level of the dbn
0:19:20without any training so actually you are doing better
0:19:25this is the error rate here and if you do some unsupervised training like the
0:19:31restricted boltzmann machine training actually a green slightly worse okay on that task now it
0:19:37does not that this pre-training here helps a lot when you do supervised training after
0:19:42that but if you don't do supervised training actually not doing much
0:19:46okay so i think it's that's what i'm saying is important to have a good
0:19:49evaluation task
0:19:50for unsupervised
0:19:52problems because then you can discover whether you unsupervised unit is actually mean any good
0:19:57or not
0:19:59okay so not in the time remains i would like to talk a little bit
0:20:03about
0:20:04this all the idea the idea of using top-down information
0:20:08and that that's an idea that was not at least to me very
0:20:11natural because
0:20:13i have this idea that maybe should maybe should learn first the phonemes the elements
0:20:18of the language before running higher
0:20:20or other information but of course phonemes a part of a big system okay and
0:20:26so maybe the meaning the definition of the finance is
0:20:31emerges out of the big system so the intuition there is that maybe babies are
0:20:36trying to learn the whole system
0:20:37and why they do that they are going to find if one
0:20:42okay so
0:20:44so all the different things we try i'm going to
0:20:49talk about this idea of using lexical information
0:20:53so lexical information is a very simple idea
0:20:56is the following is that
0:20:57typically when you have to retake two words that random
0:21:02or you just you to you take your you whole lexicon and the you try
0:21:06to find minimal pairs
0:21:08that would a actually different on one only one segment so for instance cannot and
0:21:13cannot
0:21:13okay
0:21:15you don't find a lot of than you do fine then but they are
0:21:18very infrequent statistic
0:21:20so now if you are looking at your lexicon you imagine you are
0:21:25you have some initial position to find the words and then you are looking at
0:21:29all the set of maybe more it is that you find you have to find
0:21:32a lot of minimal pairs that correspond to
0:21:35this contrast a whole then you can be pretty sure that it's not really a
0:21:39phoneme ink
0:21:40contrast these are probably telephone
0:21:43and that's the intuition okay
0:21:45so how we tested that
0:21:47we started with
0:21:50a transcribed corpus then we the transcribed it into phonemes then we make random allophones
0:21:58this is not going well
0:22:00okay
0:22:01and then we transcribe this a phoneme a transcription again into fine
0:22:06of very fine description with all these other phones and we vary the number of
0:22:10other phones
0:22:11so that's how we generate the corpus and then the task is to take this
0:22:15and basically fine
0:22:17which pairs of phones belong to the same phone and want
0:22:22using just information in a corpus
0:22:26so
0:22:28so that's what we do
0:22:29and this is the basically
0:22:32so we compute the distance
0:22:35now the number of different
0:22:38minimal process that you have for each contrast
0:22:42and we compute here the area-under-the-curve and that's this
0:22:47right here and this is the number of phones
0:22:49so don't look at this curve here this curve here is the relevant one is
0:22:55the effect of using the strategy of counting the number of
0:22:59also i mean the multi okay
0:23:01so the performance is quite good and it's actually not really affected
0:23:06negatively by the number of phones that you had
0:23:09okay so this is
0:23:11this strategy works quite well but of course it's cheating right because there i assume
0:23:17that the babies had the boundaries of words
0:23:20but they don't and in fact i showed you just before that it's actually extremely
0:23:24difficult if you have lots of allophones to find the boundaries of words
0:23:28so
0:23:29so that that's a kind of security that we would like to avoid
0:23:35and so the idea that the un T martine the postdoc had
0:23:39which was great was to say well maybe we don't need to have an exactly
0:23:42second maybe babies can go and build a proper lexicon with the whatever segmentation and
0:23:48reason they have it's going to be incomplete is going to be wrong you has
0:23:51many long words in it
0:23:53but still maybe could be you useful thing to have and that's what we find
0:23:57here so this is there we use of free really extremely rudimentary segment segmentation sources
0:24:03using an n-gram to the ten percent most frequent n-grams in the corpus and that's
0:24:08the lexicon so it was really pretty awful
0:24:11but still it provided
0:24:15actually performance that was almost as good as the gold mexican
0:24:20okay
0:24:22so then that the
0:24:24and demanding went to japan and then i had pasta a doctoral student
0:24:28who said well we could even go even further than that
0:24:32maybe babies could be constructed
0:24:34some approximate semantics
0:24:37and the reason why it could be useful to do that is that well cannot
0:24:40okay
0:24:42they are different allophones because they are in minimal pair but what about this one
0:24:48these are two words in french can and cannot
0:24:50and the if i way to apply the same strategy i would declare that and
0:24:55the are allophones which is wrong and then i would end up with the a
0:25:00japanese french
0:25:02type of this than so that's not what we want
0:25:07so but on the other hand if we have some idea of even vague idea
0:25:10that cannot
0:25:12about the meaning of got out actually not okay now
0:25:15but some kind of bird and it whereas this one is some kind of water
0:25:21thing then that's that maybe that sufficient to help distinguish these two cases that's also
0:25:28kind of cannot
0:25:31so there
0:25:33what we did what we do the same kind of pipeline we make the problem
0:25:37actually more realistic by having a instead of having run them allophones we generated allophones
0:25:42by using tied a three state
0:25:49using hmm
0:25:50actually that makes the protection much more difficult that they are phones are more realistic
0:25:55but it's also becoming the lexical started data represented before it is having trouble with
0:25:59that
0:26:01and then the idea is that you take that now don't cheat anymore you are
0:26:05trying to recover
0:26:07possible words from that and then you do some semantic estimation and then
0:26:11now you compute the semantic distance between two pairs of phones
0:26:16so how does it work
0:26:18what word segmentation
0:26:19a state-of-the-art
0:26:22minimum description length or adaptive grammar
0:26:25okay so we know that is working but we know that's not working very well
0:26:28okay especially if we have lots of allophones it's going to have a pretty bad
0:26:33estimate of the lexicon
0:26:35but then we still take that as alex again and then we apply the latent
0:26:38semantic analysis
0:26:41which basically is
0:26:43is counting how many
0:26:44how much time these different terms occur in the dig documents and here we took
0:26:49the comments we to go ten sentences length
0:26:52so we have this whole corpus we segmented into ten sentences and we compute the
0:26:57this matrix of counts which we then decompose and we arrive at semantic representation where
0:27:02basic each word now is a vector
0:27:05so the i mean not that people in india the mean much more sophisticated things
0:27:08like this one so this is pretty
0:27:13first
0:27:13are older a semantic analysis
0:27:17what but what's nice was that we can compute the cosine between the two or
0:27:21semantic propose semantic presentation of the words and the idea of now is that if
0:27:26you have to allophones they should have quite i don't similar and
0:27:30vectors because the are occurring the same context
0:27:34okay so that's the result
0:27:37there
0:27:38so in this in this study what we did was to try to
0:27:44to look at them
0:27:46because we have generated is allophones on the basis of hmms we can compute this
0:27:50the acoustic distance within that okay so obviously acoustic distance
0:27:54is going to help you have to allophones two forms that are quite close to
0:27:58one other maybe they should be grouped together because they are likely allophones
0:28:03we also know that is not working
0:28:05if we only have that's it's not working because the bottom-up strategy doesn't work on
0:28:09performance is not that
0:28:11but it still not perfect okay so that's the performance
0:28:14in the task where you i give you to phones and you have to tell
0:28:19you whether they're allophones all okay
0:28:21the chances fifty percent
0:28:23so this is the percent correct if you use acoustic you only for english and
0:28:28japanese of
0:28:30something missing there which is the number find a phone so it's hundred five hundred
0:28:34thousand that the phones
0:28:37and this is that the effect of the acoustic distance the semantic distance okay and
0:28:42the performance is almost as good as the acoustic distance
0:28:45when you combine then you have very good performance
0:28:48so that should that shows
0:28:50that
0:28:51that you can basically
0:28:54use this kind of semantic representation even though they are in computed on the basis
0:28:59of an extremely bad lexical i mean at this level here the number of real
0:29:03words that you find with the ad upper brummer a type of framework is about
0:29:07twenty percent so you're like second is twenty percent real words and eighty percent
0:29:12but nevertheless
0:29:14that mexican and is enough to give you some semantic top-down information so that's shows
0:29:21that the semantic information is very strong
0:29:26alright so
0:29:27i'm going to wrap up very quickly so i thought that the with this idea
0:29:32that babies would go in a sort of
0:29:36a bottom-up fashion
0:29:38that doesn't work it didn't scalable and it was a climate is
0:29:43also and it doesn't really account for the fact that babies are learning works
0:29:48before they have really for those zoom in on their eventually of phonemes okay
0:29:53in fact now they are but it's showing that even at six months babies have
0:29:56an idea of the semantic
0:29:59representations of words
0:30:01so basically babies out and everything
0:30:05so now we would like to replace is a scenario by
0:30:09a system like this where we you start with row speech and you try to
0:30:13learn all the levels
0:30:14at the same time
0:30:16and then of course you're going to do a bad job the phonemes are going
0:30:19to be wrong the word segment to be wrong everything of semantics going to be
0:30:22awful
0:30:23but then
0:30:24you combine all this and you make a better next iteration
0:30:28so that would be the
0:30:30the proposed architecture for babies
0:30:34you know the for this to work you have to work we have to do
0:30:37a lot more work we have to
0:30:39basically a stop
0:30:40of course using target language and try to approximate the really put that B C
0:30:45getting as much as we can
0:30:47we have to quantify what does it mean to have a pro to lexical propose
0:30:50something okay so i gave you an idea of and evaluation procedure for evaluating what
0:30:56is it to have a proper phoneme we have to do the same thing for
0:30:59pro to words and propose semantics et cetera
0:31:03because these are sort of really approximate representations
0:31:07and then
0:31:08and then
0:31:09the synergies are what i this the describe just
0:31:12now which is
0:31:14this image is a when you try to learn the phonological units alone you are
0:31:17doing about the bad job semantic representations alone it's difficult but you can basically
0:31:23if you try to learn
0:31:24sort of a joint model you are going to be better
0:31:29and they are a lot of potential synergies you could you could imagine
0:31:33the last thing that i have to do as a psychologist of course is to
0:31:36go back to the baby and test with the babies are actually doing this but
0:31:39i'm not going to
0:31:41talk about that now
0:31:43and find the
0:31:44i mean why should we do this i think this reverse engineering of you meant
0:31:49infant is really a new channel G break and i think
0:31:54both sides can bring out of things
0:31:57both psychologist engineers can bring ideas we can bring some interesting corpora and we can
0:32:03test with other the ideas are going to
0:32:07real realistic probabilities
0:32:08and then region is can bring algorithms and also test this large scale test on
0:32:13real data is very important
0:32:17and we have a lot to work on because that's would be some of the
0:32:21potential architecture i try to put everything that has been
0:32:26documented somewhere in terms of
0:32:28potential link between different levels that you're
0:32:32thank you approximate is and there i guess you have added a lot of things
0:32:37you have also babies are actual arc estimating so maybe this articulation is feeding back
0:32:42to help in constructing
0:32:44the sub-lexical units
0:32:46they also any string the faces
0:32:48of caretakers and they also have a lot of which are
0:32:52semantic input for acquisition so all these
0:32:56representation have to be put in at some point that and but i think we
0:32:59what we have to do is to establish whether we do have interesting synergies a
0:33:04not if we don't then we can be factor the system to separate subsystems
0:33:10and that's what have to say so this is this is the team like human
0:33:13these are
0:33:14very nice colleagues that help to this work
0:33:29okay so we are gonna have a abbreviated channel so we don't have a hold
0:33:34on a time for questions but
0:33:35one or two
0:33:47what do you think about inference learning
0:33:50something between
0:33:51phonemes and words like syllables which have a nice sort of acoustics
0:33:56chunkiness to them
0:33:58i mean that's actually
0:34:01that's was the basically the hypothesis i had when i did not used
0:34:05but the role of the syllables
0:34:09i guess
0:34:12i mean that's perfectly possible i mean the thing is that
0:34:19in the way task i think that what
0:34:22deep belief networks are doing by having these inputs recitation where you stack about hundred
0:34:30and fifty millisecond of signal
0:34:33some of going in that direction i mean how to the fifties milliseconds basically the
0:34:36size of syllable
0:34:39so
0:34:40i guess that's behind this i mean if this is what people are using the
0:34:44most recent and the reason is that that's basically the domain of the syllable where
0:34:48you have coarticulation that so that's where you can capture the essential information for
0:34:54recover in contrast
0:34:55so i think there are many ways in which syllable type units could play a
0:34:59role it could be in an implicit fashion like i just said or you could
0:35:04have actually tried to build recognition system i mean units that have the still shape
0:35:09which is another way to the
0:35:15i mean we know we know that inference are counting syllables for instance so at
0:35:19birth it can be can effect you present then a three syllable
0:35:23words and then you switched into syllables the notice the change they don't notice the
0:35:27change if you go from for need to six phone instances which is the same
0:35:31racial of change
0:35:33so we have evidence that there are a decent syllable nuclei are things that the
0:35:39pay attention to
0:35:40a lot
0:35:44thank you for your talk manual i think you at one time told me that
0:35:48almost from day one
0:35:50that infants sorta can are take you can imitate articulatory gestures
0:35:55that somehow hardcode
0:35:58in and well okay it and i don't know how you do that experiment but
0:36:02on the other hand all of these acquiring you know phonemic balance in and word
0:36:08boundary segmentation lexicons that seems to be all sort of this
0:36:12part of the plastic city of learning and inference so why isn't the some notion
0:36:18of starting with the articulate you articulatory gestures since that is sort of
0:36:23there are beginning wise that part of your model or is it should be
0:36:28in the most the mobile it's sparse
0:36:31so i have i have actually of course
0:36:34so i have proposed a working on this and they're actually number of
0:36:40have tried to incorporate
0:36:42using actually deep learning systems trying to learn at the same time speech features and
0:36:47articulatory features
0:36:50and then
0:36:50and then if you retrain like this and then you only present now speech features
0:36:55you are two between better decoding then if you had where learning speech feature so
0:37:00we know that there is some notion in which that could work but of course
0:37:03this work was done with
0:37:05real i don't i articulation
0:37:07the baby articulation is much more primitive so it's not here that's going to help
0:37:11as much but that's one of thing we want to try
0:37:15so i think we a relative time but of manually i believe you're gonna be
0:37:18here tomorrow as well so i encourage people to go and ask all these questions
0:37:22i think they're very they're very relevant work we all so this community