0:00:16 | so |
---|
0:00:17 | well |
---|
0:00:19 | thank you for i'm thinking the organisers for allow me to be sort of surprise |
---|
0:00:25 | at talker and |
---|
0:00:27 | and so i'm going to tell you a little bit what we have been doing |
---|
0:00:31 | in terms of trying to understand language acquisition |
---|
0:00:35 | now when well as a parent we are trying to understand how babies are learning |
---|
0:00:42 | languages it seems very obvious we are just using on tuition and well maybe it's |
---|
0:00:46 | just have to listen to what we're turning right it's very simple |
---|
0:00:51 | now as a psychologist |
---|
0:00:53 | then we have been trained to try to think in terms to take the place |
---|
0:00:57 | of the baby okay |
---|
0:00:59 | how's it feel to be a baby indeed this situation well it's going to be |
---|
0:01:03 | a lot more complicated because |
---|
0:01:05 | we don't understand that or what we told we just have the signal |
---|
0:01:09 | and now |
---|
0:01:10 | what i would like to do is to take the third perspective |
---|
0:01:14 | which would be the perspective of an engineer i'm not an engineer myself i'm a |
---|
0:01:18 | psychologist but here the idea is try to see how could we basically construct a |
---|
0:01:23 | system that does what the baby do |
---|
0:01:25 | okay that's the perspective we would like to push |
---|
0:01:29 | so okay |
---|
0:01:32 | so this what we basically we know |
---|
0:01:35 | are we think we know about babies only language acquisition so this time nine here |
---|
0:01:40 | is the model |
---|
0:01:41 | so this is birth and this is the first offline |
---|
0:01:44 | and as you can see babies are learning a quite a number of things quite |
---|
0:01:47 | quickly so |
---|
0:01:49 | basically here babies are starting to say the few words and before that they are |
---|
0:01:53 | at rings various organisations but actually |
---|
0:01:57 | before they are trained the first where they are |
---|
0:02:00 | learning quite a bit of information about their own language |
---|
0:02:04 | for instance the start to be sensitive to that of a list that are a |
---|
0:02:08 | typical of the language channels will start to build some representation of the consonants that |
---|
0:02:13 | here are starting to be all basically language models with a sequential dependencies et cetera |
---|
0:02:19 | et cetera so this is taking place |
---|
0:02:21 | very only |
---|
0:02:23 | way before they can show us that they have |
---|
0:02:26 | then these things okay |
---|
0:02:28 | at the same time they also learning over aspects of language in the prior the |
---|
0:02:31 | prosody and in the lexical domain |
---|
0:02:34 | so |
---|
0:02:35 | how do we know that babies are doing this well this is all job a |
---|
0:02:38 | psychologist to try to interpret interrogate the babies that don't talk |
---|
0:02:42 | and we have to find a clever ways to |
---|
0:02:46 | build situations where the babies is going to |
---|
0:02:50 | for instance look at the screen or |
---|
0:02:52 | sec a little |
---|
0:02:55 | blind people here |
---|
0:02:56 | and this behavior of a maybe a way to control this thing really that they're |
---|
0:03:00 | going to be presented with so in the typical experiment that was basically the beginning |
---|
0:03:05 | of this field in the seventies |
---|
0:03:07 | okay time as it this study where you basically presents over and over again each |
---|
0:03:12 | time the babies doing this little behaviour this section we have your we present the |
---|
0:03:15 | same syllable so it's a |
---|
0:03:19 | and you can see here that the basically the frequency of this setting is decreasing |
---|
0:03:24 | because it's boring is always the same syllable but then suddenly |
---|
0:03:28 | you change the syllable or not spell |
---|
0:03:31 | okay and now the baby sucking a lot more |
---|
0:03:33 | okay that means that the baby has |
---|
0:03:35 | notice that there was a change |
---|
0:03:38 | and this to all the conditions where the same the same syllable continue blah exactly |
---|
0:03:44 | the same syllable when in slightly different one |
---|
0:03:47 | so |
---|
0:03:47 | with this kind of ideas you can basically pro babies perception of a speech sounds |
---|
0:03:54 | and you can ask yourself okay to discriminate but i'm part |
---|
0:03:57 | dot and got and always kind of sounds you can also program memory |
---|
0:04:01 | have they memorise have the segmented out |
---|
0:04:05 | particular frequent or interesting parts of the environment so this sounds in that environment all |
---|
0:04:11 | they also more fancy type of equipment the to do the same kind of experiments |
---|
0:04:15 | but i'm not going to talk about them |
---|
0:04:18 | so |
---|
0:04:19 | the question that's really interest me here is a how can we understand what babies |
---|
0:04:25 | are doing okay not if you open up a linguistic i mean psycho linguistic the |
---|
0:04:30 | rampant all technology journals you find some hypothesis |
---|
0:04:35 | that interesting but i'm not going to talk about them because unfortunately this series |
---|
0:04:40 | do not a low to basically have an idea of what are the mechanisms that |
---|
0:04:43 | babies are using for |
---|
0:04:46 | understanding speech |
---|
0:04:49 | you do fine in psychology and also linguistic jungles publications trying to |
---|
0:04:54 | basic cut down to learning problem to solve a so for instance some people i'm |
---|
0:04:58 | going to talk more about that have studied how you could |
---|
0:05:01 | fine phonemes |
---|
0:05:03 | from row speech using some kind of course unsupervised clustering |
---|
0:05:07 | but also known the once you have to the i don't put the phone and |
---|
0:05:11 | find the word forms |
---|
0:05:12 | or once you of the reform sums from learn some semantics et cetera et cetera |
---|
0:05:17 | so these are |
---|
0:05:19 | this paper was out on basically less technologies that an english they are not the |
---|
0:05:23 | done by engineers and |
---|
0:05:26 | they what one particular aspect of them is that they are focusing on really |
---|
0:05:31 | as a small part of the problem of the learning problem |
---|
0:05:35 | and they also |
---|
0:05:38 | basically making a lot of assumptions about the rest of the system |
---|
0:05:41 | so that the question we can ask ourselves is |
---|
0:05:45 | could we make a global system that would learn but with many of the babies |
---|
0:05:49 | doing by concatenating these elements |
---|
0:05:51 | and what i i'm |
---|
0:05:53 | i think i will try to do demonstrate to use that such a system simply |
---|
0:05:57 | does not work |
---|
0:05:59 | doesn't work because it doesn't scale it has incorporate some particularity is and you also |
---|
0:06:04 | we doesn't press one what the previous doing anyway |
---|
0:06:07 | so i'm going to focus on this particular part and we talk a lot about |
---|
0:06:12 | that at least two talks today focused on how you could discover some units of |
---|
0:06:18 | speech from |
---|
0:06:19 | from row speech in psychology |
---|
0:06:24 | it's really people believe that bay the weight babies do that is by accumulating evidence |
---|
0:06:31 | and doing some kind of and but unsupervised clustering |
---|
0:06:35 | so this is the paper a couple of papers were published |
---|
0:06:39 | basically that i stack these babies that six months are able to distinguish sounds that |
---|
0:06:43 | are not in and language so they can distinguish dot i wouldn't well if you |
---|
0:06:49 | are speaking yes and you say that i say right |
---|
0:06:53 | but most of you wouldn't hear that and contrary to the image maybe i six |
---|
0:06:58 | months but the twelve months they have lost but the ability because that contrast is |
---|
0:07:02 | not used in the language |
---|
0:07:04 | okay and so the hypothesis about how babies do that is that they basically accumulating |
---|
0:07:09 | evidence and doing some kind of statistical clustering based on the input that's available in |
---|
0:07:14 | the language |
---|
0:07:15 | now |
---|
0:07:17 | and that in the number of papers have a try to demonstrate that you can |
---|
0:07:21 | build a system like this |
---|
0:07:23 | however |
---|
0:07:24 | most of this papers have dealt with a very small number of categories so these |
---|
0:07:29 | are sort of proof of principle papers that basically construct data according to distributions deck |
---|
0:07:35 | is and they show that you can find these by doing some kind of clustering |
---|
0:07:39 | so that's nice but does it scale we |
---|
0:07:44 | and as everybody knows here we know that speech is more complicated and this is |
---|
0:07:50 | basically running speech and you got more conversational speech you need some not separated the |
---|
0:07:55 | a not so segmented easily segmentation is part of the problem except for except |
---|
0:08:00 | okay so is where i started to get involved in this problem working |
---|
0:08:06 | with a |
---|
0:08:07 | that sounds if a hopkins and we wish we choose the idea was basically to |
---|
0:08:12 | try to apply real simple unsupervised clustering algorithm on the row speech on running speech |
---|
0:08:19 | and see what we get could we get phonemes up about |
---|
0:08:23 | so this is what |
---|
0:08:25 | we did they have there was this the idea is you start with a very |
---|
0:08:28 | simple markov model with just one state and then you speak the states into |
---|
0:08:34 | various |
---|
0:08:36 | possibilities you can split it in time and time domain or like a horizontally like |
---|
0:08:40 | you have two different versions of each sound and then you can make this |
---|
0:08:45 | continue to H rate is a graph drawing process until you have a very complicated |
---|
0:08:49 | network |
---|
0:08:50 | and so in other to analyze |
---|
0:08:53 | the what the system was doing what sensitive and a bad actors and it was |
---|
0:08:59 | to apply decoding |
---|
0:09:03 | using finite state transducers so that you can basically have some interpretation of what the |
---|
0:09:07 | states mean |
---|
0:09:08 | and what was discovered was that the phoneme the units that are found in this |
---|
0:09:12 | kind of system are very small smaller than phonemes |
---|
0:09:16 | but even if you concatenate them and these are the most frequent |
---|
0:09:20 | so strings concatenation is |
---|
0:09:23 | they correspond not a phone is but more to contextual and of phones |
---|
0:09:27 | that is also thought problem which the units are not very talker invariant |
---|
0:09:34 | but so |
---|
0:09:35 | so this problem sun a very surprising for those of you work with speech and |
---|
0:09:39 | that's majority of people here because we all know that again the phonemes are not |
---|
0:09:44 | going to be found a in such of in such a way |
---|
0:09:47 | this one problem i want to we insist on because i think that's quite crucial |
---|
0:09:52 | and we talked about that in earlier discussions |
---|
0:09:55 | is the fact that languages |
---|
0:09:58 | do contain elements |
---|
0:10:01 | that are |
---|
0:10:03 | that you will discover if you do some kind of unsupervised clustering but there is |
---|
0:10:07 | no way to merge them into abstract phonemes and this is due to the existence |
---|
0:10:13 | allophones okay you have in many languages in most languages you have and a phonics |
---|
0:10:17 | rules like for instance in france you have the overall voiced what get number one |
---|
0:10:22 | and you have the unvoiced in cannot wrote all okay so this sounds exist in |
---|
0:10:28 | the language there is no i think you can do about that they actually are |
---|
0:10:30 | two different phonemes in some other language |
---|
0:10:34 | so |
---|
0:10:35 | you are going to and the fact discovering this units |
---|
0:10:40 | okay so in a with a purely bottom-up fashion there is no way to remove |
---|
0:10:44 | this |
---|
0:10:46 | okay |
---|
0:10:47 | so |
---|
0:10:48 | well |
---|
0:10:49 | you could say and that's actually was one of the question what but was discussed |
---|
0:10:55 | before how many phone ins how many units you want to discover |
---|
0:11:01 | and it was sort of set it doesn't really matter we can take a sixty |
---|
0:11:05 | four we can take hundred |
---|
0:11:08 | well actually it doesn't matter for the rest of the processing at least |
---|
0:11:11 | that's what we discovered with the |
---|
0:11:14 | phd student of mine so what we did there was to basically vary the number |
---|
0:11:19 | of allophones that you used to |
---|
0:11:22 | transcribed speech okay and then we use a these other algorithm which is this word |
---|
0:11:28 | segmentation algorithm |
---|
0:11:30 | that was referred also to before so we use a one of sharon goldwater type |
---|
0:11:35 | of algorithm |
---|
0:11:36 | and |
---|
0:11:37 | what we found so here what you have is the segmentation f-score and this is |
---|
0:11:41 | basically number of this is the number of phones is converted into the number of |
---|
0:11:46 | alternate word forms that the |
---|
0:11:48 | and phones create |
---|
0:11:50 | and you can see that the performances is affected is dropping |
---|
0:11:54 | this is the right here for english french and japanese and in some languages like |
---|
0:11:59 | japanese it's really having a very detrimental effect you have lots of allophones then it's |
---|
0:12:04 | becoming extremely difficult to find words okay because these are reasons just break down |
---|
0:12:11 | so it doesn't matter to have to start of to start with good units |
---|
0:12:18 | so this is another experiment that was reported by our and |
---|
0:12:23 | where again issues you basically replaced with some kind of unsupervised you need |
---|
0:12:30 | and you try to |
---|
0:12:31 | feed that onto a word segmentation system then you end up with a very poor |
---|
0:12:36 | performance |
---|
0:12:38 | okay so |
---|
0:12:41 | that means that phonemes |
---|
0:12:43 | at least with a simple minded |
---|
0:12:45 | clustering system is not able acoustically |
---|
0:12:49 | so there are two ideas from their which i want to discuss one is to |
---|
0:12:53 | use a better all the three model and the other is to use the top |
---|
0:12:55 | down |
---|
0:12:57 | model top down information |
---|
0:13:00 | so |
---|
0:13:01 | regarding the |
---|
0:13:04 | well i'm basically going to |
---|
0:13:06 | this is just a summary of what i said so what we have right now |
---|
0:13:10 | is with some of simplified of fate input we there are some unsupervised learning clustering |
---|
0:13:16 | i present have been successful with more realistic input we have a system that works |
---|
0:13:22 | but they use heavy the supervised |
---|
0:13:24 | a models and the question is where we can we build systems that |
---|
0:13:29 | a combined this portion of the space |
---|
0:13:33 | and |
---|
0:13:34 | so i'm not going to present a much work that we did on unsupervised for |
---|
0:13:39 | pruning discovery |
---|
0:13:40 | because for me there was a plenary a very important question first is that how |
---|
0:13:46 | we evaluate unsupervised phone and discovery |
---|
0:13:49 | so imagine you have a system the discovered units how do you know how can |
---|
0:13:53 | you evaluate whether these units a good a not good |
---|
0:13:56 | so traditionally people use for name error rate which is busy you train the phone |
---|
0:14:02 | and decoder which is what we did with this is successive state splitting |
---|
0:14:06 | it was this |
---|
0:14:10 | finite state transducer that translated the find that the states of the system into phonemes |
---|
0:14:17 | of course the problem is that when you do that then maybe a lot of |
---|
0:14:20 | the performance at the end is due to the decoder |
---|
0:14:25 | it may not be |
---|
0:14:27 | the fact that these units are good it just that you have trained a good |
---|
0:14:30 | because |
---|
0:14:32 | and also we don't even know that phonemes of the relevant units for this for |
---|
0:14:36 | inference okay |
---|
0:14:37 | so maybe they are using something else maybe they're using |
---|
0:14:39 | syllables diphones some other kind of you |
---|
0:14:43 | so the idea is to use and so the variation technique that's basically is suited |
---|
0:14:50 | to do this kind of work |
---|
0:14:51 | and |
---|
0:14:52 | and the idea of entered ideas that we don't really care whether babies are all |
---|
0:14:57 | the system is discovering phonemes what we care about is that the able to distinguish |
---|
0:15:01 | words that mean different things |
---|
0:15:03 | so talk and all |
---|
0:15:05 | the whole mean different things so they should be distinguish but the system know what |
---|
0:15:08 | the what how you cope with the just okay |
---|
0:15:12 | so this is the idea underlying the same different task that are in had means |
---|
0:15:16 | pushy |
---|
0:15:17 | all these years and we have first slightly different version of that which we called |
---|
0:15:22 | at X task |
---|
0:15:24 | so with the |
---|
0:15:26 | the same different task goes like this you are i'd if you two words to |
---|
0:15:30 | talk and then you have to say whether the same word |
---|
0:15:33 | and you compute the acoustic distance between then and these are the distribution of the |
---|
0:15:37 | two acoustic distance and what or and showed |
---|
0:15:41 | was that |
---|
0:15:41 | if you are basically doing things within the same you same talker the two distributions |
---|
0:15:47 | are quite different so it's easy to say what is the same word or not |
---|
0:15:51 | but if it's the same if it's a different or quite becomes a lot harder |
---|
0:15:55 | okay |
---|
0:15:56 | so what we did west to |
---|
0:16:00 | build on this |
---|
0:16:02 | and |
---|
0:16:03 | ask a slightly different question i give you three things i give you don't say |
---|
0:16:07 | by one talker |
---|
0:16:08 | the whole say by the same poker |
---|
0:16:11 | and then talk say by a second talk okay so now you have to say |
---|
0:16:14 | whether this |
---|
0:16:16 | i am here is closer to this one obvious one |
---|
0:16:19 | so this is simple psychoacoustic task |
---|
0:16:22 | that's |
---|
0:16:23 | for me it's really inspired but what by the type of experiments we do we |
---|
0:16:27 | babysit apples and with that we can compute the primes you can compute |
---|
0:16:32 | the values that are that have |
---|
0:16:35 | i mean a psychological interpretation but also we can |
---|
0:16:39 | basic you'd have a very fine grained analysis of the type of errors the system |
---|
0:16:43 | is doing so there we apply this task to a database of syllables that have |
---|
0:16:49 | been recalled in english across talkers |
---|
0:16:51 | and this is the performance that you get a recitals what's nice with this kind |
---|
0:16:55 | of that you got really compared human and machine |
---|
0:16:58 | and this is performance of humans and this performance of mfcc coefficients okay so we |
---|
0:17:03 | can see there is a quite a bit of difference between |
---|
0:17:06 | these two kind of |
---|
0:17:08 | of systems |
---|
0:17:10 | so this these are actually run on meaning that's a double so we can be |
---|
0:17:15 | the case that humans are using meaning to do this task okay |
---|
0:17:21 | but then this task this kind of task can be used to test different kind |
---|
0:17:25 | of features |
---|
0:17:26 | which is nice so that's what we did with the this of mine too much |
---|
0:17:31 | that's |
---|
0:17:31 | and also hynek |
---|
0:17:36 | here i actually so the same the same as i was talking about so a |
---|
0:17:39 | crosstalk or phone and discrimination you can then apply a typical processing pipeline where you |
---|
0:17:46 | start with the signal you power spectrum and you will kind of transformations |
---|
0:17:51 | and you can see way whether each of these different types of approximation you due |
---|
0:17:56 | to the signal is actually improving a not all that particular task |
---|
0:18:00 | okay so this in this graph you have the effect of performance depending on the |
---|
0:18:04 | number spectrum channels and what we found was that the |
---|
0:18:09 | actually phone and discrimination task requires fewer channels stand for instance if you were to |
---|
0:18:13 | do a talk a discrimination task which we can do now having |
---|
0:18:17 | dog spoken back to speakers and then a for item that's a different word but |
---|
0:18:22 | one of the first talk about this they all the talk |
---|
0:18:26 | so i'm not going to say more about that but |
---|
0:18:28 | but we |
---|
0:18:30 | that's the ideas that |
---|
0:18:32 | trying to specify the proper evaluation tasks is going to help devising proper features that |
---|
0:18:39 | between the would work for |
---|
0:18:44 | unsupervised learning |
---|
0:18:46 | this is this work we started with another post of mine |
---|
0:18:50 | what he did was to apply the deep belief network |
---|
0:18:55 | to the to this problem so this is we already |
---|
0:18:59 | learned a lot about this ring the first day of the talk |
---|
0:19:03 | but then what you can do is you can compare the performance of this deep |
---|
0:19:07 | belief network representations that each of the levels to do this kind of discrimination task |
---|
0:19:12 | okay |
---|
0:19:13 | and this is the mfcc for instance this is what you have |
---|
0:19:17 | at the first level of the dbn |
---|
0:19:20 | without any training so actually you are doing better |
---|
0:19:25 | this is the error rate here and if you do some unsupervised training like the |
---|
0:19:31 | restricted boltzmann machine training actually a green slightly worse okay on that task now it |
---|
0:19:37 | does not that this pre-training here helps a lot when you do supervised training after |
---|
0:19:42 | that but if you don't do supervised training actually not doing much |
---|
0:19:46 | okay so i think it's that's what i'm saying is important to have a good |
---|
0:19:49 | evaluation task |
---|
0:19:50 | for unsupervised |
---|
0:19:52 | problems because then you can discover whether you unsupervised unit is actually mean any good |
---|
0:19:57 | or not |
---|
0:19:59 | okay so not in the time remains i would like to talk a little bit |
---|
0:20:03 | about |
---|
0:20:04 | this all the idea the idea of using top-down information |
---|
0:20:08 | and that that's an idea that was not at least to me very |
---|
0:20:11 | natural because |
---|
0:20:13 | i have this idea that maybe should maybe should learn first the phonemes the elements |
---|
0:20:18 | of the language before running higher |
---|
0:20:20 | or other information but of course phonemes a part of a big system okay and |
---|
0:20:26 | so maybe the meaning the definition of the finance is |
---|
0:20:31 | emerges out of the big system so the intuition there is that maybe babies are |
---|
0:20:36 | trying to learn the whole system |
---|
0:20:37 | and why they do that they are going to find if one |
---|
0:20:42 | okay so |
---|
0:20:44 | so all the different things we try i'm going to |
---|
0:20:49 | talk about this idea of using lexical information |
---|
0:20:53 | so lexical information is a very simple idea |
---|
0:20:56 | is the following is that |
---|
0:20:57 | typically when you have to retake two words that random |
---|
0:21:02 | or you just you to you take your you whole lexicon and the you try |
---|
0:21:06 | to find minimal pairs |
---|
0:21:08 | that would a actually different on one only one segment so for instance cannot and |
---|
0:21:13 | cannot |
---|
0:21:13 | okay |
---|
0:21:15 | you don't find a lot of than you do fine then but they are |
---|
0:21:18 | very infrequent statistic |
---|
0:21:20 | so now if you are looking at your lexicon you imagine you are |
---|
0:21:25 | you have some initial position to find the words and then you are looking at |
---|
0:21:29 | all the set of maybe more it is that you find you have to find |
---|
0:21:32 | a lot of minimal pairs that correspond to |
---|
0:21:35 | this contrast a whole then you can be pretty sure that it's not really a |
---|
0:21:39 | phoneme ink |
---|
0:21:40 | contrast these are probably telephone |
---|
0:21:43 | and that's the intuition okay |
---|
0:21:45 | so how we tested that |
---|
0:21:47 | we started with |
---|
0:21:50 | a transcribed corpus then we the transcribed it into phonemes then we make random allophones |
---|
0:21:58 | this is not going well |
---|
0:22:00 | okay |
---|
0:22:01 | and then we transcribe this a phoneme a transcription again into fine |
---|
0:22:06 | of very fine description with all these other phones and we vary the number of |
---|
0:22:10 | other phones |
---|
0:22:11 | so that's how we generate the corpus and then the task is to take this |
---|
0:22:15 | and basically fine |
---|
0:22:17 | which pairs of phones belong to the same phone and want |
---|
0:22:22 | using just information in a corpus |
---|
0:22:26 | so |
---|
0:22:28 | so that's what we do |
---|
0:22:29 | and this is the basically |
---|
0:22:32 | so we compute the distance |
---|
0:22:35 | now the number of different |
---|
0:22:38 | minimal process that you have for each contrast |
---|
0:22:42 | and we compute here the area-under-the-curve and that's this |
---|
0:22:47 | right here and this is the number of phones |
---|
0:22:49 | so don't look at this curve here this curve here is the relevant one is |
---|
0:22:55 | the effect of using the strategy of counting the number of |
---|
0:22:59 | also i mean the multi okay |
---|
0:23:01 | so the performance is quite good and it's actually not really affected |
---|
0:23:06 | negatively by the number of phones that you had |
---|
0:23:09 | okay so this is |
---|
0:23:11 | this strategy works quite well but of course it's cheating right because there i assume |
---|
0:23:17 | that the babies had the boundaries of words |
---|
0:23:20 | but they don't and in fact i showed you just before that it's actually extremely |
---|
0:23:24 | difficult if you have lots of allophones to find the boundaries of words |
---|
0:23:28 | so |
---|
0:23:29 | so that that's a kind of security that we would like to avoid |
---|
0:23:35 | and so the idea that the un T martine the postdoc had |
---|
0:23:39 | which was great was to say well maybe we don't need to have an exactly |
---|
0:23:42 | second maybe babies can go and build a proper lexicon with the whatever segmentation and |
---|
0:23:48 | reason they have it's going to be incomplete is going to be wrong you has |
---|
0:23:51 | many long words in it |
---|
0:23:53 | but still maybe could be you useful thing to have and that's what we find |
---|
0:23:57 | here so this is there we use of free really extremely rudimentary segment segmentation sources |
---|
0:24:03 | using an n-gram to the ten percent most frequent n-grams in the corpus and that's |
---|
0:24:08 | the lexicon so it was really pretty awful |
---|
0:24:11 | but still it provided |
---|
0:24:15 | actually performance that was almost as good as the gold mexican |
---|
0:24:20 | okay |
---|
0:24:22 | so then that the |
---|
0:24:24 | and demanding went to japan and then i had pasta a doctoral student |
---|
0:24:28 | who said well we could even go even further than that |
---|
0:24:32 | maybe babies could be constructed |
---|
0:24:34 | some approximate semantics |
---|
0:24:37 | and the reason why it could be useful to do that is that well cannot |
---|
0:24:40 | okay |
---|
0:24:42 | they are different allophones because they are in minimal pair but what about this one |
---|
0:24:48 | these are two words in french can and cannot |
---|
0:24:50 | and the if i way to apply the same strategy i would declare that and |
---|
0:24:55 | the are allophones which is wrong and then i would end up with the a |
---|
0:25:00 | japanese french |
---|
0:25:02 | type of this than so that's not what we want |
---|
0:25:07 | so but on the other hand if we have some idea of even vague idea |
---|
0:25:10 | that cannot |
---|
0:25:12 | about the meaning of got out actually not okay now |
---|
0:25:15 | but some kind of bird and it whereas this one is some kind of water |
---|
0:25:21 | thing then that's that maybe that sufficient to help distinguish these two cases that's also |
---|
0:25:28 | kind of cannot |
---|
0:25:31 | so there |
---|
0:25:33 | what we did what we do the same kind of pipeline we make the problem |
---|
0:25:37 | actually more realistic by having a instead of having run them allophones we generated allophones |
---|
0:25:42 | by using tied a three state |
---|
0:25:49 | using hmm |
---|
0:25:50 | actually that makes the protection much more difficult that they are phones are more realistic |
---|
0:25:55 | but it's also becoming the lexical started data represented before it is having trouble with |
---|
0:25:59 | that |
---|
0:26:01 | and then the idea is that you take that now don't cheat anymore you are |
---|
0:26:05 | trying to recover |
---|
0:26:07 | possible words from that and then you do some semantic estimation and then |
---|
0:26:11 | now you compute the semantic distance between two pairs of phones |
---|
0:26:16 | so how does it work |
---|
0:26:18 | what word segmentation |
---|
0:26:19 | a state-of-the-art |
---|
0:26:22 | minimum description length or adaptive grammar |
---|
0:26:25 | okay so we know that is working but we know that's not working very well |
---|
0:26:28 | okay especially if we have lots of allophones it's going to have a pretty bad |
---|
0:26:33 | estimate of the lexicon |
---|
0:26:35 | but then we still take that as alex again and then we apply the latent |
---|
0:26:38 | semantic analysis |
---|
0:26:41 | which basically is |
---|
0:26:43 | is counting how many |
---|
0:26:44 | how much time these different terms occur in the dig documents and here we took |
---|
0:26:49 | the comments we to go ten sentences length |
---|
0:26:52 | so we have this whole corpus we segmented into ten sentences and we compute the |
---|
0:26:57 | this matrix of counts which we then decompose and we arrive at semantic representation where |
---|
0:27:02 | basic each word now is a vector |
---|
0:27:05 | so the i mean not that people in india the mean much more sophisticated things |
---|
0:27:08 | like this one so this is pretty |
---|
0:27:13 | first |
---|
0:27:13 | are older a semantic analysis |
---|
0:27:17 | what but what's nice was that we can compute the cosine between the two or |
---|
0:27:21 | semantic propose semantic presentation of the words and the idea of now is that if |
---|
0:27:26 | you have to allophones they should have quite i don't similar and |
---|
0:27:30 | vectors because the are occurring the same context |
---|
0:27:34 | okay so that's the result |
---|
0:27:37 | there |
---|
0:27:38 | so in this in this study what we did was to try to |
---|
0:27:44 | to look at them |
---|
0:27:46 | because we have generated is allophones on the basis of hmms we can compute this |
---|
0:27:50 | the acoustic distance within that okay so obviously acoustic distance |
---|
0:27:54 | is going to help you have to allophones two forms that are quite close to |
---|
0:27:58 | one other maybe they should be grouped together because they are likely allophones |
---|
0:28:03 | we also know that is not working |
---|
0:28:05 | if we only have that's it's not working because the bottom-up strategy doesn't work on |
---|
0:28:09 | performance is not that |
---|
0:28:11 | but it still not perfect okay so that's the performance |
---|
0:28:14 | in the task where you i give you to phones and you have to tell |
---|
0:28:19 | you whether they're allophones all okay |
---|
0:28:21 | the chances fifty percent |
---|
0:28:23 | so this is the percent correct if you use acoustic you only for english and |
---|
0:28:28 | japanese of |
---|
0:28:30 | something missing there which is the number find a phone so it's hundred five hundred |
---|
0:28:34 | thousand that the phones |
---|
0:28:37 | and this is that the effect of the acoustic distance the semantic distance okay and |
---|
0:28:42 | the performance is almost as good as the acoustic distance |
---|
0:28:45 | when you combine then you have very good performance |
---|
0:28:48 | so that should that shows |
---|
0:28:50 | that |
---|
0:28:51 | that you can basically |
---|
0:28:54 | use this kind of semantic representation even though they are in computed on the basis |
---|
0:28:59 | of an extremely bad lexical i mean at this level here the number of real |
---|
0:29:03 | words that you find with the ad upper brummer a type of framework is about |
---|
0:29:07 | twenty percent so you're like second is twenty percent real words and eighty percent |
---|
0:29:12 | but nevertheless |
---|
0:29:14 | that mexican and is enough to give you some semantic top-down information so that's shows |
---|
0:29:21 | that the semantic information is very strong |
---|
0:29:26 | alright so |
---|
0:29:27 | i'm going to wrap up very quickly so i thought that the with this idea |
---|
0:29:32 | that babies would go in a sort of |
---|
0:29:36 | a bottom-up fashion |
---|
0:29:38 | that doesn't work it didn't scalable and it was a climate is |
---|
0:29:43 | also and it doesn't really account for the fact that babies are learning works |
---|
0:29:48 | before they have really for those zoom in on their eventually of phonemes okay |
---|
0:29:53 | in fact now they are but it's showing that even at six months babies have |
---|
0:29:56 | an idea of the semantic |
---|
0:29:59 | representations of words |
---|
0:30:01 | so basically babies out and everything |
---|
0:30:05 | so now we would like to replace is a scenario by |
---|
0:30:09 | a system like this where we you start with row speech and you try to |
---|
0:30:13 | learn all the levels |
---|
0:30:14 | at the same time |
---|
0:30:16 | and then of course you're going to do a bad job the phonemes are going |
---|
0:30:19 | to be wrong the word segment to be wrong everything of semantics going to be |
---|
0:30:22 | awful |
---|
0:30:23 | but then |
---|
0:30:24 | you combine all this and you make a better next iteration |
---|
0:30:28 | so that would be the |
---|
0:30:30 | the proposed architecture for babies |
---|
0:30:34 | you know the for this to work you have to work we have to do |
---|
0:30:37 | a lot more work we have to |
---|
0:30:39 | basically a stop |
---|
0:30:40 | of course using target language and try to approximate the really put that B C |
---|
0:30:45 | getting as much as we can |
---|
0:30:47 | we have to quantify what does it mean to have a pro to lexical propose |
---|
0:30:50 | something okay so i gave you an idea of and evaluation procedure for evaluating what |
---|
0:30:56 | is it to have a proper phoneme we have to do the same thing for |
---|
0:30:59 | pro to words and propose semantics et cetera |
---|
0:31:03 | because these are sort of really approximate representations |
---|
0:31:07 | and then |
---|
0:31:08 | and then |
---|
0:31:09 | the synergies are what i this the describe just |
---|
0:31:12 | now which is |
---|
0:31:14 | this image is a when you try to learn the phonological units alone you are |
---|
0:31:17 | doing about the bad job semantic representations alone it's difficult but you can basically |
---|
0:31:23 | if you try to learn |
---|
0:31:24 | sort of a joint model you are going to be better |
---|
0:31:29 | and they are a lot of potential synergies you could you could imagine |
---|
0:31:33 | the last thing that i have to do as a psychologist of course is to |
---|
0:31:36 | go back to the baby and test with the babies are actually doing this but |
---|
0:31:39 | i'm not going to |
---|
0:31:41 | talk about that now |
---|
0:31:43 | and find the |
---|
0:31:44 | i mean why should we do this i think this reverse engineering of you meant |
---|
0:31:49 | infant is really a new channel G break and i think |
---|
0:31:54 | both sides can bring out of things |
---|
0:31:57 | both psychologist engineers can bring ideas we can bring some interesting corpora and we can |
---|
0:32:03 | test with other the ideas are going to |
---|
0:32:07 | real realistic probabilities |
---|
0:32:08 | and then region is can bring algorithms and also test this large scale test on |
---|
0:32:13 | real data is very important |
---|
0:32:17 | and we have a lot to work on because that's would be some of the |
---|
0:32:21 | potential architecture i try to put everything that has been |
---|
0:32:26 | documented somewhere in terms of |
---|
0:32:28 | potential link between different levels that you're |
---|
0:32:32 | thank you approximate is and there i guess you have added a lot of things |
---|
0:32:37 | you have also babies are actual arc estimating so maybe this articulation is feeding back |
---|
0:32:42 | to help in constructing |
---|
0:32:44 | the sub-lexical units |
---|
0:32:46 | they also any string the faces |
---|
0:32:48 | of caretakers and they also have a lot of which are |
---|
0:32:52 | semantic input for acquisition so all these |
---|
0:32:56 | representation have to be put in at some point that and but i think we |
---|
0:32:59 | what we have to do is to establish whether we do have interesting synergies a |
---|
0:33:04 | not if we don't then we can be factor the system to separate subsystems |
---|
0:33:10 | and that's what have to say so this is this is the team like human |
---|
0:33:13 | these are |
---|
0:33:14 | very nice colleagues that help to this work |
---|
0:33:29 | okay so we are gonna have a abbreviated channel so we don't have a hold |
---|
0:33:34 | on a time for questions but |
---|
0:33:35 | one or two |
---|
0:33:47 | what do you think about inference learning |
---|
0:33:50 | something between |
---|
0:33:51 | phonemes and words like syllables which have a nice sort of acoustics |
---|
0:33:56 | chunkiness to them |
---|
0:33:58 | i mean that's actually |
---|
0:34:01 | that's was the basically the hypothesis i had when i did not used |
---|
0:34:05 | but the role of the syllables |
---|
0:34:09 | i guess |
---|
0:34:12 | i mean that's perfectly possible i mean the thing is that |
---|
0:34:19 | in the way task i think that what |
---|
0:34:22 | deep belief networks are doing by having these inputs recitation where you stack about hundred |
---|
0:34:30 | and fifty millisecond of signal |
---|
0:34:33 | some of going in that direction i mean how to the fifties milliseconds basically the |
---|
0:34:36 | size of syllable |
---|
0:34:39 | so |
---|
0:34:40 | i guess that's behind this i mean if this is what people are using the |
---|
0:34:44 | most recent and the reason is that that's basically the domain of the syllable where |
---|
0:34:48 | you have coarticulation that so that's where you can capture the essential information for |
---|
0:34:54 | recover in contrast |
---|
0:34:55 | so i think there are many ways in which syllable type units could play a |
---|
0:34:59 | role it could be in an implicit fashion like i just said or you could |
---|
0:35:04 | have actually tried to build recognition system i mean units that have the still shape |
---|
0:35:09 | which is another way to the |
---|
0:35:15 | i mean we know we know that inference are counting syllables for instance so at |
---|
0:35:19 | birth it can be can effect you present then a three syllable |
---|
0:35:23 | words and then you switched into syllables the notice the change they don't notice the |
---|
0:35:27 | change if you go from for need to six phone instances which is the same |
---|
0:35:31 | racial of change |
---|
0:35:33 | so we have evidence that there are a decent syllable nuclei are things that the |
---|
0:35:39 | pay attention to |
---|
0:35:40 | a lot |
---|
0:35:44 | thank you for your talk manual i think you at one time told me that |
---|
0:35:48 | almost from day one |
---|
0:35:50 | that infants sorta can are take you can imitate articulatory gestures |
---|
0:35:55 | that somehow hardcode |
---|
0:35:58 | in and well okay it and i don't know how you do that experiment but |
---|
0:36:02 | on the other hand all of these acquiring you know phonemic balance in and word |
---|
0:36:08 | boundary segmentation lexicons that seems to be all sort of this |
---|
0:36:12 | part of the plastic city of learning and inference so why isn't the some notion |
---|
0:36:18 | of starting with the articulate you articulatory gestures since that is sort of |
---|
0:36:23 | there are beginning wise that part of your model or is it should be |
---|
0:36:28 | in the most the mobile it's sparse |
---|
0:36:31 | so i have i have actually of course |
---|
0:36:34 | so i have proposed a working on this and they're actually number of |
---|
0:36:40 | have tried to incorporate |
---|
0:36:42 | using actually deep learning systems trying to learn at the same time speech features and |
---|
0:36:47 | articulatory features |
---|
0:36:50 | and then |
---|
0:36:50 | and then if you retrain like this and then you only present now speech features |
---|
0:36:55 | you are two between better decoding then if you had where learning speech feature so |
---|
0:37:00 | we know that there is some notion in which that could work but of course |
---|
0:37:03 | this work was done with |
---|
0:37:05 | real i don't i articulation |
---|
0:37:07 | the baby articulation is much more primitive so it's not here that's going to help |
---|
0:37:11 | as much but that's one of thing we want to try |
---|
0:37:15 | so i think we a relative time but of manually i believe you're gonna be |
---|
0:37:18 | here tomorrow as well so i encourage people to go and ask all these questions |
---|
0:37:22 | i think they're very they're very relevant work we all so this community |
---|