0:00:13 | and |
---|
0:00:13 | the of the not |
---|
0:00:16 | oh |
---|
0:00:20 | have the perhaps effect |
---|
0:00:22 | so that like to a |
---|
0:00:23 | i'm gonna tend to say this |
---|
0:00:24 | yeah i |
---|
0:00:25 | she can she can a |
---|
0:00:27 | is that correct |
---|
0:00:28 | that's thank you and check |
---|
0:00:30 | say |
---|
0:00:31 | i probably said something |
---|
0:00:33 | oh |
---|
0:00:34 | so uh i gonna presents some work on language i D using a a a a a combination of tori |
---|
0:00:38 | based approach and prosody based on link |
---|
0:00:41 | a this work was done by um |
---|
0:00:44 | two of my staff members are but she sent one am an ash |
---|
0:00:48 | a i will be presenting a all but later today |
---|
0:00:50 | um |
---|
0:00:51 | a work uh were focused on a is uh to develop a a a a close it to a language |
---|
0:00:55 | id task |
---|
0:00:57 | uh a the approach that we're using is a phonological feature based scheme |
---|
0:01:01 | we use this uh for number of applications and accent classification |
---|
0:01:06 | as well as dialect id |
---|
0:01:08 | and and are here we're gonna be using this for language id |
---|
0:01:11 | i and the mean uh aspect or advancement here is combines the prosody articulatory based uh structure |
---|
0:01:17 | um are evaluations were gonna a benchmark performance against a a a a parallel phone bank a recognizer with language |
---|
0:01:24 | model of pprlm |
---|
0:01:26 | a using a close to a uh that uh |
---|
0:01:29 | our corpus of uh a five in in a languages uh from the seven portion of D |
---|
0:01:35 | so |
---|
0:01:36 | what is the motivation for using our ticket or based of features well |
---|
0:01:40 | uh languages accents and dialects they have different dominant uh articulatory traits and we believe that emphasising these components were |
---|
0:01:48 | i help improve language id |
---|
0:01:50 | as opposed to just a for statistical based approach |
---|
0:01:53 | um some of these traits for example we can look initialization of valves routed valves uh a lack of to |
---|
0:01:59 | flex type phonemes |
---|
0:02:01 | a lack of constant constant type clusters |
---|
0:02:03 | um if you're looking for example at diphthongs so a languages like danish may have seventy five to a hundred |
---|
0:02:09 | tip songs |
---|
0:02:10 | where's languages like japanese have no depth song so |
---|
0:02:13 | uh i the presence or absence of some of these traits |
---|
0:02:17 | i would be a useful to see the articulatory uh domain |
---|
0:02:21 | and in addition to this uh automatically learning these uh articulatory traits uh are well us to kind of hopefully |
---|
0:02:26 | build a model level that contribute to improved a language classification |
---|
0:02:31 | so this slide is uh a little busy but it's uh a key for the proposed system |
---|
0:02:37 | so we kind of start off first |
---|
0:02:39 | but i uh extracting our phonological features so is a time-frequency representation here |
---|
0:02:44 | i would probably out five time as a phonological feature representation presentation we tend use government phonology based approach is |
---|
0:02:51 | or or partitioning |
---|
0:02:53 | each each of these blocks here represents a different articulatory type trade like lip rounding |
---|
0:02:58 | uh time height and well one |
---|
0:03:00 | for order backwards |
---|
0:03:01 | uh in addition to that uh the this approach or the uh phonological feature extraction scheme |
---|
0:03:07 | is a traditional hmm based approach are trained on |
---|
0:03:10 | oh switchboard corpus |
---|
0:03:12 | um and so that's that the first step here |
---|
0:03:15 | uh the uh bottom to |
---|
0:03:17 | i have a point here |
---|
0:03:18 | um |
---|
0:03:19 | uh four |
---|
0:03:20 | steps four in five in two in three four and five |
---|
0:03:24 | is basically the phonological feature |
---|
0:03:26 | a language feature extraction |
---|
0:03:28 | uh and so uh two in three are basically the |
---|
0:03:32 | uh |
---|
0:03:32 | and prosodic feature extraction face so when we look at a prosodic feature phase |
---|
0:03:37 | where are analysing uh consonant bound a constant clusters |
---|
0:03:41 | um this is all down an unsupervised manner so we don't know what the phone sequences of course |
---|
0:03:46 | um |
---|
0:03:47 | so we uh break up uh the sequence in |
---|
0:03:51 | uh pseudo syllables uh are constant all our constant i clusters |
---|
0:03:55 | um |
---|
0:03:56 | and after that we extract a prosody based traits and this includes |
---|
0:04:00 | uh both pitch contours and energy contours |
---|
0:04:03 | at a us to go a syllable level |
---|
0:04:06 | uh i'm the phonological states over on this side |
---|
0:04:10 | uh wind up extracting |
---|
0:04:11 | uh the static a a language features at the um |
---|
0:04:15 | uh at the frame level and so that's what's done here in step four |
---|
0:04:18 | um this gives us a static snapshot |
---|
0:04:21 | um |
---|
0:04:22 | oh the from logical feature values that to get from a get a particular given time |
---|
0:04:26 | and each static a feature is also um |
---|
0:04:29 | augmented to a in a unigram uh bigram and trigram type of representation so we have a nick an expanded |
---|
0:04:35 | to |
---|
0:04:36 | language features set |
---|
0:04:38 | um |
---|
0:04:39 | that's of the static side on the uh dynamic side |
---|
0:04:42 | uh at step five |
---|
0:04:44 | uh we extract features a long time and phonological feature so if you look at uh |
---|
0:04:50 | i get a nice |
---|
0:04:51 | point point here |
---|
0:04:52 | um |
---|
0:04:52 | can see here we have phonological features going this way time going this way so this plot kind of shows |
---|
0:04:58 | a movement along the phonological feature and then across time it's so |
---|
0:05:01 | this move mean here kind of shows E |
---|
0:05:03 | a pattern |
---|
0:05:05 | well we have movement the articulatory are the phonological features as well as |
---|
0:05:09 | movement across time |
---|
0:05:11 | a so this uh get was uh a new feature |
---|
0:05:13 | uh it generates a a a uh a change for every phonological feature change so i'll show you that the |
---|
0:05:19 | next slide here so |
---|
0:05:20 | and since |
---|
0:05:21 | uh the long uh ball is here represent present a phonological feature |
---|
0:05:27 | a values across time |
---|
0:05:28 | in each of the dots that you see in here |
---|
0:05:31 | they were present a a change in the values much like a of but at the phonological features that H |
---|
0:05:38 | so here we can kinda show some examples of what this might look like |
---|
0:05:41 | uh for articulatory inspired language features |
---|
0:05:44 | um we can see static type combinations of phonological features so that would be the combination of all of these |
---|
0:05:49 | here |
---|
0:05:50 | we my C dynamic changes so and here we see the loop from this stage to here |
---|
0:05:55 | uh we have uh one phonological feature turning off into turning on so we can look at this transition |
---|
0:06:00 | and we also have the absence of phonological features that might represent the combination here |
---|
0:06:05 | which may be an indication of a particular language |
---|
0:06:08 | uh a rate that is unique to this uh a language that's saying |
---|
0:06:13 | a in addition to that we can also look at a static I features of the static combinations of features |
---|
0:06:17 | you work across a |
---|
0:06:19 | uh the phonological feature type for particular time |
---|
0:06:22 | uh block here |
---|
0:06:24 | and |
---|
0:06:24 | what we do is we tend to skip a we only need to get a |
---|
0:06:28 | uh a snapshot here and if you work will skip the next one because is the same |
---|
0:06:32 | so we're just capturing the individual |
---|
0:06:34 | uh us a static uh a logical feature vectors that are unique uh to to that stuff there |
---|
0:06:41 | um so an example might be um |
---|
0:06:43 | uh for particular language feature about but might be found very high for that particular town |
---|
0:06:49 | a position |
---|
0:06:50 | and this gives you a static a representation for time for uh fact nist and height |
---|
0:06:54 | okay |
---|
0:06:55 | these are also augmented uh with their unigram bigram and trigram type combinations for the static language features that were |
---|
0:07:01 | using |
---|
0:07:02 | uh and |
---|
0:07:03 | a because of that they'll allow us to have some type of |
---|
0:07:06 | allophonic type variations that can be captured in the a language model |
---|
0:07:11 | in addition to that we can look at be extracting the dynamic features here and in this context to were |
---|
0:07:16 | obviously also looking at transitions where you have movement |
---|
0:07:19 | so when things are static |
---|
0:07:20 | uh you don't change and when there is a movement here you kind of an and a five those parts |
---|
0:07:25 | so we're looking at those of value pairs when of phonological feature changes |
---|
0:07:29 | and we skip uh i things that uh are are are are not changing |
---|
0:07:33 | oh over time so an example would be like a |
---|
0:07:36 | a a language feature that would be a place uh |
---|
0:07:39 | a from a real or to a labial type position |
---|
0:07:41 | and that which show you a movement uh are ticket right hand |
---|
0:07:45 | and uh bigram uh sorry unigram bigram and trigram type language model combinations of the dynamic language or features |
---|
0:07:53 | incorporated to arrive at this phase as well |
---|
0:07:56 | use a maxim to be classification framework again i said uh uh it's a close set uh a a a |
---|
0:08:01 | of language five languages working with |
---|
0:08:04 | we extract the evidence from uh these language features |
---|
0:08:07 | uh represented here and we find a maximum entropy classifier for a particular language |
---|
0:08:13 | uh and the language features themselves could be articulatory-prosodic or combination of those and the prosodic cases would be energy |
---|
0:08:19 | and pitch the look at those at this phase |
---|
0:08:22 | so the prayer prosody based language features a motivation here we do know that uh a perception of um |
---|
0:08:28 | of languages by humans show that uh prosody is an important factor they |
---|
0:08:32 | uh which track the language features from pitch and energy contours the extraction strategy uses the wrapped uh algorithm for |
---|
0:08:38 | pitch information |
---|
0:08:40 | and we normalize uh the pitch value |
---|
0:08:42 | a for the means so where you remove for uh some of the speaker dependency there |
---|
0:08:46 | how to contours themselves are broken up as i mentioned and zero syllable for |
---|
0:08:50 | um |
---|
0:08:51 | and that's done using the phonological feature based parsing scheme |
---|
0:08:55 | and there are a lot pitch log energy contours of then approximated using a lagrange multiple uh um what ground |
---|
0:09:00 | polynomial basis uh and those coefficients are used |
---|
0:09:04 | and a gmm based a classifier for |
---|
0:09:06 | uh for the language |
---|
0:09:08 | so these are lower branch paul uh polynomials that are used in have a value |
---|
0:09:13 | a very between my one a plus one so we get different shapes here and we approximate the contours for |
---|
0:09:18 | the energy and pitch using |
---|
0:09:20 | these polynomials we have three of them here that |
---|
0:09:24 | so the the prosody based a language features are set of this way a the coefficients some cells are used |
---|
0:09:29 | to train gmms the gmm |
---|
0:09:31 | a are just uh |
---|
0:09:33 | the cluster centroids for the code for those particular prosody of components |
---|
0:09:38 | um and uh the vector some cells form the language features are gonna be working with and again |
---|
0:09:44 | uh unigram bigram trigram language models for the codebook entries are used |
---|
0:09:50 | uh for the evaluation uh we used to a five languages uh |
---|
0:09:54 | uh a in in in T I the are an indian languages a hundred speakers per language seventy five hours |
---|
0:10:00 | a a a of speech per language uh there's ninety hours of spontaneous speech this work has focused on the |
---|
0:10:06 | read speech |
---|
0:10:07 | um |
---|
0:10:08 | a or or at a kind of that telugu want a more uh are these four and a merrily em |
---|
0:10:13 | uh a on a seven port here the are but the for the if |
---|
0:10:17 | a so them are not a is the one that belongs to the in the uh a type type a |
---|
0:10:21 | languages |
---|
0:10:22 | the other for kind of telugu tamil and meryl am |
---|
0:10:25 | are are very in a language |
---|
0:10:27 | so uh uh we are benchmarking this approach against a a um |
---|
0:10:32 | a parallel bank of phone recognizer a language model |
---|
0:10:35 | uh a on the see with the performance would be so |
---|
0:10:38 | uh we |
---|
0:10:39 | uh |
---|
0:10:40 | elaborated or uh |
---|
0:10:42 | a i guess with a broom |
---|
0:10:43 | sorry |
---|
0:10:44 | uh but and we're using but so uh |
---|
0:10:47 | uh a phone recognition setup |
---|
0:10:48 | so when their first set of uh phone recognizers from but we use the a german hindi japanese mandarin and |
---|
0:10:55 | the spanish uh set |
---|
0:10:57 | a a second set was uh |
---|
0:10:59 | a english czech hungarian and russian so |
---|
0:11:02 | uh we wanted to see if we left out any in the in type languages is where that when actually |
---|
0:11:06 | make it day |
---|
0:11:08 | these is some of the results for ticket a based language features evaluated on the read speech |
---|
0:11:13 | and we can see using just the static features some cells of fifty nine percent a language i D K |
---|
0:11:18 | case here |
---|
0:11:19 | um |
---|
0:11:20 | yeah are out to a is the uh one that actually uh scores the best and there's most confusion between |
---|
0:11:26 | canada and tell a |
---|
0:11:28 | um um |
---|
0:11:29 | if you focus just some but and the dynamic language features again a |
---|
0:11:33 | uh uh we have a a significant improvement uh |
---|
0:11:37 | at list and the are do uh i D uh a kind of the still the same |
---|
0:11:42 | and some improvement uh on B am are at a so the |
---|
0:11:45 | i dynamic features increases the seventy one point nine percent |
---|
0:11:48 | if you combine of the static and dynamic language that features a seventy four percent |
---|
0:11:54 | okay |
---|
0:11:55 | uh i next uh we looked at uh a incorporating or or considering the prosody based uh a language features |
---|
0:12:01 | um this included pitch energy and the combinations of pitch and energy contours |
---|
0:12:06 | of course prosody by itself is not uh an overriding factor that can |
---|
0:12:10 | you can build a language id system but uh if you are meant that right |
---|
0:12:14 | a spectral based uh structure you can improve things |
---|
0:12:17 | so again chance here it's a five where classifications so twenty percent chance so you can see that there is |
---|
0:12:22 | an improvement |
---|
0:12:23 | um by combining both pitch and energy based contours using the |
---|
0:12:27 | well a ground problem male type modeling scheme |
---|
0:12:30 | forty seven percent classification |
---|
0:12:34 | next to uh we combine the phonological feature prosodic based uh set ups uh performance actually increases uh two seventy |
---|
0:12:41 | nine uh a point uh a five percent |
---|
0:12:44 | uh using this combination |
---|
0:12:46 | and using the pprlm |
---|
0:12:48 | we was uh a two point uh as uh two percent |
---|
0:12:52 | some other experiment we had other one that it two point seven percent |
---|
0:12:55 | um and so we're still getting better performance with the pprlm verses the phonological feature prosodic based uh structure |
---|
0:13:02 | um we did uh do some additional experiments uh of extracting the prosodic based pieces to the pprlm |
---|
0:13:09 | a i see that when improve performance and so these are the final results |
---|
0:13:13 | using static or a language to uh features from a |
---|
0:13:16 | um i |
---|
0:13:17 | uh |
---|
0:13:18 | using |
---|
0:13:19 | static language features from phonological feature based scheme |
---|
0:13:22 | a a fifty nine percent uh using dynamic um |
---|
0:13:25 | language features double the number of feature set size you seven thousand |
---|
0:13:30 | a seventy one percent and the combination of static and dynamic a we get seventy four percent |
---|
0:13:36 | uh using prosodic type structure you can see you don't get much improvement there but if you can combine pitch |
---|
0:13:41 | and energy contour me information |
---|
0:13:44 | and improves |
---|
0:13:45 | and the combination of phonological and prosodic based schemes given a give you seventy five percent |
---|
0:13:50 | um the P V are line give us more pro improvement but if you do have |
---|
0:13:54 | system fusion for that |
---|
0:13:56 | you a four percent increase |
---|
0:13:59 | absolute in a language id for the five language |
---|
0:14:01 | so |
---|
0:14:02 | um it does show some improvements uh i incorporating the phonological feature in the prosody based uh structure |
---|
0:14:08 | here |
---|
0:14:09 | so a conclusion we present a new framework for using articulatory and prosodic information for language identification |
---|
0:14:16 | we developed a new methodology for extracting language features from phonological representations |
---|
0:14:22 | not a language features themselves |
---|
0:14:24 | oh i mean i can learn from a maximum tree based techniques |
---|
0:14:27 | um |
---|
0:14:28 | the combination of prosodic and articulatory type information was shown to be useful for improved uh a language D |
---|
0:14:35 | and of the proposed system um |
---|
0:14:37 | shows some further improvements when combined with a pprlm a type system |
---|
0:14:42 | in the future we're gonna expand uh this too |
---|
0:14:44 | as they a new languages and also consider performance on the spontaneous speech which seems some changes in our and |
---|
0:14:51 | production type traits for the |
---|
0:14:52 | spontaneous speech |
---|
0:14:54 | and uh there are some references uh from the page |
---|
0:14:57 | thank you |
---|
0:15:05 | i |
---|
0:15:06 | i |
---|
0:15:24 | oh |
---|
0:15:35 | you |
---|
0:15:36 | we agree "'cause" we actually ran we've on make same types experiments um the same five are big corpus for |
---|
0:15:41 | dialect id |
---|
0:15:43 | and uh |
---|
0:15:44 | so you can clearly see that the for very confusable south indian languages |
---|
0:15:49 | that's much more challenging we've seen some some differences when we look at accent structure and |
---|
0:15:54 | previous interspeech icassp papers |
---|
0:15:56 | that's why we think that we can look at languages that are particularly close together |
---|
0:16:00 | this send how think that the subtle differences that you might see and these languages may come out a little |
---|
0:16:05 | bit more and in the articulatory type pattern |
---|
0:16:08 | you look at the larger statistical based schemes like pprlm |
---|
0:16:12 | approaches |
---|
0:16:13 | um |
---|
0:16:14 | i think somehow uh |
---|
0:16:16 | if there are big differences between uh the languages i think it's maybe a little bit more difficult to kind |
---|
0:16:21 | of we those things out sometimes |
---|
0:16:24 | uh if there's channel or microphone mismatch like what fred it talked about using map |
---|
0:16:29 | sometimes that tends to dominate the differences between the sets so |
---|
0:16:32 | i must are collected on all the same way you can't be sure that so |
---|
0:16:36 | i do agree we try to make that the task more challenging |
---|
0:16:39 | we are participating in the L this year so we hope that this might uh |
---|
0:16:43 | come to fruition the better when |
---|
0:16:45 | there is a wider |
---|
0:16:48 | oh |
---|
0:16:53 | i |
---|
0:16:54 | i |
---|
0:16:58 | i |
---|
0:16:58 | yeah |
---|
0:16:59 | i |
---|
0:17:00 | oh |
---|
0:17:01 | i was recorded |
---|
0:17:03 | uh |
---|
0:17:04 | oh |
---|
0:17:06 | a |
---|
0:17:07 | a |
---|
0:17:10 | it was actually recorded in in uh on the street |
---|
0:17:13 | and |
---|
0:17:14 | in each of the different regions |
---|
0:17:15 | so |
---|
0:17:16 | a people were recorded in kind kind of a quite a when they are reading and then they were in |
---|
0:17:20 | more public settings when they were |
---|
0:17:22 | uh |
---|
0:17:24 | and spontaneous mode so we have spontaneous and we have more of a noisy version of that |
---|
0:17:30 | they |
---|
0:17:32 | i at each region yeah |
---|
0:17:33 | i |
---|
0:17:36 | i |
---|
0:17:37 | i |
---|
0:17:38 | yeah |
---|
0:17:39 | a |
---|
0:17:40 | is there so we have information on uh we've also run a listener tests on this one we are looking |
---|
0:17:44 | so we we had a paper on this and dialect id but were very careful because these are languages and |
---|
0:17:49 | not dialects |
---|
0:17:51 | um but we have a listener tests where we had listeners that |
---|
0:17:55 | for a probably lingo base spoke either two or three of the five languages |
---|
0:17:59 | to see if they could assess the differences |
---|
0:18:01 | um |
---|
0:18:02 | we had some of those results in the previous interspeech conference but yes the there are some different i think |
---|
0:18:06 | what you're asking is where they were all recorded |
---|
0:18:09 | and consistent space |
---|
0:18:10 | um |
---|
0:18:11 | so the recording set up was the same |
---|
0:18:13 | but they were recorded in different regional location |
---|
0:18:18 | a |
---|
0:18:21 | two |
---|
0:18:22 | joe chose |
---|
0:18:29 | a |
---|
0:18:39 | yeah well we we could you've seen we look at accent and dialect on read speech you don't get a |
---|
0:18:44 | good performance uh we've actually seen that tremendously and accent you just |
---|
0:18:49 | you don't get much information on X N sensing material from read speech |
---|
0:18:52 | um |
---|
0:18:54 | the spontaneous is really what you have to focus on and so |
---|
0:18:57 | we done this primarily because we can get the results |
---|
0:19:00 | a little bit faster on the read speech we we we are running experiments right now on the spontaneous part |
---|
0:19:05 | um we just and have them ready in time to get them to this particular paper what we do know |
---|
0:19:11 | a |
---|
0:19:12 | think |
---|