0:00:22 | oh |
---|
0:00:22 | can have a |
---|
0:00:25 | okay good |
---|
0:00:27 | i okay |
---|
0:00:28 | mean |
---|
0:00:28 | um |
---|
0:00:29 | to do be talking about how we generalise and adapt the concept of pronunciation modeling |
---|
0:00:36 | and and use that to design a framework to help analyse like |
---|
0:00:41 | step here is the structure of the talk |
---|
0:00:43 | and i'll first start from the motivation |
---|
0:00:46 | um of speech science and engineering |
---|
0:00:48 | that |
---|
0:00:49 | model |
---|
0:00:53 | so |
---|
0:00:54 | dialect recognition |
---|
0:00:55 | a uh the dialect research uh there are different |
---|
0:00:58 | branches she's of work |
---|
0:01:00 | on the one hand there's speech science |
---|
0:01:02 | so for it |
---|
0:01:03 | well speech |
---|
0:01:04 | a signs |
---|
0:01:06 | these are social linguists |
---|
0:01:08 | but word um and a |
---|
0:01:10 | rules |
---|
0:01:11 | for across dialects to understand why these dialects are different |
---|
0:01:15 | um this is |
---|
0:01:16 | very important |
---|
0:01:17 | um but the are analysis is often manual |
---|
0:01:21 | so it's very time consuming |
---|
0:01:23 | we are them out of data that that can be the |
---|
0:01:26 | and that without enough data uh have sometimes |
---|
0:01:30 | a it is po |
---|
0:01:31 | it's the ball that some of these rules might be over |
---|
0:01:34 | or or or or a specified |
---|
0:01:38 | on the other |
---|
0:01:39 | yeah and we have speech technology |
---|
0:01:41 | so for example or a speech engine is um |
---|
0:01:45 | but design |
---|
0:01:46 | automatic dialect recognition systems |
---|
0:01:50 | i |
---|
0:01:51 | i |
---|
0:01:52 | and um |
---|
0:01:54 | and i to of these not |
---|
0:01:57 | and so it can put |
---|
0:01:58 | a since to that very efficiently even if the is a lot to and can also reach be a decent |
---|
0:02:03 | perform |
---|
0:02:06 | that we model these two then the commands |
---|
0:02:09 | i'm do these dialect differences |
---|
0:02:13 | for |
---|
0:02:15 | and a work |
---|
0:02:16 | we decided to combine the straits of these to research communities |
---|
0:02:20 | to bridge the gap between speech science and technology |
---|
0:02:24 | a in particular we want to design automatic systems that are you have to explicitly the these than the cross |
---|
0:02:31 | across dialects |
---|
0:02:32 | and use that to infer from human last |
---|
0:02:35 | so because of this in so but it's nature of had these |
---|
0:02:39 | results of the system we turn this approach in so but the of dialect recognition |
---|
0:02:47 | so to to can you a a or taste of what i mean by what of system can do |
---|
0:02:54 | as an example |
---|
0:02:55 | so in the end that we have there were transcript and the audio signal |
---|
0:03:00 | which could be used to generate the reference pronunciation and the dialect specific pronunciation |
---|
0:03:06 | um um in and red here |
---|
0:03:09 | to the model for all and the mapping between this reference pronunciation and dialect specific pronunciation |
---|
0:03:16 | um so that in the ah |
---|
0:03:17 | but we can get these phonetic transformations the use phonetic rules |
---|
0:03:21 | um |
---|
0:03:22 | that tell you how the dialects are different |
---|
0:03:24 | so for example in this case |
---|
0:03:27 | we see that a is deleted one it's followed by a consonant |
---|
0:03:31 | a and in addition we can see that we can quantify the occurrence frequency and no how often this happens |
---|
0:03:38 | and that's kind of information is extremely important for forensic phoneticians |
---|
0:03:42 | which is uh one of the big motivations behind a work |
---|
0:03:48 | so before i go into more of the details of our proposed model |
---|
0:03:52 | um i like to form we introduce what i mean by phonetic transformation because uh i will be |
---|
0:03:58 | we will be characterising dialects differences um using phonetic transformations |
---|
0:04:03 | so um |
---|
0:04:05 | represents adds a word to um in the rap reference dialect as reference phones |
---|
0:04:11 | and in the dialect interest we represent the pronunciation a surface phones |
---|
0:04:16 | and this may in between the reference phones and the surface phones is what we call phonetic transformation |
---|
0:04:22 | so to K |
---|
0:04:24 | if we're given the word |
---|
0:04:25 | a |
---|
0:04:26 | um and shoes general american english |
---|
0:04:29 | has the reference dialect |
---|
0:04:31 | um |
---|
0:04:32 | and british english as a dialect of interest |
---|
0:04:35 | now we have the reference phones and surface phones of the word back |
---|
0:04:39 | and here you see |
---|
0:04:41 | and the reference phones is mapped to a a a a a and the surface phones |
---|
0:04:45 | so this is an example of a |
---|
0:04:47 | a substitution which use the kind of phonetic transformation |
---|
0:04:51 | um there are two other car |
---|
0:04:53 | i have to be shown in in and so |
---|
0:04:56 | more about then |
---|
0:04:57 | right right but this is what i mean by phonetic transformation |
---|
0:05:02 | and |
---|
0:05:02 | i and to i proposed model |
---|
0:05:05 | and a |
---|
0:05:06 | we we it to make |
---|
0:05:08 | a model any parents to express a woman these have a transformations |
---|
0:05:13 | so i'm is called phonetic pronunciation model |
---|
0:05:16 | yeah and |
---|
0:05:18 | we want to answer the following questions you of this model |
---|
0:05:22 | so first to um |
---|
0:05:23 | one and can be a dialect to a reference dialect |
---|
0:05:28 | kinds of phonetic transformations occur |
---|
0:05:31 | oh a substitution |
---|
0:05:32 | insertions or deletions |
---|
0:05:34 | and if they occur to the how to that kurt in only certain phonetic context that okay |
---|
0:05:41 | and |
---|
0:05:42 | a thing to the curb |
---|
0:05:43 | so to answer these questions um we have to in |
---|
0:05:48 | a model but |
---|
0:05:50 | a markov model |
---|
0:05:52 | and we use that to help us automatically running the reference phones with the surface phones |
---|
0:05:57 | um the second part |
---|
0:05:59 | decision tree clustering which helps us gender as the phonetic rule |
---|
0:06:06 | so here is the slide way |
---|
0:06:08 | a three |
---|
0:06:09 | the thing kind of phonetic transformations each with an example |
---|
0:06:13 | yeah and the in the example |
---|
0:06:16 | american english has a reference dialect and british english for |
---|
0:06:21 | but um dialect of interest |
---|
0:06:24 | um so we use a |
---|
0:06:26 | cases the substitution of a a an american english it's pronounced that's back and in |
---|
0:06:33 | british english or sound like by |
---|
0:06:35 | um the second that the relation example where |
---|
0:06:39 | one is followed by a constant so in american english |
---|
0:06:43 | part |
---|
0:06:44 | what's that like something like |
---|
0:06:46 | in british english |
---|
0:06:47 | and |
---|
0:06:49 | example of phonetic transformations is insertions |
---|
0:06:52 | still here in general american english of what happens with the bound and the |
---|
0:06:57 | val following it at that |
---|
0:06:59 | the word finally it starts with a |
---|
0:07:02 | um that how the and and i i might be inserted in between |
---|
0:07:06 | when it's the british ah english speaker |
---|
0:07:09 | so that phrase saw i feel was on to more like saw a film |
---|
0:07:15 | um so these are some of the examples of the phonetic transformations |
---|
0:07:19 | and in the following slides was straight how these examples fit into our proposed H M and that |
---|
0:07:28 | but here is um a traditional hmm work |
---|
0:07:32 | where the circles represent the states in the squares represent the observation |
---|
0:07:36 | and um they are also i the state transition |
---|
0:07:40 | so this is a trivial case where |
---|
0:07:42 | the reference phones in the surface phones are things so there are no dialect differences |
---|
0:07:47 | um and this is the case of a substitution |
---|
0:07:49 | where |
---|
0:07:50 | i |
---|
0:07:51 | W and in this case the traditional hmm system can handle it at quickly |
---|
0:07:57 | however |
---|
0:07:58 | what about an insertion it's so if we have an insertion of a here we see that this are stiff |
---|
0:08:04 | is and does not have any corresponding state |
---|
0:08:08 | to it |
---|
0:08:09 | so a solution is that now we have a one to two mapping between the reference phones and the state |
---|
0:08:17 | so for reference pattern |
---|
0:08:19 | it's rappers |
---|
0:08:21 | oh |
---|
0:08:22 | uh states the first one is the right circle |
---|
0:08:24 | which indicates an estate |
---|
0:08:27 | and then it's by an insertion state the green circle |
---|
0:08:31 | and so now you see that um the observation |
---|
0:08:36 | that's the corresponding state to be mapped to |
---|
0:08:41 | and in addition uh we also for the categorise our state transitions |
---|
0:08:46 | um according to the press |
---|
0:08:49 | data transformations |
---|
0:08:50 | so now if |
---|
0:08:52 | a state transition is and sure and insertion state has like the red a or here in the graph |
---|
0:08:58 | there we call it insertion state transition |
---|
0:09:03 | okay so we can like the case of insertions |
---|
0:09:06 | how about deletions then |
---|
0:09:09 | so here we see the example i where um |
---|
0:09:12 | this state |
---|
0:09:13 | are has some the corresponding surface down or observation |
---|
0:09:17 | and to solve this problem we introduce a deletion state transition |
---|
0:09:22 | which skips normal state |
---|
0:09:24 | so in this case |
---|
0:09:26 | the state are is skipped |
---|
0:09:27 | so it no longer needs to be mapped to an observation |
---|
0:09:32 | so these are some of the highlights of um the differences if i proponents hmm network |
---|
0:09:37 | and the traditional one to help us more explicitly model the phonetic transformations in a richer way |
---|
0:09:45 | for now |
---|
0:09:45 | after training a hmm system using triphones |
---|
0:09:49 | we could find a rose like these on the right |
---|
0:09:53 | so for example |
---|
0:09:55 | yeah becomes all and it's followed by a T H |
---|
0:09:58 | so back becomes by |
---|
0:10:01 | also |
---|
0:10:01 | becomes comes a one it's followed by an uh |
---|
0:10:04 | as becomes class |
---|
0:10:06 | and |
---|
0:10:07 | i'm not example |
---|
0:10:09 | hmmm |
---|
0:10:11 | i still laugh becomes small |
---|
0:10:14 | the question here or one as it is |
---|
0:10:17 | the is observed rules |
---|
0:10:19 | um |
---|
0:10:19 | actually originating from a more general underlying rule |
---|
0:10:24 | and if it it is how can we find that |
---|
0:10:27 | so here we use decision tree a clustering to help us |
---|
0:10:31 | so from the results of decision tree clustering |
---|
0:10:34 | um we can find that by clustering |
---|
0:10:37 | these observed for an underlying rule |
---|
0:10:40 | so here the underlying what we found was that oh |
---|
0:10:43 | so now i actually when have a a is followed by a voiceless fricative but phonetic transformation of at to |
---|
0:10:49 | a little occur |
---|
0:10:54 | so i just talked about the highlights of for model and now um |
---|
0:10:58 | we going into the evaluation stage |
---|
0:11:00 | and we've done a series of experiments |
---|
0:11:03 | um and |
---|
0:11:04 | because of the time constraint not be able to share this information |
---|
0:11:08 | so the dialect recognition task um |
---|
0:11:11 | well not be talked about but uh you can read a lot of the details in our paper |
---|
0:11:18 | i'll be focusing on the other choose the first one is the pronunciation generation experiment |
---|
0:11:23 | where |
---|
0:11:25 | basically what as that's that bill the of the model by seeing how well it can convert one pronunciation into |
---|
0:11:31 | one other dialects pronunciation |
---|
0:11:35 | that do are we used it is um |
---|
0:11:38 | and big database um it has five different arabic dialect regions |
---|
0:11:43 | you where E |
---|
0:11:44 | egypt |
---|
0:11:45 | why |
---|
0:11:46 | palestine time in C or yeah |
---|
0:11:48 | and they are all conversational telephone speech |
---|
0:11:51 | and here we chose your he as a reference dialect |
---|
0:11:55 | and in this table or you can see that data the partition um for a experiment |
---|
0:12:01 | so |
---|
0:12:03 | this experiment the assumption is if we trained a |
---|
0:12:06 | pronunciation model well that it has learned these phonetic rules across dialects correctly |
---|
0:12:12 | then the model should be able to convert |
---|
0:12:14 | um the reference |
---|
0:12:16 | phones into a other dialects each and |
---|
0:12:19 | a very well |
---|
0:12:20 | so here after which |
---|
0:12:23 | and C and model a phonetic pronunciation model |
---|
0:12:26 | we give it a |
---|
0:12:28 | reference phones of the test that |
---|
0:12:30 | and |
---|
0:12:32 | to will generate the most likely surface phones of other arabic dialects |
---|
0:12:37 | i by comparing these surface phones |
---|
0:12:40 | that were generated |
---|
0:12:41 | to the ground truth surface phones we can see how well i model was converting |
---|
0:12:47 | uh one pronounce one doll let's pronunciation to another |
---|
0:12:51 | and here are the results so the orange um by a is the monophone version of the pronunciation model |
---|
0:12:58 | and the blue one is the decision tree um pronunciation model |
---|
0:13:02 | and we see here |
---|
0:13:04 | tree |
---|
0:13:04 | helps improve the recovery rate at one point seven percent relative |
---|
0:13:10 | meaning that the decision tree through results help as um |
---|
0:13:14 | convert these pronunciations better |
---|
0:13:18 | i'm here are like to mention a site note and we also did a lot of for |
---|
0:13:23 | analysis and found that they are are word usage differences across arabic dialect |
---|
0:13:29 | and this could um um can potentially complicate the evaluation of our |
---|
0:13:33 | system |
---|
0:13:35 | for |
---|
0:13:36 | um we also did the same experiment |
---|
0:13:38 | a using a phonetic pronunciation model on multiple english corpora without these were usage differences that will cause complications |
---|
0:13:47 | and the results are very good |
---|
0:13:49 | unfortunately i can not sure with the a show with you these to day because it will be covered in |
---|
0:13:54 | interspeech |
---|
0:13:55 | but um that means you should all come to my talk in interest as well |
---|
0:14:01 | so that |
---|
0:14:02 | evaluation is the row can an evaluation of where we can i one and rules are and shoot the ones |
---|
0:14:09 | in the linguistic literature |
---|
0:14:11 | so here on the left see that linguistic |
---|
0:14:14 | description of their for arabic dialects |
---|
0:14:16 | there are from the literature |
---|
0:14:18 | on the right T C where rules from my proposed system |
---|
0:14:22 | and |
---|
0:14:24 | you can see that the and rules from a proposed system actually |
---|
0:14:28 | um corresponds with these linguistic descriptions |
---|
0:14:31 | and spherical or more i they actually sometimes might potentially find the phonetic context of what these rules occur |
---|
0:14:39 | and most importantly um |
---|
0:14:41 | we can also quantify to five |
---|
0:14:43 | the current |
---|
0:14:44 | frequencies of these rules given the phonetic context |
---|
0:14:48 | and this information is very input |
---|
0:14:51 | six annotations for a for forensic phoneticians |
---|
0:14:54 | but is rarely document |
---|
0:14:56 | in the literature |
---|
0:14:58 | a little to conclude my top what talking about the contributions of this work |
---|
0:15:03 | so here we propose an automatic yet informative approach and analysing dialects |
---|
0:15:09 | and we call that's informative dialect recognition |
---|
0:15:13 | we use a mathematical framework to characterise phonetic transformations a |
---|
0:15:17 | a style X |
---|
0:15:18 | in a very explicit manner or to in these rules |
---|
0:15:22 | um yeah and i proposed system is able to postulate rules |
---|
0:15:26 | from large corpora to discover a |
---|
0:15:28 | we fine and quantify dialect specific rules |
---|
0:15:34 | so um |
---|
0:15:35 | if people have questions or issues that they were like to ask me about the talk i would be happy |
---|
0:15:40 | to do so |
---|
0:15:42 | i |
---|
0:15:49 | five |
---|
0:15:50 | a i don't know of the four |
---|
0:15:54 | one one four |
---|
0:15:57 | uh |
---|
0:16:01 | um um |
---|
0:16:05 | i |
---|
0:16:07 | oh i thought |
---|
0:16:07 | i i i it to you i yeah |
---|
0:16:39 | hmmm |
---|
0:16:44 | a |
---|
0:17:00 | i |
---|
0:17:10 | a |
---|
0:17:15 | hmmm |
---|
0:17:43 | i |
---|
0:17:44 | a |
---|
0:17:47 | hmmm |
---|
0:17:54 | hmmm |
---|
0:18:02 | hmmm |
---|
0:18:05 | a |
---|
0:18:07 | and |
---|
0:18:14 | a |
---|
0:18:18 | oh |
---|
0:18:22 | a |
---|
0:18:26 | um |
---|
0:18:28 | thank you |
---|
0:18:29 | and so i don't know i can remember all of them to respond them to a but uh |
---|
0:18:34 | that that's one yes that is the uh we are well yeah that point and it's just a i system |
---|
0:18:39 | is also able to go to these |
---|
0:18:42 | tension differences that may not actually be a phonetic rule in the |
---|
0:18:46 | but existing or not existing know when error is one of them |
---|
0:18:49 | and um |
---|
0:18:51 | john wells had |
---|
0:18:52 | have have a a has established a lot of very good literature on dialect differences in in and actually i'll |
---|
0:18:59 | be using a a lot of that in my next talk a um so |
---|
0:19:03 | so that is um what you could |
---|
0:19:05 | you looking for two |
---|
0:19:06 | and um |
---|
0:19:07 | you mentioned something else out the reference dialect um but the session of the reference dialect they are |
---|
0:19:13 | but the to me and linguistic descriptor um considerations so we actually consider |
---|
0:19:19 | or from the linguistic um side um |
---|
0:19:23 | i make some decisions such as i would not want to use a each option i back as |
---|
0:19:28 | um the reference dialect because it seems like for the native speakers of their big that i know that usually |
---|
0:19:34 | know how their dialogue is different |
---|
0:19:36 | um the egyptian dialect and so |
---|
0:19:38 | i since i don't really understand yeah a big and we have had to them to help me as a |
---|
0:19:41 | as of the model or of the system is going in the right direction uh will be easier for them |
---|
0:19:46 | to tell me uh uh if |
---|
0:19:49 | these phonetic transformations are occurring and it egyptian one is not a reference |
---|
0:19:53 | and then for palestine a and and see |
---|
0:19:56 | we want to but we have time we have been taking a big family |
---|
0:20:00 | so i was more reluctant to use them as reference is because uh since they are more closely |
---|
0:20:07 | then that values palestine then i may not be able to see C or you and difference is very easily |
---|
0:20:13 | and in the initial um |
---|
0:20:15 | establishment of |
---|
0:20:17 | uh the system it might be be better to have more or dialect differences |
---|
0:20:21 | and finally from the engineering perspective we actually have a lot more you data so that we can train systems |
---|
0:20:28 | on and so um |
---|
0:20:30 | that was the reason why a B and we chose iraqi rocky and this is a was a very difficult |
---|
0:20:35 | its decision but i and worked out okay in this case |
---|
0:20:38 | um um |
---|
0:20:39 | and so uh are there any other questions |
---|
0:20:42 | no no okay |
---|
0:20:49 | know |
---|
0:20:51 | hmmm |
---|