0:00:15 | a minutes no work to minimize the from the inverse in tokyo so to that |
---|
0:00:20 | like talk about the a speaker basis accent clustering of english using invariance invariant structure |
---|
0:00:25 | analysis in the speech accent archive |
---|
0:00:28 | so all the miss |
---|
0:00:38 | no |
---|
0:00:44 | how can it go |
---|
0:00:54 | okay |
---|
0:00:56 | alright thank you so this is a lie on one public presentation so |
---|
0:01:02 | first background objective |
---|
0:01:03 | and then you what kind of corpus we used to what kind method of speech |
---|
0:01:06 | signals would used so after that i will show you have very interesting result of |
---|
0:01:10 | a previous study so that |
---|
0:01:12 | i was shown to be a coming experiments done in a current paper |
---|
0:01:17 | so |
---|
0:01:18 | in this dog i focus on english |
---|
0:01:21 | that only the long as english but |
---|
0:01:23 | as you know english this is used as only |
---|
0:01:27 | global longish or intonation language spoken by everybody k |
---|
0:01:31 | so |
---|
0:01:32 | the u d is this we can find more researchers so more teachers a |
---|
0:01:38 | for treating english as english she's get well the english |
---|
0:01:43 | so what is the english is what linguists it's a set of localised versions of |
---|
0:01:47 | english |
---|
0:01:48 | so they claim that there is no standard pronunciation of english and american english and |
---|
0:01:53 | british english a bigger just two major example of accented english get |
---|
0:01:58 | so |
---|
0:01:59 | and i this is to a very well known three circle more detail what english |
---|
0:02:03 | as |
---|
0:02:03 | the inner circle misty a english as native language and outer circle is english class |
---|
0:02:09 | official language like single and expanding circle is a english that's for language japan helsinki |
---|
0:02:16 | in brazil |
---|
0:02:18 | so |
---|
0:02:19 | and the in this situation still what kind we see changes like ian is found |
---|
0:02:24 | in this study is a linguistics |
---|
0:02:26 | so great interest lies in how one type of pronunciation compares to other varieties not |
---|
0:02:32 | now one type of pronunciation is incorrect to compute american english british english can |
---|
0:02:38 | so i |
---|
0:02:40 | a here what asking a simple question what is the minimum unit of accent diversity |
---|
0:02:44 | toward english as simple some people may say maybe country american accented or japanese accented |
---|
0:02:50 | and finny a feeling accent others may say might be tolerable in new york accident |
---|
0:02:55 | and helsinki accent |
---|
0:02:57 | sixty would town or village |
---|
0:02:59 | but if we consider |
---|
0:03:00 | the reason of accent |
---|
0:03:02 | it will be |
---|
0:03:03 | personal history of learning english |
---|
0:03:07 | so |
---|
0:03:07 | the meeting model unit will be individual my english you want in life use of |
---|
0:03:12 | english and her english so how many how mean different kinds of things i mean |
---|
0:03:16 | users wings one point five billy |
---|
0:03:19 | so we can say do you their one point five |
---|
0:03:23 | different |
---|
0:03:23 | itself in glasses on this planet |
---|
0:03:25 | okay so at the aim of this study is a technical feasibility of |
---|
0:03:30 | speaker basis accent clustering of what english |
---|
0:03:34 | so if you do bottom-up clustering you have to put up repeated distance matrix among |
---|
0:03:38 | all the elements on among all the speakers |
---|
0:03:41 | so then |
---|
0:03:41 | so that the aim of this study is that the feasibility technical feasibility to estimate |
---|
0:03:46 | into speaker accent distance |
---|
0:03:51 | so what kind of course we used the speech accent |
---|
0:03:54 | a high so this is very interesting very useful corpus for us so well developed |
---|
0:04:00 | by a wind we propose a weinberger be joy smells missing the first e |
---|
0:04:04 | so |
---|
0:04:06 | in this development of the corpus he asked |
---|
0:04:10 | lots and lots of internationals a uses a linguist to read this common progress |
---|
0:04:16 | okay |
---|
0:04:16 | so please call still or something so obvious part of what's designed to achieve high |
---|
0:04:21 | performance a performance of a high phonetic coverage of american english |
---|
0:04:27 | my burgers american speaker so then this of course focus on before the makeup or |
---|
0:04:30 | smirk american english |
---|
0:04:31 | so i we show you one example |
---|
0:04:34 | of the speech accent archive |
---|
0:04:39 | sure i have to click this |
---|
0:04:44 | pretty i have already i k we have problems with paul |
---|
0:04:51 | so he's a speaker from czech republic |
---|
0:04:55 | so in speech can i of this kind of the variously accented english can be |
---|
0:04:59 | found |
---|
0:05:00 | and also with this |
---|
0:05:01 | a corpus is very used to because it provides us with the i pure translates |
---|
0:05:07 | okay now transports |
---|
0:05:09 | or something like this |
---|
0:05:17 | sorry |
---|
0:05:18 | okay |
---|
0:05:24 | sorry this thing that would transcripts |
---|
0:05:27 | so |
---|
0:05:28 | i using this we constrain a predictor of the distances sadistic this is very useful |
---|
0:05:34 | so the next one |
---|
0:05:36 | so what is the technical challenge here okay |
---|
0:05:39 | so here i can say that the acoustic stiff acoustic difference acoustic distance between two |
---|
0:05:44 | speakers is now |
---|
0:05:45 | accent distance |
---|
0:05:47 | so we are what is show you a three example three utterances |
---|
0:05:51 | sorry |
---|
0:05:51 | three utterances the reading |
---|
0:05:54 | descent this |
---|
0:05:55 | what is from american female speaker |
---|
0:05:57 | and the other two awful for my from my pronunciation |
---|
0:06:01 | the but the at this is my normal english normal english but the a bus |
---|
0:06:05 | much upon eyes to english this and |
---|
0:06:11 | very good female excel was consistently straightforward |
---|
0:06:14 | if you think india carefully first |
---|
0:06:17 | the fast as will be straight for perfect control carefully first |
---|
0:06:22 | so on so we used a little for diffusing sensitive carefully fast |
---|
0:06:27 | the question is how this excess closer to eighty or is close to be right |
---|
0:06:34 | so if you focus on acoustic difference between two speakers x has to be much |
---|
0:06:39 | closer to be because cell |
---|
0:06:41 | other sounds is generated by the same speaker okay but if you focus on accent |
---|
0:06:45 | difference or phonetic difference so i think x will be |
---|
0:06:50 | a will be just is close to two k |
---|
0:06:53 | so how to extract how to estimate accent distance between two speakers |
---|
0:06:57 | so some methods are possible but that in this talk our focus on |
---|
0:07:03 | for the special features |
---|
0:07:04 | used for that task |
---|
0:07:06 | so we tried to remove what suppress |
---|
0:07:10 | no linguistic factors is just age and gender so these are told what you relevant |
---|
0:07:13 | factors have to remove those things |
---|
0:07:16 | so in |
---|
0:07:16 | no more acoustic analysis of speech us of phase information removed and pitch harmonics are |
---|
0:07:22 | removed k what about speaker identity how to remove mean amounts of format on speech |
---|
0:07:27 | this is a question i |
---|
0:07:29 | so for that too |
---|
0:07:30 | something like phone a session skeleton has to be extracted for comparison |
---|
0:07:34 | so how to do that |
---|
0:07:36 | so in a previous study we need a up approves invariance astra invariant speech structure |
---|
0:07:42 | analysis that's a speaker-invariant was speaker independent |
---|
0:07:45 | representation of a speech |
---|
0:07:48 | okay |
---|
0:07:48 | so |
---|
0:07:50 | how to extract the skeleton pronunciation scale to must be scaled |
---|
0:07:53 | so |
---|
0:07:54 | good features that in this task good features should be insensitive |
---|
0:08:00 | to age and gender differences features should be sensitive to absent differences |
---|
0:08:05 | so this is your age difference in gender difference of the japanese vowels formant frequency |
---|
0:08:10 | i think it you know familiar with this k |
---|
0:08:13 | but this is the accent different system and be american english speaker dialects |
---|
0:08:18 | i will henceforth scoundrel upper westchester |
---|
0:08:21 | looking at this graph and these pass |
---|
0:08:24 | so |
---|
0:08:25 | we can say that a good feature seems to be not feature instances |
---|
0:08:29 | okay |
---|
0:08:29 | but feature relations so distribution pattern the power supply someone that's among speakers k the |
---|
0:08:35 | same dialects but for different dialects the feature distributions a totally different |
---|
0:08:40 | so |
---|
0:08:41 | we focus on in the stock we focus simulations or stable distribution can is this |
---|
0:08:46 | focused all and it can be represented geometrically as distance metrics |
---|
0:08:52 | the question here is the kind of this is matches the speaker independent what speaker-invariant |
---|
0:08:58 | so |
---|
0:09:01 | so invariance in variability so how to extract have to define |
---|
0:09:06 | the invariant distance |
---|
0:09:08 | between two you speech you that all speech event |
---|
0:09:11 | so |
---|
0:09:13 | a in studies of speaker conversion speech or speaker for i built is often modeled |
---|
0:09:17 | as a transformation of acoustic space |
---|
0:09:21 | this is see for example this is a closing space speak at |
---|
0:09:24 | and this wasn't speech space c speaker b |
---|
0:09:28 | one trajectory representing one actions good morning and so |
---|
0:09:32 | good morning of the speaker b y |
---|
0:09:34 | so how to extract speaker independent features from here |
---|
0:09:38 | okay |
---|
0:09:39 | so speaker independent speaker invariance can be interpreted as transforming variance |
---|
0:09:44 | so the question here is how what is the call pulley to complete ran some |
---|
0:09:49 | invariant feature manager |
---|
0:09:51 | so we |
---|
0:09:52 | found out f divergence is a very good candidate for that |
---|
0:09:56 | so and |
---|
0:10:00 | so here |
---|
0:10:01 | every speech event is characterised as distributions not a point in acoustic space so if |
---|
0:10:08 | we calculate after you've regions |
---|
0:10:10 | so this day visions measure is invariant with any kind of differentiable |
---|
0:10:16 | and continuous transform |
---|
0:10:18 | and then the it is interesting that if we want to have us to complete |
---|
0:10:23 | in various that has to be |
---|
0:10:25 | f divergence |
---|
0:10:27 | so speech contrast i mean less of a lexus batch based method which consists of |
---|
0:10:32 | a certain value features |
---|
0:10:34 | so let's use this let's use just to represent pronunciation to represent speech |
---|
0:10:40 | this is all approach this is trajectory can question space a so that we present |
---|
0:10:45 | one utterance converted into a sequence of distributions |
---|
0:10:48 | okay distribution has to be must use this must |
---|
0:10:51 | so that after that we calculate left divergence between any plp distributions |
---|
0:10:56 | so |
---|
0:10:57 | in this talk we use about the chili a distance but the so distance is |
---|
0:11:00 | the one of the f divergence measures |
---|
0:11:03 | so that a speaker shows are the same procedure i looking at from a different |
---|
0:11:06 | viewpoint we implement this procedure as it |
---|
0:11:10 | training of hmm |
---|
0:11:11 | and calculating a distance between |
---|
0:11:14 | a any pair of state so one utterance from one instance hmm this build |
---|
0:11:19 | and then we extract a only contrast not only local contrasts but also distant contrasts |
---|
0:11:28 | okay so well i explained it the acid background objective and corpus in the method |
---|
0:11:34 | and i'm gonna show you some interesting result the previous work |
---|
0:11:39 | so well in two thousand |
---|
0:11:41 | six maybe |
---|
0:11:43 | still we did speaker basis accent clustering but this experiment are used |
---|
0:11:47 | simulated data similar to deal with simulated japanese english |
---|
0:11:51 | so |
---|
0:11:52 | in this work we used a twelve japanese which a student for returnees from us |
---|
0:11:58 | so they can speak japanese of course very good speaker of japanese and they have |
---|
0:12:02 | very good speakers of american english |
---|
0:12:04 | so we asked them to say to pronounce |
---|
0:12:08 | a b t one us english words upbeat be that bad so these voice and |
---|
0:12:14 | also we aston to pronounce |
---|
0:12:16 | the ilp the told what a japanese what but people but the but the one |
---|
0:12:21 | k |
---|
0:12:22 | so |
---|
0:12:22 | and then we extracted vol one segment what medical e and it we should we |
---|
0:12:26 | created we form to follow based structures well based |
---|
0:12:30 | a structures |
---|
0:12:33 | so |
---|
0:12:34 | but the we want to simulated variously accented japanese english so that for that we |
---|
0:12:40 | do replacement of some american english about was with japanese follows |
---|
0:12:45 | so why this is america things of all walls and the is one to guess |
---|
0:12:49 | eight is a difference of replacement s eight |
---|
0:12:52 | it's a no replacement |
---|
0:12:54 | so there |
---|
0:12:56 | or is you know |
---|
0:12:56 | american english american tings of hours |
---|
0:12:59 | and it is one replace |
---|
0:13:01 | all the bubbles american is of our sub replaced by japanese vowels |
---|
0:13:05 | totally japanese accented but works |
---|
0:13:07 | and as to gone is a seven well partially hardly japanese how we american english |
---|
0:13:12 | so well so for example this about voices apply used |
---|
0:13:16 | what kind of japanese possible so this is the replacement able assist these of always |
---|
0:13:21 | that replace replaced by a japanese follow of a |
---|
0:13:25 | e who april |
---|
0:13:27 | so |
---|
0:13:29 | we have twelve speakers from a to l and eight pronunciation at jackson's one two |
---|
0:13:34 | eight k |
---|
0:13:35 | so we can have |
---|
0:13:37 | six and ninety six simulated learners |
---|
0:13:40 | that's cluster these |
---|
0:13:42 | these lattice |
---|
0:13:44 | so well as their power sample from power some post we can get of all |
---|
0:13:50 | what distributions and then we can get a distance matrix i mean about that show |
---|
0:13:55 | the subspace structure |
---|
0:13:58 | so well |
---|
0:14:00 | to cluster ninety six speakers we have to k are calculated ninety six ninety six |
---|
0:14:05 | distance metrics |
---|
0:14:07 | okay |
---|
0:14:07 | but one speaker is modeled as |
---|
0:14:10 | structure so how to |
---|
0:14:11 | define |
---|
0:14:12 | the distance measure between two structures so we prepared two kinds of structure to structure |
---|
0:14:18 | distance measure |
---|
0:14:19 | so this is the first one so this is very simple definition of the distance |
---|
0:14:24 | between two structure it's euclidean distance between two speakers two structures |
---|
0:14:29 | so speaker a is blue one |
---|
0:14:31 | and green one |
---|
0:14:33 | so lets calculate euclidean distance between these two |
---|
0:14:36 | so this is another a suit definition of the distance between two structures |
---|
0:14:41 | so in this case |
---|
0:14:42 | us to focus let's focus all the volvo a of a speaker a speaker s |
---|
0:14:48 | and about what a speaker t o calculate the difference between these two this to |
---|
0:14:52 | be used |
---|
0:14:54 | power i involve what i speaker s and t but that's your distance |
---|
0:14:58 | and |
---|
0:14:58 | so a summation star |
---|
0:15:00 | to difference |
---|
0:15:01 | two different definitions of died distance page |
---|
0:15:04 | so using these two |
---|
0:15:09 | we can have two |
---|
0:15:11 | ninety six ninety six distance matrix |
---|
0:15:14 | a man speakers |
---|
0:15:15 | so we if we if we troll gender grounds for these two that is metrics |
---|
0:15:21 | so what matters is what kind of results we can obtain |
---|
0:15:24 | so |
---|
0:15:25 | if the result is like this we have very happy |
---|
0:15:28 | because one two three four is a pronunciation wax and |
---|
0:15:31 | so if the result is something like this a b c d well it's a |
---|
0:15:34 | speaker clustering |
---|
0:15:36 | we're not happy |
---|
0:15:37 | so what kind we sell we obtained |
---|
0:15:42 | so this is a result |
---|
0:15:44 | all the contrast based euclidean distance |
---|
0:15:48 | which the result of instance based distance measure |
---|
0:15:51 | the second definition distance metric |
---|
0:15:53 | so you can see |
---|
0:15:54 | one three c five what some noises can be found here but if always six |
---|
0:16:00 | rather good |
---|
0:16:02 | accent clustering |
---|
0:16:03 | but what about this j l k a y k d well complete speaker class |
---|
0:16:11 | so |
---|
0:16:11 | big difference in the result of a dangerous ground so why |
---|
0:16:16 | so big difference |
---|
0:16:18 | so because that big difference east coast |
---|
0:16:21 | by |
---|
0:16:22 | this difference of this is made a distance definition between two structures |
---|
0:16:27 | so this is a |
---|
0:16:28 | just a difference of two volvo set |
---|
0:16:31 | but that this is a difference of differences |
---|
0:16:34 | so this is first well i think this is the first order difference that cruise |
---|
0:16:38 | you other gives you speaker clustering but this is second-order differences that leaves you accent |
---|
0:16:43 | clustering |
---|
0:16:44 | that is interesting thing |
---|
0:16:45 | so let's |
---|
0:16:47 | used is full |
---|
0:16:50 | all four |
---|
0:16:51 | real data |
---|
0:16:53 | speech accent archive |
---|
0:16:56 | so we have data are of into innocent speakers |
---|
0:17:00 | not the same pro graph |
---|
0:17:02 | okay |
---|
0:17:03 | so let's cluster these speakers |
---|
0:17:07 | but the |
---|
0:17:14 | sorry |
---|
0:17:18 | but the |
---|
0:17:19 | and this work we use a at that we |
---|
0:17:22 | adaptive a little bit different strategy used in the a previous study so in previous |
---|
0:17:27 | study we calculate just euclidean distance between two structures but in this study |
---|
0:17:31 | we used |
---|
0:17:33 | the year we treated the of this in the calculation for vanessa regression problem |
---|
0:17:39 | so first we prepared a |
---|
0:17:42 | reference distances between two speakers so we first distances up a given from i pure |
---|
0:17:48 | transcripts |
---|
0:17:49 | so we first we did a dtw between two transcripts |
---|
0:17:53 | between two speakers that we can define reference distances |
---|
0:17:57 | and this is a target prediction still for prediction we used a regression model so |
---|
0:18:03 | here as we always used and input features structure based features |
---|
0:18:07 | so |
---|
0:18:08 | for comparison we need another experiment |
---|
0:18:11 | silver at this is the distance between two |
---|
0:18:15 | phonotactic phonetic transcripts so in this case in and nine other experiment |
---|
0:18:20 | a phone then make transcripts are used |
---|
0:18:23 | so phonetic transcripts are converted into phone any conversion k |
---|
0:18:27 | it's a kind of rough transcripts |
---|
0:18:29 | so |
---|
0:18:30 | then the we calculate the dtw distance between these two corresponds to rough calculation of |
---|
0:18:36 | accent |
---|
0:18:38 | so |
---|
0:18:40 | a dtw i p a based reference is fess distance is we did you gap |
---|
0:18:46 | between two tracks |
---|
0:18:48 | but for dates |
---|
0:18:49 | we have to prepare |
---|
0:18:50 | i do systematic so all that i pa forms all the kinds of might be |
---|
0:18:55 | a force found it as a |
---|
0:18:57 | so well the number well i p r for some very few large more than |
---|
0:19:01 | three hundred |
---|
0:19:02 | so what we found that the one hundred fifty three i-th you've symbols can cover |
---|
0:19:06 | ninety six ninety five percent of all the a phone instances in s a x |
---|
0:19:09 | o |
---|
0:19:10 | we ask them of in addition to produce |
---|
0:19:12 | these each of these symbols twenty times so we build speaker dependent formulation is really |
---|
0:19:18 | for not phoneme |
---|
0:19:19 | phone hmms built so we calculate the but that's a distance between any pair of |
---|
0:19:24 | phones |
---|
0:19:24 | i beautiful's |
---|
0:19:26 | so then we are prepared a form based distance matrix so use that we calculate |
---|
0:19:32 | transcript to transcript distance |
---|
0:19:35 | but the full this calculation we still like to the speakers from the s a |
---|
0:19:39 | y a part of the speakers is used was useful least for this task because |
---|
0:19:45 | many speakers of s a as thirty eight what the latent some words for example |
---|
0:19:49 | well wall were okay |
---|
0:19:51 | so it's a it's a kind of nonnativeness okay |
---|
0:19:54 | silver we belated these words so the a number of speakers that drastically reduced |
---|
0:19:59 | so a lot then that we shouldn't speaker number of origin speakers are more than |
---|
0:20:03 | one eighteen q but that the effective number of speakers is only three hundred three |
---|
0:20:09 | hundred seventy but the speaker pair number of speaker pair |
---|
0:20:13 | it's still very large |
---|
0:20:15 | so |
---|
0:20:17 | i using this reference distance |
---|
0:20:21 | so we did we run now test the are so what kind of features we |
---|
0:20:24 | used features and regression model so we first we bill ubm hmm corresponding to the |
---|
0:20:31 | slu paragraph okay so to was use universal speech accent archive speech so we build |
---|
0:20:36 | h mount phoneme hmm concatenation |
---|
0:20:39 | and ubm spilled |
---|
0:20:40 | so |
---|
0:20:41 | each addresses import map adaptation so adapt a speaker dependent hmm paragraph based hmm |
---|
0:20:48 | so that the structure calculation is done so well i since the paragraph contains two |
---|
0:20:53 | hundred twenty one phoneme instances by referring to by referring to cmu dictionary so to |
---|
0:20:58 | twenty two why this is metrics obtain so this is the kind of |
---|
0:21:02 | pronunciation scaled accent skeleton |
---|
0:21:05 | so be but |
---|
0:21:08 | what we want to predict is the accent distance between two speakers so the input |
---|
0:21:13 | features to as you all should be d for angel features between two speakers speaker |
---|
0:21:19 | s and t so here we used silver deformation metrics just a subtraction |
---|
0:21:26 | and t and where |
---|
0:21:28 | in previous works we did |
---|
0:21:30 | a the square some of these features i mean you could injustice but in this |
---|
0:21:35 | study we separate each of them and then the |
---|
0:21:39 | these features are used as input features in into the svm |
---|
0:21:43 | how many elements have been to mentions is quite huge twenty four kilos |
---|
0:21:48 | so one |
---|
0:21:49 | high dimensional vector can be present accent characteristics |
---|
0:21:53 | okay i think dataset kind of similar to a gmm supervector one a high dimensional |
---|
0:21:58 | vector can represent speaker characteristics |
---|
0:22:01 | so this is useful as input features as we all so as to devise a |
---|
0:22:06 | very general well |
---|
0:22:07 | one is used |
---|
0:22:10 | and then and that's one |
---|
0:22:12 | still was for many confusion up |
---|
0:22:15 | transcript at a transcript distance so two kinds of phoneme based transcripts are used one |
---|
0:22:21 | is over the transcript |
---|
0:22:23 | moreover transcripts |
---|
0:22:25 | i'm not going the other one is transcripts generated from a phoneme recognizer or phoneme |
---|
0:22:30 | error what detector |
---|
0:22:32 | the accuracies about seventy three point five percent so dtw stampeding transcripts of the two |
---|
0:22:37 | speakers |
---|
0:22:39 | but there are four namely could transcript phoneme transcript |
---|
0:22:41 | a quick response to after question of accent used |
---|
0:22:46 | okay so it results |
---|
0:22:50 | two conditions and results |
---|
0:22:52 | so we did |
---|
0:22:54 | prediction experiments you a into models with two conditions |
---|
0:22:59 | a one is speaker all speaker pair open mode |
---|
0:23:02 | the other one speaker open mode |
---|
0:23:04 | so |
---|
0:23:05 | the what we want to do stuff prediction of speaker distance accent distance between two |
---|
0:23:10 | speakers so than the |
---|
0:23:11 | the unit |
---|
0:23:14 | to a unit |
---|
0:23:15 | to that i mean be a unit of input to as we always that speaker |
---|
0:23:19 | pair i still speaker pair open mode it is that the |
---|
0:23:25 | not a single speaker pair it's not is found that simultaneously in training and testing |
---|
0:23:31 | speaker pair open mode |
---|
0:23:32 | so speaker open mode is also tested not a single speaker |
---|
0:23:37 | it's fun somebody nist two bits in training or testing |
---|
0:23:41 | so two modes |
---|
0:23:43 | and eer results accent distance prediction so we do crossvalidation above from a performance metric |
---|
0:23:49 | is the correlation to i p a based reference distance |
---|
0:23:53 | so this is a result |
---|
0:23:54 | so speaker pair open mode the correlations very high okay so this is so that |
---|
0:23:59 | result of articulation graph got reference to since i pa and predicted if a difference |
---|
0:24:04 | and but the speaker open mode |
---|
0:24:06 | the correlation is not so high grew quite little |
---|
0:24:09 | the oracle transcription gamma phoneme based one |
---|
0:24:12 | rough estimation of accents not to distance |
---|
0:24:16 | but in this case you can find at the speaker pair open mode predict to |
---|
0:24:20 | one is higher than what the transcription |
---|
0:24:22 | but this is low what what's low what on this but this still higher than |
---|
0:24:26 | using the so well transcription generated from this asr |
---|
0:24:30 | so why this is so low guess speak speaker open mode |
---|
0:24:34 | still that |
---|
0:24:35 | if we consider the mechanism of speaker us a few or we can say that |
---|
0:24:39 | the matter need to you about a likely to of accent adaboosted estimated as |
---|
0:24:43 | all the and |
---|
0:24:44 | the speaker pair open mode but all the and square in a speaker open mode |
---|
0:24:49 | so n is not a number of speakers available still |
---|
0:24:52 | speaker pair open mode speaker |
---|
0:24:54 | pair open mode |
---|
0:24:55 | so would be the that the magnitude up task difficulty yes on the estimated simple |
---|
0:25:01 | averaging of the test so this that a complicated version of |
---|
0:25:06 | okay so well let me companies can produce this work yes |
---|
0:25:12 | summary |
---|
0:25:12 | the ultimate goal of the studies to create a global really global well individual basis |
---|
0:25:18 | map of world english as |
---|
0:25:20 | so and then the for that we have to estimate we have to produce a |
---|
0:25:24 | technique to estimate the accent distance between any pair of speakers |
---|
0:25:29 | so for that we used |
---|
0:25:30 | that's speech accent archive you know still important speech structure analysis was used as speech |
---|
0:25:36 | analysis method experiments showed that the |
---|
0:25:39 | a high correlation was found that in speaker pair open mode but the a speaker |
---|
0:25:44 | open what is not sorry |
---|
0:25:46 | future directions so well i think structure vector plus it's be a result somewhat similar |
---|
0:25:52 | to ra gmm supervector high dimensional but one vector that can characterize speaker id and |
---|
0:25:57 | svm so but these states lots of people or researches use i-vectors are i-vector based |
---|
0:26:02 | features might be can be used for this |
---|
0:26:05 | and i was told if we change reengineering is still needed i think |
---|
0:26:09 | and now the machine waiting around techniques should be should be should be used and |
---|
0:26:13 | also we are interested in your more extensive collection the data are using cross source |
---|
0:26:20 | that's all existing way by all the speaker |
---|
0:26:26 | you're not a it should be question all right handers your correlation you're getting point |
---|
0:26:32 | nineteen point five real-time were you open speaker set |
---|
0:26:37 | that several european speaker so i all i use all that speaker's available |
---|
0:26:42 | so iteration speakers in the african speakers so well i still selected the speakers |
---|
0:26:49 | from ray the paragraph without inserting what deleting words |
---|
0:26:54 | which are my question is alright that still based on a perfectly red or on |
---|
0:27:00 | a on paragraph that right sure a large study people have shown that |
---|
0:27:04 | when you're working accent if you're reading prepared text versus spontaneous or conversational is on |
---|
0:27:12 | you get much more action yes and sure yes conversational speech and non speech sure |
---|
0:27:18 | still unclear comment on whether you think reach for each speaker c rats |
---|
0:27:24 | so |
---|
0:27:25 | before coming here if the state helsinki reversed yesterday i skipped the first half discomforts |
---|
0:27:30 | y |
---|
0:27:31 | because |
---|
0:27:31 | there's a research team of collecting spun to a natural non-native english is okay so |
---|
0:27:37 | some other research groups of collecting data was spontaneous speech k us with my non-native |
---|
0:27:44 | speakers |
---|
0:27:45 | kind of mess |
---|
0:27:47 | "'kay" missy data right last let's analyses in unexpected things |
---|
0:27:51 | so this database is a very artificial |
---|
0:27:54 | control dataset k |
---|
0:27:56 | so but the |
---|
0:27:57 | a what is possible ways |
---|
0:27:59 | spontaneous speech what is possible with control data so i think so something is possible |
---|
0:28:05 | is which control data and still something other things like a spot become possible with |
---|
0:28:11 | spontaneous data |
---|
0:28:12 | so |
---|
0:28:14 | my proposal to those |
---|
0:28:15 | researchers is that the us to collect |
---|
0:28:19 | collection up control data and spontaneous data at the same type k |
---|
0:28:23 | so for example |
---|
0:28:25 | this the sat progress of please call stellar that probably is collected from the speakers |
---|
0:28:30 | users being was and then also you collected data responding to see the from those |
---|
0:28:35 | speakers |
---|
0:28:36 | so then be you us you accent clustering is done well with a by using |
---|
0:28:41 | control data and then the so clustering result can be used to explain what is |
---|
0:28:47 | happening in non-native conversations |
---|
0:28:49 | so |
---|
0:28:51 | the i think the what is that it is then issues you to collect both |
---|
0:28:55 | kind of control dater and spontaneous |
---|
0:28:59 | so i know that the also researchers claim that the s a is not |
---|
0:29:03 | really non-native data is it just artificial collection updated but the i think the from |
---|
0:29:11 | technical point of view so that kind dataset is very useful |
---|