a minutes no work to minimize the from the inverse in tokyo so to that

like talk about the a speaker basis accent clustering of english using invariance invariant structure

analysis in the speech accent archive

so all the miss

no

how can it go

okay

alright thank you so this is a lie on one public presentation so

first background objective

and then you what kind of corpus we used to what kind method of speech

signals would used so after that i will show you have very interesting result of

a previous study so that

i was shown to be a coming experiments done in a current paper

so

in this dog i focus on english

that only the long as english but

as you know english this is used as only

global longish or intonation language spoken by everybody k

so

the u d is this we can find more researchers so more teachers a

for treating english as english she's get well the english

so what is the english is what linguists it's a set of localised versions of

english

so they claim that there is no standard pronunciation of english and american english and

british english a bigger just two major example of accented english get

so

and i this is to a very well known three circle more detail what english

as

the inner circle misty a english as native language and outer circle is english class

official language like single and expanding circle is a english that's for language japan helsinki

in brazil

so

and the in this situation still what kind we see changes like ian is found

in this study is a linguistics

so great interest lies in how one type of pronunciation compares to other varieties not

now one type of pronunciation is incorrect to compute american english british english can

so i

a here what asking a simple question what is the minimum unit of accent diversity

toward english as simple some people may say maybe country american accented or japanese accented

and finny a feeling accent others may say might be tolerable in new york accident

and helsinki accent

sixty would town or village

but if we consider

the reason of accent

it will be

personal history of learning english

so

the meeting model unit will be individual my english you want in life use of

english and her english so how many how mean different kinds of things i mean

users wings one point five billy

so we can say do you their one point five

different

itself in glasses on this planet

okay so at the aim of this study is a technical feasibility of

speaker basis accent clustering of what english

so if you do bottom-up clustering you have to put up repeated distance matrix among

all the elements on among all the speakers

so then

so that the aim of this study is that the feasibility technical feasibility to estimate

into speaker accent distance

so what kind of course we used the speech accent

a high so this is very interesting very useful corpus for us so well developed

by a wind we propose a weinberger be joy smells missing the first e

so

in this development of the corpus he asked

lots and lots of internationals a uses a linguist to read this common progress

okay

so please call still or something so obvious part of what's designed to achieve high

performance a performance of a high phonetic coverage of american english

my burgers american speaker so then this of course focus on before the makeup or

smirk american english

so i we show you one example

of the speech accent archive

sure i have to click this

pretty i have already i k we have problems with paul

so he's a speaker from czech republic

so in speech can i of this kind of the variously accented english can be

found

and also with this

a corpus is very used to because it provides us with the i pure translates

okay now transports

or something like this

sorry

okay

sorry this thing that would transcripts

so

i using this we constrain a predictor of the distances sadistic this is very useful

so the next one

so what is the technical challenge here okay

so here i can say that the acoustic stiff acoustic difference acoustic distance between two

speakers is now

accent distance

so we are what is show you a three example three utterances

sorry

three utterances the reading

descent this

what is from american female speaker

and the other two awful for my from my pronunciation

the but the at this is my normal english normal english but the a bus

much upon eyes to english this and

very good female excel was consistently straightforward

if you think india carefully first

the fast as will be straight for perfect control carefully first

so on so we used a little for diffusing sensitive carefully fast

the question is how this excess closer to eighty or is close to be right

so if you focus on acoustic difference between two speakers x has to be much

closer to be because cell

other sounds is generated by the same speaker okay but if you focus on accent

difference or phonetic difference so i think x will be

a will be just is close to two k

so how to extract how to estimate accent distance between two speakers

so some methods are possible but that in this talk our focus on

for the special features

used for that task

so we tried to remove what suppress

no linguistic factors is just age and gender so these are told what you relevant

factors have to remove those things

so in

no more acoustic analysis of speech us of phase information removed and pitch harmonics are

removed k what about speaker identity how to remove mean amounts of format on speech

this is a question i

so for that too

something like phone a session skeleton has to be extracted for comparison

so how to do that

so in a previous study we need a up approves invariance astra invariant speech structure

analysis that's a speaker-invariant was speaker independent

representation of a speech

okay

so

how to extract the skeleton pronunciation scale to must be scaled

so

good features that in this task good features should be insensitive

to age and gender differences features should be sensitive to absent differences

so this is your age difference in gender difference of the japanese vowels formant frequency

i think it you know familiar with this k

but this is the accent different system and be american english speaker dialects

i will henceforth scoundrel upper westchester

looking at this graph and these pass

so

we can say that a good feature seems to be not feature instances

okay

but feature relations so distribution pattern the power supply someone that's among speakers k the

same dialects but for different dialects the feature distributions a totally different

so

we focus on in the stock we focus simulations or stable distribution can is this

focused all and it can be represented geometrically as distance metrics

the question here is the kind of this is matches the speaker independent what speaker-invariant

so

so invariance in variability so how to extract have to define

the invariant distance

between two you speech you that all speech event

so

a in studies of speaker conversion speech or speaker for i built is often modeled

as a transformation of acoustic space

this is see for example this is a closing space speak at

and this wasn't speech space c speaker b

one trajectory representing one actions good morning and so

good morning of the speaker b y

so how to extract speaker independent features from here

okay

so speaker independent speaker invariance can be interpreted as transforming variance

so the question here is how what is the call pulley to complete ran some

invariant feature manager

so we

found out f divergence is a very good candidate for that

so and

so here

every speech event is characterised as distributions not a point in acoustic space so if

we calculate after you've regions

so this day visions measure is invariant with any kind of differentiable

and continuous transform

and then the it is interesting that if we want to have us to complete

in various that has to be

f divergence

so speech contrast i mean less of a lexus batch based method which consists of

a certain value features

so let's use this let's use just to represent pronunciation to represent speech

this is all approach this is trajectory can question space a so that we present

one utterance converted into a sequence of distributions

okay distribution has to be must use this must

so that after that we calculate left divergence between any plp distributions

so

in this talk we use about the chili a distance but the so distance is

the one of the f divergence measures

so that a speaker shows are the same procedure i looking at from a different

viewpoint we implement this procedure as it

training of hmm

and calculating a distance between

a any pair of state so one utterance from one instance hmm this build

and then we extract a only contrast not only local contrasts but also distant contrasts

okay so well i explained it the acid background objective and corpus in the method

and i'm gonna show you some interesting result the previous work

so well in two thousand

six maybe

still we did speaker basis accent clustering but this experiment are used

simulated data similar to deal with simulated japanese english

so

in this work we used a twelve japanese which a student for returnees from us

so they can speak japanese of course very good speaker of japanese and they have

very good speakers of american english

so we asked them to say to pronounce

a b t one us english words upbeat be that bad so these voice and

also we aston to pronounce

the ilp the told what a japanese what but people but the but the one

k

so

and then we extracted vol one segment what medical e and it we should we

created we form to follow based structures well based

a structures

so

but the we want to simulated variously accented japanese english so that for that we

do replacement of some american english about was with japanese follows

so why this is america things of all walls and the is one to guess

eight is a difference of replacement s eight

it's a no replacement

so there

or is you know

american english american tings of hours

and it is one replace

all the bubbles american is of our sub replaced by japanese vowels

totally japanese accented but works

and as to gone is a seven well partially hardly japanese how we american english

so well so for example this about voices apply used

what kind of japanese possible so this is the replacement able assist these of always

that replace replaced by a japanese follow of a

e who april

so

we have twelve speakers from a to l and eight pronunciation at jackson's one two

eight k

so we can have

six and ninety six simulated learners

that's cluster these

these lattice

so well as their power sample from power some post we can get of all

what distributions and then we can get a distance matrix i mean about that show

the subspace structure

so well

to cluster ninety six speakers we have to k are calculated ninety six ninety six

distance metrics

okay

but one speaker is modeled as

structure so how to

define

the distance measure between two structures so we prepared two kinds of structure to structure

distance measure

so this is the first one so this is very simple definition of the distance

between two structure it's euclidean distance between two speakers two structures

so speaker a is blue one

and green one

so lets calculate euclidean distance between these two

so this is another a suit definition of the distance between two structures

so in this case

us to focus let's focus all the volvo a of a speaker a speaker s

and about what a speaker t o calculate the difference between these two this to

be used

power i involve what i speaker s and t but that's your distance

and

so a summation star

to difference

two different definitions of died distance page

so using these two

we can have two

ninety six ninety six distance matrix

a man speakers

so we if we if we troll gender grounds for these two that is metrics

so what matters is what kind of results we can obtain

so

if the result is like this we have very happy

because one two three four is a pronunciation wax and

so if the result is something like this a b c d well it's a

speaker clustering

we're not happy

so what kind we sell we obtained

so this is a result

all the contrast based euclidean distance

which the result of instance based distance measure

the second definition distance metric

so you can see

one three c five what some noises can be found here but if always six

rather good

accent clustering

but what about this j l k a y k d well complete speaker class

so

big difference in the result of a dangerous ground so why

so big difference

so because that big difference east coast

by

this difference of this is made a distance definition between two structures

so this is a

just a difference of two volvo set

but that this is a difference of differences

so this is first well i think this is the first order difference that cruise

you other gives you speaker clustering but this is second-order differences that leaves you accent

clustering

that is interesting thing

so let's

used is full

all four

real data

speech accent archive

so we have data are of into innocent speakers

not the same pro graph

okay

so let's cluster these speakers

but the

sorry

but the

and this work we use a at that we

adaptive a little bit different strategy used in the a previous study so in previous

study we calculate just euclidean distance between two structures but in this study

we used

the year we treated the of this in the calculation for vanessa regression problem

so first we prepared a

reference distances between two speakers so we first distances up a given from i pure

transcripts

so we first we did a dtw between two transcripts

between two speakers that we can define reference distances

and this is a target prediction still for prediction we used a regression model so

here as we always used and input features structure based features

so

for comparison we need another experiment

silver at this is the distance between two

phonotactic phonetic transcripts so in this case in and nine other experiment

a phone then make transcripts are used

so phonetic transcripts are converted into phone any conversion k

it's a kind of rough transcripts

so

then the we calculate the dtw distance between these two corresponds to rough calculation of

accent

so

a dtw i p a based reference is fess distance is we did you gap

between two tracks

but for dates

we have to prepare

i do systematic so all that i pa forms all the kinds of might be

a force found it as a

so well the number well i p r for some very few large more than

three hundred

so what we found that the one hundred fifty three i-th you've symbols can cover

ninety six ninety five percent of all the a phone instances in s a x

o

we ask them of in addition to produce

these each of these symbols twenty times so we build speaker dependent formulation is really

for not phoneme

phone hmms built so we calculate the but that's a distance between any pair of

phones

i beautiful's

so then we are prepared a form based distance matrix so use that we calculate

transcript to transcript distance

but the full this calculation we still like to the speakers from the s a

y a part of the speakers is used was useful least for this task because

many speakers of s a as thirty eight what the latent some words for example

well wall were okay

so it's a it's a kind of nonnativeness okay

silver we belated these words so the a number of speakers that drastically reduced

so a lot then that we shouldn't speaker number of origin speakers are more than

one eighteen q but that the effective number of speakers is only three hundred three

hundred seventy but the speaker pair number of speaker pair

it's still very large

so

i using this reference distance

so we did we run now test the are so what kind of features we

used features and regression model so we first we bill ubm hmm corresponding to the

slu paragraph okay so to was use universal speech accent archive speech so we build

h mount phoneme hmm concatenation

and ubm spilled

so

each addresses import map adaptation so adapt a speaker dependent hmm paragraph based hmm

so that the structure calculation is done so well i since the paragraph contains two

hundred twenty one phoneme instances by referring to by referring to cmu dictionary so to

twenty two why this is metrics obtain so this is the kind of

pronunciation scaled accent skeleton

so be but

what we want to predict is the accent distance between two speakers so the input

features to as you all should be d for angel features between two speakers speaker

s and t so here we used silver deformation metrics just a subtraction

and t and where

in previous works we did

a the square some of these features i mean you could injustice but in this

study we separate each of them and then the

these features are used as input features in into the svm

how many elements have been to mentions is quite huge twenty four kilos

so one

high dimensional vector can be present accent characteristics

okay i think dataset kind of similar to a gmm supervector one a high dimensional

vector can represent speaker characteristics

so this is useful as input features as we all so as to devise a

very general well

one is used

and then and that's one

still was for many confusion up

transcript at a transcript distance so two kinds of phoneme based transcripts are used one

is over the transcript

moreover transcripts

i'm not going the other one is transcripts generated from a phoneme recognizer or phoneme

error what detector

the accuracies about seventy three point five percent so dtw stampeding transcripts of the two

speakers

but there are four namely could transcript phoneme transcript

a quick response to after question of accent used

okay so it results

two conditions and results

so we did

prediction experiments you a into models with two conditions

a one is speaker all speaker pair open mode

the other one speaker open mode

so

the what we want to do stuff prediction of speaker distance accent distance between two

speakers so than the

the unit

to a unit

to that i mean be a unit of input to as we always that speaker

pair i still speaker pair open mode it is that the

not a single speaker pair it's not is found that simultaneously in training and testing

speaker pair open mode

so speaker open mode is also tested not a single speaker

it's fun somebody nist two bits in training or testing

so two modes

and eer results accent distance prediction so we do crossvalidation above from a performance metric

is the correlation to i p a based reference distance

so this is a result

so speaker pair open mode the correlations very high okay so this is so that

result of articulation graph got reference to since i pa and predicted if a difference

and but the speaker open mode

the correlation is not so high grew quite little

the oracle transcription gamma phoneme based one

rough estimation of accents not to distance

but in this case you can find at the speaker pair open mode predict to

one is higher than what the transcription

but this is low what what's low what on this but this still higher than

using the so well transcription generated from this asr

so why this is so low guess speak speaker open mode

still that

if we consider the mechanism of speaker us a few or we can say that

the matter need to you about a likely to of accent adaboosted estimated as

all the and

the speaker pair open mode but all the and square in a speaker open mode

so n is not a number of speakers available still

speaker pair open mode speaker

pair open mode

so would be the that the magnitude up task difficulty yes on the estimated simple

averaging of the test so this that a complicated version of

okay so well let me companies can produce this work yes

summary

the ultimate goal of the studies to create a global really global well individual basis

map of world english as

so and then the for that we have to estimate we have to produce a

technique to estimate the accent distance between any pair of speakers

so for that we used

that's speech accent archive you know still important speech structure analysis was used as speech

analysis method experiments showed that the

a high correlation was found that in speaker pair open mode but the a speaker

open what is not sorry

future directions so well i think structure vector plus it's be a result somewhat similar

to ra gmm supervector high dimensional but one vector that can characterize speaker id and

svm so but these states lots of people or researches use i-vectors are i-vector based

features might be can be used for this

and i was told if we change reengineering is still needed i think

and now the machine waiting around techniques should be should be should be used and

also we are interested in your more extensive collection the data are using cross source

that's all existing way by all the speaker

you're not a it should be question all right handers your correlation you're getting point

nineteen point five real-time were you open speaker set

that several european speaker so i all i use all that speaker's available

so iteration speakers in the african speakers so well i still selected the speakers

from ray the paragraph without inserting what deleting words

which are my question is alright that still based on a perfectly red or on

a on paragraph that right sure a large study people have shown that

when you're working accent if you're reading prepared text versus spontaneous or conversational is on

you get much more action yes and sure yes conversational speech and non speech sure

still unclear comment on whether you think reach for each speaker c rats

so

before coming here if the state helsinki reversed yesterday i skipped the first half discomforts

y

because

there's a research team of collecting spun to a natural non-native english is okay so

some other research groups of collecting data was spontaneous speech k us with my non-native

speakers

kind of mess

"'kay" missy data right last let's analyses in unexpected things

so this database is a very artificial

control dataset k

so but the

a what is possible ways

spontaneous speech what is possible with control data so i think so something is possible

is which control data and still something other things like a spot become possible with

spontaneous data

so

my proposal to those

researchers is that the us to collect

collection up control data and spontaneous data at the same type k

so for example

this the sat progress of please call stellar that probably is collected from the speakers

users being was and then also you collected data responding to see the from those

speakers

so then be you us you accent clustering is done well with a by using

control data and then the so clustering result can be used to explain what is

happening in non-native conversations

so

the i think the what is that it is then issues you to collect both

kind of control dater and spontaneous

so i know that the also researchers claim that the s a is not

really non-native data is it just artificial collection updated but the i think the from

technical point of view so that kind dataset is very useful