a minutes no work to minimize the from the inverse in tokyo so to that
like talk about the a speaker basis accent clustering of english using invariance invariant structure
analysis in the speech accent archive
so all the miss
no
how can it go
okay
alright thank you so this is a lie on one public presentation so
first background objective
and then you what kind of corpus we used to what kind method of speech
signals would used so after that i will show you have very interesting result of
a previous study so that
i was shown to be a coming experiments done in a current paper
so
in this dog i focus on english
that only the long as english but
as you know english this is used as only
global longish or intonation language spoken by everybody k
so
the u d is this we can find more researchers so more teachers a
for treating english as english she's get well the english
so what is the english is what linguists it's a set of localised versions of
english
so they claim that there is no standard pronunciation of english and american english and
british english a bigger just two major example of accented english get
so
and i this is to a very well known three circle more detail what english
as
the inner circle misty a english as native language and outer circle is english class
official language like single and expanding circle is a english that's for language japan helsinki
in brazil
so
and the in this situation still what kind we see changes like ian is found
in this study is a linguistics
so great interest lies in how one type of pronunciation compares to other varieties not
now one type of pronunciation is incorrect to compute american english british english can
so i
a here what asking a simple question what is the minimum unit of accent diversity
toward english as simple some people may say maybe country american accented or japanese accented
and finny a feeling accent others may say might be tolerable in new york accident
and helsinki accent
sixty would town or village
but if we consider
the reason of accent
it will be
personal history of learning english
so
the meeting model unit will be individual my english you want in life use of
english and her english so how many how mean different kinds of things i mean
users wings one point five billy
so we can say do you their one point five
different
itself in glasses on this planet
okay so at the aim of this study is a technical feasibility of
speaker basis accent clustering of what english
so if you do bottom-up clustering you have to put up repeated distance matrix among
all the elements on among all the speakers
so then
so that the aim of this study is that the feasibility technical feasibility to estimate
into speaker accent distance
so what kind of course we used the speech accent
a high so this is very interesting very useful corpus for us so well developed
by a wind we propose a weinberger be joy smells missing the first e
so
in this development of the corpus he asked
lots and lots of internationals a uses a linguist to read this common progress
okay
so please call still or something so obvious part of what's designed to achieve high
performance a performance of a high phonetic coverage of american english
my burgers american speaker so then this of course focus on before the makeup or
smirk american english
so i we show you one example
of the speech accent archive
sure i have to click this
pretty i have already i k we have problems with paul
so he's a speaker from czech republic
so in speech can i of this kind of the variously accented english can be
found
and also with this
a corpus is very used to because it provides us with the i pure translates
okay now transports
or something like this
sorry
okay
sorry this thing that would transcripts
so
i using this we constrain a predictor of the distances sadistic this is very useful
so the next one
so what is the technical challenge here okay
so here i can say that the acoustic stiff acoustic difference acoustic distance between two
speakers is now
accent distance
so we are what is show you a three example three utterances
sorry
three utterances the reading
descent this
what is from american female speaker
and the other two awful for my from my pronunciation
the but the at this is my normal english normal english but the a bus
much upon eyes to english this and
very good female excel was consistently straightforward
if you think india carefully first
the fast as will be straight for perfect control carefully first
so on so we used a little for diffusing sensitive carefully fast
the question is how this excess closer to eighty or is close to be right
so if you focus on acoustic difference between two speakers x has to be much
closer to be because cell
other sounds is generated by the same speaker okay but if you focus on accent
difference or phonetic difference so i think x will be
a will be just is close to two k
so how to extract how to estimate accent distance between two speakers
so some methods are possible but that in this talk our focus on
for the special features
used for that task
so we tried to remove what suppress
no linguistic factors is just age and gender so these are told what you relevant
factors have to remove those things
so in
no more acoustic analysis of speech us of phase information removed and pitch harmonics are
removed k what about speaker identity how to remove mean amounts of format on speech
this is a question i
so for that too
something like phone a session skeleton has to be extracted for comparison
so how to do that
so in a previous study we need a up approves invariance astra invariant speech structure
analysis that's a speaker-invariant was speaker independent
representation of a speech
okay
so
how to extract the skeleton pronunciation scale to must be scaled
so
good features that in this task good features should be insensitive
to age and gender differences features should be sensitive to absent differences
so this is your age difference in gender difference of the japanese vowels formant frequency
i think it you know familiar with this k
but this is the accent different system and be american english speaker dialects
i will henceforth scoundrel upper westchester
looking at this graph and these pass
so
we can say that a good feature seems to be not feature instances
okay
but feature relations so distribution pattern the power supply someone that's among speakers k the
same dialects but for different dialects the feature distributions a totally different
so
we focus on in the stock we focus simulations or stable distribution can is this
focused all and it can be represented geometrically as distance metrics
the question here is the kind of this is matches the speaker independent what speaker-invariant
so
so invariance in variability so how to extract have to define
the invariant distance
between two you speech you that all speech event
so
a in studies of speaker conversion speech or speaker for i built is often modeled
as a transformation of acoustic space
this is see for example this is a closing space speak at
and this wasn't speech space c speaker b
one trajectory representing one actions good morning and so
good morning of the speaker b y
so how to extract speaker independent features from here
okay
so speaker independent speaker invariance can be interpreted as transforming variance
so the question here is how what is the call pulley to complete ran some
invariant feature manager
so we
found out f divergence is a very good candidate for that
so and
so here
every speech event is characterised as distributions not a point in acoustic space so if
we calculate after you've regions
so this day visions measure is invariant with any kind of differentiable
and continuous transform
and then the it is interesting that if we want to have us to complete
in various that has to be
f divergence
so speech contrast i mean less of a lexus batch based method which consists of
a certain value features
so let's use this let's use just to represent pronunciation to represent speech
this is all approach this is trajectory can question space a so that we present
one utterance converted into a sequence of distributions
okay distribution has to be must use this must
so that after that we calculate left divergence between any plp distributions
so
in this talk we use about the chili a distance but the so distance is
the one of the f divergence measures
so that a speaker shows are the same procedure i looking at from a different
viewpoint we implement this procedure as it
training of hmm
and calculating a distance between
a any pair of state so one utterance from one instance hmm this build
and then we extract a only contrast not only local contrasts but also distant contrasts
okay so well i explained it the acid background objective and corpus in the method
and i'm gonna show you some interesting result the previous work
so well in two thousand
six maybe
still we did speaker basis accent clustering but this experiment are used
simulated data similar to deal with simulated japanese english
so
in this work we used a twelve japanese which a student for returnees from us
so they can speak japanese of course very good speaker of japanese and they have
very good speakers of american english
so we asked them to say to pronounce
a b t one us english words upbeat be that bad so these voice and
also we aston to pronounce
the ilp the told what a japanese what but people but the but the one
k
so
and then we extracted vol one segment what medical e and it we should we
created we form to follow based structures well based
a structures
so
but the we want to simulated variously accented japanese english so that for that we
do replacement of some american english about was with japanese follows
so why this is america things of all walls and the is one to guess
eight is a difference of replacement s eight
it's a no replacement
so there
or is you know
american english american tings of hours
and it is one replace
all the bubbles american is of our sub replaced by japanese vowels
totally japanese accented but works
and as to gone is a seven well partially hardly japanese how we american english
so well so for example this about voices apply used
what kind of japanese possible so this is the replacement able assist these of always
that replace replaced by a japanese follow of a
e who april
so
we have twelve speakers from a to l and eight pronunciation at jackson's one two
eight k
so we can have
six and ninety six simulated learners
that's cluster these
these lattice
so well as their power sample from power some post we can get of all
what distributions and then we can get a distance matrix i mean about that show
the subspace structure
so well
to cluster ninety six speakers we have to k are calculated ninety six ninety six
distance metrics
okay
but one speaker is modeled as
structure so how to
define
the distance measure between two structures so we prepared two kinds of structure to structure
distance measure
so this is the first one so this is very simple definition of the distance
between two structure it's euclidean distance between two speakers two structures
so speaker a is blue one
and green one
so lets calculate euclidean distance between these two
so this is another a suit definition of the distance between two structures
so in this case
us to focus let's focus all the volvo a of a speaker a speaker s
and about what a speaker t o calculate the difference between these two this to
be used
power i involve what i speaker s and t but that's your distance
and
so a summation star
to difference
two different definitions of died distance page
so using these two
we can have two
ninety six ninety six distance matrix
a man speakers
so we if we if we troll gender grounds for these two that is metrics
so what matters is what kind of results we can obtain
so
if the result is like this we have very happy
because one two three four is a pronunciation wax and
so if the result is something like this a b c d well it's a
speaker clustering
we're not happy
so what kind we sell we obtained
so this is a result
all the contrast based euclidean distance
which the result of instance based distance measure
the second definition distance metric
so you can see
one three c five what some noises can be found here but if always six
rather good
accent clustering
but what about this j l k a y k d well complete speaker class
so
big difference in the result of a dangerous ground so why
so big difference
so because that big difference east coast
by
this difference of this is made a distance definition between two structures
so this is a
just a difference of two volvo set
but that this is a difference of differences
so this is first well i think this is the first order difference that cruise
you other gives you speaker clustering but this is second-order differences that leaves you accent
clustering
that is interesting thing
so let's
used is full
all four
real data
speech accent archive
so we have data are of into innocent speakers
not the same pro graph
okay
so let's cluster these speakers
but the
sorry
but the
and this work we use a at that we
adaptive a little bit different strategy used in the a previous study so in previous
study we calculate just euclidean distance between two structures but in this study
we used
the year we treated the of this in the calculation for vanessa regression problem
so first we prepared a
reference distances between two speakers so we first distances up a given from i pure
transcripts
so we first we did a dtw between two transcripts
between two speakers that we can define reference distances
and this is a target prediction still for prediction we used a regression model so
here as we always used and input features structure based features
so
for comparison we need another experiment
silver at this is the distance between two
phonotactic phonetic transcripts so in this case in and nine other experiment
a phone then make transcripts are used
so phonetic transcripts are converted into phone any conversion k
it's a kind of rough transcripts
so
then the we calculate the dtw distance between these two corresponds to rough calculation of
accent
so
a dtw i p a based reference is fess distance is we did you gap
between two tracks
but for dates
we have to prepare
i do systematic so all that i pa forms all the kinds of might be
a force found it as a
so well the number well i p r for some very few large more than
three hundred
so what we found that the one hundred fifty three i-th you've symbols can cover
ninety six ninety five percent of all the a phone instances in s a x
o
we ask them of in addition to produce
these each of these symbols twenty times so we build speaker dependent formulation is really
for not phoneme
phone hmms built so we calculate the but that's a distance between any pair of
phones
i beautiful's
so then we are prepared a form based distance matrix so use that we calculate
transcript to transcript distance
but the full this calculation we still like to the speakers from the s a
y a part of the speakers is used was useful least for this task because
many speakers of s a as thirty eight what the latent some words for example
well wall were okay
so it's a it's a kind of nonnativeness okay
silver we belated these words so the a number of speakers that drastically reduced
so a lot then that we shouldn't speaker number of origin speakers are more than
one eighteen q but that the effective number of speakers is only three hundred three
hundred seventy but the speaker pair number of speaker pair
it's still very large
so
i using this reference distance
so we did we run now test the are so what kind of features we
used features and regression model so we first we bill ubm hmm corresponding to the
slu paragraph okay so to was use universal speech accent archive speech so we build
h mount phoneme hmm concatenation
and ubm spilled
so
each addresses import map adaptation so adapt a speaker dependent hmm paragraph based hmm
so that the structure calculation is done so well i since the paragraph contains two
hundred twenty one phoneme instances by referring to by referring to cmu dictionary so to
twenty two why this is metrics obtain so this is the kind of
pronunciation scaled accent skeleton
so be but
what we want to predict is the accent distance between two speakers so the input
features to as you all should be d for angel features between two speakers speaker
s and t so here we used silver deformation metrics just a subtraction
and t and where
in previous works we did
a the square some of these features i mean you could injustice but in this
study we separate each of them and then the
these features are used as input features in into the svm
how many elements have been to mentions is quite huge twenty four kilos
so one
high dimensional vector can be present accent characteristics
okay i think dataset kind of similar to a gmm supervector one a high dimensional
vector can represent speaker characteristics
so this is useful as input features as we all so as to devise a
very general well
one is used
and then and that's one
still was for many confusion up
transcript at a transcript distance so two kinds of phoneme based transcripts are used one
is over the transcript
moreover transcripts
i'm not going the other one is transcripts generated from a phoneme recognizer or phoneme
error what detector
the accuracies about seventy three point five percent so dtw stampeding transcripts of the two
speakers
but there are four namely could transcript phoneme transcript
a quick response to after question of accent used
okay so it results
two conditions and results
so we did
prediction experiments you a into models with two conditions
a one is speaker all speaker pair open mode
the other one speaker open mode
so
the what we want to do stuff prediction of speaker distance accent distance between two
speakers so than the
the unit
to a unit
to that i mean be a unit of input to as we always that speaker
pair i still speaker pair open mode it is that the
not a single speaker pair it's not is found that simultaneously in training and testing
speaker pair open mode
so speaker open mode is also tested not a single speaker
it's fun somebody nist two bits in training or testing
so two modes
and eer results accent distance prediction so we do crossvalidation above from a performance metric
is the correlation to i p a based reference distance
so this is a result
so speaker pair open mode the correlations very high okay so this is so that
result of articulation graph got reference to since i pa and predicted if a difference
and but the speaker open mode
the correlation is not so high grew quite little
the oracle transcription gamma phoneme based one
rough estimation of accents not to distance
but in this case you can find at the speaker pair open mode predict to
one is higher than what the transcription
but this is low what what's low what on this but this still higher than
using the so well transcription generated from this asr
so why this is so low guess speak speaker open mode
still that
if we consider the mechanism of speaker us a few or we can say that
the matter need to you about a likely to of accent adaboosted estimated as
all the and
the speaker pair open mode but all the and square in a speaker open mode
so n is not a number of speakers available still
speaker pair open mode speaker
pair open mode
so would be the that the magnitude up task difficulty yes on the estimated simple
averaging of the test so this that a complicated version of
okay so well let me companies can produce this work yes
summary
the ultimate goal of the studies to create a global really global well individual basis
map of world english as
so and then the for that we have to estimate we have to produce a
technique to estimate the accent distance between any pair of speakers
so for that we used
that's speech accent archive you know still important speech structure analysis was used as speech
analysis method experiments showed that the
a high correlation was found that in speaker pair open mode but the a speaker
open what is not sorry
future directions so well i think structure vector plus it's be a result somewhat similar
to ra gmm supervector high dimensional but one vector that can characterize speaker id and
svm so but these states lots of people or researches use i-vectors are i-vector based
features might be can be used for this
and i was told if we change reengineering is still needed i think
and now the machine waiting around techniques should be should be should be used and
also we are interested in your more extensive collection the data are using cross source
that's all existing way by all the speaker
you're not a it should be question all right handers your correlation you're getting point
nineteen point five real-time were you open speaker set
that several european speaker so i all i use all that speaker's available
so iteration speakers in the african speakers so well i still selected the speakers
from ray the paragraph without inserting what deleting words
which are my question is alright that still based on a perfectly red or on
a on paragraph that right sure a large study people have shown that
when you're working accent if you're reading prepared text versus spontaneous or conversational is on
you get much more action yes and sure yes conversational speech and non speech sure
still unclear comment on whether you think reach for each speaker c rats
so
before coming here if the state helsinki reversed yesterday i skipped the first half discomforts
y
because
there's a research team of collecting spun to a natural non-native english is okay so
some other research groups of collecting data was spontaneous speech k us with my non-native
speakers
kind of mess
"'kay" missy data right last let's analyses in unexpected things
so this database is a very artificial
control dataset k
so but the
a what is possible ways
spontaneous speech what is possible with control data so i think so something is possible
is which control data and still something other things like a spot become possible with
spontaneous data
so
my proposal to those
researchers is that the us to collect
collection up control data and spontaneous data at the same type k
so for example
this the sat progress of please call stellar that probably is collected from the speakers
users being was and then also you collected data responding to see the from those
speakers
so then be you us you accent clustering is done well with a by using
control data and then the so clustering result can be used to explain what is
happening in non-native conversations
so
the i think the what is that it is then issues you to collect both
kind of control dater and spontaneous
so i know that the also researchers claim that the s a is not
really non-native data is it just artificial collection updated but the i think the from
technical point of view so that kind dataset is very useful