Speech Transcript - An Empirical Analysis of Information Encoded in Disentangled Neural Speaker Representations

however

my name is a weird

this is trained in the signals the standard a traditional a accuracy los angeles

to the be presenting our work

try to an umbilical analysis of information coder

in this and then the neural speaker representations

and here the people that have

well average of it for this work

so first

i'll introduce what i referred to as speaker meetings in the rest of the talk

speaker limiting the lower dimensions these two presentations

that or discriminative of speaker identity

these other applications

such as

in voice biometrics but the task is to verify wasn't sounded different speech

the house at application can speaker adapted a set of models

they can also be used in speaker diarization

with the task is to domain

who spoke when in multiparty conversations

this can be of particular use in meeting an x and many other applications

good speaker ramblings should satisfy two properties

first there should be discriminative of speaker factors

second is that addition be invariant to other factors

so what are the fact of information that could be encoded speaker embedding

for ease of analysis be broadly categorized them as follows

so as to the speaker factors these are related to the speaker's identity but example

that gender age et cetera

content factors a these are quite during speech production by the speaker

for example

emotional state output a in the speech signal

sentiment whether it is a positive landed one year

the language being spoken

and most importantly the lexicon containing the signal

and

that is the channel factors these factors that quite given signal captured of the microphone

we could be the room acoustics

the microphone on a linear is applied on acoustic noise

and also artificial and also the artifacts related to the competition

on signal vector

as i mentioned previously good speaker the minister supposed to be invariant nuisance factors

these other factors that in that in order to the speaker's identity

such emergencies useful for robust speaker recognition

in the presence of a bad on acoustic noise

they're also useful for detecting a speaker's identity

irrespective of the emotional state of the speaker

and

also independent of all speakers is

this is particularly useful

in text-independent speaker verification applications

so with those that don't have the motivation the goal of our work is to

four

first

is to quantify the amount of misinformation in speaker meetings

second is to investigate

what extent

unsupervised learning and hence

to remove the misinformation

most existing digits

only performed analysis based on one or two datasets

and

compared to analysis is lacking

also most of this work do not consider the dependence

but in the individual variables in the dataset

for example

note addressed dataset a lexical content and the speaker identity sad and angry

but some sentences that spoken only vectors speakers

therefore

it should be possible to predict the speakers based on lexical content on

being can to mitigate these limitations our previous work

by making the following contributions

firstly we use multiple datasets to comprehensively and lies information and are denoted speaker different

additions

secondly we analyze the

effect of disentangling speaker factors from uses factors on then down information

briefly detail what they mean made disentanglement

in the

orders of the talk

we define a disentanglement broadly as the task of separating out information streams from advancing

signal

is a coke example

the input speech signal from belief you good

who is happy that just bought a civilised like super

contain such information related to various factors

it contains information about because identity including have with him gender and age

the information put into the good emotional state is also encoder

more importantly

the language identity and the lexical content i don't same but in the signal

the goal of additional embedding extractor

is to separate all these information streams

and in the context of speaker the meetings i which is supposed to capture speaker

and get additional information

all other factors such as an emotional state and the lexical content

i considered nuisance factors

it is these factors which we propose to remove from the speaker meanings

to make the more robust

no and explain the methodology behind it is and then a speaker domain extraction

this is a model b is

as input of we can use any speech representation sort of that's either spectrogram

only one speaker meeting from pre-denned model statistics vectors

and

using than suppose disentanglement adapted from

method that as previously proposed in the computer vision domain

we try to separate out

these speaker later information

from the loses information

please note that this method with previously proposed in our earlier work

and you can find more details

in that paper

however for completeness that explained in that he rested

i don't think that comprises two models the main model

which are shown in the clean

blocks here

and

the address it and models shown in the blue

then put it is first processed in court of misfits fit into two

and weighting function in is trash shown in the figure

the embedding hits them

is starting to the predictive

which predictions speaker labels like that

the embedding has two is concatenated with the noisy version of h one

which is denoted by hits and prime here

it's and frame is obtained by thing it's one

to drop what martin

two randomly remove certain elements of h from

and has two along with the noisy

hatch on which is session pine

i concatenated

and fed into a decoder

which tries really consider that the origin input x

the motivation behind using the top or

is to make sure that

hatch one

is an and eleven source of information for the reconstruction task

and training in this and make sure that

the information required for reconstruction is not storage and

and only

the information required for

speaker and weightings are stored

here

in addition

we also used to disentangle models we just one and low

these models are jointly trained

to perform poorly in predicting hits on from is to

and has to from its own

the goal of these models is to ensure that

and so the nist two are not very to a feature that

doesn't make sure that did not contain similar information

this way

we can team for this and then there's other conditions

and the questions that we used a present one fish one here the main model

produces two losses a one is a standard cross entropy loss from the predicate

which pretty speakers

and the second is the means greater reconstruction us from the decoder

and the adversarial

a model is a use means could've lost

the overall loss function is shown here

we try to minimize the loss with respect to the main models

when advert of by maximizing the twisted in knots

this training process further apart from previous work as i mentioned before

basically use this technique

on it

because the digit recognition task

on successful training

them but enhancement is expected to capture speaker discriminative information

and them in his to is expected to captain useless information

notice that we are not used any labels of that uses factors such as a

nice tight channel conditions

extractor

for training the models we use the standard box in the training corpus now which

consists of

in the way we use of interviews with celebrities

the additive noise and reverberation which is standard practice in a day in examining

this results in two point four million utterances from i don't seven thousand two hundred

speakers

as mentioned before we can you either you spectrograms atoms is and what

well it also is decoder meetings from kate and models which we do in

this work

so we use i x it is extracted from a publicly available played in models

as input

exactly that's most of you already know are speaker demanding a hint on the automatically

rubber and related work

that is trained to classify speakers

from a large dataset artificial augmented with noise and reverberation

and this model has shown to provide state-of-the-art performance and multiple tasks

not require speaker discriminant discriminately

we use multiple datasets i not evaluations as mentioned here

and by evaluating some factors for example

i emotion on my calculator that

we could also

too low the

issue of dataset bias

creating in the model

and following others in the looks the make the assumption that

better classification performance

all of the speaker remaining for the factors

in light

there is more information present in the embedding with respect to that factors

and as a baseline views expected that speaker eminence since our model a data accepted

as input

we can consider a speaker ramblings as a refinement of detectors

but speaker different information today and uses factorisation will

the also reduce the dimension of expected by using pca

or to match the

the and meetings in vermont models

so us of the results

and the first set of results shows the accuracy of predicting speaker factors using x

vectors

shown in blue

and using alignment actually hindered

and in this case high it is better

the first two of graphs here so speaker classification accuracy and the other two sure

gender prediction accuracy

so we find that in general both expect is an atom bearings

but from creativity in just thank speakers and genders

and we see a slight degradation a using another

however the differences that women

one other observation is that

in i'm okay final performance of

both axes and i model

we conjecture that this the eight it could be due to a speaker overlap

and also this dataset is not what ideally suited for speaker

recognition task since

the purpose of this dataset was emotion recognition

no the more enticing results

of a show the results of predicting the and in factors using x s and

are speaker dominance and in this case since is then used actors you know it

is british

we find that in

the cases are model it is the model is information

in particular

emotion and lexical information added used to a greater extent

here the lexical accuracy

is accuracy of predicting the sentence

spoken given speaker the meeting of that sentence

and apart from the election emotional lexical content we also see a detection

no information but into sentiment

we just was used to motion

and also language

in this side of a report the results of predicting the channel factors using x

vectors

and a speaker dominance

okay in this case a low respective

in particular of we focus on three factors

the room microphone distance are the microphone location

and then i start

we find that in predicting the location of the microphone use

and the type of agonise present

except is have a much higher accuracy than a to predict

this means that being able to successfully reduced and what of this isn't information from

extractors

however we notice that

in panic and the room

in this the recording with me

because so present to see that what extent this and i gnostic animating that very

effective

this needs further investigation

we show the results of like evaluation

then evaluated models for speaker verification task

and our competitors

the detection update of "'cause" actual

where the false positive rate and the according to be exact scale only lately

right and they because model you get compared to the articles

and the "'cause"

that it was at the origin

you don't better models

the black dotted lines a show the except the model

and all the other

lines do not are modeled they then without

lda based dimensionality reduction

be found statistically significant differences only in the graphs

based numbers dimension

well most notably in challenging scenarios

babble in television lies in the background

all models perform better than extractors

also in the are distant microphone condition i've models perform significantly better than extractors

we also found that at the model and do that is trained with a metadata

what was slightly better compared to the model in one

that is staying with not additional conditions

this actually confronted expected be

so finally like to quickly present a discussion based on experiments which hopefully will be

useful pointers for future research

in this domain

first we find that speaker the meetings to captain right of information what into a

nuisance factors

and this can sometimes be detrimental to robustness

and we also found that just introducing

bottleneck on the dimension of the speaker automatic by using pca

doesn't seem all this information

this points of the need for explicitly more the fusion starters

and using the

on suppose that wasn't invariance technique which is the

taking that using a model

we can it is then uses information

from the speaker meetings

and the because advantages that unlabeled of nuisance factors are not required for this matter

we also found that and the voice disentanglement retains gender information

this actually such as that speaker gender

as captured when you know conditions

is a crucial part of identity

this is quite intuitive from human perception point hasn't

essentially what the shows is that mute of conditions and sounds and

for though human perception

finally a disincentive speaker representations shall

a better verification performance the presence of ability of tiny conditions

but it only babble in television i features consider

very challenging for this test

going forward we would like to explore methods to further improve the sentiment

and

so far we have not as a mention of all of not used in uses

labeled so

we would like to see if

if we use this it's a with variable data available

danny

achieve better disentanglement

so that brings me to the

invested in different conditions of those of the differences

finally i would like to acknowledge the support for us to for this work

and

that's it utterance that's what is into my presentation

please feel free to us and many men with any questions or stations you might

have

thank you

An Empirical Analysis of Information Encoded in Disentangled Neural Speaker Representations

Special Session: VOiCES 2020

Raghuveer Peri, Haoqi Li, Krishna Somandepalli, Arindam Jati, Shrikanth Narayanan