hi my name is channel the university of edinburgh c s t l

line had to talk about a paper

dropping classes but each speaker representation learning speaker odyssey twenty

it's a shame that we can we'll meet up and took a this but i

hope everyone stinks a time and enjoying the virtual conference instead

this work proposes to techniques for training at each speaker weddings

speaker bangs i-vector representations of an input utterance that characterizes speaker utterance

and these of become crucial for many

speaker recognition related tasks such as speaker verification and speaker diarization

and feature heavily in state-of-the-art methods and pipelines such task

the historically successful effective technique based on factor analysis has largely been overtaken in popularity

in recent times by embedding is extracted from deep neural networks

such as x vectors

and the general to this time the techniques used each speaker and things

in both

expected some i-vectors the previous work show that there's many sources of variability included into

the embedding is

which are not strictly related the speaker

but in the case of d speaker many specifically a their selection of works exploring

this phenomenon

in this work by willing thinking

the export speaking style related to emotional prosody

okay the next and i-vectors


this other work by gary at are explore export emotional so in expected


this work by roger two formidable study into some encoded information including channel information transcription

information such as sentences words and phones alonso other along with all the map estimation

such as utterance duration of one iteration type

and we can analyse these kinds information regarding these two things and so broadly class

crisis sources of this variability

each first category that we'd like to consider it is a speaker information related factors

such as attributes related to the speaker identity

and he's t

things might be a gender accent the wage

the second task reasonable grouping of non-speech related factors

such as the environment and how information of utterances

for example the room recording conditions

and they should be sent these categories are not restrict to universal by any means

and it's

possible consider attributes one actually the main data overlap with the other

for example the channel information indicative radio sure may vary much buildings with the identity

of radio presenter

okay so how these sources of variability link into the properties we will ideally want

the speaker and bindings used for verification and diarization

in case of non-speaker related information we typically want this to be invariant

for example we would ideally like the same speaker to be mapped into the same

area of embedding space regardless of the recording environment

and this colours and reflected in previous mixture relates to domain and channel invariant speaker

of the saint a representations

such as i what i cost twenty on channel invariant bindings using adversarial training

in addition to this work at my neighbour corporation or augmentation invariant speaker about things

as for speaker related information we would ideally like other things

to capture was information in the sun discrimate to fashion

meeting all the source variability are in some way encoded

and we will use the term speaker distribution of someone umbrella terms describe these the

speaker related information

if we consider aspects of speaker variability is a whole

such as those attributes on gender accent or age

the embedding space should ideally matched manifold possible speakers

and in these tasks speakers in the test set are

not typically seen at training time or the held-out

and the concern here is that these unseen speakers not follow the distributions you know

training time


we also well as such a distribution mismatch exists and if it does

is that a problem

so how much we establish of such a mismatch exists

first we should describe the overall structure of these speaker body extractors such as an

extract network

d speaker settings are typically extracted from intermediate layer the network trained on speaker classification



in this classification task by aligning to discriminate between the training set of speakers the

intermediate and maps to a space that can be used to represent the speaker identity

have any given utterance


if we put the unseen test utterances of some datasets

or test speakers

the for classification network

and we evaluate the mean softmax probabilities the network codecs

we can get some kind of approximation of what the training classes the network please

it is seeing where given the unseen test because as input

so for and

test utterances we can calculate the softmax probability that the output of a whole network

where h is h i is the penalty representation of the i-th input utterance

of the network

before is transformed into the class probabilities by the final affine weight matrix w and

the softmax operation

and this is what this p average time is

an average of the softmax lee speaker class probabilities from some set of input utterances

or some measure of the probability mass assigned each training speaker about the network


if we examine in the case of an expected system trained on voxel up to

and the test set of oxnard is forty unique speakers with forty two utterances each


we can put them through the network and calculated the average and we also did

the same with a set of forty unique speak well forty speakers but on the

training set also with forty two utterances each

and is this figure displays the five thousand my remote for training classes

ranked according to the p average value produced

as we can see that was surprising it is to be much more competent on

test speakers than on training speakers

and more specifically seems to produce a small number of speakers of much higher probability

the rest


i think read like to train speakers here mine a flat line but if we

move apart but instead on the next slide

we can see that this is not the case and as there are many more

variations of training all speakers

that using forty hours of five thousand nine hundred ninety four

we sample these three hundred times

to provide arab also means red line so

and the context this is a fairly competitive model

scoring three point zero four percent equal error rate on the voxel a test set

using cosine similarity scoring

one might be this one might this be the case

why does this network compared to predict some training speakers which such competence from a

test speakers

this isn't immediately clear as the what's the test set has a good balance of

male and female speakers

and was chosen as the names began with let e

no particular specific

is this mismatch even a bad thing

if we consider each of the training partisans general templates that i for speaker

instead a specific identity is the predicted types of speaker producement test set would indicate

somewhat of a class imbalance

the negative effect class imbalance problems can cause a fairly well documented classification tasks

and these issues also extend the

to representation learning to with this work a

despite in contrast cost sensitive training as a means to mitigate this

the retirement our work this work proposes to make that study these issues

one in the form of for but of a robustness to different speaker distributions

i don't know the in the form of

adaptation to a different speaker distribution

and both of these more methods work really a dropping classes from the output layer

the first method that will propose

have be called rock class and it works by periodically dropping d classes from the


and the help of classification

so every t training vectors

we don't move d classes from the dataset

such that during the next p training patches these classes well

we also remove the corresponding rows on the final affine weight matrix

i'd in combination with an angular penalty softmax

this continually changes the decision boundary distinguishing between the subsets of classes


this could essentially as a formal draw out on the classification

also synchronise with

data provided

i has been theoretically justified in the past by continually something from any and networks

and then averaging at test time resulting in a more robust model and in this

election we think that this method is something from many different subsets of classification tasks

and he's a diagram sliding more detailed diagram for how this method works

and we can see that

the embedding generate is essentially kind of multiple different subset classification tossed around training

okay here is the second method we propose which we call drop the times which

is a means of adapting to an unlabeled

the test set in a somewhat unsupervised fashion

from the fully trained model be calculated is p average value for the test utterance

with no need for the test set labels


isolates on the low quality classes a limited p average classes

and drop those problem

from the help and the dataset


we combine those low probability classes into a single new class

we then find you know the remaining training classes and then repeat this process iteratively

calculating the average again and further dropping classes


we think this can be you distance all over sampling of the classes which we

which produce heidi average values

in the lifetime of the networks training


we believe in some way

due to the i p arbitrary be leaving to the in some way more relevant

to the test utterances

all close to match to the speaker distribution the

which go quickly of the basic experimental setup the primary datasets used in speaker verification

datasets namely voxel and speakers in the wild

models were trained using forced up to only you'll utilizing the cal the augmentation

and if you're interested our experimental code can be found getup

the for the drug class experiments we explored very illustrations to wait for dropping

p and the number of classes to drop the

but drop the times we varied the exact method discarding the classes entirely what combining

them into a single new class

or dropping only the taste and you can run using the final language affine weight


and finally to control experiments training the baseline along the without talking to classes to

eliminate the advantage of the extra training iterations the proper database

and the other control experiment was to talk random classes and ignore the p arbitrarily

here we have the results of the drug class experiments

on the left here is an exploration into varying the number of iterations to wait


and as we can see for both whatsoever and the death portions of speakers in

the while dropping five thousand horses for configurations of p less than seven hundred fifty

appears to improve performance of the baseline

on the right to choosing the best configuration on the left a of two hundred

fifty for p we tried the number of classes drop rejoinder federation fifty iterations

and as we can see pretty much will configurations outperform the baseline


and i we have the drop adaptive results

for a budget of thirty thousand additional iterations remembering of this method starts with a

fully trained model

we performed six rounds of dropping classes with the configuration d

cup and d classes for five thousand iterations each

and as we can see both box eleven speaks in the wild running further iterations

did improve did not improve the in performance

and the

as well as dropping random process which ignores the p average value supposed to control


reassuringly i didn't improve performance

but for methods dropping the data with mlp average value such a drop adapt drop

attack combine and this properly data

this improved performance of the baseline suggesting that this was something of high p average

classes of the lifetime of the training of the network can help improve performance on

a specific test set

with sometimes dropping the classes from the affine weight final affine weight matrix improving things


we also performed an extension experiment tricks examine what's going wondering this drop adapt fine


and here we posit kl divergence of the average distribution uniform distribution

at each round of dropping classes control but that combine and box up

alongside the equal error rate

and the kl divergence is some measure of how close that p average distribution

is to being more uniform and less imbalance

and he we can see that the scale directions drops

that has the scale like fashion shops so to use the a search is a

verification performs increase suggesting that this distribution matching metric maybe someone a good indication of

how well matched the network is set to a certain test set

but are clearly this task quite sure yes a network which predicts every files uniformly

will time will clearly not be affected

and there are clearly minutes on a minimum number of test speakers for this observation

to hold in the

we have some good sample for what's an overall speaker distribution of some test set


so how conclusions all the cost can improve verification performance of extracting bindings with good


and we suggest this teacher similar effect of dropout

similarly crop adapt can improve verification performance with the of assumption that the hyper here

every classes to be the key step

has this p average distribution training classes

it see that asr as the average distribution train classes on the reason we set

a sigh set of test because becomes more uniform verification performance increases to

thanks for listening i hope you can interest the virtual conference then the state says
