hi my name is channel the university of edinburgh c s t l
line had to talk about a paper
dropping classes but each speaker representation learning speaker odyssey twenty
it's a shame that we can we'll meet up and took a this but i
hope everyone stinks a time and enjoying the virtual conference instead
this work proposes to techniques for training at each speaker weddings
speaker bangs i-vector representations of an input utterance that characterizes speaker utterance
and these of become crucial for many
speaker recognition related tasks such as speaker verification and speaker diarization
and feature heavily in state-of-the-art methods and pipelines such task
the historically successful effective technique based on factor analysis has largely been overtaken in popularity
in recent times by embedding is extracted from deep neural networks
such as x vectors
and the general to this time the techniques used each speaker and things
in both
expected some i-vectors the previous work show that there's many sources of variability included into
the embedding is
which are not strictly related the speaker
but in the case of d speaker many specifically a their selection of works exploring
this phenomenon
in this work by willing thinking
the export speaking style related to emotional prosody
okay the next and i-vectors
and
this other work by gary at are explore export emotional so in expected
and
this work by roger two formidable study into some encoded information including channel information transcription
information such as sentences words and phones alonso other along with all the map estimation
such as utterance duration of one iteration type
and we can analyse these kinds information regarding these two things and so broadly class
crisis sources of this variability
each first category that we'd like to consider it is a speaker information related factors
such as attributes related to the speaker identity
and he's t
things might be a gender accent the wage
the second task reasonable grouping of non-speech related factors
such as the environment and how information of utterances
for example the room recording conditions
and they should be sent these categories are not restrict to universal by any means
and it's
possible consider attributes one actually the main data overlap with the other
for example the channel information indicative radio sure may vary much buildings with the identity
of radio presenter
okay so how these sources of variability link into the properties we will ideally want
the speaker and bindings used for verification and diarization
in case of non-speaker related information we typically want this to be invariant
for example we would ideally like the same speaker to be mapped into the same
area of embedding space regardless of the recording environment
and this colours and reflected in previous mixture relates to domain and channel invariant speaker
of the saint a representations
such as i what i cost twenty on channel invariant bindings using adversarial training
in addition to this work at my neighbour corporation or augmentation invariant speaker about things
as for speaker related information we would ideally like other things
to capture was information in the sun discrimate to fashion
meeting all the source variability are in some way encoded
and we will use the term speaker distribution of someone umbrella terms describe these the
speaker related information
if we consider aspects of speaker variability is a whole
such as those attributes on gender accent or age
the embedding space should ideally matched manifold possible speakers
and in these tasks speakers in the test set are
not typically seen at training time or the held-out
and the concern here is that these unseen speakers not follow the distributions you know
training time
and
we also well as such a distribution mismatch exists and if it does
is that a problem
so how much we establish of such a mismatch exists
first we should describe the overall structure of these speaker body extractors such as an
extract network
d speaker settings are typically extracted from intermediate layer the network trained on speaker classification
task
and
in this classification task by aligning to discriminate between the training set of speakers the
intermediate and maps to a space that can be used to represent the speaker identity
have any given utterance
now
if we put the unseen test utterances of some datasets
or test speakers
the for classification network
and we evaluate the mean softmax probabilities the network codecs
we can get some kind of approximation of what the training classes the network please
it is seeing where given the unseen test because as input
so for and
test utterances we can calculate the softmax probability that the output of a whole network
where h is h i is the penalty representation of the i-th input utterance
of the network
before is transformed into the class probabilities by the final affine weight matrix w and
the softmax operation
and this is what this p average time is
an average of the softmax lee speaker class probabilities from some set of input utterances
or some measure of the probability mass assigned each training speaker about the network
so
if we examine in the case of an expected system trained on voxel up to
and the test set of oxnard is forty unique speakers with forty two utterances each
and
we can put them through the network and calculated the average and we also did
the same with a set of forty unique speak well forty speakers but on the
training set also with forty two utterances each
and is this figure displays the five thousand my remote for training classes
ranked according to the p average value produced
as we can see that was surprising it is to be much more competent on
test speakers than on training speakers
and more specifically seems to produce a small number of speakers of much higher probability
the rest
and
i think read like to train speakers here mine a flat line but if we
move apart but instead on the next slide
we can see that this is not the case and as there are many more
variations of training all speakers
that using forty hours of five thousand nine hundred ninety four
we sample these three hundred times
to provide arab also means red line so
and the context this is a fairly competitive model
scoring three point zero four percent equal error rate on the voxel a test set
using cosine similarity scoring
one might be this one might this be the case
why does this network compared to predict some training speakers which such competence from a
test speakers
this isn't immediately clear as the what's the test set has a good balance of
male and female speakers
and was chosen as the names began with let e
no particular specific
is this mismatch even a bad thing
if we consider each of the training partisans general templates that i for speaker
instead a specific identity is the predicted types of speaker producement test set would indicate
somewhat of a class imbalance
the negative effect class imbalance problems can cause a fairly well documented classification tasks
and these issues also extend the
to representation learning to with this work a
despite in contrast cost sensitive training as a means to mitigate this
the retirement our work this work proposes to make that study these issues
one in the form of for but of a robustness to different speaker distributions
i don't know the in the form of
adaptation to a different speaker distribution
and both of these more methods work really a dropping classes from the output layer
the first method that will propose
have be called rock class and it works by periodically dropping d classes from the
dataset
and the help of classification
so every t training vectors
we don't move d classes from the dataset
such that during the next p training patches these classes well
we also remove the corresponding rows on the final affine weight matrix
i'd in combination with an angular penalty softmax
this continually changes the decision boundary distinguishing between the subsets of classes
and
this could essentially as a formal draw out on the classification
also synchronise with
data provided
i has been theoretically justified in the past by continually something from any and networks
and then averaging at test time resulting in a more robust model and in this
election we think that this method is something from many different subsets of classification tasks
and he's a diagram sliding more detailed diagram for how this method works
and we can see that
the embedding generate is essentially kind of multiple different subset classification tossed around training
okay here is the second method we propose which we call drop the times which
is a means of adapting to an unlabeled
the test set in a somewhat unsupervised fashion
from the fully trained model be calculated is p average value for the test utterance
with no need for the test set labels
within
isolates on the low quality classes a limited p average classes
and drop those problem
from the help and the dataset
well
we combine those low probability classes into a single new class
we then find you know the remaining training classes and then repeat this process iteratively
calculating the average again and further dropping classes
and
we think this can be you distance all over sampling of the classes which we
which produce heidi average values
in the lifetime of the networks training
which
we believe in some way
due to the i p arbitrary be leaving to the in some way more relevant
to the test utterances
all close to match to the speaker distribution the
which go quickly of the basic experimental setup the primary datasets used in speaker verification
datasets namely voxel and speakers in the wild
models were trained using forced up to only you'll utilizing the cal the augmentation
and if you're interested our experimental code can be found getup
the for the drug class experiments we explored very illustrations to wait for dropping
p and the number of classes to drop the
but drop the times we varied the exact method discarding the classes entirely what combining
them into a single new class
or dropping only the taste and you can run using the final language affine weight
matrix
and finally to control experiments training the baseline along the without talking to classes to
eliminate the advantage of the extra training iterations the proper database
and the other control experiment was to talk random classes and ignore the p arbitrarily
here we have the results of the drug class experiments
on the left here is an exploration into varying the number of iterations to wait
dropping
and as we can see for both whatsoever and the death portions of speakers in
the while dropping five thousand horses for configurations of p less than seven hundred fifty
appears to improve performance of the baseline
on the right to choosing the best configuration on the left a of two hundred
fifty for p we tried the number of classes drop rejoinder federation fifty iterations
and as we can see pretty much will configurations outperform the baseline
hum
and i we have the drop adaptive results
for a budget of thirty thousand additional iterations remembering of this method starts with a
fully trained model
we performed six rounds of dropping classes with the configuration d
cup and d classes for five thousand iterations each
and as we can see both box eleven speaks in the wild running further iterations
did improve did not improve the in performance
and the
as well as dropping random process which ignores the p average value supposed to control
experiments
reassuringly i didn't improve performance
but for methods dropping the data with mlp average value such a drop adapt drop
attack combine and this properly data
this improved performance of the baseline suggesting that this was something of high p average
classes of the lifetime of the training of the network can help improve performance on
a specific test set
with sometimes dropping the classes from the affine weight final affine weight matrix improving things
to
we also performed an extension experiment tricks examine what's going wondering this drop adapt fine
tuning
and here we posit kl divergence of the average distribution uniform distribution
at each round of dropping classes control but that combine and box up
alongside the equal error rate
and the kl divergence is some measure of how close that p average distribution
is to being more uniform and less imbalance
and he we can see that the scale directions drops
that has the scale like fashion shops so to use the a search is a
verification performs increase suggesting that this distribution matching metric maybe someone a good indication of
how well matched the network is set to a certain test set
but are clearly this task quite sure yes a network which predicts every files uniformly
will time will clearly not be affected
and there are clearly minutes on a minimum number of test speakers for this observation
to hold in the
we have some good sample for what's an overall speaker distribution of some test set
is
so how conclusions all the cost can improve verification performance of extracting bindings with good
configurations
and we suggest this teacher similar effect of dropout
similarly crop adapt can improve verification performance with the of assumption that the hyper here
every classes to be the key step
has this p average distribution training classes
it see that asr as the average distribution train classes on the reason we set
a sigh set of test because becomes more uniform verification performance increases to
thanks for listening i hope you can interest the virtual conference then the state says
five