0:00:14hi my name is channel the university of edinburgh c s t l
0:00:18line had to talk about a paper
0:00:20dropping classes but each speaker representation learning speaker odyssey twenty
0:00:26it's a shame that we can we'll meet up and took a this but i
0:00:29hope everyone stinks a time and enjoying the virtual conference instead
0:00:35this work proposes to techniques for training at each speaker weddings
0:00:40speaker bangs i-vector representations of an input utterance that characterizes speaker utterance
0:00:45and these of become crucial for many
0:00:48speaker recognition related tasks such as speaker verification and speaker diarization
0:00:53and feature heavily in state-of-the-art methods and pipelines such task
0:00:58the historically successful effective technique based on factor analysis has largely been overtaken in popularity
0:01:04in recent times by embedding is extracted from deep neural networks
0:01:08such as x vectors
0:01:09and the general to this time the techniques used each speaker and things
0:01:14in both
0:01:15expected some i-vectors the previous work show that there's many sources of variability included into
0:01:20the embedding is
0:01:21which are not strictly related the speaker
0:01:25but in the case of d speaker many specifically a their selection of works exploring
0:01:29this phenomenon
0:01:31in this work by willing thinking
0:01:33the export speaking style related to emotional prosody
0:01:37okay the next and i-vectors
0:01:39and
0:01:40this other work by gary at are explore export emotional so in expected
0:01:46and
0:01:47this work by roger two formidable study into some encoded information including channel information transcription
0:01:54information such as sentences words and phones alonso other along with all the map estimation
0:01:59such as utterance duration of one iteration type
0:02:04and we can analyse these kinds information regarding these two things and so broadly class
0:02:09crisis sources of this variability
0:02:11each first category that we'd like to consider it is a speaker information related factors
0:02:16such as attributes related to the speaker identity
0:02:19and he's t
0:02:20things might be a gender accent the wage
0:02:25the second task reasonable grouping of non-speech related factors
0:02:28such as the environment and how information of utterances
0:02:31for example the room recording conditions
0:02:35and they should be sent these categories are not restrict to universal by any means
0:02:38and it's
0:02:39possible consider attributes one actually the main data overlap with the other
0:02:43for example the channel information indicative radio sure may vary much buildings with the identity
0:02:49of radio presenter
0:02:52okay so how these sources of variability link into the properties we will ideally want
0:02:57the speaker and bindings used for verification and diarization
0:03:01in case of non-speaker related information we typically want this to be invariant
0:03:05for example we would ideally like the same speaker to be mapped into the same
0:03:09area of embedding space regardless of the recording environment
0:03:12and this colours and reflected in previous mixture relates to domain and channel invariant speaker
0:03:18of the saint a representations
0:03:20such as i what i cost twenty on channel invariant bindings using adversarial training
0:03:25in addition to this work at my neighbour corporation or augmentation invariant speaker about things
0:03:31as for speaker related information we would ideally like other things
0:03:35to capture was information in the sun discrimate to fashion
0:03:38meeting all the source variability are in some way encoded
0:03:42and we will use the term speaker distribution of someone umbrella terms describe these the
0:03:47speaker related information
0:03:50if we consider aspects of speaker variability is a whole
0:03:53such as those attributes on gender accent or age
0:03:57the embedding space should ideally matched manifold possible speakers
0:04:00and in these tasks speakers in the test set are
0:04:04not typically seen at training time or the held-out
0:04:08and the concern here is that these unseen speakers not follow the distributions you know
0:04:13training time
0:04:14and
0:04:15we also well as such a distribution mismatch exists and if it does
0:04:19is that a problem
0:04:22so how much we establish of such a mismatch exists
0:04:27first we should describe the overall structure of these speaker body extractors such as an
0:04:32extract network
0:04:35d speaker settings are typically extracted from intermediate layer the network trained on speaker classification
0:04:40task
0:04:41and
0:04:42in this classification task by aligning to discriminate between the training set of speakers the
0:04:46intermediate and maps to a space that can be used to represent the speaker identity
0:04:50have any given utterance
0:04:54now
0:04:54if we put the unseen test utterances of some datasets
0:04:59or test speakers
0:05:01the for classification network
0:05:03and we evaluate the mean softmax probabilities the network codecs
0:05:07we can get some kind of approximation of what the training classes the network please
0:05:11it is seeing where given the unseen test because as input
0:05:15so for and
0:05:17test utterances we can calculate the softmax probability that the output of a whole network
0:05:21where h is h i is the penalty representation of the i-th input utterance
0:05:27of the network
0:05:28before is transformed into the class probabilities by the final affine weight matrix w and
0:05:33the softmax operation
0:05:35and this is what this p average time is
0:05:38an average of the softmax lee speaker class probabilities from some set of input utterances
0:05:43or some measure of the probability mass assigned each training speaker about the network
0:05:51so
0:05:51if we examine in the case of an expected system trained on voxel up to
0:05:55and the test set of oxnard is forty unique speakers with forty two utterances each
0:06:00and
0:06:01we can put them through the network and calculated the average and we also did
0:06:04the same with a set of forty unique speak well forty speakers but on the
0:06:09training set also with forty two utterances each
0:06:12and is this figure displays the five thousand my remote for training classes
0:06:16ranked according to the p average value produced
0:06:20as we can see that was surprising it is to be much more competent on
0:06:24test speakers than on training speakers
0:06:26and more specifically seems to produce a small number of speakers of much higher probability
0:06:30the rest
0:06:33and
0:06:34i think read like to train speakers here mine a flat line but if we
0:06:38move apart but instead on the next slide
0:06:41we can see that this is not the case and as there are many more
0:06:44variations of training all speakers
0:06:46that using forty hours of five thousand nine hundred ninety four
0:06:49we sample these three hundred times
0:06:51to provide arab also means red line so
0:06:54and the context this is a fairly competitive model
0:06:57scoring three point zero four percent equal error rate on the voxel a test set
0:07:00using cosine similarity scoring
0:07:05one might be this one might this be the case
0:07:08why does this network compared to predict some training speakers which such competence from a
0:07:12test speakers
0:07:13this isn't immediately clear as the what's the test set has a good balance of
0:07:17male and female speakers
0:07:19and was chosen as the names began with let e
0:07:23no particular specific
0:07:26is this mismatch even a bad thing
0:07:30if we consider each of the training partisans general templates that i for speaker
0:07:33instead a specific identity is the predicted types of speaker producement test set would indicate
0:07:38somewhat of a class imbalance
0:07:41the negative effect class imbalance problems can cause a fairly well documented classification tasks
0:07:47and these issues also extend the
0:07:49to representation learning to with this work a
0:07:52despite in contrast cost sensitive training as a means to mitigate this
0:07:57the retirement our work this work proposes to make that study these issues
0:08:02one in the form of for but of a robustness to different speaker distributions
0:08:07i don't know the in the form of
0:08:08adaptation to a different speaker distribution
0:08:12and both of these more methods work really a dropping classes from the output layer
0:08:20the first method that will propose
0:08:22have be called rock class and it works by periodically dropping d classes from the
0:08:27dataset
0:08:27and the help of classification
0:08:30so every t training vectors
0:08:32we don't move d classes from the dataset
0:08:34such that during the next p training patches these classes well
0:08:39we also remove the corresponding rows on the final affine weight matrix
0:08:43i'd in combination with an angular penalty softmax
0:08:47this continually changes the decision boundary distinguishing between the subsets of classes
0:08:52and
0:08:53this could essentially as a formal draw out on the classification
0:08:57also synchronise with
0:08:59data provided
0:09:02i has been theoretically justified in the past by continually something from any and networks
0:09:07and then averaging at test time resulting in a more robust model and in this
0:09:11election we think that this method is something from many different subsets of classification tasks
0:09:18and he's a diagram sliding more detailed diagram for how this method works
0:09:22and we can see that
0:09:23the embedding generate is essentially kind of multiple different subset classification tossed around training
0:09:32okay here is the second method we propose which we call drop the times which
0:09:36is a means of adapting to an unlabeled
0:09:39the test set in a somewhat unsupervised fashion
0:09:42from the fully trained model be calculated is p average value for the test utterance
0:09:46with no need for the test set labels
0:09:50within
0:09:52isolates on the low quality classes a limited p average classes
0:09:55and drop those problem
0:09:57from the help and the dataset
0:09:59well
0:10:00we combine those low probability classes into a single new class
0:10:04we then find you know the remaining training classes and then repeat this process iteratively
0:10:08calculating the average again and further dropping classes
0:10:11and
0:10:12we think this can be you distance all over sampling of the classes which we
0:10:17which produce heidi average values
0:10:19in the lifetime of the networks training
0:10:21which
0:10:22we believe in some way
0:10:24due to the i p arbitrary be leaving to the in some way more relevant
0:10:27to the test utterances
0:10:29all close to match to the speaker distribution the
0:10:34which go quickly of the basic experimental setup the primary datasets used in speaker verification
0:10:39datasets namely voxel and speakers in the wild
0:10:43models were trained using forced up to only you'll utilizing the cal the augmentation
0:10:49and if you're interested our experimental code can be found getup
0:10:54the for the drug class experiments we explored very illustrations to wait for dropping
0:10:59p and the number of classes to drop the
0:11:02but drop the times we varied the exact method discarding the classes entirely what combining
0:11:07them into a single new class
0:11:09or dropping only the taste and you can run using the final language affine weight
0:11:13matrix
0:11:14and finally to control experiments training the baseline along the without talking to classes to
0:11:20eliminate the advantage of the extra training iterations the proper database
0:11:23and the other control experiment was to talk random classes and ignore the p arbitrarily
0:11:31here we have the results of the drug class experiments
0:11:33on the left here is an exploration into varying the number of iterations to wait
0:11:37dropping
0:11:38and as we can see for both whatsoever and the death portions of speakers in
0:11:42the while dropping five thousand horses for configurations of p less than seven hundred fifty
0:11:47appears to improve performance of the baseline
0:11:51on the right to choosing the best configuration on the left a of two hundred
0:11:56fifty for p we tried the number of classes drop rejoinder federation fifty iterations
0:12:01and as we can see pretty much will configurations outperform the baseline
0:12:05hum
0:12:06and i we have the drop adaptive results
0:12:11for a budget of thirty thousand additional iterations remembering of this method starts with a
0:12:16fully trained model
0:12:18we performed six rounds of dropping classes with the configuration d
0:12:22cup and d classes for five thousand iterations each
0:12:27and as we can see both box eleven speaks in the wild running further iterations
0:12:33did improve did not improve the in performance
0:12:37and the
0:12:38as well as dropping random process which ignores the p average value supposed to control
0:12:42experiments
0:12:43reassuringly i didn't improve performance
0:12:45but for methods dropping the data with mlp average value such a drop adapt drop
0:12:50attack combine and this properly data
0:12:53this improved performance of the baseline suggesting that this was something of high p average
0:12:58classes of the lifetime of the training of the network can help improve performance on
0:13:01a specific test set
0:13:04with sometimes dropping the classes from the affine weight final affine weight matrix improving things
0:13:09to
0:13:12we also performed an extension experiment tricks examine what's going wondering this drop adapt fine
0:13:16tuning
0:13:17and here we posit kl divergence of the average distribution uniform distribution
0:13:23at each round of dropping classes control but that combine and box up
0:13:27alongside the equal error rate
0:13:29and the kl divergence is some measure of how close that p average distribution
0:13:33is to being more uniform and less imbalance
0:13:37and he we can see that the scale directions drops
0:13:40that has the scale like fashion shops so to use the a search is a
0:13:44verification performs increase suggesting that this distribution matching metric maybe someone a good indication of
0:13:51how well matched the network is set to a certain test set
0:13:54but are clearly this task quite sure yes a network which predicts every files uniformly
0:13:59will time will clearly not be affected
0:14:02and there are clearly minutes on a minimum number of test speakers for this observation
0:14:05to hold in the
0:14:07we have some good sample for what's an overall speaker distribution of some test set
0:14:11is
0:14:13so how conclusions all the cost can improve verification performance of extracting bindings with good
0:14:18configurations
0:14:19and we suggest this teacher similar effect of dropout
0:14:23similarly crop adapt can improve verification performance with the of assumption that the hyper here
0:14:28every classes to be the key step
0:14:32has this p average distribution training classes
0:14:37it see that asr as the average distribution train classes on the reason we set
0:14:41a sigh set of test because becomes more uniform verification performance increases to
0:14:48thanks for listening i hope you can interest the virtual conference then the state says
0:14:52five