0:00:14 | hi my name is channel the university of edinburgh c s t l |
---|
0:00:18 | line had to talk about a paper |
---|
0:00:20 | dropping classes but each speaker representation learning speaker odyssey twenty |
---|
0:00:26 | it's a shame that we can we'll meet up and took a this but i |
---|
0:00:29 | hope everyone stinks a time and enjoying the virtual conference instead |
---|
0:00:35 | this work proposes to techniques for training at each speaker weddings |
---|
0:00:40 | speaker bangs i-vector representations of an input utterance that characterizes speaker utterance |
---|
0:00:45 | and these of become crucial for many |
---|
0:00:48 | speaker recognition related tasks such as speaker verification and speaker diarization |
---|
0:00:53 | and feature heavily in state-of-the-art methods and pipelines such task |
---|
0:00:58 | the historically successful effective technique based on factor analysis has largely been overtaken in popularity |
---|
0:01:04 | in recent times by embedding is extracted from deep neural networks |
---|
0:01:08 | such as x vectors |
---|
0:01:09 | and the general to this time the techniques used each speaker and things |
---|
0:01:14 | in both |
---|
0:01:15 | expected some i-vectors the previous work show that there's many sources of variability included into |
---|
0:01:20 | the embedding is |
---|
0:01:21 | which are not strictly related the speaker |
---|
0:01:25 | but in the case of d speaker many specifically a their selection of works exploring |
---|
0:01:29 | this phenomenon |
---|
0:01:31 | in this work by willing thinking |
---|
0:01:33 | the export speaking style related to emotional prosody |
---|
0:01:37 | okay the next and i-vectors |
---|
0:01:39 | and |
---|
0:01:40 | this other work by gary at are explore export emotional so in expected |
---|
0:01:46 | and |
---|
0:01:47 | this work by roger two formidable study into some encoded information including channel information transcription |
---|
0:01:54 | information such as sentences words and phones alonso other along with all the map estimation |
---|
0:01:59 | such as utterance duration of one iteration type |
---|
0:02:04 | and we can analyse these kinds information regarding these two things and so broadly class |
---|
0:02:09 | crisis sources of this variability |
---|
0:02:11 | each first category that we'd like to consider it is a speaker information related factors |
---|
0:02:16 | such as attributes related to the speaker identity |
---|
0:02:19 | and he's t |
---|
0:02:20 | things might be a gender accent the wage |
---|
0:02:25 | the second task reasonable grouping of non-speech related factors |
---|
0:02:28 | such as the environment and how information of utterances |
---|
0:02:31 | for example the room recording conditions |
---|
0:02:35 | and they should be sent these categories are not restrict to universal by any means |
---|
0:02:38 | and it's |
---|
0:02:39 | possible consider attributes one actually the main data overlap with the other |
---|
0:02:43 | for example the channel information indicative radio sure may vary much buildings with the identity |
---|
0:02:49 | of radio presenter |
---|
0:02:52 | okay so how these sources of variability link into the properties we will ideally want |
---|
0:02:57 | the speaker and bindings used for verification and diarization |
---|
0:03:01 | in case of non-speaker related information we typically want this to be invariant |
---|
0:03:05 | for example we would ideally like the same speaker to be mapped into the same |
---|
0:03:09 | area of embedding space regardless of the recording environment |
---|
0:03:12 | and this colours and reflected in previous mixture relates to domain and channel invariant speaker |
---|
0:03:18 | of the saint a representations |
---|
0:03:20 | such as i what i cost twenty on channel invariant bindings using adversarial training |
---|
0:03:25 | in addition to this work at my neighbour corporation or augmentation invariant speaker about things |
---|
0:03:31 | as for speaker related information we would ideally like other things |
---|
0:03:35 | to capture was information in the sun discrimate to fashion |
---|
0:03:38 | meeting all the source variability are in some way encoded |
---|
0:03:42 | and we will use the term speaker distribution of someone umbrella terms describe these the |
---|
0:03:47 | speaker related information |
---|
0:03:50 | if we consider aspects of speaker variability is a whole |
---|
0:03:53 | such as those attributes on gender accent or age |
---|
0:03:57 | the embedding space should ideally matched manifold possible speakers |
---|
0:04:00 | and in these tasks speakers in the test set are |
---|
0:04:04 | not typically seen at training time or the held-out |
---|
0:04:08 | and the concern here is that these unseen speakers not follow the distributions you know |
---|
0:04:13 | training time |
---|
0:04:14 | and |
---|
0:04:15 | we also well as such a distribution mismatch exists and if it does |
---|
0:04:19 | is that a problem |
---|
0:04:22 | so how much we establish of such a mismatch exists |
---|
0:04:27 | first we should describe the overall structure of these speaker body extractors such as an |
---|
0:04:32 | extract network |
---|
0:04:35 | d speaker settings are typically extracted from intermediate layer the network trained on speaker classification |
---|
0:04:40 | task |
---|
0:04:41 | and |
---|
0:04:42 | in this classification task by aligning to discriminate between the training set of speakers the |
---|
0:04:46 | intermediate and maps to a space that can be used to represent the speaker identity |
---|
0:04:50 | have any given utterance |
---|
0:04:54 | now |
---|
0:04:54 | if we put the unseen test utterances of some datasets |
---|
0:04:59 | or test speakers |
---|
0:05:01 | the for classification network |
---|
0:05:03 | and we evaluate the mean softmax probabilities the network codecs |
---|
0:05:07 | we can get some kind of approximation of what the training classes the network please |
---|
0:05:11 | it is seeing where given the unseen test because as input |
---|
0:05:15 | so for and |
---|
0:05:17 | test utterances we can calculate the softmax probability that the output of a whole network |
---|
0:05:21 | where h is h i is the penalty representation of the i-th input utterance |
---|
0:05:27 | of the network |
---|
0:05:28 | before is transformed into the class probabilities by the final affine weight matrix w and |
---|
0:05:33 | the softmax operation |
---|
0:05:35 | and this is what this p average time is |
---|
0:05:38 | an average of the softmax lee speaker class probabilities from some set of input utterances |
---|
0:05:43 | or some measure of the probability mass assigned each training speaker about the network |
---|
0:05:51 | so |
---|
0:05:51 | if we examine in the case of an expected system trained on voxel up to |
---|
0:05:55 | and the test set of oxnard is forty unique speakers with forty two utterances each |
---|
0:06:00 | and |
---|
0:06:01 | we can put them through the network and calculated the average and we also did |
---|
0:06:04 | the same with a set of forty unique speak well forty speakers but on the |
---|
0:06:09 | training set also with forty two utterances each |
---|
0:06:12 | and is this figure displays the five thousand my remote for training classes |
---|
0:06:16 | ranked according to the p average value produced |
---|
0:06:20 | as we can see that was surprising it is to be much more competent on |
---|
0:06:24 | test speakers than on training speakers |
---|
0:06:26 | and more specifically seems to produce a small number of speakers of much higher probability |
---|
0:06:30 | the rest |
---|
0:06:33 | and |
---|
0:06:34 | i think read like to train speakers here mine a flat line but if we |
---|
0:06:38 | move apart but instead on the next slide |
---|
0:06:41 | we can see that this is not the case and as there are many more |
---|
0:06:44 | variations of training all speakers |
---|
0:06:46 | that using forty hours of five thousand nine hundred ninety four |
---|
0:06:49 | we sample these three hundred times |
---|
0:06:51 | to provide arab also means red line so |
---|
0:06:54 | and the context this is a fairly competitive model |
---|
0:06:57 | scoring three point zero four percent equal error rate on the voxel a test set |
---|
0:07:00 | using cosine similarity scoring |
---|
0:07:05 | one might be this one might this be the case |
---|
0:07:08 | why does this network compared to predict some training speakers which such competence from a |
---|
0:07:12 | test speakers |
---|
0:07:13 | this isn't immediately clear as the what's the test set has a good balance of |
---|
0:07:17 | male and female speakers |
---|
0:07:19 | and was chosen as the names began with let e |
---|
0:07:23 | no particular specific |
---|
0:07:26 | is this mismatch even a bad thing |
---|
0:07:30 | if we consider each of the training partisans general templates that i for speaker |
---|
0:07:33 | instead a specific identity is the predicted types of speaker producement test set would indicate |
---|
0:07:38 | somewhat of a class imbalance |
---|
0:07:41 | the negative effect class imbalance problems can cause a fairly well documented classification tasks |
---|
0:07:47 | and these issues also extend the |
---|
0:07:49 | to representation learning to with this work a |
---|
0:07:52 | despite in contrast cost sensitive training as a means to mitigate this |
---|
0:07:57 | the retirement our work this work proposes to make that study these issues |
---|
0:08:02 | one in the form of for but of a robustness to different speaker distributions |
---|
0:08:07 | i don't know the in the form of |
---|
0:08:08 | adaptation to a different speaker distribution |
---|
0:08:12 | and both of these more methods work really a dropping classes from the output layer |
---|
0:08:20 | the first method that will propose |
---|
0:08:22 | have be called rock class and it works by periodically dropping d classes from the |
---|
0:08:27 | dataset |
---|
0:08:27 | and the help of classification |
---|
0:08:30 | so every t training vectors |
---|
0:08:32 | we don't move d classes from the dataset |
---|
0:08:34 | such that during the next p training patches these classes well |
---|
0:08:39 | we also remove the corresponding rows on the final affine weight matrix |
---|
0:08:43 | i'd in combination with an angular penalty softmax |
---|
0:08:47 | this continually changes the decision boundary distinguishing between the subsets of classes |
---|
0:08:52 | and |
---|
0:08:53 | this could essentially as a formal draw out on the classification |
---|
0:08:57 | also synchronise with |
---|
0:08:59 | data provided |
---|
0:09:02 | i has been theoretically justified in the past by continually something from any and networks |
---|
0:09:07 | and then averaging at test time resulting in a more robust model and in this |
---|
0:09:11 | election we think that this method is something from many different subsets of classification tasks |
---|
0:09:18 | and he's a diagram sliding more detailed diagram for how this method works |
---|
0:09:22 | and we can see that |
---|
0:09:23 | the embedding generate is essentially kind of multiple different subset classification tossed around training |
---|
0:09:32 | okay here is the second method we propose which we call drop the times which |
---|
0:09:36 | is a means of adapting to an unlabeled |
---|
0:09:39 | the test set in a somewhat unsupervised fashion |
---|
0:09:42 | from the fully trained model be calculated is p average value for the test utterance |
---|
0:09:46 | with no need for the test set labels |
---|
0:09:50 | within |
---|
0:09:52 | isolates on the low quality classes a limited p average classes |
---|
0:09:55 | and drop those problem |
---|
0:09:57 | from the help and the dataset |
---|
0:09:59 | well |
---|
0:10:00 | we combine those low probability classes into a single new class |
---|
0:10:04 | we then find you know the remaining training classes and then repeat this process iteratively |
---|
0:10:08 | calculating the average again and further dropping classes |
---|
0:10:11 | and |
---|
0:10:12 | we think this can be you distance all over sampling of the classes which we |
---|
0:10:17 | which produce heidi average values |
---|
0:10:19 | in the lifetime of the networks training |
---|
0:10:21 | which |
---|
0:10:22 | we believe in some way |
---|
0:10:24 | due to the i p arbitrary be leaving to the in some way more relevant |
---|
0:10:27 | to the test utterances |
---|
0:10:29 | all close to match to the speaker distribution the |
---|
0:10:34 | which go quickly of the basic experimental setup the primary datasets used in speaker verification |
---|
0:10:39 | datasets namely voxel and speakers in the wild |
---|
0:10:43 | models were trained using forced up to only you'll utilizing the cal the augmentation |
---|
0:10:49 | and if you're interested our experimental code can be found getup |
---|
0:10:54 | the for the drug class experiments we explored very illustrations to wait for dropping |
---|
0:10:59 | p and the number of classes to drop the |
---|
0:11:02 | but drop the times we varied the exact method discarding the classes entirely what combining |
---|
0:11:07 | them into a single new class |
---|
0:11:09 | or dropping only the taste and you can run using the final language affine weight |
---|
0:11:13 | matrix |
---|
0:11:14 | and finally to control experiments training the baseline along the without talking to classes to |
---|
0:11:20 | eliminate the advantage of the extra training iterations the proper database |
---|
0:11:23 | and the other control experiment was to talk random classes and ignore the p arbitrarily |
---|
0:11:31 | here we have the results of the drug class experiments |
---|
0:11:33 | on the left here is an exploration into varying the number of iterations to wait |
---|
0:11:37 | dropping |
---|
0:11:38 | and as we can see for both whatsoever and the death portions of speakers in |
---|
0:11:42 | the while dropping five thousand horses for configurations of p less than seven hundred fifty |
---|
0:11:47 | appears to improve performance of the baseline |
---|
0:11:51 | on the right to choosing the best configuration on the left a of two hundred |
---|
0:11:56 | fifty for p we tried the number of classes drop rejoinder federation fifty iterations |
---|
0:12:01 | and as we can see pretty much will configurations outperform the baseline |
---|
0:12:05 | hum |
---|
0:12:06 | and i we have the drop adaptive results |
---|
0:12:11 | for a budget of thirty thousand additional iterations remembering of this method starts with a |
---|
0:12:16 | fully trained model |
---|
0:12:18 | we performed six rounds of dropping classes with the configuration d |
---|
0:12:22 | cup and d classes for five thousand iterations each |
---|
0:12:27 | and as we can see both box eleven speaks in the wild running further iterations |
---|
0:12:33 | did improve did not improve the in performance |
---|
0:12:37 | and the |
---|
0:12:38 | as well as dropping random process which ignores the p average value supposed to control |
---|
0:12:42 | experiments |
---|
0:12:43 | reassuringly i didn't improve performance |
---|
0:12:45 | but for methods dropping the data with mlp average value such a drop adapt drop |
---|
0:12:50 | attack combine and this properly data |
---|
0:12:53 | this improved performance of the baseline suggesting that this was something of high p average |
---|
0:12:58 | classes of the lifetime of the training of the network can help improve performance on |
---|
0:13:01 | a specific test set |
---|
0:13:04 | with sometimes dropping the classes from the affine weight final affine weight matrix improving things |
---|
0:13:09 | to |
---|
0:13:12 | we also performed an extension experiment tricks examine what's going wondering this drop adapt fine |
---|
0:13:16 | tuning |
---|
0:13:17 | and here we posit kl divergence of the average distribution uniform distribution |
---|
0:13:23 | at each round of dropping classes control but that combine and box up |
---|
0:13:27 | alongside the equal error rate |
---|
0:13:29 | and the kl divergence is some measure of how close that p average distribution |
---|
0:13:33 | is to being more uniform and less imbalance |
---|
0:13:37 | and he we can see that the scale directions drops |
---|
0:13:40 | that has the scale like fashion shops so to use the a search is a |
---|
0:13:44 | verification performs increase suggesting that this distribution matching metric maybe someone a good indication of |
---|
0:13:51 | how well matched the network is set to a certain test set |
---|
0:13:54 | but are clearly this task quite sure yes a network which predicts every files uniformly |
---|
0:13:59 | will time will clearly not be affected |
---|
0:14:02 | and there are clearly minutes on a minimum number of test speakers for this observation |
---|
0:14:05 | to hold in the |
---|
0:14:07 | we have some good sample for what's an overall speaker distribution of some test set |
---|
0:14:11 | is |
---|
0:14:13 | so how conclusions all the cost can improve verification performance of extracting bindings with good |
---|
0:14:18 | configurations |
---|
0:14:19 | and we suggest this teacher similar effect of dropout |
---|
0:14:23 | similarly crop adapt can improve verification performance with the of assumption that the hyper here |
---|
0:14:28 | every classes to be the key step |
---|
0:14:32 | has this p average distribution training classes |
---|
0:14:37 | it see that asr as the average distribution train classes on the reason we set |
---|
0:14:41 | a sigh set of test because becomes more uniform verification performance increases to |
---|
0:14:48 | thanks for listening i hope you can interest the virtual conference then the state says |
---|
0:14:52 | five |
---|