0:00:14 | however |
---|
0:00:15 | my name is a weird |
---|
0:00:17 | this is trained in the signals the standard a traditional a accuracy los angeles |
---|
0:00:23 | to the be presenting our work |
---|
0:00:25 | try to an umbilical analysis of information coder |
---|
0:00:29 | in this and then the neural speaker representations |
---|
0:00:32 | and here the people that have |
---|
0:00:34 | well average of it for this work |
---|
0:00:38 | so first |
---|
0:00:40 | i'll introduce what i referred to as speaker meetings in the rest of the talk |
---|
0:00:44 | speaker limiting the lower dimensions these two presentations |
---|
0:00:48 | that or discriminative of speaker identity |
---|
0:00:52 | these other applications |
---|
0:00:54 | such as |
---|
0:00:55 | in voice biometrics but the task is to verify wasn't sounded different speech |
---|
0:01:01 | the house at application can speaker adapted a set of models |
---|
0:01:06 | they can also be used in speaker diarization |
---|
0:01:08 | with the task is to domain |
---|
0:01:10 | who spoke when in multiparty conversations |
---|
0:01:14 | this can be of particular use in meeting an x and many other applications |
---|
0:01:19 | good speaker ramblings should satisfy two properties |
---|
0:01:23 | first there should be discriminative of speaker factors |
---|
0:01:26 | second is that addition be invariant to other factors |
---|
0:01:30 | so what are the fact of information that could be encoded speaker embedding |
---|
0:01:34 | for ease of analysis be broadly categorized them as follows |
---|
0:01:39 | so as to the speaker factors these are related to the speaker's identity but example |
---|
0:01:44 | that gender age et cetera |
---|
0:01:47 | content factors a these are quite during speech production by the speaker |
---|
0:01:51 | for example |
---|
0:01:53 | emotional state output a in the speech signal |
---|
0:01:58 | sentiment whether it is a positive landed one year |
---|
0:02:00 | the language being spoken |
---|
0:02:02 | and most importantly the lexicon containing the signal |
---|
0:02:06 | and |
---|
0:02:07 | that is the channel factors these factors that quite given signal captured of the microphone |
---|
0:02:12 | we could be the room acoustics |
---|
0:02:14 | the microphone on a linear is applied on acoustic noise |
---|
0:02:18 | and also artificial and also the artifacts related to the competition |
---|
0:02:22 | on signal vector |
---|
0:02:26 | as i mentioned previously good speaker the minister supposed to be invariant nuisance factors |
---|
0:02:30 | these other factors that in that in order to the speaker's identity |
---|
0:02:34 | such emergencies useful for robust speaker recognition |
---|
0:02:38 | in the presence of a bad on acoustic noise |
---|
0:02:42 | they're also useful for detecting a speaker's identity |
---|
0:02:45 | irrespective of the emotional state of the speaker |
---|
0:02:48 | and |
---|
0:02:49 | also independent of all speakers is |
---|
0:02:52 | this is particularly useful |
---|
0:02:54 | in text-independent speaker verification applications |
---|
0:02:58 | so with those that don't have the motivation the goal of our work is to |
---|
0:03:03 | four |
---|
0:03:03 | first |
---|
0:03:04 | is to quantify the amount of misinformation in speaker meetings |
---|
0:03:08 | second is to investigate |
---|
0:03:10 | what extent |
---|
0:03:11 | unsupervised learning and hence |
---|
0:03:13 | to remove the misinformation |
---|
0:03:18 | most existing digits |
---|
0:03:20 | only performed analysis based on one or two datasets |
---|
0:03:24 | and |
---|
0:03:24 | compared to analysis is lacking |
---|
0:03:27 | also most of this work do not consider the dependence |
---|
0:03:30 | but in the individual variables in the dataset |
---|
0:03:32 | for example |
---|
0:03:33 | note addressed dataset a lexical content and the speaker identity sad and angry |
---|
0:03:38 | but some sentences that spoken only vectors speakers |
---|
0:03:42 | therefore |
---|
0:03:42 | it should be possible to predict the speakers based on lexical content on |
---|
0:03:47 | being can to mitigate these limitations our previous work |
---|
0:03:51 | by making the following contributions |
---|
0:03:53 | firstly we use multiple datasets to comprehensively and lies information and are denoted speaker different |
---|
0:03:59 | additions |
---|
0:04:00 | secondly we analyze the |
---|
0:04:02 | effect of disentangling speaker factors from uses factors on then down information |
---|
0:04:11 | briefly detail what they mean made disentanglement |
---|
0:04:14 | in the |
---|
0:04:15 | orders of the talk |
---|
0:04:17 | we define a disentanglement broadly as the task of separating out information streams from advancing |
---|
0:04:23 | signal |
---|
0:04:24 | is a coke example |
---|
0:04:26 | the input speech signal from belief you good |
---|
0:04:29 | who is happy that just bought a civilised like super |
---|
0:04:33 | contain such information related to various factors |
---|
0:04:36 | it contains information about because identity including have with him gender and age |
---|
0:04:42 | the information put into the good emotional state is also encoder |
---|
0:04:46 | more importantly |
---|
0:04:47 | the language identity and the lexical content i don't same but in the signal |
---|
0:04:52 | the goal of additional embedding extractor |
---|
0:04:54 | is to separate all these information streams |
---|
0:04:59 | and in the context of speaker the meetings i which is supposed to capture speaker |
---|
0:05:02 | and get additional information |
---|
0:05:04 | all other factors such as an emotional state and the lexical content |
---|
0:05:08 | i considered nuisance factors |
---|
0:05:11 | it is these factors which we propose to remove from the speaker meanings |
---|
0:05:15 | to make the more robust |
---|
0:05:18 | no and explain the methodology behind it is and then a speaker domain extraction |
---|
0:05:23 | this is a model b is |
---|
0:05:24 | as input of we can use any speech representation sort of that's either spectrogram |
---|
0:05:29 | only one speaker meeting from pre-denned model statistics vectors |
---|
0:05:33 | and |
---|
0:05:34 | using than suppose disentanglement adapted from |
---|
0:05:38 | method that as previously proposed in the computer vision domain |
---|
0:05:41 | we try to separate out |
---|
0:05:42 | these speaker later information |
---|
0:05:44 | from the loses information |
---|
0:05:47 | please note that this method with previously proposed in our earlier work |
---|
0:05:51 | and you can find more details |
---|
0:05:54 | in that paper |
---|
0:05:55 | however for completeness that explained in that he rested |
---|
0:06:01 | i don't think that comprises two models the main model |
---|
0:06:04 | which are shown in the clean |
---|
0:06:07 | blocks here |
---|
0:06:08 | and |
---|
0:06:09 | the address it and models shown in the blue |
---|
0:06:12 | then put it is first processed in court of misfits fit into two |
---|
0:06:16 | and weighting function in is trash shown in the figure |
---|
0:06:19 | the embedding hits them |
---|
0:06:21 | is starting to the predictive |
---|
0:06:22 | which predictions speaker labels like that |
---|
0:06:25 | the embedding has two is concatenated with the noisy version of h one |
---|
0:06:30 | which is denoted by hits and prime here |
---|
0:06:32 | it's and frame is obtained by thing it's one |
---|
0:06:34 | to drop what martin |
---|
0:06:36 | two randomly remove certain elements of h from |
---|
0:06:40 | and has two along with the noisy |
---|
0:06:43 | hatch on which is session pine |
---|
0:06:45 | i concatenated |
---|
0:06:47 | and fed into a decoder |
---|
0:06:49 | which tries really consider that the origin input x |
---|
0:06:54 | the motivation behind using the top or |
---|
0:06:56 | is to make sure that |
---|
0:06:58 | hatch one |
---|
0:06:59 | is an and eleven source of information for the reconstruction task |
---|
0:07:03 | and training in this and make sure that |
---|
0:07:05 | the information required for reconstruction is not storage and |
---|
0:07:08 | and only |
---|
0:07:09 | the information required for |
---|
0:07:11 | speaker and weightings are stored |
---|
0:07:14 | here |
---|
0:07:16 | in addition |
---|
0:07:17 | we also used to disentangle models we just one and low |
---|
0:07:21 | these models are jointly trained |
---|
0:07:23 | to perform poorly in predicting hits on from is to |
---|
0:07:27 | and has to from its own |
---|
0:07:29 | the goal of these models is to ensure that |
---|
0:07:31 | and so the nist two are not very to a feature that |
---|
0:07:35 | doesn't make sure that did not contain similar information |
---|
0:07:38 | this way |
---|
0:07:40 | we can team for this and then there's other conditions |
---|
0:07:44 | and the questions that we used a present one fish one here the main model |
---|
0:07:49 | produces two losses a one is a standard cross entropy loss from the predicate |
---|
0:07:52 | which pretty speakers |
---|
0:07:54 | and the second is the means greater reconstruction us from the decoder |
---|
0:07:59 | and the adversarial |
---|
0:08:00 | a model is a use means could've lost |
---|
0:08:04 | the overall loss function is shown here |
---|
0:08:07 | we try to minimize the loss with respect to the main models |
---|
0:08:10 | when advert of by maximizing the twisted in knots |
---|
0:08:14 | this training process further apart from previous work as i mentioned before |
---|
0:08:18 | basically use this technique |
---|
0:08:20 | on it |
---|
0:08:20 | because the digit recognition task |
---|
0:08:24 | on successful training |
---|
0:08:26 | them but enhancement is expected to capture speaker discriminative information |
---|
0:08:30 | and them in his to is expected to captain useless information |
---|
0:08:34 | notice that we are not used any labels of that uses factors such as a |
---|
0:08:38 | nice tight channel conditions |
---|
0:08:40 | extractor |
---|
0:08:44 | for training the models we use the standard box in the training corpus now which |
---|
0:08:47 | consists of |
---|
0:08:48 | in the way we use of interviews with celebrities |
---|
0:08:51 | the additive noise and reverberation which is standard practice in a day in examining |
---|
0:08:56 | this results in two point four million utterances from i don't seven thousand two hundred |
---|
0:09:00 | speakers |
---|
0:09:01 | as mentioned before we can you either you spectrograms atoms is and what |
---|
0:09:06 | well it also is decoder meetings from kate and models which we do in |
---|
0:09:09 | this work |
---|
0:09:11 | so we use i x it is extracted from a publicly available played in models |
---|
0:09:15 | as input |
---|
0:09:17 | exactly that's most of you already know are speaker demanding a hint on the automatically |
---|
0:09:21 | rubber and related work |
---|
0:09:23 | that is trained to classify speakers |
---|
0:09:25 | from a large dataset artificial augmented with noise and reverberation |
---|
0:09:29 | and this model has shown to provide state-of-the-art performance and multiple tasks |
---|
0:09:35 | not require speaker discriminant discriminately |
---|
0:09:39 | we use multiple datasets i not evaluations as mentioned here |
---|
0:09:43 | and by evaluating some factors for example |
---|
0:09:46 | i emotion on my calculator that |
---|
0:09:48 | we could also |
---|
0:09:50 | too low the |
---|
0:09:51 | issue of dataset bias |
---|
0:09:53 | creating in the model |
---|
0:09:55 | and following others in the looks the make the assumption that |
---|
0:09:59 | better classification performance |
---|
0:10:01 | all of the speaker remaining for the factors |
---|
0:10:04 | in light |
---|
0:10:06 | there is more information present in the embedding with respect to that factors |
---|
0:10:11 | and as a baseline views expected that speaker eminence since our model a data accepted |
---|
0:10:16 | as input |
---|
0:10:17 | we can consider a speaker ramblings as a refinement of detectors |
---|
0:10:21 | but speaker different information today and uses factorisation will |
---|
0:10:26 | the also reduce the dimension of expected by using pca |
---|
0:10:30 | or to match the |
---|
0:10:32 | the and meetings in vermont models |
---|
0:10:37 | so us of the results |
---|
0:10:41 | and the first set of results shows the accuracy of predicting speaker factors using x |
---|
0:10:46 | vectors |
---|
0:10:47 | shown in blue |
---|
0:10:48 | and using alignment actually hindered |
---|
0:10:50 | and in this case high it is better |
---|
0:10:53 | the first two of graphs here so speaker classification accuracy and the other two sure |
---|
0:10:58 | gender prediction accuracy |
---|
0:11:00 | so we find that in general both expect is an atom bearings |
---|
0:11:04 | but from creativity in just thank speakers and genders |
---|
0:11:07 | and we see a slight degradation a using another |
---|
0:11:11 | however the differences that women |
---|
0:11:14 | one other observation is that |
---|
0:11:15 | in i'm okay final performance of |
---|
0:11:19 | both axes and i model |
---|
0:11:21 | we conjecture that this the eight it could be due to a speaker overlap |
---|
0:11:25 | and also this dataset is not what ideally suited for speaker |
---|
0:11:29 | recognition task since |
---|
0:11:31 | the purpose of this dataset was emotion recognition |
---|
0:11:36 | no the more enticing results |
---|
0:11:39 | of a show the results of predicting the and in factors using x s and |
---|
0:11:44 | are speaker dominance and in this case since is then used actors you know it |
---|
0:11:48 | is british |
---|
0:11:49 | we find that in |
---|
0:11:51 | on |
---|
0:11:51 | the cases are model it is the model is information |
---|
0:11:56 | in particular |
---|
0:11:58 | emotion and lexical information added used to a greater extent |
---|
0:12:02 | here the lexical accuracy |
---|
0:12:04 | is accuracy of predicting the sentence |
---|
0:12:06 | spoken given speaker the meeting of that sentence |
---|
0:12:10 | and apart from the election emotional lexical content we also see a detection |
---|
0:12:14 | no information but into sentiment |
---|
0:12:18 | we just was used to motion |
---|
0:12:20 | and also language |
---|
0:12:25 | in this side of a report the results of predicting the channel factors using x |
---|
0:12:28 | vectors |
---|
0:12:29 | and a speaker dominance |
---|
0:12:31 | okay in this case a low respective |
---|
0:12:33 | in particular of we focus on three factors |
---|
0:12:36 | the room microphone distance are the microphone location |
---|
0:12:41 | and then i start |
---|
0:12:44 | we find that in predicting the location of the microphone use |
---|
0:12:48 | and the type of agonise present |
---|
0:12:50 | except is have a much higher accuracy than a to predict |
---|
0:12:54 | this means that being able to successfully reduced and what of this isn't information from |
---|
0:12:58 | extractors |
---|
0:13:00 | however we notice that |
---|
0:13:03 | in panic and the room |
---|
0:13:05 | in this the recording with me |
---|
0:13:07 | because so present to see that what extent this and i gnostic animating that very |
---|
0:13:11 | effective |
---|
0:13:12 | this needs further investigation |
---|
0:13:18 | we show the results of like evaluation |
---|
0:13:21 | then evaluated models for speaker verification task |
---|
0:13:24 | and our competitors |
---|
0:13:27 | the detection update of "'cause" actual |
---|
0:13:29 | where the false positive rate and the according to be exact scale only lately |
---|
0:13:34 | right and they because model you get compared to the articles |
---|
0:13:38 | and the "'cause" |
---|
0:13:40 | that it was at the origin |
---|
0:13:42 | you don't better models |
---|
0:13:44 | the black dotted lines a show the except the model |
---|
0:13:47 | and all the other |
---|
0:13:49 | lines do not are modeled they then without |
---|
0:13:52 | lda based dimensionality reduction |
---|
0:13:57 | be found statistically significant differences only in the graphs |
---|
0:14:00 | based numbers dimension |
---|
0:14:05 | well most notably in challenging scenarios |
---|
0:14:09 | babble in television lies in the background |
---|
0:14:11 | all models perform better than extractors |
---|
0:14:13 | also in the are distant microphone condition i've models perform significantly better than extractors |
---|
0:14:21 | we also found that at the model and do that is trained with a metadata |
---|
0:14:26 | what was slightly better compared to the model in one |
---|
0:14:29 | that is staying with not additional conditions |
---|
0:14:31 | this actually confronted expected be |
---|
0:14:38 | so finally like to quickly present a discussion based on experiments which hopefully will be |
---|
0:14:44 | useful pointers for future research |
---|
0:14:46 | in this domain |
---|
0:14:48 | first we find that speaker the meetings to captain right of information what into a |
---|
0:14:52 | nuisance factors |
---|
0:14:54 | and this can sometimes be detrimental to robustness |
---|
0:14:58 | and we also found that just introducing |
---|
0:15:02 | bottleneck on the dimension of the speaker automatic by using pca |
---|
0:15:06 | doesn't seem all this information |
---|
0:15:09 | this points of the need for explicitly more the fusion starters |
---|
0:15:13 | and using the |
---|
0:15:14 | on suppose that wasn't invariance technique which is the |
---|
0:15:19 | taking that using a model |
---|
0:15:21 | we can it is then uses information |
---|
0:15:23 | from the speaker meetings |
---|
0:15:25 | and the because advantages that unlabeled of nuisance factors are not required for this matter |
---|
0:15:31 | we also found that and the voice disentanglement retains gender information |
---|
0:15:36 | this actually such as that speaker gender |
---|
0:15:38 | as captured when you know conditions |
---|
0:15:41 | is a crucial part of identity |
---|
0:15:43 | this is quite intuitive from human perception point hasn't |
---|
0:15:47 | essentially what the shows is that mute of conditions and sounds and |
---|
0:15:51 | for though human perception |
---|
0:15:54 | finally a disincentive speaker representations shall |
---|
0:15:57 | a better verification performance the presence of ability of tiny conditions |
---|
0:16:01 | but it only babble in television i features consider |
---|
0:16:05 | very challenging for this test |
---|
0:16:10 | going forward we would like to explore methods to further improve the sentiment |
---|
0:16:14 | and |
---|
0:16:15 | so far we have not as a mention of all of not used in uses |
---|
0:16:18 | labeled so |
---|
0:16:20 | we would like to see if |
---|
0:16:22 | if we use this it's a with variable data available |
---|
0:16:25 | danny |
---|
0:16:26 | achieve better disentanglement |
---|
0:16:29 | so that brings me to the |
---|
0:16:33 | invested in different conditions of those of the differences |
---|
0:16:36 | finally i would like to acknowledge the support for us to for this work |
---|
0:16:40 | and |
---|
0:16:42 | that's it utterance that's what is into my presentation |
---|
0:16:45 | please feel free to us and many men with any questions or stations you might |
---|
0:16:48 | have |
---|
0:16:49 | thank you |
---|