0:00:15 | hello my name is then of course you have no i will be presenting joint |
---|
0:00:19 | work with the michael it's |
---|
0:00:21 | the excel and unlikely |
---|
0:00:23 | from the human language technology center of excellence |
---|
0:00:26 | i johns hopkins university |
---|
0:00:28 | that i don't know or work is might need to expect a marketing estimation network |
---|
0:00:34 | plus also |
---|
0:00:35 | for improving speaker recognition |
---|
0:00:41 | current state-of-the-art in text-independent speaker recognition is based on the in and variance training with |
---|
0:00:47 | our classification loss |
---|
0:00:48 | for example a multiclass cross entropy |
---|
0:00:52 | if there is no severe mismatch between the nn training data and the deployment environment |
---|
0:00:58 | this in that cosine similarity between them but is from a system trained with the |
---|
0:01:03 | angular marking softmax |
---|
0:01:05 | or vice versa |
---|
0:01:06 | speaker discrimination |
---|
0:01:09 | for example in the most recent nist sre evaluation |
---|
0:01:13 | which is audio from |
---|
0:01:15 | video particular didn't of videos |
---|
0:01:17 | the top performing |
---|
0:01:19 | single system |
---|
0:01:21 | on the audio track was based on this part of that |
---|
0:01:26 | unfortunately |
---|
0:01:27 | even though cosine similarity provides |
---|
0:01:30 | good speaker discrimination |
---|
0:01:32 | directly using those scores |
---|
0:01:35 | does not allow us to make use of the automatically a stressful |
---|
0:01:41 | because this discourse are not calibrated |
---|
0:01:45 | typical way to address this problem is |
---|
0:01:48 | use an affine mapping to transform the scores into look like a result alright calibrated |
---|
0:01:55 | this is |
---|
0:01:56 | typically done using on logistic regression |
---|
0:01:59 | and we learn two numbers i scale |
---|
0:02:02 | and also |
---|
0:02:04 | so looking at the top equation |
---|
0:02:06 | the raw scores are denoted by s i e a |
---|
0:02:09 | which is the cosine similarity between |
---|
0:02:12 | two and variance |
---|
0:02:15 | and this can be |
---|
0:02:16 | basically as precise that unit for a well unit length and then in x i |
---|
0:02:21 | till the |
---|
0:02:22 | transpose |
---|
0:02:23 | x till the j |
---|
0:02:25 | so it is nothing more than this they number of unit length and between |
---|
0:02:30 | you will learn a calibration mapping |
---|
0:02:32 | with the parameters a and b we can transform this score in to log-likelihood ratios |
---|
0:02:37 | and then we can make use of the bayes threshold to make optimal positions |
---|
0:02:45 | in this work we |
---|
0:02:47 | proposed agenda cezanne |
---|
0:02:49 | and i went to look at it is to think that the actual |
---|
0:02:53 | scale at |
---|
0:02:55 | can be thought of us |
---|
0:02:56 | simply a sign |
---|
0:02:58 | constant might be due to the |
---|
0:03:00 | unit length and variance |
---|
0:03:02 | so it's inventing |
---|
0:03:04 | get the same active |
---|
0:03:06 | instead |
---|
0:03:07 | we suggest that it's probably better that it somebody has its on my data |
---|
0:03:12 | and we want to use a neural network |
---|
0:03:15 | to estimate |
---|
0:03:16 | the optimal value of those magnitudes |
---|
0:03:19 | we also used a global offset to provide |
---|
0:03:23 | the mapping |
---|
0:03:24 | to like to raise |
---|
0:03:26 | no the this new approach |
---|
0:03:29 | may result in a non monotonic mapping |
---|
0:03:32 | which means that you has the potential to not only produce calibrated scores but it |
---|
0:03:37 | also can improve discrimination |
---|
0:03:39 | by increasing the classical range |
---|
0:03:43 | to train this mike to network |
---|
0:03:45 | we |
---|
0:03:46 | want to use a binary classification task |
---|
0:03:50 | so we draw target and non-target trials from a training set |
---|
0:03:54 | and i lost constant it's a by a weighted by now regression |
---|
0:03:58 | where are five is the prior of a target trial |
---|
0:04:01 | and then |
---|
0:04:02 | these two i is the log posterior odds |
---|
0:04:07 | which can be decomposed in terms of the local error rates so |
---|
0:04:10 | on the log prior art |
---|
0:04:13 | the overall |
---|
0:04:14 | system architecture than one and use |
---|
0:04:17 | it's gonna be training so we a steps |
---|
0:04:20 | on the well left |
---|
0:04:22 | it's a block diagram of our baseline architecture |
---|
0:04:25 | we're gonna use |
---|
0:04:26 | to the convolution with a resonant architecture |
---|
0:04:30 | well why a temporal only |
---|
0:04:32 | and getting a high dimensional |
---|
0:04:35 | was probably in activations |
---|
0:04:37 | and then we use an affine layer |
---|
0:04:39 | to do a bottleneck so that we can obtain then between |
---|
0:04:43 | and but it's are gonna be a one and fifty six dimensions |
---|
0:04:47 | and their star is used to the node where the embedding is extracted condition at |
---|
0:04:51 | work |
---|
0:04:52 | in a more will be trained using multiclass cross entropy with the softmax classification the |
---|
0:04:59 | using directive mark |
---|
0:05:02 | the first as the of the training process is to use short segments to train |
---|
0:05:06 | the network |
---|
0:05:08 | in the past we've seen this to be a good you know compromise |
---|
0:05:13 | because the sort of sequences allow for a good use of gpu memory with a |
---|
0:05:18 | large buttons |
---|
0:05:20 | i'm not the same time in makes the task are even though we have a |
---|
0:05:25 | very powerful classification head was to get |
---|
0:05:27 | error so that we're going back propagated gradients |
---|
0:05:32 | as the second step we propose to freeze then most memory intensive layers |
---|
0:05:38 | which are typically only layers we are operating at the frame level |
---|
0:05:42 | and then finding the postpone layers with more recordings |
---|
0:05:46 | using |
---|
0:05:47 | all the nodes of the sequence of the audio recording |
---|
0:05:51 | which might be a two minutes of speech |
---|
0:05:54 | by freezing the people were layers |
---|
0:05:57 | we |
---|
0:05:57 | the dues the man's of memory and therefore we can use the long sequences |
---|
0:06:02 | and also we avoid overfitting to these are problem |
---|
0:06:07 | based on the long sequences |
---|
0:06:11 | finally the third step it in which we train them i-vector estimation |
---|
0:06:17 | the first thing we do is we're gonna discard the actual multiclass classification |
---|
0:06:23 | and we're gonna use a binary a classification |
---|
0:06:27 | we're gonna use a sinus structure which is depicted here by copying the network tries |
---|
0:06:32 | but the parameters of the same this is just for illustration purposes |
---|
0:06:36 | and notice that we also three is |
---|
0:06:39 | the affine layer corresponding this is denoted by |
---|
0:06:43 | degree of colour |
---|
0:06:45 | so |
---|
0:06:46 | actually merges fixing them variance and now we're adding |
---|
0:06:50 | and i'm into the estimation of work |
---|
0:06:52 | that takes the possible in |
---|
0:06:54 | activation which are very high dimensional and tries to learn as a lower magnitude |
---|
0:07:00 | the along with the unit length expressed or |
---|
0:07:04 | it's gonna be optimized people use a to minimize the cross-entropy |
---|
0:07:09 | we also keep the global also |
---|
0:07:12 | as part of the optimization problem |
---|
0:07:16 | to validate are ideas we're gonna use the following setup |
---|
0:07:21 | i'll start baseline system we're gonna use a modification of the rest in a thirty |
---|
0:07:26 | four |
---|
0:07:27 | expect to propose |
---|
0:07:28 | by saving |
---|
0:07:30 | and company |
---|
0:07:33 | the modifications that we're doing is we're allocating more challenge more channels to their layers |
---|
0:07:39 | because wishing that improves performance |
---|
0:07:42 | i'm not the same time to control the number of parameters |
---|
0:07:46 | we're gonna change the expansion rates of different layers so that we do not increase |
---|
0:07:52 | the channel so much in deeper layers |
---|
0:07:54 | and we have a certain is control the number and it is without degrading performance |
---|
0:08:00 | to train the n and we're gonna use the box selected to dev data which |
---|
0:08:04 | comprises about six thousand speakers and a million utterances |
---|
0:08:08 | and this is wideband a i sixteen khz |
---|
0:08:12 | note that we process differently the data when we use it with source segments on |
---|
0:08:16 | full-length refinements in terms of how we apply a plantations |
---|
0:08:20 | and i refer you to the paper to look at it excels |
---|
0:08:23 | those are very important |
---|
0:08:25 | what a good performance and also generalization |
---|
0:08:29 | to make sure that we do not overfit to a single |
---|
0:08:33 | evaluation set we are benchmarking against for different states |
---|
0:08:38 | speakers in the while and bob select one |
---|
0:08:40 | are actually good three it's to bob select two |
---|
0:08:43 | there there's not much about the means that between |
---|
0:08:47 | those two evaluation sets on the training data |
---|
0:08:51 | the |
---|
0:08:51 | sre nineteen outperform video portion and the time five |
---|
0:08:55 | have some domains is compared to the training data |
---|
0:08:58 | and i will be someone in the results later |
---|
0:09:01 | mostly this is |
---|
0:09:02 | in the case of sre nineteen is because the tails audio comprises multiple speakers and |
---|
0:09:09 | there's a need for diarization |
---|
0:09:11 | and in the time five k's |
---|
0:09:13 | there is |
---|
0:09:14 | far field microphone recordings with a lot of overlap speech and higher levels of reverberation |
---|
0:09:20 | so there is a very challenging setup |
---|
0:09:22 | also the time five results will be a split |
---|
0:09:25 | in terms of a close-talking microphone and too far field mike |
---|
0:09:31 | so that start by looking at the baseline system the we're proposing |
---|
0:09:35 | we're percent of results in terms of equal error rate and to all other operating |
---|
0:09:40 | points |
---|
0:09:41 | we're doing this to facilitate the comparison with prior work |
---|
0:09:45 | if you look at the right of the table |
---|
0:09:47 | we are listing the best single system with no fusion number the we're able to |
---|
0:09:52 | find in the literature |
---|
0:09:54 | for all the benchmarks |
---|
0:09:56 | you know all of the costs work reported |
---|
0:10:00 | but our baseline |
---|
0:10:02 | since to do a good job compared to the prior work |
---|
0:10:06 | i know performance of most of the operating |
---|
0:10:10 | points |
---|
0:10:12 | note that we're not actually doing any particular tuning for its evaluation set |
---|
0:10:17 | it's the for some small carrier that as i said sre nineteen |
---|
0:10:22 | require a diarisation so we'd are as the test segments |
---|
0:10:26 | and then for its |
---|
0:10:28 | speaker that adding a text and then we extract an expert or |
---|
0:10:31 | and their score |
---|
0:10:33 | we can score |
---|
0:10:35 | with the enrollment on all the test expect sources the one of the key for |
---|
0:10:38 | scoring |
---|
0:10:43 | so check |
---|
0:10:44 | the |
---|
0:10:45 | improvements that the phone lines refinement brings |
---|
0:10:49 | in the second stage |
---|
0:10:51 | we can compare in this table |
---|
0:10:54 | with respect to the baseline |
---|
0:10:56 | overall we also positive trends across all the data sets an operating points |
---|
0:11:02 | but the games are |
---|
0:11:04 | larger for the speakers in a while also and this makes sense because |
---|
0:11:08 | this is done |
---|
0:11:09 | so for with the evaluation data has a longer duration compared to the four seconds |
---|
0:11:14 | segments that were used to train the nn |
---|
0:11:17 | so |
---|
0:11:17 | this value is the recent findings know how in our interest this paper |
---|
0:11:23 | in which we saw that formants were fine and it's a good way to mitigate |
---|
0:11:28 | the mismatch between the duration |
---|
0:11:31 | in the training faces on the test phases |
---|
0:11:36 | regarding the amount of the destination node work we explore multiple topologies |
---|
0:11:42 | all of them were fit for where architectures and we explore interracial that and with |
---|
0:11:47 | here and percent in three |
---|
0:11:49 | represented in cases |
---|
0:11:50 | a change in terms of the |
---|
0:11:53 | number of layers and the with of the layers |
---|
0:11:56 | the parameters go for one point five million to twenty million |
---|
0:12:01 | when we compare performance for this three architectures across all the task |
---|
0:12:07 | we do not see why changes |
---|
0:12:10 | so the performance is quite stable across networks which is |
---|
0:12:14 | it's probably a string |
---|
0:12:17 | to find a good trade-off between the number of parameters some performance we're gonna be |
---|
0:12:21 | the magneto two |
---|
0:12:23 | architecture for the remaining part of experiments |
---|
0:12:31 | percent |
---|
0:12:32 | the overall |
---|
0:12:34 | gains in discrimination |
---|
0:12:36 | and due to the three stages |
---|
0:12:39 | you have the graphs |
---|
0:12:42 | the horizontal axis |
---|
0:12:43 | are |
---|
0:12:44 | the different benchmarks |
---|
0:12:47 | we are explained in a |
---|
0:12:49 | by far field microphones and of different plot just a first facilitate the visualisation because |
---|
0:12:55 | they're in a different dynamic range |
---|
0:12:58 | on the vertical axis we're depicting one of the |
---|
0:13:01 | a cost |
---|
0:13:03 | and then the colour coding indicates into the baseline system |
---|
0:13:07 | the utterance |
---|
0:13:08 | is |
---|
0:13:09 | applying the for refinement to that baseline system |
---|
0:13:13 | and the grey indicates application of the magnitude estimation on top of full-length refinement |
---|
0:13:20 | so overall we can see that there was it is trained as well the |
---|
0:13:25 | full answer feynman an i-vector estimation produced |
---|
0:13:27 | gains |
---|
0:13:29 | and we see that across all data sets |
---|
0:13:32 | in there so |
---|
0:13:34 | e r |
---|
0:13:35 | we are getting out twelve percent gain and then for the other two operating points |
---|
0:13:40 | we're getting an average of to the one percent gains |
---|
0:13:43 | even though i'm only assign one operating points here in the paper you guys the |
---|
0:13:48 | results for the other two operating |
---|
0:13:52 | so finally a look into the calibration results |
---|
0:13:57 | most of the global calibrate or and the miami network our training on the ball |
---|
0:14:01 | so they have to dataset |
---|
0:14:04 | this is a is a good night for the box select one on this because |
---|
0:14:07 | in the well evaluations that but is not subset would match forty five and sorry |
---|
0:14:12 | nineteen |
---|
0:14:13 | where the reason segments and |
---|
0:14:15 | before |
---|
0:14:17 | you know they do not calibrate or we can see that we can obtain good |
---|
0:14:22 | performance |
---|
0:14:23 | in terms of the actual cost max and the mean cost |
---|
0:14:26 | what both box evidence because no well |
---|
0:14:29 | but when we moved to the other datasets |
---|
0:14:32 | we struggle to obtain a good calibration with the global calibration |
---|
0:14:37 | looking at the magnitude estimation that work we see a similar trend |
---|
0:14:42 | for box a lemon speakers in a while we obtain very good calibration |
---|
0:14:46 | but we also system struggle for the other sets |
---|
0:14:51 | i think that a fair statement is to say that the mac we can estimation |
---|
0:14:55 | does not deal with the domain saved |
---|
0:14:58 | but you |
---|
0:14:59 | performance the global mean and calibration |
---|
0:15:02 | in all the operating points |
---|
0:15:04 | i'm for all data sets |
---|
0:15:06 | to gain some understanding of what mounted estimations doing |
---|
0:15:12 | we did some analysis |
---|
0:15:14 | the bottom plot on the right shows the cosine scores |
---|
0:15:18 | the histogram scores for the |
---|
0:15:20 | non-target on the target distribution |
---|
0:15:22 | the red colour indicates a non-target scores and the blue collar indicates a target score |
---|
0:15:29 | the top two panels are still in the cosine score |
---|
0:15:32 | it's kind of a lot |
---|
0:15:33 | against the magnitude |
---|
0:15:35 | the of the product the magnitudes |
---|
0:15:38 | for both and variance involving strike |
---|
0:15:42 | therefore some of the line indicates the global scale |
---|
0:15:46 | or magnitude |
---|
0:15:47 | the big global calibrate or assigns |
---|
0:15:50 | to this limiting |
---|
0:15:53 | discourse used for this analysis are one the speakers in a while evaluations |
---|
0:15:58 | since the magnitude estimation network improves discrimination |
---|
0:16:02 | we expect |
---|
0:16:04 | two trends |
---|
0:16:05 | for the local sense for targets |
---|
0:16:08 | we expect that the |
---|
0:16:10 | a lot the magnitude |
---|
0:16:11 | should be bigger than a global scale |
---|
0:16:15 | on the other hand for the high cosine score |
---|
0:16:19 | non-target trials |
---|
0:16:21 | we expect the others |
---|
0:16:23 | that is that the product manager to be smaller than a global scale |
---|
0:16:29 | the expected trends are actually person in these plots |
---|
0:16:32 | we look at the top plot we see |
---|
0:16:34 | the |
---|
0:16:35 | there's on |
---|
0:16:36 | tilt |
---|
0:16:38 | and the |
---|
0:16:40 | magnitude for the no |
---|
0:16:42 | cosine scoring |
---|
0:16:44 | tend to be of all |
---|
0:16:45 | the |
---|
0:16:45 | contact constant magnitude that will be assigned for the global i-vector |
---|
0:16:50 | on the other and we see that a large portion |
---|
0:16:54 | then non-targets are the global scale |
---|
0:16:56 | and |
---|
0:16:57 | the ones that are doing getting very high cosine scores |
---|
0:17:01 | also quite attenuated |
---|
0:17:04 | this is consistent with the observation that magnitude estimation there were is improvement of discrimination |
---|
0:17:10 | so to control we have |
---|
0:17:13 | introduce undirected estimation network |
---|
0:17:15 | within a global offset |
---|
0:17:17 | the idea is to assign an eigen to each one of the unit length x |
---|
0:17:21 | vectors that are training with an angular mark the softmax |
---|
0:17:27 | the resulting scale extractors can be directly compare used in inner products to produce calibrated |
---|
0:17:33 | scores |
---|
0:17:34 | and also we have seen that it increases the discrimination between speakers |
---|
0:17:39 | although |
---|
0:17:40 | the domain is still remains a chance this are significant improvements |
---|
0:17:46 | the propose system outperforms a very strong baseline on the for common benchmarks dimensional |
---|
0:17:53 | when we but also for the validated the use of for recording refinements to help |
---|
0:17:58 | will the duration mismatches interviews you another training and test phase |
---|
0:18:05 | if you found this work interesting i suggest that we also take a look at |
---|
0:18:09 | day |
---|
0:18:10 | current work the senator a and meets my clan are gonna be presented in this |
---|
0:18:14 | work so |
---|
0:18:15 | once it is related |
---|
0:18:18 | and if you have any questions you can reach me at my email |
---|
0:18:22 | and i look for |
---|
0:18:24 | to hand you guys in the middle sessions |
---|
0:18:28 | thanks for the time |
---|