0:00:14 | i |
---|
0:00:15 | this is a special edition |
---|
0:00:17 | from the lab id indian institute of science battle |
---|
0:00:22 | and i think presenting a paper |
---|
0:00:25 | system for assigning twenty nineteen cts challenge improvements in data analysis |
---|
0:00:31 | the goal of those of this paper |
---|
0:00:34 | actually i someone g |
---|
0:00:35 | log online so |
---|
0:00:37 | but using |
---|
0:00:38 | actually on band |
---|
0:00:41 | let's go |
---|
0:00:42 | and the total number of this presentation |
---|
0:00:45 | and introduce a brief overview |
---|
0:00:48 | off how speaker recognition systems well |
---|
0:00:51 | discuss |
---|
0:00:52 | sre nineteen challenge performance metrics |
---|
0:00:56 | talk about |
---|
0:00:57 | the front end and back and modeling in a system something |
---|
0:01:01 | discuss the results of these systems |
---|
0:01:04 | and then some analysis of post evaluation results before concluding the presentation |
---|
0:01:12 | this is a brief overview of how speaker verification for speaker recognition systems well |
---|
0:01:19 | the first phase |
---|
0:01:20 | we have the raw speech and extract features like mfccs from that |
---|
0:01:26 | these features |
---|
0:01:27 | well then |
---|
0:01:29 | processed with some voice activity detection and normalization |
---|
0:01:33 | then these features are given as an input to train a deep neural network model |
---|
0:01:38 | parameters |
---|
0:01:40 | the most popular neural network based embedding extractor and the last few years have been |
---|
0:01:45 | the exploration of mars |
---|
0:01:47 | once the extractor training phase is done |
---|
0:01:50 | we enter the lda training phase |
---|
0:01:54 | these extracted extractors |
---|
0:01:56 | have some processing done on them |
---|
0:01:58 | like sending and lda |
---|
0:02:00 | the other then unit length normalized |
---|
0:02:03 | before cleaning up but a more |
---|
0:02:06 | most popular state-of-the-art systems |
---|
0:02:08 | use a generative gaussian but more |
---|
0:02:12 | for the back end system |
---|
0:02:14 | in the verification phase |
---|
0:02:16 | we have but right which consists |
---|
0:02:19 | off and domain utterance under test functions |
---|
0:02:22 | and the objective of the speaker recognition system |
---|
0:02:27 | i don't know whether the test utterance belongs to the target speaker or non-target stego |
---|
0:02:34 | thus once you extract extra and ratings for the enrollment and test utterances |
---|
0:02:41 | we compute |
---|
0:02:42 | log-likelihood ratio scores using will be lda back-end more |
---|
0:02:47 | and then |
---|
0:02:47 | using these scores we did only |
---|
0:02:50 | if the trial is a target one |
---|
0:02:52 | or a non-target one |
---|
0:02:57 | let's look at the sre nineteen performance metrics |
---|
0:03:01 | then this assigning challenge in twenty nineteen |
---|
0:03:04 | consisting |
---|
0:03:05 | of two tracks |
---|
0:03:07 | the forest |
---|
0:03:08 | one speaker detection one conversational telephone speech or c d s |
---|
0:03:13 | and second |
---|
0:03:14 | was no multimedia speaker recognition |
---|
0:03:18 | a work was on the forced that the cts challenge |
---|
0:03:22 | the normalized detection cost function or dcf |
---|
0:03:26 | is defined |
---|
0:03:27 | as in equation one |
---|
0:03:29 | just seen on all be done on my data |
---|
0:03:32 | is equal to be missed of the you'd are pleasantly times p fa of t |
---|
0:03:38 | right |
---|
0:03:38 | in this and be a fee |
---|
0:03:40 | the probability of miss and false alarms respectively |
---|
0:03:45 | on this is when the speaker recognition system |
---|
0:03:48 | the database a target trial as a non-target one that is the system wrong |
---|
0:03:55 | though and alignment and best |
---|
0:03:57 | to be |
---|
0:03:58 | of the same speaker |
---|
0:04:00 | of false alarm |
---|
0:04:01 | is when non-target trial is it on you ready to as a target right |
---|
0:04:07 | in this and be a free |
---|
0:04:09 | computed by applying detection threshold of the ego the log-likelihood ratios |
---|
0:04:15 | the training cost mentally all the nist sre nineteen for the conversational telephone speech |
---|
0:04:21 | is given |
---|
0:04:22 | by equation two |
---|
0:04:24 | you're to be done one |
---|
0:04:26 | is equal to ninety nine and be done to is equal to one eighty nine |
---|
0:04:32 | the minimum detection cost was alone |
---|
0:04:35 | as mindcf or semen is computed using the detection thresholds that minimize the detection cost |
---|
0:04:44 | you creation three |
---|
0:04:46 | ins to minimize you wish to |
---|
0:04:48 | on the threshold you know one and the two |
---|
0:04:52 | the equal error rate eer |
---|
0:04:54 | is the value of p fa and p miss |
---|
0:04:57 | computed at that actually read p fa and b m is equal |
---|
0:05:02 | we report the results in terms of eer |
---|
0:05:06 | semen |
---|
0:05:07 | and c primary for all of a systems |
---|
0:05:11 | the assigning nineteen evaluation set consisted of or two and a half million trials from |
---|
0:05:17 | fourteen thousand five hundred and sixty one segments |
---|
0:05:22 | let's look at the front-end modeling in a systems |
---|
0:05:26 | we obtain |
---|
0:05:27 | t expect all models with different subsets of the training data that i described in |
---|
0:05:33 | the next slide |
---|
0:05:34 | we used the extended time delay neural network architecture |
---|
0:05:39 | the extended d v and an architecture consisted of twenty hidden layers and value nonlinearities |
---|
0:05:46 | the model |
---|
0:05:47 | mostly to discriminate among the speakers in the training set |
---|
0:05:52 | the forest and hidden layers |
---|
0:05:54 | all three i-th frame level by the last two already at the statement level |
---|
0:05:59 | that is a one thousand five hundred dimensional statistics putting your between the frame level |
---|
0:06:05 | and the same and several years |
---|
0:06:07 | it computes |
---|
0:06:08 | the mean and standard deviation |
---|
0:06:11 | after training and ratings are extracted from the five hundred and twelve dimensional affine company |
---|
0:06:18 | of the lemon clear |
---|
0:06:19 | which is the forced alignment label layer |
---|
0:06:22 | these and weightings are the extra cost use |
---|
0:06:28 | this table |
---|
0:06:30 | describes the details of the training and development datasets |
---|
0:06:34 | used in the assigning nineteen evaluation systems |
---|
0:06:38 | x p one |
---|
0:06:40 | well extract the one model |
---|
0:06:42 | i was trying valiantly |
---|
0:06:43 | on the wall syllabic or whatever |
---|
0:06:46 | x lead to |
---|
0:06:47 | you was mixer six |
---|
0:06:49 | and vts sat process |
---|
0:06:52 | x p d |
---|
0:06:53 | was the full extent a system |
---|
0:06:56 | which are staying on the little box ella |
---|
0:06:59 | and previous sre data sets |
---|
0:07:02 | the data partitions |
---|
0:07:04 | use in the back end martyrs of the system in individual systems submitted are indicated |
---|
0:07:09 | in the table two |
---|
0:07:14 | now let's look at the background model |
---|
0:07:18 | once the popular systems in speaker verification |
---|
0:07:22 | use |
---|
0:07:23 | the generated of course in the lda bungee nearly as of that in modeling approach |
---|
0:07:28 | once the extra those that extracted |
---|
0:07:31 | there is some preprocessing done on them |
---|
0:07:34 | they are standard are the mean is a model |
---|
0:07:36 | the transformed using lda |
---|
0:07:39 | and are you wouldn't like nonetheless |
---|
0:07:41 | the bleu model |
---|
0:07:43 | on this process extract of a particular recording |
---|
0:07:46 | is given |
---|
0:07:47 | by equation four |
---|
0:07:49 | but you do i |
---|
0:07:51 | is the extra for the particular recording |
---|
0:07:54 | well make our |
---|
0:07:55 | this kind of only can speak of five do we just go origin |
---|
0:07:59 | five |
---|
0:07:59 | characterizes the speaker subspace matrix and axes on a is a collection procedure |
---|
0:08:06 | now the scoring |
---|
0:08:08 | well bad of expect that was one from the enrollment recording be noted your diary |
---|
0:08:15 | and one |
---|
0:08:15 | from the test recording denoting show but you don't e |
---|
0:08:18 | are used |
---|
0:08:19 | when w can be lda model or to compute the log-likelihood ratio score given in |
---|
0:08:25 | equation five |
---|
0:08:27 | english and five is of course that one and b and q |
---|
0:08:32 | alright in many cases |
---|
0:08:35 | along with the g vad approach |
---|
0:08:38 | we propose |
---|
0:08:40 | when you wouldn't be lda model what and the lda model |
---|
0:08:43 | for background modeling |
---|
0:08:45 | what we have you are |
---|
0:08:48 | pairwise discriminative network |
---|
0:08:50 | the bayesian portion of the network |
---|
0:08:52 | corresponds |
---|
0:08:53 | to the enrollment and ratings |
---|
0:08:56 | and the pink portion of the network correspond |
---|
0:08:58 | the test and really |
---|
0:09:01 | we construct |
---|
0:09:02 | the preprocessing steps |
---|
0:09:04 | in the generated a gpu |
---|
0:09:06 | as layers in the neural network |
---|
0:09:10 | lda |
---|
0:09:11 | as the force affine layer |
---|
0:09:14 | then unit length normalization as a nonlinear activation |
---|
0:09:18 | and then be is entering and diagonalization as another affine transformation |
---|
0:09:25 | the final pairwise |
---|
0:09:27 | but is scoring |
---|
0:09:29 | which is given in equation five in the previous slide is implemented as a quadratically |
---|
0:09:36 | the by having those of this model |
---|
0:09:38 | are optimized |
---|
0:09:40 | using an approximation of the minimum detection cost function |
---|
0:09:44 | or seen in |
---|
0:09:49 | no less than i'd are submitted systems and the results |
---|
0:09:54 | the database your |
---|
0:09:56 | shows det is about the seven individual models that we submitted |
---|
0:10:00 | and a couple of fusion systems |
---|
0:10:04 | the best individual system |
---|
0:10:06 | was the combination of the x t which is the for the extra extractor with |
---|
0:10:12 | the proposed and b idea more |
---|
0:10:16 | for the s i eighteen development set |
---|
0:10:18 | it had a score of five point three one person ser and pointing to a |
---|
0:10:23 | signal |
---|
0:10:25 | and the best scores for the assigned nineteen evaluation |
---|
0:10:28 | was |
---|
0:10:29 | for one nine seven percent |
---|
0:10:30 | and for the and point four two |
---|
0:10:33 | for semen |
---|
0:10:35 | the fusion systems |
---|
0:10:37 | are some gains on the individual systems |
---|
0:10:41 | all that all |
---|
0:10:42 | the for an extra system just actually three |
---|
0:10:45 | performs significantly better than the walks l images extreme one |
---|
0:10:50 | and the x s i next week two systems |
---|
0:10:53 | for any choice of backing |
---|
0:10:57 | systems be which is trained on and vad |
---|
0:10:59 | just in the c include a system f |
---|
0:11:02 | and it is observed that a model support in domain and out-of-domain data better than |
---|
0:11:08 | the collision be lda |
---|
0:11:13 | let's talk about some post evaluation experiments and analysis |
---|
0:11:18 | one of the factors |
---|
0:11:19 | then we found that we didn't to optimally |
---|
0:11:22 | with calibration |
---|
0:11:24 | in our previous work for sat |
---|
0:11:27 | we propose |
---|
0:11:28 | an alternative approach to calibration |
---|
0:11:30 | but the target and non-target scores will model |
---|
0:11:34 | as a gaussian distribution with the shape variance |
---|
0:11:39 | as assigning nineteen did not have an exclusively matched development dataset provided |
---|
0:11:45 | the aforementioned calibration |
---|
0:11:47 | using the sre eating development dataset when applied on assigned nineteen don't know to be |
---|
0:11:54 | ineffective |
---|
0:11:55 | this was done for all of us operating systems and thus the calibration |
---|
0:12:00 | was |
---|
0:12:00 | not as optimal as you want to |
---|
0:12:03 | the graph on the right |
---|
0:12:05 | shows |
---|
0:12:06 | how exciting |
---|
0:12:07 | development and assigning nineteen evaluation datasets are not matched |
---|
0:12:12 | and |
---|
0:12:13 | the threshold instantly opening |
---|
0:12:15 | well not optimal for are selected systems |
---|
0:12:21 | we perform some normalisation techniques to improve a score |
---|
0:12:26 | we perform the adaptive symmetric normalization well yes non using the sri meeting development unlimited |
---|
0:12:34 | say as the core |
---|
0:12:36 | and be achieved |
---|
0:12:37 | twenty four percent relative improvements for the x p one which is the voxel of |
---|
0:12:42 | extract the system |
---|
0:12:43 | and twenty one percent relative improvement for the full extract the system actually the on |
---|
0:12:48 | the sre eighteen development set |
---|
0:12:51 | you got comparatively low but consistent improvement of about fourteen percent on an average |
---|
0:12:57 | in all of us systems for the sre nineteen evaluations yes |
---|
0:13:02 | the table |
---|
0:13:04 | shows the best values |
---|
0:13:05 | there we go out for the exciting development and the sre ninety eight evaluation |
---|
0:13:11 | you got and eer of four point seventy question |
---|
0:13:14 | and assuming all point two seven as best scores for deciding of love me |
---|
0:13:20 | and eer also point five one |
---|
0:13:23 | and semen of point thirty six and the c by many of pointy nine for |
---|
0:13:28 | the sre ninety evaluation systems |
---|
0:13:33 | to summarize |
---|
0:13:34 | we k t extractor extract was and background models on different partitions be available data |
---|
0:13:41 | sets |
---|
0:13:42 | we also explored a normal discriminative back end model quality and the lda which is |
---|
0:13:48 | inspired from be neural network architectures and the generated of be a dog key idea |
---|
0:13:53 | more |
---|
0:13:54 | we observe that the and view stuff only this of the system or g p |
---|
0:13:59 | lda for with his datasets |
---|
0:14:02 | the errors that will cost by calibration |
---|
0:14:05 | with the mismatched development datasets are discussed |
---|
0:14:09 | but also significant performance gains that were achieved by using |
---|
0:14:13 | cohort based |
---|
0:14:14 | yes non adaptive score normalization technique for various systems |
---|
0:14:21 | these are some of the references that we use |
---|
0:14:25 | thank you |
---|