0:00:14 | i'm going to pretend to this work about them in addition speaker recognition |
---|
0:00:18 | a decision strategy in preston from scratch one element in speaker recognition |
---|
0:00:26 | we want to carry out speaker recognition on a new domain not up to increase |
---|
0:00:30 | the criticism detection |
---|
0:00:32 | thanks to adaptation techniques |
---|
0:00:35 | but we don't want |
---|
0:00:36 | to meet to take into account the difficulties of the task in real life situations |
---|
0:00:42 | the task of data collecting and also without the cost and therefore forming the large |
---|
0:00:48 | available to them in dataset |
---|
0:00:52 | so as to assume that a unique and nonaudible in them and development dataset not |
---|
0:00:58 | anymore possibly reduced in size down stuff speaker also segments per speaker |
---|
0:01:05 | this dataset is used to learn an adaptive speaker recognition model |
---|
0:01:10 | first we want to know that how about the performance increase depending on the amount |
---|
0:01:15 | of unlabeled in domain data |
---|
0:01:18 | in terms of segments |
---|
0:01:19 | and so of speakers or |
---|
0:01:23 | of some po size of segments per speaker |
---|
0:01:31 | instead of the asking is always number of clusters damman thanks to another line in |
---|
0:01:36 | domain data set |
---|
0:01:37 | so this break distinct and number |
---|
0:01:42 | we want to |
---|
0:01:43 | carol to clustering without this requirement for exist in |
---|
0:01:48 | in domain |
---|
0:01:49 | and lower bound |
---|
0:01:51 | data |
---|
0:01:53 | this is explained below in this presentation |
---|
0:02:01 | displays most edges back and process for speaker recognition systems based on embedding |
---|
0:02:08 | the different adaptation techniques that can be included |
---|
0:02:13 | missiles are which amazed |
---|
0:02:15 | transforming vectors to reduce the shift between target and out-of-domain distributions |
---|
0:02:21 | covariance indignant |
---|
0:02:23 | while or at the feature distribution of the up to attempt to about the out-of-domain |
---|
0:02:29 | distributions to also target ones |
---|
0:02:32 | leading to transform on out-of-domain data into possible in domain data |
---|
0:02:40 | when speaker labels of in domain simple or about anymore |
---|
0:02:44 | supervised adaptation can be carried out |
---|
0:02:47 | that's the kind of map |
---|
0:02:49 | approach |
---|
0:02:51 | that's more z-norm to linear interpolation between them and then total and parameters |
---|
0:02:58 | also score normalizations can be considered as and supervised adaptation is |
---|
0:03:03 | as they use an on the rubber in the man subsets for impostor cohort |
---|
0:03:09 | that does not that we generalize is interpretation of the lda parameters |
---|
0:03:14 | to all possible stages of the system and a and whitening |
---|
0:03:18 | this tactic improvements of performance of a percent |
---|
0:03:22 | on all our experiments |
---|
0:03:29 | so how does not from i think raise depending on the a |
---|
0:03:33 | amount of data |
---|
0:03:36 | we carry out |
---|
0:03:37 | experiments |
---|
0:03:41 | focusing on the gain of adaptive systems a function of the invaluable data and results |
---|
0:03:47 | sort parameters are selected for the coarse reference tonight it's |
---|
0:03:54 | speaker else |
---|
0:03:56 | speaker samples |
---|
0:03:57 | and |
---|
0:03:58 | adaptation technique |
---|
0:04:02 | they are is a description of the experimental setup for our |
---|
0:04:07 | i'm not exist |
---|
0:04:09 | we use and that's just seen from county you is twenty three cepstral coefficients |
---|
0:04:14 | the window size |
---|
0:04:16 | of three seconds |
---|
0:04:18 | then vad with the u c zero component |
---|
0:04:23 | z extract a fixed vector r is a one of candide toolkit |
---|
0:04:28 | what is attentive statistics putting layer |
---|
0:04:32 | this extractor is trained on switchboard and nist sre |
---|
0:04:36 | right tails |
---|
0:04:39 | use five fold it i one session strategy with full crowed you please |
---|
0:04:46 | nor is music |
---|
0:04:48 | bubble from use "'em" |
---|
0:04:52 | so the men is that it is an arabic language which is called a manner |
---|
0:04:56 | as the nist recognition evaluation |
---|
0:04:59 | two so |
---|
0:05:00 | so than eighteen |
---|
0:05:02 | cmn and two thousand |
---|
0:05:04 | eighteen |
---|
0:05:05 | nineteen sorry |
---|
0:05:06 | cts |
---|
0:05:10 | this languages finalists from the nist speaker recognition training data bases |
---|
0:05:15 | one do things to our mismatch |
---|
0:05:22 | the in domain corpus for development and test is described in system or |
---|
0:05:28 | development dataset may have just the enrollment test segments the leave out of from nist |
---|
0:05:32 | sre eighteen development test |
---|
0:05:35 | and how for the enrollment the segments delivered from nist sre eight nineteen that's |
---|
0:05:42 | the other are fixed set aside for making up trial data set of test |
---|
0:05:47 | the fifty per cent split takes genders into account to more elements will be asked |
---|
0:05:52 | us you |
---|
0:05:54 | contains committee on trial perhaps |
---|
0:05:57 | a normally and uniformly picked up with the constraint of being equalized by gender |
---|
0:06:03 | and of target prior |
---|
0:06:04 | equal to one percent |
---|
0:06:07 | one analysing the adaptation strategy |
---|
0:06:10 | to predict errors number of speakers and the number of segments per speakers are rated |
---|
0:06:16 | another two three different total amount of segments and also |
---|
0:06:21 | given a fixed amount to assess the impact of speaker class variability |
---|
0:06:26 | each time a subset is picked up from the three hundred and ten speakers size |
---|
0:06:31 | development dataset and an important for the two models |
---|
0:06:36 | system development set |
---|
0:06:38 | is fixed and on the intended for testing |
---|
0:06:42 | for alternatives are considered that experimented |
---|
0:06:45 | system applying and supervised adaptation only |
---|
0:06:49 | system applying supervised adaptation only |
---|
0:06:52 | and the system applying for pipeline |
---|
0:06:55 | unsupervised installer |
---|
0:06:57 | the goal is to assess the usefulness |
---|
0:07:00 | of unsupervised techniques for speaker labels are available |
---|
0:07:07 | this figure shows the results of our analyses |
---|
0:07:12 | performance in terms of recall rate of unsupervised and supervised |
---|
0:07:17 | adapted systems depending on the number of speakers |
---|
0:07:22 | and segment bell speakers |
---|
0:07:25 | of the in domain development dataset |
---|
0:07:28 | the case |
---|
0:07:30 | since andy segments per speaker s corresponds to all segments remorseful the speakers |
---|
0:07:36 | so and t is the mean |
---|
0:07:39 | x is the number of speakers |
---|
0:07:42 | where x is the number of segments per speaker |
---|
0:07:47 | it can be upset of that |
---|
0:07:49 | combining unsupervised and supervised adaptation is the best way having lower bound labeled data doesn't |
---|
0:07:55 | make sense provides questionable |
---|
0:07:58 | and sre |
---|
0:08:01 | also we observe that |
---|
0:08:03 | and then with the small in domain data set here or fifty speakers there is |
---|
0:08:08 | a significant gain of performance with adaptation compared to the design of twelve point |
---|
0:08:14 | twelve best |
---|
0:08:16 | now or not a subset of the dashed curves in the figure |
---|
0:08:21 | they correspond to fixed total amount of segments |
---|
0:08:28 | for example |
---|
0:08:29 | this last row corresponds to the same amount of two thousand and five hundred segments |
---|
0:08:37 | possibly |
---|
0:08:39 | fifty speakers and fifty segments |
---|
0:08:42 | bell speaker or one hundred |
---|
0:08:45 | suppose |
---|
0:08:48 | by sweeping the kl |
---|
0:08:51 | we cannot sell that |
---|
0:08:53 | given a total amount of segments performance improvement with the number of speakers |
---|
0:08:58 | gathering data from a few speakers to then with many utterances per speaker |
---|
0:09:03 | really needs again off adapted systems |
---|
0:09:07 | talk about clustering |
---|
0:09:10 | the goal is to up to show reliable a in domain data set by using |
---|
0:09:15 | unsupervised clustering and in defining the provided places |
---|
0:09:20 | this is to speaker labels |
---|
0:09:23 | dataset x |
---|
0:09:26 | cluster on |
---|
0:09:27 | the results |
---|
0:09:29 | is the actual speaker labels for |
---|
0:09:34 | note that we use |
---|
0:09:36 | why previous thing total dataset form in domain data |
---|
0:09:40 | a model is computed |
---|
0:09:42 | using out of them and training dataset |
---|
0:09:45 | then the score matrix of course tails x is used for going out |
---|
0:09:51 | an item out to hierarchical clustering using s |
---|
0:09:56 | a similarity matrix |
---|
0:09:59 | given this clustering problem is how to determine the actual number or |
---|
0:10:05 | of places |
---|
0:10:08 | by sweeping the number of clusters for each number you a model is estimated which |
---|
0:10:12 | includes and double delta parameters |
---|
0:10:16 | and the preexisting in them into a low dataset y is used for error rate |
---|
0:10:21 | computation |
---|
0:10:27 | then we select the class labels corresponding to the number of classes q that minimizes |
---|
0:10:32 | the or right |
---|
0:10:37 | nor block of this approach is here quality nor and |
---|
0:10:42 | actually quite a preexisting the mental set that is not |
---|
0:10:46 | so a missile from scratch without in domain data except |
---|
0:10:56 | so we propose a missile for clustering the in domain data set and determining the |
---|
0:11:01 | optimal number of classes from scratch result requirement of preexisting in them into a set |
---|
0:11:10 | is algorithm |
---|
0:11:11 | first |
---|
0:11:12 | this algorithm is identical |
---|
0:11:15 | then |
---|
0:11:16 | for each number of classes q |
---|
0:11:18 | we identify class and speaker |
---|
0:11:21 | and by key matrix can |
---|
0:11:27 | then we use |
---|
0:11:28 | this is not weights of artificial keys |
---|
0:11:31 | for computing the error rate |
---|
0:11:37 | now we have to determine the optimal number of classes |
---|
0:11:42 | we use the remote gridiron one on the field of clustering |
---|
0:11:48 | on display in the air or its those criteria for determining the optimal number of |
---|
0:11:53 | clusters |
---|
0:11:55 | reported was is correspond to the loop of the algorithm from scratch |
---|
0:12:01 | we can see that the slope of equal error rate goal so then it slows |
---|
0:12:05 | down around the neighbourhood by excess of the exact number of speakers |
---|
0:12:11 | which is |
---|
0:12:12 | two hundred and fifty |
---|
0:12:15 | moreover the values of this yes we still operating points |
---|
0:12:20 | rich local minima before converging to zero |
---|
0:12:25 | the trust one in the same neighbour |
---|
0:12:31 | two hundred and fifty |
---|
0:12:38 | so i don't format salted gives the wrong |
---|
0:12:42 | three hundred |
---|
0:12:44 | classes |
---|
0:12:45 | with the colour white beyond this threshold also dcf increases |
---|
0:12:55 | no display the performance of the adapted system using clustering from scratch as a function |
---|
0:13:01 | of the number of clusters |
---|
0:13:04 | compared to unsupervised and supervised with the exact speaker labels adaptation |
---|
0:13:09 | with |
---|
0:13:12 | exact syllables and spell adaptation the performance of eigenvalues round six test and |
---|
0:13:19 | with only and style adaptation performance is round seven percent |
---|
0:13:25 | and we can see the crawled all results by varying the number of classes |
---|
0:13:33 | form the clustering |
---|
0:13:35 | from scratch that we propose |
---|
0:13:42 | we can see that the missile or estimates the number of speakers but manage to |
---|
0:13:47 | attain dusting performance in terms of equal error rate and this yes |
---|
0:13:53 | close to the performance |
---|
0:13:56 | with exact lower bounds and supervised adaptation |
---|
0:14:03 | of the residuals |
---|
0:14:05 | with values number of segments per speaker |
---|
0:14:09 | five ten or more |
---|
0:14:11 | for example |
---|
0:14:13 | last line we can see that results by clustering from scratch |
---|
0:14:17 | the right |
---|
0:14:18 | a similar to goals were produced in one about that moment set |
---|
0:14:24 | but also close to the ones with the exact speaker labels |
---|
0:14:31 | now will conclude |
---|
0:14:35 | the analyses that we carried out |
---|
0:14:38 | shows that improvement of performance is due to supervised but also unsupervised domain adaptation techniques |
---|
0:14:46 | michael a or lda |
---|
0:14:49 | that's techniques well combine one is a model field |
---|
0:14:53 | the other on the picture failed to achieve best performance |
---|
0:14:59 | also |
---|
0:15:01 | it's subset of that the small sample of in domain data can significantly reduce the |
---|
0:15:05 | gap of performance |
---|
0:15:08 | but when following the amount of speakers |
---|
0:15:11 | rather than of segments per speaker |
---|
0:15:18 | lastly a new or partial optional speaker labeling has been introduced here |
---|
0:15:23 | doing from scratch |
---|
0:15:25 | without break this thing in the man labeled data |
---|
0:15:29 | for clustering |
---|
0:15:31 | well actually being a given and performance |
---|
0:15:36 | thank you for attention |
---|
0:15:38 | can try to as for more details on this study |
---|
0:15:41 | but by |
---|