0:00:15 | good morning everybody my name is one is all the and that's being gonna present |
---|
0:00:19 | that were work short duration of speaker modeling we form of that training |
---|
0:00:23 | i will first start with a brief introduction on which i will explain the minimal |
---|
0:00:28 | the main motivation of our work |
---|
0:00:31 | i will that continuum now we've explaining our approach form of that training and the |
---|
0:00:36 | present experimental setup results and i would finally conclude the we will at some conclusion |
---|
0:00:41 | and the some future work directions |
---|
0:00:45 | so |
---|
0:00:46 | linguistic variation is a significant source of already shown in many of them i speech |
---|
0:00:50 | processing application such as a short duration of speaker verification and speaker diarization |
---|
0:00:57 | in both cases we find ourselves to deal with short duration segments when we want |
---|
0:01:03 | to lure a speaker model |
---|
0:01:04 | so let's suppose to have a ubm model |
---|
0:01:07 | and that if we want to learn a speaker model from upper that they show |
---|
0:01:10 | well we use a long utterance or we have plenty of data that the estimate |
---|
0:01:15 | it speaker model |
---|
0:01:16 | a will be near to the idea speaker model serious a the phonetic variation will |
---|
0:01:22 | be marginalised |
---|
0:01:23 | while when we use the short utterance for example three seconds do you know what |
---|
0:01:28 | muppet that they some process then this team at the speaker model is far from |
---|
0:01:33 | ideal speaker model c insert the phonetic variation on is not marginalised |
---|
0:01:38 | sold objective of our work keys to improve the speaker modeling while the decrease in |
---|
0:01:43 | the variation due to the phonetic content and the while increasing the speaker discrimination |
---|
0:01:51 | a to do these and we started from a method called a speaker that it |
---|
0:01:56 | training set up to but is technique commonly used in i automatic speech recognition and |
---|
0:02:02 | the is used to reduce the speaker variation a to estimate models that are more |
---|
0:02:08 | a phonetic that the morphed weighted discriminant |
---|
0:02:12 | and to get better estimation |
---|
0:02:16 | still |
---|
0:02:17 | the idea is the to let us to model you have it in the original |
---|
0:02:20 | the acoustic feature space and we say we can discriminate between speakers and that we |
---|
0:02:26 | can discrimate the also between phones |
---|
0:02:28 | so in that is at a scenario what sub that's is to project the acoustic |
---|
0:02:33 | features in a space in which all the phonetic information is retained while the speaker |
---|
0:02:39 | disk the speaker information is discarded |
---|
0:02:43 | so |
---|
0:02:44 | a if we interchange at the roles of speaker informs we can that reach the |
---|
0:02:51 | opposite results |
---|
0:02:53 | that means suppress the phonetic variation while at emphasizing that the speaker variation |
---|
0:02:59 | so we have always our original feature acoustic space which we can discriminate between phones |
---|
0:03:06 | and speakers and the what but in this idea of snarl but does is to |
---|
0:03:10 | project the features in that their acoustic feature space in which all the speaker information |
---|
0:03:14 | is entertaining while we can not discrimate anymore between phones |
---|
0:03:20 | still |
---|
0:03:22 | but that was first applied to speaker diarization we have a relative increase in the |
---|
0:03:27 | speaker discrimination on the of twenty seven percent and the additive decrease in phone discrimination |
---|
0:03:32 | of features by percent and the speaker and phone discrimination was calculated for the fisher |
---|
0:03:38 | score as we will see later |
---|
0:03:41 | and the however the improvement in the diarization error rate a was disappointing |
---|
0:03:47 | still |
---|
0:03:48 | what we do in that what we now used we try to optimize and the |
---|
0:03:53 | bottle with butter in installation the from the convolutive a complexity of speaker diarization |
---|
0:03:59 | and the this is done by a addressing the problem of speaker modeling in the |
---|
0:04:04 | case of a shot when the training data nice cars and the by performing a |
---|
0:04:10 | small scale a speaker verification experiments are |
---|
0:04:13 | a |
---|
0:04:15 | by using a database that the man only lap to look at the phonetic level |
---|
0:04:21 | so i would proceed to a in our approach for that that's training we use |
---|
0:04:26 | a extensively a constrained maximum likelihood linear regression cmllr |
---|
0:04:32 | so all cmllr is the technique to reduce be screen and mismatched it with the |
---|
0:04:37 | reduced the mismatch between an adaptation on the dataset that and the a initial model |
---|
0:04:44 | it by estimating an affine transformation |
---|
0:04:48 | so let's assume we have it a gaussian mixture model we've initial mean more and |
---|
0:04:53 | the a initial covariance sigma so as similar transform estimate a and times and my |
---|
0:05:00 | six eight where n is the dimension of the feature space and the and then |
---|
0:05:06 | dimensional buyers vector b |
---|
0:05:08 | and sets that we cannot that the mean and the covariance signal far initial model |
---|
0:05:12 | through this equation by we just chief the mean and we calculate the various in |
---|
0:05:17 | this way |
---|
0:05:17 | and the what the real important think of similar out that we use than simply |
---|
0:05:22 | but these two that we have the possibility of transforming an acoustic feature on of |
---|
0:05:27 | the patient feature a we can my back to the our initial model a by |
---|
0:05:32 | applying the a line the |
---|
0:05:34 | a |
---|
0:05:36 | the inverse transformation |
---|
0:05:38 | and the and |
---|
0:05:40 | at the base of but where is similar i and the but the as we |
---|
0:05:44 | said that a aims to suppress the form that variability in order to provide a |
---|
0:05:49 | more speaker discriminatively features |
---|
0:05:52 | and the so we supposed to have a set of s speaker and p phones |
---|
0:05:56 | and the and set of utterances that are parameterize the a |
---|
0:06:01 | by a set of acoustic features |
---|
0:06:03 | and the we suppose also to have a set of initial that a speaker models |
---|
0:06:07 | always the craft a mixture models |
---|
0:06:10 | so what |
---|
0:06:11 | but thus is to estimate the a set of gaussian mixture models a that are |
---|
0:06:16 | normalized across phones and the for each phone p we aims to estimate a similar |
---|
0:06:22 | transform a that the cutters the for evaluation on the across the speakers and this |
---|
0:06:28 | is done it |
---|
0:06:28 | by solving the problem of maximum likelihood and the it is it a |
---|
0:06:34 | done it iteratively |
---|
0:06:35 | still |
---|
0:06:37 | that's supposed to have a fixed number of iteration and the in the initial iteration |
---|
0:06:42 | on that we start by a giving as input the initial features and the initial |
---|
0:06:46 | that a |
---|
0:06:48 | in |
---|
0:06:49 | my initial speaker models |
---|
0:06:51 | so for each phone we estimate the similar transform that is common to all the |
---|
0:06:56 | speaker models and that we estimate this transformed by using a the data is for |
---|
0:07:04 | that particular phone for all the speakers |
---|
0:07:07 | so we use the i'll transform to normalize the feature vectors by using transformation that |
---|
0:07:14 | we so before |
---|
0:07:16 | and the by applying the inverse transformation |
---|
0:07:19 | the normalized feature of be normalized features are then used to estimate the is set |
---|
0:07:24 | of speaker models that are normalized across phones |
---|
0:07:28 | so if for each at the last iteration on the band the we got a |
---|
0:07:32 | normalized features our fast scores and that our finally our finally a speaker models that |
---|
0:07:39 | are normalized across the phones |
---|
0:07:41 | otherwise we give we give as input |
---|
0:07:44 | the am |
---|
0:07:45 | obtain a features and that the in that a |
---|
0:07:48 | a speaker models |
---|
0:07:51 | so |
---|
0:07:53 | a in our case when we deal with short duration utterances of a full so |
---|
0:07:57 | that we don't that much data to it to estimate the transform for each phone |
---|
0:08:02 | so what we do is to estimate the transform for an acoustic classes acoustic class |
---|
0:08:08 | so that the |
---|
0:08:10 | these a set of phones so i transform a and acoustic classes that are mean |
---|
0:08:16 | by using good a binaural regression free it which the main know what is initialized |
---|
0:08:21 | with the all the phones |
---|
0:08:23 | and the according to linguistic rule we split these main all the a by choosing |
---|
0:08:30 | the split the maximize the likelihood of the training data |
---|
0:08:34 | and that this is don until the increase in the like to you the a |
---|
0:08:38 | is i urban that fixed frazzled |
---|
0:08:41 | so when |
---|
0:08:43 | when it or we reach the last the |
---|
0:08:46 | the last iteration we calculate each and which a transform |
---|
0:08:50 | forty eight acoustic classes |
---|
0:08:52 | and the |
---|
0:08:53 | it's phoning that acoustic classes that share the same transform |
---|
0:08:59 | so |
---|
0:09:00 | in our experimental setup so what we need the two |
---|
0:09:05 | to evaluate and we'll to my spotting ideal scenario |
---|
0:09:09 | is to a database you which we have short duration sentences |
---|
0:09:15 | a |
---|
0:09:16 | we have clear and accurate phonetic transcription and the we have a limit the level |
---|
0:09:23 | of noise and channel variation this is because we want to |
---|
0:09:26 | to see to estimate where |
---|
0:09:30 | performance of part in that the l and able to my scenario |
---|
0:09:33 | so |
---|
0:09:35 | by taking into account these consideration that we a we concluded that nice database are |
---|
0:09:41 | not the |
---|
0:09:43 | i are not a |
---|
0:09:45 | does not fit our the nist database that don't fit our needs because you to |
---|
0:09:50 | the lack of a target speaker and the phonetic transcriptions |
---|
0:09:55 | to the channel variation and to the different types of noisy compromise the recordings |
---|
0:10:01 | so we choose the we base our choice on the timit the database because the |
---|
0:10:07 | a is a collection of i quality and the read speech sentences |
---|
0:10:12 | and the |
---|
0:10:14 | it's end this is a which is last three seconds and the |
---|
0:10:19 | average and the is manually transcribed at the phonetic level and the you know the |
---|
0:10:25 | database very is a limited noise and the |
---|
0:10:29 | bodies a |
---|
0:10:30 | that is not channel variation |
---|
0:10:33 | so |
---|
0:10:34 | however i database is composed of six under speakers of which are four hundred fifty |
---|
0:10:39 | eight are males and what i don't and each word females it each speaker contribute |
---|
0:10:45 | to this ten sentences we've average duration of three seconds |
---|
0:10:49 | and the |
---|
0:10:51 | we said we divide the database that so that well data for from for under |
---|
0:10:57 | six to speaker where used to learn the ubm while the remaining speaker the recording |
---|
0:11:03 | from the remaining speaker |
---|
0:11:05 | a what i'm sixty eight where used for a city experiments automatic speaker prediction experiments |
---|
0:11:10 | and the but performance is the analyzed by using her from one to seven sentences |
---|
0:11:16 | are it to learn the speaker model |
---|
0:11:21 | so the first opposed to a it was too |
---|
0:11:24 | a segment the our utterance easy speech and non-speech as segments according to the ground |
---|
0:11:32 | of transcriptions |
---|
0:11:33 | we then extract the features that were canonically mel-frequency says that a comfy sent twelve |
---|
0:11:39 | plus energy blast delta and acceleration coefficients |
---|
0:11:43 | we've and we've an estimated speaker model by map adaptation from the ubm models that |
---|
0:11:51 | a |
---|
0:11:52 | estimation from four to one thousand ready for gmm components |
---|
0:11:57 | and that by using an initial feature and the initial the thing to speak in |
---|
0:12:01 | the initial speaker model we applied but that is starting from acoustic classes that where |
---|
0:12:10 | where a obtain a from the initial set of the thirty eight phones and we |
---|
0:12:15 | finally got our a normalized features and our normalize speaker models |
---|
0:12:22 | so but performance was assessed on two different this piece used them at a traditional |
---|
0:12:27 | gmmubm system and the state-of-the-art high vector p of you system |
---|
0:12:33 | a baseline and we perform our baseline experiment with the initial set of features that |
---|
0:12:39 | we defined before was be |
---|
0:12:41 | at the a |
---|
0:12:43 | is be experiments we've part where you where the would perform by using the for |
---|
0:12:49 | to normalize speaker features |
---|
0:12:53 | so i without with the experiment the results |
---|
0:12:56 | so to as a set before to assess the speaker and the phone discrimination we |
---|
0:13:01 | decided to use the fisher score discriminant the fisher score the future score |
---|
0:13:05 | and that's supposed to have a the |
---|
0:13:08 | as classes and the a set of and a lot but feature |
---|
0:13:12 | bilateral i mean each feature is the in |
---|
0:13:16 | not at the with the |
---|
0:13:17 | class belong to |
---|
0:13:19 | so the speaker phone discrimination |
---|
0:13:22 | it's calculate the a fruit the feature score |
---|
0:13:25 | that the |
---|
0:13:27 | at which in than where it at the numerator the inter class distance where you |
---|
0:13:32 | is the mean of each class |
---|
0:13:34 | and the at the denominator we have at the intra |
---|
0:13:37 | a and |
---|
0:13:38 | intra class distance |
---|
0:13:41 | basically a represent the spread of the features are around their own mean in the |
---|
0:13:45 | class |
---|
0:13:46 | so |
---|
0:13:47 | if we want to |
---|
0:13:50 | in if we |
---|
0:13:52 | if the inter class distance increase it means that the numerator is i are while |
---|
0:13:58 | the if the we have more normalisation more the features more spread out there are |
---|
0:14:03 | on their mean when it means that the denominator is i of a numerator |
---|
0:14:08 | so that |
---|
0:14:10 | in our experiments we calculate the speaker discrimination and the phone discrimination after ten iterations |
---|
0:14:16 | of but and we show that the speaker discrimination as a relative increase of forty |
---|
0:14:22 | percent of the ten iteration |
---|
0:14:25 | while at the phone discrimination |
---|
0:14:27 | as a |
---|
0:14:28 | relative decrease of fifty percent |
---|
0:14:31 | so a disease a |
---|
0:14:35 | this is good because it is |
---|
0:14:37 | it goes along with the previous results that we've got in our previous work |
---|
0:14:42 | however i would bust the and now to the automatic speaker verification experiments |
---|
0:14:49 | so as it possible to serve |
---|
0:14:51 | a we a |
---|
0:14:53 | for our speaker verification experiments by using them all those from |
---|
0:14:57 | for |
---|
0:14:58 | to one thousand before gmm components and the whole for gmm and that they vector |
---|
0:15:03 | ple the fist thing about these is that the a an i-vector p lda |
---|
0:15:09 | performance but much better but gmm-ubm system the scale is different from like that |
---|
0:15:15 | at the |
---|
0:15:16 | also we bought to a we can see that we have always a |
---|
0:15:22 | but the performance is rather than a than the baseline system |
---|
0:15:27 | and the |
---|
0:15:28 | another thing to not this is that is that for lower model complexity we can |
---|
0:15:35 | reach but the performance is then the baseline or a similar performance is |
---|
0:15:40 | then that the baseline |
---|
0:15:42 | and the a result of the models training we one-sentence when we deal we one |
---|
0:15:47 | sentence |
---|
0:15:48 | and the for seven centers is that it is the |
---|
0:15:52 | we carried comparable performance with the baseline |
---|
0:15:56 | but we've |
---|
0:15:58 | the word model complexity for example in the four we forty two jim ubm components |
---|
0:16:03 | we get the same performance it as the baseline when using two hundred fifty six |
---|
0:16:08 | components |
---|
0:16:09 | and the same in the i-vector system where we got better performance with forty two |
---|
0:16:14 | components |
---|
0:16:15 | is it |
---|
0:16:16 | by using but |
---|
0:16:17 | and the a |
---|
0:16:19 | compared to the baseline system that we what do you we two hundred fifty six |
---|
0:16:23 | components |
---|
0:16:25 | so |
---|
0:16:27 | in this two tables i'm going to |
---|
0:16:30 | the present the results |
---|
0:16:33 | a where a |
---|
0:16:35 | independently from the model size views |
---|
0:16:38 | so these are results |
---|
0:16:40 | a |
---|
0:16:41 | i've the result by using good the an optimal model size for the speaker model |
---|
0:16:46 | and that we can see that for the i-vector p lda system at than |
---|
0:16:52 | fifty percent a increasing the performances of course in an ideal and optimize the environment |
---|
0:16:59 | while a for the gmm ubm sister system the could but the performance see for |
---|
0:17:04 | the first four by when the using one that and three training sentences and we |
---|
0:17:09 | got comparable the results when using five and seven sentence |
---|
0:17:14 | and the |
---|
0:17:16 | in these that you that the plot use a we can still see the results |
---|
0:17:21 | as before these are results when using one single sentence and the we can see |
---|
0:17:27 | that we have |
---|
0:17:28 | fifty percent degrees in the are of the i-vector the lda system and the in |
---|
0:17:33 | the gmm-ubm system at the lines are more are less far apart but we have |
---|
0:17:38 | the and the degrees from near are of forty two to three six percent |
---|
0:17:45 | and the |
---|
0:17:47 | to conclude |
---|
0:17:48 | a this works in this work we address the problem of speaker model e in |
---|
0:17:52 | the case that when training data is cars shall by you when using short duration |
---|
0:17:57 | utterances |
---|
0:17:58 | and the we optimize and the value it but a at the speaker modeling level |
---|
0:18:04 | but performing the small the speaker verification the experiments |
---|
0:18:09 | and the by using timit database that it's lateral at the phonetic level |
---|
0:18:15 | and the we show that but it's skies formally in that using performed by s |
---|
0:18:20 | a |
---|
0:18:22 | well it probably significantly the performance of those two systems gmm ubm and i-vector p |
---|
0:18:27 | lda |
---|
0:18:28 | and the what is worth nothing is also that but is able to provide the |
---|
0:18:34 | equivalent or but the performance by using a lower model complexity |
---|
0:18:41 | and the |
---|
0:18:43 | for the future work we aim to go back twelve original goal that these but |
---|
0:18:47 | for speaker diarization |
---|
0:18:49 | a we want to explore approaches automatic approach is to a in a closed acoustical |
---|
0:18:57 | but class transcriptions because of but the but actually doesn't need to |
---|
0:19:01 | a |
---|
0:19:02 | it might be no doesn't need phonetic transcription but as long as we are we |
---|
0:19:07 | are able to label that adding that way that we can map the features to |
---|
0:19:11 | a particular acoustic class |
---|
0:19:13 | we can transform we can calculate the transform for that particular class and we can |
---|
0:19:19 | finally improve the performance is a in the in the system |
---|
0:19:23 | and one final it was problem as passionately but am and speaker-independent approaches to four |
---|
0:19:30 | and normalization |
---|
0:19:33 | effect of rotation |
---|
0:20:02 | you're i-vector extractor i was trained with sort channels and so as well because be |
---|
0:20:08 | lda was obviously trained on i think with the sentence is okay so we didn't |
---|
0:20:13 | manage for example sentences from the same channel speaker to create a big |
---|
0:20:17 | sentence so that you train the i-vector extractor this way |
---|
0:20:21 | for example for a one speaker we use of the centres is when we put |
---|
0:20:24 | it together and we |
---|
0:20:26 | okay so it's okay so character selected to understand was we used short sentences so |
---|
0:20:31 | that is exactly what |
---|
0:20:50 | wise that's used you don't have a couple of minutes |
---|
0:20:55 | i think balls |
---|
0:20:57 | r is a |
---|
0:21:00 | and the team it |
---|
0:21:01 | a much too simple databases for |
---|
0:21:05 | because of all the |
---|
0:21:08 | you a beetle |
---|
0:21:11 | well close to zero percent |
---|
0:21:14 | so what does it challenges in a |
---|
0:21:17 | text dependent i mean this should be applied everywhere i mean if it works so |
---|
0:21:21 | well |
---|
0:21:23 | what is what are the challenges |
---|
0:21:26 | sorry what which to what in the real life i mean this is not be |
---|
0:21:30 | used in many systems |
---|
0:21:33 | right so |
---|
0:21:34 | i mean you should be employed if we have no ever |
---|
0:21:40 | so what actually this one was that of the work to optimize and the bottle |
---|
0:21:45 | with button ideas value because as i said a we try to apply the as |
---|
0:21:49 | a respectively speaker diarization but the problem was that we didn't that comp in this |
---|
0:21:55 | enough to |
---|
0:21:57 | because there is out that was disappointing where like a we gotta really little improvement |
---|
0:22:02 | in the it their decision error rate so we said okay but is not |
---|
0:22:08 | we tried to see how to try to find out the |
---|
0:22:12 | upper limit performance that can be shown off but i mean you have been using |
---|
0:22:16 | timit that there are many versions and t meet what they the timit |
---|
0:22:22 | with a noise condition all sorts of more we or telephone bandwidths condition |
---|
0:22:28 | then it has been transform in many ways so why don't you uses |
---|
0:22:34 | the because as a set of timit where the phonetic transcription also for these we |
---|
0:22:40 | since we want to what demise dismantling the ideal condition we |
---|
0:22:44 | this |
---|
0:22:47 | so i would think would be interesting to see that this is one |
---|
0:22:57 | of the risk of a primary all is very quickly |
---|
0:23:02 | the major impediment so to progress in this field lack of data |
---|
0:23:12 | i towards been very generous and making available we also data but that's pretty much |
---|
0:23:16 | three |
---|
0:23:17 | the only |
---|
0:23:19 | dataset a recent phone search the so that we have so to work on |
---|
0:23:24 | so |
---|
0:23:26 | i mean one was experience with realtors novels that the problem really as part of |
---|
0:23:31 | our |
---|
0:23:32 | we're not going to be able to make progress almost as far as like can |
---|
0:23:36 | see |
---|
0:23:38 | on that's we find some way of showing various at some are among researchers we |
---|
0:23:42 | need a on this program |
---|
0:23:48 | you are working with the industrial partner that probably collect some that all right |
---|
0:24:09 | mutual benefit to sharing data |
---|
0:24:12 | then we could probably |
---|
0:24:15 | make sure progress in this way we otherwise it's not software or how we're going |
---|
0:24:20 | to the one |
---|
0:24:30 | thank you for those points patrick |
---|
0:24:33 | i just one and that mention that in odyssey two thousand one when we became |
---|
0:24:39 | odyssey there was considerable effort put forward to a creating these standard text dependent corpora |
---|
0:24:47 | to distribute to the participants both per se converse and new ones |
---|
0:24:54 | put together these nice text-dependent datasets we distributed them to the odyssey members in advance |
---|
0:25:03 | and plan to have a whole track with a text dependent speaker verification |
---|
0:25:08 | and the sad news with only a couple of sites participating |
---|
0:25:13 | so i think craig greenberg was |
---|
0:25:18 | imply maybe a similar issue with the hazy or evaluations so a lot of these |
---|
0:25:26 | you know it has to be a two way street to go to the f |
---|
0:25:28 | for an expense put together corpora |
---|
0:25:31 | and then have a reasonable number of participants want to take on the challenge so |
---|
0:25:37 | if there's been a shift in interest to |
---|
0:25:41 | text-dependent verification i |
---|
0:25:44 | i think would be good is a community to get together in figure that out |
---|
0:25:47 | and put together some evaluation |
---|