0:00:15good morning everybody my name is one is all the and that's being gonna present
0:00:19that were work short duration of speaker modeling we form of that training
0:00:23i will first start with a brief introduction on which i will explain the minimal
0:00:28the main motivation of our work
0:00:31i will that continuum now we've explaining our approach form of that training and the
0:00:36present experimental setup results and i would finally conclude the we will at some conclusion
0:00:41and the some future work directions
0:00:45so
0:00:46linguistic variation is a significant source of already shown in many of them i speech
0:00:50processing application such as a short duration of speaker verification and speaker diarization
0:00:57in both cases we find ourselves to deal with short duration segments when we want
0:01:03to lure a speaker model
0:01:04so let's suppose to have a ubm model
0:01:07and that if we want to learn a speaker model from upper that they show
0:01:10well we use a long utterance or we have plenty of data that the estimate
0:01:15it speaker model
0:01:16a will be near to the idea speaker model serious a the phonetic variation will
0:01:22be marginalised
0:01:23while when we use the short utterance for example three seconds do you know what
0:01:28muppet that they some process then this team at the speaker model is far from
0:01:33ideal speaker model c insert the phonetic variation on is not marginalised
0:01:38sold objective of our work keys to improve the speaker modeling while the decrease in
0:01:43the variation due to the phonetic content and the while increasing the speaker discrimination
0:01:51a to do these and we started from a method called a speaker that it
0:01:56training set up to but is technique commonly used in i automatic speech recognition and
0:02:02the is used to reduce the speaker variation a to estimate models that are more
0:02:08a phonetic that the morphed weighted discriminant
0:02:12and to get better estimation
0:02:16still
0:02:17the idea is the to let us to model you have it in the original
0:02:20the acoustic feature space and we say we can discriminate between speakers and that we
0:02:26can discrimate the also between phones
0:02:28so in that is at a scenario what sub that's is to project the acoustic
0:02:33features in a space in which all the phonetic information is retained while the speaker
0:02:39disk the speaker information is discarded
0:02:43so
0:02:44a if we interchange at the roles of speaker informs we can that reach the
0:02:51opposite results
0:02:53that means suppress the phonetic variation while at emphasizing that the speaker variation
0:02:59so we have always our original feature acoustic space which we can discriminate between phones
0:03:06and speakers and the what but in this idea of snarl but does is to
0:03:10project the features in that their acoustic feature space in which all the speaker information
0:03:14is entertaining while we can not discrimate anymore between phones
0:03:20still
0:03:22but that was first applied to speaker diarization we have a relative increase in the
0:03:27speaker discrimination on the of twenty seven percent and the additive decrease in phone discrimination
0:03:32of features by percent and the speaker and phone discrimination was calculated for the fisher
0:03:38score as we will see later
0:03:41and the however the improvement in the diarization error rate a was disappointing
0:03:47still
0:03:48what we do in that what we now used we try to optimize and the
0:03:53bottle with butter in installation the from the convolutive a complexity of speaker diarization
0:03:59and the this is done by a addressing the problem of speaker modeling in the
0:04:04case of a shot when the training data nice cars and the by performing a
0:04:10small scale a speaker verification experiments are
0:04:13a
0:04:15by using a database that the man only lap to look at the phonetic level
0:04:21so i would proceed to a in our approach for that that's training we use
0:04:26a extensively a constrained maximum likelihood linear regression cmllr
0:04:32so all cmllr is the technique to reduce be screen and mismatched it with the
0:04:37reduced the mismatch between an adaptation on the dataset that and the a initial model
0:04:44it by estimating an affine transformation
0:04:48so let's assume we have it a gaussian mixture model we've initial mean more and
0:04:53the a initial covariance sigma so as similar transform estimate a and times and my
0:05:00six eight where n is the dimension of the feature space and the and then
0:05:06dimensional buyers vector b
0:05:08and sets that we cannot that the mean and the covariance signal far initial model
0:05:12through this equation by we just chief the mean and we calculate the various in
0:05:17this way
0:05:17and the what the real important think of similar out that we use than simply
0:05:22but these two that we have the possibility of transforming an acoustic feature on of
0:05:27the patient feature a we can my back to the our initial model a by
0:05:32applying the a line the
0:05:34a
0:05:36the inverse transformation
0:05:38and the and
0:05:40at the base of but where is similar i and the but the as we
0:05:44said that a aims to suppress the form that variability in order to provide a
0:05:49more speaker discriminatively features
0:05:52and the so we supposed to have a set of s speaker and p phones
0:05:56and the and set of utterances that are parameterize the a
0:06:01by a set of acoustic features
0:06:03and the we suppose also to have a set of initial that a speaker models
0:06:07always the craft a mixture models
0:06:10so what
0:06:11but thus is to estimate the a set of gaussian mixture models a that are
0:06:16normalized across phones and the for each phone p we aims to estimate a similar
0:06:22transform a that the cutters the for evaluation on the across the speakers and this
0:06:28is done it
0:06:28by solving the problem of maximum likelihood and the it is it a
0:06:34done it iteratively
0:06:35still
0:06:37that's supposed to have a fixed number of iteration and the in the initial iteration
0:06:42on that we start by a giving as input the initial features and the initial
0:06:46that a
0:06:48in
0:06:49my initial speaker models
0:06:51so for each phone we estimate the similar transform that is common to all the
0:06:56speaker models and that we estimate this transformed by using a the data is for
0:07:04that particular phone for all the speakers
0:07:07so we use the i'll transform to normalize the feature vectors by using transformation that
0:07:14we so before
0:07:16and the by applying the inverse transformation
0:07:19the normalized feature of be normalized features are then used to estimate the is set
0:07:24of speaker models that are normalized across phones
0:07:28so if for each at the last iteration on the band the we got a
0:07:32normalized features our fast scores and that our finally our finally a speaker models that
0:07:39are normalized across the phones
0:07:41otherwise we give we give as input
0:07:44the am
0:07:45obtain a features and that the in that a
0:07:48a speaker models
0:07:51so
0:07:53a in our case when we deal with short duration utterances of a full so
0:07:57that we don't that much data to it to estimate the transform for each phone
0:08:02so what we do is to estimate the transform for an acoustic classes acoustic class
0:08:08so that the
0:08:10these a set of phones so i transform a and acoustic classes that are mean
0:08:16by using good a binaural regression free it which the main know what is initialized
0:08:21with the all the phones
0:08:23and the according to linguistic rule we split these main all the a by choosing
0:08:30the split the maximize the likelihood of the training data
0:08:34and that this is don until the increase in the like to you the a
0:08:38is i urban that fixed frazzled
0:08:41so when
0:08:43when it or we reach the last the
0:08:46the last iteration we calculate each and which a transform
0:08:50forty eight acoustic classes
0:08:52and the
0:08:53it's phoning that acoustic classes that share the same transform
0:08:59so
0:09:00in our experimental setup so what we need the two
0:09:05to evaluate and we'll to my spotting ideal scenario
0:09:09is to a database you which we have short duration sentences
0:09:15a
0:09:16we have clear and accurate phonetic transcription and the we have a limit the level
0:09:23of noise and channel variation this is because we want to
0:09:26to see to estimate where
0:09:30performance of part in that the l and able to my scenario
0:09:33so
0:09:35by taking into account these consideration that we a we concluded that nice database are
0:09:41not the
0:09:43i are not a
0:09:45does not fit our the nist database that don't fit our needs because you to
0:09:50the lack of a target speaker and the phonetic transcriptions
0:09:55to the channel variation and to the different types of noisy compromise the recordings
0:10:01so we choose the we base our choice on the timit the database because the
0:10:07a is a collection of i quality and the read speech sentences
0:10:12and the
0:10:14it's end this is a which is last three seconds and the
0:10:19average and the is manually transcribed at the phonetic level and the you know the
0:10:25database very is a limited noise and the
0:10:29bodies a
0:10:30that is not channel variation
0:10:33so
0:10:34however i database is composed of six under speakers of which are four hundred fifty
0:10:39eight are males and what i don't and each word females it each speaker contribute
0:10:45to this ten sentences we've average duration of three seconds
0:10:49and the
0:10:51we said we divide the database that so that well data for from for under
0:10:57six to speaker where used to learn the ubm while the remaining speaker the recording
0:11:03from the remaining speaker
0:11:05a what i'm sixty eight where used for a city experiments automatic speaker prediction experiments
0:11:10and the but performance is the analyzed by using her from one to seven sentences
0:11:16are it to learn the speaker model
0:11:21so the first opposed to a it was too
0:11:24a segment the our utterance easy speech and non-speech as segments according to the ground
0:11:32of transcriptions
0:11:33we then extract the features that were canonically mel-frequency says that a comfy sent twelve
0:11:39plus energy blast delta and acceleration coefficients
0:11:43we've and we've an estimated speaker model by map adaptation from the ubm models that
0:11:51a
0:11:52estimation from four to one thousand ready for gmm components
0:11:57and that by using an initial feature and the initial the thing to speak in
0:12:01the initial speaker model we applied but that is starting from acoustic classes that where
0:12:10where a obtain a from the initial set of the thirty eight phones and we
0:12:15finally got our a normalized features and our normalize speaker models
0:12:22so but performance was assessed on two different this piece used them at a traditional
0:12:27gmmubm system and the state-of-the-art high vector p of you system
0:12:33a baseline and we perform our baseline experiment with the initial set of features that
0:12:39we defined before was be
0:12:41at the a
0:12:43is be experiments we've part where you where the would perform by using the for
0:12:49to normalize speaker features
0:12:53so i without with the experiment the results
0:12:56so to as a set before to assess the speaker and the phone discrimination we
0:13:01decided to use the fisher score discriminant the fisher score the future score
0:13:05and that's supposed to have a the
0:13:08as classes and the a set of and a lot but feature
0:13:12bilateral i mean each feature is the in
0:13:16not at the with the
0:13:17class belong to
0:13:19so the speaker phone discrimination
0:13:22it's calculate the a fruit the feature score
0:13:25that the
0:13:27at which in than where it at the numerator the inter class distance where you
0:13:32is the mean of each class
0:13:34and the at the denominator we have at the intra
0:13:37a and
0:13:38intra class distance
0:13:41basically a represent the spread of the features are around their own mean in the
0:13:45class
0:13:46so
0:13:47if we want to
0:13:50in if we
0:13:52if the inter class distance increase it means that the numerator is i are while
0:13:58the if the we have more normalisation more the features more spread out there are
0:14:03on their mean when it means that the denominator is i of a numerator
0:14:08so that
0:14:10in our experiments we calculate the speaker discrimination and the phone discrimination after ten iterations
0:14:16of but and we show that the speaker discrimination as a relative increase of forty
0:14:22percent of the ten iteration
0:14:25while at the phone discrimination
0:14:27as a
0:14:28relative decrease of fifty percent
0:14:31so a disease a
0:14:35this is good because it is
0:14:37it goes along with the previous results that we've got in our previous work
0:14:42however i would bust the and now to the automatic speaker verification experiments
0:14:49so as it possible to serve
0:14:51a we a
0:14:53for our speaker verification experiments by using them all those from
0:14:57for
0:14:58to one thousand before gmm components and the whole for gmm and that they vector
0:15:03ple the fist thing about these is that the a an i-vector p lda
0:15:09performance but much better but gmm-ubm system the scale is different from like that
0:15:15at the
0:15:16also we bought to a we can see that we have always a
0:15:22but the performance is rather than a than the baseline system
0:15:27and the
0:15:28another thing to not this is that is that for lower model complexity we can
0:15:35reach but the performance is then the baseline or a similar performance is
0:15:40then that the baseline
0:15:42and the a result of the models training we one-sentence when we deal we one
0:15:47sentence
0:15:48and the for seven centers is that it is the
0:15:52we carried comparable performance with the baseline
0:15:56but we've
0:15:58the word model complexity for example in the four we forty two jim ubm components
0:16:03we get the same performance it as the baseline when using two hundred fifty six
0:16:08components
0:16:09and the same in the i-vector system where we got better performance with forty two
0:16:14components
0:16:15is it
0:16:16by using but
0:16:17and the a
0:16:19compared to the baseline system that we what do you we two hundred fifty six
0:16:23components
0:16:25so
0:16:27in this two tables i'm going to
0:16:30the present the results
0:16:33a where a
0:16:35independently from the model size views
0:16:38so these are results
0:16:40a
0:16:41i've the result by using good the an optimal model size for the speaker model
0:16:46and that we can see that for the i-vector p lda system at than
0:16:52fifty percent a increasing the performances of course in an ideal and optimize the environment
0:16:59while a for the gmm ubm sister system the could but the performance see for
0:17:04the first four by when the using one that and three training sentences and we
0:17:09got comparable the results when using five and seven sentence
0:17:14and the
0:17:16in these that you that the plot use a we can still see the results
0:17:21as before these are results when using one single sentence and the we can see
0:17:27that we have
0:17:28fifty percent degrees in the are of the i-vector the lda system and the in
0:17:33the gmm-ubm system at the lines are more are less far apart but we have
0:17:38the and the degrees from near are of forty two to three six percent
0:17:45and the
0:17:47to conclude
0:17:48a this works in this work we address the problem of speaker model e in
0:17:52the case that when training data is cars shall by you when using short duration
0:17:57utterances
0:17:58and the we optimize and the value it but a at the speaker modeling level
0:18:04but performing the small the speaker verification the experiments
0:18:09and the by using timit database that it's lateral at the phonetic level
0:18:15and the we show that but it's skies formally in that using performed by s
0:18:20a
0:18:22well it probably significantly the performance of those two systems gmm ubm and i-vector p
0:18:27lda
0:18:28and the what is worth nothing is also that but is able to provide the
0:18:34equivalent or but the performance by using a lower model complexity
0:18:41and the
0:18:43for the future work we aim to go back twelve original goal that these but
0:18:47for speaker diarization
0:18:49a we want to explore approaches automatic approach is to a in a closed acoustical
0:18:57but class transcriptions because of but the but actually doesn't need to
0:19:01a
0:19:02it might be no doesn't need phonetic transcription but as long as we are we
0:19:07are able to label that adding that way that we can map the features to
0:19:11a particular acoustic class
0:19:13we can transform we can calculate the transform for that particular class and we can
0:19:19finally improve the performance is a in the in the system
0:19:23and one final it was problem as passionately but am and speaker-independent approaches to four
0:19:30and normalization
0:19:33effect of rotation
0:20:02you're i-vector extractor i was trained with sort channels and so as well because be
0:20:08lda was obviously trained on i think with the sentence is okay so we didn't
0:20:13manage for example sentences from the same channel speaker to create a big
0:20:17sentence so that you train the i-vector extractor this way
0:20:21for example for a one speaker we use of the centres is when we put
0:20:24it together and we
0:20:26okay so it's okay so character selected to understand was we used short sentences so
0:20:31that is exactly what
0:20:50wise that's used you don't have a couple of minutes
0:20:55i think balls
0:20:57r is a
0:21:00and the team it
0:21:01a much too simple databases for
0:21:05because of all the
0:21:08you a beetle
0:21:11well close to zero percent
0:21:14so what does it challenges in a
0:21:17text dependent i mean this should be applied everywhere i mean if it works so
0:21:21well
0:21:23what is what are the challenges
0:21:26sorry what which to what in the real life i mean this is not be
0:21:30used in many systems
0:21:33right so
0:21:34i mean you should be employed if we have no ever
0:21:40so what actually this one was that of the work to optimize and the bottle
0:21:45with button ideas value because as i said a we try to apply the as
0:21:49a respectively speaker diarization but the problem was that we didn't that comp in this
0:21:55enough to
0:21:57because there is out that was disappointing where like a we gotta really little improvement
0:22:02in the it their decision error rate so we said okay but is not
0:22:08we tried to see how to try to find out the
0:22:12upper limit performance that can be shown off but i mean you have been using
0:22:16timit that there are many versions and t meet what they the timit
0:22:22with a noise condition all sorts of more we or telephone bandwidths condition
0:22:28then it has been transform in many ways so why don't you uses
0:22:34the because as a set of timit where the phonetic transcription also for these we
0:22:40since we want to what demise dismantling the ideal condition we
0:22:44this
0:22:47so i would think would be interesting to see that this is one
0:22:57of the risk of a primary all is very quickly
0:23:02the major impediment so to progress in this field lack of data
0:23:12i towards been very generous and making available we also data but that's pretty much
0:23:16three
0:23:17the only
0:23:19dataset a recent phone search the so that we have so to work on
0:23:24so
0:23:26i mean one was experience with realtors novels that the problem really as part of
0:23:31our
0:23:32we're not going to be able to make progress almost as far as like can
0:23:36see
0:23:38on that's we find some way of showing various at some are among researchers we
0:23:42need a on this program
0:23:48you are working with the industrial partner that probably collect some that all right
0:24:09mutual benefit to sharing data
0:24:12then we could probably
0:24:15make sure progress in this way we otherwise it's not software or how we're going
0:24:20to the one
0:24:30thank you for those points patrick
0:24:33i just one and that mention that in odyssey two thousand one when we became
0:24:39odyssey there was considerable effort put forward to a creating these standard text dependent corpora
0:24:47to distribute to the participants both per se converse and new ones
0:24:54put together these nice text-dependent datasets we distributed them to the odyssey members in advance
0:25:03and plan to have a whole track with a text dependent speaker verification
0:25:08and the sad news with only a couple of sites participating
0:25:13so i think craig greenberg was
0:25:18imply maybe a similar issue with the hazy or evaluations so a lot of these
0:25:26you know it has to be a two way street to go to the f
0:25:28for an expense put together corpora
0:25:31and then have a reasonable number of participants want to take on the challenge so
0:25:37if there's been a shift in interest to
0:25:41text-dependent verification i
0:25:44i think would be good is a community to get together in figure that out
0:25:47and put together some evaluation