0:00:15a good morning everyone so i'm not sure if you notice what
0:00:19this is the only speaker recognition and talk in this section
0:00:24so which makes me feel somehow like
0:00:27the distant relative that the family invites
0:00:31but you know they don't want to like to
0:00:35so today i'm gonna be a presenting the some of the recent advances that we've
0:00:40had seen in our speaker recognition system and i will share some results
0:00:47that we obtained with the system on the nist sre two thousand and ten
0:00:53extended core task
0:00:54tasks with an emphasis on the telephony condition which is condition five
0:01:01this is joint work with actually ongoing about the who is now an assistant professor
0:01:07at i see bangalore india and jason pelecanos
0:01:13i will start mike like with a brief overview of some of the recent a
0:01:18state-of-the-art works in speaker recognition then i will
0:01:24share with you the objectives of my talk i will present a our speaker recognition
0:01:31system and the key components that contributed the most towards the end results
0:01:37i'll describe our experimental setup the data we used the d and then acoustic models
0:01:44as well it's configurations as well as the speaker recognition system configuration and i'll share
0:01:50with you as i said the results we obtained with the system on the nist
0:01:54sre two thousand ten extended core task
0:01:57mostly on condition five and then a comparable
0:02:02so when we look at the recent a state-of-the-art work on a speaker recognition forcible
0:02:09most of most of the state-of-the-art systems are i-vector based
0:02:13and they somehow use a universal background model to generate statistics to compute the i-vectors
0:02:22now when we look at this through the time so we started by traditional unsupervised
0:02:26a gaussian mixture models to represent the ubms
0:02:31and then more recently we used a
0:02:36of phone any event based you be ubms which are derived from an asr systems
0:02:43so
0:02:44i would like to emphasise here that even though this work
0:02:50don't at i b and does not get much created but this is the first
0:02:53work that in fact used seen owns to compute a
0:02:57the hyper parameters of the ubm for speaker recognition in fact it was it achieves
0:03:03a state-of-the-art results as a single system on the nist sre two thousand
0:03:07and then a this work game the work from best buy which basically used the
0:03:13nn based a scene on posteriors to compute the ubm parameters
0:03:20more recently there was the work from a johns hopkins university the used a t
0:03:25v and then based you
0:03:27a posteriors to compute the ubm parameters any fact they found that
0:03:32contrary to read to what sri found
0:03:36with the
0:03:37a with the diagonal covariance make a so with a with the ubm that uses
0:03:43diagonal covariance matrices you can i in fact you
0:03:46estimate the ubm parameters we
0:03:48with a full covariance matrices and reduce a lot of computations then you need to
0:03:53necessarily
0:03:56go through the hassle of
0:03:58of it for the nn based system so you can use directly if a supervised
0:04:03ubm to compute the statistics and then from their compute the i-vectors and they had
0:04:08nice gains as well
0:04:10we also some of the state-of-the-art systems also use they don't use the nn posteriors
0:04:15to compute the ubm hyper parameters a use the n and bottleneck features and then
0:04:21the
0:04:21the rest of the pipeline mean in i-vector based speaker recognition system remains thing
0:04:27so i mentioned some of the word here i would like to give some pretty
0:04:30to their heck a fixed work in ninety eight
0:04:34that was the first to explore a bottleneck based features for a speaker recognition
0:04:41so the objectives of mike talk today
0:04:45i will be sharing our a state-of-the-art results on the nist sre two thousand and
0:04:50ten extended core tasks again our emphasis is on telephony condition which is condition five
0:04:56i will be presenting the key system components that contributed the most towards achieving these
0:05:02results
0:05:05namely i will talk about the fmllr based features that we used
0:05:10and compared them in fact compare compared them with the a more traditional raw acoustic
0:05:16features such as mfccs
0:05:20we also used the n and based acoustic models in place of a gmm
0:05:25unsupervised gmm acoustic model for i think ubm
0:05:29the dev so this is basically
0:05:32technically not novel id and then based i-vector a i-vectors a they've been around for
0:05:37awhile now
0:05:38what we did here we nearly double
0:05:41the size of the scene onset and we wanted to see how that impacts the
0:05:45speaker recognition performance
0:05:47and then finally we explore a nearest neighbor this given analysis to achieve inter session
0:05:54variability compensation in the i-vector space we compared the performance with
0:05:59the more commonly used lda
0:06:02we also quantify the contribution all
0:06:06these three system components
0:06:09to work towards the performance in fact we also
0:06:12we'll see how varying for example the signal that the size of the scene on
0:06:17set
0:06:18will impact the perform
0:06:22now let's take a look at
0:06:23our speaker recognition system so you see the flowchart of all that speaker recognition system
0:06:29here this is assuming that all the all the model parameters already train a so
0:06:35we have at the in an acoustic model training i-vector extractors for that the lda
0:06:40models
0:06:40and
0:06:42so the three components i just mentioned let me repeat this
0:06:46we have
0:06:48a similar based features
0:06:51can be used to train and evaluate the d and then
0:06:56as well as
0:06:58to compute the sufficient statistics for i-vector extraction so with a from a large you
0:07:02can achieve well speaker and channel and normalization
0:07:07we have the in an acoustic model instead of a an unsupervised on a gmm
0:07:12acoustic model to compute
0:07:14the i-vectors again we nearly compared to the previous work double the size of the
0:07:20signal set
0:07:20and then we replace the traditional of the more commonly used lda with
0:07:25and you for intersession variability compensation and used you i'm sure you're familiar with the
0:07:31sparkling
0:07:34so if we look at the previous work
0:07:38with the non with the nn based signal i-vectors we what we of there is
0:07:44that many systems the used to different set of features
0:07:48to
0:07:49compute the posteriors
0:07:52and to compute a sufficient statistics so typically asr features are different from a
0:07:59speaker-recognition features which makes sense so in this work we wanted to see
0:08:03what happens if we can you many five
0:08:05or use the same set of features to both trained and evaluated the nn and
0:08:09to compute a sufficient
0:08:10statistics for i-vector extraction so for the two words that we
0:08:14are considered to use o a feature space maximum likelihood linear regression transforms to which
0:08:21are actually use and based features which are used as features for our d n
0:08:27and system
0:08:30so adamantly transform is a linear transform like this which is basically would which can
0:08:35be decomposed into a linear manner and it right and a translation and these parameters
0:08:40can be obtain a
0:08:42using the alignments that we obtain
0:08:46from a from the first pass through a gmm hmm system
0:08:51and then a maximum likelihood basically
0:08:57estimate gives us a fight and better probably not be repeated parameters here
0:09:04and the product of this transform on a raw acoustic features such as mfccs or
0:09:09even transform like lda transform features
0:09:12our speaker and channel more normalized features
0:09:15i mean this may sound contradictory by the way so because after the larger used
0:09:20to reduce speaker variability but as we know
0:09:23there are two types of variability is their variability other speakers
0:09:27speaker variability can be with teen or across speakers right so here we believe that
0:09:33the
0:09:35normalisation that if from lower provides within speakers dominates the a between speaker normalisation also
0:09:43we get the benefit of channel normalization so if we have different setup and stands
0:09:47for example we think that f and lark and take care that as well
0:09:53now i as i mentioned the in ansi known i-vectors they've been around for awhile
0:09:59so nothing technique the new in this
0:10:01here in this slide the only difference is that you we really nearly a double
0:10:06the size of our semen set compared to the previous work to compute the posteriors
0:10:11and the from their computer sufficient
0:10:14sufficient i will sufficient statistics
0:10:19so i'm not gonna spend much time on this
0:10:21now what
0:10:23but can i conduct a present that
0:10:28now we know basically how to rapidly compute i-vectors using even ten k c known
0:10:34so just you connect this work to the
0:10:38one of the presentations yesterday a set of money talked about
0:10:42a how i-vector distributions are not necessarily gaussian and actually he showed us some distributions
0:10:48and that was even on clean data not of in noisy data okay so
0:10:54and lda basically it's formulated based on gaussian distribution assumptions for a class for different
0:11:00for individual classes
0:11:02or even if they are not gaussian they need to be at least uni modal
0:11:08at a soul
0:11:09therefore lda cannot effectively handle multimodal data
0:11:13which is typical in the nist sre types of a scenarios because data come from
0:11:19various sources we have switchboard sources of data we have mixer sources of data and
0:11:24that causes a multimodal the in the i-vectors
0:11:28and also for applications such as language recognition we because we only have a few
0:11:34classes the lda transform
0:11:38i can be rank deficient
0:11:40so we might get a hit from that as well
0:11:44so instead of trying to transform the i-vector space so that is more gaussian like
0:11:49what center are presented yesterday
0:11:53we here we tried to use the transform that is the does not assumed gaussianity
0:11:57or does not use the class
0:12:01the ball or a structure of the classes to compute a the between class scatter
0:12:06matrices so
0:12:07when you look at the lda uses the class centroids
0:12:11the differences between class controls will error rate here
0:12:16arrow here we see to compute the between-class gotta meet now in the n b
0:12:21a what we do with that we don't assume any a global extract structure for
0:12:27classes for individual classes rather we assume that classes are only locally structure
0:12:31so we use the local
0:12:34means that are computed based on character a nearest neighbours for each individual sample
0:12:39and then used to differences to compute the between class scatter matrices
0:12:46another thing is that we introduce this weighting function here which is basically to emphasise
0:12:50the sample these samples near the classification boundary which are more important for discrimination between
0:12:58different classes rather than the sample here
0:13:01which should get a really small way because it doesn't contribute towards to discrimination
0:13:07the class discrimination
0:13:09and then another thing is that on like lda and be a
0:13:12even that we have enough a number of examples of for can for different classes
0:13:17can always before right
0:13:20so therefore is very useful for applications such as language id which we don't you
0:13:27publish to work in i guess a twenty fifteen and we actually obtain some gains
0:13:32over the
0:13:34so our experimental setup a for training data we extracted english telephony and microphone data
0:13:40from a
0:13:41this two thousand and four through two thousand and eight sre data we also used
0:13:48switchboard data both cellular and land line data
0:13:52this are basically resulted in a total of sixty k recordings
0:13:58to train our system hyper parameters for evaluations we considered the nist a twenty ten
0:14:05sre and that a evaluation set there is that we considered a nist sre two
0:14:09thousand ten compared to twenty twelve is because
0:14:13we had some anchors to compare our performance of our system where it
0:14:17with other with other no sites
0:14:23so the conditions we consider where c one condition want to see five
0:14:28you can see the details here but i wanted to emphasise that again our emphasis
0:14:31is on condition five which is
0:14:34a left levantine and there is a mismatch between enrollment and test
0:14:39so the type of
0:14:42phones used in
0:14:45enrollment s are not necessarily the same
0:14:48our didn't system
0:14:50our d an acoustic model had a seven hidden layers a six to six of
0:14:55wins
0:14:55ha twenty forty eight a hidden units and then the bottleneck layer which
0:15:01five hundred and twelve units we use fisher data to train it
0:15:04in addition to the think eight original think a signals we also consider two point
0:15:10four k
0:15:12posteriors to see basically how of varying the size of how a varying the granular
0:15:17larry d
0:15:19the in the output layer
0:15:21well in fact that speaker recognition performance
0:15:24a typical setup for speaker our speaker recognition system
0:15:29we use five hundred dimensional total variability subspace
0:15:33which was reduced to two fifty using an l d or lda simply lda which
0:15:38was trained on the entire a training set and we report equal error rate or
0:15:43and mindcf away and ten
0:15:47i think that we also consider a to give working in thinking signals from i-vector
0:15:51extraction
0:15:54in terms of results list can compare held every and da so this is a
0:15:58these results are obtained with mfcc twenty forty component gaussian mixture model thinking in and
0:16:03results are reported condition five as we see a no matter what type of acoustic
0:16:09model we use lda all always provided a nice benefit over
0:16:13and the across the three metrics
0:16:18and the reason is because in as i mentioned in v a can handle non
0:16:23gaussian and more to model
0:16:26more effectively then lda for comparison of
0:16:30mfccs version of a large
0:16:34again condition five thinking in we can see that
0:16:40first with lda and the it doesn't matter we always have improvement with a from
0:16:45a large aware mfccs and the reason is because you
0:16:49m f and large provide a speaker and channel normalization
0:16:53also note that we unified a speaker recognition and speech recognition features this way okay
0:16:58so the system is even it's simpler but we should also take into account the
0:17:05fact that for every two in order to compute fmllr transforms we need a two-pass
0:17:08system as opposed to
0:17:12to look up to measure the impact of signals that
0:17:16size we consider to pay for k ever think a posteriors accuracy as an increase
0:17:21the signals that side results improve
0:17:24we also considered thirty two j signals
0:17:27okay so choose just to just to see how it how it impacts performance
0:17:33we did not see much gains with thirty two k c that experiment to the
0:17:39wind to finish
0:17:42i just one to emphasise here that in contrast to what we see with the
0:17:47d n and
0:17:48if you increase the size of the components the number of components in a gmm
0:17:53athletes with a diagonal covariance matrices you don't c d's gain if you increase the
0:17:57size
0:17:58of the gmm components from two k to forty two six k to make a
0:18:03you don't see
0:18:04probably gave it if not degradations
0:18:08and now they say i picture is worth in other words you're this that bloody
0:18:13at this work to table
0:18:16so for a week what we what you can you can see how lda compared
0:18:20to lda with both a gmm and the in and based systems the performance larger
0:18:26menus gmm gmms to compute the i-vector a posterior to compute i-vectors
0:18:33and with the nn as we increase the size of the signals that the this
0:18:37gap in performance
0:18:38a narrow and then secondly we can compare two k versus
0:18:43ten k the nn seen on a
0:18:48performance
0:18:50a progression of our system over time without and with the very basic system gmms
0:18:54and mfcc then lda we replace the and the got it you came and we
0:19:00are replaced you know you know got of used and the timit and from lars
0:19:05got further boost in performance so we at the best published a performance on the
0:19:12nist sre two thousand and ten condition five at least
0:19:15for other conditions we believe those are also the best college performances these we you
0:19:20know refer to the paper for more detail
0:19:23because we're really on previous best results i wanted to mention you created to a
0:19:28d choose more that it's you one point o nine equal error rate
0:19:33they had a gender dependent system our system is gender independent
0:19:36and but this work
0:19:39which also the also use the gender dependent systems and what the only reported results
0:19:44on female trials while i'm not sure how we can
0:19:48compare it is numbers but
0:19:53so in conclusion i presented hours a speaker recognition system the components that it has
0:19:58i shared with you our results and quantify the contribution of different components
0:20:05we have if you're interested for further progress on our system
0:20:09please come visit us are i in your speech or you know if you buy
0:20:13me a cookie after this i might be i able to share more details when
0:20:18you thank you
0:20:27for some questions
0:20:37for your presentation my question is about the weights in the in computation it is
0:20:44those weights in the
0:20:46original in da that you mentioned
0:20:50in computation
0:20:54those weights alright originally in the in da for us they are they and you
0:20:59say that at all the data that are close the boundaries are have to listen
0:21:04to look at this let's take a look at how things are computed so that
0:21:09is that it says minimum of the distance between
0:21:13each sound and its k nearest neighbor within that class
0:21:18e seven point it's k nearest neighbours from the j a two class right and
0:21:24then that is divided by a so of course if the sample is not close
0:21:28to the boundary it's gonna be closer to use case okay a k nearest neighbours
0:21:33from kids the same class
0:21:35right
0:21:36so this is gonna be smallest number is gonna be small compared to the denominator
0:21:40so you're gonna get close to zero number versus if the if the south are
0:21:44close to the boundary this number
0:21:48is gonna get
0:21:49close to this number
0:21:51so sorry this number is gonna basically at the this is gonna come out of
0:21:57the mean this so it's the sample from class i compared to its case i
0:22:04k nearest mean of the k here and here's neighbours from plastic is gonna come
0:22:09out so this is divided by
0:22:12this term is gonna be point five so for samples near and dear a classification
0:22:16boundary you get point five four samples that are far from that is a battery
0:22:20where you're gonna get some a zero norm
0:22:24can you tell conceptually what does it mean i mean
0:22:27what "'cause" i mean why some those doubles the boundaries more weights done because the
0:22:34sound of which are far from either far anyway so what are but directly what
0:22:38we what they contribute or you know a station the between class scatter because sometimes
0:22:44that are if it assumes that they are gaussians
0:22:48samples that are near yes
0:22:52first well even if it's gaussian
0:22:56so all those data is that are a away from the mean so that are
0:23:00like l layers right
0:23:03the because you can distinguish them
0:23:07if there is that she that can conclude that can do not directly last but
0:23:13again we extended data more than those that are or more representative of the training
0:23:18set is the training set what we mean by shift is the training set already
0:23:22have the labels we know that those samples or
0:23:26are in that class and they are far from the classification boundary
0:23:33okay
0:23:37work
0:23:45thank you thank you for your the actually of a question regarding the implementation of
0:23:49the indy so a new papers i've seen several things like they're within class covariance
0:23:57use the classical one are used for this work on this work we use the
0:24:01classical one
0:24:02okay be clear lately we limit limited with something for this work we use the
0:24:07exactly the same we compute the within class scatter matrix exactly the same way you
0:24:12computed for lda
0:24:14for at k nearest neighbours we use one versus press
0:24:18that means that work each class
0:24:21you consider that fast versus all the other classes and you compute the and you
0:24:25compute the nearest exit was the development question so if there is a except for
0:24:29the computational time because it was gonna be other was gonna be various i know
0:24:35but intimate results does it change anything
0:24:38if we i this is never be was so slow that i've never explore
0:24:44thank you
0:24:46english degree