0:00:14hello i run decision at a really a psd student problem this kind university'll background
0:00:20represent the work place that's later it is in your network based i-vector system for
0:00:27speaker verification
0:00:28a set thing to all the c two thousand and twenty
0:00:33this work proposes to incorporate it is a neural networks you to automatic speaker verification
0:00:38systems to improve the system generalization ability
0:00:43in this presentation of a firstly introduced is he systems and strategies to as a
0:00:48very in developing by a these systems
0:00:52this is followed by some related works these days and learning process so use of
0:00:56the buttons days of modeling in machine learning community
0:01:00then talk about our approach including the motivation and how to apply based entirely you
0:01:06to is this systems
0:01:07next our experimental setup and results will be restricted to where five the effectiveness of
0:01:14our approach
0:01:15this is followed by the final
0:01:17congruence
0:01:20automatic speaker verification systems and i've confirming the spoken utterance is the speaker identity claim
0:01:28we have zero i ever increasing use of nist systems why don't data lacks
0:01:33including was iteration you electronic devices
0:01:37you banking association and so on
0:01:40there are three most the represented q frameworks for developing is these systems
0:01:46i-vectors speaker-invariant is the systems were proposed to know those speaker and channel variations the
0:01:52better
0:01:53and the user speaker discriminative back end work experience
0:01:58benefiting from the partial discriminative ability of the neural networks
0:02:03speaker embedded in this distance or proposed to extract speaker discriminative repetitions is utterance
0:02:11this could choose the state-of-the-art performance
0:02:15is the development of and upper test testing
0:02:19many research is also focus on constructing a s p system
0:02:25and to and manner
0:02:29its head and z zero four lacy systems development is then this nice feature in
0:02:33the training and evaluation data
0:02:36so i says the speaker population is nice and the variations you channel and environmental
0:02:41background
0:02:43a speaker or blsa use the for training and evaluation commonly how no overlap
0:02:48especially for practical applications
0:02:52cool work on this is nice your data pairs
0:02:56the strictly the speaker representations to generalize well on these bigger data
0:03:03this i know and every environmental variations most the most only it is the in
0:03:07practical applications
0:03:09where the training and evaluation data are collected from different types of recorders
0:03:15and environments
0:03:16is this nice is also have a high demand for the model general the idiot
0:03:22times today
0:03:24to address this is you
0:03:26previous efforts have applied it was fishing you to elevate the channel and parametric variations
0:03:31from a christian by any
0:03:34is the pros as the improvement will be affected you elevating the effects
0:03:38of channel and environmental mismatch used
0:03:41are you guys in the consider the speaker population size that could also lead to
0:03:46the system performance degradation
0:03:50in this or
0:03:51we focus on the ice vectors system and try to incorporate it is a neural
0:03:56networks
0:03:57to true
0:03:58the systems generates is it at
0:04:01across all these three and so from these nine she's
0:04:06the baseline and you of course as the initial and would be effective to improve
0:04:10the generalisation ability of discriminative training you p and systems
0:04:15in the machine learning community
0:04:17barbara et al proposes
0:04:20and you've patient variational you or is there is useful based in your networks
0:04:26but i the lid or propose a novel
0:04:29propagation
0:04:31compatible algorithm for learning the network parameters of zero discrete
0:04:36distribution
0:04:37in this is this area
0:04:40reefs and a whole
0:04:41propose to your ball be based a neural networks use these recognition
0:04:46so and or stuck a bayesian learning of kidding you need contributions for speaker identification
0:04:53chatting you also applied to it is interesting to me into language modeling
0:05:00i we introduce our approach is the most efficient would be personally talk about
0:05:07traditional d extractor system
0:05:09a system parameters estimated and the maximum likelihood strategy
0:05:14i p for me mistake i showing you think about
0:05:20it has to
0:05:21our feet when given limited training data are you moment they are
0:05:26is lasts is nice speech in the training and evaluation data
0:05:31in the case of a nice you speaker population
0:05:34the overfitting the model parameters may result in speaker representations all you must i
0:05:40distribution
0:05:41to come or score supporting a speaker identities
0:05:44however this you can not to generalize well on
0:05:48i think speaker data
0:05:50the cases of channel and environmental nice
0:05:54a similar for instance for channel is nice the orthopaedic the model parameters may be
0:06:00partially rely on the channel you formation
0:06:02go classify speakers to wear a suit recorders for different speakers in the training data
0:06:10are more ones analysing to channel mismatch the evaluation data
0:06:14the original channels to be really system is broken and to train the relies on
0:06:20channel information cleanly to misclassification
0:06:27and all have so on
0:06:29that the extract a speaker representations from outside first is the
0:06:35still contain a the speaker and related information just a channel
0:06:40transcription an utterance long
0:06:42is the information to the fact that the verification performance especially on the nist sre
0:06:47evaluation data
0:06:49is a neural network shares economy nist a great interest would be a problem this
0:06:53t
0:06:55posterior distribution as shown in figure two
0:06:58is probabilistic parameters could have what is an additional data
0:07:02to address this speaker population is niceties you
0:07:06it is clear that have could have some of the distributions of speaker representation for
0:07:12better generalization and since you're data
0:07:15to apply the mismatch is caused by chain i mean variance there probabilistic parameter model
0:07:20mean go the reduce the risk
0:07:23overfitting on channel information based more thing parameters to consider archer possible values
0:07:29that don't rely on channel information for speaker classification
0:07:34the boundaries you want to be used work to incorporate
0:07:38we place a neural networks into a the i-vector system by replacing the layers to
0:07:45improve his general system abated
0:07:51acts like the system consists of two parts of france and use the following shaking
0:07:56utterance level speaker in banking and of our vacations calling back
0:08:01if right hand compresses these utterances of different amounts into a fixed that ms in
0:08:06speaker-related ratings
0:08:09based on this inventing different scoring schemes kind we use the projection whether two utterances
0:08:15you don't close to kristin on that
0:08:18in this work we focused on the reversal the print and choose probabilistic linear discriminant
0:08:24analysis and of course task only has two
0:08:28given by hand for the performance evaluation
0:08:32that's right for extractor is a neural network
0:08:35to the other speaker discrimination task as a new people three across this all frame
0:08:41level and bathroom is levelling structures
0:08:44and a frame level several layers of
0:08:47of time delay neural network are used to model have you burned corpora
0:08:53curve right okay characteristics
0:08:56of acoustic features
0:08:58then
0:09:00then
0:09:00statistics to relay a ivory is all the frame level all those from velocity d
0:09:06and they are i don't confuse the army and standard deviation
0:09:10that compute his that a case space are propagated through or several embedding layers and
0:09:17the panelists of my output layer
0:09:20the cross entropy is used to find for the crime interest unit sheena states
0:09:26in the testing stage
0:09:27even though acoustic features of that utterance mean value layout too easy extracted as e
0:09:34x vector
0:09:39is a neural networks
0:09:40during the parameters posterior distribution p of w given t to model week after dingy
0:09:47and the right legally enables and you need
0:09:50number of possible model parameters to physically is that she and they have be
0:09:55is the data i third and he modeling has the most model
0:09:59parameters and the make the more the generalize well as in the
0:10:03during the testing stage
0:10:05the model can choose the occluded
0:10:07well i even they include x
0:10:09and making i x
0:10:10the expectation
0:10:11or what awaits posterior distribution
0:10:14p w
0:10:15"'kay" of audio unity i shown you creation one
0:10:20i work better that they write the i think nation l p and out of
0:10:23that unit but intractable for neural networks of irony right was that
0:10:29yes the number of possible ways values could be you data
0:10:33so the variational approximation is commonly adopted to estimate the posterior distribution
0:10:40the variational
0:10:42a poor approximation i theme is a set of parameters see that you're for distribution
0:10:48to all of that you
0:10:49to approximate the posterior distribution p of w unity
0:10:54this is issued by minimizing the callback labour
0:10:59i divergence between these two distributions and so and you creation two
0:11:05from you creation to equation four
0:11:08we applied the ester two and job the constant term low key of the
0:11:14low key of p
0:11:16that that's of by the minimization no was it actually
0:11:19you creations problem of for to say
0:11:23benesty that just means that is
0:11:26increasing could be decomposed into two pass
0:11:31one is
0:11:32the kl-divergence speech e
0:11:35the speech in the approximation distribution q of w and the posterior
0:11:40and the prior distribution p of w on the page
0:11:45the one is
0:11:47the one is
0:11:49the expectation of the log likelihood of the training data over the approximation distribution q
0:11:55of topic
0:11:56increase mistakes is used as the loss function to be really nice in the training
0:12:01process
0:12:07as commonly adopted to be assumed that both variational approximation of that you and the
0:12:13prior distribution p of that you follows telcon or cost and distribution these a printer
0:12:20side data to composed of new q and the map you
0:12:24and six is see that the controls of new p and c marquee respectively
0:12:30the two class you know loss function of the last is gonna be formulated as
0:12:34you kristen's seven and eight respect it because it ain't useful in relation to apply
0:12:40a model car was some three
0:12:42two approximates the integration
0:12:44processed
0:12:45finally can
0:12:47concatenate increases seven and eight we have the final loss function actually is not
0:12:53this news
0:12:54we see you be the directly use the are watching imprecise
0:12:59order to evaluate the effectiveness of operation any of a speaker verification you both
0:13:04so and a long utterance conditions
0:13:06we performed experiments on two datasets
0:13:09to solve utterance condition we consider the book set of one side
0:13:15totally
0:13:16wow hundred and forty
0:13:17eight thousand
0:13:19six hundred and forty two utterances from one thousand and twenty two hundred by
0:13:25site agrees
0:13:28we adopted a four thousand
0:13:31four thousand eight hundred and they seventy four utterances from forty speakers for evaluation
0:13:37and the remaining utterances are used for junior
0:13:40yes you system parameters
0:13:42for the long utterance condition a card has thing in beanies the speaker barry is
0:13:46to be correct recognition evaluation can use the for benchmarking i won't motives
0:13:52but is synthesized we adopt the previous
0:13:54sre corpora sense these four
0:13:56in total be how wrong
0:13:59sixty miles thousands recordings from six thousand
0:14:03and of the hundred speakers indigenous this site
0:14:06we evaluate the general system benefits that
0:14:10based on included three and a
0:14:12evaluation of different miss nine degrees
0:14:15we performed only and also to me evaluations
0:14:18which in and has two stages i think really on the same
0:14:21dataset the in domain evaluation
0:14:23well executed on different not size are also be evaluation
0:14:30so if you dimensional mel-frequency spectral coefficients are adopted i so closely features our experiments
0:14:37extracted mfccs onion normalization them voice activity detection filters all non-speech frames
0:14:43that's right drawing structure configuration is shown in table one
0:14:47linear discriminant analysis is applied to reduce the extractors dimension
0:14:53to make a fair calibration that based extractor system is configured to be is the
0:14:58same architecture of the baseline system
0:15:00except
0:15:01the first the t v and later use replace the bad their business the number
0:15:06or units
0:15:08so that is the gradient descent and is a great i as you optimize rd
0:15:12machine evaluation metrics adaptively increase or other commonly used equal error rate and minimum you
0:15:18understand
0:15:19cost function
0:15:21here is that you need only evaluation results we have their own that
0:15:26you calamities
0:15:27consistently decrease after incorporating the basin running on both sides
0:15:34on this dataset be considered i was right you glower we degrees
0:15:39across close a and the lda back-end
0:15:45on the
0:15:46looks at the one that i with a few i enquiry decrease from place to
0:15:50extract or system is two point six days point process
0:15:55and the fusion system quoted surefire to our wrists radio or are we increase that
0:15:59so on to four presents
0:16:04and then he's that i sign skin database until you varying degrees is to one
0:16:09thirty h
0:16:10is the three two percent for
0:16:12based on extractor system and three point
0:16:16eight a stands for the fusion system
0:16:19we also consider the consistent improvement in detection cost function performance after applying bayesian learning
0:16:27and that the stooges just the
0:16:29is observations where five improve the general system ability of the client base a neural
0:16:35networks
0:16:37figure four ulysses
0:16:39the details at work feed off curves
0:16:41all systems these the cost and by can win benchmark almost set of one side
0:16:47is shows the proposed space is just a model from the baseline for all operating
0:16:52points
0:16:53and the fusion system couldn't show further improvements to trigger
0:16:57complementary advantages of the baseline and based in system
0:17:02k is the off total knee evaluation regions
0:17:05the model to now centered one was evaluated on these the sre ten
0:17:10and vice versa
0:17:12system performance costing significantly due to the last term is my speech in the training
0:17:17and evaluation data
0:17:22from the table be of the died
0:17:25systems could benefit more from the generalisation calibration in your
0:17:29we also consider the average radio equal error rate degrees across course and real case
0:17:35calling back end for performance evaluation
0:17:39in the experiments evaluated on nist i sign ten database right you equal weight
0:17:45increase is
0:17:47for one six nine cents and the six point
0:17:51well three percent over the baseline system and the fusion system respectively
0:17:56for the experiments evaluated on the wheel set of one dataset are always right you
0:18:02equal ridiculous yes three point o seven percent for the base tax vectors this the
0:18:07and the fusion system as true father
0:18:11for the average review equal error rate degrees all six point
0:18:15for
0:18:15one a sense
0:18:18the latter value equal error rate decreases compared to be is that the only evaluations
0:18:24just respect bayesian learning could be
0:18:26more beneficial when larger miss nicely it is
0:18:29between the training and evaluation data the last column in the table shows the corresponding
0:18:35you
0:18:36detection cost function performance
0:18:38and we also can see consistent improvement by applying bayesian learning and with the fusion
0:18:44system
0:18:46similar to that of the variation in figure four
0:18:49the detection error tradeoff curves in figure fell so consistent improvements by applying bayesian learning
0:18:56and a few this system
0:18:58for all operating points
0:19:02in this work we
0:19:03we incorporate the base in your network utility
0:19:08i extractor assistant when you produce
0:19:10models generalisation ability
0:19:12our experimental results verify the bayesian routine enable a consistent
0:19:17generalizes the ability you improvement over extractor system both
0:19:22sort and alarm rates conditions
0:19:25and the through the system which used for the improvements nor overall system score the
0:19:29latter improvement problem would be is already and all of complete evaluation results as s
0:19:35is around you makes
0:19:36my personal and the doctor nice it is between the training and evaluation data
0:19:42possible future research will focus on
0:19:45incorporating the bayesian learning improve the and ran a speaker verification systems
0:19:49then for a listening