hello i run decision at a really a psd student problem this kind university'll background

represent the work place that's later it is in your network based i-vector system for

speaker verification

a set thing to all the c two thousand and twenty

this work proposes to incorporate it is a neural networks you to automatic speaker verification

systems to improve the system generalization ability

in this presentation of a firstly introduced is he systems and strategies to as a

very in developing by a these systems

this is followed by some related works these days and learning process so use of

the buttons days of modeling in machine learning community

then talk about our approach including the motivation and how to apply based entirely you

to is this systems

next our experimental setup and results will be restricted to where five the effectiveness of

our approach

this is followed by the final

congruence

automatic speaker verification systems and i've confirming the spoken utterance is the speaker identity claim

we have zero i ever increasing use of nist systems why don't data lacks

including was iteration you electronic devices

you banking association and so on

there are three most the represented q frameworks for developing is these systems

i-vectors speaker-invariant is the systems were proposed to know those speaker and channel variations the

better

and the user speaker discriminative back end work experience

benefiting from the partial discriminative ability of the neural networks

speaker embedded in this distance or proposed to extract speaker discriminative repetitions is utterance

this could choose the state-of-the-art performance

is the development of and upper test testing

many research is also focus on constructing a s p system

and to and manner

its head and z zero four lacy systems development is then this nice feature in

the training and evaluation data

so i says the speaker population is nice and the variations you channel and environmental

background

a speaker or blsa use the for training and evaluation commonly how no overlap

especially for practical applications

cool work on this is nice your data pairs

the strictly the speaker representations to generalize well on these bigger data

this i know and every environmental variations most the most only it is the in

practical applications

where the training and evaluation data are collected from different types of recorders

and environments

is this nice is also have a high demand for the model general the idiot

times today

to address this is you

previous efforts have applied it was fishing you to elevate the channel and parametric variations

from a christian by any

is the pros as the improvement will be affected you elevating the effects

of channel and environmental mismatch used

are you guys in the consider the speaker population size that could also lead to

the system performance degradation

in this or

we focus on the ice vectors system and try to incorporate it is a neural

networks

to true

the systems generates is it at

across all these three and so from these nine she's

the baseline and you of course as the initial and would be effective to improve

the generalisation ability of discriminative training you p and systems

in the machine learning community

barbara et al proposes

and you've patient variational you or is there is useful based in your networks

but i the lid or propose a novel

propagation

compatible algorithm for learning the network parameters of zero discrete

distribution

in this is this area

reefs and a whole

propose to your ball be based a neural networks use these recognition

so and or stuck a bayesian learning of kidding you need contributions for speaker identification

chatting you also applied to it is interesting to me into language modeling

i we introduce our approach is the most efficient would be personally talk about

traditional d extractor system

a system parameters estimated and the maximum likelihood strategy

i p for me mistake i showing you think about

it has to

our feet when given limited training data are you moment they are

is lasts is nice speech in the training and evaluation data

in the case of a nice you speaker population

the overfitting the model parameters may result in speaker representations all you must i

distribution

to come or score supporting a speaker identities

however this you can not to generalize well on

i think speaker data

the cases of channel and environmental nice

a similar for instance for channel is nice the orthopaedic the model parameters may be

partially rely on the channel you formation

go classify speakers to wear a suit recorders for different speakers in the training data

are more ones analysing to channel mismatch the evaluation data

the original channels to be really system is broken and to train the relies on

channel information cleanly to misclassification

and all have so on

that the extract a speaker representations from outside first is the

still contain a the speaker and related information just a channel

transcription an utterance long

is the information to the fact that the verification performance especially on the nist sre

evaluation data

is a neural network shares economy nist a great interest would be a problem this

t

posterior distribution as shown in figure two

is probabilistic parameters could have what is an additional data

to address this speaker population is niceties you

it is clear that have could have some of the distributions of speaker representation for

better generalization and since you're data

to apply the mismatch is caused by chain i mean variance there probabilistic parameter model

mean go the reduce the risk

overfitting on channel information based more thing parameters to consider archer possible values

that don't rely on channel information for speaker classification

the boundaries you want to be used work to incorporate

we place a neural networks into a the i-vector system by replacing the layers to

improve his general system abated

acts like the system consists of two parts of france and use the following shaking

utterance level speaker in banking and of our vacations calling back

if right hand compresses these utterances of different amounts into a fixed that ms in

speaker-related ratings

based on this inventing different scoring schemes kind we use the projection whether two utterances

you don't close to kristin on that

in this work we focused on the reversal the print and choose probabilistic linear discriminant

analysis and of course task only has two

given by hand for the performance evaluation

that's right for extractor is a neural network

to the other speaker discrimination task as a new people three across this all frame

level and bathroom is levelling structures

and a frame level several layers of

of time delay neural network are used to model have you burned corpora

curve right okay characteristics

of acoustic features

then

then

statistics to relay a ivory is all the frame level all those from velocity d

and they are i don't confuse the army and standard deviation

that compute his that a case space are propagated through or several embedding layers and

the panelists of my output layer

the cross entropy is used to find for the crime interest unit sheena states

in the testing stage

even though acoustic features of that utterance mean value layout too easy extracted as e

x vector

is a neural networks

during the parameters posterior distribution p of w given t to model week after dingy

and the right legally enables and you need

number of possible model parameters to physically is that she and they have be

is the data i third and he modeling has the most model

parameters and the make the more the generalize well as in the

during the testing stage

the model can choose the occluded

well i even they include x

and making i x

the expectation

or what awaits posterior distribution

p w

"'kay" of audio unity i shown you creation one

i work better that they write the i think nation l p and out of

that unit but intractable for neural networks of irony right was that

yes the number of possible ways values could be you data

so the variational approximation is commonly adopted to estimate the posterior distribution

the variational

a poor approximation i theme is a set of parameters see that you're for distribution

to all of that you

to approximate the posterior distribution p of w unity

this is issued by minimizing the callback labour

i divergence between these two distributions and so and you creation two

from you creation to equation four

we applied the ester two and job the constant term low key of the

low key of p

that that's of by the minimization no was it actually

you creations problem of for to say

benesty that just means that is

increasing could be decomposed into two pass

one is

the kl-divergence speech e

the speech in the approximation distribution q of w and the posterior

and the prior distribution p of w on the page

the one is

the one is

the expectation of the log likelihood of the training data over the approximation distribution q

of topic

increase mistakes is used as the loss function to be really nice in the training

process

as commonly adopted to be assumed that both variational approximation of that you and the

prior distribution p of that you follows telcon or cost and distribution these a printer

side data to composed of new q and the map you

and six is see that the controls of new p and c marquee respectively

the two class you know loss function of the last is gonna be formulated as

you kristen's seven and eight respect it because it ain't useful in relation to apply

a model car was some three

two approximates the integration

processed

finally can

concatenate increases seven and eight we have the final loss function actually is not

this news

we see you be the directly use the are watching imprecise

order to evaluate the effectiveness of operation any of a speaker verification you both

so and a long utterance conditions

we performed experiments on two datasets

to solve utterance condition we consider the book set of one side

totally

wow hundred and forty

eight thousand

six hundred and forty two utterances from one thousand and twenty two hundred by

site agrees

we adopted a four thousand

four thousand eight hundred and they seventy four utterances from forty speakers for evaluation

and the remaining utterances are used for junior

yes you system parameters

for the long utterance condition a card has thing in beanies the speaker barry is

to be correct recognition evaluation can use the for benchmarking i won't motives

but is synthesized we adopt the previous

sre corpora sense these four

in total be how wrong

sixty miles thousands recordings from six thousand

and of the hundred speakers indigenous this site

we evaluate the general system benefits that

based on included three and a

evaluation of different miss nine degrees

we performed only and also to me evaluations

which in and has two stages i think really on the same

dataset the in domain evaluation

well executed on different not size are also be evaluation

so if you dimensional mel-frequency spectral coefficients are adopted i so closely features our experiments

extracted mfccs onion normalization them voice activity detection filters all non-speech frames

that's right drawing structure configuration is shown in table one

linear discriminant analysis is applied to reduce the extractors dimension

to make a fair calibration that based extractor system is configured to be is the

same architecture of the baseline system

except

the first the t v and later use replace the bad their business the number

or units

so that is the gradient descent and is a great i as you optimize rd

machine evaluation metrics adaptively increase or other commonly used equal error rate and minimum you

understand

cost function

here is that you need only evaluation results we have their own that

you calamities

consistently decrease after incorporating the basin running on both sides

on this dataset be considered i was right you glower we degrees

across close a and the lda back-end

on the

looks at the one that i with a few i enquiry decrease from place to

extract or system is two point six days point process

and the fusion system quoted surefire to our wrists radio or are we increase that

so on to four presents

and then he's that i sign skin database until you varying degrees is to one

thirty h

is the three two percent for

based on extractor system and three point

eight a stands for the fusion system

we also consider the consistent improvement in detection cost function performance after applying bayesian learning

and that the stooges just the

is observations where five improve the general system ability of the client base a neural

networks

figure four ulysses

the details at work feed off curves

all systems these the cost and by can win benchmark almost set of one side

is shows the proposed space is just a model from the baseline for all operating

points

and the fusion system couldn't show further improvements to trigger

complementary advantages of the baseline and based in system

k is the off total knee evaluation regions

the model to now centered one was evaluated on these the sre ten

and vice versa

system performance costing significantly due to the last term is my speech in the training

and evaluation data

from the table be of the died

systems could benefit more from the generalisation calibration in your

we also consider the average radio equal error rate degrees across course and real case

calling back end for performance evaluation

in the experiments evaluated on nist i sign ten database right you equal weight

increase is

for one six nine cents and the six point

well three percent over the baseline system and the fusion system respectively

for the experiments evaluated on the wheel set of one dataset are always right you

equal ridiculous yes three point o seven percent for the base tax vectors this the

and the fusion system as true father

for the average review equal error rate degrees all six point

for

one a sense

the latter value equal error rate decreases compared to be is that the only evaluations

just respect bayesian learning could be

more beneficial when larger miss nicely it is

between the training and evaluation data the last column in the table shows the corresponding

you

detection cost function performance

and we also can see consistent improvement by applying bayesian learning and with the fusion

system

similar to that of the variation in figure four

the detection error tradeoff curves in figure fell so consistent improvements by applying bayesian learning

and a few this system

for all operating points

in this work we

we incorporate the base in your network utility

i extractor assistant when you produce

models generalisation ability

our experimental results verify the bayesian routine enable a consistent

generalizes the ability you improvement over extractor system both

sort and alarm rates conditions

and the through the system which used for the improvements nor overall system score the

latter improvement problem would be is already and all of complete evaluation results as s

is around you makes

my personal and the doctor nice it is between the training and evaluation data

possible future research will focus on

incorporating the bayesian learning improve the and ran a speaker verification systems

then for a listening