welcome for the presentation for the paper element of a slow environment invariant speaker recognition

it's just and by nations on channel checksum hall and some given

the goal of this work the speaker recognition

speaker recognition is identifying of our system

from characteristics of voices

the effect of a stopping who is speaking

it can be kind of rising to closer problem

well as a problem

or closest setting all testing identities are created by knight in shining there were can

be addressed as a classification problem

we nonrunning where for this problem this speaker identification

on the other hand well as the setting dusting identities are not seen during training

which is close to practise

we also call this problem a speaker verification

in speaker verification extract the speaker representation of two speech signals income are then whether

two speech

or from same person or not

like many other research areas progress in speaker verification has been facilitated by the availability

of our skill dataset call also

barcelona the data set consisting of short clips of human speech

extracted from interviews videos

the speakers and a wide range of their

in a series accents

questions ranges

videos included in the dataset are shot in a large number of challenging visual

and auditory environments

this and has been widely used for training and testing speaker recognition models

well of one contains over one hundred thousand utterances where one thousand two hundred fifteen

one celebrities

well boston to contains over one million utterances

well

moreover six thousand celebrities

extracted from videos uploaded you two

most of the previous research only focus on boosting the performance of speaker verification based

on given test data set

there has been a number of more insertion or more hardcore detectors one loss functions

suitable were asked

but he scores do not consider what information is learned

by the models

whether it is the useful information or undesirable biases are present in the dataset

in speaker recognition the challenge comes down to ability to separate the voice characteristics and

the environments and which the forces voices recorded

in real-world scenarios

a person usually in rows

as well her voice and the same environment

therefore

the same person speaks in different environmental voice creature can be bar on the embedding

space

if the voice characteristics and environment information is and angle

and then

the wall so that there is a means of recordings from vipers

but finite environments for each speaker

making it possible with the model over bits of the environment as well as the

voice characteristics

therefore

we do not know whether a network

can impact of the other voice characteristics were environmental

or

session biased as well

you know work to prevent this

we must look beyond classification accuracy as the only learning objective

so

the objective of this work is

how to learn speaker discriminative and environment in various speaker verification network

in this work

we introduce an environment a diverse real training framework in which the network and we

learn a speaker discriminative and environment invariant in that things but not close to the

mean shift strange shape

chip this by using bo

previous and usability a limitation but also that it is the

we show that our environment results training wells network to generalize better

to unseen conditions

okay motivation of our training one

is that

a model should not possible to discriminate between two coats of this thing speaker from

the same video were to those of the same speaker and different scales

now let's talk about our training framework

this is the overview of the training phase

we will explain in detail in the later slides

first we'll talk about the fast formation

each meeting bass consist of two three two second audio segments brown and different speakers

to the tree audio segments from each speaker

or the same video

and the other is for the value

the two sediments cannot be either from different parts of the same audio clip or

another clear from the same q two

but billy reference can be performed a while at

and the box select data set

as shown in the left

here you know assumption is that

the error clips from the same video with

in requiring is still in wireless whereas

the clips from different videos would have more different channel characteristics

now

let's talk about the environment space

but in wyman the toolkit strange graded whether or not in the audio segments come

from the same environment

or same video

but it should be lost

the anchor nested weapon the same video as the anchor well

one mean also there

and the anchor in a segment promote different video one and negative

the gradient is back okay to only to the environment that are

so the scene a feature extractor is not optimized during the space

in the speaker space the c n and feature extractor and the speaker recognition network

were trained simultaneously using this gender cross entropy loss

in addition the computational asking in like this and that's what's ability to discriminate between

the clearest averaging from the same environment

and those from different environments

this is done by minimising the kl divergence between the softmax

but the chirplet distances

and the unit one distribution

the environment where channel information can be seen as undesirable sources of variations

it should be absent from and i new speaker embedding

the extent to which the conclusion a lot of computers

the overall loss function

this control binary variable out but

by zero

this and normal shape parameter of the speaker verification

as outweigh increases

the extent of the confusion lost

increases

increases to

now we're going to explain our experiments

our experiments are performance two different architectures

which is em or t and then resident work

although the original the g and there's not a state-of-the-art network is no for high

efficiency and good classification performance

which is em already as a furthermore quantification of the network

that a forty dimensional mel filter of ssm closeness to the whole spectrogram

technically reducing the number of computations

then resonant thirty four is the same as the original rest now but there are

layers

except with only one or

but the channels in each residual well

in order to reduce computational cost

gratification we used two types and well average rating

and so that all

well i have vegetabley simply takes the mean of the feature along

the time domain

so what bowling is introduced to any addition to the frames

that are more informative

or utterance level speaker recognition

and the speaker and environment that were

the speaker network single fully connected layer is used

and really environment that order to fully committed layer with the real the activation is

using this train

we train our models and so when on the boston one dataset

well i and its application may or may training on the overlapping order the development

set well identification and verification so that a models change what identification

can be used to really verification

this makes me cry vacation a one thousand two hundred eleven way classification task

and test set consisted unseen utterance as but scenes speakers during training

we're verification

all speech segments from the one thousand two hundred eleven development set speakers

are used for chain

and that shane model is then evaluated

on the already unseen test speakers

during training

we use a baseline to second temporal segment

extractor randomly from each utterance

spectrograms are extracted from that but the hamming window

of with

twenty five milisecond this that

and milisecond

rasta

but to hunker a seven dimensional aspect frames are used as the input to the

network

proposes you have only

where dimensional mel-cepstral counts

or uses the but

mean and variance normalization is for one every frequency bin of the spectrogram

have built so again i utterance level

no

no voice activity detection where they are one station is used and the string

and then to ask in training or a classification task

but the verification task requires a measure of similarity

in our work but by a layer and the classification the trick is replaced with

one

i low dimensional by abundant well and this layer as we trained with contrast in

boston high negative mine

the c and feature extractor is not function

but the contract loss

and the to restrain using stochastic gradient descent

the initial learning rate of

zero one zero one increasing by a factor of zero point nine by

every at all

the training is a actual hundred and false

whenever the validation a right not improvement and of course

the chambers

replay experiment measures the performance on the same also the test set used in the

speaker verification task

but they rely on robust speaker and re recorded using a jobless be

well i hundred and ten microphone

this results in a significant change and channel characteristics

and the duration of sound quality

a model so identical to those used in previous experiments

and not by engine on their place segments

this table of course

results for multiple models used for evaluation

across both speaker identification and verification task

a models trained with the proposed a diverse to strategy

well a place greater than zero

consistently outperform those trained without

when r y equals to zero

reply equal error rate is the result of replay experiment as we mentioned in the

previous slide

the improvement in performance as a play increases

is more pronounced

in this study

which suggests that

the model training with the proposed addresses training generalize much better to unseen environments or

channel

one

experiments compare by not universal training

else truman environment information that embedding

the test lists for evaluating environment recognition system

nine thousand four hundred and eighty six same-speaker pairs

a which come from the same video and the other how well the opinions

the lower equal error rate in the case that the network is better at reading

whether or not

error of audio segments company the same value

results demonstrate and environmental recognition performance

decreases with uniquely somehow well it shows that unwanted environment information

as in the in the speaker and leading to an extent

so summarize our work

pairs

we propose an environment and process training network to their speaker discriminative environment invariant that

so

secondly

for most of

our proposed method icsi's database lies in value on the basilar one dataset

we also problem network to vary by the environment information as to remove them the

embedding

and q