i know my mean is shocking ending this the ubiquity of these vectors and training

workshop

all represents all paper selecting t speaker in between its nodes or is okay shen

these are to this contents are you or start with an introduction the motivation

next are we going the voice just dataset

and i we introduced a baseline system

we use low and it and the proposed tomorrow this the remaining the states

experiments and corresponding richard will be then present followed by our conclusion

a nice to meet introduction

recently

tim neural network are using the kings table t are honest in speaker verification

however distantly utterances are well known to integrate or honest because the contain environmental vector

such and reverberation and noise

so celeste of these so case we always use of security in complex environments ascending

problem is done challenge was

then encoded already dataset

previously

several studies have or compensation for the performance degradation or with the distant environments

however to problem to have mean oregon meetings eating compensation method

well as

you just a one as a degradation of one cluster of utterance

applying the compensation that a good agreement though honestly recognition or distant contrasts

however when the distant compensation technique was applied to the cluster doctrines the performance det

only

or into this you know nina used in recording used compensation system when you come

from various distance

second

there is a dependency on the sre system

when a new speaker embedding structure is almost

corresponding studies or adequate at position and you know you should be are well

to all the gradient this

previous problems

we want to build a system followed in no or properties

first

you should be independent the front end speaker extractor

second

the proposed system should be or on selecting cepstral innocent

while considering got used and you training speech and microphone

certainly

was cluster and distant utterance can be including

into the proposed system

why not only

the problem of the system comprise all you late we simply architecture

the cost minima or had to store all honestly cross that line

we propose to this town doctrines compensation system

the worst cross or system so that really can't of the announcements according to require

use tentel compensation

we design also or cleaning to determine the level and the voice and you preparation

no apply compensation accordingly

a second approach or system is based on the auto-encoder primal

while key binding document retention

into two sorts there is no system into set correctly stressed speaker information

including embedding teary encoding quality

once a spacey target contain clean speaker information on your plane or the channel offset

function to these input layer

and you know the subspace is target two

contain subsequently incarnation but liberation indoors

with dataset using this study will be described

that was dataset was collected by clinton levers this dataset

so one loss or

only layer coding we'd already market various test and of course conditions

of course the conditional order to according to learn

trendy nor training mike

impressed angle and distracters

in the workforce it dataset

there are three hundred speakers

the development set comprise all our total term store

two hundred speakers and all evaluation sets comprise are twelve utterance well unless the whole

one hundred speakers

introduce a known and used as baseline

no the use of data from a speaker embedding stricter

that you will know where one time actually

when can as four or so used to extract speaker embedding

mel frequency cepstral coefficients

a local man a speech or moreover that only used

this acoustic is true for that human knowledge into a size or discriminative features

convolutional neural network which is frequently used or anything about extractor

gradually increased only set to create

does when in perspective ran into the c n only set their people standing can

consider only on digits time and frequency region

and then you're

there are close to the input layer

although

this conventional acoustic is for us to in widely used

mainly sense to the also explore low weight problem as you could to t n

it is that they don't alignment learning can batteries track discriminant information you document layers

when we're on are processed by synonyms

additional frequency response

also we can spend can be strictly

in addition the progress and all data to data and task

known and all the policy intentionally architecture where the midget a global c n n's

extract train leavened representation

as illustrated here

no one installation the plot is similar to the original last night

well the whole mess clean a year

this representation and in canada uni directional getting equal to unit layer

to all we're getting into a single times level election station

a fully connected layer with the one thousand twenty four those

and conduct affine transformation it is a later uses a speaker embedding

in this section we introduce two or system or at a speaker invading last night

the first proposed system is a lucrative as skin condition based selective innocent

the q on the night show the crime local sc

this system comprise all p n in that in a speaker embedding asking condition

in on the other segments kiss each and unit

sc cantonese out you know is able to encoder

and sat in a decidedly stencil activity in the skin condition similar to the case

becomes you

during the training phase

and ct nn is trained for me nice to me scared and an object motion

routine do not include any in a speaker embedding

when a source utterances include

sc on the only on structural be included

on the other hand we're not distant utterances include

sc on the key noisy

output or source all trials

that was used to make the distance utterance

a stinky in it is trained to minimize the wine on the cross entropy object

function

when a source alton seeing a binary label is a one to make the skin

condition only working

and the way not distance all utterances include the finally agrees general to make the

iterative scheme condition

in the figure below

the top n only presented a training base of our proposed

i think i feel

or quoting from previous study

when compensation is conducting speaker and benny's face

compensation may not be and although the ins evaluation pair too low

this phenomenon is to analyze as all users what we losing or discriminative power

all speaker embedding by changing value

you know high dimensional extract embedding space

labels in this knowledge e unless component so proposed system

or on a speaker identification where do contain what the cross entropy roses function is

used

so the final was it commissioned used to train the sc is it is just

a

just described there

loss and the same is or total reconstruction error

this is seeing measure the distance the detection error

analysis a measure called speaker identification error

this entire in a speaker and battery

in the test case the speaker and made it is including to c t n

and as the key and

so clean condition to connect input and output all sc t n is not rely

on it all other whereas the nn

we don't sigmoid activation function

this is only a longer between zero and one and produce source case clean condition

why nineteen a speaker embedding is still i by adding the all will go to

see the nn

and its cascade condition

in the figure below those already there all represent the test process over our proposed

sc

the second proposed system usually prior to causality business not destroy the whole time corner

that is not

those second proposed system usually prior to us so that leaving that's not

described auto-encoder

the second proposed system easily hurt us so that in a sense to discriminate auto-encoder

that is composed of on encoder decoder and two on an intermediate hidden layers

like you hear loss filter set architecture

the architecture design follow descreening altering quality structure

inspired by pca set eyes computer intermediate hidden layer

to collect the reverberation voicing and layer

and to contain

clean speech recognition in this kind layer

so that i used an intermediate human lay your next time s ideally and always

isolated

you has been very

when training set up

although was of ocean correspond to minimize the inter class areas and mesh five the

you class variance

we utilize central sandy tolerance margin thus

centre or source presented very nice intra-class variance why don't you embedding it surely many

discriminate

noninternal destruction was used in d c in to maximize the entire class

variance

in the same yes the previous sc diana sylvia function was used to train but

you know resulting colour

to nest or on the ocean between the number of source of times

and distance all times in the training set

the sample weight or two on the because this six

and one is given recording you put

the c of the ocean is also used to store all the function shrek on

the speaker identification

the final was of functional propose that a system

it is described below

here can my is all hyper parameter the scale the omission or try to this

time

and at times all hyper parameter the combined always function gender roles and inter racial

noticed

no less mobile and experiments and results

the train set comprise all art so the voices development set

and what select one and two dataset

baseline alone a system

in cologne where called is a two

it in nine thousand

what's a nice sample which a car or was to recognise that was second

we're meeting that's construction

to the so

we had to click a short utterance and a common and the call me

all the details are present in the paper

the baseline system used a low and then architecture

we had some modification

first set and the number of the articulators no to seven about it

by on the sisters tree

to consider more speakers

secondly

increased a criminal at all the speaker and battery to one thousand training or

"'kay" the glow described here top on it in a single system o'connor's from the

always the challenge

and our baseline system with various congregation

target comparison between the current system in our baseline

kind of in may going to the occurrence in the

input feature

tries the congregation

and binary classifiers

our story describe the noticed when using all the voice just dataset or training

our street train

we first trained on that were use of constant two

and then press

on the top layer

and conduct fine tuning we propose that set

and hours or shown college road all training or street dataset scatter

training all or street dataset simultaneously and provides the best but almost

proposed sc explore the learning life's customer and optimiser

the best performance loss and the quantum and used as treaty and cosine along a

scheduler

sc show six point

it's by orson the year

where the test set and then the only channels three percent laid our reduction of

compared to the baseline

we experiment the proposed set a we keep a bit size and a manager

the best performance was an echo the menu saddam

and set aside to ten thousand

the set i shows system only or seven percent a year or the test set

and fifteen point nine seven percent are

compared to the baseline

score normalization technique are frequently chlorine various acoustic business condition

most of the artist and in the course is two thousand nineteen challenge or so

use the score normalization techniques such as generous colour magician

the score normalization estimating score normalization

we experiment i actually so this technique or our baseline aurora two for all system

sc that's data

and an important measure the in table low

the results show the z-norm demonstrate but best document in most cases in our experiments

in addition scores and all somewhere all the two proposed system

only the audition across the improvement

we don't eer all other

six point one nine percent or z-norm

finally then we introduce the conclusion

in this study we propose to speaker-invariant is not system

was proposed system are independent from the front ends you can vary instruction

and this taste and can process not only distance on trust was cluster utterance

this process which can are you sure wasn't degradation

when cluster goddess are input into the speaker and battery in is not system

it is time won't systems utterance

compared to the baseline system to proposed system as the c s c and set

up in was based on a real eleven point two or three percent

and fourteen point nine three percent respectively

this is richard show that you x in this impulse cluster and discuss utterance

in our just for making sensing interrogate to proposed system into a single speaker in

body units nist is that

they could probably sing