i know my mean is shocking ending this the ubiquity of these vectors and training
workshop
all represents all paper selecting t speaker in between its nodes or is okay shen
these are to this contents are you or start with an introduction the motivation
next are we going the voice just dataset
and i we introduced a baseline system
we use low and it and the proposed tomorrow this the remaining the states
experiments and corresponding richard will be then present followed by our conclusion
a nice to meet introduction
recently
tim neural network are using the kings table t are honest in speaker verification
however distantly utterances are well known to integrate or honest because the contain environmental vector
such and reverberation and noise
so celeste of these so case we always use of security in complex environments ascending
problem is done challenge was
then encoded already dataset
previously
several studies have or compensation for the performance degradation or with the distant environments
however to problem to have mean oregon meetings eating compensation method
well as
you just a one as a degradation of one cluster of utterance
applying the compensation that a good agreement though honestly recognition or distant contrasts
however when the distant compensation technique was applied to the cluster doctrines the performance det
only
or into this you know nina used in recording used compensation system when you come
from various distance
second
there is a dependency on the sre system
when a new speaker embedding structure is almost
corresponding studies or adequate at position and you know you should be are well
to all the gradient this
previous problems
we want to build a system followed in no or properties
first
you should be independent the front end speaker extractor
second
the proposed system should be or on selecting cepstral innocent
while considering got used and you training speech and microphone
certainly
was cluster and distant utterance can be including
into the proposed system
why not only
the problem of the system comprise all you late we simply architecture
the cost minima or had to store all honestly cross that line
we propose to this town doctrines compensation system
the worst cross or system so that really can't of the announcements according to require
use tentel compensation
we design also or cleaning to determine the level and the voice and you preparation
no apply compensation accordingly
a second approach or system is based on the auto-encoder primal
while key binding document retention
into two sorts there is no system into set correctly stressed speaker information
including embedding teary encoding quality
once a spacey target contain clean speaker information on your plane or the channel offset
function to these input layer
and you know the subspace is target two
contain subsequently incarnation but liberation indoors
with dataset using this study will be described
that was dataset was collected by clinton levers this dataset
so one loss or
only layer coding we'd already market various test and of course conditions
of course the conditional order to according to learn
trendy nor training mike
impressed angle and distracters
in the workforce it dataset
there are three hundred speakers
the development set comprise all our total term store
two hundred speakers and all evaluation sets comprise are twelve utterance well unless the whole
one hundred speakers
introduce a known and used as baseline
no the use of data from a speaker embedding stricter
that you will know where one time actually
when can as four or so used to extract speaker embedding
mel frequency cepstral coefficients
a local man a speech or moreover that only used
this acoustic is true for that human knowledge into a size or discriminative features
convolutional neural network which is frequently used or anything about extractor
gradually increased only set to create
does when in perspective ran into the c n only set their people standing can
consider only on digits time and frequency region
and then you're
there are close to the input layer
although
this conventional acoustic is for us to in widely used
mainly sense to the also explore low weight problem as you could to t n
it is that they don't alignment learning can batteries track discriminant information you document layers
when we're on are processed by synonyms
additional frequency response
also we can spend can be strictly
in addition the progress and all data to data and task
known and all the policy intentionally architecture where the midget a global c n n's
extract train leavened representation
as illustrated here
no one installation the plot is similar to the original last night
well the whole mess clean a year
this representation and in canada uni directional getting equal to unit layer
to all we're getting into a single times level election station
a fully connected layer with the one thousand twenty four those
and conduct affine transformation it is a later uses a speaker embedding
in this section we introduce two or system or at a speaker invading last night
the first proposed system is a lucrative as skin condition based selective innocent
the q on the night show the crime local sc
this system comprise all p n in that in a speaker embedding asking condition
in on the other segments kiss each and unit
sc cantonese out you know is able to encoder
and sat in a decidedly stencil activity in the skin condition similar to the case
becomes you
during the training phase
and ct nn is trained for me nice to me scared and an object motion
routine do not include any in a speaker embedding
when a source utterances include
sc on the only on structural be included
on the other hand we're not distant utterances include
sc on the key noisy
output or source all trials
that was used to make the distance utterance
a stinky in it is trained to minimize the wine on the cross entropy object
function
when a source alton seeing a binary label is a one to make the skin
condition only working
and the way not distance all utterances include the finally agrees general to make the
iterative scheme condition
in the figure below
the top n only presented a training base of our proposed
i think i feel
or quoting from previous study
when compensation is conducting speaker and benny's face
compensation may not be and although the ins evaluation pair too low
this phenomenon is to analyze as all users what we losing or discriminative power
all speaker embedding by changing value
you know high dimensional extract embedding space
labels in this knowledge e unless component so proposed system
or on a speaker identification where do contain what the cross entropy roses function is
used
so the final was it commissioned used to train the sc is it is just
a
just described there
loss and the same is or total reconstruction error
this is seeing measure the distance the detection error
analysis a measure called speaker identification error
this entire in a speaker and battery
in the test case the speaker and made it is including to c t n
and as the key and
so clean condition to connect input and output all sc t n is not rely
on it all other whereas the nn
we don't sigmoid activation function
this is only a longer between zero and one and produce source case clean condition
why nineteen a speaker embedding is still i by adding the all will go to
see the nn
and its cascade condition
in the figure below those already there all represent the test process over our proposed
sc
the second proposed system usually prior to causality business not destroy the whole time corner
that is not
those second proposed system usually prior to us so that leaving that's not
described auto-encoder
the second proposed system easily hurt us so that in a sense to discriminate auto-encoder
that is composed of on encoder decoder and two on an intermediate hidden layers
like you hear loss filter set architecture
the architecture design follow descreening altering quality structure
inspired by pca set eyes computer intermediate hidden layer
to collect the reverberation voicing and layer
and to contain
clean speech recognition in this kind layer
so that i used an intermediate human lay your next time s ideally and always
isolated
you has been very
when training set up
although was of ocean correspond to minimize the inter class areas and mesh five the
you class variance
we utilize central sandy tolerance margin thus
centre or source presented very nice intra-class variance why don't you embedding it surely many
discriminate
noninternal destruction was used in d c in to maximize the entire class
variance
in the same yes the previous sc diana sylvia function was used to train but
you know resulting colour
to nest or on the ocean between the number of source of times
and distance all times in the training set
the sample weight or two on the because this six
and one is given recording you put
the c of the ocean is also used to store all the function shrek on
the speaker identification
the final was of functional propose that a system
it is described below
here can my is all hyper parameter the scale the omission or try to this
time
and at times all hyper parameter the combined always function gender roles and inter racial
noticed
no less mobile and experiments and results
the train set comprise all art so the voices development set
and what select one and two dataset
baseline alone a system
in cologne where called is a two
it in nine thousand
what's a nice sample which a car or was to recognise that was second
we're meeting that's construction
to the so
we had to click a short utterance and a common and the call me
all the details are present in the paper
the baseline system used a low and then architecture
we had some modification
first set and the number of the articulators no to seven about it
by on the sisters tree
to consider more speakers
secondly
increased a criminal at all the speaker and battery to one thousand training or
"'kay" the glow described here top on it in a single system o'connor's from the
always the challenge
and our baseline system with various congregation
target comparison between the current system in our baseline
kind of in may going to the occurrence in the
input feature
tries the congregation
and binary classifiers
our story describe the noticed when using all the voice just dataset or training
our street train
we first trained on that were use of constant two
and then press
on the top layer
and conduct fine tuning we propose that set
and hours or shown college road all training or street dataset scatter
training all or street dataset simultaneously and provides the best but almost
proposed sc explore the learning life's customer and optimiser
the best performance loss and the quantum and used as treaty and cosine along a
scheduler
sc show six point
it's by orson the year
where the test set and then the only channels three percent laid our reduction of
compared to the baseline
we experiment the proposed set a we keep a bit size and a manager
the best performance was an echo the menu saddam
and set aside to ten thousand
the set i shows system only or seven percent a year or the test set
and fifteen point nine seven percent are
compared to the baseline
score normalization technique are frequently chlorine various acoustic business condition
most of the artist and in the course is two thousand nineteen challenge or so
use the score normalization techniques such as generous colour magician
the score normalization estimating score normalization
we experiment i actually so this technique or our baseline aurora two for all system
sc that's data
and an important measure the in table low
the results show the z-norm demonstrate but best document in most cases in our experiments
in addition scores and all somewhere all the two proposed system
only the audition across the improvement
we don't eer all other
six point one nine percent or z-norm
finally then we introduce the conclusion
in this study we propose to speaker-invariant is not system
was proposed system are independent from the front ends you can vary instruction
and this taste and can process not only distance on trust was cluster utterance
this process which can are you sure wasn't degradation
when cluster goddess are input into the speaker and battery in is not system
it is time won't systems utterance
compared to the baseline system to proposed system as the c s c and set
up in was based on a real eleven point two or three percent
and fourteen point nine three percent respectively
this is richard show that you x in this impulse cluster and discuss utterance
in our just for making sensing interrogate to proposed system into a single speaker in
body units nist is that
they could probably sing