welcome for the presentation for the paper element of a slow environment invariant speaker recognition
it's just and by nations on channel checksum hall and some given
the goal of this work the speaker recognition
speaker recognition is identifying of our system
from characteristics of voices
the effect of a stopping who is speaking
it can be kind of rising to closer problem
well as a problem
or closest setting all testing identities are created by knight in shining there were can
be addressed as a classification problem
we nonrunning where for this problem this speaker identification
on the other hand well as the setting dusting identities are not seen during training
which is close to practise
we also call this problem a speaker verification
in speaker verification extract the speaker representation of two speech signals income are then whether
two speech
or from same person or not
like many other research areas progress in speaker verification has been facilitated by the availability
of our skill dataset call also
barcelona the data set consisting of short clips of human speech
extracted from interviews videos
the speakers and a wide range of their
in a series accents
questions ranges
videos included in the dataset are shot in a large number of challenging visual
and auditory environments
this and has been widely used for training and testing speaker recognition models
well of one contains over one hundred thousand utterances where one thousand two hundred fifteen
one celebrities
well boston to contains over one million utterances
well
moreover six thousand celebrities
extracted from videos uploaded you two
most of the previous research only focus on boosting the performance of speaker verification based
on given test data set
there has been a number of more insertion or more hardcore detectors one loss functions
suitable were asked
but he scores do not consider what information is learned
by the models
whether it is the useful information or undesirable biases are present in the dataset
in speaker recognition the challenge comes down to ability to separate the voice characteristics and
the environments and which the forces voices recorded
in real-world scenarios
a person usually in rows
as well her voice and the same environment
therefore
the same person speaks in different environmental voice creature can be bar on the embedding
space
if the voice characteristics and environment information is and angle
and then
the wall so that there is a means of recordings from vipers
but finite environments for each speaker
making it possible with the model over bits of the environment as well as the
voice characteristics
therefore
we do not know whether a network
can impact of the other voice characteristics were environmental
or
session biased as well
you know work to prevent this
we must look beyond classification accuracy as the only learning objective
so
the objective of this work is
how to learn speaker discriminative and environment in various speaker verification network
in this work
we introduce an environment a diverse real training framework in which the network and we
learn a speaker discriminative and environment invariant in that things but not close to the
mean shift strange shape
chip this by using bo
previous and usability a limitation but also that it is the
we show that our environment results training wells network to generalize better
to unseen conditions
okay motivation of our training one
is that
a model should not possible to discriminate between two coats of this thing speaker from
the same video were to those of the same speaker and different scales
now let's talk about our training framework
this is the overview of the training phase
we will explain in detail in the later slides
first we'll talk about the fast formation
each meeting bass consist of two three two second audio segments brown and different speakers
to the tree audio segments from each speaker
or the same video
and the other is for the value
the two sediments cannot be either from different parts of the same audio clip or
another clear from the same q two
but billy reference can be performed a while at
and the box select data set
as shown in the left
here you know assumption is that
the error clips from the same video with
in requiring is still in wireless whereas
the clips from different videos would have more different channel characteristics
now
let's talk about the environment space
but in wyman the toolkit strange graded whether or not in the audio segments come
from the same environment
or same video
but it should be lost
the anchor nested weapon the same video as the anchor well
one mean also there
and the anchor in a segment promote different video one and negative
the gradient is back okay to only to the environment that are
so the scene a feature extractor is not optimized during the space
in the speaker space the c n and feature extractor and the speaker recognition network
were trained simultaneously using this gender cross entropy loss
in addition the computational asking in like this and that's what's ability to discriminate between
the clearest averaging from the same environment
and those from different environments
this is done by minimising the kl divergence between the softmax
but the chirplet distances
and the unit one distribution
the environment where channel information can be seen as undesirable sources of variations
it should be absent from and i new speaker embedding
the extent to which the conclusion a lot of computers
the overall loss function
this control binary variable out but
by zero
this and normal shape parameter of the speaker verification
as outweigh increases
the extent of the confusion lost
increases
increases to
now we're going to explain our experiments
our experiments are performance two different architectures
which is em or t and then resident work
although the original the g and there's not a state-of-the-art network is no for high
efficiency and good classification performance
which is em already as a furthermore quantification of the network
that a forty dimensional mel filter of ssm closeness to the whole spectrogram
technically reducing the number of computations
then resonant thirty four is the same as the original rest now but there are
layers
except with only one or
but the channels in each residual well
in order to reduce computational cost
gratification we used two types and well average rating
and so that all
well i have vegetabley simply takes the mean of the feature along
the time domain
so what bowling is introduced to any addition to the frames
that are more informative
or utterance level speaker recognition
and the speaker and environment that were
the speaker network single fully connected layer is used
and really environment that order to fully committed layer with the real the activation is
using this train
we train our models and so when on the boston one dataset
well i and its application may or may training on the overlapping order the development
set well identification and verification so that a models change what identification
can be used to really verification
this makes me cry vacation a one thousand two hundred eleven way classification task
and test set consisted unseen utterance as but scenes speakers during training
we're verification
all speech segments from the one thousand two hundred eleven development set speakers
are used for chain
and that shane model is then evaluated
on the already unseen test speakers
during training
we use a baseline to second temporal segment
extractor randomly from each utterance
spectrograms are extracted from that but the hamming window
of with
twenty five milisecond this that
and milisecond
rasta
but to hunker a seven dimensional aspect frames are used as the input to the
network
proposes you have only
where dimensional mel-cepstral counts
or uses the but
mean and variance normalization is for one every frequency bin of the spectrogram
have built so again i utterance level
no
no voice activity detection where they are one station is used and the string
and then to ask in training or a classification task
but the verification task requires a measure of similarity
in our work but by a layer and the classification the trick is replaced with
one
i low dimensional by abundant well and this layer as we trained with contrast in
boston high negative mine
the c and feature extractor is not function
but the contract loss
and the to restrain using stochastic gradient descent
the initial learning rate of
zero one zero one increasing by a factor of zero point nine by
every at all
the training is a actual hundred and false
whenever the validation a right not improvement and of course
the chambers
replay experiment measures the performance on the same also the test set used in the
speaker verification task
but they rely on robust speaker and re recorded using a jobless be
well i hundred and ten microphone
this results in a significant change and channel characteristics
and the duration of sound quality
a model so identical to those used in previous experiments
and not by engine on their place segments
this table of course
results for multiple models used for evaluation
across both speaker identification and verification task
a models trained with the proposed a diverse to strategy
well a place greater than zero
consistently outperform those trained without
when r y equals to zero
reply equal error rate is the result of replay experiment as we mentioned in the
previous slide
the improvement in performance as a play increases
is more pronounced
in this study
which suggests that
the model training with the proposed addresses training generalize much better to unseen environments or
channel
one
experiments compare by not universal training
else truman environment information that embedding
the test lists for evaluating environment recognition system
nine thousand four hundred and eighty six same-speaker pairs
a which come from the same video and the other how well the opinions
the lower equal error rate in the case that the network is better at reading
whether or not
error of audio segments company the same value
results demonstrate and environmental recognition performance
decreases with uniquely somehow well it shows that unwanted environment information
as in the in the speaker and leading to an extent
so summarize our work
pairs
we propose an environment and process training network to their speaker discriminative environment invariant that
so
secondly
for most of
our proposed method icsi's database lies in value on the basilar one dataset
we also problem network to vary by the environment information as to remove them the
embedding
and q