0:00:13welcome for the presentation for the paper element of a slow environment invariant speaker recognition
0:00:20it's just and by nations on channel checksum hall and some given
0:00:26the goal of this work the speaker recognition
0:00:29speaker recognition is identifying of our system
0:00:31from characteristics of voices
0:00:33the effect of a stopping who is speaking
0:00:36it can be kind of rising to closer problem
0:00:39well as a problem
0:00:41or closest setting all testing identities are created by knight in shining there were can
0:00:46be addressed as a classification problem
0:00:49we nonrunning where for this problem this speaker identification
0:00:55on the other hand well as the setting dusting identities are not seen during training
0:01:00which is close to practise
0:01:03we also call this problem a speaker verification
0:01:07in speaker verification extract the speaker representation of two speech signals income are then whether
0:01:14two speech
0:01:14or from same person or not
0:01:18like many other research areas progress in speaker verification has been facilitated by the availability
0:01:24of our skill dataset call also
0:01:28barcelona the data set consisting of short clips of human speech
0:01:32extracted from interviews videos
0:01:34the speakers and a wide range of their
0:01:37in a series accents
0:01:40questions ranges
0:01:42videos included in the dataset are shot in a large number of challenging visual
0:01:46and auditory environments
0:01:49this and has been widely used for training and testing speaker recognition models
0:01:55well of one contains over one hundred thousand utterances where one thousand two hundred fifteen
0:02:01one celebrities
0:02:02well boston to contains over one million utterances
0:02:06well
0:02:06moreover six thousand celebrities
0:02:08extracted from videos uploaded you two
0:02:11most of the previous research only focus on boosting the performance of speaker verification based
0:02:17on given test data set
0:02:20there has been a number of more insertion or more hardcore detectors one loss functions
0:02:25suitable were asked
0:02:27but he scores do not consider what information is learned
0:02:31by the models
0:02:32whether it is the useful information or undesirable biases are present in the dataset
0:02:39in speaker recognition the challenge comes down to ability to separate the voice characteristics and
0:02:45the environments and which the forces voices recorded
0:02:49in real-world scenarios
0:02:51a person usually in rows
0:02:53as well her voice and the same environment
0:02:56therefore
0:02:57the same person speaks in different environmental voice creature can be bar on the embedding
0:03:03space
0:03:04if the voice characteristics and environment information is and angle
0:03:08and then
0:03:11the wall so that there is a means of recordings from vipers
0:03:14but finite environments for each speaker
0:03:17making it possible with the model over bits of the environment as well as the
0:03:22voice characteristics
0:03:24therefore
0:03:25we do not know whether a network
0:03:27can impact of the other voice characteristics were environmental
0:03:31or
0:03:32session biased as well
0:03:34you know work to prevent this
0:03:36we must look beyond classification accuracy as the only learning objective
0:03:43so
0:03:44the objective of this work is
0:03:46how to learn speaker discriminative and environment in various speaker verification network
0:03:52in this work
0:03:53we introduce an environment a diverse real training framework in which the network and we
0:04:00learn a speaker discriminative and environment invariant in that things but not close to the
0:04:05mean shift strange shape
0:04:08chip this by using bo
0:04:09previous and usability a limitation but also that it is the
0:04:14we show that our environment results training wells network to generalize better
0:04:19to unseen conditions
0:04:22okay motivation of our training one
0:04:25is that
0:04:26a model should not possible to discriminate between two coats of this thing speaker from
0:04:31the same video were to those of the same speaker and different scales
0:04:38now let's talk about our training framework
0:04:41this is the overview of the training phase
0:04:44we will explain in detail in the later slides
0:04:47first we'll talk about the fast formation
0:04:50each meeting bass consist of two three two second audio segments brown and different speakers
0:04:57to the tree audio segments from each speaker
0:05:00or the same video
0:05:01and the other is for the value
0:05:04the two sediments cannot be either from different parts of the same audio clip or
0:05:09another clear from the same q two
0:05:12but billy reference can be performed a while at
0:05:15and the box select data set
0:05:17as shown in the left
0:05:19here you know assumption is that
0:05:22the error clips from the same video with
0:05:24in requiring is still in wireless whereas
0:05:28the clips from different videos would have more different channel characteristics
0:05:34now
0:05:35let's talk about the environment space
0:05:38but in wyman the toolkit strange graded whether or not in the audio segments come
0:05:43from the same environment
0:05:44or same video
0:05:46but it should be lost
0:05:48the anchor nested weapon the same video as the anchor well
0:05:52one mean also there
0:05:54and the anchor in a segment promote different video one and negative
0:05:59the gradient is back okay to only to the environment that are
0:06:03so the scene a feature extractor is not optimized during the space
0:06:08in the speaker space the c n and feature extractor and the speaker recognition network
0:06:13were trained simultaneously using this gender cross entropy loss
0:06:18in addition the computational asking in like this and that's what's ability to discriminate between
0:06:23the clearest averaging from the same environment
0:06:26and those from different environments
0:06:29this is done by minimising the kl divergence between the softmax
0:06:33but the chirplet distances
0:06:35and the unit one distribution
0:06:38the environment where channel information can be seen as undesirable sources of variations
0:06:44it should be absent from and i new speaker embedding
0:06:48the extent to which the conclusion a lot of computers
0:06:51the overall loss function
0:06:53this control binary variable out but
0:06:56by zero
0:06:57this and normal shape parameter of the speaker verification
0:07:01as outweigh increases
0:07:03the extent of the confusion lost
0:07:06increases
0:07:07increases to
0:07:09now we're going to explain our experiments
0:07:13our experiments are performance two different architectures
0:07:16which is em or t and then resident work
0:07:19although the original the g and there's not a state-of-the-art network is no for high
0:07:25efficiency and good classification performance
0:07:29which is em already as a furthermore quantification of the network
0:07:33that a forty dimensional mel filter of ssm closeness to the whole spectrogram
0:07:38technically reducing the number of computations
0:07:41then resonant thirty four is the same as the original rest now but there are
0:07:45layers
0:07:46except with only one or
0:07:48but the channels in each residual well
0:07:50in order to reduce computational cost
0:07:54gratification we used two types and well average rating
0:07:58and so that all
0:08:00well i have vegetabley simply takes the mean of the feature along
0:08:04the time domain
0:08:06so what bowling is introduced to any addition to the frames
0:08:10that are more informative
0:08:11or utterance level speaker recognition
0:08:15and the speaker and environment that were
0:08:19the speaker network single fully connected layer is used
0:08:23and really environment that order to fully committed layer with the real the activation is
0:08:27using this train
0:08:29we train our models and so when on the boston one dataset
0:08:34well i and its application may or may training on the overlapping order the development
0:08:39set well identification and verification so that a models change what identification
0:08:44can be used to really verification
0:08:48this makes me cry vacation a one thousand two hundred eleven way classification task
0:08:54and test set consisted unseen utterance as but scenes speakers during training
0:09:00we're verification
0:09:02all speech segments from the one thousand two hundred eleven development set speakers
0:09:07are used for chain
0:09:08and that shane model is then evaluated
0:09:11on the already unseen test speakers
0:09:15during training
0:09:16we use a baseline to second temporal segment
0:09:20extractor randomly from each utterance
0:09:23spectrograms are extracted from that but the hamming window
0:09:26of with
0:09:27twenty five milisecond this that
0:09:30and milisecond
0:09:31rasta
0:09:33but to hunker a seven dimensional aspect frames are used as the input to the
0:09:37network
0:09:39proposes you have only
0:09:40where dimensional mel-cepstral counts
0:09:42or uses the but
0:09:45mean and variance normalization is for one every frequency bin of the spectrogram
0:09:50have built so again i utterance level
0:09:53no
0:09:54no voice activity detection where they are one station is used and the string
0:10:02and then to ask in training or a classification task
0:10:06but the verification task requires a measure of similarity
0:10:10in our work but by a layer and the classification the trick is replaced with
0:10:14one
0:10:15i low dimensional by abundant well and this layer as we trained with contrast in
0:10:21boston high negative mine
0:10:24the c and feature extractor is not function
0:10:26but the contract loss
0:10:30and the to restrain using stochastic gradient descent
0:10:33the initial learning rate of
0:10:35zero one zero one increasing by a factor of zero point nine by
0:10:40every at all
0:10:41the training is a actual hundred and false
0:10:44whenever the validation a right not improvement and of course
0:10:48the chambers
0:10:51replay experiment measures the performance on the same also the test set used in the
0:10:56speaker verification task
0:10:58but they rely on robust speaker and re recorded using a jobless be
0:11:03well i hundred and ten microphone
0:11:05this results in a significant change and channel characteristics
0:11:09and the duration of sound quality
0:11:12a model so identical to those used in previous experiments
0:11:16and not by engine on their place segments
0:11:21this table of course
0:11:22results for multiple models used for evaluation
0:11:26across both speaker identification and verification task
0:11:30a models trained with the proposed a diverse to strategy
0:11:33well a place greater than zero
0:11:36consistently outperform those trained without
0:11:39when r y equals to zero
0:11:42reply equal error rate is the result of replay experiment as we mentioned in the
0:11:47previous slide
0:11:48the improvement in performance as a play increases
0:11:51is more pronounced
0:11:53in this study
0:11:54which suggests that
0:11:55the model training with the proposed addresses training generalize much better to unseen environments or
0:12:01channel
0:12:03one
0:12:05experiments compare by not universal training
0:12:08else truman environment information that embedding
0:12:12the test lists for evaluating environment recognition system
0:12:16nine thousand four hundred and eighty six same-speaker pairs
0:12:20a which come from the same video and the other how well the opinions
0:12:26the lower equal error rate in the case that the network is better at reading
0:12:31whether or not
0:12:32error of audio segments company the same value
0:12:35results demonstrate and environmental recognition performance
0:12:39decreases with uniquely somehow well it shows that unwanted environment information
0:12:45as in the in the speaker and leading to an extent
0:12:49so summarize our work
0:12:51pairs
0:12:52we propose an environment and process training network to their speaker discriminative environment invariant that
0:12:58so
0:13:00secondly
0:13:01for most of
0:13:02our proposed method icsi's database lies in value on the basilar one dataset
0:13:09we also problem network to vary by the environment information as to remove them the
0:13:14embedding
0:13:16and q