0:00:13i know my mean is shocking ending this the ubiquity of these vectors and training
0:00:18workshop
0:00:19all represents all paper selecting t speaker in between its nodes or is okay shen
0:00:27these are to this contents are you or start with an introduction the motivation
0:00:32next are we going the voice just dataset
0:00:36and i we introduced a baseline system
0:00:38we use low and it and the proposed tomorrow this the remaining the states
0:00:44experiments and corresponding richard will be then present followed by our conclusion
0:00:49a nice to meet introduction
0:00:53recently
0:00:54tim neural network are using the kings table t are honest in speaker verification
0:01:01however distantly utterances are well known to integrate or honest because the contain environmental vector
0:01:08such and reverberation and noise
0:01:11so celeste of these so case we always use of security in complex environments ascending
0:01:16problem is done challenge was
0:01:19then encoded already dataset
0:01:24previously
0:01:25several studies have or compensation for the performance degradation or with the distant environments
0:01:33however to problem to have mean oregon meetings eating compensation method
0:01:39well as
0:01:39you just a one as a degradation of one cluster of utterance
0:01:44applying the compensation that a good agreement though honestly recognition or distant contrasts
0:01:51however when the distant compensation technique was applied to the cluster doctrines the performance det
0:01:58only
0:02:01or into this you know nina used in recording used compensation system when you come
0:02:05from various distance
0:02:08second
0:02:08there is a dependency on the sre system
0:02:12when a new speaker embedding structure is almost
0:02:15corresponding studies or adequate at position and you know you should be are well
0:02:23to all the gradient this
0:02:24previous problems
0:02:26we want to build a system followed in no or properties
0:02:31first
0:02:32you should be independent the front end speaker extractor
0:02:36second
0:02:37the proposed system should be or on selecting cepstral innocent
0:02:41while considering got used and you training speech and microphone
0:02:45certainly
0:02:46was cluster and distant utterance can be including
0:02:50into the proposed system
0:02:53why not only
0:02:53the problem of the system comprise all you late we simply architecture
0:02:58the cost minima or had to store all honestly cross that line
0:03:05we propose to this town doctrines compensation system
0:03:10the worst cross or system so that really can't of the announcements according to require
0:03:15use tentel compensation
0:03:18we design also or cleaning to determine the level and the voice and you preparation
0:03:23no apply compensation accordingly
0:03:26a second approach or system is based on the auto-encoder primal
0:03:31while key binding document retention
0:03:34into two sorts there is no system into set correctly stressed speaker information
0:03:40including embedding teary encoding quality
0:03:44once a spacey target contain clean speaker information on your plane or the channel offset
0:03:50function to these input layer
0:03:52and you know the subspace is target two
0:03:54contain subsequently incarnation but liberation indoors
0:04:01with dataset using this study will be described
0:04:06that was dataset was collected by clinton levers this dataset
0:04:10so one loss or
0:04:12only layer coding we'd already market various test and of course conditions
0:04:18of course the conditional order to according to learn
0:04:21trendy nor training mike
0:04:23impressed angle and distracters
0:04:25in the workforce it dataset
0:04:27there are three hundred speakers
0:04:30the development set comprise all our total term store
0:04:34two hundred speakers and all evaluation sets comprise are twelve utterance well unless the whole
0:04:40one hundred speakers
0:04:44introduce a known and used as baseline
0:04:48no the use of data from a speaker embedding stricter
0:04:52that you will know where one time actually
0:04:56when can as four or so used to extract speaker embedding
0:04:59mel frequency cepstral coefficients
0:05:02a local man a speech or moreover that only used
0:05:05this acoustic is true for that human knowledge into a size or discriminative features
0:05:12convolutional neural network which is frequently used or anything about extractor
0:05:18gradually increased only set to create
0:05:21does when in perspective ran into the c n only set their people standing can
0:05:26consider only on digits time and frequency region
0:05:30and then you're
0:05:31there are close to the input layer
0:05:35although
0:05:36this conventional acoustic is for us to in widely used
0:05:39mainly sense to the also explore low weight problem as you could to t n
0:05:45it is that they don't alignment learning can batteries track discriminant information you document layers
0:05:53when we're on are processed by synonyms
0:05:56additional frequency response
0:05:58also we can spend can be strictly
0:06:02in addition the progress and all data to data and task
0:06:09known and all the policy intentionally architecture where the midget a global c n n's
0:06:14extract train leavened representation
0:06:17as illustrated here
0:06:19no one installation the plot is similar to the original last night
0:06:23well the whole mess clean a year
0:06:27this representation and in canada uni directional getting equal to unit layer
0:06:33to all we're getting into a single times level election station
0:06:37a fully connected layer with the one thousand twenty four those
0:06:41and conduct affine transformation it is a later uses a speaker embedding
0:06:49in this section we introduce two or system or at a speaker invading last night
0:06:57the first proposed system is a lucrative as skin condition based selective innocent
0:07:03the q on the night show the crime local sc
0:07:08this system comprise all p n in that in a speaker embedding asking condition
0:07:14in on the other segments kiss each and unit
0:07:18sc cantonese out you know is able to encoder
0:07:20and sat in a decidedly stencil activity in the skin condition similar to the case
0:07:26becomes you
0:07:29during the training phase
0:07:31and ct nn is trained for me nice to me scared and an object motion
0:07:35routine do not include any in a speaker embedding
0:07:39when a source utterances include
0:07:41sc on the only on structural be included
0:07:46on the other hand we're not distant utterances include
0:07:49sc on the key noisy
0:07:52output or source all trials
0:07:54that was used to make the distance utterance
0:07:58a stinky in it is trained to minimize the wine on the cross entropy object
0:08:02function
0:08:04when a source alton seeing a binary label is a one to make the skin
0:08:09condition only working
0:08:11and the way not distance all utterances include the finally agrees general to make the
0:08:16iterative scheme condition
0:08:18in the figure below
0:08:20the top n only presented a training base of our proposed
0:08:24i think i feel
0:08:27or quoting from previous study
0:08:29when compensation is conducting speaker and benny's face
0:08:33compensation may not be and although the ins evaluation pair too low
0:08:38this phenomenon is to analyze as all users what we losing or discriminative power
0:08:43all speaker embedding by changing value
0:08:46you know high dimensional extract embedding space
0:08:50labels in this knowledge e unless component so proposed system
0:08:54or on a speaker identification where do contain what the cross entropy roses function is
0:08:59used
0:09:01so the final was it commissioned used to train the sc is it is just
0:09:05a
0:09:06just described there
0:09:09loss and the same is or total reconstruction error
0:09:12this is seeing measure the distance the detection error
0:09:15analysis a measure called speaker identification error
0:09:19this entire in a speaker and battery
0:09:23in the test case the speaker and made it is including to c t n
0:09:27and as the key and
0:09:30so clean condition to connect input and output all sc t n is not rely
0:09:35on it all other whereas the nn
0:09:37we don't sigmoid activation function
0:09:41this is only a longer between zero and one and produce source case clean condition
0:09:48why nineteen a speaker embedding is still i by adding the all will go to
0:09:52see the nn
0:09:54and its cascade condition
0:09:56in the figure below those already there all represent the test process over our proposed
0:10:02sc
0:10:05the second proposed system usually prior to causality business not destroy the whole time corner
0:10:12that is not
0:10:16those second proposed system usually prior to us so that leaving that's not
0:10:20described auto-encoder
0:10:23the second proposed system easily hurt us so that in a sense to discriminate auto-encoder
0:10:30that is composed of on encoder decoder and two on an intermediate hidden layers
0:10:37like you hear loss filter set architecture
0:10:41the architecture design follow descreening altering quality structure
0:10:46inspired by pca set eyes computer intermediate hidden layer
0:10:51to collect the reverberation voicing and layer
0:10:54and to contain
0:10:55clean speech recognition in this kind layer
0:11:01so that i used an intermediate human lay your next time s ideally and always
0:11:06isolated
0:11:07you has been very
0:11:09when training set up
0:11:11although was of ocean correspond to minimize the inter class areas and mesh five the
0:11:16you class variance
0:11:18we utilize central sandy tolerance margin thus
0:11:23centre or source presented very nice intra-class variance why don't you embedding it surely many
0:11:28discriminate
0:11:31noninternal destruction was used in d c in to maximize the entire class
0:11:36variance
0:11:40in the same yes the previous sc diana sylvia function was used to train but
0:11:46you know resulting colour
0:11:48to nest or on the ocean between the number of source of times
0:11:52and distance all times in the training set
0:11:54the sample weight or two on the because this six
0:11:58and one is given recording you put
0:12:01the c of the ocean is also used to store all the function shrek on
0:12:05the speaker identification
0:12:08the final was of functional propose that a system
0:12:12it is described below
0:12:14here can my is all hyper parameter the scale the omission or try to this
0:12:19time
0:12:20and at times all hyper parameter the combined always function gender roles and inter racial
0:12:27noticed
0:12:29no less mobile and experiments and results
0:12:34the train set comprise all art so the voices development set
0:12:38and what select one and two dataset
0:12:42baseline alone a system
0:12:43in cologne where called is a two
0:12:46it in nine thousand
0:12:48what's a nice sample which a car or was to recognise that was second
0:12:51we're meeting that's construction
0:12:54to the so
0:12:55we had to click a short utterance and a common and the call me
0:12:59all the details are present in the paper
0:13:04the baseline system used a low and then architecture
0:13:07we had some modification
0:13:10first set and the number of the articulators no to seven about it
0:13:15by on the sisters tree
0:13:17to consider more speakers
0:13:20secondly
0:13:21increased a criminal at all the speaker and battery to one thousand training or
0:13:28"'kay" the glow described here top on it in a single system o'connor's from the
0:13:33always the challenge
0:13:34and our baseline system with various congregation
0:13:38target comparison between the current system in our baseline
0:13:43kind of in may going to the occurrence in the
0:13:46input feature
0:13:47tries the congregation
0:13:49and binary classifiers
0:13:52our story describe the noticed when using all the voice just dataset or training
0:13:58our street train
0:14:00we first trained on that were use of constant two
0:14:03and then press
0:14:04on the top layer
0:14:06and conduct fine tuning we propose that set
0:14:09and hours or shown college road all training or street dataset scatter
0:14:15training all or street dataset simultaneously and provides the best but almost
0:14:23proposed sc explore the learning life's customer and optimiser
0:14:29the best performance loss and the quantum and used as treaty and cosine along a
0:14:34scheduler
0:14:36sc show six point
0:14:38it's by orson the year
0:14:40where the test set and then the only channels three percent laid our reduction of
0:14:46compared to the baseline
0:14:50we experiment the proposed set a we keep a bit size and a manager
0:14:56the best performance was an echo the menu saddam
0:14:59and set aside to ten thousand
0:15:03the set i shows system only or seven percent a year or the test set
0:15:08and fifteen point nine seven percent are
0:15:11compared to the baseline
0:15:16score normalization technique are frequently chlorine various acoustic business condition
0:15:22most of the artist and in the course is two thousand nineteen challenge or so
0:15:27use the score normalization techniques such as generous colour magician
0:15:31the score normalization estimating score normalization
0:15:36we experiment i actually so this technique or our baseline aurora two for all system
0:15:43sc that's data
0:15:45and an important measure the in table low
0:15:48the results show the z-norm demonstrate but best document in most cases in our experiments
0:15:55in addition scores and all somewhere all the two proposed system
0:15:59only the audition across the improvement
0:16:02we don't eer all other
0:16:03six point one nine percent or z-norm
0:16:08finally then we introduce the conclusion
0:16:13in this study we propose to speaker-invariant is not system
0:16:18was proposed system are independent from the front ends you can vary instruction
0:16:23and this taste and can process not only distance on trust was cluster utterance
0:16:29this process which can are you sure wasn't degradation
0:16:33when cluster goddess are input into the speaker and battery in is not system
0:16:37it is time won't systems utterance
0:16:41compared to the baseline system to proposed system as the c s c and set
0:16:47up in was based on a real eleven point two or three percent
0:16:51and fourteen point nine three percent respectively
0:16:55this is richard show that you x in this impulse cluster and discuss utterance
0:17:01in our just for making sensing interrogate to proposed system into a single speaker in
0:17:07body units nist is that
0:17:12they could probably sing