0:00:13 | i know my mean is shocking ending this the ubiquity of these vectors and training |
---|
0:00:18 | workshop |
---|
0:00:19 | all represents all paper selecting t speaker in between its nodes or is okay shen |
---|
0:00:27 | these are to this contents are you or start with an introduction the motivation |
---|
0:00:32 | next are we going the voice just dataset |
---|
0:00:36 | and i we introduced a baseline system |
---|
0:00:38 | we use low and it and the proposed tomorrow this the remaining the states |
---|
0:00:44 | experiments and corresponding richard will be then present followed by our conclusion |
---|
0:00:49 | a nice to meet introduction |
---|
0:00:53 | recently |
---|
0:00:54 | tim neural network are using the kings table t are honest in speaker verification |
---|
0:01:01 | however distantly utterances are well known to integrate or honest because the contain environmental vector |
---|
0:01:08 | such and reverberation and noise |
---|
0:01:11 | so celeste of these so case we always use of security in complex environments ascending |
---|
0:01:16 | problem is done challenge was |
---|
0:01:19 | then encoded already dataset |
---|
0:01:24 | previously |
---|
0:01:25 | several studies have or compensation for the performance degradation or with the distant environments |
---|
0:01:33 | however to problem to have mean oregon meetings eating compensation method |
---|
0:01:39 | well as |
---|
0:01:39 | you just a one as a degradation of one cluster of utterance |
---|
0:01:44 | applying the compensation that a good agreement though honestly recognition or distant contrasts |
---|
0:01:51 | however when the distant compensation technique was applied to the cluster doctrines the performance det |
---|
0:01:58 | only |
---|
0:02:01 | or into this you know nina used in recording used compensation system when you come |
---|
0:02:05 | from various distance |
---|
0:02:08 | second |
---|
0:02:08 | there is a dependency on the sre system |
---|
0:02:12 | when a new speaker embedding structure is almost |
---|
0:02:15 | corresponding studies or adequate at position and you know you should be are well |
---|
0:02:23 | to all the gradient this |
---|
0:02:24 | previous problems |
---|
0:02:26 | we want to build a system followed in no or properties |
---|
0:02:31 | first |
---|
0:02:32 | you should be independent the front end speaker extractor |
---|
0:02:36 | second |
---|
0:02:37 | the proposed system should be or on selecting cepstral innocent |
---|
0:02:41 | while considering got used and you training speech and microphone |
---|
0:02:45 | certainly |
---|
0:02:46 | was cluster and distant utterance can be including |
---|
0:02:50 | into the proposed system |
---|
0:02:53 | why not only |
---|
0:02:53 | the problem of the system comprise all you late we simply architecture |
---|
0:02:58 | the cost minima or had to store all honestly cross that line |
---|
0:03:05 | we propose to this town doctrines compensation system |
---|
0:03:10 | the worst cross or system so that really can't of the announcements according to require |
---|
0:03:15 | use tentel compensation |
---|
0:03:18 | we design also or cleaning to determine the level and the voice and you preparation |
---|
0:03:23 | no apply compensation accordingly |
---|
0:03:26 | a second approach or system is based on the auto-encoder primal |
---|
0:03:31 | while key binding document retention |
---|
0:03:34 | into two sorts there is no system into set correctly stressed speaker information |
---|
0:03:40 | including embedding teary encoding quality |
---|
0:03:44 | once a spacey target contain clean speaker information on your plane or the channel offset |
---|
0:03:50 | function to these input layer |
---|
0:03:52 | and you know the subspace is target two |
---|
0:03:54 | contain subsequently incarnation but liberation indoors |
---|
0:04:01 | with dataset using this study will be described |
---|
0:04:06 | that was dataset was collected by clinton levers this dataset |
---|
0:04:10 | so one loss or |
---|
0:04:12 | only layer coding we'd already market various test and of course conditions |
---|
0:04:18 | of course the conditional order to according to learn |
---|
0:04:21 | trendy nor training mike |
---|
0:04:23 | impressed angle and distracters |
---|
0:04:25 | in the workforce it dataset |
---|
0:04:27 | there are three hundred speakers |
---|
0:04:30 | the development set comprise all our total term store |
---|
0:04:34 | two hundred speakers and all evaluation sets comprise are twelve utterance well unless the whole |
---|
0:04:40 | one hundred speakers |
---|
0:04:44 | introduce a known and used as baseline |
---|
0:04:48 | no the use of data from a speaker embedding stricter |
---|
0:04:52 | that you will know where one time actually |
---|
0:04:56 | when can as four or so used to extract speaker embedding |
---|
0:04:59 | mel frequency cepstral coefficients |
---|
0:05:02 | a local man a speech or moreover that only used |
---|
0:05:05 | this acoustic is true for that human knowledge into a size or discriminative features |
---|
0:05:12 | convolutional neural network which is frequently used or anything about extractor |
---|
0:05:18 | gradually increased only set to create |
---|
0:05:21 | does when in perspective ran into the c n only set their people standing can |
---|
0:05:26 | consider only on digits time and frequency region |
---|
0:05:30 | and then you're |
---|
0:05:31 | there are close to the input layer |
---|
0:05:35 | although |
---|
0:05:36 | this conventional acoustic is for us to in widely used |
---|
0:05:39 | mainly sense to the also explore low weight problem as you could to t n |
---|
0:05:45 | it is that they don't alignment learning can batteries track discriminant information you document layers |
---|
0:05:53 | when we're on are processed by synonyms |
---|
0:05:56 | additional frequency response |
---|
0:05:58 | also we can spend can be strictly |
---|
0:06:02 | in addition the progress and all data to data and task |
---|
0:06:09 | known and all the policy intentionally architecture where the midget a global c n n's |
---|
0:06:14 | extract train leavened representation |
---|
0:06:17 | as illustrated here |
---|
0:06:19 | no one installation the plot is similar to the original last night |
---|
0:06:23 | well the whole mess clean a year |
---|
0:06:27 | this representation and in canada uni directional getting equal to unit layer |
---|
0:06:33 | to all we're getting into a single times level election station |
---|
0:06:37 | a fully connected layer with the one thousand twenty four those |
---|
0:06:41 | and conduct affine transformation it is a later uses a speaker embedding |
---|
0:06:49 | in this section we introduce two or system or at a speaker invading last night |
---|
0:06:57 | the first proposed system is a lucrative as skin condition based selective innocent |
---|
0:07:03 | the q on the night show the crime local sc |
---|
0:07:08 | this system comprise all p n in that in a speaker embedding asking condition |
---|
0:07:14 | in on the other segments kiss each and unit |
---|
0:07:18 | sc cantonese out you know is able to encoder |
---|
0:07:20 | and sat in a decidedly stencil activity in the skin condition similar to the case |
---|
0:07:26 | becomes you |
---|
0:07:29 | during the training phase |
---|
0:07:31 | and ct nn is trained for me nice to me scared and an object motion |
---|
0:07:35 | routine do not include any in a speaker embedding |
---|
0:07:39 | when a source utterances include |
---|
0:07:41 | sc on the only on structural be included |
---|
0:07:46 | on the other hand we're not distant utterances include |
---|
0:07:49 | sc on the key noisy |
---|
0:07:52 | output or source all trials |
---|
0:07:54 | that was used to make the distance utterance |
---|
0:07:58 | a stinky in it is trained to minimize the wine on the cross entropy object |
---|
0:08:02 | function |
---|
0:08:04 | when a source alton seeing a binary label is a one to make the skin |
---|
0:08:09 | condition only working |
---|
0:08:11 | and the way not distance all utterances include the finally agrees general to make the |
---|
0:08:16 | iterative scheme condition |
---|
0:08:18 | in the figure below |
---|
0:08:20 | the top n only presented a training base of our proposed |
---|
0:08:24 | i think i feel |
---|
0:08:27 | or quoting from previous study |
---|
0:08:29 | when compensation is conducting speaker and benny's face |
---|
0:08:33 | compensation may not be and although the ins evaluation pair too low |
---|
0:08:38 | this phenomenon is to analyze as all users what we losing or discriminative power |
---|
0:08:43 | all speaker embedding by changing value |
---|
0:08:46 | you know high dimensional extract embedding space |
---|
0:08:50 | labels in this knowledge e unless component so proposed system |
---|
0:08:54 | or on a speaker identification where do contain what the cross entropy roses function is |
---|
0:08:59 | used |
---|
0:09:01 | so the final was it commissioned used to train the sc is it is just |
---|
0:09:05 | a |
---|
0:09:06 | just described there |
---|
0:09:09 | loss and the same is or total reconstruction error |
---|
0:09:12 | this is seeing measure the distance the detection error |
---|
0:09:15 | analysis a measure called speaker identification error |
---|
0:09:19 | this entire in a speaker and battery |
---|
0:09:23 | in the test case the speaker and made it is including to c t n |
---|
0:09:27 | and as the key and |
---|
0:09:30 | so clean condition to connect input and output all sc t n is not rely |
---|
0:09:35 | on it all other whereas the nn |
---|
0:09:37 | we don't sigmoid activation function |
---|
0:09:41 | this is only a longer between zero and one and produce source case clean condition |
---|
0:09:48 | why nineteen a speaker embedding is still i by adding the all will go to |
---|
0:09:52 | see the nn |
---|
0:09:54 | and its cascade condition |
---|
0:09:56 | in the figure below those already there all represent the test process over our proposed |
---|
0:10:02 | sc |
---|
0:10:05 | the second proposed system usually prior to causality business not destroy the whole time corner |
---|
0:10:12 | that is not |
---|
0:10:16 | those second proposed system usually prior to us so that leaving that's not |
---|
0:10:20 | described auto-encoder |
---|
0:10:23 | the second proposed system easily hurt us so that in a sense to discriminate auto-encoder |
---|
0:10:30 | that is composed of on encoder decoder and two on an intermediate hidden layers |
---|
0:10:37 | like you hear loss filter set architecture |
---|
0:10:41 | the architecture design follow descreening altering quality structure |
---|
0:10:46 | inspired by pca set eyes computer intermediate hidden layer |
---|
0:10:51 | to collect the reverberation voicing and layer |
---|
0:10:54 | and to contain |
---|
0:10:55 | clean speech recognition in this kind layer |
---|
0:11:01 | so that i used an intermediate human lay your next time s ideally and always |
---|
0:11:06 | isolated |
---|
0:11:07 | you has been very |
---|
0:11:09 | when training set up |
---|
0:11:11 | although was of ocean correspond to minimize the inter class areas and mesh five the |
---|
0:11:16 | you class variance |
---|
0:11:18 | we utilize central sandy tolerance margin thus |
---|
0:11:23 | centre or source presented very nice intra-class variance why don't you embedding it surely many |
---|
0:11:28 | discriminate |
---|
0:11:31 | noninternal destruction was used in d c in to maximize the entire class |
---|
0:11:36 | variance |
---|
0:11:40 | in the same yes the previous sc diana sylvia function was used to train but |
---|
0:11:46 | you know resulting colour |
---|
0:11:48 | to nest or on the ocean between the number of source of times |
---|
0:11:52 | and distance all times in the training set |
---|
0:11:54 | the sample weight or two on the because this six |
---|
0:11:58 | and one is given recording you put |
---|
0:12:01 | the c of the ocean is also used to store all the function shrek on |
---|
0:12:05 | the speaker identification |
---|
0:12:08 | the final was of functional propose that a system |
---|
0:12:12 | it is described below |
---|
0:12:14 | here can my is all hyper parameter the scale the omission or try to this |
---|
0:12:19 | time |
---|
0:12:20 | and at times all hyper parameter the combined always function gender roles and inter racial |
---|
0:12:27 | noticed |
---|
0:12:29 | no less mobile and experiments and results |
---|
0:12:34 | the train set comprise all art so the voices development set |
---|
0:12:38 | and what select one and two dataset |
---|
0:12:42 | baseline alone a system |
---|
0:12:43 | in cologne where called is a two |
---|
0:12:46 | it in nine thousand |
---|
0:12:48 | what's a nice sample which a car or was to recognise that was second |
---|
0:12:51 | we're meeting that's construction |
---|
0:12:54 | to the so |
---|
0:12:55 | we had to click a short utterance and a common and the call me |
---|
0:12:59 | all the details are present in the paper |
---|
0:13:04 | the baseline system used a low and then architecture |
---|
0:13:07 | we had some modification |
---|
0:13:10 | first set and the number of the articulators no to seven about it |
---|
0:13:15 | by on the sisters tree |
---|
0:13:17 | to consider more speakers |
---|
0:13:20 | secondly |
---|
0:13:21 | increased a criminal at all the speaker and battery to one thousand training or |
---|
0:13:28 | "'kay" the glow described here top on it in a single system o'connor's from the |
---|
0:13:33 | always the challenge |
---|
0:13:34 | and our baseline system with various congregation |
---|
0:13:38 | target comparison between the current system in our baseline |
---|
0:13:43 | kind of in may going to the occurrence in the |
---|
0:13:46 | input feature |
---|
0:13:47 | tries the congregation |
---|
0:13:49 | and binary classifiers |
---|
0:13:52 | our story describe the noticed when using all the voice just dataset or training |
---|
0:13:58 | our street train |
---|
0:14:00 | we first trained on that were use of constant two |
---|
0:14:03 | and then press |
---|
0:14:04 | on the top layer |
---|
0:14:06 | and conduct fine tuning we propose that set |
---|
0:14:09 | and hours or shown college road all training or street dataset scatter |
---|
0:14:15 | training all or street dataset simultaneously and provides the best but almost |
---|
0:14:23 | proposed sc explore the learning life's customer and optimiser |
---|
0:14:29 | the best performance loss and the quantum and used as treaty and cosine along a |
---|
0:14:34 | scheduler |
---|
0:14:36 | sc show six point |
---|
0:14:38 | it's by orson the year |
---|
0:14:40 | where the test set and then the only channels three percent laid our reduction of |
---|
0:14:46 | compared to the baseline |
---|
0:14:50 | we experiment the proposed set a we keep a bit size and a manager |
---|
0:14:56 | the best performance was an echo the menu saddam |
---|
0:14:59 | and set aside to ten thousand |
---|
0:15:03 | the set i shows system only or seven percent a year or the test set |
---|
0:15:08 | and fifteen point nine seven percent are |
---|
0:15:11 | compared to the baseline |
---|
0:15:16 | score normalization technique are frequently chlorine various acoustic business condition |
---|
0:15:22 | most of the artist and in the course is two thousand nineteen challenge or so |
---|
0:15:27 | use the score normalization techniques such as generous colour magician |
---|
0:15:31 | the score normalization estimating score normalization |
---|
0:15:36 | we experiment i actually so this technique or our baseline aurora two for all system |
---|
0:15:43 | sc that's data |
---|
0:15:45 | and an important measure the in table low |
---|
0:15:48 | the results show the z-norm demonstrate but best document in most cases in our experiments |
---|
0:15:55 | in addition scores and all somewhere all the two proposed system |
---|
0:15:59 | only the audition across the improvement |
---|
0:16:02 | we don't eer all other |
---|
0:16:03 | six point one nine percent or z-norm |
---|
0:16:08 | finally then we introduce the conclusion |
---|
0:16:13 | in this study we propose to speaker-invariant is not system |
---|
0:16:18 | was proposed system are independent from the front ends you can vary instruction |
---|
0:16:23 | and this taste and can process not only distance on trust was cluster utterance |
---|
0:16:29 | this process which can are you sure wasn't degradation |
---|
0:16:33 | when cluster goddess are input into the speaker and battery in is not system |
---|
0:16:37 | it is time won't systems utterance |
---|
0:16:41 | compared to the baseline system to proposed system as the c s c and set |
---|
0:16:47 | up in was based on a real eleven point two or three percent |
---|
0:16:51 | and fourteen point nine three percent respectively |
---|
0:16:55 | this is richard show that you x in this impulse cluster and discuss utterance |
---|
0:17:01 | in our just for making sensing interrogate to proposed system into a single speaker in |
---|
0:17:07 | body units nist is that |
---|
0:17:12 | they could probably sing |
---|