0:00:18 | five |
---|
0:00:19 | by there is being on |
---|
0:00:21 | and now i'm going to present information preservation we |
---|
0:00:26 | or speaker embedding |
---|
0:00:28 | on my name is gonna and i'm from this all missionary first |
---|
0:00:38 | "'kay" designed the contents of my |
---|
0:00:41 | presentation first |
---|
0:00:44 | introduce you briefly about the speaker recognition task |
---|
0:00:49 | and then |
---|
0:00:51 | i will explain your bone really borders |
---|
0:00:54 | which is really to my research |
---|
0:00:56 | then i will explain you my proposed method |
---|
0:01:00 | then i will |
---|
0:01:02 | sure you the experiments settings and its results |
---|
0:01:06 | and finally i would concludes my presentation |
---|
0:01:12 | okay |
---|
0:01:14 | yes you can see this is the general generating based speaker recognition system |
---|
0:01:19 | the first component so we use so being level network |
---|
0:01:24 | i in it is only implemented be by which joe network and time delay neural |
---|
0:01:32 | network |
---|
0:01:33 | so |
---|
0:01:34 | we take stuff envelopes features which is which can be on mfcc or spectra |
---|
0:01:40 | it so input and you up the stuffing level or intuition |
---|
0:01:47 | the best can fancy soap telling what you |
---|
0:01:50 | so |
---|
0:01:52 | it is really our biggest but the average which is no |
---|
0:01:57 | but a sampling |
---|
0:01:58 | or is just mean and variance vectors the completes mean and be there is so |
---|
0:02:03 | the thing really features |
---|
0:02:06 | and you're going is the frame it outputs |
---|
0:02:09 | from the point network |
---|
0:02:12 | which was a finger the network and you don't install x dimensional vector |
---|
0:02:18 | why this is important is a |
---|
0:02:21 | it you can make a fixed image and i-vector from of a variable length though |
---|
0:02:26 | paul |
---|
0:02:27 | premarital outputs |
---|
0:02:32 | well teletext components is just speaker identified |
---|
0:02:35 | well what this does is a eucharistic buys does |
---|
0:02:38 | i guess speakers from the speaker embedding |
---|
0:02:42 | well |
---|
0:02:44 | you you'd health so that works to learn though |
---|
0:02:48 | speaker dependent features |
---|
0:02:51 | so what |
---|
0:02:52 | it is only used for training because |
---|
0:02:55 | in verification is scenario |
---|
0:02:59 | in the test set you |
---|
0:03:02 | you |
---|
0:03:03 | you common the |
---|
0:03:05 | once in speakers |
---|
0:03:06 | so when testing the system |
---|
0:03:11 | you use the on the other scoring metric like cosine similarity or purity |
---|
0:03:17 | or |
---|
0:03:18 | the scoring |
---|
0:03:24 | this is |
---|
0:03:26 | expected baseline systems |
---|
0:03:28 | so |
---|
0:03:30 | target and channel will expected baseline system |
---|
0:03:34 | it is |
---|
0:03:35 | it's you can see it is |
---|
0:03:37 | we don't you'd is made up at the frame level network |
---|
0:03:40 | and fleeing what you're and us to segments that where |
---|
0:03:45 | that works |
---|
0:03:46 | so |
---|
0:03:47 | mm sixty two are usually used for the input features of the network |
---|
0:03:51 | and five |
---|
0:03:53 | first five there is a rectangular neural network |
---|
0:03:56 | which works at the frame level |
---|
0:03:58 | and then playing layer he's sober the female representation |
---|
0:04:03 | and there are |
---|
0:04:05 | additionally hidden layers |
---|
0:04:06 | only which are operate at a segment level |
---|
0:04:10 | and the last layer is a softmax output layer |
---|
0:04:13 | so we pretty there is pretty doing so i guess speaker |
---|
0:04:21 | on a line going to |
---|
0:04:23 | or shrewd a richer information estimation and extraction technique |
---|
0:04:28 | though |
---|
0:04:29 | which are information |
---|
0:04:31 | is a measure of the into a dependency |
---|
0:04:34 | between two random variables |
---|
0:04:37 | so |
---|
0:04:38 | mutual information can be viewed as a probe l a place like blurred directions |
---|
0:04:43 | it in that are in this group to division and the product of |
---|
0:04:47 | very generous of though two random variables |
---|
0:04:51 | or |
---|
0:04:52 | during the representational from a laboratory principal directions are the |
---|
0:04:58 | key element so well which are information to do this |
---|
0:05:01 | estimator which i will be spin later |
---|
0:05:03 | so |
---|
0:05:04 | the following theorem gives are used for uses useful representation |
---|
0:05:09 | well which is being called a scoreboard and representation you gives you the lower bound |
---|
0:05:15 | so which are information |
---|
0:05:19 | next the |
---|
0:05:20 | which are information europe estimator |
---|
0:05:22 | which is close to mine |
---|
0:05:24 | so |
---|
0:05:25 | this see the idea of mine is to model function |
---|
0:05:29 | t we don't function i mean |
---|
0:05:31 | or parameters by though due to network |
---|
0:05:34 | you'd parameter omega |
---|
0:05:37 | so |
---|
0:05:39 | what this network or estimates the richer information |
---|
0:05:43 | or using the |
---|
0:05:45 | to model those got about the representational richer information |
---|
0:05:52 | some things |
---|
0:05:54 | using mine so you can do that on the training mission estimation and stuff estimation |
---|
0:06:00 | together |
---|
0:06:01 | the bigger all |
---|
0:06:03 | well |
---|
0:06:04 | buying is buying is the to maximize an estimate the mutual information between the input |
---|
0:06:11 | and output pairs of the encoder |
---|
0:06:15 | e by |
---|
0:06:16 | so which is |
---|
0:06:17 | neural networks |
---|
0:06:18 | with the parameter five |
---|
0:06:20 | so |
---|
0:06:22 | was to eat is to realise on the sampling strategy |
---|
0:06:25 | so |
---|
0:06:26 | making a positive and negative examples |
---|
0:06:30 | alright |
---|
0:06:30 | drawn from the joint and the product of marginal descriptors |
---|
0:06:35 | just to treat distribution |
---|
0:06:37 | so in general |
---|
0:06:39 | i still systems are collected from the same |
---|
0:06:42 | utterance |
---|
0:06:43 | in this case use the same utterance |
---|
0:06:45 | or word imagery i can be are saying image |
---|
0:06:48 | so |
---|
0:06:49 | while the net |
---|
0:06:50 | they can it samples are obtained by a |
---|
0:06:53 | on the other randomly sample utterance or image |
---|
0:06:57 | so |
---|
0:06:58 | and is |
---|
0:06:59 | we can optimize the richer information |
---|
0:07:02 | so i mean though estimate and |
---|
0:07:06 | maximize together because |
---|
0:07:07 | the don't score does care about and reputation was the lower bound |
---|
0:07:11 | sorry when you maximize the |
---|
0:07:15 | the top line so it can estimate and |
---|
0:07:19 | maximize the richer information at the same time |
---|
0:07:22 | so this is a the mind of thirty |
---|
0:07:26 | so it is |
---|
0:07:27 | derived from the task or power directly |
---|
0:07:33 | so |
---|
0:07:35 | the all i want to spend new about my |
---|
0:07:38 | proposed method |
---|
0:07:39 | so |
---|
0:07:40 | this is |
---|
0:07:43 | information preservation pulling |
---|
0:07:44 | the log |
---|
0:07:45 | the idea of all i information fusion |
---|
0:07:48 | paediatrician pulling one which i will call i d p so |
---|
0:07:52 | to prevent |
---|
0:07:53 | to prevent a information vacation in the putting stage |
---|
0:07:57 | so i will ensure into this use |
---|
0:08:00 | mine too regularized all |
---|
0:08:03 | utterance level features |
---|
0:08:05 | to have a high mutual information we the |
---|
0:08:10 | frame level features |
---|
0:08:12 | so |
---|
0:08:13 | but i meant but i make ice you hear |
---|
0:08:16 | well |
---|
0:08:17 | the mine so frame level features and utterance-level features are sensitive from though |
---|
0:08:23 | same input utterances utterance |
---|
0:08:25 | so |
---|
0:08:27 | the by cheap ares a the |
---|
0:08:30 | joint |
---|
0:08:33 | the actual sampled from the joint distribution |
---|
0:08:36 | and automatic |
---|
0:08:37 | are there at a pair |
---|
0:08:39 | the frame about features and objective of features are sent to from the difference you |
---|
0:08:43 | presented an utterance so it is a sentence from the product of margin |
---|
0:08:53 | so |
---|
0:08:54 | in information projection station playing |
---|
0:08:57 | i such as the two difference way to use mine |
---|
0:09:02 | so one is what we're image information maximization gi which i recall g i n |
---|
0:09:08 | and the second one you so |
---|
0:09:10 | local mature information maximization |
---|
0:09:12 | which is lid and |
---|
0:09:14 | so |
---|
0:09:15 | but different used a in a g i n |
---|
0:09:18 | so it's a matter what the u |
---|
0:09:22 | to model or something sense information one frame rate of features i applied online to |
---|
0:09:28 | maximize information between the |
---|
0:09:31 | all frame level features and the |
---|
0:09:34 | a transitive feature |
---|
0:09:36 | so |
---|
0:09:38 | two random variable for mine will be though |
---|
0:09:42 | sequence so |
---|
0:09:43 | we really features which is larger or older age |
---|
0:09:48 | which you it's which you know stuff sequence |
---|
0:09:51 | and the alternative feature |
---|
0:09:54 | w so we which is the up the top the plea module so |
---|
0:09:59 | in local which are information maximization |
---|
0:10:03 | a the difference is still enjoy |
---|
0:10:07 | or you can be one if you |
---|
0:10:09 | frame of the features |
---|
0:10:10 | will be enough to take all right decision to |
---|
0:10:14 | applying to |
---|
0:10:16 | predicting it is from those positive or negative samples so therefore some useful information will |
---|
0:10:23 | be |
---|
0:10:25 | you can also |
---|
0:10:27 | in their individual frame individual feature so |
---|
0:10:31 | i suggest a tree prevent this |
---|
0:10:34 | so we fix nice to meet sure information between the single |
---|
0:10:38 | payment of feature and the utterance set of feature or |
---|
0:10:42 | tahoe lost will be for every g |
---|
0:10:45 | which are information between single family feature and doctrines the feature |
---|
0:10:55 | so this is tom |
---|
0:10:56 | more information |
---|
0:10:59 | preservation pulling architecture |
---|
0:11:01 | so |
---|
0:11:03 | man l i m can be applied |
---|
0:11:04 | or together when training data in between speaker in bidding system |
---|
0:11:09 | so i keep ulysses of optimized jointly we |
---|
0:11:12 | speech |
---|
0:11:13 | conventional speaker or classification loss |
---|
0:11:16 | during the string |
---|
0:11:17 | so in this case i used a |
---|
0:11:21 | cross that softmax cross entropy lost |
---|
0:11:24 | for the speaker |
---|
0:11:25 | classification loss |
---|
0:11:27 | so the first time |
---|
0:11:29 | you know star speaker efficient loss which is softmax course cross entropy |
---|
0:11:33 | and second then |
---|
0:11:35 | the third terms are the clover and look where |
---|
0:11:38 | mine objectives |
---|
0:11:42 | so you can see this figure |
---|
0:11:44 | two or understand my |
---|
0:11:46 | or architecture |
---|
0:11:51 | "'kay" this is so experiment is settings |
---|
0:11:54 | so i used a |
---|
0:11:56 | most commonly is dataset me too so the |
---|
0:11:59 | one and two |
---|
0:12:01 | so the input features was the |
---|
0:12:05 | thirty dimensional mfcc extracted read |
---|
0:12:08 | twenty five milisecond hamming window with a |
---|
0:12:11 | and there is signal should it's shift |
---|
0:12:14 | so |
---|
0:12:16 | during the training each buttress was trying to |
---|
0:12:20 | to point five second segment well which was a to make |
---|
0:12:27 | input batch be dull |
---|
0:12:28 | fixed dimension |
---|
0:12:30 | so mean and variance normalization was applied to it is extracted |
---|
0:12:35 | mfccs and i use |
---|
0:12:38 | no voice activity detection |
---|
0:12:41 | or |
---|
0:12:42 | automatic sinus they from or any kind so |
---|
0:12:45 | i |
---|
0:12:45 | data augmentation was not |
---|
0:12:48 | so |
---|
0:12:49 | the what the competition is like this |
---|
0:12:52 | like this |
---|
0:12:53 | so point of pulling there are used a tentative |
---|
0:12:56 | that is pulling so each is most commonly used one |
---|
0:13:00 | and the |
---|
0:13:02 | that image and no was a too large for magnitude because the last frame the |
---|
0:13:08 | minute took output is so one hundred thousand |
---|
0:13:11 | five hundred thirty six |
---|
0:13:13 | dimension |
---|
0:13:14 | so i |
---|
0:13:16 | lda the addition of dimension |
---|
0:13:19 | you just on that works |
---|
0:13:20 | to make |
---|
0:13:22 | the automation |
---|
0:13:25 | lower |
---|
0:13:30 | and that it is are the training materials so batch size was a |
---|
0:13:34 | one directing t eight |
---|
0:13:36 | and to make you the for my network |
---|
0:13:39 | well segment level features this complaint even at to the frame level feature at |
---|
0:13:47 | feature dimension so |
---|
0:13:51 | and the optimal two measures |
---|
0:13:54 | you don't initial |
---|
0:13:55 | but anyway topple |
---|
0:13:59 | or |
---|
0:14:01 | tend to the minus three power and expose spanish jerry degrees at every epoch on |
---|
0:14:07 | t are the final it will tend to the buying as he however so |
---|
0:14:14 | and a whole neural network improving the implementation was done using the tensor floor will |
---|
0:14:21 | and the for the back end just scoring metric i used |
---|
0:14:25 | cosine similarity and p l d |
---|
0:14:27 | so when using the p lda the last |
---|
0:14:30 | up at all |
---|
0:14:32 | nazi then there was used as |
---|
0:14:34 | speaker embedding |
---|
0:14:36 | or this is there are so this is for the cosine similarity and this is |
---|
0:14:42 | for p eight is a this is |
---|
0:14:44 | this is the ever so when using the cosine similarity doll up to class t |
---|
0:14:49 | stand there was were |
---|
0:14:53 | the top performers what was higher so for consensus similarity |
---|
0:14:59 | why is the last you'd in the output of what for the using purity double |
---|
0:15:04 | opt at all |
---|
0:15:05 | second to the last hidden layer was used |
---|
0:15:08 | so |
---|
0:15:09 | before the ple training or lda was applied |
---|
0:15:13 | to reduce does speaker |
---|
0:15:15 | spain bidding imagine two |
---|
0:15:17 | two hundred and you it is followed by and then normalization and whitening |
---|
0:15:25 | okay |
---|
0:15:26 | this is the experiment results so |
---|
0:15:31 | i |
---|
0:15:33 | the first size reminded using g i m only cheese and local |
---|
0:15:39 | ill i am twenty case |
---|
0:15:40 | though |
---|
0:15:42 | laughs |
---|
0:15:43 | table his for the |
---|
0:15:45 | g i m o pony case |
---|
0:15:46 | so |
---|
0:15:50 | the best the best performance was for the p a d was are |
---|
0:15:54 | by point |
---|
0:15:55 | at four so |
---|
0:15:57 | but in the |
---|
0:16:00 | it's be expected baseline system |
---|
0:16:03 | so he was |
---|
0:16:05 | five point this |
---|
0:16:07 | so you showed a better for performance from the baseline system |
---|
0:16:13 | so |
---|
0:16:13 | though the rights t-ball yes for the lid in one case |
---|
0:16:17 | so the best it's |
---|
0:16:20 | on performance was |
---|
0:16:23 | for the |
---|
0:16:24 | ple was not by point one e |
---|
0:16:28 | percent's |
---|
0:16:29 | but |
---|
0:16:30 | in the baseline system i was a five point sixty six |
---|
0:16:34 | so we show the better performance from |
---|
0:16:38 | then a baseline system |
---|
0:16:44 | this is a disadvantage that although i p |
---|
0:16:48 | so i |
---|
0:16:51 | in various hyperparameter case so i i'm thinking the five |
---|
0:16:55 | i'm interest for my not exist and i or |
---|
0:17:03 | basement in many cases |
---|
0:17:04 | the best case was did this case was |
---|
0:17:07 | for the giant was all |
---|
0:17:10 | zero point zero one and that for a i was at zero point zero point |
---|
0:17:15 | one |
---|
0:17:15 | three shows the |
---|
0:17:17 | this once all but using the cosine similarity |
---|
0:17:20 | so you was at six point one four percent |
---|
0:17:24 | so |
---|
0:17:26 | we should a better |
---|
0:17:28 | performance from |
---|
0:17:29 | then are present system which was |
---|
0:17:32 | six point seven tiff simply for in their expected fixating system |
---|
0:17:39 | then |
---|
0:17:40 | okay so |
---|
0:17:42 | so the |
---|
0:17:43 | i found no |
---|
0:17:44 | best case so i hyper parameter settings and are you applying this to |
---|
0:17:49 | folks do not too |
---|
0:17:52 | dataset |
---|
0:17:53 | so i training the we |
---|
0:17:54 | that the system we don't excel up to and b be restricted on the same |
---|
0:17:59 | as that which was |
---|
0:18:00 | bookseller one test set |
---|
0:18:02 | testing |
---|
0:18:03 | so |
---|
0:18:05 | the performance |
---|
0:18:07 | well was a |
---|
0:18:08 | much better |
---|
0:18:10 | so then in the best case using a purity a was a |
---|
0:18:15 | three point zero nine percent l b r |
---|
0:18:18 | so it is |
---|
0:18:19 | the for what for you was issued a better performance of the baseline which was |
---|
0:18:23 | a |
---|
0:18:24 | three point six |
---|
0:18:25 | sixty two so we used so both a twenty percent |
---|
0:18:30 | well |
---|
0:18:30 | all the performance was better |
---|
0:18:33 | in terms of the |
---|
0:18:36 | i |
---|
0:18:39 | i |
---|
0:18:44 | we thank you so much |
---|
0:18:46 | using |
---|
0:18:47 | so using this showing this easement utterance |
---|
0:18:50 | all |
---|
0:18:51 | but new methods were |
---|
0:18:54 | showed a |
---|
0:18:56 | better performance in every case so |
---|
0:18:59 | it shows turbine is very helpful for the l |
---|
0:19:04 | relating the rights |
---|
0:19:06 | features for |
---|
0:19:07 | or more information speaker advance information |
---|
0:19:10 | only training the speaker in being system |
---|
0:19:13 | so |
---|
0:19:15 | what |
---|
0:19:16 | in the in our future research |
---|
0:19:18 | we should experiments it more be |
---|
0:19:22 | other flea method |
---|
0:19:25 | except so |
---|
0:19:27 | or but intuitive statistic putting which i was used |
---|
0:19:31 | so and were refers really maybe to combine to |
---|
0:19:34 | proposed that the |
---|
0:19:36 | we other was this |
---|
0:19:39 | so |
---|
0:19:39 | thank you for listening to my presentation and if you have any session you can |
---|
0:19:44 | on just email me in it is shown on my own speaker |
---|
0:19:49 | and actually |
---|
0:19:52 | but |
---|