0:00:18five
0:00:19by there is being on
0:00:21and now i'm going to present information preservation we
0:00:26or speaker embedding
0:00:28on my name is gonna and i'm from this all missionary first
0:00:38"'kay" designed the contents of my
0:00:41presentation first
0:00:44introduce you briefly about the speaker recognition task
0:00:49and then
0:00:51i will explain your bone really borders
0:00:54which is really to my research
0:00:56then i will explain you my proposed method
0:01:00then i will
0:01:02sure you the experiments settings and its results
0:01:06and finally i would concludes my presentation
0:01:12okay
0:01:14yes you can see this is the general generating based speaker recognition system
0:01:19the first component so we use so being level network
0:01:24i in it is only implemented be by which joe network and time delay neural
0:01:32network
0:01:33so
0:01:34we take stuff envelopes features which is which can be on mfcc or spectra
0:01:40it so input and you up the stuffing level or intuition
0:01:47the best can fancy soap telling what you
0:01:50so
0:01:52it is really our biggest but the average which is no
0:01:57but a sampling
0:01:58or is just mean and variance vectors the completes mean and be there is so
0:02:03the thing really features
0:02:06and you're going is the frame it outputs
0:02:09from the point network
0:02:12which was a finger the network and you don't install x dimensional vector
0:02:18why this is important is a
0:02:21it you can make a fixed image and i-vector from of a variable length though
0:02:26paul
0:02:27premarital outputs
0:02:32well teletext components is just speaker identified
0:02:35well what this does is a eucharistic buys does
0:02:38i guess speakers from the speaker embedding
0:02:42well
0:02:44you you'd health so that works to learn though
0:02:48speaker dependent features
0:02:51so what
0:02:52it is only used for training because
0:02:55in verification is scenario
0:02:59in the test set you
0:03:02you
0:03:03you common the
0:03:05once in speakers
0:03:06so when testing the system
0:03:11you use the on the other scoring metric like cosine similarity or purity
0:03:17or
0:03:18the scoring
0:03:24this is
0:03:26expected baseline systems
0:03:28so
0:03:30target and channel will expected baseline system
0:03:34it is
0:03:35it's you can see it is
0:03:37we don't you'd is made up at the frame level network
0:03:40and fleeing what you're and us to segments that where
0:03:45that works
0:03:46so
0:03:47mm sixty two are usually used for the input features of the network
0:03:51and five
0:03:53first five there is a rectangular neural network
0:03:56which works at the frame level
0:03:58and then playing layer he's sober the female representation
0:04:03and there are
0:04:05additionally hidden layers
0:04:06only which are operate at a segment level
0:04:10and the last layer is a softmax output layer
0:04:13so we pretty there is pretty doing so i guess speaker
0:04:21on a line going to
0:04:23or shrewd a richer information estimation and extraction technique
0:04:28though
0:04:29which are information
0:04:31is a measure of the into a dependency
0:04:34between two random variables
0:04:37so
0:04:38mutual information can be viewed as a probe l a place like blurred directions
0:04:43it in that are in this group to division and the product of
0:04:47very generous of though two random variables
0:04:51or
0:04:52during the representational from a laboratory principal directions are the
0:04:58key element so well which are information to do this
0:05:01estimator which i will be spin later
0:05:03so
0:05:04the following theorem gives are used for uses useful representation
0:05:09well which is being called a scoreboard and representation you gives you the lower bound
0:05:15so which are information
0:05:19next the
0:05:20which are information europe estimator
0:05:22which is close to mine
0:05:24so
0:05:25this see the idea of mine is to model function
0:05:29t we don't function i mean
0:05:31or parameters by though due to network
0:05:34you'd parameter omega
0:05:37so
0:05:39what this network or estimates the richer information
0:05:43or using the
0:05:45to model those got about the representational richer information
0:05:52some things
0:05:54using mine so you can do that on the training mission estimation and stuff estimation
0:06:00together
0:06:01the bigger all
0:06:03well
0:06:04buying is buying is the to maximize an estimate the mutual information between the input
0:06:11and output pairs of the encoder
0:06:15e by
0:06:16so which is
0:06:17neural networks
0:06:18with the parameter five
0:06:20so
0:06:22was to eat is to realise on the sampling strategy
0:06:25so
0:06:26making a positive and negative examples
0:06:30alright
0:06:30drawn from the joint and the product of marginal descriptors
0:06:35just to treat distribution
0:06:37so in general
0:06:39i still systems are collected from the same
0:06:42utterance
0:06:43in this case use the same utterance
0:06:45or word imagery i can be are saying image
0:06:48so
0:06:49while the net
0:06:50they can it samples are obtained by a
0:06:53on the other randomly sample utterance or image
0:06:57so
0:06:58and is
0:06:59we can optimize the richer information
0:07:02so i mean though estimate and
0:07:06maximize together because
0:07:07the don't score does care about and reputation was the lower bound
0:07:11sorry when you maximize the
0:07:15the top line so it can estimate and
0:07:19maximize the richer information at the same time
0:07:22so this is a the mind of thirty
0:07:26so it is
0:07:27derived from the task or power directly
0:07:33so
0:07:35the all i want to spend new about my
0:07:38proposed method
0:07:39so
0:07:40this is
0:07:43information preservation pulling
0:07:44the log
0:07:45the idea of all i information fusion
0:07:48paediatrician pulling one which i will call i d p so
0:07:52to prevent
0:07:53to prevent a information vacation in the putting stage
0:07:57so i will ensure into this use
0:08:00mine too regularized all
0:08:03utterance level features
0:08:05to have a high mutual information we the
0:08:10frame level features
0:08:12so
0:08:13but i meant but i make ice you hear
0:08:16well
0:08:17the mine so frame level features and utterance-level features are sensitive from though
0:08:23same input utterances utterance
0:08:25so
0:08:27the by cheap ares a the
0:08:30joint
0:08:33the actual sampled from the joint distribution
0:08:36and automatic
0:08:37are there at a pair
0:08:39the frame about features and objective of features are sent to from the difference you
0:08:43presented an utterance so it is a sentence from the product of margin
0:08:53so
0:08:54in information projection station playing
0:08:57i such as the two difference way to use mine
0:09:02so one is what we're image information maximization gi which i recall g i n
0:09:08and the second one you so
0:09:10local mature information maximization
0:09:12which is lid and
0:09:14so
0:09:15but different used a in a g i n
0:09:18so it's a matter what the u
0:09:22to model or something sense information one frame rate of features i applied online to
0:09:28maximize information between the
0:09:31all frame level features and the
0:09:34a transitive feature
0:09:36so
0:09:38two random variable for mine will be though
0:09:42sequence so
0:09:43we really features which is larger or older age
0:09:48which you it's which you know stuff sequence
0:09:51and the alternative feature
0:09:54w so we which is the up the top the plea module so
0:09:59in local which are information maximization
0:10:03a the difference is still enjoy
0:10:07or you can be one if you
0:10:09frame of the features
0:10:10will be enough to take all right decision to
0:10:14applying to
0:10:16predicting it is from those positive or negative samples so therefore some useful information will
0:10:23be
0:10:25you can also
0:10:27in their individual frame individual feature so
0:10:31i suggest a tree prevent this
0:10:34so we fix nice to meet sure information between the single
0:10:38payment of feature and the utterance set of feature or
0:10:42tahoe lost will be for every g
0:10:45which are information between single family feature and doctrines the feature
0:10:55so this is tom
0:10:56more information
0:10:59preservation pulling architecture
0:11:01so
0:11:03man l i m can be applied
0:11:04or together when training data in between speaker in bidding system
0:11:09so i keep ulysses of optimized jointly we
0:11:12speech
0:11:13conventional speaker or classification loss
0:11:16during the string
0:11:17so in this case i used a
0:11:21cross that softmax cross entropy lost
0:11:24for the speaker
0:11:25classification loss
0:11:27so the first time
0:11:29you know star speaker efficient loss which is softmax course cross entropy
0:11:33and second then
0:11:35the third terms are the clover and look where
0:11:38mine objectives
0:11:42so you can see this figure
0:11:44two or understand my
0:11:46or architecture
0:11:51"'kay" this is so experiment is settings
0:11:54so i used a
0:11:56most commonly is dataset me too so the
0:11:59one and two
0:12:01so the input features was the
0:12:05thirty dimensional mfcc extracted read
0:12:08twenty five milisecond hamming window with a
0:12:11and there is signal should it's shift
0:12:14so
0:12:16during the training each buttress was trying to
0:12:20to point five second segment well which was a to make
0:12:27input batch be dull
0:12:28fixed dimension
0:12:30so mean and variance normalization was applied to it is extracted
0:12:35mfccs and i use
0:12:38no voice activity detection
0:12:41or
0:12:42automatic sinus they from or any kind so
0:12:45i
0:12:45data augmentation was not
0:12:48so
0:12:49the what the competition is like this
0:12:52like this
0:12:53so point of pulling there are used a tentative
0:12:56that is pulling so each is most commonly used one
0:13:00and the
0:13:02that image and no was a too large for magnitude because the last frame the
0:13:08minute took output is so one hundred thousand
0:13:11five hundred thirty six
0:13:13dimension
0:13:14so i
0:13:16lda the addition of dimension
0:13:19you just on that works
0:13:20to make
0:13:22the automation
0:13:25lower
0:13:30and that it is are the training materials so batch size was a
0:13:34one directing t eight
0:13:36and to make you the for my network
0:13:39well segment level features this complaint even at to the frame level feature at
0:13:47feature dimension so
0:13:51and the optimal two measures
0:13:54you don't initial
0:13:55but anyway topple
0:13:59or
0:14:01tend to the minus three power and expose spanish jerry degrees at every epoch on
0:14:07t are the final it will tend to the buying as he however so
0:14:14and a whole neural network improving the implementation was done using the tensor floor will
0:14:21and the for the back end just scoring metric i used
0:14:25cosine similarity and p l d
0:14:27so when using the p lda the last
0:14:30up at all
0:14:32nazi then there was used as
0:14:34speaker embedding
0:14:36or this is there are so this is for the cosine similarity and this is
0:14:42for p eight is a this is
0:14:44this is the ever so when using the cosine similarity doll up to class t
0:14:49stand there was were
0:14:53the top performers what was higher so for consensus similarity
0:14:59why is the last you'd in the output of what for the using purity double
0:15:04opt at all
0:15:05second to the last hidden layer was used
0:15:08so
0:15:09before the ple training or lda was applied
0:15:13to reduce does speaker
0:15:15spain bidding imagine two
0:15:17two hundred and you it is followed by and then normalization and whitening
0:15:25okay
0:15:26this is the experiment results so
0:15:31i
0:15:33the first size reminded using g i m only cheese and local
0:15:39ill i am twenty case
0:15:40though
0:15:42laughs
0:15:43table his for the
0:15:45g i m o pony case
0:15:46so
0:15:50the best the best performance was for the p a d was are
0:15:54by point
0:15:55at four so
0:15:57but in the
0:16:00it's be expected baseline system
0:16:03so he was
0:16:05five point this
0:16:07so you showed a better for performance from the baseline system
0:16:13so
0:16:13though the rights t-ball yes for the lid in one case
0:16:17so the best it's
0:16:20on performance was
0:16:23for the
0:16:24ple was not by point one e
0:16:28percent's
0:16:29but
0:16:30in the baseline system i was a five point sixty six
0:16:34so we show the better performance from
0:16:38then a baseline system
0:16:44this is a disadvantage that although i p
0:16:48so i
0:16:51in various hyperparameter case so i i'm thinking the five
0:16:55i'm interest for my not exist and i or
0:17:03basement in many cases
0:17:04the best case was did this case was
0:17:07for the giant was all
0:17:10zero point zero one and that for a i was at zero point zero point
0:17:15one
0:17:15three shows the
0:17:17this once all but using the cosine similarity
0:17:20so you was at six point one four percent
0:17:24so
0:17:26we should a better
0:17:28performance from
0:17:29then are present system which was
0:17:32six point seven tiff simply for in their expected fixating system
0:17:39then
0:17:40okay so
0:17:42so the
0:17:43i found no
0:17:44best case so i hyper parameter settings and are you applying this to
0:17:49folks do not too
0:17:52dataset
0:17:53so i training the we
0:17:54that the system we don't excel up to and b be restricted on the same
0:17:59as that which was
0:18:00bookseller one test set
0:18:02testing
0:18:03so
0:18:05the performance
0:18:07well was a
0:18:08much better
0:18:10so then in the best case using a purity a was a
0:18:15three point zero nine percent l b r
0:18:18so it is
0:18:19the for what for you was issued a better performance of the baseline which was
0:18:23a
0:18:24three point six
0:18:25sixty two so we used so both a twenty percent
0:18:30well
0:18:30all the performance was better
0:18:33in terms of the
0:18:36i
0:18:39i
0:18:44we thank you so much
0:18:46using
0:18:47so using this showing this easement utterance
0:18:50all
0:18:51but new methods were
0:18:54showed a
0:18:56better performance in every case so
0:18:59it shows turbine is very helpful for the l
0:19:04relating the rights
0:19:06features for
0:19:07or more information speaker advance information
0:19:10only training the speaker in being system
0:19:13so
0:19:15what
0:19:16in the in our future research
0:19:18we should experiments it more be
0:19:22other flea method
0:19:25except so
0:19:27or but intuitive statistic putting which i was used
0:19:31so and were refers really maybe to combine to
0:19:34proposed that the
0:19:36we other was this
0:19:39so
0:19:39thank you for listening to my presentation and if you have any session you can
0:19:44on just email me in it is shown on my own speaker
0:19:49and actually
0:19:52but