five
by there is being on
and now i'm going to present information preservation we
or speaker embedding
on my name is gonna and i'm from this all missionary first
"'kay" designed the contents of my
presentation first
introduce you briefly about the speaker recognition task
and then
i will explain your bone really borders
which is really to my research
then i will explain you my proposed method
then i will
sure you the experiments settings and its results
and finally i would concludes my presentation
okay
yes you can see this is the general generating based speaker recognition system
the first component so we use so being level network
i in it is only implemented be by which joe network and time delay neural
network
so
we take stuff envelopes features which is which can be on mfcc or spectra
it so input and you up the stuffing level or intuition
the best can fancy soap telling what you
so
it is really our biggest but the average which is no
but a sampling
or is just mean and variance vectors the completes mean and be there is so
the thing really features
and you're going is the frame it outputs
from the point network
which was a finger the network and you don't install x dimensional vector
why this is important is a
it you can make a fixed image and i-vector from of a variable length though
paul
premarital outputs
well teletext components is just speaker identified
well what this does is a eucharistic buys does
i guess speakers from the speaker embedding
well
you you'd health so that works to learn though
speaker dependent features
so what
it is only used for training because
in verification is scenario
in the test set you
you
you common the
once in speakers
so when testing the system
you use the on the other scoring metric like cosine similarity or purity
or
the scoring
this is
expected baseline systems
so
target and channel will expected baseline system
it is
it's you can see it is
we don't you'd is made up at the frame level network
and fleeing what you're and us to segments that where
that works
so
mm sixty two are usually used for the input features of the network
and five
first five there is a rectangular neural network
which works at the frame level
and then playing layer he's sober the female representation
and there are
additionally hidden layers
only which are operate at a segment level
and the last layer is a softmax output layer
so we pretty there is pretty doing so i guess speaker
on a line going to
or shrewd a richer information estimation and extraction technique
though
which are information
is a measure of the into a dependency
between two random variables
so
mutual information can be viewed as a probe l a place like blurred directions
it in that are in this group to division and the product of
very generous of though two random variables
or
during the representational from a laboratory principal directions are the
key element so well which are information to do this
estimator which i will be spin later
so
the following theorem gives are used for uses useful representation
well which is being called a scoreboard and representation you gives you the lower bound
so which are information
next the
which are information europe estimator
which is close to mine
so
this see the idea of mine is to model function
t we don't function i mean
or parameters by though due to network
you'd parameter omega
so
what this network or estimates the richer information
or using the
to model those got about the representational richer information
some things
using mine so you can do that on the training mission estimation and stuff estimation
together
the bigger all
well
buying is buying is the to maximize an estimate the mutual information between the input
and output pairs of the encoder
e by
so which is
neural networks
with the parameter five
so
was to eat is to realise on the sampling strategy
so
making a positive and negative examples
alright
drawn from the joint and the product of marginal descriptors
just to treat distribution
so in general
i still systems are collected from the same
utterance
in this case use the same utterance
or word imagery i can be are saying image
so
while the net
they can it samples are obtained by a
on the other randomly sample utterance or image
so
and is
we can optimize the richer information
so i mean though estimate and
maximize together because
the don't score does care about and reputation was the lower bound
sorry when you maximize the
the top line so it can estimate and
maximize the richer information at the same time
so this is a the mind of thirty
so it is
derived from the task or power directly
so
the all i want to spend new about my
proposed method
so
this is
information preservation pulling
the log
the idea of all i information fusion
paediatrician pulling one which i will call i d p so
to prevent
to prevent a information vacation in the putting stage
so i will ensure into this use
mine too regularized all
utterance level features
to have a high mutual information we the
frame level features
so
but i meant but i make ice you hear
well
the mine so frame level features and utterance-level features are sensitive from though
same input utterances utterance
so
the by cheap ares a the
joint
the actual sampled from the joint distribution
and automatic
are there at a pair
the frame about features and objective of features are sent to from the difference you
presented an utterance so it is a sentence from the product of margin
so
in information projection station playing
i such as the two difference way to use mine
so one is what we're image information maximization gi which i recall g i n
and the second one you so
local mature information maximization
which is lid and
so
but different used a in a g i n
so it's a matter what the u
to model or something sense information one frame rate of features i applied online to
maximize information between the
all frame level features and the
a transitive feature
so
two random variable for mine will be though
sequence so
we really features which is larger or older age
which you it's which you know stuff sequence
and the alternative feature
w so we which is the up the top the plea module so
in local which are information maximization
a the difference is still enjoy
or you can be one if you
frame of the features
will be enough to take all right decision to
applying to
predicting it is from those positive or negative samples so therefore some useful information will
be
you can also
in their individual frame individual feature so
i suggest a tree prevent this
so we fix nice to meet sure information between the single
payment of feature and the utterance set of feature or
tahoe lost will be for every g
which are information between single family feature and doctrines the feature
so this is tom
more information
preservation pulling architecture
so
man l i m can be applied
or together when training data in between speaker in bidding system
so i keep ulysses of optimized jointly we
speech
conventional speaker or classification loss
during the string
so in this case i used a
cross that softmax cross entropy lost
for the speaker
classification loss
so the first time
you know star speaker efficient loss which is softmax course cross entropy
and second then
the third terms are the clover and look where
mine objectives
so you can see this figure
two or understand my
or architecture
"'kay" this is so experiment is settings
so i used a
most commonly is dataset me too so the
one and two
so the input features was the
thirty dimensional mfcc extracted read
twenty five milisecond hamming window with a
and there is signal should it's shift
so
during the training each buttress was trying to
to point five second segment well which was a to make
input batch be dull
fixed dimension
so mean and variance normalization was applied to it is extracted
mfccs and i use
no voice activity detection
or
automatic sinus they from or any kind so
i
data augmentation was not
so
the what the competition is like this
like this
so point of pulling there are used a tentative
that is pulling so each is most commonly used one
and the
that image and no was a too large for magnitude because the last frame the
minute took output is so one hundred thousand
five hundred thirty six
dimension
so i
lda the addition of dimension
you just on that works
to make
the automation
lower
and that it is are the training materials so batch size was a
one directing t eight
and to make you the for my network
well segment level features this complaint even at to the frame level feature at
feature dimension so
and the optimal two measures
you don't initial
but anyway topple
or
tend to the minus three power and expose spanish jerry degrees at every epoch on
t are the final it will tend to the buying as he however so
and a whole neural network improving the implementation was done using the tensor floor will
and the for the back end just scoring metric i used
cosine similarity and p l d
so when using the p lda the last
up at all
nazi then there was used as
speaker embedding
or this is there are so this is for the cosine similarity and this is
for p eight is a this is
this is the ever so when using the cosine similarity doll up to class t
stand there was were
the top performers what was higher so for consensus similarity
why is the last you'd in the output of what for the using purity double
opt at all
second to the last hidden layer was used
so
before the ple training or lda was applied
to reduce does speaker
spain bidding imagine two
two hundred and you it is followed by and then normalization and whitening
okay
this is the experiment results so
i
the first size reminded using g i m only cheese and local
ill i am twenty case
though
laughs
table his for the
g i m o pony case
so
the best the best performance was for the p a d was are
by point
at four so
but in the
it's be expected baseline system
so he was
five point this
so you showed a better for performance from the baseline system
so
though the rights t-ball yes for the lid in one case
so the best it's
on performance was
for the
ple was not by point one e
percent's
but
in the baseline system i was a five point sixty six
so we show the better performance from
then a baseline system
this is a disadvantage that although i p
so i
in various hyperparameter case so i i'm thinking the five
i'm interest for my not exist and i or
basement in many cases
the best case was did this case was
for the giant was all
zero point zero one and that for a i was at zero point zero point
one
three shows the
this once all but using the cosine similarity
so you was at six point one four percent
so
we should a better
performance from
then are present system which was
six point seven tiff simply for in their expected fixating system
then
okay so
so the
i found no
best case so i hyper parameter settings and are you applying this to
folks do not too
dataset
so i training the we
that the system we don't excel up to and b be restricted on the same
as that which was
bookseller one test set
testing
so
the performance
well was a
much better
so then in the best case using a purity a was a
three point zero nine percent l b r
so it is
the for what for you was issued a better performance of the baseline which was
a
three point six
sixty two so we used so both a twenty percent
well
all the performance was better
in terms of the
i
i
we thank you so much
using
so using this showing this easement utterance
all
but new methods were
showed a
better performance in every case so
it shows turbine is very helpful for the l
relating the rights
features for
or more information speaker advance information
only training the speaker in being system
so
what
in the in our future research
we should experiments it more be
other flea method
except so
or but intuitive statistic putting which i was used
so and were refers really maybe to combine to
proposed that the
we other was this
so
thank you for listening to my presentation and if you have any session you can
on just email me in it is shown on my own speaker
and actually
but