where one is it can problems in wyoming recently
are we papers travel is combat vector based on factor as the time delay neural
network for text independent speaker recognition
it's too long so that speaking
currently the most effective text independent speaker recognition method has term to be extracting speaker
and batting
on the on the extractor extract the wrong back to write that i'm dating or
network has been demonstrated to be along the best performance on recent nist sre evaluations
well speech signal consists of content in you curved and their emotions channel and noise
information and so well
no way no speech content is well the mean information
generally different verification tasks progress on the different type of target information on vol
and ignore the influence all other information
however the fact the that different be components share some common information and cannot be
completely separated
based on this study are some of it has learning masters have been proposed
in collection on i think that's learning had only errors are shared between different task
the networks
why previous works we have proposed that combines a vector costing vector and rooted at
the performance can be further improved by introducing phonetic information
but this is one of the wanted it is also vector is that it only
authors of a simple okay network
so in this paper we introduce factor as in the years into a vector and
propose an extended network called have t vector
speaker in batting has the mean in speaker recognition math are this stage
the input layer the frame level acoustic features of the speech
as far as well as it is true or several layers of time delay architecture
frame level information and that were created in a standard text coding layer
the me and the standard of dimension are calculated and the converted into statement layer
a second level information
have to remove the whole training with all the have of the year after this
statistics scrolling layer will be extracted s speaker in body
and the lda nearly a
we have to calculate the square
after as the idea has the vector the or more
because comprised of a characterizing the weight matrix between td layers
this mess are used as last network parameters well maintaining the iterations the and there's
a better payoff
no work trainee
and obtains good results
the weight matrix factorized into product of two factor metrics
unfortunately your i've see the characters finish there is has been a confirmed i in
the nist sre eighteen evaluation
although the extractor network performance speaker detection and a segment level
something cremation is ignored in this cross last
alright the extractor network without asr network okay labels and asr network always cheap is
information and a frame level
in that it's unethical adaptation master and hacker as a phonetic information into extractor
network
first a correctly asr model is the bottleneck layer and were the without a bottleneck
layer as a salary vector into extractor network
that
fabric multiclass the lower income band spectral actor network based asr network
so that two networks share a part of the frame level layers and
the training process is alternative method two parts of the combined network
the speaker and batting part of the combined the network and their of all more
about the common information or by speaker right features and the phonetic content
and the recognition about
for this matter
so that a court application on the error rate multitask learning or correspond to as
bags of the speech information
the former trying to write a supervised the next two and fact all one i
take the content
and a letter
or trying to their more detail information from one like a content
they still actor network combat this to him as errors in an attempt to more
effectively learn the share part of
syntactic information and speaker information
similar to the phonetic adaptation matter
star network is pre-trained first
features are extracted from as automatically your and more as the
speaker embedding part have really multitask learning
during have great multitask learning now where a trainee
pretrained asr network is no longer updated
after that the two parts so they have great multitask learning that work
a lot in order to the not the alternatively
training and the embedding as extracted from they had only or be handled porting layer
of the speaker embedding part
many experiments have shown that
or network architectures improve the performance
so we have do extend this vectorized t and network architecture which as it can
extend and from the t v is that still
we use this architecture in the nist sre nineteen evaluation
well they greatly deep in the network architecture
the network parameter a long as
controls to start a range and the performance was
the second fa candidly improvement
in all the good performance network and the impact of one i think information on
speaker recognition is that still
we introduce the to see vector
and it called f t vector
the way we include because they
the rice the layers as a little difference from the at the end that mar
the company can actually mass are within the layer is that all this kind of
and the
include of the vectorized in a year it is likely applied by a local stations
and the up with all the by to read the areas as a
and the you watch
a similar to the rest not
e os you have t v and to replace the car use the two
extracted the embedding in there is the vector network
replace the part x that have the phonetic a problem that vector at then you're
twenty
the extract the embedding exactly are twenty two
and the same time but was simply five you have to be a network without
putting the year to replace the so far of the multi task learning is the
vector
for some new the first feel layers of the two hour stream in those them
or very
for the experiment
it is performed according to the requirements of the next training condition or nist sre
at
it should be noted cadence that that's our training data size doesn't include the looks
l a and sat down to datasets
the fisher data size consists of for me labels so we use data to create
additional sre task and the remaining data size i euros the two three that neural
network and back end
each sense to know there and i rise are noise this are used that no
i as sources to enhance the training data
and amount of training data is able
the test to study the development and evaluation data size or
nist sre it can cts task
the input features of network a mystery imagine a mfccs
this is the estimate working is trained using english dataset as
so it has two or and fisher
it doesn't phase well with the language of the sre eight data set
just and transcriptions for a landed by gmm you x m used and from one
phonetic labels
i extractor all the pre-trained asr
it's really
on this range
and it has the same as that all the same actors as though
the about experiments a little bad taste and background processing
after imbalance are extracted
two hundred imagine that only a hand only a story are trained and the class
due to domain mismatch
a common optimisation and mess already has two euros sre at on labeled data to
also realise the already adaptation
but me us another math are to get better results
okay as to apply i'm supplies the cost three
sre eighteen unlabeled data and use the clustered data to train can be
then the p r t and videos the for story
the result of f t vector as bad heard that the background identity and you
or in the controllers
and they have t vectors test on that schade first a layer on the best
performance
the overall effect of the have to vectors as norm decreased as that the number
of charlotte i layers increases as we review that's nist and this partly due to
language mismatch
the training data for asr part spoke english well at a test it has side
is spoken in one is the already are about
a bathroom the results because the data you mean in this case
the extracted the phonotactic information can still have improved a factor of speaker recognition
that's all thank you