0:00:14where one is it can problems in wyoming recently
0:00:19are we papers travel is combat vector based on factor as the time delay neural
0:00:24network for text independent speaker recognition
0:00:28it's too long so that speaking
0:00:33currently the most effective text independent speaker recognition method has term to be extracting speaker
0:00:40and batting
0:00:41on the on the extractor extract the wrong back to write that i'm dating or
0:00:46network has been demonstrated to be along the best performance on recent nist sre evaluations
0:00:55well speech signal consists of content in you curved and their emotions channel and noise
0:01:02information and so well
0:01:04no way no speech content is well the mean information
0:01:10generally different verification tasks progress on the different type of target information on vol
0:01:17and ignore the influence all other information
0:01:21however the fact the that different be components share some common information and cannot be
0:01:28completely separated
0:01:31based on this study are some of it has learning masters have been proposed
0:01:37in collection on i think that's learning had only errors are shared between different task
0:01:43the networks
0:01:46why previous works we have proposed that combines a vector costing vector and rooted at
0:01:53the performance can be further improved by introducing phonetic information
0:01:59but this is one of the wanted it is also vector is that it only
0:02:03authors of a simple okay network
0:02:07so in this paper we introduce factor as in the years into a vector and
0:02:12propose an extended network called have t vector
0:02:20speaker in batting has the mean in speaker recognition math are this stage
0:02:26the input layer the frame level acoustic features of the speech
0:02:32as far as well as it is true or several layers of time delay architecture
0:02:38frame level information and that were created in a standard text coding layer
0:02:44the me and the standard of dimension are calculated and the converted into statement layer
0:02:51a second level information
0:02:55have to remove the whole training with all the have of the year after this
0:02:59statistics scrolling layer will be extracted s speaker in body
0:03:06and the lda nearly a
0:03:08we have to calculate the square
0:03:12after as the idea has the vector the or more
0:03:17because comprised of a characterizing the weight matrix between td layers
0:03:24this mess are used as last network parameters well maintaining the iterations the and there's
0:03:31a better payoff
0:03:32no work trainee
0:03:34and obtains good results
0:03:36the weight matrix factorized into product of two factor metrics
0:03:44unfortunately your i've see the characters finish there is has been a confirmed i in
0:03:51the nist sre eighteen evaluation
0:03:56although the extractor network performance speaker detection and a segment level
0:04:03something cremation is ignored in this cross last
0:04:07alright the extractor network without asr network okay labels and asr network always cheap is
0:04:17information and a frame level
0:04:21in that it's unethical adaptation master and hacker as a phonetic information into extractor
0:04:29network
0:04:31first a correctly asr model is the bottleneck layer and were the without a bottleneck
0:04:38layer as a salary vector into extractor network
0:04:46that
0:04:48fabric multiclass the lower income band spectral actor network based asr network
0:04:54so that two networks share a part of the frame level layers and
0:05:00the training process is alternative method two parts of the combined network
0:05:06the speaker and batting part of the combined the network and their of all more
0:05:11about the common information or by speaker right features and the phonetic content
0:05:18and the recognition about
0:05:20for this matter
0:05:26so that a court application on the error rate multitask learning or correspond to as
0:05:31bags of the speech information
0:05:34the former trying to write a supervised the next two and fact all one i
0:05:42take the content
0:05:43and a letter
0:05:45or trying to their more detail information from one like a content
0:05:51they still actor network combat this to him as errors in an attempt to more
0:05:57effectively learn the share part of
0:06:00syntactic information and speaker information
0:06:05similar to the phonetic adaptation matter
0:06:09star network is pre-trained first
0:06:11features are extracted from as automatically your and more as the
0:06:18speaker embedding part have really multitask learning
0:06:22during have great multitask learning now where a trainee
0:06:27pretrained asr network is no longer updated
0:06:32after that the two parts so they have great multitask learning that work
0:06:37a lot in order to the not the alternatively
0:06:41training and the embedding as extracted from they had only or be handled porting layer
0:06:47of the speaker embedding part
0:06:55many experiments have shown that
0:06:59or network architectures improve the performance
0:07:04so we have do extend this vectorized t and network architecture which as it can
0:07:10extend and from the t v is that still
0:07:14we use this architecture in the nist sre nineteen evaluation
0:07:20well they greatly deep in the network architecture
0:07:27the network parameter a long as
0:07:30controls to start a range and the performance was
0:07:34the second fa candidly improvement
0:07:38in all the good performance network and the impact of one i think information on
0:07:46speaker recognition is that still
0:07:48we introduce the to see vector
0:07:53and it called f t vector
0:07:58the way we include because they
0:08:00the rice the layers as a little difference from the at the end that mar
0:08:06the company can actually mass are within the layer is that all this kind of
0:08:13and the
0:08:15include of the vectorized in a year it is likely applied by a local stations
0:08:20and the up with all the by to read the areas as a
0:08:25and the you watch
0:08:27a similar to the rest not
0:08:32e os you have t v and to replace the car use the two
0:08:37extracted the embedding in there is the vector network
0:08:44replace the part x that have the phonetic a problem that vector at then you're
0:08:50twenty
0:08:52the extract the embedding exactly are twenty two
0:08:58and the same time but was simply five you have to be a network without
0:09:03putting the year to replace the so far of the multi task learning is the
0:09:09vector
0:09:10for some new the first feel layers of the two hour stream in those them
0:09:15or very
0:09:21for the experiment
0:09:24it is performed according to the requirements of the next training condition or nist sre
0:09:30at
0:09:32it should be noted cadence that that's our training data size doesn't include the looks
0:09:39l a and sat down to datasets
0:09:43the fisher data size consists of for me labels so we use data to create
0:09:50additional sre task and the remaining data size i euros the two three that neural
0:09:57network and back end
0:10:00each sense to know there and i rise are noise this are used that no
0:10:06i as sources to enhance the training data
0:10:10and amount of training data is able
0:10:15the test to study the development and evaluation data size or
0:10:20nist sre it can cts task
0:10:23the input features of network a mystery imagine a mfccs
0:10:30this is the estimate working is trained using english dataset as
0:10:36so it has two or and fisher
0:10:39it doesn't phase well with the language of the sre eight data set
0:10:48just and transcriptions for a landed by gmm you x m used and from one
0:10:56phonetic labels
0:10:58i extractor all the pre-trained asr
0:11:01it's really
0:11:02on this range
0:11:04and it has the same as that all the same actors as though
0:11:13the about experiments a little bad taste and background processing
0:11:19after imbalance are extracted
0:11:22two hundred imagine that only a hand only a story are trained and the class
0:11:29due to domain mismatch
0:11:31a common optimisation and mess already has two euros sre at on labeled data to
0:11:37also realise the already adaptation
0:11:40but me us another math are to get better results
0:11:45okay as to apply i'm supplies the cost three
0:11:50sre eighteen unlabeled data and use the clustered data to train can be
0:11:57then the p r t and videos the for story
0:12:01the result of f t vector as bad heard that the background identity and you
0:12:06or in the controllers
0:12:10and they have t vectors test on that schade first a layer on the best
0:12:15performance
0:12:18the overall effect of the have to vectors as norm decreased as that the number
0:12:23of charlotte i layers increases as we review that's nist and this partly due to
0:12:30language mismatch
0:12:32the training data for asr part spoke english well at a test it has side
0:12:39is spoken in one is the already are about
0:12:44a bathroom the results because the data you mean in this case
0:12:49the extracted the phonotactic information can still have improved a factor of speaker recognition
0:12:59that's all thank you