where one is it can problems in wyoming recently

are we papers travel is combat vector based on factor as the time delay neural

network for text independent speaker recognition

it's too long so that speaking

currently the most effective text independent speaker recognition method has term to be extracting speaker

and batting

on the on the extractor extract the wrong back to write that i'm dating or

network has been demonstrated to be along the best performance on recent nist sre evaluations

well speech signal consists of content in you curved and their emotions channel and noise

information and so well

no way no speech content is well the mean information

generally different verification tasks progress on the different type of target information on vol

and ignore the influence all other information

however the fact the that different be components share some common information and cannot be

completely separated

based on this study are some of it has learning masters have been proposed

in collection on i think that's learning had only errors are shared between different task

the networks

why previous works we have proposed that combines a vector costing vector and rooted at

the performance can be further improved by introducing phonetic information

but this is one of the wanted it is also vector is that it only

authors of a simple okay network

so in this paper we introduce factor as in the years into a vector and

propose an extended network called have t vector

speaker in batting has the mean in speaker recognition math are this stage

the input layer the frame level acoustic features of the speech

as far as well as it is true or several layers of time delay architecture

frame level information and that were created in a standard text coding layer

the me and the standard of dimension are calculated and the converted into statement layer

a second level information

have to remove the whole training with all the have of the year after this

statistics scrolling layer will be extracted s speaker in body

and the lda nearly a

we have to calculate the square

after as the idea has the vector the or more

because comprised of a characterizing the weight matrix between td layers

this mess are used as last network parameters well maintaining the iterations the and there's

a better payoff

no work trainee

and obtains good results

the weight matrix factorized into product of two factor metrics

unfortunately your i've see the characters finish there is has been a confirmed i in

the nist sre eighteen evaluation

although the extractor network performance speaker detection and a segment level

something cremation is ignored in this cross last

alright the extractor network without asr network okay labels and asr network always cheap is

information and a frame level

in that it's unethical adaptation master and hacker as a phonetic information into extractor

network

first a correctly asr model is the bottleneck layer and were the without a bottleneck

layer as a salary vector into extractor network

that

fabric multiclass the lower income band spectral actor network based asr network

so that two networks share a part of the frame level layers and

the training process is alternative method two parts of the combined network

the speaker and batting part of the combined the network and their of all more

about the common information or by speaker right features and the phonetic content

and the recognition about

for this matter

so that a court application on the error rate multitask learning or correspond to as

bags of the speech information

the former trying to write a supervised the next two and fact all one i

take the content

and a letter

or trying to their more detail information from one like a content

they still actor network combat this to him as errors in an attempt to more

effectively learn the share part of

syntactic information and speaker information

similar to the phonetic adaptation matter

star network is pre-trained first

features are extracted from as automatically your and more as the

speaker embedding part have really multitask learning

during have great multitask learning now where a trainee

pretrained asr network is no longer updated

after that the two parts so they have great multitask learning that work

a lot in order to the not the alternatively

training and the embedding as extracted from they had only or be handled porting layer

of the speaker embedding part

many experiments have shown that

or network architectures improve the performance

so we have do extend this vectorized t and network architecture which as it can

extend and from the t v is that still

we use this architecture in the nist sre nineteen evaluation

well they greatly deep in the network architecture

the network parameter a long as

controls to start a range and the performance was

the second fa candidly improvement

in all the good performance network and the impact of one i think information on

speaker recognition is that still

we introduce the to see vector

and it called f t vector

the way we include because they

the rice the layers as a little difference from the at the end that mar

the company can actually mass are within the layer is that all this kind of

and the

include of the vectorized in a year it is likely applied by a local stations

and the up with all the by to read the areas as a

and the you watch

a similar to the rest not

e os you have t v and to replace the car use the two

extracted the embedding in there is the vector network

replace the part x that have the phonetic a problem that vector at then you're

twenty

the extract the embedding exactly are twenty two

and the same time but was simply five you have to be a network without

putting the year to replace the so far of the multi task learning is the

vector

for some new the first feel layers of the two hour stream in those them

or very

for the experiment

it is performed according to the requirements of the next training condition or nist sre

at

it should be noted cadence that that's our training data size doesn't include the looks

l a and sat down to datasets

the fisher data size consists of for me labels so we use data to create

additional sre task and the remaining data size i euros the two three that neural

network and back end

each sense to know there and i rise are noise this are used that no

i as sources to enhance the training data

and amount of training data is able

the test to study the development and evaluation data size or

nist sre it can cts task

the input features of network a mystery imagine a mfccs

this is the estimate working is trained using english dataset as

so it has two or and fisher

it doesn't phase well with the language of the sre eight data set

just and transcriptions for a landed by gmm you x m used and from one

phonetic labels

i extractor all the pre-trained asr

it's really

on this range

and it has the same as that all the same actors as though

the about experiments a little bad taste and background processing

after imbalance are extracted

two hundred imagine that only a hand only a story are trained and the class

due to domain mismatch

a common optimisation and mess already has two euros sre at on labeled data to

also realise the already adaptation

but me us another math are to get better results

okay as to apply i'm supplies the cost three

sre eighteen unlabeled data and use the clustered data to train can be

then the p r t and videos the for story

the result of f t vector as bad heard that the background identity and you

or in the controllers

and they have t vectors test on that schade first a layer on the best

performance

the overall effect of the have to vectors as norm decreased as that the number

of charlotte i layers increases as we review that's nist and this partly due to

language mismatch

the training data for asr part spoke english well at a test it has side

is spoken in one is the already are about

a bathroom the results because the data you mean in this case

the extracted the phonotactic information can still have improved a factor of speaker recognition

that's all thank you