0:00:14 | where one is it can problems in wyoming recently |
---|
0:00:19 | are we papers travel is combat vector based on factor as the time delay neural |
---|
0:00:24 | network for text independent speaker recognition |
---|
0:00:28 | it's too long so that speaking |
---|
0:00:33 | currently the most effective text independent speaker recognition method has term to be extracting speaker |
---|
0:00:40 | and batting |
---|
0:00:41 | on the on the extractor extract the wrong back to write that i'm dating or |
---|
0:00:46 | network has been demonstrated to be along the best performance on recent nist sre evaluations |
---|
0:00:55 | well speech signal consists of content in you curved and their emotions channel and noise |
---|
0:01:02 | information and so well |
---|
0:01:04 | no way no speech content is well the mean information |
---|
0:01:10 | generally different verification tasks progress on the different type of target information on vol |
---|
0:01:17 | and ignore the influence all other information |
---|
0:01:21 | however the fact the that different be components share some common information and cannot be |
---|
0:01:28 | completely separated |
---|
0:01:31 | based on this study are some of it has learning masters have been proposed |
---|
0:01:37 | in collection on i think that's learning had only errors are shared between different task |
---|
0:01:43 | the networks |
---|
0:01:46 | why previous works we have proposed that combines a vector costing vector and rooted at |
---|
0:01:53 | the performance can be further improved by introducing phonetic information |
---|
0:01:59 | but this is one of the wanted it is also vector is that it only |
---|
0:02:03 | authors of a simple okay network |
---|
0:02:07 | so in this paper we introduce factor as in the years into a vector and |
---|
0:02:12 | propose an extended network called have t vector |
---|
0:02:20 | speaker in batting has the mean in speaker recognition math are this stage |
---|
0:02:26 | the input layer the frame level acoustic features of the speech |
---|
0:02:32 | as far as well as it is true or several layers of time delay architecture |
---|
0:02:38 | frame level information and that were created in a standard text coding layer |
---|
0:02:44 | the me and the standard of dimension are calculated and the converted into statement layer |
---|
0:02:51 | a second level information |
---|
0:02:55 | have to remove the whole training with all the have of the year after this |
---|
0:02:59 | statistics scrolling layer will be extracted s speaker in body |
---|
0:03:06 | and the lda nearly a |
---|
0:03:08 | we have to calculate the square |
---|
0:03:12 | after as the idea has the vector the or more |
---|
0:03:17 | because comprised of a characterizing the weight matrix between td layers |
---|
0:03:24 | this mess are used as last network parameters well maintaining the iterations the and there's |
---|
0:03:31 | a better payoff |
---|
0:03:32 | no work trainee |
---|
0:03:34 | and obtains good results |
---|
0:03:36 | the weight matrix factorized into product of two factor metrics |
---|
0:03:44 | unfortunately your i've see the characters finish there is has been a confirmed i in |
---|
0:03:51 | the nist sre eighteen evaluation |
---|
0:03:56 | although the extractor network performance speaker detection and a segment level |
---|
0:04:03 | something cremation is ignored in this cross last |
---|
0:04:07 | alright the extractor network without asr network okay labels and asr network always cheap is |
---|
0:04:17 | information and a frame level |
---|
0:04:21 | in that it's unethical adaptation master and hacker as a phonetic information into extractor |
---|
0:04:29 | network |
---|
0:04:31 | first a correctly asr model is the bottleneck layer and were the without a bottleneck |
---|
0:04:38 | layer as a salary vector into extractor network |
---|
0:04:46 | that |
---|
0:04:48 | fabric multiclass the lower income band spectral actor network based asr network |
---|
0:04:54 | so that two networks share a part of the frame level layers and |
---|
0:05:00 | the training process is alternative method two parts of the combined network |
---|
0:05:06 | the speaker and batting part of the combined the network and their of all more |
---|
0:05:11 | about the common information or by speaker right features and the phonetic content |
---|
0:05:18 | and the recognition about |
---|
0:05:20 | for this matter |
---|
0:05:26 | so that a court application on the error rate multitask learning or correspond to as |
---|
0:05:31 | bags of the speech information |
---|
0:05:34 | the former trying to write a supervised the next two and fact all one i |
---|
0:05:42 | take the content |
---|
0:05:43 | and a letter |
---|
0:05:45 | or trying to their more detail information from one like a content |
---|
0:05:51 | they still actor network combat this to him as errors in an attempt to more |
---|
0:05:57 | effectively learn the share part of |
---|
0:06:00 | syntactic information and speaker information |
---|
0:06:05 | similar to the phonetic adaptation matter |
---|
0:06:09 | star network is pre-trained first |
---|
0:06:11 | features are extracted from as automatically your and more as the |
---|
0:06:18 | speaker embedding part have really multitask learning |
---|
0:06:22 | during have great multitask learning now where a trainee |
---|
0:06:27 | pretrained asr network is no longer updated |
---|
0:06:32 | after that the two parts so they have great multitask learning that work |
---|
0:06:37 | a lot in order to the not the alternatively |
---|
0:06:41 | training and the embedding as extracted from they had only or be handled porting layer |
---|
0:06:47 | of the speaker embedding part |
---|
0:06:55 | many experiments have shown that |
---|
0:06:59 | or network architectures improve the performance |
---|
0:07:04 | so we have do extend this vectorized t and network architecture which as it can |
---|
0:07:10 | extend and from the t v is that still |
---|
0:07:14 | we use this architecture in the nist sre nineteen evaluation |
---|
0:07:20 | well they greatly deep in the network architecture |
---|
0:07:27 | the network parameter a long as |
---|
0:07:30 | controls to start a range and the performance was |
---|
0:07:34 | the second fa candidly improvement |
---|
0:07:38 | in all the good performance network and the impact of one i think information on |
---|
0:07:46 | speaker recognition is that still |
---|
0:07:48 | we introduce the to see vector |
---|
0:07:53 | and it called f t vector |
---|
0:07:58 | the way we include because they |
---|
0:08:00 | the rice the layers as a little difference from the at the end that mar |
---|
0:08:06 | the company can actually mass are within the layer is that all this kind of |
---|
0:08:13 | and the |
---|
0:08:15 | include of the vectorized in a year it is likely applied by a local stations |
---|
0:08:20 | and the up with all the by to read the areas as a |
---|
0:08:25 | and the you watch |
---|
0:08:27 | a similar to the rest not |
---|
0:08:32 | e os you have t v and to replace the car use the two |
---|
0:08:37 | extracted the embedding in there is the vector network |
---|
0:08:44 | replace the part x that have the phonetic a problem that vector at then you're |
---|
0:08:50 | twenty |
---|
0:08:52 | the extract the embedding exactly are twenty two |
---|
0:08:58 | and the same time but was simply five you have to be a network without |
---|
0:09:03 | putting the year to replace the so far of the multi task learning is the |
---|
0:09:09 | vector |
---|
0:09:10 | for some new the first feel layers of the two hour stream in those them |
---|
0:09:15 | or very |
---|
0:09:21 | for the experiment |
---|
0:09:24 | it is performed according to the requirements of the next training condition or nist sre |
---|
0:09:30 | at |
---|
0:09:32 | it should be noted cadence that that's our training data size doesn't include the looks |
---|
0:09:39 | l a and sat down to datasets |
---|
0:09:43 | the fisher data size consists of for me labels so we use data to create |
---|
0:09:50 | additional sre task and the remaining data size i euros the two three that neural |
---|
0:09:57 | network and back end |
---|
0:10:00 | each sense to know there and i rise are noise this are used that no |
---|
0:10:06 | i as sources to enhance the training data |
---|
0:10:10 | and amount of training data is able |
---|
0:10:15 | the test to study the development and evaluation data size or |
---|
0:10:20 | nist sre it can cts task |
---|
0:10:23 | the input features of network a mystery imagine a mfccs |
---|
0:10:30 | this is the estimate working is trained using english dataset as |
---|
0:10:36 | so it has two or and fisher |
---|
0:10:39 | it doesn't phase well with the language of the sre eight data set |
---|
0:10:48 | just and transcriptions for a landed by gmm you x m used and from one |
---|
0:10:56 | phonetic labels |
---|
0:10:58 | i extractor all the pre-trained asr |
---|
0:11:01 | it's really |
---|
0:11:02 | on this range |
---|
0:11:04 | and it has the same as that all the same actors as though |
---|
0:11:13 | the about experiments a little bad taste and background processing |
---|
0:11:19 | after imbalance are extracted |
---|
0:11:22 | two hundred imagine that only a hand only a story are trained and the class |
---|
0:11:29 | due to domain mismatch |
---|
0:11:31 | a common optimisation and mess already has two euros sre at on labeled data to |
---|
0:11:37 | also realise the already adaptation |
---|
0:11:40 | but me us another math are to get better results |
---|
0:11:45 | okay as to apply i'm supplies the cost three |
---|
0:11:50 | sre eighteen unlabeled data and use the clustered data to train can be |
---|
0:11:57 | then the p r t and videos the for story |
---|
0:12:01 | the result of f t vector as bad heard that the background identity and you |
---|
0:12:06 | or in the controllers |
---|
0:12:10 | and they have t vectors test on that schade first a layer on the best |
---|
0:12:15 | performance |
---|
0:12:18 | the overall effect of the have to vectors as norm decreased as that the number |
---|
0:12:23 | of charlotte i layers increases as we review that's nist and this partly due to |
---|
0:12:30 | language mismatch |
---|
0:12:32 | the training data for asr part spoke english well at a test it has side |
---|
0:12:39 | is spoken in one is the already are about |
---|
0:12:44 | a bathroom the results because the data you mean in this case |
---|
0:12:49 | the extracted the phonotactic information can still have improved a factor of speaker recognition |
---|
0:12:59 | that's all thank you |
---|