0:00:14 | hello i run decision at a really a psd student problem this kind university'll background |
---|
0:00:20 | represent the work place that's later it is in your network based i-vector system for |
---|
0:00:27 | speaker verification |
---|
0:00:28 | a set thing to all the c two thousand and twenty |
---|
0:00:33 | this work proposes to incorporate it is a neural networks you to automatic speaker verification |
---|
0:00:38 | systems to improve the system generalization ability |
---|
0:00:43 | in this presentation of a firstly introduced is he systems and strategies to as a |
---|
0:00:48 | very in developing by a these systems |
---|
0:00:52 | this is followed by some related works these days and learning process so use of |
---|
0:00:56 | the buttons days of modeling in machine learning community |
---|
0:01:00 | then talk about our approach including the motivation and how to apply based entirely you |
---|
0:01:06 | to is this systems |
---|
0:01:07 | next our experimental setup and results will be restricted to where five the effectiveness of |
---|
0:01:14 | our approach |
---|
0:01:15 | this is followed by the final |
---|
0:01:17 | congruence |
---|
0:01:20 | automatic speaker verification systems and i've confirming the spoken utterance is the speaker identity claim |
---|
0:01:28 | we have zero i ever increasing use of nist systems why don't data lacks |
---|
0:01:33 | including was iteration you electronic devices |
---|
0:01:37 | you banking association and so on |
---|
0:01:40 | there are three most the represented q frameworks for developing is these systems |
---|
0:01:46 | i-vectors speaker-invariant is the systems were proposed to know those speaker and channel variations the |
---|
0:01:52 | better |
---|
0:01:53 | and the user speaker discriminative back end work experience |
---|
0:01:58 | benefiting from the partial discriminative ability of the neural networks |
---|
0:02:03 | speaker embedded in this distance or proposed to extract speaker discriminative repetitions is utterance |
---|
0:02:11 | this could choose the state-of-the-art performance |
---|
0:02:15 | is the development of and upper test testing |
---|
0:02:19 | many research is also focus on constructing a s p system |
---|
0:02:25 | and to and manner |
---|
0:02:29 | its head and z zero four lacy systems development is then this nice feature in |
---|
0:02:33 | the training and evaluation data |
---|
0:02:36 | so i says the speaker population is nice and the variations you channel and environmental |
---|
0:02:41 | background |
---|
0:02:43 | a speaker or blsa use the for training and evaluation commonly how no overlap |
---|
0:02:48 | especially for practical applications |
---|
0:02:52 | cool work on this is nice your data pairs |
---|
0:02:56 | the strictly the speaker representations to generalize well on these bigger data |
---|
0:03:03 | this i know and every environmental variations most the most only it is the in |
---|
0:03:07 | practical applications |
---|
0:03:09 | where the training and evaluation data are collected from different types of recorders |
---|
0:03:15 | and environments |
---|
0:03:16 | is this nice is also have a high demand for the model general the idiot |
---|
0:03:22 | times today |
---|
0:03:24 | to address this is you |
---|
0:03:26 | previous efforts have applied it was fishing you to elevate the channel and parametric variations |
---|
0:03:31 | from a christian by any |
---|
0:03:34 | is the pros as the improvement will be affected you elevating the effects |
---|
0:03:38 | of channel and environmental mismatch used |
---|
0:03:41 | are you guys in the consider the speaker population size that could also lead to |
---|
0:03:46 | the system performance degradation |
---|
0:03:50 | in this or |
---|
0:03:51 | we focus on the ice vectors system and try to incorporate it is a neural |
---|
0:03:56 | networks |
---|
0:03:57 | to true |
---|
0:03:58 | the systems generates is it at |
---|
0:04:01 | across all these three and so from these nine she's |
---|
0:04:06 | the baseline and you of course as the initial and would be effective to improve |
---|
0:04:10 | the generalisation ability of discriminative training you p and systems |
---|
0:04:15 | in the machine learning community |
---|
0:04:17 | barbara et al proposes |
---|
0:04:20 | and you've patient variational you or is there is useful based in your networks |
---|
0:04:26 | but i the lid or propose a novel |
---|
0:04:29 | propagation |
---|
0:04:31 | compatible algorithm for learning the network parameters of zero discrete |
---|
0:04:36 | distribution |
---|
0:04:37 | in this is this area |
---|
0:04:40 | reefs and a whole |
---|
0:04:41 | propose to your ball be based a neural networks use these recognition |
---|
0:04:46 | so and or stuck a bayesian learning of kidding you need contributions for speaker identification |
---|
0:04:53 | chatting you also applied to it is interesting to me into language modeling |
---|
0:05:00 | i we introduce our approach is the most efficient would be personally talk about |
---|
0:05:07 | traditional d extractor system |
---|
0:05:09 | a system parameters estimated and the maximum likelihood strategy |
---|
0:05:14 | i p for me mistake i showing you think about |
---|
0:05:20 | it has to |
---|
0:05:21 | our feet when given limited training data are you moment they are |
---|
0:05:26 | is lasts is nice speech in the training and evaluation data |
---|
0:05:31 | in the case of a nice you speaker population |
---|
0:05:34 | the overfitting the model parameters may result in speaker representations all you must i |
---|
0:05:40 | distribution |
---|
0:05:41 | to come or score supporting a speaker identities |
---|
0:05:44 | however this you can not to generalize well on |
---|
0:05:48 | i think speaker data |
---|
0:05:50 | the cases of channel and environmental nice |
---|
0:05:54 | a similar for instance for channel is nice the orthopaedic the model parameters may be |
---|
0:06:00 | partially rely on the channel you formation |
---|
0:06:02 | go classify speakers to wear a suit recorders for different speakers in the training data |
---|
0:06:10 | are more ones analysing to channel mismatch the evaluation data |
---|
0:06:14 | the original channels to be really system is broken and to train the relies on |
---|
0:06:20 | channel information cleanly to misclassification |
---|
0:06:27 | and all have so on |
---|
0:06:29 | that the extract a speaker representations from outside first is the |
---|
0:06:35 | still contain a the speaker and related information just a channel |
---|
0:06:40 | transcription an utterance long |
---|
0:06:42 | is the information to the fact that the verification performance especially on the nist sre |
---|
0:06:47 | evaluation data |
---|
0:06:49 | is a neural network shares economy nist a great interest would be a problem this |
---|
0:06:53 | t |
---|
0:06:55 | posterior distribution as shown in figure two |
---|
0:06:58 | is probabilistic parameters could have what is an additional data |
---|
0:07:02 | to address this speaker population is niceties you |
---|
0:07:06 | it is clear that have could have some of the distributions of speaker representation for |
---|
0:07:12 | better generalization and since you're data |
---|
0:07:15 | to apply the mismatch is caused by chain i mean variance there probabilistic parameter model |
---|
0:07:20 | mean go the reduce the risk |
---|
0:07:23 | overfitting on channel information based more thing parameters to consider archer possible values |
---|
0:07:29 | that don't rely on channel information for speaker classification |
---|
0:07:34 | the boundaries you want to be used work to incorporate |
---|
0:07:38 | we place a neural networks into a the i-vector system by replacing the layers to |
---|
0:07:45 | improve his general system abated |
---|
0:07:51 | acts like the system consists of two parts of france and use the following shaking |
---|
0:07:56 | utterance level speaker in banking and of our vacations calling back |
---|
0:08:01 | if right hand compresses these utterances of different amounts into a fixed that ms in |
---|
0:08:06 | speaker-related ratings |
---|
0:08:09 | based on this inventing different scoring schemes kind we use the projection whether two utterances |
---|
0:08:15 | you don't close to kristin on that |
---|
0:08:18 | in this work we focused on the reversal the print and choose probabilistic linear discriminant |
---|
0:08:24 | analysis and of course task only has two |
---|
0:08:28 | given by hand for the performance evaluation |
---|
0:08:32 | that's right for extractor is a neural network |
---|
0:08:35 | to the other speaker discrimination task as a new people three across this all frame |
---|
0:08:41 | level and bathroom is levelling structures |
---|
0:08:44 | and a frame level several layers of |
---|
0:08:47 | of time delay neural network are used to model have you burned corpora |
---|
0:08:53 | curve right okay characteristics |
---|
0:08:56 | of acoustic features |
---|
0:08:58 | then |
---|
0:09:00 | then |
---|
0:09:00 | statistics to relay a ivory is all the frame level all those from velocity d |
---|
0:09:06 | and they are i don't confuse the army and standard deviation |
---|
0:09:10 | that compute his that a case space are propagated through or several embedding layers and |
---|
0:09:17 | the panelists of my output layer |
---|
0:09:20 | the cross entropy is used to find for the crime interest unit sheena states |
---|
0:09:26 | in the testing stage |
---|
0:09:27 | even though acoustic features of that utterance mean value layout too easy extracted as e |
---|
0:09:34 | x vector |
---|
0:09:39 | is a neural networks |
---|
0:09:40 | during the parameters posterior distribution p of w given t to model week after dingy |
---|
0:09:47 | and the right legally enables and you need |
---|
0:09:50 | number of possible model parameters to physically is that she and they have be |
---|
0:09:55 | is the data i third and he modeling has the most model |
---|
0:09:59 | parameters and the make the more the generalize well as in the |
---|
0:10:03 | during the testing stage |
---|
0:10:05 | the model can choose the occluded |
---|
0:10:07 | well i even they include x |
---|
0:10:09 | and making i x |
---|
0:10:10 | the expectation |
---|
0:10:11 | or what awaits posterior distribution |
---|
0:10:14 | p w |
---|
0:10:15 | "'kay" of audio unity i shown you creation one |
---|
0:10:20 | i work better that they write the i think nation l p and out of |
---|
0:10:23 | that unit but intractable for neural networks of irony right was that |
---|
0:10:29 | yes the number of possible ways values could be you data |
---|
0:10:33 | so the variational approximation is commonly adopted to estimate the posterior distribution |
---|
0:10:40 | the variational |
---|
0:10:42 | a poor approximation i theme is a set of parameters see that you're for distribution |
---|
0:10:48 | to all of that you |
---|
0:10:49 | to approximate the posterior distribution p of w unity |
---|
0:10:54 | this is issued by minimizing the callback labour |
---|
0:10:59 | i divergence between these two distributions and so and you creation two |
---|
0:11:05 | from you creation to equation four |
---|
0:11:08 | we applied the ester two and job the constant term low key of the |
---|
0:11:14 | low key of p |
---|
0:11:16 | that that's of by the minimization no was it actually |
---|
0:11:19 | you creations problem of for to say |
---|
0:11:23 | benesty that just means that is |
---|
0:11:26 | increasing could be decomposed into two pass |
---|
0:11:31 | one is |
---|
0:11:32 | the kl-divergence speech e |
---|
0:11:35 | the speech in the approximation distribution q of w and the posterior |
---|
0:11:40 | and the prior distribution p of w on the page |
---|
0:11:45 | the one is |
---|
0:11:47 | the one is |
---|
0:11:49 | the expectation of the log likelihood of the training data over the approximation distribution q |
---|
0:11:55 | of topic |
---|
0:11:56 | increase mistakes is used as the loss function to be really nice in the training |
---|
0:12:01 | process |
---|
0:12:07 | as commonly adopted to be assumed that both variational approximation of that you and the |
---|
0:12:13 | prior distribution p of that you follows telcon or cost and distribution these a printer |
---|
0:12:20 | side data to composed of new q and the map you |
---|
0:12:24 | and six is see that the controls of new p and c marquee respectively |
---|
0:12:30 | the two class you know loss function of the last is gonna be formulated as |
---|
0:12:34 | you kristen's seven and eight respect it because it ain't useful in relation to apply |
---|
0:12:40 | a model car was some three |
---|
0:12:42 | two approximates the integration |
---|
0:12:44 | processed |
---|
0:12:45 | finally can |
---|
0:12:47 | concatenate increases seven and eight we have the final loss function actually is not |
---|
0:12:53 | this news |
---|
0:12:54 | we see you be the directly use the are watching imprecise |
---|
0:12:59 | order to evaluate the effectiveness of operation any of a speaker verification you both |
---|
0:13:04 | so and a long utterance conditions |
---|
0:13:06 | we performed experiments on two datasets |
---|
0:13:09 | to solve utterance condition we consider the book set of one side |
---|
0:13:15 | totally |
---|
0:13:16 | wow hundred and forty |
---|
0:13:17 | eight thousand |
---|
0:13:19 | six hundred and forty two utterances from one thousand and twenty two hundred by |
---|
0:13:25 | site agrees |
---|
0:13:28 | we adopted a four thousand |
---|
0:13:31 | four thousand eight hundred and they seventy four utterances from forty speakers for evaluation |
---|
0:13:37 | and the remaining utterances are used for junior |
---|
0:13:40 | yes you system parameters |
---|
0:13:42 | for the long utterance condition a card has thing in beanies the speaker barry is |
---|
0:13:46 | to be correct recognition evaluation can use the for benchmarking i won't motives |
---|
0:13:52 | but is synthesized we adopt the previous |
---|
0:13:54 | sre corpora sense these four |
---|
0:13:56 | in total be how wrong |
---|
0:13:59 | sixty miles thousands recordings from six thousand |
---|
0:14:03 | and of the hundred speakers indigenous this site |
---|
0:14:06 | we evaluate the general system benefits that |
---|
0:14:10 | based on included three and a |
---|
0:14:12 | evaluation of different miss nine degrees |
---|
0:14:15 | we performed only and also to me evaluations |
---|
0:14:18 | which in and has two stages i think really on the same |
---|
0:14:21 | dataset the in domain evaluation |
---|
0:14:23 | well executed on different not size are also be evaluation |
---|
0:14:30 | so if you dimensional mel-frequency spectral coefficients are adopted i so closely features our experiments |
---|
0:14:37 | extracted mfccs onion normalization them voice activity detection filters all non-speech frames |
---|
0:14:43 | that's right drawing structure configuration is shown in table one |
---|
0:14:47 | linear discriminant analysis is applied to reduce the extractors dimension |
---|
0:14:53 | to make a fair calibration that based extractor system is configured to be is the |
---|
0:14:58 | same architecture of the baseline system |
---|
0:15:00 | except |
---|
0:15:01 | the first the t v and later use replace the bad their business the number |
---|
0:15:06 | or units |
---|
0:15:08 | so that is the gradient descent and is a great i as you optimize rd |
---|
0:15:12 | machine evaluation metrics adaptively increase or other commonly used equal error rate and minimum you |
---|
0:15:18 | understand |
---|
0:15:19 | cost function |
---|
0:15:21 | here is that you need only evaluation results we have their own that |
---|
0:15:26 | you calamities |
---|
0:15:27 | consistently decrease after incorporating the basin running on both sides |
---|
0:15:34 | on this dataset be considered i was right you glower we degrees |
---|
0:15:39 | across close a and the lda back-end |
---|
0:15:45 | on the |
---|
0:15:46 | looks at the one that i with a few i enquiry decrease from place to |
---|
0:15:50 | extract or system is two point six days point process |
---|
0:15:55 | and the fusion system quoted surefire to our wrists radio or are we increase that |
---|
0:15:59 | so on to four presents |
---|
0:16:04 | and then he's that i sign skin database until you varying degrees is to one |
---|
0:16:09 | thirty h |
---|
0:16:10 | is the three two percent for |
---|
0:16:12 | based on extractor system and three point |
---|
0:16:16 | eight a stands for the fusion system |
---|
0:16:19 | we also consider the consistent improvement in detection cost function performance after applying bayesian learning |
---|
0:16:27 | and that the stooges just the |
---|
0:16:29 | is observations where five improve the general system ability of the client base a neural |
---|
0:16:35 | networks |
---|
0:16:37 | figure four ulysses |
---|
0:16:39 | the details at work feed off curves |
---|
0:16:41 | all systems these the cost and by can win benchmark almost set of one side |
---|
0:16:47 | is shows the proposed space is just a model from the baseline for all operating |
---|
0:16:52 | points |
---|
0:16:53 | and the fusion system couldn't show further improvements to trigger |
---|
0:16:57 | complementary advantages of the baseline and based in system |
---|
0:17:02 | k is the off total knee evaluation regions |
---|
0:17:05 | the model to now centered one was evaluated on these the sre ten |
---|
0:17:10 | and vice versa |
---|
0:17:12 | system performance costing significantly due to the last term is my speech in the training |
---|
0:17:17 | and evaluation data |
---|
0:17:22 | from the table be of the died |
---|
0:17:25 | systems could benefit more from the generalisation calibration in your |
---|
0:17:29 | we also consider the average radio equal error rate degrees across course and real case |
---|
0:17:35 | calling back end for performance evaluation |
---|
0:17:39 | in the experiments evaluated on nist i sign ten database right you equal weight |
---|
0:17:45 | increase is |
---|
0:17:47 | for one six nine cents and the six point |
---|
0:17:51 | well three percent over the baseline system and the fusion system respectively |
---|
0:17:56 | for the experiments evaluated on the wheel set of one dataset are always right you |
---|
0:18:02 | equal ridiculous yes three point o seven percent for the base tax vectors this the |
---|
0:18:07 | and the fusion system as true father |
---|
0:18:11 | for the average review equal error rate degrees all six point |
---|
0:18:15 | for |
---|
0:18:15 | one a sense |
---|
0:18:18 | the latter value equal error rate decreases compared to be is that the only evaluations |
---|
0:18:24 | just respect bayesian learning could be |
---|
0:18:26 | more beneficial when larger miss nicely it is |
---|
0:18:29 | between the training and evaluation data the last column in the table shows the corresponding |
---|
0:18:35 | you |
---|
0:18:36 | detection cost function performance |
---|
0:18:38 | and we also can see consistent improvement by applying bayesian learning and with the fusion |
---|
0:18:44 | system |
---|
0:18:46 | similar to that of the variation in figure four |
---|
0:18:49 | the detection error tradeoff curves in figure fell so consistent improvements by applying bayesian learning |
---|
0:18:56 | and a few this system |
---|
0:18:58 | for all operating points |
---|
0:19:02 | in this work we |
---|
0:19:03 | we incorporate the base in your network utility |
---|
0:19:08 | i extractor assistant when you produce |
---|
0:19:10 | models generalisation ability |
---|
0:19:12 | our experimental results verify the bayesian routine enable a consistent |
---|
0:19:17 | generalizes the ability you improvement over extractor system both |
---|
0:19:22 | sort and alarm rates conditions |
---|
0:19:25 | and the through the system which used for the improvements nor overall system score the |
---|
0:19:29 | latter improvement problem would be is already and all of complete evaluation results as s |
---|
0:19:35 | is around you makes |
---|
0:19:36 | my personal and the doctor nice it is between the training and evaluation data |
---|
0:19:42 | possible future research will focus on |
---|
0:19:45 | incorporating the bayesian learning improve the and ran a speaker verification systems |
---|
0:19:49 | then for a listening |
---|