0:00:13 | hello |
---|
0:00:14 | my name is an and i one of the unwieldy timit speech signal just |
---|
0:00:19 | i'm going to tell you about of our own each speaker embedding see what you |
---|
0:00:24 | recognition also differences |
---|
0:00:34 | the problem of the market individual systems weighted smart speakers fuse the demand for the |
---|
0:00:39 | five you'll speaker recognition |
---|
0:00:41 | as environmental conditions those devices i usually used in a provide some cases non the |
---|
0:00:46 | nist clean speech processing algorithms additionally have to be robust noise |
---|
0:00:51 | and the last thing there does recognition you have what incomprehension complete on the results |
---|
0:00:57 | you know what are you |
---|
0:01:00 | is performance on short duration test segments |
---|
0:01:04 | so the main focus of our study |
---|
0:01:07 | was to design the model that would |
---|
0:01:10 | not all before well or unseen by you audio samples recorded was environment but |
---|
0:01:16 | also we seem the recognition quality when tested on short speech segments |
---|
0:01:22 | in order to achieve this |
---|
0:01:24 | we started from what of moving the training data closer to the testing scenario |
---|
0:01:29 | of that but investigated in fact |
---|
0:01:31 | to know overall recognition performance of changes in how do we are relation of the |
---|
0:01:37 | training data conditions teach |
---|
0:01:39 | and the second concern was the problem presents or was segments are not speaker specific |
---|
0:01:45 | information such as background noise as silence in what you so we prioritise the robustness |
---|
0:01:51 | to noise aspect of the voice activity detector which was used to the couple's house |
---|
0:01:59 | next we have try different acoustic features as well as biting extract architectures |
---|
0:02:06 | also we investigated the effect so bad |
---|
0:02:09 | can level that we have tuition and score normalisation |
---|
0:02:14 | is the |
---|
0:02:16 | datas foundation on every that dependent experiment we will first introduced data used the current |
---|
0:02:22 | study |
---|
0:02:24 | so we have constructed for datasets |
---|
0:02:27 | that i primarily comprised of books love one into data |
---|
0:02:32 | except for the training data one and two |
---|
0:02:35 | one and three so that also have a fractional seen data mixed in |
---|
0:02:43 | and that significant |
---|
0:02:45 | difference between these datasets used a harmonisation use |
---|
0:02:49 | as for the training data to a forced and of how this dialogue limitations used |
---|
0:02:54 | while training data one and we don't that i in green |
---|
0:02:59 | well mentally a different way |
---|
0:03:03 | so in contrast to the augmentation scheme developed in reality or readable moist and speech |
---|
0:03:11 | rate sure |
---|
0:03:12 | what two thousand |
---|
0:03:14 | once we have generated a reading room impulse responses from for different positions of sources |
---|
0:03:21 | and destructive |
---|
0:03:23 | to generate those are also responses we have used the impulse response generative proposed by |
---|
0:03:31 | john allen and of in berkeley |
---|
0:03:34 | this may be we have try to narrow down the gap between a real and |
---|
0:03:39 | simply a room impulse responses by creating more realistic blocks |
---|
0:03:46 | the benchmark okay i'll breast speaker recognition systems verified that |
---|
0:03:51 | as described scheme use that indeed are one standard that conditions key |
---|
0:03:59 | you can see |
---|
0:04:01 | so now let's start with the |
---|
0:04:06 | data we boasting |
---|
0:04:07 | as sort of acoustic features we have experimented with what to dimensional mfccs and inter |
---|
0:04:13 | dimensional mel of the backs |
---|
0:04:16 | extracted acoustic features underwent you the local mean normalisation |
---|
0:04:22 | followed by a global mean and variance normalisation or just a single local station |
---|
0:04:30 | if a look at the benchmark swivels you that model trained on a two dimensional |
---|
0:04:34 | mel filter banks i've defines the si model trained on what to dimensional mfccs |
---|
0:04:40 | on the on the |
---|
0:04:41 | percent of test particles |
---|
0:04:45 | and the next preprocessing stage we want to draw attention to use was of detection |
---|
0:04:52 | in our previous studies we have these and mitchell energy based voice activity detector being |
---|
0:04:58 | sensitive to noise |
---|
0:05:00 | so we have decided to create a i'll a neural network based voice activity detector |
---|
0:05:07 | i well voice activity detector is based on you net architecture which initially was developed |
---|
0:05:15 | for medical image segmentation |
---|
0:05:18 | the joyous or unit is actually read to the tree don't betweens you to one |
---|
0:05:24 | we have traded on |
---|
0:05:25 | one holes |
---|
0:05:27 | data and a small fraction of microphone you know which was downsampled to eight khz |
---|
0:05:33 | well labels for these does it |
---|
0:05:37 | well a teens either in terms of manual segmentation or using out meeting |
---|
0:05:46 | speech recognition based was estimated that the segmentation |
---|
0:05:50 | followed by manual post processing |
---|
0:05:55 | as for the results what we observe is that you then based wasn't the type |
---|
0:06:00 | that they actually helps us to improve the quality of systems for difficult conditions |
---|
0:06:07 | a bit to the standard called energy and there should be fine |
---|
0:06:17 | let's now that into the details of the main components of our system |
---|
0:06:23 | converting structures |
---|
0:06:26 | embedding extract is comprised of one frame level network |
---|
0:06:32 | then statistics willing clear and |
---|
0:06:36 | also the segment level or |
---|
0:06:41 | where you level |
---|
0:06:43 | next walk are where actually darcy analyses that you what do you features at the |
---|
0:06:48 | frame level |
---|
0:06:50 | and for the frame level we have considered two types of neural networks for used |
---|
0:06:55 | in the n n's |
---|
0:06:56 | based on a present that |
---|
0:07:00 | did an em based and also i resent response |
---|
0:07:04 | the main difference between |
---|
0:07:07 | there's to me is the there and that type of a kernel and well processing |
---|
0:07:12 | that what |
---|
0:07:15 | frame level lattice formal by statistical here |
---|
0:07:19 | that's |
---|
0:07:22 | and it's frame level just a long time |
---|
0:07:25 | i'm gonna feature maps are then latins and rasta the segment level that extracts herence |
---|
0:07:30 | level mission |
---|
0:07:32 | or salted embedding vector results normalized and class that that's fine |
---|
0:07:38 | we have started with well-known extended version all |
---|
0:07:42 | t d n s |
---|
0:07:43 | and or place t nine |
---|
0:07:48 | time delay |
---|
0:07:51 | lee here with a list here |
---|
0:07:55 | then we have moved to the fact tries to the end i texture and finally |
---|
0:08:00 | ended our experiments we rise that's |
---|
0:08:03 | which ride present it and see for configuration and y a v one resonates of |
---|
0:08:10 | india with a skip looks at it |
---|
0:08:16 | and |
---|
0:08:19 | i'm gonna the test results for those architectures we are drawn to components |
---|
0:08:25 | first |
---|
0:08:26 | are whereas than that the t for all forms x vectors |
---|
0:08:33 | second there is that no improvement is achieved by switching to it's here is that |
---|
0:08:39 | the loss functions we have stick to additive white pixels |
---|
0:08:43 | which is well started in the area of speaker recognition |
---|
0:08:49 | also we have try to train our best model using this axles |
---|
0:08:55 | which was recently proposed and when it actually does is this section of the softmax |
---|
0:09:01 | was to independent and try and the class checkers |
---|
0:09:06 | however that was not able to get these mikes please training help me from absentmindedly |
---|
0:09:14 | in this work we use cosine similarity emphasizing liberty a |
---|
0:09:20 | mentioned learning |
---|
0:09:21 | a scoring |
---|
0:09:23 | we |
---|
0:09:24 | also used |
---|
0:09:25 | simple domain adaptation procedure based on a century the data |
---|
0:09:30 | on in domain set by we have speaker bindings obstruction |
---|
0:09:35 | the mean vectors of calculated using adaptation said this case |
---|
0:09:40 | we also adaptively normalize the schools with the statistics of total |
---|
0:09:47 | ten percent best scoring posters for which embedding people |
---|
0:09:52 | mean annotation allows us to use the equal error rate and improve we just here |
---|
0:09:59 | but slightly |
---|
0:10:01 | but so if we can well |
---|
0:10:07 | score normalisation we will see that score normalisation outperforms station |
---|
0:10:14 | on the majority of the distance so that we can make sure somebody |
---|
0:10:19 | the results |
---|
0:10:21 | change during training |
---|
0:10:24 | propose to model for jesus on the duration of training samples |
---|
0:10:30 | so |
---|
0:10:34 | was so it is that systems based on race that architectures are deformed spectra based |
---|
0:10:39 | systems in all experiments |
---|
0:10:41 | you know based |
---|
0:10:43 | voice activity detector a skull the energy based voice activity detector |
---|
0:10:48 | and score normalisation well as the good performance of all extracted types of the majority |
---|
0:10:55 | of the test settings |
---|
0:10:57 | also the task of millions it's a training data from relation can slightly with the |
---|
0:11:04 | quality of c |
---|
0:11:07 | this of max baseless training doesn't help to present that eer |
---|
0:11:12 | or performance and also we did not achieve any one by using more complex right |
---|
0:11:24 | five right |
---|
0:11:28 | for testing our hypothesis on the whole generation test segments |
---|
0:11:32 | we have more to the thing |
---|
0:11:36 | the experiments |
---|
0:11:38 | with the tests of links ranging from a point five seconds |
---|
0:11:43 | first we have seen that independent wanted to sample duration is here is that it |
---|
0:11:47 | is still doesn't doing better but it to address that the g |
---|
0:11:53 | secondly we validated that everything based architectures |
---|
0:11:59 | thingy |
---|
0:12:04 | be the ones |
---|
0:12:06 | it is based on |
---|
0:12:09 | expenditures in terms of you or weighted and i mean this year for the tests |
---|
0:12:13 | on a |
---|
0:12:15 | a four and the while to twenty five second segments |
---|
0:12:21 | it is |
---|
0:12:22 | also |
---|
0:12:24 | where is it to see that today in other ways to extract systems degree more |
---|
0:12:29 | that resin systems function segments |
---|
0:12:34 | his finger with the that occur is an illustration of the relative differences between wanting |
---|
0:12:41 | from testing address that sample durations to come up short length segment and looking from |
---|
0:12:46 | testing x searches for durations to test environment shuffling signal |
---|
0:12:57 | for a voltage right to see how |
---|
0:13:00 | a low we to augment we refer to as more realistic we can base to |
---|
0:13:05 | the call this dialogue intuition in terms of the performance of the best duration model |
---|
0:13:11 | what is then trained on short duration segments would see that the situation changes in |
---|
0:13:16 | the way that the we now |
---|
0:13:18 | is not one obvious how well that |
---|
0:13:23 | what is that the gap between the |
---|
0:13:26 | metrics into roles just quite now |
---|
0:13:28 | no |
---|
0:13:30 | if we say the training segments not sure that |
---|
0:13:35 | we would |
---|
0:13:36 | differentiating we know that are |
---|
0:13:38 | the whole trained on data with more realistic room impulse responses |
---|
0:13:43 | i've defines the model trained on call just l version of impulse responses |
---|
0:13:52 | and the gap is getting why the |
---|
0:13:55 | how with the but absolute we are |
---|
0:13:58 | is the still not |
---|
0:14:01 | we as |
---|
0:14:03 | the obvious conclusion we can draw from the results |
---|
0:14:09 | here is that in case of training address that based more on short differences |
---|
0:14:16 | in the one for shorter duration is that it was tough degradation for |
---|
0:14:27 | in order to compare our speaker recognition systems performance for sure you rinse as you |
---|
0:14:33 | those already presented for the probably |
---|
0:14:35 | we have they publish describing |
---|
0:14:40 | are used as well as same time nibbling too heavy steel above results on what's |
---|
0:14:45 | layer experiment |
---|
0:14:47 | so we were able to |
---|
0:14:51 | cheering testing problem bolts mostly identical to those used in the paper |
---|
0:14:56 | of interest so this is the second p with the |
---|
0:15:01 | you can see hold endurance level location of speaker recognition in the war |
---|
0:15:07 | so for we do stability purposes we also did not use you know what for |
---|
0:15:15 | just a data |
---|
0:15:17 | so you can see a how |
---|
0:15:21 | actually try to trying to create a problem |
---|
0:15:26 | as for the results we can say that when used testing show significantly better quality |
---|
0:15:32 | or our moral or very short duration |
---|
0:15:37 | a slight one second to second of artistic as |
---|
0:15:44 | like durations |
---|
0:15:47 | and |
---|
0:15:50 | hence the final spy and the here the |
---|
0:15:56 | maybe |
---|
0:15:58 | take ways |
---|
0:15:59 | all this talk so that jane results confirm that |
---|
0:16:06 | or is that i take sures x vector approach in table one duration of short |
---|
0:16:11 | duration scenarios |
---|
0:16:14 | appropriate training data preparation can significantly improve the quality of the final speaker recognition systems |
---|
0:16:22 | also proposed you know based the of was to detect of queens energy based was |
---|
0:16:28 | activity detector |
---|
0:16:32 | and best performing system or voice just goal |
---|
0:16:37 | it is a thirty four based so systems built on inter dimensional mel the bank |
---|
0:16:43 | features |
---|
0:16:44 | and it actually all ones our previous best single system unit to the voice this |
---|
0:16:50 | challenge |
---|
0:16:53 | proposal scoring model means adaptation score normalisation techniques provide additional performance gains for speaker |
---|
0:17:03 | and that's it |
---|
0:17:06 | maybe for attention you have any questions will having tons of them in a given |
---|
0:17:11 | a session |
---|