0:00:13hello
0:00:14my name is an and i one of the unwieldy timit speech signal just
0:00:19i'm going to tell you about of our own each speaker embedding see what you
0:00:24recognition also differences
0:00:34the problem of the market individual systems weighted smart speakers fuse the demand for the
0:00:39five you'll speaker recognition
0:00:41as environmental conditions those devices i usually used in a provide some cases non the
0:00:46nist clean speech processing algorithms additionally have to be robust noise
0:00:51and the last thing there does recognition you have what incomprehension complete on the results
0:00:57you know what are you
0:01:00is performance on short duration test segments
0:01:04so the main focus of our study
0:01:07was to design the model that would
0:01:10not all before well or unseen by you audio samples recorded was environment but
0:01:16also we seem the recognition quality when tested on short speech segments
0:01:22in order to achieve this
0:01:24we started from what of moving the training data closer to the testing scenario
0:01:29of that but investigated in fact
0:01:31to know overall recognition performance of changes in how do we are relation of the
0:01:37training data conditions teach
0:01:39and the second concern was the problem presents or was segments are not speaker specific
0:01:45information such as background noise as silence in what you so we prioritise the robustness
0:01:51to noise aspect of the voice activity detector which was used to the couple's house
0:01:59next we have try different acoustic features as well as biting extract architectures
0:02:06also we investigated the effect so bad
0:02:09can level that we have tuition and score normalisation
0:02:14is the
0:02:16datas foundation on every that dependent experiment we will first introduced data used the current
0:02:22study
0:02:24so we have constructed for datasets
0:02:27that i primarily comprised of books love one into data
0:02:32except for the training data one and two
0:02:35one and three so that also have a fractional seen data mixed in
0:02:43and that significant
0:02:45difference between these datasets used a harmonisation use
0:02:49as for the training data to a forced and of how this dialogue limitations used
0:02:54while training data one and we don't that i in green
0:02:59well mentally a different way
0:03:03so in contrast to the augmentation scheme developed in reality or readable moist and speech
0:03:11rate sure
0:03:12what two thousand
0:03:14once we have generated a reading room impulse responses from for different positions of sources
0:03:21and destructive
0:03:23to generate those are also responses we have used the impulse response generative proposed by
0:03:31john allen and of in berkeley
0:03:34this may be we have try to narrow down the gap between a real and
0:03:39simply a room impulse responses by creating more realistic blocks
0:03:46the benchmark okay i'll breast speaker recognition systems verified that
0:03:51as described scheme use that indeed are one standard that conditions key
0:03:59you can see
0:04:01so now let's start with the
0:04:06data we boasting
0:04:07as sort of acoustic features we have experimented with what to dimensional mfccs and inter
0:04:13dimensional mel of the backs
0:04:16extracted acoustic features underwent you the local mean normalisation
0:04:22followed by a global mean and variance normalisation or just a single local station
0:04:30if a look at the benchmark swivels you that model trained on a two dimensional
0:04:34mel filter banks i've defines the si model trained on what to dimensional mfccs
0:04:40on the on the
0:04:41percent of test particles
0:04:45and the next preprocessing stage we want to draw attention to use was of detection
0:04:52in our previous studies we have these and mitchell energy based voice activity detector being
0:04:58sensitive to noise
0:05:00so we have decided to create a i'll a neural network based voice activity detector
0:05:07i well voice activity detector is based on you net architecture which initially was developed
0:05:15for medical image segmentation
0:05:18the joyous or unit is actually read to the tree don't betweens you to one
0:05:24we have traded on
0:05:25one holes
0:05:27data and a small fraction of microphone you know which was downsampled to eight khz
0:05:33well labels for these does it
0:05:37well a teens either in terms of manual segmentation or using out meeting
0:05:46speech recognition based was estimated that the segmentation
0:05:50followed by manual post processing
0:05:55as for the results what we observe is that you then based wasn't the type
0:06:00that they actually helps us to improve the quality of systems for difficult conditions
0:06:07a bit to the standard called energy and there should be fine
0:06:17let's now that into the details of the main components of our system
0:06:23converting structures
0:06:26embedding extract is comprised of one frame level network
0:06:32then statistics willing clear and
0:06:36also the segment level or
0:06:41where you level
0:06:43next walk are where actually darcy analyses that you what do you features at the
0:06:48frame level
0:06:50and for the frame level we have considered two types of neural networks for used
0:06:55in the n n's
0:06:56based on a present that
0:07:00did an em based and also i resent response
0:07:04the main difference between
0:07:07there's to me is the there and that type of a kernel and well processing
0:07:12that what
0:07:15frame level lattice formal by statistical here
0:07:19that's
0:07:22and it's frame level just a long time
0:07:25i'm gonna feature maps are then latins and rasta the segment level that extracts herence
0:07:30level mission
0:07:32or salted embedding vector results normalized and class that that's fine
0:07:38we have started with well-known extended version all
0:07:42t d n s
0:07:43and or place t nine
0:07:48time delay
0:07:51lee here with a list here
0:07:55then we have moved to the fact tries to the end i texture and finally
0:08:00ended our experiments we rise that's
0:08:03which ride present it and see for configuration and y a v one resonates of
0:08:10india with a skip looks at it
0:08:16and
0:08:19i'm gonna the test results for those architectures we are drawn to components
0:08:25first
0:08:26are whereas than that the t for all forms x vectors
0:08:33second there is that no improvement is achieved by switching to it's here is that
0:08:39the loss functions we have stick to additive white pixels
0:08:43which is well started in the area of speaker recognition
0:08:49also we have try to train our best model using this axles
0:08:55which was recently proposed and when it actually does is this section of the softmax
0:09:01was to independent and try and the class checkers
0:09:06however that was not able to get these mikes please training help me from absentmindedly
0:09:14in this work we use cosine similarity emphasizing liberty a
0:09:20mentioned learning
0:09:21a scoring
0:09:23we
0:09:24also used
0:09:25simple domain adaptation procedure based on a century the data
0:09:30on in domain set by we have speaker bindings obstruction
0:09:35the mean vectors of calculated using adaptation said this case
0:09:40we also adaptively normalize the schools with the statistics of total
0:09:47ten percent best scoring posters for which embedding people
0:09:52mean annotation allows us to use the equal error rate and improve we just here
0:09:59but slightly
0:10:01but so if we can well
0:10:07score normalisation we will see that score normalisation outperforms station
0:10:14on the majority of the distance so that we can make sure somebody
0:10:19the results
0:10:21change during training
0:10:24propose to model for jesus on the duration of training samples
0:10:30so
0:10:34was so it is that systems based on race that architectures are deformed spectra based
0:10:39systems in all experiments
0:10:41you know based
0:10:43voice activity detector a skull the energy based voice activity detector
0:10:48and score normalisation well as the good performance of all extracted types of the majority
0:10:55of the test settings
0:10:57also the task of millions it's a training data from relation can slightly with the
0:11:04quality of c
0:11:07this of max baseless training doesn't help to present that eer
0:11:12or performance and also we did not achieve any one by using more complex right
0:11:24five right
0:11:28for testing our hypothesis on the whole generation test segments
0:11:32we have more to the thing
0:11:36the experiments
0:11:38with the tests of links ranging from a point five seconds
0:11:43first we have seen that independent wanted to sample duration is here is that it
0:11:47is still doesn't doing better but it to address that the g
0:11:53secondly we validated that everything based architectures
0:11:59thingy
0:12:04be the ones
0:12:06it is based on
0:12:09expenditures in terms of you or weighted and i mean this year for the tests
0:12:13on a
0:12:15a four and the while to twenty five second segments
0:12:21it is
0:12:22also
0:12:24where is it to see that today in other ways to extract systems degree more
0:12:29that resin systems function segments
0:12:34his finger with the that occur is an illustration of the relative differences between wanting
0:12:41from testing address that sample durations to come up short length segment and looking from
0:12:46testing x searches for durations to test environment shuffling signal
0:12:57for a voltage right to see how
0:13:00a low we to augment we refer to as more realistic we can base to
0:13:05the call this dialogue intuition in terms of the performance of the best duration model
0:13:11what is then trained on short duration segments would see that the situation changes in
0:13:16the way that the we now
0:13:18is not one obvious how well that
0:13:23what is that the gap between the
0:13:26metrics into roles just quite now
0:13:28no
0:13:30if we say the training segments not sure that
0:13:35we would
0:13:36differentiating we know that are
0:13:38the whole trained on data with more realistic room impulse responses
0:13:43i've defines the model trained on call just l version of impulse responses
0:13:52and the gap is getting why the
0:13:55how with the but absolute we are
0:13:58is the still not
0:14:01we as
0:14:03the obvious conclusion we can draw from the results
0:14:09here is that in case of training address that based more on short differences
0:14:16in the one for shorter duration is that it was tough degradation for
0:14:27in order to compare our speaker recognition systems performance for sure you rinse as you
0:14:33those already presented for the probably
0:14:35we have they publish describing
0:14:40are used as well as same time nibbling too heavy steel above results on what's
0:14:45layer experiment
0:14:47so we were able to
0:14:51cheering testing problem bolts mostly identical to those used in the paper
0:14:56of interest so this is the second p with the
0:15:01you can see hold endurance level location of speaker recognition in the war
0:15:07so for we do stability purposes we also did not use you know what for
0:15:15just a data
0:15:17so you can see a how
0:15:21actually try to trying to create a problem
0:15:26as for the results we can say that when used testing show significantly better quality
0:15:32or our moral or very short duration
0:15:37a slight one second to second of artistic as
0:15:44like durations
0:15:47and
0:15:50hence the final spy and the here the
0:15:56maybe
0:15:58take ways
0:15:59all this talk so that jane results confirm that
0:16:06or is that i take sures x vector approach in table one duration of short
0:16:11duration scenarios
0:16:14appropriate training data preparation can significantly improve the quality of the final speaker recognition systems
0:16:22also proposed you know based the of was to detect of queens energy based was
0:16:28activity detector
0:16:32and best performing system or voice just goal
0:16:37it is a thirty four based so systems built on inter dimensional mel the bank
0:16:43features
0:16:44and it actually all ones our previous best single system unit to the voice this
0:16:50challenge
0:16:53proposal scoring model means adaptation score normalisation techniques provide additional performance gains for speaker
0:17:03and that's it
0:17:06maybe for attention you have any questions will having tons of them in a given
0:17:11a session