hello
my name is an and i one of the unwieldy timit speech signal just
i'm going to tell you about of our own each speaker embedding see what you
recognition also differences
the problem of the market individual systems weighted smart speakers fuse the demand for the
five you'll speaker recognition
as environmental conditions those devices i usually used in a provide some cases non the
nist clean speech processing algorithms additionally have to be robust noise
and the last thing there does recognition you have what incomprehension complete on the results
you know what are you
is performance on short duration test segments
so the main focus of our study
was to design the model that would
not all before well or unseen by you audio samples recorded was environment but
also we seem the recognition quality when tested on short speech segments
in order to achieve this
we started from what of moving the training data closer to the testing scenario
of that but investigated in fact
to know overall recognition performance of changes in how do we are relation of the
training data conditions teach
and the second concern was the problem presents or was segments are not speaker specific
information such as background noise as silence in what you so we prioritise the robustness
to noise aspect of the voice activity detector which was used to the couple's house
next we have try different acoustic features as well as biting extract architectures
also we investigated the effect so bad
can level that we have tuition and score normalisation
is the
datas foundation on every that dependent experiment we will first introduced data used the current
study
so we have constructed for datasets
that i primarily comprised of books love one into data
except for the training data one and two
one and three so that also have a fractional seen data mixed in
and that significant
difference between these datasets used a harmonisation use
as for the training data to a forced and of how this dialogue limitations used
while training data one and we don't that i in green
well mentally a different way
so in contrast to the augmentation scheme developed in reality or readable moist and speech
rate sure
what two thousand
once we have generated a reading room impulse responses from for different positions of sources
and destructive
to generate those are also responses we have used the impulse response generative proposed by
john allen and of in berkeley
this may be we have try to narrow down the gap between a real and
simply a room impulse responses by creating more realistic blocks
the benchmark okay i'll breast speaker recognition systems verified that
as described scheme use that indeed are one standard that conditions key
you can see
so now let's start with the
data we boasting
as sort of acoustic features we have experimented with what to dimensional mfccs and inter
dimensional mel of the backs
extracted acoustic features underwent you the local mean normalisation
followed by a global mean and variance normalisation or just a single local station
if a look at the benchmark swivels you that model trained on a two dimensional
mel filter banks i've defines the si model trained on what to dimensional mfccs
on the on the
percent of test particles
and the next preprocessing stage we want to draw attention to use was of detection
in our previous studies we have these and mitchell energy based voice activity detector being
sensitive to noise
so we have decided to create a i'll a neural network based voice activity detector
i well voice activity detector is based on you net architecture which initially was developed
for medical image segmentation
the joyous or unit is actually read to the tree don't betweens you to one
we have traded on
one holes
data and a small fraction of microphone you know which was downsampled to eight khz
well labels for these does it
well a teens either in terms of manual segmentation or using out meeting
speech recognition based was estimated that the segmentation
followed by manual post processing
as for the results what we observe is that you then based wasn't the type
that they actually helps us to improve the quality of systems for difficult conditions
a bit to the standard called energy and there should be fine
let's now that into the details of the main components of our system
converting structures
embedding extract is comprised of one frame level network
then statistics willing clear and
also the segment level or
where you level
next walk are where actually darcy analyses that you what do you features at the
frame level
and for the frame level we have considered two types of neural networks for used
in the n n's
based on a present that
did an em based and also i resent response
the main difference between
there's to me is the there and that type of a kernel and well processing
that what
frame level lattice formal by statistical here
that's
and it's frame level just a long time
i'm gonna feature maps are then latins and rasta the segment level that extracts herence
level mission
or salted embedding vector results normalized and class that that's fine
we have started with well-known extended version all
t d n s
and or place t nine
time delay
lee here with a list here
then we have moved to the fact tries to the end i texture and finally
ended our experiments we rise that's
which ride present it and see for configuration and y a v one resonates of
india with a skip looks at it
and
i'm gonna the test results for those architectures we are drawn to components
first
are whereas than that the t for all forms x vectors
second there is that no improvement is achieved by switching to it's here is that
the loss functions we have stick to additive white pixels
which is well started in the area of speaker recognition
also we have try to train our best model using this axles
which was recently proposed and when it actually does is this section of the softmax
was to independent and try and the class checkers
however that was not able to get these mikes please training help me from absentmindedly
in this work we use cosine similarity emphasizing liberty a
mentioned learning
a scoring
we
also used
simple domain adaptation procedure based on a century the data
on in domain set by we have speaker bindings obstruction
the mean vectors of calculated using adaptation said this case
we also adaptively normalize the schools with the statistics of total
ten percent best scoring posters for which embedding people
mean annotation allows us to use the equal error rate and improve we just here
but slightly
but so if we can well
score normalisation we will see that score normalisation outperforms station
on the majority of the distance so that we can make sure somebody
the results
change during training
propose to model for jesus on the duration of training samples
so
was so it is that systems based on race that architectures are deformed spectra based
systems in all experiments
you know based
voice activity detector a skull the energy based voice activity detector
and score normalisation well as the good performance of all extracted types of the majority
of the test settings
also the task of millions it's a training data from relation can slightly with the
quality of c
this of max baseless training doesn't help to present that eer
or performance and also we did not achieve any one by using more complex right
five right
for testing our hypothesis on the whole generation test segments
we have more to the thing
the experiments
with the tests of links ranging from a point five seconds
first we have seen that independent wanted to sample duration is here is that it
is still doesn't doing better but it to address that the g
secondly we validated that everything based architectures
thingy
be the ones
it is based on
expenditures in terms of you or weighted and i mean this year for the tests
on a
a four and the while to twenty five second segments
it is
also
where is it to see that today in other ways to extract systems degree more
that resin systems function segments
his finger with the that occur is an illustration of the relative differences between wanting
from testing address that sample durations to come up short length segment and looking from
testing x searches for durations to test environment shuffling signal
for a voltage right to see how
a low we to augment we refer to as more realistic we can base to
the call this dialogue intuition in terms of the performance of the best duration model
what is then trained on short duration segments would see that the situation changes in
the way that the we now
is not one obvious how well that
what is that the gap between the
metrics into roles just quite now
no
if we say the training segments not sure that
we would
differentiating we know that are
the whole trained on data with more realistic room impulse responses
i've defines the model trained on call just l version of impulse responses
and the gap is getting why the
how with the but absolute we are
is the still not
we as
the obvious conclusion we can draw from the results
here is that in case of training address that based more on short differences
in the one for shorter duration is that it was tough degradation for
in order to compare our speaker recognition systems performance for sure you rinse as you
those already presented for the probably
we have they publish describing
are used as well as same time nibbling too heavy steel above results on what's
layer experiment
so we were able to
cheering testing problem bolts mostly identical to those used in the paper
of interest so this is the second p with the
you can see hold endurance level location of speaker recognition in the war
so for we do stability purposes we also did not use you know what for
just a data
so you can see a how
actually try to trying to create a problem
as for the results we can say that when used testing show significantly better quality
or our moral or very short duration
a slight one second to second of artistic as
like durations
and
hence the final spy and the here the
maybe
take ways
all this talk so that jane results confirm that
or is that i take sures x vector approach in table one duration of short
duration scenarios
appropriate training data preparation can significantly improve the quality of the final speaker recognition systems
also proposed you know based the of was to detect of queens energy based was
activity detector
and best performing system or voice just goal
it is a thirty four based so systems built on inter dimensional mel the bank
features
and it actually all ones our previous best single system unit to the voice this
challenge
proposal scoring model means adaptation score normalisation techniques provide additional performance gains for speaker
and that's it
maybe for attention you have any questions will having tons of them in a given
a session