hello

my name is an and i one of the unwieldy timit speech signal just

i'm going to tell you about of our own each speaker embedding see what you

recognition also differences

the problem of the market individual systems weighted smart speakers fuse the demand for the

five you'll speaker recognition

as environmental conditions those devices i usually used in a provide some cases non the

nist clean speech processing algorithms additionally have to be robust noise

and the last thing there does recognition you have what incomprehension complete on the results

you know what are you

is performance on short duration test segments

so the main focus of our study

was to design the model that would

not all before well or unseen by you audio samples recorded was environment but

also we seem the recognition quality when tested on short speech segments

in order to achieve this

we started from what of moving the training data closer to the testing scenario

of that but investigated in fact

to know overall recognition performance of changes in how do we are relation of the

training data conditions teach

and the second concern was the problem presents or was segments are not speaker specific

information such as background noise as silence in what you so we prioritise the robustness

to noise aspect of the voice activity detector which was used to the couple's house

next we have try different acoustic features as well as biting extract architectures

also we investigated the effect so bad

can level that we have tuition and score normalisation

is the

datas foundation on every that dependent experiment we will first introduced data used the current

study

so we have constructed for datasets

that i primarily comprised of books love one into data

except for the training data one and two

one and three so that also have a fractional seen data mixed in

and that significant

difference between these datasets used a harmonisation use

as for the training data to a forced and of how this dialogue limitations used

while training data one and we don't that i in green

well mentally a different way

so in contrast to the augmentation scheme developed in reality or readable moist and speech

rate sure

what two thousand

once we have generated a reading room impulse responses from for different positions of sources

and destructive

to generate those are also responses we have used the impulse response generative proposed by

john allen and of in berkeley

this may be we have try to narrow down the gap between a real and

simply a room impulse responses by creating more realistic blocks

the benchmark okay i'll breast speaker recognition systems verified that

as described scheme use that indeed are one standard that conditions key

you can see

so now let's start with the

data we boasting

as sort of acoustic features we have experimented with what to dimensional mfccs and inter

dimensional mel of the backs

extracted acoustic features underwent you the local mean normalisation

followed by a global mean and variance normalisation or just a single local station

if a look at the benchmark swivels you that model trained on a two dimensional

mel filter banks i've defines the si model trained on what to dimensional mfccs

on the on the

percent of test particles

and the next preprocessing stage we want to draw attention to use was of detection

in our previous studies we have these and mitchell energy based voice activity detector being

sensitive to noise

so we have decided to create a i'll a neural network based voice activity detector

i well voice activity detector is based on you net architecture which initially was developed

for medical image segmentation

the joyous or unit is actually read to the tree don't betweens you to one

we have traded on

one holes

data and a small fraction of microphone you know which was downsampled to eight khz

well labels for these does it

well a teens either in terms of manual segmentation or using out meeting

speech recognition based was estimated that the segmentation

followed by manual post processing

as for the results what we observe is that you then based wasn't the type

that they actually helps us to improve the quality of systems for difficult conditions

a bit to the standard called energy and there should be fine

let's now that into the details of the main components of our system

converting structures

embedding extract is comprised of one frame level network

then statistics willing clear and

also the segment level or

where you level

next walk are where actually darcy analyses that you what do you features at the

frame level

and for the frame level we have considered two types of neural networks for used

in the n n's

based on a present that

did an em based and also i resent response

the main difference between

there's to me is the there and that type of a kernel and well processing

that what

frame level lattice formal by statistical here

that's

and it's frame level just a long time

i'm gonna feature maps are then latins and rasta the segment level that extracts herence

level mission

or salted embedding vector results normalized and class that that's fine

we have started with well-known extended version all

t d n s

and or place t nine

time delay

lee here with a list here

then we have moved to the fact tries to the end i texture and finally

ended our experiments we rise that's

which ride present it and see for configuration and y a v one resonates of

india with a skip looks at it

and

i'm gonna the test results for those architectures we are drawn to components

first

are whereas than that the t for all forms x vectors

second there is that no improvement is achieved by switching to it's here is that

the loss functions we have stick to additive white pixels

which is well started in the area of speaker recognition

also we have try to train our best model using this axles

which was recently proposed and when it actually does is this section of the softmax

was to independent and try and the class checkers

however that was not able to get these mikes please training help me from absentmindedly

in this work we use cosine similarity emphasizing liberty a

mentioned learning

a scoring

we

also used

simple domain adaptation procedure based on a century the data

on in domain set by we have speaker bindings obstruction

the mean vectors of calculated using adaptation said this case

we also adaptively normalize the schools with the statistics of total

ten percent best scoring posters for which embedding people

mean annotation allows us to use the equal error rate and improve we just here

but slightly

but so if we can well

score normalisation we will see that score normalisation outperforms station

on the majority of the distance so that we can make sure somebody

the results

change during training

propose to model for jesus on the duration of training samples

so

was so it is that systems based on race that architectures are deformed spectra based

systems in all experiments

you know based

voice activity detector a skull the energy based voice activity detector

and score normalisation well as the good performance of all extracted types of the majority

of the test settings

also the task of millions it's a training data from relation can slightly with the

quality of c

this of max baseless training doesn't help to present that eer

or performance and also we did not achieve any one by using more complex right

five right

for testing our hypothesis on the whole generation test segments

we have more to the thing

the experiments

with the tests of links ranging from a point five seconds

first we have seen that independent wanted to sample duration is here is that it

is still doesn't doing better but it to address that the g

secondly we validated that everything based architectures

thingy

be the ones

it is based on

expenditures in terms of you or weighted and i mean this year for the tests

on a

a four and the while to twenty five second segments

it is

also

where is it to see that today in other ways to extract systems degree more

that resin systems function segments

his finger with the that occur is an illustration of the relative differences between wanting

from testing address that sample durations to come up short length segment and looking from

testing x searches for durations to test environment shuffling signal

for a voltage right to see how

a low we to augment we refer to as more realistic we can base to

the call this dialogue intuition in terms of the performance of the best duration model

what is then trained on short duration segments would see that the situation changes in

the way that the we now

is not one obvious how well that

what is that the gap between the

metrics into roles just quite now

no

if we say the training segments not sure that

we would

differentiating we know that are

the whole trained on data with more realistic room impulse responses

i've defines the model trained on call just l version of impulse responses

and the gap is getting why the

how with the but absolute we are

is the still not

we as

the obvious conclusion we can draw from the results

here is that in case of training address that based more on short differences

in the one for shorter duration is that it was tough degradation for

in order to compare our speaker recognition systems performance for sure you rinse as you

those already presented for the probably

we have they publish describing

are used as well as same time nibbling too heavy steel above results on what's

layer experiment

so we were able to

cheering testing problem bolts mostly identical to those used in the paper

of interest so this is the second p with the

you can see hold endurance level location of speaker recognition in the war

so for we do stability purposes we also did not use you know what for

just a data

so you can see a how

actually try to trying to create a problem

as for the results we can say that when used testing show significantly better quality

or our moral or very short duration

a slight one second to second of artistic as

like durations

and

hence the final spy and the here the

maybe

take ways

all this talk so that jane results confirm that

or is that i take sures x vector approach in table one duration of short

duration scenarios

appropriate training data preparation can significantly improve the quality of the final speaker recognition systems

also proposed you know based the of was to detect of queens energy based was

activity detector

and best performing system or voice just goal

it is a thirty four based so systems built on inter dimensional mel the bank

features

and it actually all ones our previous best single system unit to the voice this

challenge

proposal scoring model means adaptation score normalisation techniques provide additional performance gains for speaker

and that's it

maybe for attention you have any questions will having tons of them in a given

a session