no

mining solutions

and i would like to present our contribution regarding q decision of the courses corpus

or multichannel speaker verification which uses talking

according to research papers

there is

more interest

in march a speaker verification

but the number of dataset

still limited

therefore we wanted to use

boxes data

for the evaluation of the multichannel speaker verification systems to the object is of our

a where

are as follows

we analyze the original trial is defined for the voices challenge

we really finally

so that a multichannel speaker verification systems can use of

since we created in you try to nist

multi-channel trial list

the final if sensors robust

and also you assist used to do their voices data for training subsystems

so because we wanted to create a multichannel trial set

we needed first

and one

the original

try to set defined for the first time

so we can see your stiff one that every set of recordings

what recording

in a different room

as regards noise condition

we can see that test recordings were recorded with background noise

it was under babble noise

television noise

and music and also without and thing

enrollment recordings

where required without any vector noise

so they're just room reverberation and background

and we can see that the haar of the

enrollment data for evaluation

what's taken from the original there is

as regards microphones

and enrollment recordings but with two microphones

test recordings with eight or eleven microphone

these numbers

would be quite important for us

in terms of speakers we can see that there are some unique speakers in and

enrollment and test portion

overall we have about one hundred speakers in enrollment both for evaluation and development

for development

we have much more speakers test then a calibration set

regarding utterances

utterances are just trying between enrollment and test

also speakers in the development set are different from those that are evaluation set

so we wanted to create a multi multichannel trials

so we to analyze the origin one and you realise

but for every enrollment recording

there are always multiple test recording

containing the same utterance the same noise speaker

room

but they are recorded with a different microphone

and this is what we may use all

so while creating our multichannel trials we use single enrolment

and that in terms of test recordings

we do some recordings to create and microphone

so now we will look into the creation of test portions of development and evaluation

for the for this to

we can see that for every and enrollment utterance

there are always eight

utterances containing the same

basically the same utterance

and are recorded over different microphone

she one to a three s o one

are numbers representing random turkish

we decided to always four

recordings

two one microphone error

that means

that

instead of eight trials people tend to trials

meaning that you use the number of trials from four million to one

or evaluation set

we have relied on

recordings for every enrollment utterance

we again grooved for recordings together and we're left with three more utterances

therefore we randomly another one utterance from those that of it

this new use the number of trials from three one five million to nine hundred

eighty thousand shots

we try not only reading

a development

and evaluation sets but we also try to creating

and data

our multichannel training dataset is based on the full list of recordings from one and

two

be excluded completely recordings from three and four

because as we have seen a full original utterances recorded in ruins to four

we also the i'll the development data because they were recorded in one and room

then we again grooved the recordings based on the content and we obtain again microphone

areas contain four microphones

so

the result was trained dataset comprising fifty seven point eight thousand examples but which use

of two hundred speakers

so it is clear that there is there

because this dataset is similar to the development dataset in terms of speakers and also

acoustic conditions

but this was already just because the

all the data set

so now they're three

development and evaluation set

and also training set

no less channel to explanation of multichannel product speaker verification

so we use it is done system

then it contains a front end which is the funding that performs a station are

very

and then the single channel output goes to exeter extractor

and the and buildings are scored using nearly

so this is very standard i point

but our goal was not propose no motion system

but rather assess the use of the to the voices

for

forming we were able to make use not original voices training data

we also tried using simulated data and i will explain while and when later presentation

the voices training dataset is quite the and therefore we couldn't use it for training

of the extra their extractor

it means that use bookseller or training a of the experisuch there and also you

be okay

or front end processing

we use the g

generalized eigen four

so this is former get utilizes

something would statistics

and crew a single

so first we need to a computer or estimate speech cross power spectral density matrix

and noise here

those three matrices

go to g is over

which is generalized eigenvalue decomposition

the principal eigenvector then used construed to be a beamformer weight

it is applied to multi-channel input

and we obtain a single job

in order

to estimate speech i was used

we use neural network

we have

single one quarter

and

this is applied to all of g a chance

to give an input

this not network is supposed to a model for each and mask for noise

the resulting mask

are applied to input spectrum

and noise and speech psd matrices are estimated

this picture is differentiable s is usually in our previous work

the architecture of this model is pretty simple a contains how about two in your

layers

and then there are two layers

one

of coding

model ordinance one

or

in our experiments we will refer to models

but essentially they are the same

and what is different is the weight of training

so for b c model

we do

the weight of the most system either

just by a optimize the output mask

therefore

we

compute first i

ideal binary mask

and then we are minimizing binary cross entropy between output and yes

so in order to computer science

we need to know speech and noise

so that means that can not use which dataset and to this data for training

to create such assimilated a dataset we use the same utterances are and multi-channel voices

dataset

and we perform us english using mute source method

and everything

all sessions

which was also used in

of course dataset

for the missing model

we optimize the output of the form

therefore we minimize

and s between the output

and clean speech

in this case we can use multichannel a voices training data

because what

described it audio

and then clean speech which is taken from speech

so much for that expunge our architecture and now we will to experiments are

for reference

we show results for the so called single channel

in this case we use the original trial list

defined for the voice

and we evaluated our extractor extract the

our baseline is informed which is well established to for performing with us

the results are you

then we try to assessing dct and this models

using the same trial this

s for one

it is worth mentioning that

take the channel cannot be readily compare the formant because of the number of trials

is the

then we tried assessing the performance of u c and testing

we can see that this is novel tense

better results than baseline from

however the performance of this model is quite or

we hypothesize that it is much more difficult part

to train new on this work to all good but correct mask for speech from

foreigners just by minimizing unless you how good and

moreover

there is more variability in the training data for easy model

then in the training data for missing the training data or miss model

all training are okay

from the voices

further

we can see

the pca model generalizes another

then and this novel

and this is

again because of variability in the data

then you're trying to improve and missing model

but still using voices dataset and no external data

so what

its use of men

and especially proposed variant of spectrum and where we apply mosques directly to this

more specifically we have five to frequency must

and to time marks

we can see that we were able to improve

performance of and si models quite substantially

we can also observe that performance is not better than the baseline performance

we also tried using spectral language model

and again see some improvement

but the improvement is not that i don't

as for the mse model which is good news for us

so much for the

first experiment

and let's turn to the wrong so in the number

experiment

higher assessing performance of individual microphones

we hypothesize that some of microphones can perform poorly

they are used in multiple microphone errors

in this case

microphones

can be far from each other as opposed to conventional small microphone arrays

and i thought that maybe a really performing microphones to integrate overall performance greatly

and it might be useful to exclude them from trials

so to assess this

we first needed to assess single microphones

so it is me too

the original trial list

and then we created a microphone specific trial list

where as you can see

neural and recordings are always the same

and test recordings correspond to the microphone

that recorded

specifically utterance

these are the result that we obtain

and we can see that best microphone

our best performing microphone dislike in front of the loudspeaker

then the worst microphone is microphone with number twelve which was constructed

and another who microphone is the one that is order

from the loudspeaker number six

we can see that there is quite some difference between the best and worst my

this is even more pronounced

for evaluation set

where we can see that best performing microphone

and its performance for two point two eight

because er

the was performed microphone it's almost seven times worse than the best one

again the microphone number twelve was for obstructive

and microphone number six is far from the art speaker

then try excluding those microphones

from trials

as expected the numbers that you got are better

but what is more important

the difference is not the

so

we decided not to exclude and

microphones from the trials

this concludes our presentation

and now let's move

to the outcomes

all or

we adopted the closest definition of trials

and created trial list for development and evolution market a speaker verification

we are aware of the fact and that reduce the number of trials quite substantially

but we verified that the results obtained with the trial list are reliable

details on that can be found in our paper

we have i five several

or

such a small number of speakers and small rather than t in the acoustic environments

and channel s

and we tackle these problems via data location

in our set of experiments we have confirmed that even with a data set of

the size

and without of data limitation

we can achieve interesting results

and carry out research in this field much a speaker verification

thank you for your attention