no
mining solutions
and i would like to present our contribution regarding q decision of the courses corpus
or multichannel speaker verification which uses talking
according to research papers
there is
more interest
in march a speaker verification
but the number of dataset
still limited
therefore we wanted to use
boxes data
for the evaluation of the multichannel speaker verification systems to the object is of our
a where
are as follows
we analyze the original trial is defined for the voices challenge
we really finally
so that a multichannel speaker verification systems can use of
since we created in you try to nist
multi-channel trial list
the final if sensors robust
and also you assist used to do their voices data for training subsystems
so because we wanted to create a multichannel trial set
we needed first
and one
the original
try to set defined for the first time
so we can see your stiff one that every set of recordings
what recording
in a different room
as regards noise condition
we can see that test recordings were recorded with background noise
it was under babble noise
television noise
and music and also without and thing
enrollment recordings
where required without any vector noise
so they're just room reverberation and background
and we can see that the haar of the
enrollment data for evaluation
what's taken from the original there is
as regards microphones
and enrollment recordings but with two microphones
test recordings with eight or eleven microphone
these numbers
would be quite important for us
in terms of speakers we can see that there are some unique speakers in and
enrollment and test portion
overall we have about one hundred speakers in enrollment both for evaluation and development
for development
we have much more speakers test then a calibration set
regarding utterances
utterances are just trying between enrollment and test
also speakers in the development set are different from those that are evaluation set
so we wanted to create a multi multichannel trials
so we to analyze the origin one and you realise
but for every enrollment recording
there are always multiple test recording
containing the same utterance the same noise speaker
room
but they are recorded with a different microphone
and this is what we may use all
so while creating our multichannel trials we use single enrolment
and that in terms of test recordings
we do some recordings to create and microphone
so now we will look into the creation of test portions of development and evaluation
for the for this to
we can see that for every and enrollment utterance
there are always eight
utterances containing the same
basically the same utterance
and are recorded over different microphone
she one to a three s o one
are numbers representing random turkish
we decided to always four
recordings
two one microphone error
that means
that
instead of eight trials people tend to trials
meaning that you use the number of trials from four million to one
or evaluation set
we have relied on
recordings for every enrollment utterance
we again grooved for recordings together and we're left with three more utterances
therefore we randomly another one utterance from those that of it
this new use the number of trials from three one five million to nine hundred
eighty thousand shots
we try not only reading
a development
and evaluation sets but we also try to creating
and data
our multichannel training dataset is based on the full list of recordings from one and
two
be excluded completely recordings from three and four
because as we have seen a full original utterances recorded in ruins to four
we also the i'll the development data because they were recorded in one and room
then we again grooved the recordings based on the content and we obtain again microphone
areas contain four microphones
so
the result was trained dataset comprising fifty seven point eight thousand examples but which use
of two hundred speakers
so it is clear that there is there
because this dataset is similar to the development dataset in terms of speakers and also
acoustic conditions
but this was already just because the
all the data set
so now they're three
development and evaluation set
and also training set
no less channel to explanation of multichannel product speaker verification
so we use it is done system
then it contains a front end which is the funding that performs a station are
very
and then the single channel output goes to exeter extractor
and the and buildings are scored using nearly
so this is very standard i point
but our goal was not propose no motion system
but rather assess the use of the to the voices
for
forming we were able to make use not original voices training data
we also tried using simulated data and i will explain while and when later presentation
the voices training dataset is quite the and therefore we couldn't use it for training
of the extra their extractor
it means that use bookseller or training a of the experisuch there and also you
be okay
or front end processing
we use the g
generalized eigen four
so this is former get utilizes
something would statistics
and crew a single
so first we need to a computer or estimate speech cross power spectral density matrix
and noise here
those three matrices
go to g is over
which is generalized eigenvalue decomposition
the principal eigenvector then used construed to be a beamformer weight
it is applied to multi-channel input
and we obtain a single job
in order
to estimate speech i was used
we use neural network
we have
single one quarter
and
this is applied to all of g a chance
to give an input
this not network is supposed to a model for each and mask for noise
the resulting mask
are applied to input spectrum
and noise and speech psd matrices are estimated
this picture is differentiable s is usually in our previous work
the architecture of this model is pretty simple a contains how about two in your
layers
and then there are two layers
one
of coding
model ordinance one
or
in our experiments we will refer to models
but essentially they are the same
and what is different is the weight of training
so for b c model
we do
the weight of the most system either
just by a optimize the output mask
therefore
we
compute first i
ideal binary mask
and then we are minimizing binary cross entropy between output and yes
so in order to computer science
we need to know speech and noise
so that means that can not use which dataset and to this data for training
to create such assimilated a dataset we use the same utterances are and multi-channel voices
dataset
and we perform us english using mute source method
and everything
all sessions
which was also used in
of course dataset
for the missing model
we optimize the output of the form
therefore we minimize
and s between the output
and clean speech
in this case we can use multichannel a voices training data
because what
described it audio
and then clean speech which is taken from speech
so much for that expunge our architecture and now we will to experiments are
for reference
we show results for the so called single channel
in this case we use the original trial list
defined for the voice
and we evaluated our extractor extract the
our baseline is informed which is well established to for performing with us
the results are you
then we try to assessing dct and this models
using the same trial this
s for one
it is worth mentioning that
take the channel cannot be readily compare the formant because of the number of trials
is the
then we tried assessing the performance of u c and testing
we can see that this is novel tense
better results than baseline from
however the performance of this model is quite or
we hypothesize that it is much more difficult part
to train new on this work to all good but correct mask for speech from
foreigners just by minimizing unless you how good and
moreover
there is more variability in the training data for easy model
then in the training data for missing the training data or miss model
all training are okay
from the voices
further
we can see
the pca model generalizes another
then and this novel
and this is
again because of variability in the data
then you're trying to improve and missing model
but still using voices dataset and no external data
so what
its use of men
and especially proposed variant of spectrum and where we apply mosques directly to this
more specifically we have five to frequency must
and to time marks
we can see that we were able to improve
performance of and si models quite substantially
we can also observe that performance is not better than the baseline performance
we also tried using spectral language model
and again see some improvement
but the improvement is not that i don't
as for the mse model which is good news for us
so much for the
first experiment
and let's turn to the wrong so in the number
experiment
higher assessing performance of individual microphones
we hypothesize that some of microphones can perform poorly
they are used in multiple microphone errors
in this case
microphones
can be far from each other as opposed to conventional small microphone arrays
and i thought that maybe a really performing microphones to integrate overall performance greatly
and it might be useful to exclude them from trials
so to assess this
we first needed to assess single microphones
so it is me too
the original trial list
and then we created a microphone specific trial list
where as you can see
neural and recordings are always the same
and test recordings correspond to the microphone
that recorded
specifically utterance
these are the result that we obtain
and we can see that best microphone
our best performing microphone dislike in front of the loudspeaker
then the worst microphone is microphone with number twelve which was constructed
and another who microphone is the one that is order
from the loudspeaker number six
we can see that there is quite some difference between the best and worst my
this is even more pronounced
for evaluation set
where we can see that best performing microphone
and its performance for two point two eight
because er
the was performed microphone it's almost seven times worse than the best one
again the microphone number twelve was for obstructive
and microphone number six is far from the art speaker
then try excluding those microphones
from trials
as expected the numbers that you got are better
but what is more important
the difference is not the
so
we decided not to exclude and
microphones from the trials
this concludes our presentation
and now let's move
to the outcomes
all or
we adopted the closest definition of trials
and created trial list for development and evolution market a speaker verification
we are aware of the fact and that reduce the number of trials quite substantially
but we verified that the results obtained with the trial list are reliable
details on that can be found in our paper
we have i five several
or
such a small number of speakers and small rather than t in the acoustic environments
and channel s
and we tackle these problems via data location
in our set of experiments we have confirmed that even with a data set of
the size
and without of data limitation
we can achieve interesting results
and carry out research in this field much a speaker verification
thank you for your attention